Use Regular Expressions in Octoparse

Thursday, October 13, 2016 9:09 AM

Regular expressions are patterns used to match character combinations in strings. They are used withe different methods of strings like match, replace, search, and split. Scraping data from websites (where CSS selectors or XPath can’t work, e.g extract something from inline JavaScript of a web page) by using regular expression syntax could help you quickly identify the information you want.

If you know little about how to create a regular expression syntax, you could directly turn to our RegEx Tool for help.

So In this tutorial I will take glassdoor for example to show you how to use regular expressions to scrape data from web pages.

(Download my extraction task of this tutorial HERE just in case you need it.)

 

Step 1.

Choose “Advanced Mode”. ➜ Click “Start” ➜ Complete basic information.

Enter the target URL in the built-in browser. ➜ Click “Go” icon to open the webpage.

(URL of the example: https://www.glassdoor.com/Job/public-relations-jobs-SRCH_KO0,16.htm )

 

Step 2.

Assumed that we want to scrape the requirements of the public relations manager, you would find that you can't directly extract the data without additional information.

Scrape the whole text first.

Click “Public Relations Manager”. ➜Click “Click an item”.➜Click the title and extract the text.

 

Click the highlighted link ➜ Click “Extract Inner HTML, including the page source code, text with format and images”.

Then you need to customize your data. Click the field you want to customize.➜Click “Customize Field” ➜ Click “Re-format extracted data” ➜ Click “Add step”.

 

You could find that there are line breaks in the job requirements. See the example in the screenshot.

 

In order to extract the whole requirement information, you need to delete the line break first. Here we use the metacharacter “\n” to match the line break.

Click “Replace with Regular Expression”. ➜Enter “\n” in the "Regular Expression" text box. ➜ Leave blank in the “Replace with” box. ➜ Hit “Calculate” and the output is the value we want.

 

And now we could use regular expressions to match the information we want. Click “Add step” again. ➜ Click “Match with Regular Expression”.

If you know how to write a regular expression, you could create the Regular Expression to match the information you want directly.

(Note: Click Here to know more about Regular Expression).

If you don’t know how to write a regular expression, you could try “Try RegEx Tool”.

Click “Try RegEx Tool”. You could find that the job requirement information begins with “possess:” and ends with “Swissôtel Hotels” in the “Source Text”. ➜ Click “Start With” and paste “possess:”➜ Click “End With” and paste “Swissôtel Hotels”➜ Click “Generate” ➜ Click “Match All” ➜ Click “Match” ➜ Click “Apply”. The job requirement information would be shown in the “Output” browser. ➜ Click "OK".

 

You would find that there are unnecessary tags in the requirement information. Also use regular expressions to these tags.

Click “Add step” again. ➜ Click “Replace with Regular Expression”. ➜Enter “</p><p></p><p>” in the "Regular Expression" text box. ➜ Leave blank in the “Replace with” box. ➜ Hit “Calculate” and the output is the value we want.

 

Also remove the “</p><p></p><p style=””>” in the same way.

 

 

Step 3.

Click “Done” ➜ Click the “Field Name” to modify the name ➜ Click “Next” ➜ Click “Next” ➜ Click “Local Extraction” ➜ “OK” to run the task on your computer. Octoparse will automatically extract all the data selected.

 

Step 4.

The data extracted will be shown in “Data Extracted” pane. Click “Export” button to export the results to Excel file, databases or other formats and save the file to your computer.

 

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today!

 

 

Author's Picks

 

Scrape Web Data from A Drop-Down Menu 1

Modify XPath Manually in Octoparse

Getting started with XPath 1

Getting started with XPath 2

Pagination: Scrape Data from Websites with Query Strings (1)

Pagination: Scrape Data from Websites with Query Strings (2)

 

Getting started with XPath 1
30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
Request Pro Trial

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.
× get my coupon now No Thanks