Use Regular Expressions in OctoparseThursday, October 13, 2016 9:09 AM
If you know little about how to create a regular expression syntax, you could directly turn to our RegEx Tool for help.
So In this tutorial I will take glassdoor for example to show you how to use regular expressions to scrape data from web pages.
(Download my extraction task of this tutorial HERE just in case you need it.)
Choose “Advanced Mode”. ➜ Click “Start” ➜ Complete basic information.
Enter the target URL in the built-in browser. ➜ Click “Go” icon to open the webpage.
(URL of the example: https://www.glassdoor.com/Job/public-relations-jobs-SRCH_KO0,16.htm )
Assumed that we want to scrape the requirements of the public relations manager, you would find that you can't directly extract the data without additional information.
Scrape the whole text first.
Click “Public Relations Manager”. ➜Click “Click an item”.➜Click the title and extract the text.
Click the highlighted link ➜ Click “Extract Inner HTML, including the page source code, text with format and images”.
Then you need to customize your data. Click the field you want to customize.➜Click “Customize Field” ➜ Click “Re-format extracted data” ➜ Click “Add step”.
You could find that there are line breaks in the job requirements. See the example in the screenshot.
In order to extract the whole requirement information, you need to delete the line break first. Here we use the metacharacter “\n” to match the line break.
Click “Replace with Regular Expression”. ➜Enter “\n” in the "Regular Expression" text box. ➜ Leave blank in the “Replace with” box. ➜ Hit “Calculate” and the output is the value we want.
And now we could use regular expressions to match the information we want. Click “Add step” again. ➜ Click “Match with Regular Expression”.
If you know how to write a regular expression, you could create the Regular Expression to match the information you want directly.
(Note: Click Here to know more about Regular Expression).
If you don’t know how to write a regular expression, you could try “Try RegEx Tool”.
Click “Try RegEx Tool”. You could find that the job requirement information begins with “possess:” and ends with “Swissôtel Hotels” in the “Source Text”. ➜ Click “Start With” and paste “possess:”➜ Click “End With” and paste “Swissôtel Hotels”➜ Click “Generate” ➜ Click “Match All” ➜ Click “Match” ➜ Click “Apply”. The job requirement information would be shown in the “Output” browser. ➜ Click "OK".
You would find that there are unnecessary tags in the requirement information. Also use regular expressions to these tags.
Click “Add step” again. ➜ Click “Replace with Regular Expression”. ➜Enter “</p><p></p><p>” in the "Regular Expression" text box. ➜ Leave blank in the “Replace with” box. ➜ Hit “Calculate” and the output is the value we want.
Also remove the “</p><p></p><p style=””>” in the same way.
Click “Done” ➜ Click the “Field Name” to modify the name ➜ Click “Next” ➜ Click “Next” ➜ Click “Local Extraction” ➜ “OK” to run the task on your computer. Octoparse will automatically extract all the data selected.
The data extracted will be shown in “Data Extracted” pane. Click “Export” button to export the results to Excel file, databases or other formats and save the file to your computer.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!