Web Scraping Tutorials: Scraping Source Code from Web Pages
Thursday, March 09, 2017 8:50 PMOctoparse enables you to scrape source code from web pages thus to extract the exact information from the web pages.
In this web scraping tutorial we will scrape detailed information about lawyers in New York from the search results at lawyers.com website. The detailed information such as website and review can have different locations on different web pages. Therefore, we need to use regular expressions to extract the exact content we want with Octoparse advanced options.
The website URL we will use is http://www.lawyers.com/all-legal-issues/all-cities/new-york/law-firms/.
The data fields include law firm, serving region and website URL.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the information about these lawyers at this site. (Download my extraction task of this tutorial HERE just in case you need it.)
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2.
Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: http://www.lawyers.com/all-legal-issues/all-cities/new-york/law-firms/)
Step 3.
Move your cursor over the item with similar layout, where you would extract the detail information of each lawyer.
Click the first lawyer ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first lawyer has been added to the list. ➜ Click "Continue to edit the list".
Click the second lawyer ➜ Click "Add current item to the list" again (Now we get all the lawyers with similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the information of the lawyer.
Step 4. Extract the content from the web page.
Click the firm➜ Select "Extract text".
Click the "Region" block ➜ Select "Extract Outer HTML, including the page source code, text with format and images". We will extract the serving region with regular expression tool.
Click the "Contact Firm" block ➜ Select "Extract Outer HTML, including the page source code, text with format and images". We will extract the exact website URL from the source code with regular expression tool.
After all the content have been selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".
Step 5. Re-format the data fields.
Step 5.1. We will correctly select the text from the data field "Website".
Choose the data field ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data”.
Click “Add step” ➜ Select “Match with Regular Expression” ➜ Use "Try RegEx Tool".
From the RegEx Tool window, we can see that the website URL is embedded in the code like:
.... data-omniture-type="website" href="http://www.materalaw.com" target="_blank"> ....
Thus we check the options "Start With" with the value ="website" href=" ➜ Check the options "End With" with the value " ➜ Click “Generate” to create the regular expressions ➜ Check the option "Match All" ➜ Click “Match” ➜ Click “Apply” ➜ Click “Calculate” ➜ Click “OK”.
Click “Done”. Then you will see the website URL has been extracted correctly. ➜ Click “Save”.
Step 5.2.
We will correctly select the text from the data field "Region".
Choose the data field ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data”.
1) Click “Add step” ➜ Select “Replace with Regular Expression” ➜ Enter ‘\s+’ in the Regular Expression box ➜ Click “Calculate” ➜ Click “OK”.
This step will delete all the spaces.
2) Click “Add step” ➜ Select “Match with Regular Expression” ➜ Use "Try RegEx Tool".
From the RegEx Tool window, we can see that the region is embedded in the code like:
.... <span class="span-block-box">Serving<b>Melville,NY</b> ....
Thus we check the options "Start With" with the value Serving<b> ➜ Check the options "End With" with the value </b> ➜ Click “Generate” to create the regular expressions ➜ Check the option "Match All" ➜ Click “Match” ➜ Click “Apply” ➜ Click “Calculate” ➜ Click “OK”.
Click “Done”. Then you will see the region of the law firm has been extracted correctly. ➜ Click “Save”.
Step 6. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow.
Go to Web Page ➜ The Loop Item box ➜ Click Item ➜ Extract Data.
Step 7.
Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 7-1.
Step 7-2.
Step 7-3.
Step 8.
The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Author: The Octoparse Team
Author's Picks
Web Scraping - How to Store Cookies in Octoparse
Modify X Path For "Load More" Button with Octoparse
Introduction to Octoparse XPath Tool
Scraping Websites That Required Login with Octoparse
Scrape Data from Yellowpages.com
Scraping Product Detail Pages from eBay.com
Scraping Hotel Reviews from Tripadvisor.com