Web Scraping Tutorials: Scraping Source Code from Web Pages

Thursday, March 09, 2017 8:50 PM

Octoparse enables you to scrape source code from web pages thus to extract the exact information from the web pages.

In this web scraping tutorial we will scrape detailed information about lawyers in New York from the search results at lawyers.com website. The detailed information such as website and review can have different locations on different web pages. Therefore, we need to use regular expressions to extract the exact content we want with Octoparse advanced options.

The website URL we will use is http://www.lawyers.com/all-legal-issues/all-cities/new-york/law-firms/.

The data fields include law firm, serving region and website URL.

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape the information about these lawyers at this site. (Download my extraction task of this tutorial HERE just in case you need it.)

 

Step 1. Set up basic information.

Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".

 

 

 

Step 2. 

Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.

(URL of the example: http://www.lawyers.com/all-legal-issues/all-cities/new-york/law-firms/)

 

Step 3. 

Move your cursor over the item with similar layout, where you would extract the detail information of each lawyer.

Click the first lawyer  Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

Then the first lawyer has been added to the list. ➜ Click "Continue to edit the list".

Click the second lawyer  Click "Add current item to the list" again (Now we get all the lawyerwith similar layout) ➜ Click "Finish Creating List" ➜ Click "loop" to process the list for extracting the information of the lawyer.

 

Step 4. Extract the content from the web page.

Click the firm➜ Select "Extract text".

Click the "Region" block ➜ Select "Extract Outer HTML, including the page source code, text with format and images".  We will extract the serving region with regular expression tool.

Click the "Contact Firm" block ➜ Select "Extract Outer HTML, including the page source code, text with format and images". We will extract the exact website URL from the source code with regular expression tool.

After all the content have been selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".

 

Step 5. Re-format the data fields.

Step 5.1. We will correctly select the text from the data field "Website".

Choose the data field ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data”.

Click “Add step” ➜ Select “Match with Regular Expression” ➜ Use "Try RegEx Tool".

From the RegEx Tool window, we can see that the website URL is embedded in the code like:

.... data-omniture-type="website" href="http://www.materalaw.com" target="_blank"> ....

Thus we check the options "Start With" with the value ="website" href=" ➜ Check the options "End With" with the value " ➜ Click “Generate” to create the regular expressions ➜ Check the option "Match All" ➜ Click “Match” ➜ Click “Apply” ➜ Click “Calculate” ➜ Click “OK”.

Click “Done”. Then you will see the website URL has been extracted correctly. ➜ Click “Save”. 

 

Step 5.2. 

We will correctly select the text from the data field "Region".

Choose the data field ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data”.

 

1) Click “Add step” ➜ Select “Replace with Regular Expression” ➜ Enter ‘\s+’ in the Regular Expression box ➜ Click “Calculate” ➜ Click “OK”.

This step will delete all the spaces.

 

2) Click “Add step” ➜ Select “Match with Regular Expression” ➜ Use "Try RegEx Tool".

 

From the RegEx Tool window, we can see that the region is embedded in the code like:

.... <span class="span-block-box">Serving<b>Melville,NY</b> ....

Thus we check the options "Start With" with the value Serving<b> ➜ Check the options "End With" with the value </b> ➜ Click “Generate” to create the regular expressions ➜ Check the option "Match All" ➜ Click “Match” ➜ Click “Apply” ➜ Click “Calculate” ➜ Click “OK”.

Click “Done”. Then you will see the region of the law firm has been extracted correctly. ➜ Click “Save”. 

 

Step 6. Check the workflow.

Now we need to check the workflow by clicking actions from the beginning of the workflow.

Go to Web Page ➜ The Loop Item box  Click Item ➜ Extract Data.

 

Step 7. 

Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.

Step 7-1.

Step 7-2.

Step 7-3.

 

 

Step 8. 

The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.

Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

Author's Picks

Web Scraping - How to Store Cookies in Octoparse

Modify X Path For "Load More" Button with Octoparse

How to Scrape WordPress Posts

Octoparse Cloud Service

Introduction to Octoparse XPath Tool

Scraping Websites That Required Login with Octoparse

Scrape Data from Yellowpages.com

Scraping Product Detail Pages from eBay.com

Scraping Hotel Reviews from Tripadvisor.com

 

 

Request Pro Trial Data
Collection
Service
Email
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.
× get my coupon now No Thanks