Web Scraping Case Study | Scraping Data from Yelp

Wednesday, April 12, 2017 11:03 AM

In this tutorial, I will walk through the detailed steps to show you how to scrape data and normalize the data fields accordingly from Yelp.  

 

List Features Covered

  • Pagination 
  • Reformat Extracted Data
  • Build a list
  • Modify XPath
  • Set AJAX Timout

 

Now, let's get started. 

Step 1. Set up basic information

  • Click "Quick Start"
  • Choose "New Task (Advanced Mode)"
  • Complete the basic information
  • Click "Next"

 

Step 2. Navigate to the target website

  • Enter the target URL in the built-in browser (URL used for the example: https://www.yelp.com/search?find_desc=Restaurants&find_loc=Washington,+DC)
  • Click "Go" icon to open webpage

 

Step 3. Create a list of items

  • Move your cursor over to where you would extract the info of that restaurant. Hover the cursor over the first section until you have the section properly selected and highlighted in blue, and click on it

(Tip: If the selection had not been identified properly in the first place, click anywhere on the section, and use the Window Expand button to extend the selection. )

  • When prompted, Click "Create a list of items" (sections with similar layout)
  • Click "Add current item to the list"

Now, the first item has been added to the list, we need to finish adding all items to the list.

  • Click "Continue to edit the list"
  • Click on the second section of similar layout
  • Click "Add current item to the list" once again

Now we got all the sections added to the list.

  • Click "Finish Creating List"
  • Click "Loop", this action will tell Octoparse to loop through each section of the list and proceed to extracting the desired data.

 

Step 4. Select the data to be extracted 

After we finish creating the list, we'll be navigated to the detailed page of each item where we will extract data.

  • Click the Restaurant name "Un Je Ne Sais Quoi"
  • Select “Extract text”

Now, we can observe the Restaurant name has been extracted as a data field. 

For extracting star-rating, if the selection had not been identified properly in the first place, we will need to expand the selection area to the point where the outlined box includes every one of the starts.

  • Since there is nothing is showed under “Extract text”, we will select “Extract outer html” .

Now, the Star Rating's outer HTML has been extracted.

  • Follow the same steps to extract the other data.

 

Step 5. Rename the data fields

  • Rename any data fields if necessary.
  • Click "Save"

 

 

Step 6. Modify the Xpath of data field "Name" in the Loop Item

Whenever a data field is selected to be extracted, Octoparse automatically identifies the XPath for the specific data field. However, sometimes a data field can't always be located accurately with the assigned XPath; hence we'll need to manually modify the XPath to correctly extract the desired data.

For this example, the XPath Octoparse identified for the data field "Name" cannot consistently locate the name of the restaurant. We need to modify the XPath manually to make sure all names of the restaurant get extracted. 

  • Choose the "Name" field and click "Customize Field" to locate the item.
  • Choose "Define ways to locate an item.
  • Then you could enter the modified XPath "//H1[contains(@class,'biz-page-title embossed-text-white')]" in the "Matching XPath" text box.
  • Click "OK" and "Save"

 

Step 7. Re-format Extracted Data

Since star-rating had not been selected properly, we will need to re-format the data field "Star-rating" to extract the exact information we want.

  • Choose data field "Star" to reformat
  • Select the "Customize Field" 
  • Choose "Re-format extracted data"
  • Click "Add step"
  • Select "Match with Regular Expression"
  • Input the Regular Expression "(?<=title=")(.+?)(?= star)" to normalize the Outer HTML of "Star" (or you can use the Regex tool to generate a regex automatically).
  • Click "OK"
  • Noted the value for the "Star_rating" data field turned into 4.5.
  • Click "Save"

Now, we continue to re-format the data field "Reviews" to remove the redundant words.

  • Choose data field "Review" to reformat
  • Select the "Customize Field"
  • Choose "Re-format extracted data"
  • Click "Add step"
  • Select "Match with Regular Expression"
  • Input the Regular Expression "(.+?)(?=review)" to normalize "Review" data field (or you can use the Regex tool to generate a regex automatically).
  • Click "OK"
  • Noted the value for the “Review” turned into 181
  • Click "Save"

Next, we continue to re-format "Price" to remove the blanks within the expressions.

  • Choose data field "Price" to reformat
  • Select the "Customize Field" 
  • Choose "Re-format extracted data"
  • Click "Add step"
  • Select "Match with Regular Expression"
  • Input the Regular Expression "\s+" (or you can use the Regex tool to generate a regex automatically).
  • Click "OK"
  • Click "Save"

Now, we have mastered how to normalize the extracted data. Follow the same steps to re-format other data using the Regex Setting Tool if necessary.

 

Step 8. Set up pagination and adjust the relative loop sequence

As the listing of the restaurants spans through many pages, we'll need to instruct Octoparse to go to the different page and capture the desired data accordingly. This is done by setting up "pagination". 

  • Click on "Next" to the rightmost of page numbers
  • When prompted, choose "Loop Click Next Page".

Now, we need to adjust the relative nesting sequence between the "Cycle Pages" and "Loop Item" manually, since we need to scrape each page before we click to paginate.

  • Drag "Cycle Pages" out of the "Loop Item"
  • Drag "Loop Item" into "Cycle Pages" 
  • Click "Save"

 

Step 9. Set AJAX Timeout

As the web page is loaded with AJAX, we need to set AJAX timeout for the action.

  • Navigate to "Click to Paginate" action of the outer "Cycle Pages" loop.
  • Check "AJAX Load"
  • Set an AJAX timeout of 2 seconds
  • Click "Save"

 

Step 10. Create a list of reviews

Now, we will proceed to scrape the reviews and ratings for each restaurant. Similar to how we created the list of restaurants, build a list for the reviews before selecting the specific data fields to extract. 

Notice how the review section had not been selected properly in the first place, use the Expand Window button to expand the selection until the whole section gets selected/highlighted. 

 

Step 11.  Extract the review data

  • Click on the name of reviewer
  • Select "Extract text"

Now, observe how the reviewer's name had been extracted as a data field.

  • Follow the same steps to extract the other data

 

Step 12. Re-format Extracted Data

Re-formate the data field "Star" according to the steps listed earlier. 

 

Step 13. Set up pagination for items within inner loop

As there are usually multiple pages of reviews, to extract all reviews we will need to configure for pagination.

 

 

Step 14. Set AJAX Timeout

Same as Step 12. We also need to set a AJAX Timeout for the inner "Cycle Pages" action.

 

Step 15. Starting running your task

  • After saving your task configuration,click "Next"
  • Choose to run the task locally (on your own machine) or run in the cloud (for scheduled and faster extraction). To demonstrate, we'll run an local extraction for now.
  • Select "Local Extraction"
  • Click "OK" to start the task extraction

(Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress)

 

Step 17. Check extracted data

  • Check the data extracted
  • Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer
  • Done!

Good job for completing the task!

 

Now check out similar case studies:

Or, learn more about related topics:

 

 

Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

 

 

btn_sidebar_use.png
btn_sidebar_form.png