Web Scraping Case Study | Scraping Yelp Reviews

Saturday, December 31, 2016 12:39 AM

 

Brief Intro

Octoparse enables you to scrape reviews from yelp.com.    

In this tutorial we will scrape all reviews about car audios in Brooklyn, NY, United States from yelp.com with Octoparse.

The website URL we will use is https://www.yelp.com/search?find_desc=car+audio&find_loc=Brooklyn%2C+NY

The data fields include the company name, phone number , address, Rating and his/her reviews about the car audio.

 

List features covered 

  • Pagination
  • Building a list
  • Setting AJAX
  • Re-format data fields

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Yelp reviews. (Download my extraction task of this tutorial HERE just in case you need it.)

 

Now, let's get started!

 

Step 1. Set up basic information

  • Before you scrape data with pagination, complete basic information

 

 

Step 2. Navigate to the target website

  • Enter the target URL in the built-in browser (URL of the example: https://www.yelp.com/search?find_desc=car+audio&find_loc=Brooklyn%2C+NY ) 
  • Click "Go" icon to open webpage

 

 

Step 3Create a list of items

 Move your cursor over the article with similar layout, where you would extract the content of the article.

  • Click any where on the first section on the web page 
  • Click “Expand the selection area” to the point where the outlined box includes all the content you want to scrape.

 If The selection had not been identified properly in the first place. 

  • When prompted, Click “Create a list of items” (sections with similar layout)
  • Click “Add current item to the list”

 

 

(Note:

1.If you want to extract information from every page of search result, you need to add a page navigation action.

2.You can right click the "Next" pagination link to prevent triggering the link.

3.You can click "Expand the selection area" button until "Loop click in the element" appears. ) 

 

Now, the first item has been added to the list, we need to finish adding all items to the list

  • Click “Continue to edit the list”
  • Click a second section with similar layout
  • Click “Add current item to the list” again

Now we get all the sections added to the list with similar layout

  • Click “Finish Creating List”
  • Click “loop”, this action will tell Octoparse to click on each section on the list to extract the selected data

 

Step 4. Select the data to be extracted 

  • Click the data field "Company Name"
  • Select “Extract text”
  • Follow the same steps to extract the other data

 

Step 5. Re-format data fields

Since star-rating had been been selected properly, we will need to re-format the data field “Star-rating” to extract the exact information we want.

  • Choose the Star-Rating data field
  • Select the “Customize Field” button.
  • Choose “Re-format extracted data”
  • From the outer html of the data field, we know that the star rating score is started with ‘title=”’ and ended with ‘out of’.
  • Click "Add step"
  • Select “Match with Regular Expression”
  • Enter the Regular Expression: (?<=title=")(.+?)(?=star) to extract the star rating
  • Click “OK”
  • Then the value for the “Star_rating” data field turns into 5.0
  • Click "Save"

 

( Note: Octoparse has provided the built-in Regex Tool for users which can generate regular expression automatically.

  • Click “Try RegEx Tool”
  • In the RegEx Tool window, check the “Start with” and enter “title=””; check the “End with” and enter “out of”
  • Click “Generate”
  • Click “Match”
  • The matching result is 5.0
  • Click “Apply” )

Step 6. Rename the data fields

  • Rename the any field names if necessary.
  • Click "Save"

 

Step 7. Set up pagination  

  • Click on “Next” to the right of page numbers
  • Choose “Loop Click Next Page”.

This will tell Octoparse to click open each page for more extraction actions

 

Step 8. Set AJAX Timeout

Octoparse enables you to scrape AJAX websites, that is, to scrape the AJAX content from websites.

(AJAX request is mainly to achieve the goal to update partial data of the web page, without needing to refresh the entire page.

Timeout parameter is the amount of time to wait for the AJAX requests to be finished, so that we can execute the next step.)

Yelp uses AJAX to load more news articles, we need to set AJAX timeout for the action.

  • Navigate to "Click to Paginate" action
  • Tick "AJAX Load" checkbox
  • Set an AJAX timeout of 2 seconds
  • Click "Save"

 

Step 9. Starting running your task 

  • After saving your extraction configuration,click “Next”
  • Select “Local Extraction”
  • Click “OK” to run the task on your computer.

Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane  for the extraction progress 

 

Step 10. Check the data and export

  • Check the data extracted
  • Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer

 

Now, you have learnt how to scrape AJAX content from Yelp, and re-format data using regular expression. 

 

Check out similar case studies:

Or learn more about pagination:

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
Request Pro Trial Data
Collection
Service
Email
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.
× get my coupon now No Thanks