Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scrape Yelp Review Data

Thursday, December 06, 2018

In this tutorial, we are going to introduce how to scrape Yelp review data. We will enter the detail page of each coffee shop, scraping the shop name, the reviewer's name and the comment.

To follow through you might want to use the URL in this tutorial:

https://www.yelp.com/search?find_desc=Coffee+%26+Tea&find_loc=Seattle%2C+WA&ns=1

 

This tutorial will also cover:

      · Modify XPath for accurately locating the desired price data

 

Main steps in the tutorial: [Download demo task file here ]

1) "Go To Web Page" - to open the targeted web page

2) Create a pagination loop - to scrape all the results from multiple pages

3) Create a "Loop Item" - to loop click into each item on each list

4) Extract data - loop capture review information on the list for extraction

5) Customize data field by modifying XPath – to improve the accuracy of a certain data field (Optional)

6) Save and start extraction - to run the task and get data

 

 

 

 

 

 

1) "Go To Web Page" - to open the targeted web page

· Create the task with "Advanced Mode".

· Paste the URL into the "Extraction URL" box and click "Save URL" to move on.

 

 

 

 

 

2) Create a pagination loop - to scrape all the results from multiple pages

· Scroll down and click the "Next Page" button on the webpage

· Click "Loop click next page" on "Action Tips"

As this website employs AJAX technique to load the new content, we need to set up "AJAX load" to help Octoparse avoid being stuck. 

 · Uncheck "Auto-Retry"

 · Check "AJAX Load" and set up "AJAX Timeout"

 

 

 

 

 

3) Create a "Loop Item" - to loop click into each item on each list

We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.

· Click "Go To Web Page" in the workflow.

· Select the pagination loop in the workflow

By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.

· Click the first cafe item

· Click "Select All" on the "Action Tips"

· Select "Loop click each element"

 

 

 

 

 

 

 

4) Extract data - loop capture review information on the list for extraction

· Click cafe name on the webpage

· Click "Extract text of selected element" on the Action Tips to extract the cafe's name

Now, let's build a "loop item" to have all reviews captured.

· Click first and second comment sections consecutively

Octoparse will intelligently identify all the comment sections on the page based on the pattern you've just defined. 

· Click "Extract text of the selected elements"

 A "Loop Item" will be automatically generated and added to the workflow. By default, Octoparse automatically extracts from the item selected, however, if this is not exactly what you are looking for, you can delete it and add the data fields you need as below. 

· Delete the unwanted data field

· Select the data you want on the comment area, like the username, location, and comment

· Click "extract text of the selected element"

· Click "OK" to save the result

 

 

 

 

Tips!

Here is a tutorial for capturing a list of items:

· Getting data - Capture a list of items

 

 

 

5) Customize data field by modifying XPath – to improve the accuracy of a certain data field (Optional)

In this case, the cafe names are not always located in the same place on different detail pages. So to avoid data missing raised by this irregular location issue, we need to modify XPath in Octoparse to ensure the element on each page to be precisely detected.

The revised XPath of the cafe name is:  

 .//*[@id='wrap']/div[2]/div/div[1]/div/div[3]/div[1]/div[1]/h1.

· Click "Customize data field"

· Select "Customize XPath"

· Paste the revised XPath into the Matching XPath textbox

· Click "OK" to save the result.

 

 

 

Tips!

To improve the accuracy of a certain data field, modifying XPath in Octoparse is highly recommended. Here are some related tutorials you might need:

· How to associate data with nearby text?

· Data fetched to the incorrect data fields

· Locate elements with XPath

 

 

 

 

6) Save and start extraction - to run the task and get data

  • Click "Start Extraction"
  • Select "Local Extraction" to run the task on your computer

 

 

Here is the sample output.

 

 

Was this article helpful? Contact us anytime if you need our help.

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png