Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
The latest version for this tutorial is available here. Go to have a check now!
In this tutorial, we are going to introduce how to scrape Yelp review data. We will enter the detail page of each coffee shop, scraping the shop name, the reviewer's name and the comment.
To follow through you might want to use the URL in this tutorial:
This tutorial will also cover:
· Modify XPath for accurately locating the desired price data
Main steps in the tutorial: [Download demo task file here ]
1) "Go To Web Page" - to open the targeted web page
· Create the task with "Advanced Mode".
· Paste the URL into the "Extraction URL" box and click "Save URL" to move on.
2) Create a pagination loop - to scrape all the results from multiple pages
· Scroll down and click the "Next Page" button on the webpage
· Click "Loop click next page" on "Action Tips"
As this website employs AJAX technique to load the new content, we need to set up "AJAX load" to help Octoparse avoid being stuck.
· Uncheck "Auto-Retry"
· Check "AJAX Load" and set up "AJAX Timeout"
To know more about AJAX, please refer to:
3) Create a "Loop Item" - to loop click into each item on each list
We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.
· Click "Go To Web Page" in the workflow.
· Select the pagination loop in the workflow
By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.
· Click the first cafe item
· Click "Select All" on the "Action Tips"
· Select "Loop click each element"
4) Extract data - loop capture review information on the list for extraction
· Click cafe name on the webpage
· Click "Extract text of selected element" on the Action Tips to extract the cafe's name
Now, let's build a "loop item" to have all reviews captured.
· Click first and second comment sections consecutively
Octoparse will intelligently identify all the comment sections on the page based on the pattern you've just defined.
· Click "Extract text of the selected elements"
A "Loop Item" will be automatically generated and added to the workflow. By default, Octoparse automatically extracts from the item selected, however, if this is not exactly what you are looking for, you can delete it and add the data fields you need as below.
· Delete the unwanted data field
· Select the data you want on the comment area, like the username, location, and comment
· Click "extract text of the selected element"
· Click "OK" to save the result
Here is a tutorial for capturing a list of items：
5) Customize data field by modifying XPath – to improve the accuracy of a certain data field (Optional)
In this case, the cafe names are not always located in the same place on different detail pages. So to avoid data missing raised by this irregular location issue, we need to modify XPath in Octoparse to ensure the element on each page to be precisely detected.
The revised XPath of the cafe name is:
· Click "Customize data field"
· Select "Customize XPath"
· Paste the revised XPath into the Matching XPath textbox
· Click "OK" to save the result.
To improve the accuracy of a certain data field, modifying XPath in Octoparse is highly recommended. Here are some related tutorials you might need：
6) Save and start extraction - to run the task and get data
Here is the sample output.
Was this article helpful? Contact us anytime if you need our help.