Web Scraping Case Study | Scraping Data from YelpWednesday, April 12, 2017 11:03 AM
In this tutorial, I will walk through the detailed steps to show you how to scrape data and normalize the data fields accordingly from Yelp.
List Features Covered
Now, let's get started.
Step 1. Set up basic information
Step 2. Navigate to the target website
Step 3. Create a list of items
(Tip: If the selection had not been identified properly in the first place, click anywhere on the section, and use the Window Expand button to extend the selection. )
Now, the first item has been added to the list, we need to finish adding all items to the list.
Now we got all the sections added to the list.
Step 4. Select the data to be extracted
After we finish creating the list, we'll be navigated to the detailed page of each item where we will extract data.
Now, we can observe the Restaurant name has been extracted as a data field.
For extracting star-rating, if the selection had not been identified properly in the first place, we will need to expand the selection area to the point where the outlined box includes every one of the starts.
Now, the Star Rating's outer HTML has been extracted.
Step 5. Rename the data fields
Step 6. Modify the Xpath of data field "Name" in the Loop Item
Whenever a data field is selected to be extracted, Octoparse automatically identifies the XPath for the specific data field. However, sometimes a data field can't always be located accurately with the assigned XPath; hence we'll need to manually modify the XPath to correctly extract the desired data.
For this example, the XPath Octoparse identified for the data field "Name" cannot consistently locate the name of the restaurant. We need to modify the XPath manually to make sure all names of the restaurant get extracted.
Step 7. Re-format Extracted Data
Since star-rating had not been selected properly, we will need to re-format the data field "Star-rating" to extract the exact information we want.
Now, we continue to re-format the data field "Reviews" to remove the redundant words.
Next, we continue to re-format "Price" to remove the blanks within the expressions.
Now, we have mastered how to normalize the extracted data. Follow the same steps to re-format other data using the Regex Setting Tool if necessary.
Step 8. Set up pagination and adjust the relative loop sequence
As the listing of the restaurants spans through many pages, we'll need to instruct Octoparse to go to the different page and capture the desired data accordingly. This is done by setting up "pagination".
Now, we need to adjust the relative nesting sequence between the "Cycle Pages" and "Loop Item" manually, since we need to scrape each page before we click to paginate.
Step 9. Set AJAX Timeout
As the web page is loaded with AJAX, we need to set AJAX timeout for the action.
Step 10. Create a list of reviews
Now, we will proceed to scrape the reviews and ratings for each restaurant. Similar to how we created the list of restaurants, build a list for the reviews before selecting the specific data fields to extract.
Notice how the review section had not been selected properly in the first place, use the Expand Window button to expand the selection until the whole section gets selected/highlighted.
Step 11. Extract the review data
Now, observe how the reviewer's name had been extracted as a data field.
Step 12. Re-format Extracted Data
Re-formate the data field "Star" according to the steps listed earlier.
Step 13. Set up pagination for items within inner loop
As there are usually multiple pages of reviews, to extract all reviews we will need to configure for pagination.
Step 14. Set AJAX Timeout
Same as Step 12. We also need to set a AJAX Timeout for the inner "Cycle Pages" action.
Step 15. Starting running your task
(Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress)
Step 17. Check extracted data
Good job for completing the task!
Now check out similar case studies:
Or, learn more about related topics:
Author: The Octoparse Team
For more information about Octoparse, please click here.