Web Scraping Case Study | Scraping Yelp ReviewsSaturday, December 31, 2016 12:39 AM
Octoparse enables you to scrape reviews from yelp.com.
In this tutorial we will scrape all reviews about car audios in Brooklyn, NY, United States from yelp.com with Octoparse.
The website URL we will use is https://www.yelp.com/search?find_desc=car+audio&find_loc=Brooklyn%2C+NY
The data fields include the company name, phone number , address, Rating and his/her reviews about the car audio.
List features covered
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Yelp reviews. (Download my extraction task of this tutorial HERE just in case you need it.)
Now, let's get started!
Step 1. Set up basic information
Step 2. Navigate to the target website
Step 3. Create a list of items
Move your cursor over the article with similar layout, where you would extract the content of the article.
If The selection had not been identified properly in the first place.
1.If you want to extract information from every page of search result, you need to add a page navigation action.
2.You can right click the "Next" pagination link to prevent triggering the link.
3.You can click "Expand the selection area" button until "Loop click in the element" appears. )
Now, the first item has been added to the list, we need to finish adding all items to the list
Now we get all the sections added to the list with similar layout
Step 4. Select the data to be extracted
Step 5. Re-format data fields
Since star-rating had been been selected properly, we will need to re-format the data field “Star-rating” to extract the exact information we want.
( Note: Octoparse has provided the built-in Regex Tool for users which can generate regular expression automatically.
Step 6. Rename the data fields
Step 7. Set up pagination
This will tell Octoparse to click open each page for more extraction actions
Step 8. Set AJAX Timeout
Octoparse enables you to scrape AJAX websites, that is, to scrape the AJAX content from websites.
(AJAX request is mainly to achieve the goal to update partial data of the web page, without needing to refresh the entire page.
Timeout parameter is the amount of time to wait for the AJAX requests to be finished, so that we can execute the next step.)
Yelp uses AJAX to load more news articles, we need to set AJAX timeout for the action.
Step 9. Starting running your task
Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress
Step 10. Check the data and export
Now, you have learnt how to scrape AJAX content from Yelp, and re-format data using regular expression.
Check out similar case studies:
Or learn more about pagination: