Web Scraping Case Study | Scraping Yelp ReviewsSaturday, December 31, 2016 12:39 AM
Octoparse enables you to scrape reviews from yelp.com.
In this tutorial we will scrape all reviews about car audios in Brooklyn, NY, United States from yelp.com with Octoparse.
The website URL we will use is https://www.yelp.com/search?find_desc=car+audio&find_loc=Brooklyn%2C+NY
The data fields include the company name, phone number , address, Rating and his/her reviews about the car audio.
List features covered
- Building a list
- Setting AJAX
- Re-format data fields
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Yelp reviews. (Download my extraction task of this tutorial HERE just in case you need it.)
Now, let's get started!
Step 1. Set up basic information
- Before you scrape data with pagination, complete basic information
Step 2. Navigate to the target website
- Enter the target URL in the built-in browser (URL of the example: https://www.yelp.com/search?find_desc=car+audio&find_loc=Brooklyn%2C+NY )
- Click "Go" icon to open webpage
Step 3. Create a list of items
Move your cursor over the article with similar layout, where you would extract the content of the article.
- Click any where on the first section on the web page
- Click “Expand the selection area” to the point where the outlined box includes all the content you want to scrape.
If The selection had not been identified properly in the first place.
- When prompted, Click “Create a list of items” (sections with similar layout)
- Click “Add current item to the list”
1.If you want to extract information from every page of search result, you need to add a page navigation action.
2.You can right click the "Next" pagination link to prevent triggering the link.
3.You can click "Expand the selection area" button until "Loop click in the element" appears. )
Now, the first item has been added to the list, we need to finish adding all items to the list
- Click “Continue to edit the list”
- Click a second section with similar layout
- Click “Add current item to the list” again
Now we get all the sections added to the list with similar layout
- Click “Finish Creating List”
- Click “loop”， this action will tell Octoparse to click on each section on the list to extract the selected data
Step 4. Select the data to be extracted
- Click the data field "Company Name"
- Select “Extract text”
- Follow the same steps to extract the other data
Step 5. Re-format data fields
Since star-rating had been been selected properly, we will need to re-format the data field “Star-rating” to extract the exact information we want.
- Choose the Star-Rating data field
- Select the “Customize Field” button.
- Choose “Re-format extracted data”
- From the outer html of the data field, we know that the star rating score is started with ‘title=”’ and ended with ‘out of’.
- Click "Add step"
- Select “Match with Regular Expression”
- Enter the Regular Expression: (?<=title=")(.+?)(?=star) to extract the star rating
- Click “OK”
- Then the value for the “Star_rating” data field turns into 5.0
- Click "Save"
( Note: Octoparse has provided the built-in Regex Tool for users which can generate regular expression automatically.
- Click “Try RegEx Tool”
- In the RegEx Tool window, check the “Start with” and enter “title=””; check the “End with” and enter “out of”
- Click “Generate”
- Click “Match”
- The matching result is 5.0
- Click “Apply” )
Step 6. Rename the data fields
- Rename the any field names if necessary.
- Click "Save"
Step 7. Set up pagination
- Click on “Next” to the right of page numbers
- Choose “Loop Click Next Page”.
This will tell Octoparse to click open each page for more extraction actions
Step 8. Set AJAX Timeout
Octoparse enables you to scrape AJAX websites, that is, to scrape the AJAX content from websites.
(AJAX request is mainly to achieve the goal to update partial data of the web page, without needing to refresh the entire page.
Timeout parameter is the amount of time to wait for the AJAX requests to be finished, so that we can execute the next step.)
Yelp uses AJAX to load more news articles, we need to set AJAX timeout for the action.
- Navigate to "Click to Paginate" action
- Tick "AJAX Load" checkbox
- Set an AJAX timeout of 2 seconds
- Click "Save"
Step 9. Starting running your task
- After saving your extraction configuration，click “Next”
- Select “Local Extraction”
- Click “OK” to run the task on your computer.
Octoparse will automatically extract all the data selected. Check the "Data Extracted" pane for the extraction progress
Step 10. Check the data and export
- Check the data extracted
- Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer
Now, you have learnt how to scrape AJAX content from Yelp, and re-format data using regular expression.
Check out similar case studies:
- Scrape AJAX Pages from The Washington Post
- Scrape Article Information from Google Scholar
- How to Scrape Wordpress Posts
Or learn more about pagination:
- Pagination Loop issue: The extraction stops after 3 pages
- Create A Loop For Pagination Manually
- Pagination Scraping: Configure “Loop click next page” When It Can’t Be Detected
- Scraping from multi-pages: pagination with "Next" button