Scraping Yelp Reviews

Saturday, December 31, 2016 12:39 AM

 

Octoparse enables you to scrape reviews from yelp.com.    

 

In this tutorial we will scrape all reviews about car audios in Brooklyn, NY, United States from yelp.com with Octoparse.

The website URL we will use is https://www.yelp.com/search?find_desc=car+audio&find_loc=Brooklyn%2C+NY

The data fields include the company name, phone number , address, car audio type, type, customer name and his/her reviews about the car audio.

 

You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps below to make a scraping task to scrape Yelp reviews. (Download my extraction task of this tutorial HERE just in case you need it.)

 

Step 1. Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".

 

 

Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.

 

(URL of the example: https://www.yelp.com/search?find_desc=car+audio&find_loc=Brooklyn%2C+NY ) 

 

 

Step 3. Right click on the "Next" pagination link. ➜ Choose "Loop click in the element" to turn the page.

 

 

(Note:

1.If you want to extract information from every page of search result, you need to add a page navigation action.

2.You can right click the "Next" pagination link to prevent triggering the link.

3.You can click "Expand the selection area" button until "Loop click in the element" appears. ) 

 

Step 4. Move your cursor over the section with similar layout, where you would extract data.

 

Click the first item ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

 

Then the first item has been added to the list. ➜ Click "Continue to edit the list".

 

Click the second item ➜ Click "Add current item to the list" again. Now we get all the links with similar layout. ➜Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.

 

 

Note:

We can tick the “Scroll Down” and “Page Acceleration” options to load the web page completely. Then click “Save”.

 

 

Step 5. Extract the information of the car audio.

 

When the web page URL is still loading while all the content of the web page has been completely loaded, we need to click the multiplication sign (×) to stop the URL from loading.

 

Extract the phone number of the car audio. ➜ Click the phone number➜ Select "Extract text". Other contents can be extracted in the same way. Add the current page URL as a data field.  

All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click “Save”.

 

 

Step 6. Extract Reviews about the car audio

 

Step 6-1. Right click on the "Next" pagination link ➜ Choose "Loop click in the element" to turn the page.

 

 

Step 6-2. Move your cursor over the section with similar layout, where you would extract data.

 

Click the first section ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".

 

Then the first section has been added to the list. ➜ Click "Continue to edit the list".

 

Click the second section ➜ Click "Add current item to the list" again. Now we get all the links with similar layout. ➜Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.

 

Step 6-3. Extract the reviews.

 

Click the customer name ➜ Select "Extract text". Reviews can be extracted in the same way. 

All the content will be selected in Data Fields. ➜ Click the "Field Name" to modify. Then click "Save".

 

 

 

Step 7. In the second Loop Item box, we drag the second "Loop Item" before the "Click to paginate" action of the second “Cycle Pages” box so that we can grab all the reviews about the hotel from multiple pages.

 

Step 8. Drag the second "Loop Item" box before the "Click to paginate" action of the first “Cycle Pages” box so that we can grab all the reviews about all the hotels from multiple pages.

 

 

Step 9. Check the workflow.

 

Now we need to check the workflow by clicking actions from the beginning of the workflow.

Go to the webpage ➜ The first Cycle Pages box ➜ The first Loop Item box ➜ Click Item ➜Extract Data ➜ The second Cycle Pages box ➜ The second Loop Item box Extract Data ➜ Extract Data ➜ Click to Paginate ➜ Click to Paginate.

While checking the workflow, we need to set up a longer timeout of some actions except the ‘Go To Web Page' action and set up ajax timeout for the two “Click to Paginate” actions.

 

 

Step 10. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.

 

 

Step 11. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.

 

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today!

 

Author's Picks

 

Scrape Data from Yellowpages.com

Scraping Online Dictionary-Merriam-Webster

Scraping Product Detail Pages from eBay.com

Scraping Hotel Reviews from Tripadvisor.com

 

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
Request Pro Trial Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.