undefined

Scrape business information from Yelp

Wednesday, April 12, 2017 11:03 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

Yelp is one of the largest business directory websites on the Internet. In this tutorial, we will show you how to collect business information on Yelp. For Yelp scraping, you could use our ready-to-use Task Template available on the home page or follow this tutorial to build the task from scratch. You can also check out the video below.

 

 

how_to_scrape_yelp_data

 


 

To demonstrate, we will use this URL as an example: https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA&ns=1

 

Here are the main steps in this tutorial: [Download demo task file here]

1. Go to Web Page - to open the target webpage

2. Auto-detect webpage - to create the workflow

3. Create Click Item - to go into detail pages

4. Adjust settings for Pagination and Loop Item

5. Extract Data - to get data from the detail pages

6. Set up wait time - to control the scraping speed

7. Run task - to get the data

 

1. Go to Web Page - to open the target webpage

  • Paste the URL and click Start

 

2. Auto-detect webpage - to create the workflow

  • Select Auto-detect web page data
  • Wait for the detection to be complete then select Create workflow
  • Go to Data preview, double click on the header to rename it, or click ... to delete a field

 

3. Create Click Item - to go into detail pages

  • Select Click a link on the web page
  • Select the first URL from the drop-down menu (you can confirm if it's the correct link in Data Preview
  • Confirm the setup

 

You will notice a Click URLs in the list action is created in the workflow.

 

4. Adjust settings for Pagination and Loop Item

We need to set up AJAX for the pagination as the page is loaded with AJAX. The auto-generated XPath of Pagination and Loop Item does not always work well, so we have to modify the XPath.

  • Click on Click to Paginate - adjust Timeout to 10s
  • Click Apply to save the changes
  • Click on Pagination - paste the XPath //a[contains(@class,'next')]
  • Click Apply to save
  • Click on Loop Item - paste the XPath //*[@id="main-content"]/div/ul/li//h4/ancestor::li
  • Click Apply to confirm

 

Tip!

To learn more about XPath, you can check this tutorial: What is XPath and how to use it in Octoparse

 

5. Extract Data - to get data from the detail pages

  • Select the information on the web page then click Extract text of the selected element. Repeat the steps to extract all the data you need

 

The rating number of Yelp is not displayed in the text format. We need to extract an attribute value in the HTML code.

    • Click on the Rating stars on the page

 

  • Choose Extract the text of the element
  • Click ... - choose Customize field - choose Extract attribute - choose aria-label 

 

 

As some business pages may not display phone numbers or their website address, we need to modify the XPath for the fields to make it always locate the correct info even when the position of the piece of info is different.

  • Switch to Vertical View
  • Double click on the XPath - Paste the XPath below to it

 

We have prepared some useful XPaths for Yelp pages.

Business Website: //p[text()='Business website']/following-sibling::p[1]

Phone Number: //p[text()='Phone number']/following-sibling::p[1]

Address: //address

Business Owner: //p[text()='Business Owner']/../preceding-sibling::p[1]

Rating stars: //h1/../following-sibling::div//div[contains(@aria-label,'star rating')]

 

6. Set up wait time - to control the scraping speed

Yelp applies an anti-scraping technique and it would block your IP if you scrape too fast. We need to slow down the scraping by setting the wait time.
    • Select Extract Data in the workflow
    • Go to the Options section
    • Tick Wait before action and set it to 10s

 

 

Below is what the final workflow looks like, once everything is in place, you can continue to run the task

final workflow

 

7. Run task - to get the data

  • Run the task in the top right corner
  • Run task on your device to run the task on your local device, or select Run task in the cloud to run the task on the Cloud (for premium users only)

 

Here is the sample output -

 

 

 

 

 

 

 

Is this article helpful? Contact us anytime if you need our help!

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept Close