undefined

Scrape business information from Yelp

Monday, January 9, 2017 1:06 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

Yelp is one of the largest business directory websites on the Internet. In this tutorial, we will show you how to collect business information on Yelp. For Yelp scraping, you could use our ready-to-use Task Template available on the home page or follow this tutorial to build the task from scratch.

 You can also check out the video below:

how_to_scrape_yelp_data

 

 

To demonstrate, we will use this URL as an example: https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA&ns=1

 

Here are the main steps in this tutorial: [Download demo task file here]

1. Open Target Webpage

2. Auto-detect Webpage and Create Workflow

3. Adjust AJAX timeout for Paginate

4. Create Click Item to Go Into Detail Pages

5. Scrape More Info from Pages

6. Set Up Wait Time for Extract Data

7. Check the Workflow

8. Run Task and Export Data

  

1. Open Target Webpage

  • Paste the URL and click Start

 

2. Auto-detect Webpage and Create Workflow

  • Select Auto-detect web page data
  • Wait for the detection to be complete then select Create workflow
  • Go to Data preview to check the current data output, double click on the header to rename it, or click ... to delete a field

 

edit-data-field

 

3. Adjust AJAX timeout for Paginate

  • Click on Click to Paginate - adjust Timeout to 10s

 

adjust-timeout

  • Click Apply to save the changes
  

4. Create Click Item to Go Into Detail Pages

  • Select Click a link on the web page
  • Select the first URL from the drop-down menu (you can confirm if it's the correct link in Data Preview

 

click an extracted data

  • Confirm the setup
 

5. Scrape More Info from Pages

  • Select information on the web page then clicks Extract text of the selected element. Repeat the step to extract all the data you need
 

6. Set Up Wait Time for Extract Data

Yelp applies an anti-scraping technique and it would block your IP if you scrape too fast. We need to slow down the scraping by setting the wait time.
  • Select Extract Data in the workflow - go to the Options section - tick Wait before action and set it to 10s
  

7. Check the Workflow

  • Below is how the final workflow looks like, if everything is in place, you can continue to run the task
  

8. Run Task and Export Data

  • Run the task on the top right corner: Run task on your device to run the task on your local device, or select Run task in the cloud to run the task on the Cloud (for premium users only)

 

Here is the sample output

 sample yelp data

  

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline