Extract Data/ Web harvesting from Yelp

3/8/2017 10:08:07 PM

                                         

As the world becomes overwhelmed with information, with all kinds of news being continuously updated every two seconds or so, a substantial volume of data is generated spreading through the web via some popular social medias, such as Twitter, Facebook and etc. We can be sure that the data harvested from those social media data set is of considerable value as these meta-data contains info involving many real-time events or topics, e.g, personal prepositions, food preference, protests, pandemics, climate disasters, crimes and etc. Besides as a cue of early event detection, the collective sentiment measured from these data can also reflect the potential trends in some social events, such as popularity, elections or movement in stock market.

There have been some methods to scrape data from popular websites, like Yelp, LinkedIn, Twitter and etc. In this writing, I will share with you some techniques to harvest data from Yelp.

In most cases, the first thing that would come to mind when we need to harvest or scrape data from websites is to use the public APIs provided by the site. Just like Facebook Graph API, Twitter REST API, Yelp also provides a REST API for user to deal with web harvesting/scraping need. By using the Yelp API, we can scrape the ratings, reviews, geo-location (city, code). For example, you can locate a specific type of business, like restaurants, and restrict the range of searching to a geographical zone, a neighboring community, apartment building or a city by using the Yelp API. Users  will be responded with a JSON formatted data frame file after sending out http requests. The data frames normally contain an abundance of matching information, including address info, distance, rating, transaction details, images, URLs and etc. Similar to LinkedIn, Yelp also make OAuth for identity authentication, that means users would need to register and apply with Yelp to get a list of identities assigned by the API. After users get authenticated by running the script, an URL request can be generated using REST. For example, the request shown below is a restaurant request for Colorado. The response content will be parsed into a JSON file and sent back with the required info repeatedly.

 

Using Yelp API (yelp.rb) for retrieving

 

#!/usr/bin/ruby

require 'rubygems'

require 'oauth'

require 'json'

 

consumer_key = 'your consumer key'

consumer_secret = 'your consumer secret'

token = 'your token'

token_secret = 'your token secret'

api_host = 'http://api.yelp.com'

 

consumer = OAuth::Consumer.new(consumer_key, consumer_secret, {:site => api_host})

access_token = OAuth::AccessToken.new(consumer, token, token_secret)

 

path = "/v2/search?term=restaurants&location=Boulder,CO"

 

jresp = JSON.parse(access_token.get(path).body)

 

jresp['businesses'].each do | business |

    if business['is_closed'] == false

      printf("%-32s  %10s  %3d  %1.1f\n",

                business['name'], business['phone'],

                business['review_count'], business['rating'])

    end

End

 

Yelp API Ruby Script

 

$ ./yelp.rb

Frasca Food and Wine              3034426966  189  4.5

John's Restaurant                 3034445232   51  4.5

Leaf Vegetarian Restaurant        3034421485  144  4.0

Nepal Cuisine                     3035545828   65  4.5

Black Cat Bistro                  3034445500   72  4.0

The Mediterranean Restaurant      3034445335  306  4.0

Arugula Bar E Ristorante          3034435100   48  4.0

Ras Kassa's Ethiopia Restaurant   3034472919  101  4.0

L'Atelier                         3034427233   58  4.0

Bombay Bistro                     3034444721   87  4.0

Brasserie Ten Ten                 3039981010  200  4.0

Flagstaff House                   3034424640   86  4.5

Pearl Street Mall                 3034493774   77  4.0

Gurkhas on the Hill               3034431355   19  4.0

The Kitchen                       3035445973  274  4.0

Chez Thuy Restaurant              3034421700   99  3.5

Il Pastaio                        3034479572  113  4.5

3 Margaritas                      3039981234   11  3.5

Q's Restaurant                    3034424880   65  4.0

Julia's Kitchen                                 8  5.0

 

$

 

Yelp has provided detailed and all-inclusive API instruction with data description, instances, and failing report. However, the API is still not perfect with the limit of 100 times for users to to call for the API, 1000 times for test-oriented and 10,000 times if users’ application meets the display requirements of Yelp. 

Using API to retrieve data enable fast and accurate data crawling. And many websites have made some partial data public to extend their impact. Yet, there are still certain data fields that are restricted in the public data pool, and would require some other crawling techniques beside the API. For example, parts of the data set in Yelp user comments are not transparent and each company with a specific id is only allowed to retrieve three data item at most. Therefore, we can try some other ways to harvest the data needed, such as using some automated scraper softwares. There are numbers of scraping tools around that provide well-rounded scraping or harvesting service, like Octoparse, Import.io, Mozenda and etc. In this writing, I will share with you about on my very own experience with Octoparse. Of course, you can also visit www.octoparse.com for more detailed info.

 

Octoparse

Octoparse is a Windows-based scraping software. Users are not required to program to scrape or crawl the websites. The workflow designer in Octoparse is pretty user-friendly as shown in the figure below. The tips are quite clear, and the icons and operations are quite straight forward and easy to handle. By simulating and learning a series of human web browsing behaviors, like opening a web page in the built-in browser, pointing and clicking the web elements by selecting the related options in the pop-up designer window, Octoparse will be able to transfer repetitive manual extraction operations into automated web extraction process and retrieve the structured data  that users need.

 

 

With a premium plan, you can scrape the site using its cloud service, which supports API, IP rotation and task scheduling.  

 

After data extraction is completed, the data harvested or scraped can be exported to different formats (excel, csv, html, txt, etc) or export directly to various databases (MySQL, SQL Server, and Oracle). 

 

 

 

Author: The Octoparse Team

Download Octoparse Today

For more information about Octoparse, please click here.

 

Author's Picks

Be the Best Junior Management Consultant: Skills You Need to Succeed

Web Scraping|Scrape Booking Reviews

Web Scraping|Scrape Data from Online Accommodation Booking Sites

5 Steps to Collect Big Data

A Must-Have Web Scraper for Data Comparison Software - Octoparse

10 Best Free Tools for Startups - Octoparse

30 Free Web Scraping Software

 

 

30 Free Web Scraping Software

Collect Data from Amazon

Top 30 Free Web Scraping Software

- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf

 

 

 

 

 

Recent Posts

Request Pro Trial Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.