Extract Data/ Web harvesting from Yelp
Saturday, March 18, 2017
As the world becomes overwhelmed with information, with all kinds of news being continuously updated every two seconds or so, a substantial volume of data is generated spreading through the web via some popular social media, such as Twitter, Facebook and etc. We can be sure that the data harvested from those social media data set is of considerable value as these meta-data contains info involving many real-time events or topics, e.g, personal prepositions, food preference, protests, pandemics, climate disasters, crimes and etc. Besides as a cue of early event detection, the collective sentiment measured from these data can also reflect the potential trends in some social events, such as popularity, elections or movement in stock market.
There have been some methods to scrape data from popular websites, like Yelp, LinkedIn, Twitter and etc. In this writing, I will share with you some techniques to harvest data from Yelp.
In most cases, the first thing that would come to mind when we need to harvest or scrape data from websites is to use the public APIs provided by the site. Just like Facebook Graph API, Twitter REST API, Yelp also provides a REST API for user to deal with web harvesting/scraping need. By using the Yelp API, we can scrape the ratings, reviews, geo-location (city, code). For example, you can locate a specific type of business, like restaurants, and restrict the range of searching to a geographical zone, a neighboring community, apartment building or a city by using the Yelp API. Users will be responded with a JSON formatted data frame file after sending out http requests. The data frames normally contain an abundance of matching information, including address info, distance, rating, transaction details, images, URLs and etc. Similar to LinkedIn, Yelp also make OAuth for identity authentication, that means users would need to register and apply with Yelp to get a list of identities assigned by the API. After users get authenticated by running the script, an URL request can be generated using REST. For example, the request shown below is a restaurant request for Colorado. The response content will be parsed into a JSON file and sent back with the required info repeatedly.
Using Yelp API (yelp.rb) for retrieving
#!/usr/bin/ruby
require 'rubygems'
require 'oauth'
require 'json'
consumer_key = 'your consumer key'
consumer_secret = 'your consumer secret'
token = 'your token'
token_secret = 'your token secret'
api_host = 'http://api.yelp.com'
consumer = OAuth::Consumer.new(consumer_key, consumer_secret, {:site => api_host})
access_token = OAuth::AccessToken.new(consumer, token, token_secret)
path = "/v2/search?term=restaurants&location=Boulder,CO"
jresp = JSON.parse(access_token.get(path).body)
jresp['businesses'].each do | business |
if business['is_closed'] == false
printf("%-32s %10s %3d %1.1f\n",
business['name'], business['phone'],
business['review_count'], business['rating'])
end
End
Yelp API Ruby Script
$ ./yelp.rb
Frasca Food and Wine 3034426966 189 4.5
John's Restaurant 3034445232 51 4.5
Leaf Vegetarian Restaurant 3034421485 144 4.0
Nepal Cuisine 3035545828 65 4.5
Black Cat Bistro 3034445500 72 4.0
The Mediterranean Restaurant 3034445335 306 4.0
Arugula Bar E Ristorante 3034435100 48 4.0
Ras Kassa's Ethiopia Restaurant 3034472919 101 4.0
L'Atelier 3034427233 58 4.0
Bombay Bistro 3034444721 87 4.0
Brasserie Ten Ten 3039981010 200 4.0
Flagstaff House 3034424640 86 4.5
Pearl Street Mall 3034493774 77 4.0
Gurkhas on the Hill 3034431355 19 4.0
The Kitchen 3035445973 274 4.0
Chez Thuy Restaurant 3034421700 99 3.5
Il Pastaio 3034479572 113 4.5
3 Margaritas 3039981234 11 3.5
Q's Restaurant 3034424880 65 4.0
Julia's Kitchen 8 5.0
$
Yelp has provided detailed and all-inclusive API instruction with data description, instances, and failing report. However, the API is still not perfect with the limit of 100 times for users to to call for the API, 1000 times for test-oriented and 10,000 times if users’ application meets the display requirements of Yelp.
Using API to retrieve data enable fast and accurate data crawling. And many websites have made some partial data public to extend their impact. Yet, there are still certain data fields that are restricted in the public data pool, and would require some other crawling techniques beside the API. For example, parts of the data set in Yelp user comments are not transparent and each company with a specific id is only allowed to retrieve three data item at most. Therefore, we can try some other ways to harvest the data needed, such as using some automated scraper softwares. There are numbers of scraping tools around that provide well-rounded scraping or harvesting service, like Octoparse, Import.io, Mozenda and etc. In this writing, I will share with you about on my very own experience with Octoparse. Of course, you can also visit www.octoparse.com for more detailed info.
Octoparse is a Windows-based scraping software. Users are not required to program to scrape or crawl the websites. The workflow designer in Octoparse is pretty user-friendly as shown in the figure below. The tips are quite clear, and the icons and operations are quite straight forward and easy to handle. By simulating and learning a series of human web browsing behaviors, like opening a web page in the built-in browser, pointing and clicking the web elements by selecting the related options in the pop-up designer window, Octoparse will be able to transfer repetitive manual extraction operations into automated web extraction process and retrieve the structured data that users need.
With a premium plan, you can scrape the site using its cloud service, which supports API, IP rotation and task scheduling.
After data extraction is completed, the data harvested or scraped can be exported to different formats (excel, csv, html, txt, etc) or export directly to various databases (MySQL, SQL Server, and Oracle).
Author: The Octoparse Team
For more information about Octoparse, please click here.
Author's Picks
Be the Best Junior Management Consultant: Skills You Need to Succeed
Web Scraping|Scrape Booking Reviews
Web Scraping|Scrape Data from Online Accommodation Booking Sites
A Must-Have Web Scraper for Data Comparison Software - Octoparse
10 Best Free Tools for Startups - Octoparse
Top 30 Free Web Scraping Software
- See more at: http://www.octoparse.com/tutorial/pagination-scrape-data-from-websites-with-query-strings-2/#sthash.gDCJJmOQ.dpuf
Most popular posts
Posts by topic
Download Octoparse to start web scraping or contact us for any
question about web scraping!