How to Extract information from Yelp

Wednesday, April 27, 2016 11:30 PM

Welcome to Octoparse’s tutorial. Octoparse is a web scraping tool specifically designed for mass-gathering of various data types. If you don’t have an account yet, please sign up at octoparse.com.

 

In this tutorial, I will show you how to extract information from Yelp. You can grab exact data from Yelp with Octoparse.

 

Step 1

First of all, create a task category and named Yelp. Enter the task name. Then click “next” to the second step.

 

 

Step 2

Open the yelp.co.uk in the build-in browser. We will find bars in the search box. Click “Enter text value”.

 

 

 

 

Enter bar and click save. Click the search button in the web page below. Choose “Click an item”.

 

 

 

Click Advanced Options. And choose the option, Load page with AJAX. And select four second from the AJAX timeout drop-down box. Click save.

 

 

 

Step 3

 

After the web page is loaded, we grab all the bars on this page.

 

Click on the first bar > Create a list of item > Select Add current item to the list > Select Continue to edit the list.

 

 

 

 

 

 

 

Click on the second one > Add current item to the list > Finish creating list > Loop

 

 

 

 

 

Step 4

Now all the bars on this web page have been grabbed. We go head to the detail page to extract the data you want.

 

Click on the name, then choose ”Extract text”.

 

 

 

Click on the star rating, then choose “Extract Outer HTML, including the page source code, text with format and image”. Click on “Customize Current Action” > “Define data extracted” > “Extract specified attribute of the item” > select “title” from the Attribute type drop-down list > OK

 

 

 

 

 

 

 

Click on the reviews, then choose ”Extract text”.

 

 

 

Click on the address, then choose ”Extract text”. Click on “Customize Current Action” > “Re-format extracted data” > “Add step” > “Replace with Regular Expression”  > Enter “ \s+” in the Regular Expression box > OK > Done

 

 

 

 

 

 

 

 

 

Click on the phone number, then choose ”Extract text”.

 

 

 

Click on the website, then choose ”Extract text”.

 

 

 

Click on the today open hours, then choose ”Extract text”.

 

 

 

Click on the price range, then choose ”Extract text”.

 

 

 

Click on the division of Hours, then choose ”Extract text”.

 

 

 

We need to customize the last three fields we extracted separately.

 

Click on the fields, today open hours, price range and Hours, separately. Then click the following steps.

 

Click on “Customize Current Action” > “Re-format extracted data” > “Add step” > “Replace with Regular Expression”  > Enter “ \s+” in the Regular Expression box > OK > Done. Click the Tick button.

 

 

 

 

 

 

 

After the web page is loaded, we will extract the reviews information in the recommended reviews section.

 

Click on the first review > Select DIV > Create a list of item > Select Add current item to the list > Select Continue to edit the list.

 

Click on the second one > Select DIV > Add current item to the list > Finish creating list > Loop

 

 

In the workflow designer, we would extract detail review information. Click “Extract Data” action in the “Loop Item” we just created.

 

 

 

Click on the customer’s name, then choose ”Extract text”.

 

 

 

Click on the star rating, then choose “Extract Outer HTML, including the page source code, text with format and image”.

 

 

 

 

Click on the date, then choose ”Extract text”.

 

 

Click on the star rating. Click on “Customize Current Action” > “Define data extracted” > “Extract specified attribute of the item” > OK

 

 

 

 

 

 

Click on the text of review, then choose ”Extract text”. Click the Tick button.

 

 

 

Click “Extract Data” action in the first “Loop Item”. Then change the bar name in define field. And change the XPath of the first field.

 

Click on “Customize Current Action” > “Define ways to locate an item” > Change the XPath to “//div[@class= biz-page-header-left]/h1” > OK > Tick button > Next > Next > Local Extraction

 

 

 

 

 

 

The data extracted will be showed in this pane and we can also see the configured rule of the task. You can also check out the build-in browser to see if the task runs as expected.

 

 

 

Export the results to Excel files, or other formats and save the file to the computer.

 

See the data extracted. Cool, right?

 

You’ve seen how to extract data from the website quickly and effectively. Download Octoparse now and try it out!

 

Related documents for you:

Brief intro to XPath

Brief intro to HTML document

 

Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.