Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scraping restaurant info from Grubhub

Friday, February 01, 2019

In this tutorial, we are going to show you how to scrape restaurant information from Grubhub.

https://www.grubhub.com/search?orderMethod=delivery&locationMode=DELIVERY&facetSet=umamiV2&pageSize=20&hideHateos=true&searchMetrics=true&latitude=40.71277618&longitude=-74.00597382&variationId=0.5-new-gotos&sortSetId=umamiV2&sponsoredSize=3&countOmittingTimes=true

 

Main steps in the tutorial: [Download demo task file here]

1. "Go To Web Page" - to open the targeted web page

2. Create a pagination loop - to scrape all the results from multiple pages

3. Create a "Loop Item" - to loop click into each restaurant on every page

4. Extract data - to select data you need to scrape

5. Save and start extraction - to run your task and get data

 

 

 

 

 

 

1) "Go To Web Page" - to open the targeted web page

  • Create the task with "Advanced Mode".
  • Paste the URL into the "Extraction URL" box and click "Save URL" to move on.

 

 

 

 

 

 

 

2) Create a pagination loop - to scrape all the results from multiple pages

  • Scroll down and click the ">>" button on the webpage
  • Click "Loop click single element" on "Action Tips"

As this website employs AJAX technique to load the new content, we need to set up "AJAX load" to help Octoparse avoid being stuck.

  • Uncheck "Auto-Retry"
  • Check "AJAX Load" and set up "AJAX Timeout"
  • Click "Save"

 Tips!

 To know more about AJAX, please refer to:

 

 

 

 

3) Create a "Loop Item" - to loop click into each restaurant on every page

We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.

  • Click "Go To Web Page" in the workflow.
  • Select the pagination loop in the workflow

By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.

  • Click the first restaurant item, Octoparsewill automatically identify the similar URLs on the page

The first restaurant item is highlighted in green while the others are highlighted in red

  • Click "Select All" on the "Action Tips"

 All the items are highlighted in green

  • Select "Loop click each element"
  • Uncheck "Auto-Retry"
  • Uncheck “open the link in new tab
  • Check "AJAX Load" and set up "AJAX Timeout"
  • Click "Save"
  • Click on "Loop Item" and Set up some wait time to ensure the webpage loads completely

 

 

 

 

4)Extract data -  to select data you need to scrape

  • Select the data you need on the item page to scrape, such as Name of the restaurant, Address, Opening hours, phone number...
  • Select "Extract text of the selected element" and rename the "Field name" column if necessary.

Rename the fields by selecting from the pre-defined list or inputting on your own

  • Click "OK" to save the result

Normally we can just Click "<" (return to the list page button) to generate a "Click Item" action, but Octoparse fails to do that here. So we need to :

  • Drop a "Click Item" action into the workflow designer
  • Click "Customize" and "Customize XPath"
  • Set the XPath "//BUTTON[contains(@class,'returnToSearch')]" to locate the "<" (return to the list page button)
  • Uncheck "Auto retry when no response"
  • Check "Load the page with AJAX" and set Time out
  • Click "Save"

 

To know more about XPath, please refer to this tutorial:

https://www.octoparse.com/tutorial-7/xpath

 

 

 

 

5) Save and start extraction - to run your task and get data

  • Click "Save"
  • Click "Start Extraction"

 

Here is the sample output:

 

 Was this article helpful? Contact us anytime if you need our help! 

 

 

 

Author: Momo

Editor:Suire

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_form.png