Step-by-step tutorials for you to get started with web scrapingDownload Octoparse
The latest version for this tutorial is available here. Go to have a check now!
In this tutorial, we are going to show you how to scrape restaurant information from Grubhub.
Main steps in the tutorial: [Download demo task file here]
1) "Go To Web Page" - to open the targeted web page
2) Create a pagination loop - to scrape all the results from multiple pages
As this website employs AJAX technique to load the new content, we need to set up "AJAX load" to help Octoparse avoid being stuck.
To know more about AJAX, please refer to:
3) Create a "Loop Item" - to loop click into each restaurant on every page
We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we 'd better go back to the first page.
By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.
The first restaurant item is highlighted in green while the others are highlighted in red
All the items are highlighted in green
4)Extract data - to select data you need to scrape
Rename the fields by selecting from the pre-defined list or inputting on your own
Normally we can just Click "<" (return to the list page button) to generate a "Click Item" action, but Octoparse fails to do that here. So we need to :
To know more about XPath, please refer to this tutorial:
5) Save and start extraction - to run your task and get data
Here is the sample output:
Was this article helpful? Contact us anytime if you need our help!