Web Scraping Cases: Scraping Restaurants Information from yell.comFriday, December 30, 2016 4:12 AM
In this tutorial, we will scrape Yell.com for all the restaurants in London using Octoparse. We will set up the task to capture name of the restaurant, address, telephone and star-rating score of every single restaurant in London as listed on Yell.
First, we'll need to get the direct URL by searching for "restaurant" in "London". Once we have the search results, copy the URL (https://www.yell.com/ucs/UcsSearchAction.do?keywords=restaurants&location=London&scrambleSeed=1180481156.). This is the link we'll need to start off the extraction task.
Step 1. Set up basic information
Step 2. Navigate to your target webpage
Step 3: Set up pagination
Sometime "Next" does not get recognized by Octoparse in the first place, here's what you can try:
1. Right lick on "Next" to prevent triggering the link to turn to the next page
2. Click the icon for “Expand the selection area” until “Loop click in the element” shows up
Step 4: Create a list of items
We are now ready to build the list for extraction. You are telling Octoparse to look for the designated data fields from each item of the builded list.
Now the first item had been added successfully to the list.
Octoparse automatically recognizes all items on the web page that share the same layouts with the first two items selected. So now, you should get all items of the list added automatically.
Step 5. Select the data to be extracted
Now, with the list built, we are ready to move on to define what data fields will be extracted from the web page. In this example, we will extract name of the restaurant, address, telephone and star-rating score.
Step 6. Re-format the data field
Since star-rating had not been selected properly, we will need to re-format the data field “Star-rating” to capture the exact information we want.
From the outer HTML of the data field, we know that the star rating score is starts with ‘title=”’ and ends with ‘out of’.
Step 7. Re-order the workflow
This is a little trick for pagination. Since we shall finish extracting from the first page before moving to the second page, hence we'll need to re-position the second "Loop" Action to right before "Click to paginate" within "Cycle Pages". This is telling Octoparse to extract then turn the page. When done, click "Save".
Step 8: Set the task to run locally
Step 9: Check data and export
Yell.com does check for malicious requests and will stop your extraction. In this case, we can set up a longer timeout for each action except for the ‘Go To Web Page' action. This extra step will lower the chance of being tracked (we set a 3 seconds time out for each action of the workflow).
Author: The Octoparse Team
For more information about Octoparse, please click here.