Web Scraping Cases: Scraping Restaurants Information from yell.comFriday, December 30, 2016 4:12 AM
In this tutorial, we will scrape Yell.com for all the restaurants in London using Octoparse. We will set up the task to capture name of the restaurant, address, telephone and star-rating score of every single restaurant in London as listed on Yell.
First, we'll need to get the direct URL by searching for "restaurant" in "London". Once we have the search results, copy the URL (https://www.yell.com/ucs/UcsSearchAction.do?keywords=restaurants&location=London&scrambleSeed=1180481156.). This is the link we'll need to start off the extraction task.
Step 1. Set up basic information
- Click “Quick Start”
- Choose "New Task (Advanced Mode)"
- Fill in the basic information
- Click “Next”
Step 2. Navigate to your target webpage
- Enter the target URL in the built-in browser
- Click “Go” icon to open the webpage.
Step 3: Set up pagination
- Click on “Next” located to the right of page numbers
- Choose “Loop Click Next Page”. You are telling Octoparse to click open each page and extract data from that page.
Sometime "Next" does not get recognized by Octoparse in the first place, here's what you can try:
1. Right lick on "Next" to prevent triggering the link to turn to the next page
2. Click the icon for “Expand the selection area” until “Loop click in the element” shows up
Step 4: Create a list of items
We are now ready to build the list for extraction. You are telling Octoparse to look for the designated data fields from each item of the builded list.
- Click on the first item of the list, make sure the outlined box contains the data to be extracted.
- When prompted, click “Create a list of items” (sections with similar layout)
- Then, Click “Add current item to the list”
Now the first item had been added successfully to the list.
- Click “Continue to edit the list”
- Going back the the webpage, click the second item of the list. Noted that items of the list must share similar layouts.
- When prompted, click “Add current item to the list”
Octoparse automatically recognizes all items on the web page that share the same layouts with the first two items selected. So now, you should get all items of the list added automatically.
- Click “Finish Creating List”
- Click “loop”， you are telling Octoparse to go through each item of the list to extract the designated data fields.
Step 5. Select the data to be extracted
Now, with the list built, we are ready to move on to define what data fields will be extracted from the web page. In this example, we will extract name of the restaurant, address, telephone and star-rating score.
- Right-click on the name of the restaurant (since the name is a hyperlink, we use right-click here to prevent triggering the link)
- Select “Extract text”
- Follow the same steps to extract Address and Telephone number
- For extracting star-rating, the selection had not been identified properly in the first place, hence we will need to expand the selection area to the point where the outlined box includes every one of the starts.
- Since there is nothing showed under “Extract text”, we will select “Extract outer HTML” .
- Rename any field names if necessary
- Click "Save"
Step 6. Re-format the data field
Since star-rating had not been selected properly, we will need to re-format the data field “Star-rating” to capture the exact information we want.
- Choose the field you want to reformat, star-rating in this case
- Select the “Customize Field” button
- Choose “Re-format extracted data”
From the outer HTML of the data field, we know that the star rating score is starts with ‘title=”’ and ends with ‘out of’.
- Click "Add step"
- Select “Match with Regular Expression”
- Click “Try RegEx Tool”
- In the RegEx Tool window, check the “Start with” and enter “title=””; check the “End with” and enter “out of”
- Click “Generate”
- Click “Match”
- The matching result is 4.3 which is exactly what we want
- Click “Apply”
- Click “OK”
- Click “Done”.
- Then the value for the “Star_rating” data field turns into 4.3
- Click "Save"
Step 7. Re-order the workflow
This is a little trick for pagination. Since we shall finish extracting from the first page before moving to the second page, hence we'll need to re-position the second "Loop" Action to right before "Click to paginate" within "Cycle Pages". This is telling Octoparse to extract then turn the page. When done, click "Save".
Step 8: Set the task to run locally
- Save task configuration, then click "Next"
- Click “Next” again to skip over "Extraction Options"
- Click “Local Extraction” to run the task on your computer
Step 9: Check data and export
- Check the data extracted
- Click the "Export" button to export the data to Excel file, databases or any other formats. Save the file to your computer.
Yell.com does check for malicious requests and will stop your extraction. In this case, we can set up a longer timeout for each action except for the ‘Go To Web Page' action. This extra step will lower the chance of being tracked (we set a 3 seconds time out for each action of the workflow).
Author: The Octoparse Team
For more information about Octoparse, please click here.