Scraping Price Range on YelpMonday, January 09, 2017 1:06 AM
Octoparse enables you to scrape price range on Yelp. To speed up the extraction, you can use our Cloud Extraction to split the scraping task into many sub-tasks. Then our cloud servers will collect the data shortly and provide you with a structured data-set.
To scrape the price ranges of all the restaurants from Yelp.com as fast as possible, you can make two scraping tasks -- Task 1 and Task 2. Task 1 is used to scrape the URLs of all the restaurants and Task 2 is used to scrape the price ranges of these restaurants on yelp.com.
In this tutorial we will scrape the price ranges of all the restaurants in New York, NY, United States on yelp.com with Octoparse.
The website URL we will use is https://www.yelp.com/search?find_desc=Restaurants&find_loc=New+York%2C+NY
The data fields include the restaurant name, the restaurant website, the price range, the menus, telphone, star rating and Food type.
You can directly download the two tasks (The OTD. file) to begin collect the data. Or you can follow the steps below to make the scraping tasks to scrape data from eBay.com.
Task 1. Scraping the URLs needed for Task 2.
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information.
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example: https://www.yelp.com/search?find_desc=Restaurants&find_loc=New+York%2C+NY)
Set a timeout of 60 seconds under "Advanced Options" ➜ Click "Save".
Step 3. Click on the "Next" pagination link. ➜ Choose "Loop click in the element" to turn the page.
Note: If the URL keeps loading while the content of the website has fully loaded, you can click the multiplication sign (×) to prevent it from loading.
1. If you want to extract information from every page of search result, you need to add a page navigation action.
2. You can right click the "Next"pagination link to prevent triggering the link.
3. You can click "Expand the selection area" button until "Loop click in the element" appears. )
Step 4. Move your cursor over the section with similar layout, where you would extract the URLs.
Click the first section ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first section has been added to the list. ➜ Click "Continue to edit the list".
Click the second section ➜ Click "Add current item to the list" again. Now we get all the sections with similar layout. ➜Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the URLs of these items.
Step 5. Extract the URLs.
Extract the link of the first item. ➜ Click the item name ➜ Click "Expand the selection area" button ➜ Select "Extract link(href attribute of A tag) of this item". Then click "Save".
Step 6. Check the workflow.
Drag the “Loop Item” box before the “Click to paginate” action of the “Cycle Pages” box in the workflow so that we can grab all the URLs of sections from multiple web pages.
Because the web page uses AJAX to scrape multiple web pages so we need to set AJAX timeout for the action.
Navigate to "Click to paginate" action ➜ Tick "AJAX Load" checkbox ➜ set an AJAX timeout of 3 seconds ➜ Click "Save".
Step 7. Click "Save" to save the configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the URLs.
Step 8. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer. Copy the list of URLs for Task 2.
Task 2. Scraping price range from yelp.com
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Create a loop for a list of URLs.
Drag a "Loop Item" into the Workflow Designer and then choose "URL list" in the "Loop mode".
Paste a list of URLs into the "URL list" box and Click "Save".
You can see the ‘Go To Web Page’ action will be generated automatically and directly go to the first URL. You can click the Loop Item box to see all the list of URLs.
Step 3. Extract the product details.
Click the auction item ➜ Select "Extract text" ➜ Click the "Field Name" to modify. Some other contents can be extracted in the same way. For data field "Star_rating", you need to extract the inner HTML and use regular expressions to get the exact information. You can add the current page URL directly. ➜ Then click "Save".
Step 4. Re-format the data fields.
After you extract all the data fields you want, you can check if Octoparse correctly extract the values from the page. For example, you can re-format the data fields “PriceRange” and “Star-rating” to extract exact information.
For data field “PriceRange”.
Choose the data field "PriceRange" ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data” ➜ Click “Add step” ➜ Select “Replace Strings” ➜ Enter the value you want to replace in the "Replace" textbox ➜ Leave blank in the "With" textbox ➜ Click "Calculate" ➜ Click “Done”.
For data field “Star_rating”.
Choose the data field "Star_rating" ➜ Select the “Customize Field” button ➜ Choose “Re-format extracted data” ➜ Click “Add step” ➜ Select “Match with Regular Expression”.
Use "Try RegEx Tool" ➜ Check the options "Start With" with the value alt=" ➜ Check the options "End With" and "Include End" with the value star rating" ➜ Click “Generate” to create the regular expressions ➜ Click “Match” ➜ Click “Apply”.
Click "Calculate" ➜ Click “OK” ➜ Click “Done”.
Then you will see the data fields “PriceRange” and “Star-rating” have been extracted correctly. ➜ Click “Save”.
Step 5. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 6. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
You will find out the there are some missing values for some data fields in the output. In this case, you need to figure out why Octoparse could not extract the value for the data fields. Check out this article to find out the reasons for the missing values when using Local Extraction.
Some original XPath for some data fields could not select the elements correctly and result in missing values for these data fields. In this case, you can modify the XPath expressions for these data fields, or follow this tutorial to modify XPath expressions in Octoparse.
Knowing some knowledge about how to edit XPath expressions could help you solve lots of problems when scraping data from websites. The tutorials or FAQs below could help you pick up XPath quickly.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!