How to Extract information from YelpWednesday, April 27, 2016 11:30 PM
Octoparse enables you to scrape data from eBay.com. To speed up the extraction, you can use our Cloud Extraction to split the scraping task into many sub-tasks. Then our cloud servers will collect the data shortly and provide you with a structured data-set.
To scrape product details from eBay.com as fast as possible, you can make two scraping tasks -- Task 1 and Task 2. Task 1 is used to scrape the URLs of product details and Task 2 is used to scrape all the product details from ebay.com.
In this tutorial we will scrape all the product details page from eBay.com with Octoparse.
The data fields include auction item name, item condition, ended time, price for the item, the number of items sold, HKD(including shipping), shipping price, shipping details, item location, seller id, seller's representative, product details, item image URL and product detail page URL.
You can directly download the two tasks (The OTD. file) to begin collect the data. Or you can follow the steps below to make the scraping tasks to scrape data from eBay.com.
Task 1. Scraping the URLs needed for Task 2.
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information.
Step 2. Enter the target URL in the built-in browser. ➜ Click "Go" icon to open the webpage.
(URL of the example:
Step 3. Click on the "Next" pagination link. ➜ Choose "Loop click in the element" to turn the page.
1. If you want to extract information from every page of search result, you need to add a page navigation action.
2. You can right click the "Next"pagination link to prevent triggering the link.
3. You can click "Expand the selection area"button until "Loop click in the element" appears. )
Step 4. Move your cursor over the section with similar layout, where you would extract the URLs.
Click the first highlighted link ➜ Create a list of sections with similar layout. Click "Create a list of items" (sections with similar layout). ➜ "Add current item to the list".
Then the first highlighted link has been added to the list. ➜ Click "Continue to edit the list".
Click the second highlighted link ➜ Click "Add current item to the list" again. Now we get all the links with similar layout. ➜Then click "Finish Creating List" ➜ Click "loop" to process the list for extracting the elements in each page.
Step 5. Extract the URLs.
Extract the link of the first item. ➜ Click the item name➜ Select "Extract link(href attribute of A tag) of this item". Then click "Save".
Step 6. Drag the second “Loop Item” box before the “Click to paginate” action of the “Cycle Pages” box in the Workflow Designer so that we can grab all the elements of sections from multiple pages.
Step 7. Check the workflow.
Now we need to check the workflow by clicking actions from the beginning of the workflow.
Go to the webpage ➜ Cycle Pages box ➜ Loop Item box ➜ Extract Data➜ Click to Paginate.
Step 8. Click "Save" to save the configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the URLs.
Step 9. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer. Copy the list of URLs for Task 2.
Task 2. Scraping product details from eBay.com
Step 1. Set up basic information.
Click "Quick Start" ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information ➜ Click "Next".
Step 2. Create a loop for a list of URLs.
Drag a "Loop Item" into the Workflow Designer and then choose "URL list" in the "Loop mode".
Paste a list of URLs into the "URL list" box and Click "Save".
You can see the ‘Go To Web Page’ action will be generated automatically and directly go to the first URL. You can click the Loop Item box to see all the list of URLs.
Step 3. Extract the product details.
You can select the URL that would has the full information you needed since sometimes the first URL will not include all the content you want to extract. In this case, you can pick up one of the URLs that contains all the content you needed in the loop. Here we choose the URL "http://www.ebay.com/itm/HATCHIMALS-COMPLETE-SET-OF-7-Includes-3-Limited-Editions-ToysRUS-Target-WM-/232138134111?hash=item360c82d25f:g:bpUAAOSwnbZYIpPJ".
Click the auction item ➜ Select "Extract text" ➜ Click the "Field Name" to modify. Other contents can be extracted in the same way. Then click "Save".
After you extract all the data fields you want, you can check if Octoparse correctly extract the values from the product detail page. For example, you can re-format the first data field “Auction_Item” to extract exact information. You can add the current page URL so that you can check which detail page may have missing values by observing the output. Then click "Save".
Step 4. Click "Save" to save your configuration. Then click "Next" ➜ Click "Next" ➜ Click "Local Extraction" to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 5. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
You will find out the there are some missing values for some data fields in the output. In this case, you need to figure out why Octoparse could not extract the value for the data fields. Click this article to find out the reasons for the missing values when using Local Extraction.
Some original XPath for some data fields could not select the elements correctly and result in missing values for these data fields. In this case, you can modify the XPath expressions for these data fields. Here I replace all the SPAN tags with * tags for all the data fields. Click "Save" to save the configuration. You can follow this tutorial to modify XPath expressions in Octoparse.
Knowing some knowledge about how to edit XPath expressions could help you solve lots of problems when scraping data from websites. The tutorials or FAQs below could help you pick up XPath quickly.
eBay.com is unable to show you more than 10,000 results. So the extraction of your task to scrape ebay.com may stop by this reason and you may need to refine your search to narrow your results.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!
If this video tutorial is not available for you, you can click hereto see the corresponding graphic tutorial.