Scraping Data from AmazonSaturday, December 31, 2016 2:20 AM
Octoparse enables you to scrape the best sellers from amazon.com.
In this web scraping tutorial we will scrape all the best sellers from one category (Books) from amazon.com with Octoparse.
The website URL we will use is https://www.amazon.com/best-sellers-books-Amazon/zgbs/books
The data fields include book name, author, best seller badge, hardcover, publisher, language, the number of reviews and star rating score.
You can directly download the task (The OTD. file) to begin collect the data. Or you can follow the steps in this web scraping tutorial to make a scraping task to scrape book information from amazon.com.
(Download my extraction task of this tutorial HERE just in case you need it.)
Step 1. Set up basic information.
Click “Quick Start” ➜ Choose "New Task (Advanced Mode)" ➜Complete basic information.
Step 2. Enter the target URL in the built-in browser. ➜ Click “Go” icon to open the webpage.
(URL of the example: https://www.amazon.com/best-sellers-books-Amazon/zgbs/books)
Step 3. Extract data from multiple web pages (configure pagination).
Drag a “Loop” item into the workflow, under the "Click Item" action. ➜ Choose a “Loop Mode” under “Advanced Options”. ➜ Select “Single Element” option.
Enter the XPath expression which can select the location of its next item into the “Single Element” text box. ➜ Click “Save”.
The XPath expression is //li[@class='zg_page zg_selected']/following-sibling::li/.//a
Drop a “Click Item” action into the “Loop item” we've just created ➜ Choose “Click items in Loop Item box” under “Advanced Option” ➜ Click “Save”. Now you’ve configured pagination scraping.
Step 4. Move your cursor over the section with similar layout, where you would extract data.
Click the first section ➜ Create a list of sections with similar layout. Click “Create a list of items” (sections with similar layout). ➜ “Add current item to the list”.
Then the first section has been added to the list. ➜ Click “Continue to edit the list”.
Click the second section ➜ Click “Add current item to the list” again. Now we get all the links with similar layout. ➜Then click “Finish Creating List” ➜ Click “loop” to process the list for extracting the elements in each page.
We need to modify the XPath for the Loop Item box to correctly select the items we want. The correct XPath is //div[@class='zg_itemWrapper']/div/A[@class='a-link-normal']
Click the Loop Item box. ➜ Enter the correct XPath into the Variable list textbox. ➜ Click "Save".
Step 5. Extract the detail information of the best sellers.
Click the best seller badge ➜ Select “Extract text”. Other contents can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the “Field Name” to modify. Then click “Save”.
a) Right click the content to prevent from triggering the hyperlink of the content if necessary.
b) You can select the item that would has the full information you needed since sometimes the first item will not include all the content you want to extract.
c) You need to re-format some data fields such as "Author" and "Language" on the product details page to correctly extract the data you want from the product detail page.
If you want to re-format the data field, select the data field ➜ Click “Customize Field” ➜ Click “Re-format extracted data” ➜ Click “Add step” ➜ Choose the options as needed ➜ Don't forget to click "Save".
Step 6. Drag the second "Loop Item" before the "Click Item" action of the first “Loop Item" box so that we can grab all the reviews about the hotel from multiple pages.
Step 7. Click “Save” to save your configuration. Then click “Next” ➜ Click “Next” ➜ Click “Local Extraction” to run the task on your computer. Octoparse will automatically extract all the data selected.
Step 8. The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!