Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scrape reviews from Amazon

Monday, January 14, 2019

In this tutorial, we will show you how to scrape the product reviews from Amazon.com.

To follow through, you may want to use this URL in the tutorial:

 

https://www.amazon.com/PlayStation-Portable-3000-System-Sony-PSP/dp/B001KMRN0M/ref=sr_1_1?s=videogames-intl-ship&ie=UTF8&qid=1546848761&sr=1-1&refinements=p_89%3ASony&th=1

 

Here are the main steps in this tutorial:  [Download task file here]

1."Go To Web Page" - to open the targeted web page

2.Create a pagination loop - to scrape all the reviews from multiple pages

3.Create a "Loop Item" to scrape all the reviews on one page

4.Extract data - to select the data and remove the unwanted information

5.Run extraction - to run your task and get data

 

 

 

 

 

 

 1)"Go To Web Page" - to open the targeted web page

  • Click "+ Task" to start a new task with Advanced Mode

      Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Amazon.com, we strongly recommend Advanced Mode to start your data extraction project.

  • Paste the URL into the "Input URL" box
  • Click "Save URL" to move on

 

 

 

 

 

 

2) Create a pagination loop - to scrape all the reviews from multiple pages

      If you are using the latest version of Octoparse, the "Workflow Modewould be automatically on. If not, you could turn on the "Workflow Mode" by switching the "Workflow" button in the top-right corner in Octoparse

  • Scroll down the page and click “see all reviews”

       To scrape all the reviews, we need to add this step to lay out all the review information on this page.

  • Scroll down the page and click the next page button ">"
  • Click "Loop click the selected link" on the "Action Tips"

 

 

 

 

 

3Create a "Loop Item"- to scrape all the reviews on one page

      The Reviews are organized on the page as a list. We need to build a "Loop Item" to loop extracting each review one by one.

  • Select the first review item in the built-in browser

        We need to make sure the whole block of the first review is selected, that said, the whole review block is highlighted in green, with all the sub-elements, like title, customer name, date, content… in red, just as the following image shows: 

  • Click "Select all sub-elements" on the Action Tips"

       Now Octoparse will automatically recognize all the similar sections on this page and highlight them in red.

  • Click "Select all"
  • Click "Extract data in the loop"

       By default, all the data are automatically extracted from the items selected. We can delete the unwanted ones in the "Customize Action" area.

 

Tips!

To learn the detailed information about capturing a list of items, here is the tutorial you might need:

https://www.octoparse.com/tutorial-7/capture-a-list-of-items

 

 

 

 

 

 

4) Extract data - to select the data and remove the unwanted information

  • Delete the unwanted or useless data fields

      Press "Shift" or "Ctrl" to batch deleting the unwanted data field 

  • Select the information that Octoparse fails to generate. Click on the information, and select "extract data" in the "Action Tips".
  • Rename the fields by selecting from the pre-defined list or inputting on your own 
  • Click "OK" to save the result.

 

 

 

 

 

 

 

 

5) Run extraction - to run your task and get data

  • Click "Start Extraction"
  • Select "Local Extraction" to run the task on your computer

 

 

Below is the output sample. 

 

 

Was this article helpful? Contact us  any time if you need our help!

 

 

Author: Momo

Editor:Suire 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_form.png