Amazon Scraping Case Study |Scrape Amazon Product Reviews and RatingsTuesday, September 19, 2017 11:30 AM
Welcome to Octoparse web scraping case study!
In this series of case study tutorials for Amazon, we will learn how to deal with various difficult-to-handle situations when scraping data from Amazon. In this tutorial, I will show you how to scrape Amazon product reviews and ratings.
List of features covered in this case study:
- Build a URL Loop List
- Set Regular Expression
- Run Local Extraction
Step 1. Set up basic information
- Click "Quick Start"
- Create a new task in Advanced Mode
- Complete the basic information
Step 2. Create a loop list
It's easy for Octoparse to retrieve data from these particular web pages by entering a given list of target URLs and traverse these URLs to open each web page .
First, drag a "Loop Item" action into the Workflow Designer pane to create a loop for a given list of URLs.
Copy a list of URLs which you'd crawl data from and paste this given list of URLs into the "List of URLs" text box.→ Click "OK".→ Click "Save".
Now, you can see the given list of URLs have been saved as the "Loop Item". After finishing building the URL list, we will be directed to the first URL automatically.
(Note: These web pages should be in similar layout so that the extraction action below set up for the first web page could be automatically applied to the rest of the list.)
Step 3. Select the data to be extracted and rename data fields
Now we will begin to extract the overall reviews and ratings from the first web page.
Click the movie title in the first web page.→ Select "Extract text".
Follow the same steps to extract the other data fields(ratings, reviews).
Rename the field names if necessary.→ Click "Save".
Step 4. Set up regular expression
Regular expressions describe patterns to look for in the data. Octoparse allows to use regular expression to reformat captured data.
In this case, if you look closely at the “Ratings” field extracted, you would find that the format is a bit messy with unnecessary information “out of 5 stars”. To fix this, we could use the RegEx Tool to capture the exact data.
- Select data field "Ratings", click the icon for "Customize Field"
- Choose "Re-format extracted data"
- Click "Add step"
- Select "Replace with Regular Expression"
- Click “Try RegEx Tool”to remove the “out of 5 stars” suffix from the string(or you could just input "(.+?)(?=out of 5 stars)" for "Regular Expression” if you know how to write a regular expression)
- Check “End with”and paste “out of 5 stars” in the text box to identify the “out of 5 stars” string and then capture the remainder of the string
- Once done, click “Generage”→Check “Match all”→ Click “Match” → Click “Apply” → Click “OK”
In the Replace box, the suffix disappears from the column data, leaving the remainder of the string intact. Click “OK” and “Save” to save the re-formatted data field.
Step 5. Extract data from multiple web pages (configure pagination)
Now we need to extract the details of those reviews. As we want to extract reviews from multiple pages, we need to add a page navigation action.
Click “Next” ➜ “loop click next page” to create a loop action to process all the web pages. The action of pagination has been added to the extraction rule.
- If you want to extract information from every page of search result, you need to add a page navigation action.
- You can right click the "Next"pagination link to prevent triggering the link.
- You can click "Expand the selection area"button until "Loop click in the element" appears.
Step 6. Create a loop list for multiple sections
To process the list of reviews for extracting the elements in each section, we need to create a loop list.
Move your cursor over the section with similar layout, where you would extract data.
Click the first section ➜ Create a list of sections with similar layout. Click “Create a list of items” (sections with similar layout). ➜ “Add current item to the list”.
Then the first section has been added to the list. ➜ Click “Continue to edit the list”.
Click the second section ➜ Click “Add current item to the list” again. Now we get all the links with similar layout. ➜Then click “Finish Creating List” ➜ Click “loop”.
(Note: You can click "Expand the selection area"button until you choose the similar layout.)
Step 7. Extract the detail information of the reviews
Now come back to the first section. Click the reviewer ➜ Select “Extract text”.
Other contents(overview, time, rating, review) can be extracted in the same way.
All the content will be selected in Data Fields. ➜ Click the “Field Name” to modify. Then click “Save”.
Re-format the data field “rating” by following the Step 4 Set regular expression shown above to correctly extract the data you want.
You can select the item that would has the full information you needed since sometimes the first item will not include all the content you want to extract.
Step 8. Start running your task
Now we are done configuring the task, it's time to run the task to get the data we want.
Click “Next” ➜ Click “Next” ➜ Click “Local Extraction” to run the task on your computer. Octoparse will automatically extract all the data selected.
(Note: Octoparse Cloud Extraction allows you to run the task without keeping your machine turn on. And the speed would be much faster (see the screenshots below). Also, features such as scheduled extraction, IP rotation, API are also supported with the Cloud. Find out more about Octoparse Cloud Service here.)
Speed of Octoparse Cloud Extraction
Speed of Octoparse Local Extraction
Step 9. Check the data and export
The data extracted will be shown in "Data Extracted" pane. Click "Export" button to export the results to Excel file, databases or other formats and save the file to your computer.
Now you have learn how to crawl ratings and reviews from Amazon, get started with your own crawling task to extract any data you want.
More tutorials about Amazon scraping: