Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Scrape product image from Amazon

Monday, January 14, 2019

In this tutorial. We are going to show you how to scrape product images URLs from Amazon

To follow through you might want to use the URL in this tutorial:

https://www.amazon.com/s/ref=sr_nr_p_n_theme_browse-bin_0?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A3736081%2Cn%3A13336081%2Ck%3Apainting%2Cp_n_feature_nineteen_browse-bin%3A17042452011%2Cp_n_theme_browse-bin%3A381190011&keywords=painting&ie=UTF8&qid=1546915257&rnid=38118901

 

This tutorial will also cover:

Customize the data field using RegEx tool (Optional)

 

Here are the main steps in this tutorial: [Download task file here]

1."Go To Web Page" - to open the targeted web page

2.Create a pagination loop - to scrape data from multiple listing pages

3.Create a "Loop Item" - to loop click into each product page on every listing page

4.Extract data – extract the image URLs

5.Customize the data field- to get a normal size image (Optional)

6.Save and start extraction - to get all the URLs of the desired images

 

 

 

 

 

1)"Go To Web Page" - to open the targeted web pag
  • Click "+Task" to start a new task with Advanced Mode
  • Paste the following URL into the"Extracting URL"box and click "Save URL" to move on

 

 

 

 

 

 

2) Create a pagination loop - to scrape data from multiple listing pages

  • Scroll down and click "Next Page" button
  • Click "Loop click next page" on "Action Tips"

 

 

 

 

 

3) Create a "Loop Item" - to loop click into each product page on every listing page

We are now on the second page. When creating a "Loop Item", we should always start with the first item on the first page. Thus, we'd better go back to the first page.

  • Click "Go To Web Page" in the workflow.
  • Select the pagination loop in the workflow

By doing this, we can help Octoparse decide the execution order and generate the Loop Item at the appropriate position in the workflow.

When you create a list of items to scrape a website, sometimes the list may include several "Ads" items. To exclude the promotion products in this case, we can start building the "Loop Item" from the 2rd row of the products on this page.

  • Click the title of the first item on the 2rd row
  • Click "Select All" on "Action Tips"

Octoparse will automatically identify other product links on the current page. The selected links will be highlighted in green while others will be highlighted in red.

  • Click "Loop click each element" to create a "Loop Item"

Octoparse will click through each link captured in the "Loop Item", and open the product detail page.

 

 

 

 

 

 4)Extract data - to extract image URLs

  • Click on one of the images
  • Select the "IMG" tag on the bottom of "Action Tips"

When you select an IMG element, the selected tag would be "IMG". Normally there’s no need to modify, as Octoparse automatically identifies tags of selected items. But for this case, we need to revise the tag on the bottom of "Action Tips".

  • Click "Extract the URL of the selected image"

Then we can repeat the above steps to get other images URLs.

 

 

 

Tips!

If we need all of the images extracted in a cell, we can use RegExp Tool to pick up all the Image URLs from its HTML. Please check out the details from down below:

  • Select the highlighted area
  • Click"Extract inner HTML of the selected element"on "Action Tips"

Pick up image URL with RegExp Tool.

  • Click "customize data field"
  • Select "Refine extracted data", click"Add step", and click"Match with Regular Expression"
  • Click "Try RegEx Tool"
  • Check "Start with" and enter" img src=" " in the box
  • Check "End with" and enter "">" in the box
  • Click "Generate" to auto-generate the Regular Expression.
  • Check "Match All" and click "Match" to get all the image URLs.

 

 

 

 

 

 

 

 

 5)Customize the data field – to get a normal size image (Optional)

The image URL we just extracted is the URL of a thumbnail image. Thus, if we want to get a normal size image, we need to reformat its URL with Regexp Tool.

  • Click"customize data field"
  • Select"Refine extracted data", click "Add step" and then select"Replace"
  • Enter what's between  "._" and "_." into the "Replace" Box. Here will be "SS40"
  • Click"Evaluate"
  • Click "OK" to save the result

 

 

 

 

 

 

 

 

If you want to learn Octoparse RegExp Tool in detail,please refer to the following tutorials:

Octoparse Regular Expression Tool

How and when to use Regular Expression in Octoparse - a guide for beginners

 

 

 

 

 

 

6) Save and start extraction - to get all the URLs of the desired images

  • Click "Save"
  • Click "Start Extraction"
  • Select "Local Extraction" to start execution.

 

 Tips!

 With above steps, we can only extract the Image URLs. If we need to download all the images form the extracted URLs, we could refer to How to download images from a list of URLs?

 

 

 

Here is the sample output.

 

 

Was this article helpful? Contact us  any time if you need our help!

 

 

Author: Momo

Editor:Suire

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download