Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Lesson 5: Getting data - Click on a list and capture data from each item page

Thursday, August 16, 2018

We believe you’ve mastered extracting simple text (see lesson 3 ) and text contained in a list (see lesson 4 ). Next, we’ll couple the techniques together and show you how you can easily click into links contained in a list and go on to capture detailed information on each item page. Clicking through links to extract is very handy when extracting information from E-commerce sites and directory sites.

web scraping with octoparse - extract from item page

Let’s see how it is done with an example. We will use URL: https://www.ebay.com/sch/Vehicle-Electronics-GPS-/3270/i.html for the example below. [Download the task file in this lesson]

 

1) Select the links to click into individual item page

To do this, we will create a "Loop Item" to loop click each product link on the result page. 

  • Click on the first product title that contains the URL to access the item page. The selected item will be highlighted in green while items with the same layout will be highlighted in red.
  • Click on the second product title containing the URL
  • Select "Loop click each URL" from "Action Tips". Notice a "Loop Item" for the clicking action is being auto-generated and added to the workflow. 

 

Tips!

To loop click through items on the list, it is important that you select the anchor texts. Octoparse automatically identifies tags of selected items. So when you select an item with URL, the selected tag would be "A", which stands for anchor that usually links one page to another.

 

2) Select details on the item page to extract

Once the "Loop Item" is completed, Octoparse will load the first item page in the built-in browser. 

Now, set up an extraction template by designating the specific data fields to capture from the page; Octoparse will apply this template to the other item pages. 

  • Click on target data fields such as title, review, price, etc.
  • Select "Extract data" from "Action Tips" to complete the extraction action when you finish selecting. Notice an "Extract data" step gets auto-generated and added to the workflow. Data fields extracted will be displayed in "Data field" pane next to the workflow designer. 

 

 

Tips!

Set up a wait time in "Advanced Options" for steps like "Click Item" or "Extract Data" can effectively avoid data skip and make the crawling process more human-like! (Usually 2-5 seconds would work well). 

 

Done! Learn how to set up pagination in lesson 6 to complete your scraping project!

 

 Lesson 6: Pagination - Capture data from multiple pages

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png