Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Lesson 4: Getting data - Capture a list of items

Thursday, August 16, 2018

In the previous lesson, we learn how to capture simple text (see how to capture text from a page in lesson 3 ). We are now ready to move on to a more advanced scraping technique: capturing a list of items. 

Contents on webpages are usually organized in certain patterns. And one of the most commonly seem pattern is a list. Here are a few examples of when contents are shown as list. 
 
 
Since list is so common Octoparse aims to make extracting list quick and easy as it automatically detects for all the possible elements of the list.  Now let’s see how it is done with an example. 

 

 1. Build the list by defining a pattern

Tell Octoparse what items to include in the list by selecting any 2 items from the list

· Click any 2 product sections consecutively. Notice the other product sections on the page are being auto selected and highlighted in green with all the sub-elements highlighted in red.

 

· Click "Extract text of the selected elements". A "Loop Item" will be automatically generated and added to the workflow. By default, Octoparse automatically extracts from the item selected, however, if this is not exactly what you are looking for, you can delete it and add the data fields you need in the next step. 

 

Tips!

1. In order for the list to be built correctly containing the items desired, it is critical to keep the two selection identical in structure, ie the highlighted content should be of the same "look".

Web scraping with Octoparse

Web scraping with Octoparse

 

You can always expand the selection area by clicking on the tags (e.g. DIV, A, LI, etc.) on the bottom of "Action Tips".

2. If certain items on the list are still missing after the first two clicks, keep clicking on more products from the same list until all items desired are selected and highlighted in green.


2. Capture sub-elements within the selected item

2.1 From the item highlighted in green (usually the first one on the list), click to capture the sub-elements desired.  This is to set an extraction template for the other items on the list. Configure the extraction step for the first item, then Octoparse will apply the template to the remaining items on the list. 

· Click to capture any sub-elements within the highlighted section

· When finish selecting, click "Extract text of the selected elements"

 

2.2. Capture all sub-elements automatically 

Beside the steps provided in 2.1, there is an alternative way to capture sub-elements in Octopares 7x. At the time you are adding items to the list, Octoparse automatically detects all sub-elements within the selected sections and highlight them in red. Now, you can click "Select all sub-elements" from "Action Tips" to have all the detected sub-elements selected at once.  

 

Now, all sub-elements are selected and shown in the "Action Tips" panel.

· Click the "X" next to the data fields to delete any unnecessary columns.

· Once done, select "Extract data".

Notice the data fields extracted are being added to the "Data field" Pane next to the workflow designer for further customization if necessary.

 

 

 Lesson 5: Click on a list and capture data from each item page

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png