Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Use lists to extract

Thursday, August 16, 2018

What is a list?

A list can be easily understood as a collection of recurring elements with similar HTML patterns. Lists of many forms are incredibly common ways for websites to organize information.
Webscraping with Octoparse  - Using list to extract

Tips:
Octoparse detects for elements belonging to a list via their coding pattern in the underlying HTML source code.

Now, let’s look at a few examples of how web pages organize information with lists. 

Webscraping with Octoparse - Extract with list Webscraping with Octoparse - Extracting with list


Web scraping with Octoparse - Extract with list Web Scraping with Octoparse - Extract with list 

Since lists are so common, learning to extract a list or via building a list becomes a key scraping technique to acquire. In this tutorial, I will cover a number of scenarios of when data extraction is done via setting up a list in Octoparse. 





Extracting data from a list
Getting data such as Text/URL/HTML directly from a selected list is the most basic kind of list extracting technique. Follow the steps below to complete the action,

1) Click on any one of the elements of the target list

Web Scraping with Octoparse - Using list to extract

2) From the Action Panel, click "Select all"

Web scraping with Octoparse - extract with list

3) Depending on the kind of data needed, follow the prompts on the Action Panel to finish the extraction action (ie.  "Extract link text").

Web scraping with Octoparse - extract with list


Tips!
  • Use the expansion button from the Action Panel to expand the selection if necessary. 

Scraping with Octoparse - expansion button

  • Octoparse always has the selected element highlighted in green and detected elements highlighted in red. 
  •  If after the first click, Octoparse fails to detect all elements from the list, you can always click on any non-detected element. Octoparse will learn from the newly clicked elements and keeps on refining the list.  





Extracting data from sections of a list  
When a list is made of similar sections with each section containing various pieces of information we want to capture, it is possible to capture the detailed information from the list directly via building a list of the sections.  

For example in the image below, the page is constructed with a list of different news articles with detailed information such as the title of the article, the date when it was posted, and the name of the author.  

Web scraping with Octoparse - example image

For extracting detailed information from each individual section of a list, we will split the extraction process into two steps:

1) Build a list of the target sections
2) Specify the detailed data fields to capture from each individual section

Follow the steps below to see how it is done in action: 
1) Build a list of the target sections
  • Click on any one section of the target list. Now the whole target section should be highlighted in green with all the sub-elements highlighted in red. 

Web Scraping with Octoparse - extract list detail step 2


Tips:
  • Hover the mouse over the section until the whole section desired is highlighted.
  • Oftentimes, it is difficult to pinpoint the exact section needed, you can always click on the expansion icon from the Action Panel to expand the selection to the point when all target data fields are included. Scraping with Octoparse - expansion icon

  • Click on another section of the target list. Octoparse automatically selects all the similar sections (highlighted). 
Web Scraping with Octoparse - extract list detail step 2

Tips:
  • Though there are prompts provided in the Action Panel to extract the detected sub-elements, but since we want to extract from all the sections of the list and not just the clicked one, we will go on with building the list instead of selecting any of the prompted actions.
  • If after two clicks, there are still sections needed but have not been selected automatically, you can keep clicking on the un-selected sections to help Octoparse refine the list. 

  • From the Action Panel, select "Extract text of the selected element"

Web scraping with Octoparse - extract list detail step 3

2) Extract the specific data fields from each individual section 
  • Click on the title of the article
  • From the Action Panel, select "Extract text of the selected element"
  • Follow the same steps for extracting the other data fields, such as the author, posted date, and abstract of the article. 

Web scraping with Octoparse - extract list detail step 5


Tips!

It is important to make sure that you are selecting data fields from the highlighted section so Octoparse can relate the data fields to the corresponding sections accurately. 


3) Toggle the Workflow switch Octoparse Web Scraping - Workflow icon located on the upper right side. On the left side is the workflow generated by Octoparse and on the right side is the data extracted. Rename the fields as needed or delete the unnecessary data fields. 

Web scraping with Octoparse - extract list detail step 7


Web scraping with Octoparse - extract list detail step 8

Tips!
To confirm if the data is being captured correctly for each item in the loop list, select different items from the loop then click "Extract data". Check to see if the data corresponding to each loop item is being extracted correctly.
Web scraping with Octoparse - extract list detail step 8

 




Click each link in a list to extract

There’s only so much that could be contained in the sections of a list. When more detailed is needed, it is often required to click on the links from the list then capture the detailed information from the detail page.

Let’s take look at the following example.

Web scraping with Octoparse - click into links and extract

Though there is some information such as the product title, model number, etc available directly from the list, but when we want something more specific such as the features or the specification of the products, we will actually need to click on the links from the list then go on to capture the desired data from the detail page. To do this, we will split the extraction process into 2 steps:

1) Build a list of the links to click open 
2) Specify the desired data fields to capture from the detail page

Follow the steps below to see how it is done:
1) Build a list of the desired links
  • Click on a link from the list

Web scraping with Octoparse - extract list detail

  • From the Action Panel, click "Select all"

Web scraping with Octoparse - extract list detail

  • From the Action Panel, click "Loop click each URL"

Web scraping with Octoparse - extract list detail

2) Specify the data fields to capture from the detail page
  • Click on the title of the product
  • From the Action Panel, click "Extract text of the selected element"
  • Follow the same steps for capturing any other data fields, t as the model, SKU, rating, etc.

Web scraping with Octoparse - extract list detail

3) Toggle the Workflow switch Web scraping with Octoparse - extract table located on the upper right side. On the left side is the workflow generated by Octoparse and on the right side is the data extracted. Rename the fields as needed or delete the unnecessary data fields. 

Web scraping with Octoparse - extract list detail

 




Capture a table

A table is one of the most common forms of data display on the web. To capture data from a table with Octoparse, we will apply list extracting technique by treating each individual row of the table as a single section of a list, then you can specify the data fields (the columns) to extract from each row. 

Web scraping with Octoparse - extract table

Follow the steps below to see how it is done in Octoparse:
1) Click on any one row of the table

 
Tips!
  • Keep clicking on the expansion icon from the Action Panel until the whole row is highlighted.

Web scraping with Octoparse - extract table

    • Reload the webpage using the "Reload" icon Web scraping with Octoparse - extract tablelocated next to the address bar
  • As soon as all the information needed gets loaded in the built-in browser, you can always click on the "Stop loading" icon Web scraping with Octoparse - extract table  to proceed to the next action. 

2) Click on another row of the same table

3) From the Action Panel, click "Extract text of the selected elements"

Web scraping with Octoparse - extract table

 

4) From the highlighted rows, click on desired data fields to capture.
5) From the Action Panel, click "Extract text of the selected elements".

6) Capture the other desired data fields from the highlighted row similarly.

Web scraping with Octoparse - extract table


Tips:
If URL or HTML of the selected element is desired instead of text, click on the corresponding option from the Action Panel.

7) Toggle the Workflow switch located on the upper right side. On the left side is the workflow generated by Octoparse and on the right side is the data extracted. Rename the fields as needed or delete the unnecessary data fields. 

Web scraping with Octoparse - extract table

8) Click through the actions in the workflow to see if different rows have the data extracted correctly. 

Web scraping with Octoparse - extract table

 

Related Articles:

What's new in Octoparse 7.X? 

Select items in a drop-down menu 

Select and extract data/URL/image/HTML 

Extract multiple pages through pagination 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png