undefined

List/Table Web Page - Advanced Mode

Thursday, March 24, 2016 5:14 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

Contents on web pages are usually organized in some kinds of patterns. Two of the most commonly seen patterns are list and table. Here are a few examples of when content is laid out as a list. 

 

Part I: Scarpe from a list

Scraping a list is quick and easy with Octoparse's auto-detect feature. Based on its advanced algorithm, Octoparse is capable of auto-detecting items from a list and generating the task workflow automatically. With Octoparse Auto-detect, scraping the list couldn't be easier. Now let's see how it is done with an example. 

This particular web page consists of items sharing the same structure. Each item contains a title, date, keyword, article...

blog items

 

Our goal is to get data extracted into excel like this:

sample data

  

Now, let's explore different ways to get this done in Octoparse.

1. Extract list with Auto-detect

2. Extract list manually 

 

You may need this link to follow through: https://www.octoparse.com/blog

 

1. Extract list with Auto-detect

Once you've created a new task using the example URL, select "Auto-detect web page data. Octoparse will now detect any data on the page and you can click "Create workflow" to generate the workflow.

  

2. Extract list manually 

If for some reason the Auto-detect fails to detect the list or if you are building a task without Auto-detect, you can always extract the list manually.

1) Method 1: 

  • Load the web page in Octoparse and hover your cursor over the first item until the entire section gets highlighted in blue
  • Continue to click on the second item and you will find all you need on one page has been selected.
  • Choose "Extract text of the selected elements" and Octoparse will create a Loop Item automatically

 

You will notice that the first item is now highlighted in red. You can select the information like title, date, and keyword from the highlighted area.

  • Select the title and choose "Extract the text of the element"
  • Repeat the steps to get other information
  • Double click on the field name to rename it if needed 

 

Tip!

Please make sure all the sub-elements you want to extract are all included in this highlighted section.

element selected

 


2) Method 2: 
 

  • Hover your cursor over the first item until the entire section gets highlighted in blue

You will notice that Octoparse detects sub-elements from the section and highlights them in red.

  • Choose "Select sub-elements"
  • Choose "Select all"
  • Select "Extract data". A loop item will be generated automatically to scrap the list of items on the page.

 

Tip!

If you want to edit or delete the extracted data fields, you can click "Extract Data" and modify the fields on the Data Preview panel.

preview box

 

Part II: Scrape from a table

Table data is also common among websites related to finance, sports, etc. This tutorial will guide you on how to scrape table data.

If you have learned how to grab a list of data, then table data is more or less similar (Extract a list). You can take each row of the table as an element of list data. Then, each table cell is equal to a sub-element in the element.

How to collect the table data with Octoparse? Go ahead with this tutorial!

 

Case URL: https://money.cnn.com/data/hotstocks/index.html

 

1. Use the Auto-detect function to set up the workflow

2. Set up workflow manually

 

1. Use the Auto-detect function to set up the workflow

Octoparse supports auto-detecting the table and capturing all the columns. With this feature, you just need to

  • Enter the web page URL and select "Auto-detect the web page data"
  • Check if all table cells have been captured and click "Create workflow"

 

2. Set up workflow manually

  • Select the first cell in the first row of the table, and then expand the selection area until it selects the whole first row

(You can click "Turn OFF Auto-detect" or "Cancel Auto-detect" to stop auto-detect if it starts automatically)

 

The Tips panel will say "One or more sub-elements are found". "Sub-elements" are the specific data fields that Octoparse detects on each row of data. This is to ask if you want to locate these sub-elements.

 

  • Choose "Select all sub-elements" from the Tips panel. 

All the sub-elements in the first row are selected, and then Octoparse finds other similar elements highlighted in red.

  • Choose "Select all" from the Tips panel.

All the sub-elements in the table are selected and highlighted in green. 2233.png

  • Choose "Extract data" from the Tips panel. 

Now, Octoparse will extract all the data fields from the table. 6555.png

  • Edit data fields if needed (optional)

You now have all the data fields set up for the task. You can refine the data fields in the "Data Preview" section.

  • Double-click the field name to rename the data fields
  • Click the three dot on the field for more actions: delete, copy, clean data, etc.

 

If you have any further trouble extracting list/table data, you're welcome to submit a ticket to our Support team.

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline