List/Table Web Page - Advanced ModeThursday, March 24, 2016 5:14 AM
Contents on web pages are usually organized in some kinds of patterns. Two of the most commonly seen patterns are list and table. Here are a few examples of when content is laid out as a list.
Part I: Scarpe from a list
This particular web page consists of items sharing the same structure. Each item contains a title, date, keyword, article...
Our goal is to get data extracted into excel like this:
Now, let's explore different ways to get this done in Octoparse.
You may need this link to follow through: https://www.octoparse.com/blog
1. Extract list with Auto-detect
Once you've created a new task using the example URL, select "Auto-detect web page data. Octoparse will now detect any data on the page and you can click "Create workflow" to generate the workflow.
2. Extract list manually
If for some reason the Auto-detect fails to detect the list or if you are building a task without Auto-detect, you can always extract the list manually.
1) Method 1:
- Load the web page in Octoparse and hover your cursor over the first item until the entire section gets highlighted in blue
- Continue to click on the second item and you will find all you need on one page has been selected.
- Choose "Extract text of the selected elements" and Octoparse will create a Loop Item automatically
You will notice that the first item is now highlighted in red. You can select the information like title, date, and keyword from the highlighted area.
- Select the title and choose "Extract the text of the element"
- Repeat the steps to get other information
- Double click on the field name to rename it if needed
Please make sure all the sub-elements you want to extract are all included in this highlighted section.
2) Method 2:
- Hover your cursor over the first item until the entire section gets highlighted in blue
You will notice that Octoparse detects sub-elements from the section and highlights them in red.
- Choose "Select sub-elements"
- Choose "Select all"
- Select "Extract data". A loop item will be generated automatically to scrap the list of items on the page.
If you want to edit or delete the extracted data fields, you can click "Extract Data" and modify the fields on the Data Preview panel.
Part II: Scrape from a table
Table data is also common among websites related to finance, sports, etc. This tutorial will guide you on how to scrape table data.
If you have learned how to grab a list of data, then table data is more or less similar (Extract a list). You can take each row of the table as an element of list data. Then, each table cell is equal to a sub-element in the element.
How to collect the table data with Octoparse? Go ahead with this tutorial!
1. Use the Auto-detect function to set up the workflow
Octoparse supports auto-detecting the table and capturing all the columns. With this feature, you just need to
- Enter the web page URL and select "Auto-detect the web page data"
- Check if all table cells have been captured and click "Create workflow"
2. Set up workflow manually
- Select the first cell in the first row of the table, and then expand the selection area until it selects the whole first row
(You can click "Turn OFF Auto-detect" or "Cancel Auto-detect" to stop auto-detect if it starts automatically)
The Tips panel will say "One or more sub-elements are found". "Sub-elements" are the specific data fields that Octoparse detects on each row of data. This is to ask if you want to locate these sub-elements.
- Choose "Select all sub-elements" from the Tips panel.
All the sub-elements in the first row are selected, and then Octoparse finds other similar elements highlighted in red.
- Choose "Select all" from the Tips panel.
All the sub-elements in the table are selected and highlighted in green.
- Choose "Extract data" from the Tips panel.
Now, Octoparse will extract all the data fields from the table.
- Edit data fields if needed (optional)
You now have all the data fields set up for the task. You can refine the data fields in the "Data Preview" section.
- Double-click the field name to rename the data fields
- Click the three dot on the field for more actions: delete, copy, clean data, etc.
If you have any further trouble extracting list/table data, you're welcome to submit a ticket to our Support team.
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.