All Collections
Using Octoparse
Scrape a list of data
Scrape a list of data
Updated over a week ago

Contents on web pages are usually organized in some kind of pattern. One of the most commonly seen patterns is a list. Here are a few examples of when content is laid out as a list.

19.png

Scraping a list is quick and easy with Octoparse's auto-detect feature. Based on its advanced algorithm, Octoparse is capable of auto-detecting items from a list and generating the task workflow automatically. With Octoparse Auto-detect, scraping the list couldn't be easier. Now let's see how it is done with an example.

This particular web page consists of items sharing the same structure. Each item contains a title, date, keyword, article...

20.png

Our goal is to get data extracted into excel like this:

21.png

Now, let's explore different ways to get this done in Octoparse:

You may need this link to follow through: http://test-sites.octoparse.com/?page_id=6


1. Extract a list with Auto-detect

Once you've created a new task using the example URL, select "Auto-detect web page data. Octoparse will now detect any data on the page and you can click "Create workflow" to generate the workflow.

After that, you can modify the fields on the Data Preview

  • Delete unwanted fields

  • Rename the fields by double-clicking on the header


2. Extract a list manually

If for some reason the Auto-detect fails to detect the list or if you are building a task without Auto-detect, you can always extract the list manually.

Method 1:

  • Hover your cursor over the first item until the entire section gets highlighted in blue, and click on it

  • Continue to click on the second item and you will find all you need on one page has been selected.

  • Choose Text and Octoparse will create a Loop Item automatically

You can now select the information like title, date and keyword from the web page to create different fields.

  • Select the title and choose Text

  • Repeat the steps to get other information

  • Double-click on the field name to rename it if needed

Method 2:

  • Hover your cursor over the first item until the entire section gets highlighted in blue

You will notice that Octoparse detects sub-elements from the section and highlights them in red.

  • Choose Select all child elements

  • Choose Select all similar groups

  • Select Element data

A loop item will be generated automatically to scrap the list of items on the page.

Did this answer your question?