Variable List, Fixed List, URL List and Text List – Which Is a Better One to Use for Your Scraping Task?

Friday, September 29, 2017 7:30 AM

Have you ever built a list in Octoparse? Have you noticed a loop mode gets automatically selected as a loop is created? This may have become so sneaky that you won’t even notice it after a long time of using it. In this article, however, I will like to point out a few scraping scenarios in which you may want to manually switch from one mode to another.

 

Consider manually selecting or switching a loop mode if you want to:

  • Speed up an extract task by splitting it
  • Search with multiple keywords on any websites then extract the search results
  • Extract from multiple URL’s with similar page layout

 

Speeding up an extraction with a split-able list (using a Fixed List/URL List)

A variable list follows a single XPath and matches any elements (as many as there are) that meets the criteria defined by the XPath on a webpage. A variable list is not split-able. On the contrary, fixed list and URL list are both split-able; hence, consider manually changing a variable list into a fixed list or URL list if you need to split a task for faster extraction.

If you need to speed up an extraction, you can:

  • Change a variable list into a fixed list (learn how).
  • Set up a crawler to first capture URL’s of all the webpages sharing similar web structure, then build a second crawler to visit and extract from each individual URL on the list following the same set of configuration (learn how).

 

Search a website with different keywords and capture the search results (using a Text List)

For anyone that wants to search and extract, you will need to provide Octoparse with a list of keywords to search for. This is done by setting up a loop of text list. Once the extraction is set to run, Octoparse will automatically search the first keyword, capture the search results, search the second keyword, capture the corresponding search results, so on and so forth. 

The detailed steps are:

Step 1: Click on the search box

Step 2: Select “Enter text value”

Step 3: Drag a Loop action to the workflow

Step 4: Select “Text List” for loop mode

Step 5: Copy and past the list of keywords into the text box, click “Save”

Step 6: Drag the Input Text action into the the loop

Step 7: Under Advanced Options, check for “Use the text in the loop item to fill in the text box”, click “Save”

 

Further readings:

 

Scrape from a list of URL’s following the same webpage structure (using a URL List)

Octoparse extracts data from any webpage by interacting with the website and scanning the webpage for specific web elements according to the task configuration. Hence, in order to grab data consistently and accurately from multiple pages, it is important that those pages share the same page structure, for example, product detail page on an Ecommerce webpage (example ), business detail page from a directory website (example) or even user page from a social media website (example). These pages that essentially “look” the same can be efficiently scrapped with a loop of URL List. 

The detailed steps are:

Step 1: Drag a Loop action to the workflow

Step 2: Select “URL List” for Loop Mode

Step 3: Copy and paste the pre-aggregated list of URL’s into the text box, click “Save”

Step 4: Notice a Go To Webpage action gets added automatically

 

Further readings:

 

That’s all for this tutorial. I hope you had enjoyed reading it!

Always reach out to support@octoparse.com if you have any questions.

 

Now check out similar case studies:

     · Get Started with Octoparse in 2 Minutes

     · 10 Essential Tutorials That Every Octoparse Newbie Should Know

     · Scrape data from multiple web pages

     · Speed up Cloud Extraction (1)

     · Speed up Cloud Extraction (2)

 

 

btn_sidebar_use.png
btn_sidebar_form.png