Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Batch URL input

Monday, November 26, 2018

Extracting data from a list of URLs  is definitely one of the most efficient and powerful way to achieve large scale data scraping with Octoparse. In times when the list of URLs is large, Octoparse supports batch/bulk URL import from local files (text or spreadsheet), from another task or even generate the URLs based on some pre-defined patterns. Through these features, Octoparse aims to further reduce the tedious workload associated with large scale data extractions.

There are three ways to batch import URLs to any single task/crawler (up to a million URLs):

1. Batch import URLs from local files

2. Batch import URLs from another task

3. Batch generate URLs based on a pre-defined pattern

 

Tips!

Once the number of imported/generated URLs reaches the limit of 1 million, Octoparse would stop importing/generating immediately.

 

 

 

 

 

 

1. Batch import URLs from files

You can now import URLs from any of the file formats below,

- CSV

- TXT

- Excel (.xlsx & .xls)

 

· Select "Advanced Mode" and click "+Task" to create a new task

· Select "Input from file"

 

· Click "Select file" then choose the file containing the URLs for the import

Octoparse automatically identifies and imports all the URLs from the file. Note only the the first 100 URLs will be shown for preview purposes.

· Click "Save URL" to complete the import

 

 

 

 

2. Batch import URLs from another task

This feature makes it possible to integrate two tasks seamlessly when URL extraction need to be done separately with another task. No more extra URL export-and-import is needed.

 

· Select "Advanced Mode" and click "+Task" to create a new task

· Select "Input from task"

 

· Select the task containing the target URLs then specify the proper data field

· Click "Save URL" to complete the import

Note the selected task (one that contains the URLs needed for more crawling) is referred as the parent task, and the new task to be configured becomes the child task. Two tasks will be associated automatically and can be executed in association with one anther. 

When a task is selected as the parent task, Octoparse will automatically retrieve all the data extracted for the selected task (cloud and local  ).

 

Tasks that have yet been run and do not have any URLs fetched can also be selected as the parent task - simply enter one example URL into the text box then proceed to configure the child task.   

 

 

 

 

 

- Associated run 

When a child task is set to run, you can specify the criteria for starting the extraction.

· Click "Start Extraction" on the task configuration interface or "Options" from Dashboard

· Select "Parent Task settings" / "Config with start"

 

 

There are four options to select from-

 · Select "Run task as soon as its parent task starts" if you wish to run the child task as soon as any URLs is fetched to the parent task.

 

 

 

Tips!

1. If you set up an associated run by selecting any option from Parent task settings, both tasks will be executed in the cloud via Octoparse Cloud Service  . Associated run is not available for Local Extraction  .

2. When an associated run is setup, task scheduling  is not available for running the child task.

 

 

 

 

 

 

 

3. Batch generate URLs based on a pre-defined pattern

With URL Batch Generate feature, you can easily generate a large number of URLs following specific patterns by modifying various parameters of one given URL.

This feature would be especially useful for scraping from a large number of different pages from a particular website. Use the URL generator to quickly generate all the page URLs and scrape all the pages simultaneously. No need to go through the pages one by one.

· Select "Advanced Mode" and click "+Task" to create a new task

· Select "Batch generate"

· Input the URL as a base for batch generate

· Highlight the selected URL parameter, and click "Add parameter"

· Select from the four Parameter Type options to define the pattern you need

· Click "Save URL" to save the list

 

 

 

 

- Four Parameter Type options

    - Type 1 : Numbers

    - Type 2 : Letters

    - Type 3 : Date

    - Type 4: Custom list

 

 

Related articles:  

Extract data from a list of URLs 

Run/Schedule tasks in the cloud 

Run tasks on local machine 

What's new in Octoparse 7.1? 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png