Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Extract data from a list of URLs

Thursday, August 16, 2018

There isn’t just one way to scrape a webpage, depending on how the webpage is structured, there are usually multiple approaches you can try.  In this tutorial, we will introduce an easy and powerful way to extract data from multiple web pages by using a list of URLs.

Questions: When should you consider scraping by using a list of URLs?

AnswerWhen the desired data spans through multiple pages sharing the same page structure. For example, when you scrape listings from Yelp, you may need to paginate through the search results. Here, page 1, page 2, page 3, etc. all share the same page structure. And another example, if you are scraping news articles from any particular website, most likely the article page will share the same page structure. 

web scraping with octoparse - demo webpage

web scraping with octoparse - demo webpage 2

 

To scrape by using a list of URLs, we'll simply set up a loop of all the URLs we need to scrape from then add a data extraction action right after it to get the data we need. Octoparse will load the URL one by one and scrape the data from each page.

By creating a "List of URLs" loop mode, Octoparse has no need to deal with extra steps like "Click to paginate" or "Click Item" to enter the item page. As a result, the speed of extraction will be faster, especially for Cloud Extraction. When a task built using "Lists of URLs" is set to run in the Cloud, the task will be split up into sub-tasks which are then set to run on various cloud servers simultaneously. 

"List of URLs" mode is very effective. You can add particular web pages to the list, and it doesn't matter whether they are consecutive pages or not, as long as they share the same page layout. Octoparse will scrape data from each URL in the list, and no page would be omitted.

 

Tips!

1. Can I use URLs that do not share the same page layout?

Unfortunately, only URLs that share the same page structure can be extracted using "List of URLs". To make sure data is extracted consistently and accurately,  it is necessary to ensure that these pages share the same page layout.

To learn more about the "List of URLs" mode, you can check out the following articles:

5 Loop Modes in Octoparse 

Variable List, Fixed List, URL List and Text List – Which Is a Better One to Use for Your Scraping Task? 

2. Is there a limit to the number of URLs that I can add at a time?

Yes. We suggest adding no more than 20,000 URLs at a time; however, this number can change slightly depending on the length of the URLs. 

3.Can Octoparse automatically collect and add the URLs?

Unfortunately, you have to collect and add the URLs to the list manually. You can use Octoparse to extract the URLs , then export the data, and add them to the "List of URLs". Octoparse Advanced API enables modifying the list of URLs without accessing the App. 

 

To extract with a list of URLs, the extraction process can generally be broken down into 3 simple steps:

web scraping with octoparse - scraping with a list of urls

 

In Octoparse, there are two ways to create a "List of URLs" loop.

1) Start a new task with a list of URLs

2) Create a "List of URLs" loop in Workflow Designer

 

 

 

 

 

 

1) Start a new task with a list of URLs

1. Select "Advanced Mode" and click "+Task" to create a new task

 

2. Paste the list of URLs in the textbox and click "Save URL"

 

After clicking "Save URL", the "Loop Item" (which loops through each URL of the list) is automatically created in the workflow.

 

If you click on "Loop Item", you can see that the URLs that you entered have been added to the "Loop Item".

Octoparse enters the "List of URLs" loop mode by default when more than one line of URL is added to "Extraction URL". 

 

3. Set up "Wait before execution"

Sometimes if Octoparse works too fast, it is possible to have pages not loaded completely before the data extraction step is executed, which may lead to no or incomplete data being extracted. To avoid this, we can set up "Wait before execution". 

Click on the Loop Item. Under "Advanced Options", set a wait time before the action is executed (2 seconds will work usually).

 

 

 

 

2) Create a "List of URLs" loop in Workflow Designer

1. Drop a "Loop Item" in the workflow

 

2. Go to "Loop mode" and select "List of URLs"

 

3. Click  and enter/paste the list of URLs. Don’t forget to click "OK" to save the setting.

 

Notice the "Go to Web Page" action is automatically generated in the workflow.  And by clicking on "Loop Item", you can find the list of URLs being added to "Loop Item" 

 

 

4. Set up "Wait before execution"

Octoparse will load each URL in the list before starting extracting the data. But if the page doesn't load completely, Octoparse may have problems in scraping data or executing the next step in the workflow. In case Octoparse starts extraction before the page loads completely, we need to set up "Wait before execution"(2 seconds are recommended).

 

Now that a "List of URLs" loop has been created, you can proceed to extract the data on the webpage and run your task with Local Extraction  or Cloud Extraction  upon completion of the task configuration. 

 

 

Related articles:

Select and extract data/URL/image/HTML 

Extract multiple pages through pagination 

Use lists to extract 

Set up wait time 

Advanced API 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png