Speed up Cloud Extraction (2)

Friday, January 20, 2017 1:30 AM

In Speed up Cloud Extraction (1), you’ve learned how to speed up Cloud Extraction by telling the program to split up one task into multiple sub tasks. When you use “Fix list”, “List of URLs” or “Text list” loop mode, Octoparse will split the task up into multiple sub tasks on the cloud platform.


In this tutorial, you will learn how to speed up Cloud Extraction in Octoparse by optimizing pagination.


When you configure pagination by clicking on “Next” button, “Click to paginate” action will be auto generated in Workflow. Since you click on one element, “Single element” loop mode would be the default loop mode, which is not allowed to split on the cloud platform. Assume that a task is meant to extract URLs from a list page:

Let’s say that opening a page task takes 5s, Extracting data takes 2s, and Clicking “Next Page” takes 3s. The extraction process on the cloud platform will be like:

( Note: Octoparse’s cloud servers extract data simultaneously.)


In this case, Cloud Extraction would be very slow since the pagination always takes 3s when scraping one data field. To optimize pagination, you will need to use split-table Loop modeon pagination. (List of URLs & Text list)


 1. Query string pagination - Use List of URLsloop mode


The query string pagination is simple URL with query string parameter: “page=1, page=2, page=3...”

For query string pagination, use “List of URLs” loop mode by putting the URLs, instead of creating a “Click to paginate” by clicking on “Next” button.


Step 1. Create a list of URLs and copy them.


Step 2. Drop a “Loop Item” into the Workflow designer.


Step 3. Select “List of URLs” loop mode and paste the URLs in the text box. Then click “OK” &“Save”.

Then continue to configure the task for Cloud Extraction.


2. Jumping to a Specific Page - Use “Text list”loop mode


When the website allows visitors to enter a page number and jump to the specific page, use “Text list” loop mode to enter page numbers.


Step 1. Create page numbers and copy them.


Step 2. Drop a “Loop Item” into the Workflow designer.


Step 3. Select “Text list” loop mode and paste the text in the text box. Then click “OK” &“Save”.


When you optimize the pagination, Cloud Extraction process will takes less 3s when scraping each data field 


Once we know how to optimize pagination by switching to different “Loop mode”s, we can make Cloud Extraction a lot faster.



Author: The Octoparse Team


Download Octoparse Today


For more information about Octoparse, please click here.

Sign up today!



Author's Picks: 


Speed up Cloud Extraction (1)

Reasons and Solutions - Missing Data in Cloud Extraction

Reasons and Solutions - Cloud Extraction Is Slower Than Local Extraction

Reasons and Solutions - Getting Data from Local Extraction but None from Cloud Extraction

Schedule Data Extraction - Get Real Time Data

Octoparse Cloud Extraction Works Better


We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline