How to Improve The Speed of Data ExtractionWednesday, July 20, 2016 11:48 PM
Octoparse offers a Cloud platform with many Cloud servers to run your tasks 24/7 and reach up to 6-20 times faster than local extraction.
But sometimes, the speed of the Cloud may not be that satisfying.
In this tutorial, we will explain the principle of speeding up in the Cloud and how to revise a task to make it run faster.
The principle of speeding up in the Cloud
Octoparse Cloud speeds up by splitting one task into multiple sub-tasks and running the sub-tasks with multiple Cloud servers.
One sub-task requires one Cloud server to run, so the speed depends on how many Cloud servers your account has and if the task is splittable.
The Standard plan has 6 Cloud servers while the Professional plan has 20. You can easily upgrade to a higher plan to speed up.
But if you don't want to change your plan, modifying the task to be splittable is quite essential.
What kind of tasks are splittable?
When you try to create any kind of loop items in Octoparse, Octoparse will automatically assign a loop mode to it based on the items selected and
how they interact with the general webpage structure.
Specifically, there are three types of splittable loop modes in Octoparse.
1. List of URLs
2. Text list
3. Fixed list
1. List of URLs
A URL loop is used when you start an extraction task using more than one URL. This is especially handy if the desired data spans through multiple web pages sharing the same page structure.
You could easily set up a loop of URLs to go through each of these pages. Octoparse will load the URLs one by one and execute the same set of extraction actions on each page.
A URL loop is splittable. Hence, when a task built with a list of URLs is set to run in the Cloud, Octoparse would split it into multiple sub-tasks for faster and more effective extraction.
To learn more about the List of URLs, please refer to the Batch URL input.
2. Text List
A Text list loop works similarly to that of the URL list loop, but instead of looping through a list of URLs now the loop works to loop through a list of predefined text values.
For more about the Text list loop, please refer to Enter Text.
3. Fixed List
Many web pages, such as e-commercial websites, often organize webpage content (ie. product information) as a collection of recurring elements with a shared HTML pattern.
When capturing such elements, such as the product titles, Octoparse would intelligently detect all the elements sharing the same HTML pattern and generate a collection of XPath(s) to locate all elements of the same kind.
Besides these 3 types of splittable loop modes, there are 2 other loop modes that are not splittable: single element loop and variable list loop.
As both loop modes only involve one single XPath, they can't be split further into sub-tasks to speed up.
How do I make my task splittable?
1. For a task with a Variable List to click through a list of elements, we can
- Change it to a Fixed List by listing the XPaths for every element on the page
- Scrape only the element URLs first without clicking into the pages, and then create another task with the URLs to get the detailed data.
Here is an example: Scrape property data from Realtor.com
2. For tasks that scrape from multiple pages, we can use the URLs for each page to build the workflow:
If you are not sure how to make your task run faster, feel free to leave a message. Our support team will help you check how to modify the task settings.
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.