Octoparse Cloud Service - Splitting Tasks to Speed Up Cloud ExtractionMonday, March 13, 2017 8:38 AM
In many cases if a task has not been split appropriately, there will be no obvious difference compared to a local extraction. In most cases, task split up can be accomplished by tweaking the loop from the workflow.
Here are some lists that can be split:
List of URL’s
And these lists can’t be split:
Let’s see how to split an extraction task using Fixed List.
In the screenshot below, a variable list is created by default. This way, the extraction process can only be run with a single cloud server. Cloud extraction is not going to outperform local extraction in this case.
To make the extraction faster, we could split the task by manually creating a fixed list. Here, we edit XPath //DIV[@id='mainResults']/UL/LI and append an array sequence number to this XPath, such as //DIV[@id='mainResults']/UL/LI[i] (i=1, 2, 3 ..., n). Input the edited XPath as a fixed list, hit ‘OK’ and ‘Save’. Now, see how the first item in the loop is detected accurately, just what we want. Click‘OK’ and‘Save’.
In the same way, edit the XPath with a sequential array number and input one by one, then hit ‘OK’ and‘Save’.
Now, all loop items are detected correctly (just like when using a variable list)
If for some reasons, you will not like the task to be split (for example, if running locally), check the appropriate box to prevent task from splitting.
Octoparse professional plan users will get a maximum number of 14 cloud servers working at any time. Tasks that are split correctly will have numerous cloud servers working to extract simultaneously, scraping data a lot faster.
Users can adjust the maximum number of tasks running at the same time or prioritize those that are more important. Higher priority tasks will be run first while others will line up till a cloud server becomes available.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today!