All Collections
Glossary
What is Cloud Extraction?
What is Cloud Extraction?
Updated over a week ago

Octoparse offers a powerful Cloud platform for premium users (Standard & Professional& Enterprise) to run your tasks 24/7.

When a task is run with Cloud Extraction, it takes advantage of multiple nodes using Octoparse's IPs. You can shut down the app or even your computer while the task is running. No need to worry about hardware limitations. Data extracted will be saved in the cloud and can be accessed at any time.

Task scheduling is also supported by Octoparse Cloud extraction. To retrieve the most updated information, you can schedule your task to run as frequently as you need.


1. Run your task with Cloud Extraction:

When you finish configuring your task, click "Run" and select "Standard Mode" or "Boost Mode" under "Run in the Cloud" to execute a run in the cloud.

Once a task is set to run in the cloud, its status will change to "Running" on the dashboard.


2. Batch run tasks with Cloud Extraction:

Select any tasks that need to be run, click on "Start Cloud Run" and the tasks will run together in the Cloud.


3. Settings of Cloud Extraction:

Octoparse cloud extraction allows for executing multiple tasks simultaneously.

On the Standard Plan, you can run up to 6 concurrent tasks in the cloud (up to 6 cloud nodes available), and on the Professional Plan, you can run up to 20 concurrent tasks (up to 20 cloud nodes available). To set the maximum number of tasks running in parallel, click and select a desired number from the drop-down options:

TIPS:

  • How’s the performance of Cloud Extraction?

Getting data extracted in the Cloud can be a lot faster than running the tasks locally given the task is split-table (Learn about when a task is split-table).

A split-table task can be broken down into multiple subtasks which can be run on multiple nodes simultaneously, thus making the extraction faster.

  • Can I run more tasks than the maximum number's allowed for?

Yes, you can. But some of the tasks will be queued until more cloud nodes become available upon completion of the earlier tasks.


4. Schedule a run in the cloud:

4.1. For a single task

When you finish configuring your task, click Run and select Schedule Cloud Runs.

Select the frequency and customize the time and date according to your requirements. Click Schedule ON and the task will be run as scheduled.

Timing for the next run can be found on the dashboard in the Next Run column.

6.png

If you wish to cancel a scheduled run, click More, and select Schedule OFF in the Cloud runs.

7.png

4.2. For a group of tasks

Go to your dashboard, switch to Task Group view, select your target task group, and click on the clock icon to set a schedule for the task group.


5. Frequently Asked Questions

5.1. What's the default time zone for the Octoparse Cloud platform?

The next execution time shown on the dashboard is based on your local time zone (according to your operating system) by default. However, if you've built the task to extract "current date & time" in the Cloud, the extracted time & date will be in UTC±00:00 regardless of your actual location.

You can convert the data timezone by following this tutorial: Convert the current time field to another time zone

5.2. Why does the task get duplicated data when it runs multiple times?

mceclip0.png

Octoparse will store the data scraped from all the runs together and recognize duplicates. Duplicates will be deleted automatically from the Cloud.

For example, Octoparse scrapes 100 lines for the first run, with no duplicates. When you check all the data for the task, there will be 100 lines.

If the website adds 5 new data lines when the task runs for the second time, the task will scrape 105 lines with 100 duplicates, and only the 5 new lines will be saved. The 100 duplicated lines will be deleted.

When you check all data (from the first and second run)of the task, you will see 105 data lines in total. If you check the data for the second running batch, you will only find 5 lines.

If you want to keep all the duplicates, please check out this tutorial: How can I keep the duplicates in the Cloud runs?

5.3. What are concurrent Cloud Runs?

Concurrent Cloud Run means the maximum number of tasks you can run at the same time. If you are on the Standard Plan, you can run at most 6 concurrent extractions in the Cloud because you have up to 6 Cloud nodes (one task needs at least one node to run).

Please note that you may find that sometimes you may see your tasks queued because one splittable task may take up more or all of the nodes in your account. Once one task takes up all the nodes, the other tasks need to wait for the Cloud resource to run them. Read this tutorial for more details about task split: How can I scrape data faster in Cloud?

5.4. What affects the number of concurrent runs?

The main factors influencing your concurrent runs are 1) the number of Cloud nodes you have and 2) the number of nodes your running tasks take up.

For example, you are on the Standard Plan, which means you have up to 6 Cloud nodes. If you have 6 tasks, and these tasks only take up 1 node each when running, you will see 6 tasks running at the same time.

If one of the tasks takes up 2 nodes (it is split into 2 or more sub-tasks), then you will only see 4 tasks running at the same time. If the task takes up 6 nodes, then you will only see one task running.

Did this answer your question?