How to Improve The Speed of Data ExtractionWednesday, July 20, 2016 11:48 PM
Sometimes you feel that it was a little bit slow to extract data using Octoparse. Actually, the speed of data extraction is mainly determined by several factors.
- The speed of your computer's CPU(Central Processing Unit), especially when you are using Local Extraction.
- The amount of the data you plan to crawl
- Free trial or paid accounts ( Standard Edition/Professional Edition)
- Local Extraction or Cloud Extraction.
Method One: Cloud Extraction
If the data that you want to extract is not too much, Octoparse can finish your job very quickly even though you’re using local extraction.
But if the amount of the data you want is way huge and you would like to speed up the extraction task, Cloud Extraction is strongly recommended.
Feature: Cloud Extraction - Run the tasks you set up in the cloud servers we provide.
Free Edition: Not allowed
Standard Edition: With 4 cloud servers provided by Octoparse's cloud platform, it allows 4 tasks to run in the cloud at the same time, 4 times extraction speed than Local Extraction.
Professional Edition: With 4 cloud servers provided by Octoparse's cloud platform, it allows 10 tasks to run in the cloud at the same time, 10 times extraction speed than Local Extraction.
Scraping the web on a large scale simultaneously, based on distributed computing, is the most powerful feature of Octoparse. After you upload your configuration project to the cloud, you can choose to perform the extraction concurrently by using many cloud servers. If you need to scrape 10,000 web pages within a short time, then Octoparse cloud service fits best.
Method Two: Choose "URL List Extraction" to configure a rule file.
Besides using cloud extraction, you can also improve the speed of data extraction by choosing the right way to configure your rule.
When the website you plan to crawl is complicated and the amount of the data you want is very huge, I suggest you choose “URL List Extraction” to configure a rule file.
It will greatly speed up the data extraction.
Next, I will show you two ways to configure a rule file.
- List (There are a lot of items in the list.)
The workflow is to set up a loop action to choose all the items first, then set another loop action to extract the detailed data of each one.
2. URL (Extract all page URLs first and then use “URL List Extraction” to extract the detailed data from these pages.)
When you extract data from a large amount of webpages, you can "URL List Extraction."
(Example site: http://snav.amadeus.fr)
1. "List rule configuration"
2. "URL rule configuration"
First, configure a rule file to extract all page URLs.
Second, configure another rule file to extract the detailed data from these data.