A few years ago, we wrote a web crawler to parse and extract data from websites. In this process, the most painful thing was the data extraction tasks were interrupted in some circumstances. For example, the computers shut down suddenly because of unexpected reasons, or the IP was blocked by the targeted website because of frequent access.
In order to resolve this problem, we’ve developed Cloud Extraction.
#1 Cloud Extraction
Cloud Extraction means data extraction tasks running in the cloud. You need to configure a rule and upload it to our cloud platform. Then your task will be reasonably assigned to one or several cloud servers to extract data simultaneously via central control commands. For example, you have configured a rule to extract data across pages ( 99 pages in total). Well, your tasks will be automatically divided into three sections and evenly assigned to three cloud servers to extract data at the same time. In this way, it will only take you one third of the original time to extract data from 99% websites.
#2 Avoid IP Being Blacklisted
Moreover, Cloud extraction can avoid various errors so we don’t have to worry about occasional network interruption anymore. When this occurs, cloud servers can resume its work immediately as soon as the network connection is available again. And also, we are no longer worried about IP being blacklisted. Cloud Extraction provides you with a huge number of IP addresses in Professional Edition. Cloud Extraction resolves this issue effectively by assigning your tasks to several cloud servers and speeding up the extraction speed.
If you need to extract data at a specified time or update your data once an hour, you can make a scheduled task for Cloud Extraction.
If you find some data haven’t been extracted, you can launch Octoparse to extract these missing data again.
Cloud service also provides you API to link your system and Octoparse closely, which enables you to directly export the extracted data into your database. So for those who want to update their system data in real-time, Octoparse is your best choice. Just make a schedule to obtain the latest data, and then automatically link and update your system automatically in real-time.
Octoparse API documents:
We are pleased to announce that we released a new version of Octoparse and we are very excited by its unique features. Octoparse is a free web scraper for collecting data from the web. Based on the popularity in China market, where Octoparse already has more than 180,000 users, we decided to break into an international market.
We are glad to help and make our product even better for you. If you find any missing feature, please feel free to contact us.
Author: The Octoparse Team