a user guide to an easy-to-use web scraping tool Octoparse

About Octoparse

Octoparse is a modern visual web data extraction software. Both experienced and inexperienced users would find it easy to bulk-extract information from websites with it. For most scraping tasks, no coding is needed.

Octoparse supports Windows XP, 7, 8, and 10. It works well for both static and dynamic websites, including those web pages using Ajax. To export the data, there are various data formats of your choice like CSV, EXCEL, HTML, TXT, and databases (MySQL, SQL Server, and Oracle via API). Octoparse simulates human operations to interact with web pages.

Its remarkable features such as filling out forms, entering a search term into the textbox, etc., make extracting web data an easy process. You can run your extraction project either on your local machines (Local Extraction) or in the cloud (Cloud Extraction).

Some of our clients use Octoparse’s cloud service, which can extract and store large amounts of data to meet large-scale extraction needs.

Octoparse free and paid editions share some features in common. Paid editions allow users to extract enormous amounts of data on a 24-7 basis using Octoparse’s cloud service. The prices of each plan can be viewed here.

Workflow

Octoparse provides a visual operation pane, which is very user-friendly and straightforward. It simulates human web browsing behavior like opening a web page, logging into an account, entering text, pointing and clicking the web element, etc. Just click the information on the website in the built-in browser and start the extraction, and you will get the structured data you need.

There are 2 extraction modes (Task Template and Advanced Mode) in Octoparse. It takes you only half an hour to get started with Octoparse, and people who have programming experience would spend less time getting familiar with Octoparse.

Cloud Extraction

Scraping the web on a large scale simultaneously, based on distributed computing, is the most powerful feature of Octoparse. After you upload your scraping project to the cloud, you can choose to perform the extraction concurrently using many cloud servers. If you need to scrape 10,000 web pages within a short time, then Octoparse cloud service fits best. Standard Edition limits you to only 10 cloud servers, though it still greatly speeds up the process of data extraction. You can set up a time schedule for regular data extraction.

Video: How to Extract Data From Millions of Web Pages in the Cloud

Advanced Mode

For the Advanced Mode, the tool provides a rich set of tools. These tools include:

# RegEx Tool#

# Xpath Tool #

# Database Auto Export Tool #

# API #

…

To improve users’ experience, Octoparse provides the inbuilt RegEx generator. The refining scraped fields might require you to apply RegEx, so this fits it best in both generating and verifying RegExes.

API

The Octoparse API makes it easy to connect your system to numerous data in real time. You can either import the Octoparse data into your own database or use our API to require access to your own account’s data. Just configure the rule for your task, and Octoparse cloud servers will do the rest. Data are returned as XML.

Video: How to Extract Data to Your Database via API

To use the Octoparse Standard API, you will need to hold a Standard or Professional account with at least one runnable task set up.

Proxies

Does it ever drive you crazy that your IP address is banned and you cannot access a website because you scrape it frequently? It happens especially when you extract data from business directories that apply strict anti-bot measures. Octoparse enables you to scrape these websites by rotating anonymous HTTP proxy servers. In Cloud Extraction, Octoparse applies lots of 3rd party proxies for automatic IP rotation. For Local Extraction, you can add a list of external proxy addresses manually and configure them for automatic rotation.

IPs are rotated with a certain interval of time you set. In this way, you can extract data from the website without taking the risks of getting IP addresses banned.