Genarally, web crawlers like Google will retrieve all of the webpages. It can find links and content inside (usually text) to make sure what they are and in this way it can index the search pages.
But crawlers run in Octoparse are determined by the rules configured, and the data extracted is structured. It does not understand the web content with advanced algorithms, but it grabs the exact web content to you perfectly.
Today we’ll talk about what an Octoparse RULE is.
The extraction rule is one of the most important features of Octoparse. The rule configured would tell Octoparse: which website is to be open; where is the data you plan to crawl; what kind of data you want, etc.
You can configure the rule to paginate, to scrape a website behind a login, to collect data from webpages loaded with AJAX, to scrape a website with infinite scrolling. But, you have to make these happen by making a rule.
Trust me. It's very easy. If you can use a web browser, you can use Octoparse. Moreover, Octoparse has a visible workflow designer to show how the rule is created.
You do not need to write any code in Octoparse. Just tell Octoparse what you want it to do by dragging actions into the workflow designer and selecting options to optimize the process.
Let’s take an example of a simple web page extraction with pagination.
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.