Data Harvesting is solving these two problemsThursday, May 19, 2016
What is an element
When you open a webpage, how do you know what information the page is trying to transfer? How do you know the element is the title of the article? How do you know the element refers to the website owner?
We might see the position of an element: the font size and color of the element, the text in front of the element. Besides, we can figure out whether the content of the element looks like a person's name / time, what it talks about in the context of the element, the field of this article, etc. People have a lot of experience and knowledge to interpret the information on the web page when they see a web page.
In the "XPath / RegEx" analytical mode, the data extraction process is completed manually. We interpret the webpage and find out the position of the element that contains the information. In the "data highlighter" analytical mode, we need to mark up in multiple pages to tell the machine where each attribute is.
But the page structure is hard to understand for those who don’t have adequate knowledge. Meanwhile, just as the language is ambiguous, sometimes the page structure is ambiguous as well, therefore it brings great difficulties to computer in understanding what the page says.
What is the difference between elements
Because it’s the program that extracts large quantities of data. So we have to tell the program accurately which elements you want to extract. In the "XPath / RegEx" analytical mode, XPath and Regular Expressions are the descriptions of information. Selecting a correct expression that covers different pages and differentiates from other attributes, is not an easy job and needs experience and skills.
In the "data highlighter" analytical mode, software will select a correct expression automatically. And in the process of "wrapper generation and data extraction", rules are also analyzed by the computer.
After solving these two problems, we come to structural parsing.
Structural parsing virtually is the interpretation of the program to a web page, whether the interpretation is based on either creating the rule manually and making agreements to the program.
Imagine that someone knows what the page says and what information has included inside the page when he/she opens a page. It’s an ability and a method that people acquire knowledge from web pages. Likewise, structural extraction is the process that computer acquire knowledge from web pages. And in Octoparse, we use “rule” to tell the program what data we want to grab from the web page.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.