Data Harvesting Is Solving These Two Problems

3 min read

In our previous article “Three kinds of analytical modes to extraction data from websites — XPath, Data Highlighter and Wrapper”, we are trying to solve two problems.

What an element is talking about

When you open a page, how do you know what information the page is trying to transfer? How do you know the element is the title of the article? How do you know the element refers to the author?

We might see the position of an element, the font size and color of the element, the text in front of the element. Besides, we can figure out whether the content of the element looks like a person’s name / time, what are talking about in the context of the element, the field of this article, and so on. People have a lot of experience and knowledge to interpret the information on the web page when they see a web page.

In the “XPath / RegEx” analytical mode, the data extraction process is completed manually. We interpret the web page and find out the position of the element which contain the information. In the “data highlighter” analytical mode, we need to mark up in multiple pages to tell the machine where each attribute is.

However, the page style structure is hard to understand for those who don’t have adequate knowledge. For instance, our grandparents may not know what the page on the screen is saying. Meanwhile, just as the language is ambiguous, sometimes the page structure is ambiguous as well, therefore it bring great difficulties to computer in understanding what the page says.

What is the difference between elements

Because it’s the computer that extract large quantities of data. So we have to tell the computer accurately which elements you want to extract. In the “XPath / RegEx” analytical mode, XPath and regular expressions are the descriptions of information. Selecting a correct expression that covers different pages and differentiates from other attributes, is not an easy job and needs experience and skills.

In the “data highlighter” analytical mode, software will select a correct expression automatically. And in the process of “wrapper generation and data extraction”, rules are also analyzed by the computer.

Strutural Parsing

After solving these two problems, we come to structural parsing.

Structural parsing virtually is the Interpretation of computer to a web page, whether the interpretation is based on either creating the rule manually and making agreements to computer or machine learning.

We can imagine that someone will know what the page says and what information has included inside the page when he/she opens a page. It’s an ability and a method that people acquire knowledge from web pages. Likewise, structural extraction is the process that computer acquire knowledge from web pages. And in Octoparse, we use “rule” to tell computer what data we want to fetch from the web page.

Hot posts

Explore topics

Get web automation tips right into your inbox
Subscribe to get Octoparse monthly newsletter about web scraping solutions, product updates, etc.

Get started with Octoparse today


Related Articles

  • avatarAnsel Barrett
    Web harvesting, also known as web scraping, is the process of data collection from target web pages on the Internet by specialized programs or software. Data is further exported to the database of your choice. Web Harvesting still mainly focus on web content pages that are based on HTML / XML. You may need to grasp some technical terms like XQuery and RegEx (Regular Expression) that can help you screen the content of text / XML documents and thus to collect the exact information.
    February 7, 2022 · 2 min read
  • avatarAnsel Barrett
    Data harvesting and data mining is totally different and getting the data is only the first part of data mining. In fact, there are more applications in data mining: regression, clustering, anomaly detection and associative learning.
    January 25, 2021 · 4 min read
  • avatarAnsel Barrett
    Structured data refers to the data that is organized, processed and accessed in a high level of categorization, stored mainly in a relational database. You can use two-dimensional table structure to logically implement the data.
    January 21, 2021 · 2 min read
  • avatarAnsel Barrett
    Nowadays, popular software or platforms all use XPATH and regular expressions to extract data from websites. Software such as Octoparse, Mozenda and import.io are based on this method and the working principle is to position the path expression of related data by XPath and extract the exact data we want from the path expression by regular expression.
    November 29, 2017 · 2 min read