Three Kinds of Analytical Modes to Extract Data from Websites
Thursday, May 19, 2016
XPath & RegEx
Many software or platforms all use XPath and Regular Expressions to extract data from websites. Software such as Octoparse, Mozenda and import.io are based on this method and the working principle is to locate the path expression of related data by XPath and extract the data we want from the path expression by regular expression. The advantage of this method is very effective and data from almost all the web pages can be extracted in this way. But the pro is the high cost of learning. You need to understand what Xpath and regular expressions are and how to write the statements. Although not particularly difficult , it may be a little difficult for those who don’t have code experience.
Generally good data extraction tools will add an important feature called data highlighter. Given a set of similar pages, you are allowed to mark up( highlight ) the locations of each attribute in the page and look for characteristics of each attribute by tagging information from multiple pages. Of course, this feature can be XPath, can also be a context , or a feature vector of machine learning.
( For more information please visit http://googlewebmastercentral.blogspot.com/2012/12/introducing-data-highlighter-for-event.html.)
Data Highlighter, with high maturity, is currently the most versatile and the easiest way to extract the data. It learns the rules by highlighting the same attributes from multiple pages and eliminates the cost of creating rules manually. Practice has proved that in a single web page template, rules could be generated by just marking two pages with data highlighter. It’s much more efficient to set rules with data highlighter. The data extraction from list web page in Octoparse is based on this principle. You simply select two links of the list and then other links in the same location of the list will be extracted.
Wrapper Generation and Data Extraction
Based on data highlighter, software like ours have added wrapper generation and data extraction. Wrapper is based on the assumption: structured web pages are mapped by the data in the database with templates. By analyzing the page, we get the page templates and extract the structured data by using templates. Web developer know that a good website is generated by background data with the front-end templates.
By comparing and contrasting several similar web pages, some software with wrapper generation would find out the same structure or elements with the same data types, and then merge the DOM tree. By comparing the same type of nodes from different pages, they figure out which part is changed and which part is unchanged. The changed part is the data while the unchanged part is the template. Likewise, if you hold the sheets of paper that is stuck together in front of a light source, you will find the changed text parts will be darker than the other parts.
The basic idea of Octoparse is to obtain the same features of list web page by template, using XPath and data highlighter. Configure rule in workflow designer with simple actions to tell Octoparse the data you want to collect. No coding experience required. Combined with the key points of these three technologies we mentioned in this article, you simply use the rule configurator in Octoparse and highlight to configure a web page, then Octoparse will use what you configure as a template to find out features in other web pages, thereby extract the exact data you want .
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.