HTML Scraper

5/17/2016 11:49:38 PM

When someone asked Michelangelo how he was able to create such a masterpiece “David”, he replied, "It was easy. I went to the quarry and saw a huge marble. All I did was chip away everything that doesn’t look like David.”

Similarly, we remove information we don’t need and extract what we need from a web page.

In our previous articles, we have talked about how to handle HTML with regular expressions. See the articles below.  

 

Using Regular Expression to Match HTML

 

Advanced Text - Recommendations to Handle HTML With Regular Expression

 

Extract Text From HTML Document

 

Comparison of HTML parsers ( from Wikipedia)

 

 

Alternatives:

  

1. Regular Expression

 

Using Regular Expression to Match HTML has explained how extract content of HTML with regular expressions above. But this method is not recommended in practice. The main reasons is that it’s relatively time consuming to write and verify regular expressions, difficult to predict the efficiency and hard to understand regular expression quickly.

 

2. XPath

 

XPath is perfect for content extraction from web pages and is strongly recommended. The XPath syntax is simple, and it is easier to read, write and test XPath than regular expression.  Many programming languages support an XPath library.

The articles below may be helpful:

 

XPath - Brief Introduction

 

Brief Intro to HTML Document

 

Getting started with XPath 1

 

Getting started with XPath 2

 

 

3. CSS Selector

 

CSS selectors is also a good choice for web content extraction. It selects an HTML element by document.querySelector() and document.querySelectorAll () selects a group of HTML elements with the same characteristics. The syntax of CSS Selector is similar to XPath syntax. But not all the programming languages support a CSS selector library.

 

Sample code:

 

<div class="test" id="testId">

     <p><span>Test</span></p>

</div>

<script type="text/javascript">    

     var testElement= document.getElementById('testId');

     var element = testElement.querySelector('.test span');

     console.log(element.innerText);

</script>

 

 

 

The output

 

 

 

If you want to learn more about Python language, you can read the article http://docs.python-guide.org/en/latest/scenarios/scrape/

 

 

 

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today.

 

 

Author's Picks

 

About Octoparse

Octoparse 6.0 is Now Available

What A Price Monitor Can Help you?

Examples of Businesses Who Use Data Scraping

Collect Data from LinkedIn

Collect Data from Gumtree.com

 

 

 

Recent Posts

Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.