Advanced Text - Recommendations to Handle HTML With Regular Expression

5/10/2016 12:22:55 AM

 

 

In this article, we will talk about how to better handle HTML with Regular Expression 

Two situations you may encounter when parsing HTML with regular expressions:

There is a small section of the HTML codes, which is very simple and relatively normal (no missing tags) , and it’s very easy to use Regular Expressions to deal with them.

There are some complex HTML codes, some of which are of poor layout and some have missing tags. In this case, it is better to use the corresponding libs, especially for parsing HTML. And it will be a lot easier for you extract data you need from parsed content.

If you are going to write a number of Regular Expressions for such complex HTML codes, you will find that most of the time, it’s impossible to write Regular Expressions that can actually be used. Because it’s hard to use a single or multiple Regular Expressions to match various HTML tags.

 

Imagine that you have to parse a large amount of HTML. And some codes have missing tags and even are not standardized. In this case, you will find that the Regular Expressions you wrote which are very complex and accurate, have lost their meaning because they cannot match the data you want.

So it’s more efficient, accurate and time-saving to use the parsing HTML libs written in some relevant programming languages to handle complex HTML, rather than to simply write the Regular Expressions that are very sophisticated but sometimes cannot accurately match the HTML content. 

 

 

 

 

 

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today.

 

 

Author's Picks

 

About Octoparse

A Comparison among Three Editions of Octoparse

Octoparse 6.0 is Now Available

What A Price Monitor Can Help you?

Collect Data from Amazon

Collect Data from eBay

Collect Data from LinkedIn

Collect Data from Gumtree.com

 

 

 

Recent Posts

Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.