In this article, we will talk about how to better handle HTML with Regular Expression
Two situations you may encounter when parsing HTML with regular expressions:
There is a small section of the HTML codes, which is very simple and relatively normal (no missing tags) , and it’s very easy to use Regular Expressions to deal with them.
There are some complex HTML codes, some of which are of poor layout and some have missing tags. In this case, it is better to use the corresponding libs, especially for parsing HTML. And it will be a lot easier for you extract data you need from parsed content.
If you are going to write a number of Regular Expressions for such complex HTML codes, you will find that most of the time, it’s impossible to write Regular Expressions that can actually be used. Because it’s hard to use a single or multiple Regular Expressions to match various HTML tags.
Imagine that you have to parse a large amount of HTML. And some codes have missing tags and even are not standardized. In this case, you will find that the Regular Expressions you wrote which are very complex and accurate, have lost their meaning because they cannot match the data you want.
So it’s more efficient, accurate and time-saving to use the parsing HTML libs written in some relevant programming languages to handle complex HTML, rather than to simply write the Regular Expressions that are very sophisticated but sometimes cannot accurately match the HTML content.
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.