undefined
Blog > Post

Using Regular Expression to Match HTML

Thursday, February 20, 2020

You will know how powerful regular expression is once you use it. - A developer sigh heartily.

 

“A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed (a line editor for the Unix operating system), an editor, and grep (a  command-line utility for searching plain-text data sets for lines matching a regular expression), a filter (a computer program or subroutine to process a stream, producing another stream).” This is an excerpt from Wikipedia used to define the regular expression.

 

 

 

So we can use regular expressions to match HTML tag and extract the data in HTML documents.

HTML is virtually composed of strings, and what makes regular expression so powerful is, a regular expression can match different strings. Admittedly, regular expression is not the first choice to correctly parse HTML, because there are some common mistakes such as missing closing tags, mismatching some tags, etc. when parsing HTML with regular expression. Besides, programmers are more likely to use other perfectly good HTML parsers like PHPQuery, BeautifulSoup, html5lib-Python, and etc. But if you want to quickly match HTML tags and you know a little about regular expression syntax which is easy to learn but hard to master, you can use this incredibly convenient tool to identify patterns in HTML documents. Every programmer or someone who want to extract web data are strongly recommended to learn regular expressions because this tool improves your working efficiency and productivity.

 

Let's look at a few examples:

 

Regular expressions for matching HTML tags:

 

<(.*)>.*?|<(.*) />

<(\S*?)[^>]*>.*?</\1>|<.*?/>

 

Regular expression to match all TD tags:

 

<td\s*.*>\s*.*<\/td>

 

Regular expression to match <img src="test.gif"/>:

 

<[a-zA-Z]+(\s+[a-zA-Z]+\s*=\s*("([^"]*)"|'([^']*)'))*\s*/>

 

We can match a variety of HTML tags by using such a regular expression and therefore easily extract data in HTML documents.

 

 

 

Octoparse

Octoparse, a visual web data collection tool, provides a tool for generating regular expressions. It can easily generate some simple regular expressions to meet your different needs to extract content in HTML documents. Besides, Octoaprse fully supports to verify customized regular expressions. 

 

 

Author: The Octoparse Team 

contact Octoparse

More Resources

 

Web Scraping Templates Take Away

Locate Element with XPath

Octoparse Regular Expression Tool (RegEx)

Deal with AJAX

Cloud Extraction: Scrape at Large Scale

Connect Octoparse API Step by Step

 

 

 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download