undefined
Blog > Post

Using This RegEx Tool to Match HTML Tags

Monday, August 9, 2021

"You will know how powerful regular expression is once you use it." - A developer sigh heartily.

 

What is a regular expression (RegEx)?

“A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text-processing utility ed (a line editor for the Unix operating system), an editor, and grep (a  command-line utility for searching plain-text data sets for lines matching a regular expression), a filter (a computer program or subroutine to process a stream, producing another stream).” This is an excerpt from Wikipedia used to define the regular expression.

 

What you can do with RegEx?

Regular expressions can be used to match HTML tags and extract the data in HTML documents.

 

Here are some RegEx use cases:

Using RegEx to extract emails

Using RegEx to extract phone numbers

RegEx to reformat data extracted

 

HTML is virtually composed of strings, and what makes regular expression so powerful is, a regular expression can match different strings. Admittedly, a regular expression is not the first choice to correctly parse HTML, because there are some common mistakes such as missing closing tags, mismatching some tags, etc. when parsing HTML with regular expression. Besides, programmers are more likely to use other perfectly good HTML parsers like PHPQuery, BeautifulSoup, html5lib-Python, and etc. But if you want to quickly match HTML tags and you know a little about regular expression syntax which is easy to learn but hard to master, you can use this incredibly convenient tool to identify patterns in HTML documents. Every programmer or someone who wants to extract web data is strongly recommended to learn regular expressions because this tool improves your working efficiency and productivity.

 

Let's look at a few examples:

 

  • Regular expressions for matching HTML tags:

 

<(.*)>.*?|<(.*) />

<(\S*?)[^>]*>.*?</\1>|<.*?/>

 

  • Regular expression to match all TD tags:

 

<td\s*.*>\s*.*<\/td>

 

  • Regular expression to match <img src="test.gif"/>:

 

<[a-zA-Z]+(\s+[a-zA-Z]+\s*=\s*("([^"]*)"|'([^']*)'))*\s*/>

 

We can match a variety of HTML tags by using such a regular expression and therefore easily extract data in HTML documents.
Regular expression tool

(Download Octoparse 8 - Open the software - Click the toolbox icon on the lower-left corner)

 

Free RegEx Tool - Octoparse

Octoparse, a visual web data collection tool, provides a tool for generating regular expressions. It can easily generate some simple regular expressions to meet your different needs to extract content in HTML documents. Besides, Octoaprse fully supports to verify customized regular expressions. 

 

>>Read about our customer stories

 

 

 

Author: The Octoparse Team 

contact Octoparse

More Resources

 

Web Scraping Templates Take Away

Locate Element with XPath

Octoparse Regular Expression Tool (RegEx)

Deal with AJAX

Cloud Extraction: Scrape at Large Scale

Connect Octoparse API Step by Step

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download
We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline