The Primary Use of Regular Expression in Data Processing

Previously, I introduced the usage, the pros and cons of Regular Expressions in extracting HTML content.

Using Regular Expression to Match HTML

Advanced Text – Recommendations to Handle HTML With Regular Expression

Regular expressions can be used:

Data validation. (A string complies with email rule or is a phone number, etc.)
Replace text. (Replace the characters we don’t want with other characters.)
Extract substrings from a string based on a pattern match. (Search specific text in the document.)

In data processing, we mainly use Regular Expressions to replace text and extract substrings.

Examples:

Remove all non-alphanumeric characters other than invalid characters (@-.) from the string, then return a string.

Sample code (C#)

using System;

using System.Text.RegularExpressions;

void Main()

{

Console.WriteLine(CleanInput(“@octoparse*.123#facebook$%-JS”));

}

static string CleanInput(string strIn)

{

// Replace invalid characters with empty strings.

return Regex.Replace(strIn, @”[^\w\.@-]”, “”);

}

The output

@octoparse.123facebook-JS

We can see the characters (* # $ %) which we believe to be invalid characters are replaced by an empty string, which removed these invalid characters

Examples:

get the domain name of URL

Sample code (C#)

using System;

using System.Text.RegularExpressions;

void Main()

{

string url = “http://www.octoparse.com/pricing/”;

Regex r = new Regex(@”((\w)+\.)+\w+”,RegexOptions.IgnorePatternWhitespace);

Console.WriteLine(r.Match(url).Captures[0].Value);

}

The output

www.octoparse.com

From the code above, we see that Regular Expression can help you extract strings with characteristics of domain names from the URL.

For hundreds of thousands of URL like this, you can extract their domain names with only a regular expression.

Octoparse

Best Web Scraper for Mac: Scrape Data from Any Website

Abigail Jones

If you're looking for an easy-to-use web scraper for your macOS devices, then you can find the answer on this page. Octoparse can help you scrape any websites easily and quickly on your Mac.

2022-05-27T00:00:00+00:00 · 5 min read

Octoparse

Data Extraction 101: How to Extracting Structured Data from Web Pages

Ansel Barrett

Structured data refers to the data that is organized, processed and accessed in a high level of categorization, stored mainly in a relational database. You can use two-dimensional table structure to logically implement the data.

2021-01-21T00:00:00+00:00 · 2 min read

Octoparse

Three Kinds of Analytical Modes to Extraction Data Grom Websites — XPath, Data Highlighter and Wrapper

Ansel Barrett

Nowadays, popular software or platforms all use XPATH and regular expressions to extract data from websites. Software such as Octoparse, Mozenda and import.io are based on this method and the working principle is to position the path expression of related data by XPath and extract the exact data we want from the path expression by regular expression.

2017-11-29T00:00:00+00:00 · 2 min read

Big Data

Big Data: 50 Fascinating and Free Data Sources for Data Visualization

Ansel Barrett

Let’s put this article on your favorite list, the most comprehensive guide of 50 data sources, including General Data, Government Data, Market Data for U.S. and China, and etc.

2017-10-30T00:00:00+00:00 · 6 min read

The Primary Use of Regular Expression in Data Processing

Regular expressions can be used:

Hot posts

Explore topics

Get started with Octoparse today

Related Articles