The Primary Use of Regular Expression in Data Processing

5/29/2016 7:41:08 AM

Previously, I introduced the usage, the pros and cons of Regular Expressions in extracting HTML content.

 

Using Regular Expression to Match HTML

 

Advanced Text - Recommendations to Handle HTML With Regular Expression

 

Regular expressions can be used:

 

  1. Data validation. (A string complies with email rule or is a phone number, etc.)

 

  1. Replace text. (Replace the characters we don’t want with other characters.)

 

  1. Extract substrings from a string based on a pattern match. (Search specific text in the document.)

 

In data processing, we mainly use Regular Expressions to replace text and extract substrings.

 

Examples: Remove all non-alphanumeric characters other than invalid characters (@-.) from the string, then return a string.

 

Sample code (C#)

 

using System;

using System.Text.RegularExpressions;

void Main()

{

    Console.WriteLine(CleanInput("@octoparse*.123#facebook$%-JS"));

}

static string CleanInput(string strIn)

{

   // Replace invalid characters with empty strings.

   return Regex.Replace(strIn, @"[^\w\.@-]", "");

}

 

The output

@octoparse.123facebook-JS

 

 

 

 

 

We can see the characters (* # $ %) which we believe to be invalid characters are replaced by an empty string, which removed these invalid characters

 

Examples: get the domain name of URL

 

Sample code (C#)

 

using System;

using System.Text.RegularExpressions;

void Main()

{

    string url = "http://www.octoparse.com/pricing/";

    Regex r = new Regex(@"((\w)+\.)+\w+",RegexOptions.IgnorePatternWhitespace);

    Console.WriteLine(r.Match(url).Captures[0].Value);

}

 

The output

www.octoparse.com

 

 

 

From the code above, we see that Regular Expression can help you extract strings with characteristics of domain names from the URL.

For hundreds of thousands of URL like this, you can extract their domain names with only a regular expression.

 

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today.

 

 

Author's Picks

 

About Octoparse

Octoparse 6.0 is Now Available

What A Price Monitor Can Help you?

Examples of Businesses Who Use Data Scraping

Collect Data from Facebook

Collect Data from Craigslist

 

 

 

Recent Posts

Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.