undefined

Use Regular Expressions in Octoparse

Thursday, October 13, 2016 9:09 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

Regular Expression (RegEx) is a special text string that can define a search pattern, which is used by string-searching algorithms for "find" or "find and replace" operations on strings. You could grab some basics of Regular Expression here  .

In Octoparse, you can use RegEx to match out/replace characters in a field value to refine the extracted data directly.

Octoparse RegEx tool is a built-in tool that offers a handy way to generate Regular Expressions automatically by setting up various criteria. When knowing little about how to create a regular expression syntax, the RegEx tool would be especially helpful.

 

Where to find the RegEx tool?

In Octoparse, there are two ways to access the RegEx tool:

Method 1: Within  Octoparse Clean Data options  

  • Select the data field you want to customize
  • Click "..." and choose "Clean Data"
  • Click "Add step"
  • Choose "Replace with Regular Expression"/"Match with regular expression
  • Click "Not sure about RegEx? Try the RegEx tool!"

regex tool

 

 

Method 2:From the Sidebar Navigation

  • Select the "Tool Box" icon from the bottom of the sidebar navigation
  • Click "RegEx Tool"

 regex tool

 

The interface of the RegEx tool

The main interface of the RegEx tool consists of 4 parts:

regex tool

1. Original Text

If you open the RegEx tool within the Clean Data options, the extracted text string will be displayed here.

If you open it from the Sidebar Navigation, the character string should be entered in the Original Text directly by typing or pasting on your own.

 

2. Generate/Reference/Sample

There are 3 tabs on this part.

  • In the "Generate" tab, there are checkboxes for various options.  You can check these boxes and fill in some parameters for Octoparse to automatically generate the Regular expression you need.
    • This section allows you to set conditions to filter out the part of data you want to sort out.
    • You can check details in the following section (How to use Octoparse Regular Expression Tool?).
  • Reference and Sample are currently empty since we haven't prepared the reference tutorials.

 

3. Regular Expression

The regular expression will be generated automatically in the "Regular Expression" box after you check the option boxes and fill in the parameters in the "Generate" tab.

Check "Match All" if you'd like to have all matches. Then click the "Match" button to check the expression would find what you want. 

 

4. Matches

Once you have an expression generated, the first match would be displayed in the Matches box.

If you've checked "Match All", then all matches would be displayed orderly in the box. 

 

 

How to use Octoparse Regular Expression Tool?

Simply click 3 buttons one-by-one in order(Generate-Match-Apply) and we could easily get the result we need.

Step 1: Check the options and fill in the needed parameters and click the "Generate" button

There are 5 options provided:

  • "Start/End with"

Pick up the content that starts or ends with, but excludes the character/characters that you input in the box.

  • "Include Start/End"

This option could only be used with "Start/End with" checked. Once you check "Include Start/End", the match result will include the text string you've entered.

  • "Contain One"

Pick up the content that contains the character/characters that you've filled. 

 

Step 2: Click the "Match" button 

Remember to check "Match All" if you'd like to have all matches.

 

Step 3: Click the "Apply" button to apply the result

 

Let's look at this example below. We need to get the "5 star rating" from the source text.

So we tick "Start with" and input 'aria-label="'(the characters in front of 5 star rating); we tick "End with" and input '"'(the characters after 5 star rating). Click "Generate" and we will see the regular expression: 

Octoparse regex tool

Let’s see some practical use cases in When and how to use Regular Expression Tool – a guide for beginners  .

 

Happy Data Hunting!

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today. 

 

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline