Use Regular Expressions in OctoparseThursday, October 13, 2016 9:09 AM
Regular Expression (RegEx) is a special text string that can define a search pattern, which is used by string-searching algorithms for "find" or "find and replace" operations on strings. You could grab some basics of Regular Expression here .
In Octoparse, you can use RegEx to match out/replace characters in a field value to refine the extracted data directly.
Octoparse RegEx tool is a built-in tool that offers a handy way to generate Regular Expressions automatically by setting up various criteria. When knowing little about how to create a regular expression syntax, the RegEx tool would be especially helpful.
Where to find the RegEx tool?
In Octoparse, there are two ways to access the RegEx tool:
Method 1: Within Octoparse Clean Data options
- Select the data field you want to customize
- Click "..." and choose "Clean Data"
- Click "Add step"
- Choose "Replace with Regular Expression"/"Match with regular expression"
- Click "Not sure about RegEx? Try the RegEx tool!"
Method 2：From the Sidebar Navigation
- Select the "Tool Box" icon from the bottom of the sidebar navigation
- Click "RegEx Tool"
The interface of the RegEx tool
The main interface of the RegEx tool consists of 4 parts:
1. Original Text
If you open the RegEx tool within the Clean Data options, the extracted text string will be displayed here.
If you open it from the Sidebar Navigation, the character string should be entered in the Original Text directly by typing or pasting on your own.
There are 3 tabs on this part.
- In the "Generate" tab, there are checkboxes for various options. You can check these boxes and fill in some parameters for Octoparse to automatically generate the Regular expression you need.
- This section allows you to set conditions to filter out the part of data you want to sort out.
- You can check details in the following section (How to use Octoparse Regular Expression Tool?).
- Reference and Sample are currently empty since we haven't prepared the reference tutorials.
3. Regular Expression
The regular expression will be generated automatically in the "Regular Expression" box after you check the option boxes and fill in the parameters in the "Generate" tab.
Check "Match All" if you'd like to have all matches. Then click the "Match" button to check the expression would find what you want.
Once you have an expression generated, the first match would be displayed in the Matches box.
If you've checked "Match All", then all matches would be displayed orderly in the box.
How to use Octoparse Regular Expression Tool?
Simply click 3 buttons one-by-one in order(Generate-Match-Apply) and we could easily get the result we need.
Step 1: Check the options and fill in the needed parameters and click the "Generate" button
There are 5 options provided:
- "Start/End with"
Pick up the content that starts or ends with, but excludes the character/characters that you input in the box.
- "Include Start/End"
This option could only be used with "Start/End with" checked. Once you check "Include Start/End", the match result will include the text string you've entered.
- "Contain One"
Pick up the content that contains the character/characters that you've filled.
Step 2: Click the "Match" button
Remember to check "Match All" if you'd like to have all matches.
Step 3: Click the "Apply" button to apply the result
Let's look at this example below. We need to get the "5 star rating" from the source text.
So we tick "Start with" and input 'aria-label="'(the characters in front of 5 star rating); we tick "End with" and input '"'(the characters after 5 star rating). Click "Generate" and we will see the regular expression:
Let’s see some practical use cases in When and how to use Regular Expression Tool – a guide for beginners .
Happy Data Hunting!
Author: The Octoparse Team
For more information about Octoparse, please click here.
Sign up today.