undefined

Extract Text from HTML - Using RegExp Tool

Thursday, September 29, 2016 6:22 AM

For the latest tutorials, visit our new self-service portal. Sharpen your skills and explore new ways to use Octoparse.

 

During your web scraping project, you may want to clean the data fields while doing the web scraping. Octoparse offers 9 data cleaning options for turning the extracted data into the format you need.

 

When should I refine the extracted data?

If you have a desired data format for a certain field, you can use our "Clean Data" function to refine the field within Octoparse. Octoparse would scrape and refine it directly during the scraping process. No need to re-format the field after exporting the data into an excel file.

 

How to refine the extracted data in Octoparse?

To access these features in Octoparse, you should follow the 4 steps below:

1. Select the data field to refine

2. Click on the "..." icon and select "Clean data".

3. Click "Add step"

4. Select an operation to re-format your datahtml

Tip!

In programming, a "string" basically refers to a collection of characters like letters, numerals, symbols, and punctuation marks. For example, " " (space) is a string; "Octoparse" is a string; and "Hello 2 *% World!" is also a string. A string can consist of no character as well. In other words, a string that contains no character is empty. If you replace a word with an empty string, colloquially, it is equal to saying that you delete the word. 

You would see the word "string" in a lot of function instructions of Octoparse's data reformat options. If you see the word "string" there, that means you can use the corresponding options to deal with a variety of character types in the data extracted, such as letters, words, sentences, numbers, spaces, symbols, and punctuation marks.

  

9 Data reformat options

1. Replace

2. Replace with regular expression

3. Match with regular expression

4. Trim spaces

5. Add a prefix

6. Add suffix

7. Reformat extracted date/time

8. Timestamp conversion

9. HTML transcoding

 

1. Replace

 Function: Replace the specific string/s in the extracted data with the new string/s that you want.

html

 

2. Replace with regular expression

Function: Use a specific regular expression to replace the matched string/s in the extracted data with the string/s that you want.

You can learn more about regular expression in W3schools .

html

 

3. Match with regular expression

Function: Use a specific regular expression to pick up the matched string/s from the extracted data.

You can learn more about regular expression in W3schools  .

html 4

4. Trim spaces

Function: Remove the unwanted space/s from the start and/or the end of the data extracted.

If you want to delete the spaces amid the data, you can use Replace or Replace with regular expression.

 

5. Add a prefix

Function: Add a string or strings to the front of the data extracted.

 

6. Add suffix

Function: Add a string to the end of the data extracted.

 

7. Reformat extracted date/time

Function: Shift the extracted date/time into one of the built-in formats, or into your own customized format.

 

 

8. Timestamp conversion

Function:  Shift the Unix timestamp into your own customized format.

The Unix timestamp is a sequence of numbers that represents a specific date and time. This function will convert Unix time to a format that we can understand easily.

9. HTML transcoding

Function: Convert specific HTML tags into plain text automatically. For example, transcode "&" into  a "&".


Tip!

All the steps added can be edited and deleted here by clicking the 7.pngicons.


Octoparse Regex Tool

Octoparse also offers a RegEx Tool to auto-generate the regular expression that you need. Let's have a quick look at how to use Octoparse's RegEx Tool to generate and apply a regular expression. For example, here we want to pick up the numeral of star-rating from the outer HTML extracted. 

· Click "Try RegEx Tool"

· Enter the match criteria: start with "src="", end with " " "

· Click "generate" to produce regular expression 

· Click "Match" to pick up the matched strings

· Click "Apply"

· Click "Confirm" to save the settings

 

Click the link here for more information about the use of the Regex tool.

 

If you have questions, you are welcome to submit a request here. Our support team will get back to you later.

 

 

Author: The Octoparse Team

Download Octoparse Today

 

For more information about Octoparse, please click here.

Sign up today.

We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline