Step-by-step tutorials for you to get started with web scraping

Download Octoparse

Re-format data extracted

Thursday, August 16, 2018

During your web scraping project, some data might not be the format you wanted. In this case, Octoparse offers 8 data re-format options for you to further process or clean the data extracted into the right format.

 

To access these features in Octoparse, the process can be split into 5 main steps:

1. Select the data field to reformat

 

2. Click on   to customize data field

 

3. Select "Refine extracted data"

 

4. Click "Add step"

 

5. Select an operation to re-format your data

 

Before introducing 8 re-formatting options, we would like to talk about the term "string" first.

In programming, a string basically refers to a collection of characters like letters, numerals, symbols and punctuation marks. For example, " " (space) is a string; "Octoparse" is a string; and "Hello 2 *% World!" is also a string. A string can consist of no character as well. In other words, a string that contains no character is empty. If you replace a word with an empty string, colloquially, it is equal to say that you delete the word. 

You would see the word "string" a lot of function instructions of Octoparse's data reformat options. If you see the word "string" there, you just know that you can use the corresponding options to deal with a variety of character types in the data extracted, such as letters, words, sentences, numbers, spaces, symbols and punctuation marks.

 

1. Replace

2. Replace with regular expression

3. Match with regular expression

4. Trim spaces

5. Add a prefix

6. Add suffix

7. Reformat extracted data/time

8. HTML transcoding

 

 

 

 

 

 

 

1. Replace

Function: Replace the specific string/s in the extracted data with the new string/s that you want.

 

 

 

 

2. Replace with regular expression

Function: Use a specific regular expression to replace the matched string/s in the extracted data with the string/s that you want.

You can learn more about the regular expression in W3schools .

 

 

 

 

 

3. Match with regular expression

Function: Use a specific regular expression to pick up the matched string/s from the extracted data.

You can learn more about the regular expression in W3schools  .

 

 

Octoparse also offers RegEx Tool to auto-generate the regular expression that you need. Let's have a quick look at how to use Octoparse's RegEx Tool to generate and apply a regular expression. For example, here we want to pick up the numeral of star-rating from the outer HTML extracted. 

· Click "Try RegEx Tool"

· Enter the match criteria: start with " alt=" ", end with "star rating"

· Click "generate" to produce the regular expression 

· Click "Match" to pick up the matched string/s

· Click "Apply"

· Click "OK" to save the settings

 

 

 

 

 

 

4. Trim spaces

Function: Remove the unwanted space/s from the start or/and the end of the data extracted.

If you want to delete the spaces amid the data, you can use Replace or Replace with regular expression.

 

 

 

 

5. Add a prefix

Function: Add a string/strings to the front of the data extracted.

 

 

 

 

6. Add suffix

Function: Add a string/strings to the end of the data extracted.

 

 

 

 

7. Reformat extracted data/time

Function: Shift the extracted date/time into one of the 14 built-in formats, or into your own customized format.

 

 

 

 

8. HTML transcoding

Function: Convert some specific HTML tags into plain texts automatically. For example, transcode "&gt" into ">" and "&nbsp" into a space.

 

 

 

Related articles:

Extract data from source code 

Conglomerate data extracted 

Select and extract data/URL/image/HTML  

Case tutorial | Scrape business information from Yelp 

Learn more about regular expression in W3school 

 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact us Download
btn_sidebar_use.png
btn_sidebar_form.png