Use Regular Expression to Reformat Captured Data

Friday, August 18, 2017 3:54 AM

Data re-format in Octoparse is a useful tool when the extracted data are not in the form you want. It mainly has 8 different functions(Replace, Replace with regular expression, Match with regular expression, Trim spaces, Add prefix, Add suffix, Re-format extracted date/time and Html transcoding) to re-format the data and in this tutorial, I will show you how to use these functions in some web scraping cases.

1. Replace

Replace is used to replace strings or keywords of extracted data with something you want. Let’s see the following case.

Let's say you need to replace the “Details about” with nothing, which means you want delete “Details about” and you can easily make it following the steps below.

  • Choose data field "title" to reformat
  • Click the "Customize Field"
  • Choose "Re-format extracted data"
  • Click "Add step"
  • Select "Replace"
  • Input the words you want replace in the "Replace" box(remember to input all the spaces), like here we input "Details about   "
  • Do not input anything in the "With" box
  • Click "Calculate" to check if the result is right
  • Click "Ok" to save the result
  • Click "Save"

 

You can have a try using this example URL: https://www.ebay.com/sch/Televisions-/11071/i.html?_dcat=11071&Brand=LG

 

2. Replace with Regular Expression

Replace with regular expression is to replace the content matched by a certain regular expression with what you want. For example, in the following case, if we look closely at the “Time” field extracted, it is obvious that the format is a bit messy with too many blanks. To fix this, we need to firstly match all the blanks and then replace them with a space. 

  • Select data field "Time", click the icon for "Customize Field" 
  • Choose "Re-format extracted data"
  • Click "Add step"
  • Select "Replace with Regular Expression"
  • Input "\s+" for "Regular Expression"(or you can use the Regex tool to generate a regex automatically) and a space for "Replace with" (click here to know more about Regular Expression)
  • Once done, click "OK"

You can also follow this tutorial to have a try: http://www.octoparse.com/tutorial/web-scraping-case-study-crawling-flight-information-from-ticket-websites/

 

3. Match with Regular Expression

Sometimes the data you want to extract is among a lot of messy words and you need to select the aimed keywords. That is exactly what “Match with regular expression” do for you. Look at the case below. Since star-rating had not been selected properly, we need to re-format the data field "Star" to extract the exact information we want.

  • Choose data field "Star" to reformat
  • Select the "Customize Field" 
  • Choose "Re-format extracted data"
  • Click "Add step"
  • Select "Match with Regular Expression"
  • Input the Regular Expression "(?<=title=")(.+?)(?= star)" to normalize the Outer HTML of "Star" (or you can use the Regex tool to generate a regex automatically).
  • Click "OK"
  • Noted the value for the "Star" data field turned into 4.5.
  • Click "Save"

Follow this tutorial to have a try: http://www.octoparse.com/tutorial/web-scraping-case-study-scraping-data-from-yelp/

 

4. Trim Spaces

Trim spaces is a function to delete the spaces before and after the extracted data. Note that only if the data contains spaces at the beginning and/or at the end, will this function work but spaces among the data will not be deleted. The steps to find the function are similar to the above functions. Click the field you want to change. Click "Customize Field". Click "Re-format extracted data".Click “Add steps”. Choose “Trim Spaces”. You can choose to trim spaces at the beginning, at the end of the extracted data or both by selecting “Trim Start”, “Trim End” or “Trim Both”.

 

 

5. Add Prefix

“Add prefix” is to add what you need (number, character, or signals etc.) to the beginning of the extracted data. To use this function, you should click the field you want to change. Click "Customize Field". Click "Re-format extracted data".Click “Add steps”.Select “Add prefix”. Input what you need in the “Prefix” box.

 

 

6. Add Suffix

“Add suffix” is to add something to the end of the data, which is just the opposite of “Add prefix”. Click the field you want to change.Click "Customize Field". Click "Re-format extracted data".Click “Add steps”. After all these steps, you will easily find “Add suffix”.

 

 

 

7. Re-format the date/time

This is a useful function to get date/time in the form you want. Let’s have a look at a case: the time extracted from the web is shown as the form dd/mm/yy, but you need it to be yy/mm/dd. Octoparse helps you to make it easily. Click the field you want to change. Click "Customize Field". Click "Re-format extracted data".Click “Add steps”. Choose “Re-format the date/time”. Select the form you need.(here we choose yyy/MM/dd). Click “Calculate” to check the form and then click “OK” to save the result.

The example URL: http://www.octoparse.com/blog/regex-how-to-extract-all-email-addresses-from-txt-files-or-strings/

 

8. Html Transcoding

Html transcoding is used when you extract out the html source and need to convert some html tags into plain text (for example, it can transcode “&gt” into “>” or “&nbsp” into a space). It is seldom used in cases but in case you need it, follow these steps. Click "Customize Field". Click "Re-format extracted data".Click “Add steps”.Choose “Html transcoding”.

 

Now you have learned all the functions of Data Re-format. See the following tutorials to learn more about Data Re-format and Regular Expression:

Re-format Captured Data (Add prefix, replace text,etc.) in Octoparse

Extracting Stock Prices using Regular expression (Example: Finance.Yahoo.com)

Scrape Emails from Facebook Pages

Extract Text from HTML - Using RegExp Tool

 

Author: The Octoparse Team

 

 

btn_sidebar_use.png
btn_sidebar_form.png