Skip to main content
Raw scraped data is rarely ready to use as-is. Octoparse lets you refine extracted fields before export, so you can clean text, reshape values, remove unwanted characters, and extract only the part of a field you need. If the wrong element is being extracted, fix the field selection first. If the right element is selected but the value format is messy, use refinement rules.

Scenarios it handles

Clean messy values

Remove, replace, trim, or reformat text before it is exported.

Extract part of a value

Use matching rules or regular expressions to keep only the text pattern you need.

Standardize across rows

Add prefixes, remove repeated text, or normalize inconsistent values across rows.

Reformat dates and timestamps

Convert date formats, relative dates, Unix timestamps, or timezone offsets into a consistent output.
Common examples:
  • Removing labels such as Price: or Rating:
  • Replacing unwanted characters or extra spaces
  • Adding a prefix or suffix to each value
  • Reformatting dates or converting timezones
  • Decoding HTML entities into plain text

Access Clean Data

1

Select the field

In the data preview or field list, select the extracted field you want to clean.
2

Open the field menu

Click the ... menu for that field.
3

Choose Clean Data

Select Clean Data to open the data cleaning workflow.
4

Add a cleaning step

Click Add Step, then choose the operation you want to apply.
5

Preview the result

Check the preview value before saving the rule.

Common refinement operations

OperationUse it for
ReplaceReplace a specific string with another value, or remove it by replacing it with an empty string
An empty string means no characters — replacing with an empty string effectively removes the matched text.
Replace with Regular ExpressionUse a regex pattern to find and replace matched strings
Match with Regular ExpressionKeep only the part of a value that matches the pattern, discard everything else
Trim spacesRemove unwanted spaces from the start or end of a value
Add a prefixAdd fixed text before each extracted value
Add a suffixAdd fixed text after each extracted value
Reformat extracted date/timeConvert date formats or relative dates (2 days ago → specific date)
Built-in format presets include yyyy/MM/dd hh:mm:ss, yyyy-MM-dd, 01/01/2026, Thu, 01 01 2026, and more.
Timestamp conversionConvert Unix timestamps into human-readable date formats
Timezone conversionAdjust date and time to a target timezone
HTML transcodingConvert HTML entities to plain text (e.g. &&)
A string is a sequence of characters — it can be a word, number, space, symbol, or punctuation mark. An empty string means no characters at all. For example, replacing a value with an empty string effectively removes it from the output.

Use RegEx for pattern-based cleanup

Regular expressions are useful when the value follows a pattern but cannot be cleaned reliably with simple replace or trim rules:
  • Extract a number or price embedded in a sentence
  • Match text before or after a known delimiter (e.g. :, |, -)
  • Keep only part of an HTML attribute value
  • Remove patterns that repeat differently across rows
  • Isolate a substring from a value that varies slightly from page to page
You can access the RegEx tool from the Clean Data workflow, or from the Tools area in the left sidebar. Octoparse also includes an AI RegEx generator — describe what you want to extract in plain language and the tool generates the pattern for you, so you don’t need to write the expression manually.
GoalPatternInputOutput
Extract price\$\d+(?:\.\d{1,2})?Only $19.99 today$19.99
Extract email[\w.-]+@[\w.-]+\.\w+Contact: info@example.cominfo@example.com
Extract date (YYYY-MM-DD)\d{4}-\d{2}-\d{2}Published 2025-06-102025-06-10
Extract digits only\d+Rating: 4.5 out of 54
Remove HTML tags<[^>]+><b>Bold text</b>Bold text
For a full list of patterns and syntax, see the RegEx Cheatsheet for Data Extraction in the Help Center.
Use RegEx only when simpler cleaning rules are not enough. For straightforward cleanup, operations such as replace, trim, prefix, and suffix are easier to maintain.

Example: extract a value from an attribute

Some websites store useful data in attributes rather than visible text. For example, a rating may be stored in an image attribute such as alt="5 stars" or in a source value such as src.
1

Select the element

Select the element that contains the value you need, such as a rating icon or text block.
2

Choose the source value

Use options such as Image URL, OuterHTML, or Other Attributes depending on where the value is stored.
3

Customize the field

Open the field menu and choose Customize Field or Clean Data.
4

Extract the target value

Select the relevant attribute, or use RegEx to match the part of the HTML you want to keep.
5

Preview before saving

Confirm that the preview shows the expected value before running the task.

Limits of field refinement

Refinement rules clean the value Octoparse has already extracted. They do not change how the web page is structured. For example, if a multi-line text block appears as several lines visually but is actually one single element in the page source, Octoparse may treat it as one field. In that case, you may not be able to split it into separate fields by visual line breaks alone. Check the source structure and use field selection, extraction settings, or RegEx cleanup depending on how the data is actually stored.
If a value cannot be separated because the website stores it as one single element, data cleaning may not be enough. You may need to adjust the selected element, inspect the HTML, or extract a different source value.

Best practices

  • Refine fields after confirming the correct element is selected.
  • Use simple cleaning steps before trying RegEx.
  • Preview each step before saving.
  • Keep field names clear so exported data is easy to understand.
  • Avoid over-cleaning if the downstream system can handle formatting later.
  • Document complex RegEx patterns so teammates can maintain the task.