Refine data

Raw scraped data is rarely ready to use as-is. Octoparse lets you refine extracted fields before export, so you can clean text, reshape values, remove unwanted characters, and extract only the part of a field you need. If the wrong element is being extracted, fix the field selection first. If the right element is selected but the value format is messy, use refinement rules.

Scenarios it handles

Clean messy values

Remove, replace, trim, or reformat text before it is exported.

Extract part of a value

Use matching rules or regular expressions to keep only the text pattern you need.

Standardize across rows

Add prefixes, remove repeated text, or normalize inconsistent values across rows.

Reformat dates and timestamps

Convert date formats, relative dates, Unix timestamps, or timezone offsets into a consistent output.

Common examples:

Removing labels such as Price: or Rating:
Replacing unwanted characters or extra spaces
Adding a prefix or suffix to each value
Reformatting dates or converting timezones
Decoding HTML entities into plain text

Access Clean Data

Select the field

In the data preview or field list, select the extracted field you want to clean.

Open the field menu

Click the ... menu for that field.

Choose Clean Data

Select Clean Data to open the data cleaning workflow.

Add a cleaning step

Click Add Step, then choose the operation you want to apply.

Preview the result

Check the preview value before saving the rule.

Operation	Use it for
Replace	Replace a specific string with another value, or remove it by replacing it with an empty string _{An empty string means no characters — replacing with an empty string effectively removes the matched text.}
Replace with Regular Expression	Use a regex pattern to find and replace matched strings
Match with Regular Expression	Keep only the part of a value that matches the pattern, discard everything else
Trim spaces	Remove unwanted spaces from the start or end of a value
Add a prefix	Add fixed text before each extracted value
Add a suffix	Add fixed text after each extracted value
Reformat extracted date/time	Convert date formats or relative dates (`2 days ago` → specific date) _{Built-in format presets include yyyy/MM/dd hh:mm:ss, yyyy-MM-dd, 01/01/2026, Thu, 01 01 2026, and more.}
Timestamp conversion	Convert Unix timestamps into human-readable date formats
Timezone conversion	Adjust date and time to a target timezone
HTML transcoding	Convert HTML entities to plain text (e.g. `&` → `&`)

A string is a sequence of characters — it can be a word, number, space, symbol, or punctuation mark. An empty string means no characters at all. For example, replacing a value with an empty string effectively removes it from the output.

Use RegEx for pattern-based cleanup

Regular expressions are useful when the value follows a pattern but cannot be cleaned reliably with simple replace or trim rules:

Extract a number or price embedded in a sentence
Match text before or after a known delimiter (e.g. :, |, -)
Keep only part of an HTML attribute value
Remove patterns that repeat differently across rows
Isolate a substring from a value that varies slightly from page to page

You can access the RegEx tool from the Clean Data workflow, or from the Tools area in the left sidebar. Octoparse also includes an AI RegEx generator — describe what you want to extract in plain language and the tool generates the pattern for you, so you don’t need to write the expression manually.

Goal	Pattern	Input	Output
Extract price	`\$\d+(?:\.\d{1,2})?`	`Only $19.99 today`	`$19.99`
Extract email	`[\w.-]+@[\w.-]+\.\w+`	`Contact: info@example.com`	`info@example.com`
Extract date (YYYY-MM-DD)	`\d{4}-\d{2}-\d{2}`	`Published 2025-06-10`	`2025-06-10`
Extract digits only	`\d+`	`Rating: 4.5 out of 5`	`4`
Remove HTML tags	`<[^>]+>`	`<b>Bold text</b>`	`Bold text`

For a full list of patterns and syntax, see the RegEx Cheatsheet for Data Extraction in the Help Center.

Use RegEx only when simpler cleaning rules are not enough. For straightforward cleanup, operations such as replace, trim, prefix, and suffix are easier to maintain.

Example: extract a value from an attribute

Some websites store useful data in attributes rather than visible text. For example, a rating may be stored in an image attribute such as alt="5 stars" or in a source value such as src.

Select the element

Select the element that contains the value you need, such as a rating icon or text block.

Choose the source value

Use options such as Image URL, OuterHTML, or Other Attributes depending on where the value is stored.

Customize the field

Open the field menu and choose Customize Field or Clean Data.

Extract the target value

Select the relevant attribute, or use RegEx to match the part of the HTML you want to keep.

Preview before saving

Confirm that the preview shows the expected value before running the task.

Refinement rules clean the value Octoparse has already extracted. They do not change how the web page is structured. For example, if a multi-line text block appears as several lines visually but is actually one single element in the page source, Octoparse may treat it as one field. In that case, you may not be able to split it into separate fields by visual line breaks alone. Check the source structure and use field selection, extraction settings, or RegEx cleanup depending on how the data is actually stored.

If a value cannot be separated because the website stores it as one single element, data cleaning may not be enough. You may need to adjust the selected element, inspect the HTML, or extract a different source value.

Best practices

Refine fields after confirming the correct element is selected.
Use simple cleaning steps before trying RegEx.
Preview each step before saving.
Keep field names clear so exported data is easy to understand.
Avoid over-cleaning if the downstream system can handle formatting later.
Document complex RegEx patterns so teammates can maintain the task.

GET STARTED

TASKS

TASK RUNNING

MONITORING

DATA EXPORT

ANTI-BLOCKING

TEAM & GOVERNANCE

Scenarios it handles

Clean messy values

Extract part of a value

Standardize across rows

Reformat dates and timestamps

Access Clean Data

Common refinement operations

Use RegEx for pattern-based cleanup

Example: extract a value from an attribute

Limits of field refinement

Best practices

​Scenarios it handles

Clean messy values

Extract part of a value

Standardize across rows

Reformat dates and timestamps

​Access Clean Data

​Common refinement operations

​Use RegEx for pattern-based cleanup

​Example: extract a value from an attribute

​Limits of field refinement

​Best practices

Scenarios it handles

Access Clean Data

Common refinement operations

Use RegEx for pattern-based cleanup

Example: extract a value from an attribute

Limits of field refinement

Best practices