Scenarios it handles
Clean messy values
Remove, replace, trim, or reformat text before it is exported.
Extract part of a value
Use matching rules or regular expressions to keep only the text pattern you need.
Standardize across rows
Add prefixes, remove repeated text, or normalize inconsistent values across rows.
Reformat dates and timestamps
Convert date formats, relative dates, Unix timestamps, or timezone offsets into a consistent output.
- Removing labels such as
Price:orRating: - Replacing unwanted characters or extra spaces
- Adding a prefix or suffix to each value
- Reformatting dates or converting timezones
- Decoding HTML entities into plain text
Access Clean Data
Common refinement operations
| Operation | Use it for |
|---|---|
| Replace | Replace a specific string with another value, or remove it by replacing it with an empty string An empty string means no characters — replacing with an empty string effectively removes the matched text. |
| Replace with Regular Expression | Use a regex pattern to find and replace matched strings |
| Match with Regular Expression | Keep only the part of a value that matches the pattern, discard everything else |
| Trim spaces | Remove unwanted spaces from the start or end of a value |
| Add a prefix | Add fixed text before each extracted value |
| Add a suffix | Add fixed text after each extracted value |
| Reformat extracted date/time | Convert date formats or relative dates (2 days ago → specific date) Built-in format presets include yyyy/MM/dd hh:mm:ss, yyyy-MM-dd, 01/01/2026, Thu, 01 01 2026, and more. |
| Timestamp conversion | Convert Unix timestamps into human-readable date formats |
| Timezone conversion | Adjust date and time to a target timezone |
| HTML transcoding | Convert HTML entities to plain text (e.g. & → &) |
Use RegEx for pattern-based cleanup
Regular expressions are useful when the value follows a pattern but cannot be cleaned reliably with simple replace or trim rules:- Extract a number or price embedded in a sentence
- Match text before or after a known delimiter (e.g.
:,|,-) - Keep only part of an HTML attribute value
- Remove patterns that repeat differently across rows
- Isolate a substring from a value that varies slightly from page to page
| Goal | Pattern | Input | Output |
|---|---|---|---|
| Extract price | \$\d+(?:\.\d{1,2})? | Only $19.99 today | $19.99 |
| Extract email | [\w.-]+@[\w.-]+\.\w+ | Contact: info@example.com | info@example.com |
| Extract date (YYYY-MM-DD) | \d{4}-\d{2}-\d{2} | Published 2025-06-10 | 2025-06-10 |
| Extract digits only | \d+ | Rating: 4.5 out of 5 | 4 |
| Remove HTML tags | <[^>]+> | <b>Bold text</b> | Bold text |
Example: extract a value from an attribute
Some websites store useful data in attributes rather than visible text. For example, a rating may be stored in an image attribute such asalt="5 stars" or in a source value such as src.
Select the element
Select the element that contains the value you need, such as a rating icon or text block.
Choose the source value
Use options such as Image URL, OuterHTML, or Other Attributes depending on where the value is stored.
Extract the target value
Select the relevant attribute, or use RegEx to match the part of the HTML you want to keep.
Limits of field refinement
Refinement rules clean the value Octoparse has already extracted. They do not change how the web page is structured. For example, if a multi-line text block appears as several lines visually but is actually one single element in the page source, Octoparse may treat it as one field. In that case, you may not be able to split it into separate fields by visual line breaks alone. Check the source structure and use field selection, extraction settings, or RegEx cleanup depending on how the data is actually stored.Best practices
- Refine fields after confirming the correct element is selected.
- Use simple cleaning steps before trying RegEx.
- Preview each step before saving.
- Keep field names clear so exported data is easy to understand.
- Avoid over-cleaning if the downstream system can handle formatting later.
- Document complex RegEx patterns so teammates can maintain the task.