Common patterns for cleanup
A small library of reusable patterns handles most real-world cases:- Prices.
[\d,.]+captures numeric values with commas and decimals, stripping currency symbols and surrounding text. Refined to(\d{1,3}(?:,\d{3})*(?:\.\d{2})?), it matches standard formats like1,299.99more precisely. - Phone numbers.
\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}handles common US formats — with or without parentheses, separated by dashes, dots, or spaces. - Emails.
[\w.+-]+@[\w-]+\.[\w.]+catches most standard email addresses — one of the most well-known regex use cases. - Dates.
\d{4}-\d{2}-\d{2}matches ISO format;\w+ \d{1,2}, \d{4}handlesMay 15, 2026style strings. - HTML tag stripping.
<[^>]+>removes tags from a string. - Whitespace normalization.
\s+collapses runs of whitespace into a single space.
Practical tips
- Test against real samples. Real pages have quirks idealized examples don’t. Run the pattern against output from the actual target site before trusting it.
- Use non-greedy quantifiers (
*?,+?). When the engine has a choice between a short and a long match, default*and+take the longest. That’s often not what you want. - Use capture groups. Parentheses
()let you extract just the part you care about from a larger match — useful when you need to match context around the data but only keep the data. - Watch locale edge cases. A price regex built for US formatting (
1,299.99) will misbehave on European numbers (1.299,99) where the comma and period swap roles. The same applies to dates, phone numbers, and decimal notations.