logo
languageENdown
menu

Why Your Data Parsing Fails (And How to Fix Messy Data)

7 min read

You just finished scraping a website. Your Python script ran without errors. You have all the data you wanted. But when you open the file, you see something like this:

<div class="price">$29.99</div>
<span class="product-name">Blue Widget</span>
<!-- Sometimes the price is here -->
<p class="cost">Price: 45.00 USD</p>

This isn’t what you needed. You needed a clean spreadsheet with columns for “Product Name” and “Price.” What went wrong?

Nothing, actually. Your web scraper did its job—it got the data from the website. But scraping and parsing are two different things.

What Is Data Parsing?

Data Parsing means taking that saved data and organizing it into a format you can actually use.

If web scraping is like photocopying pages from a book, data parsing is like reading those pages and typing the important information into a spreadsheet.

When you scrape a website, you get HTML, the code that web browsers read to display pages. HTML looks messy to us humans, and parsing is how you turn that HTML into organized data.

Every Website Structures Data Different

Every website organizes its data in a unique way. Two websites selling the same products could use completely different HTML structures to display prices, names, and descriptions. This makes data parsing challenging because you cannot use the same parsing rules across different sites.

Store A puts prices like this:

html
<span class="price">$29.99</span>

Store B does it like this:

html
<div class="product-cost">
  Price: $29.99 USD
</div>

Store C uses this:

html
<p class="amount">29.99</p>

All three websites show the same information (a price), but they structure it completely differently, which means your data parsing code needs to handle each one separately.

But this gets worse when you realize:

  • Prices might include currency symbols or not
  • Some sites use commas (1,299.99) and others use periods (1.299,99)
  • Sales prices might be in different HTML tags than regular prices
  • The website might change its structure next month

Common Data Parsing Problems You’ll Face

Problem 1: Finding the Right Data

Look at this HTML from a real product page:

html
<div class="container">
  <span class="label">Price:</span>
  <span class="value">$29.99</span>
  <span class="old-price">$39.99</span>
  <span class="tax-note">Plus tax</span>
</div>

Which price do you want? The current one or the old one? And what about the tax note, do you need that?

You need to write code that finds the exact piece you want. In Python with BeautifulSoup, you could write:

python
price = soup.find('span', class_='value').text

But if the website changes class=’value’ to class=’current-price’, your code breaks. You won’t know until you run it again and get errors.

Problem 2: Cleaning the Data

Even when you find the right data, it’s usually not clean. You might extract:

  • “$29.99” (includes dollar sign)
  • “Price: $29.99” (includes label)
  • ” $29.99 ” (has extra spaces)
  • “$29.99 USD” (includes currency code)

You need to clean each one:

python
# Remove dollar sign
price = price.replace('$', '')

# Remove "Price:" label
price = price.replace('Price:', '')

# Remove spaces
price = price.strip()

# Convert to number
price = float(price)

Now multiply this by every field you’re collecting (name, description, rating, availability, etc.). You would be writing a lot of data cleaning code.

Problem 3: Handling Missing Data

Sometimes a product doesn’t have a review rating. Or the price isn’t listed. Your code might crash:

python
rating = soup.find('span', class_='rating').text
# Error: NoneType has no attribute 'text'
You need to add checks everywhere:
python
rating_element = soup.find('span', class_='rating')
if rating_element:
    rating = rating_element.text
else:
    rating = 'No rating'

Problem 4: Nested Data

Many websites have data inside other data. For example, a product page might have multiple reviews, and each review has a rating, date, and comment.

Your HTML looks like:

html
<div class="reviews">
  <div class="review">
    <span class="rating">5 stars</span>
    <p class="comment">Great product!</p>
  </div>
  <div class="review">
    <span class="rating">4 stars</span>
    <p class="comment">Pretty good.</p>
  </div>
</div>

You need to loop through each review and extract multiple pieces from each one:

python
reviews = soup.find_all('div', class_='review')
for review in reviews:
    rating = review.find('span', class_='rating').text
    comment = review.find('p', class_='comment').text
    # Save each review somewhere

This gets complicated fast, especially when reviews can have replies, or products have multiple variations.

When Code-Based Parsing Makes Sense

Writing Python scripts to parse data is powerful. You should use code when:

  • You’re building a system that will run automatically for months
  • You need to do complex calculations or transformations on the data
  • You’re integrating with other systems (databases, APIs, etc.)
  • You have specific requirements that no tool can handle
  • You’re comfortable debugging and maintaining code

The advantage of code is complete control. You can handle any situation if you write the logic for it.

When Code-Based Parsing Gets Hard

Here’s where many people I know struggle:

Time: Writing parsing code for one website might take 2-4 hours. If you need data from 10 different websites, that’s 20-40 hours of coding.

Maintenance: Websites change their structure. When they do, your code breaks. You need to fix it, test it, and deploy it again. If you’re parsing 10 websites, you might spend several hours per month just on maintenance.

Team access: If you’re the only person who codes, you become the bottleneck. Your team members can’t help collect data or fix broken scrapers.

One-off projects: Sometimes you just need data once. Writing and testing a script might take longer than the actual data analysis.

Data Parsing Examples: How to Parse Amazon Price Data

Let me walk you through a real example that shows why parsing is tricky: getting price data from Amazon product pages. I’ll show you exactly what I encounter when I try to do this.

What I’m Trying to Get

For price monitoring purposes, I want a simple spreadsheet with three columns:

  • Product name
  • Current price
  • Original price (if on sale)

Seems straightforward. But Amazon’s HTML makes this complicated.

What the HTML Actually Looks Like

When you scrape an Amazon product page, you don’t get clean data. You get HTML that looks like this:

html
<span id="priceblock_ourprice" class="a-size-medium a-color-price">
  $24.99
</span>

Or sometimes like this:

html
<span class="a-price-whole">24</span>
<span class="a-price-fraction">99</span>

Or even like this:

html
<span class="a-offscreen">$24.99</span>
<span aria-hidden="true">
  <span class="a-price-symbol">$</span>
  <span class="a-price-whole">24</span>
  <span class="a-price-decimal">.</span>
  <span class="a-price-fraction">99</span>
</span>

All three html data show the same price ($24.99), but they’re structured completely differently.

Why Amazon Does This

Amazon uses different HTML structures for different situations:

  • Regular products vs Prime-exclusive products
  • Items sold by Amazon vs third-party sellers
  • Products on sale vs regular price
  • Desktop view vs mobile view
  • A/B tests (Amazon shows different layouts to different users)

This isn’t Amazon being difficult. They update their website constantly to improve shopping experience. But it makes data parsing harder.

My First Parsing Attempt: The Simple Approach

I start with simple code to find the price:

from bs4 import BeautifulSoup

# Get the price
price = soup.find('span', id='priceblock_ourprice')
print(price.text)

This works for some products. But on other products, I get an error:

AttributeError: 'NoneType' object has no attribute 'text'

The error happens because priceblock_ourprice doesn’t exist on every product page. I need a better approach.

My Second Attempt: Multiple Selectors

I realize I need to check multiple locations. So I update my code:

# Try the first location
price = soup.find('span', id='priceblock_ourprice')

# If not found, try another location
if not price:
    price = soup.find('span', id='priceblock_dealprice')

# Still not found? Try another
if not price:
    price = soup.find('span', class_='a-price-whole')

# Get the text if we found anything
if price:
    price_text = price.text
else:
    price_text = 'Price not found'

Now my parsing code needs to find both pieces and combine them:

pythonwhole = soup.find('span', class_='a-price-whole')
fraction = soup.find('span', class_='a-price-fraction')

if whole and fraction:
    price = whole.text + '.' + fraction.text

Data Parsing Problems I Faced:

1. Hidden vs Visible Prices

Amazon sometimes shows two versions of the price—one for screen readers and one visible to users:

html<span class="a-offscreen">$24.99</span>
<span aria-hidden="true">$24<sup>.99</sup></span>

Which one should I grab? The offscreen version is usually cleaner (no extra HTML tags), but it’s not always there.

2. Sale Prices

When a product is on sale, Amazon shows both prices:

html<span class="a-price a-text-price" data-a-color="secondary">
  <span class="a-offscreen">$39.99</span>
</span>
<span class="a-price" data-a-color="price">
  <span class="a-offscreen">$24.99</span>
</span>

I need to figure out which is the current price and which is the original price. Usually the one with data-a-color="price" is the current price, but not always.

3. Currency Symbols and Formatting

Even after I find the price, I need to clean it:

python# What I extract: "$24.99"
# What I need: 24.99 (as a number)

price_text = "$24.99"

# Remove dollar sign
price_text = price_text.replace('$', '')

# Remove any commas (for prices like $1,299.99)
price_text = price_text.replace(',', '')

# Convert to number
price_number = float(price_text)

But what if the price is listed as “Currently unavailable”? Or “See options for pricing”? My conversion to a number will crash.

I need more error handling:

pythontry:
    price_number = float(price_text.replace('$', '').replace(',', ''))
except ValueError:
    price_number = None  # or 0, or 'N/A'

My Complete Parsing Function

Here’s what my code looks like after I handle these cases:

pythondef parse_amazon_price(soup):
    # Try multiple price locations
    price_locations = [
        {'id': 'priceblock_ourprice'},
        {'id': 'priceblock_dealprice'},
        {'class_': 'a-offscreen'},
        {'class_': 'a-price-whole'}
    ]
    
    price_text = None
    
    for location in price_locations:
        element = soup.find('span', **location)
        if element:
            price_text = element.text.strip()
            break
    
    # If price is split into parts
    if not price_text:
        whole = soup.find('span', class_='a-price-whole')
        fraction = soup.find('span', class_='a-price-fraction')
        if whole and fraction:
            price_text = whole.text + '.' + fraction.text
    
    # Clean the price
    if price_text:
        # Remove currency symbols and extra text
        price_text = price_text.replace('$', '').replace(',', '')
        price_text = price_text.strip()
        
        # Convert to number
        try:
            return float(price_text)
        except ValueError:
            return None
    
    return None

This is 30+ lines of code just to get one price. And it still doesn’t handle every case I might encounter.

However, Let’s say I get this working today. Next month, Amazon updates their website. They change some class names. My code breaks.

I need to:

  1. Notice that it broke (maybe I don’t check the output regularly)
  2. Figure out what changed on Amazon’s page
  3. Update my parsing logic
  4. Test it again
  5. Hope it works until the next change

If I’m parsing 20 different products, or getting data from Amazon plus five other stores, this maintenance adds up fast.

What My Clean Data Should Look Like

After all the data parsing work, here’s what I want in my CSV file:

Product Name,Current Price,Original Price
Wireless Mouse,24.99,39.99
USB Cable,8.99,
Laptop Stand,45.00,55.00

Clean. Simple. Ready to analyze.

The parsing process is everything between the messy HTML and this clean table.

How I Used Octoparse for Data Parsing

I found that Octoparse’s Amazon Scraper deals with Amazon’s complexity differently. Instead of writing code to check multiple locations, here’s what I do:

  1. I click on the price I want on one product page
  2. I click on a few more examples
  3. Octoparse learns the pattern
  4. It extracts prices from all products

With Octoparse, I can:

  • Check multiple HTML locations automatically
  • Combine split price elements
  • Clean currency symbols
  • Convert to numbers

When Amazon changes their HTML, I usually just re-click examples rather than rewriting code. This saves me hours of debugging and maintenance work.

For Amazon specifically, both approaches work. The question is whether you want to spend time writing and maintaining parsing code, or whether you want to click examples and move on to analyzing the data.

Conclusion

Data parsing is the bridge between raw web data and useful information. As I’ve shown you with the Amazon example, it’s about handling inconsistent structures, cleaning messy formats, and maintaining your solution when websites change.

If you’re comfortable with Python and need custom logic for complex projects, writing your own parsing code gives you complete control. You’ll spend time upfront building the solution, but you’ll have exactly what you need.

If you’re collecting data from multiple websites or need results quickly, visual tools like Octoparse handle the common parsing challenges automatically. You focus on what data you want, not how to extract it from different HTML structures.


FAQs

1. What are common data parsing techniques?

Popular techniques include XPath and CSS selectors to locate HTML elements, regular expressions (RegEx) to extract patterns, and JSON parsing for API responses.

Data parsing also involves cleaning and normalizing data such as removing currency symbols, trimming spaces, and handling nested data structures.

2. How to handle dynamic content while parsing?

Dynamic content loaded by JavaScript or AJAX requires additional parsing steps. Try using headless browsers to render pages fully before you try to extract data, or intercept API calls directly for cleaner JSON data.

Handling this requires parsing the rendered DOM or API responses rather than static page HTML.

3. How do I use XPath and CSS selectors for data parsing?

XPath and CSS selectors are languages used to pinpoint specific elements within HTML for parsing:

  • XPath navigates the HTML structure like a tree, allowing precise selection based on element hierarchy, attributes, or position. For example, //div[@class='price']/text() selects the text inside a div with class “price”.
  • CSS selectors target elements based on tags, classes, IDs, or attributes with simple syntax like div.price to select the same element.

Both are widely supported in scraping tools and enable extracting exact data points from complex pages without grabbing unwanted content.

For more details and examples, see the Parsel documentation here.

4. Can Octoparse handle dynamic website content for parsing?

Octoparse provides point-and-click interfaces that automate pattern recognition and data extraction without coding, especially helpful for handling dynamic sites and multi-page scraping.

Get Web Data in Clicks
Easily scrape data from any website without coding.
Free Download

Hot posts

Explore topics

image
Get web automation tips right into your inbox
Subscribe to get Octoparse monthly newsletters about web scraping solutions, product updates, etc.

Get started with Octoparse today

Free Download

Related Articles