What to collect
E-commerce scraping usually starts with a product catalog.| Data type | Example fields |
|---|---|
| Product identity | Title, brand, ASIN/SKU/GTIN/UPC, model, product URL |
| Pricing | Current price, list price, discount, coupon, subscription price |
| Availability | In stock, out of stock, delivery estimate, seller availability |
| Seller data | Seller name, marketplace seller ID, fulfilled-by signal |
| Product content | Images, description, feature bullets, specifications |
| Reviews | Rating, review count, review text, review date, helpful votes |
| Ranking | Best-seller rank, search position, category rank |
| Variants | Size, color, pack count, style, region |
Common workflows
Catalog monitoring
Scrape category pages or search results to discover products, sellers, and rankings. Store product URLs and IDs as refresh targets.Product detail enrichment
Visit detail pages for discovered products. Collect descriptions, specs, images, variants, seller information, and availability.Review analysis
Collect reviews separately from product facts. Review pages often paginate independently and may require sorting by newest to support monitoring.Price and stock tracking
Refresh selected products on a schedule. Store timestamped snapshots so the team can detect price changes, promotions, stockouts, and seller changes.Platform differences
| Platform type | Notes |
|---|---|
| Large marketplaces | Rich data, heavy anti-bot defenses, many variants and sellers |
| Brand stores | Cleaner product structure, often Shopify or similar commerce platforms |
| Long-tail retailers | Less standardization, but lighter defenses |
| Review-heavy marketplaces | Strong sentiment value, separate review pagination |
| B2B catalogs | Often require login, quote requests, or region-specific pricing |
Data normalization
E-commerce data needs cleanup before analysis.- Normalize currency and region.
- Convert pack counts into unit price.
- Separate product price from shipping.
- Standardize availability states.
- Map variants to parent products.
- Deduplicate identical products across URLs.
- Preserve source timestamps.