A web scraper isn’t a single thing — it’s a pipeline. Whether you write it in Python, drive a headless browser yourself, or build it visually in a no-code tool, the same stages happen in the same order. Knowing the stages makes it much easier to see where a scraper breaks, where it slows down, and which technique fixes what.Documentation Index
Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
The first five stages run in order on every page; navigate loops back to fetch for the next page, and the whole loop runs at scale — in the cloud, behind rotating proxies, past CAPTCHAs.
Fetch the page
Everything starts with a request. The scraper sends an HTTP request to a URL and receives a response — usually HTML, sometimes JSON. For simple, server-rendered pages this single step delivers everything you need. The catch is that the response is only the initial payload the server sends; on many modern sites, that’s a near-empty shell.Render the page
When a site builds its content with JavaScript, the data you want isn’t in the initial HTML — it appears only after the page’s scripts run. Rendering is the stage that executes those scripts in a real browser so the full content materializes. The browser runtime you use here determines how faithfully the page loads and how much it costs in memory and speed, and it’s the core of scraping JavaScript-rendered pages.Locate the data
Once the page is fully loaded, the scraper has to point at the specific pieces you want — a price, a title, a row in a table. This is done with XPath or CSS selectors, which describe where an element sits in the page structure. Good selectors are the difference between a scraper that survives small layout changes and one that breaks on every site update.Extract and refine
Locating an element gets you its raw content; extraction pulls the value out, and refining cleans it up. A scraped price might arrive as"$1,299.00 USD" when you only want the number. Regular expressions and post-processing rules strip, split, and reformat raw text into the structured fields you actually intend to store.