Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

A web scraper isn’t a single thing — it’s a pipeline. Whether you write it in Python, drive a headless browser yourself, or build it visually in a no-code tool, the same stages happen in the same order. Knowing the stages makes it much easier to see where a scraper breaks, where it slows down, and which technique fixes what.
1

Fetch the page

Send an HTTP request, get HTML back.
2

Render the page

Run the page’s JavaScript in a real browser.
3

Locate the data

Point at elements with XPath or CSS selectors.
4

Extract and refine

Pull values out, clean them up with regex.
5

Navigate the site

Work through pagination and login walls.
6

Store the output

CSV, JSON, a database, or a downstream API.
The first five stages run in order on every page; navigate loops back to fetch for the next page, and the whole loop runs at scale — in the cloud, behind rotating proxies, past CAPTCHAs.

Fetch the page

Everything starts with a request. The scraper sends an HTTP request to a URL and receives a response — usually HTML, sometimes JSON. For simple, server-rendered pages this single step delivers everything you need. The catch is that the response is only the initial payload the server sends; on many modern sites, that’s a near-empty shell.

Render the page

When a site builds its content with JavaScript, the data you want isn’t in the initial HTML — it appears only after the page’s scripts run. Rendering is the stage that executes those scripts in a real browser so the full content materializes. The browser runtime you use here determines how faithfully the page loads and how much it costs in memory and speed, and it’s the core of scraping JavaScript-rendered pages.

Locate the data

Once the page is fully loaded, the scraper has to point at the specific pieces you want — a price, a title, a row in a table. This is done with XPath or CSS selectors, which describe where an element sits in the page structure. Good selectors are the difference between a scraper that survives small layout changes and one that breaks on every site update.

Extract and refine

Locating an element gets you its raw content; extraction pulls the value out, and refining cleans it up. A scraped price might arrive as "$1,299.00 USD" when you only want the number. Regular expressions and post-processing rules strip, split, and reformat raw text into the structured fields you actually intend to store. Most data sets span more than one page. The scraper has to walk through pagination — next-page links, “load more” buttons, infinite scroll — to reach every record. Some content also sits behind login walls, which the scraper has to authenticate past before any of the earlier stages can run.

Run it at scale

A scraper that works once on your laptop still has to survive thousands of runs. Running in the cloud keeps tasks going without tying up your machine. Rotating proxies spread requests across many IP addresses so a site doesn’t rate-limit or block you. And CAPTCHAs and anti-bot services like Cloudflare are the obstacles this stage exists to handle.

Store the output

Finally, the extracted data lands somewhere usable — a CSV or Excel file, a JSON export, a database, or a downstream API. This is the stage that turns a scraping run into a dataset.

How Octoparse fits

With a hand-coded scraper, you wire each of these stages together yourself: a request library, a browser driver, a selector engine, cleanup code, pagination logic, proxy management, and an export step. Octoparse collapses the whole pipeline into one visual workflow. You point and click to set selectors, and rendering, pagination, proxy rotation, cloud execution, and export are built in — configured rather than coded. The stages are the same; what changes is that you describe what to collect instead of engineering how each stage runs.