Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

When you’re picking a tool for web scraping, the decision that matters most isn’t language or browser engine — it’s whether the browser runs headless (no visible window, programmatic only) or headed (a real, visible browser window). Almost the entire ecosystem — Puppeteer, Playwright, Selenium, every cloud browser API — defaults to headless. A smaller category, the integrated scraping platforms, runs headed by design; Octoparse is the clearest current example. The two choices serve different operators, succeed against different sites, and break in different ways.

Why headless became the default

Headless browsers were built for developers. They run on servers without displays, fit cleanly into containers and CI/CD pipelines, skip the rendering overhead a human user needs (window chrome, GPU compositing, autofill, telemetry), and let one machine run many concurrent sessions. When the operator is writing code, reading logs, and describing the page programmatically, there’s no reason to see it. Everything in the browser runtime landscape above the integrated-platform tier — Puppeteer, Playwright, Selenium, Splash, every cloud browser API — assumes this posture. Headless is the silent default of code-based scraping.

What headless costs you

The cost shows up in four places, and it shows up consistently:
  • Bot detection signals. navigator.webdriver flips to true, the user-agent says HeadlessChrome, plugins are missing, canvas and WebGL produce anomalous fingerprints. Anti-bot services like Cloudflare and DataDome are tuned to spot these. The whole “stealth variant” sub-ecosystem — puppeteer-extra-stealth, undetected-chromedriver, nodriver, Patchright — exists because plain headless leaks them.
  • Viewport-driven behavior. Lazy-loaded images, intersection-observer content, visibility-gated scripts — these are designed around a page actually being rendered and “seen.” Headless can fake the viewport, but the boundary is fragile and the behaviors are easy to miss.
  • No human in the loop. A CAPTCHA, an unexpected modal, a session-expired login — headless is blind. The script doesn’t know it’s stuck, only that it stopped returning data.
  • Debugging by log, not by sight. When a selector breaks on page 42 of an overnight run, you reproduce it locally — often headed — to actually see what changed. The debugging tool is the headed posture; the production tool isn’t.
None of these are dealbreakers. They’re the steady tax that code-based, headless scraping pays.

The headed-by-design category

There’s a smaller category of scraping tools where the page isn’t an internal implementation detail — it is the interface. The operator selects elements by clicking on a rendered page; watches a task execute step by step; sees a CAPTCHA when it appears and clears it; debugs by looking. Octoparse is the clearest current example. ParseHub follows the same pattern, and the older Web Scraper.io Chrome extension shares the lineage. The economics of headed-by-design only work if the runtime is purpose-built. A stock Chrome with all the human-user machinery (extensions, syncing, autofill, full GPU compositing, telemetry) is too heavy to run at scraping scale. So integrated platforms ship runtimes built specifically for the headed-scraping case — heavy enough to be authentic, light enough to run densely.

Inside Octoparse’s two headed runtimes

Octoparse ships two purpose-matched headed runtimes, and switches between them based on what the target site demands.

Electron Chromium, stripped and optimized

The first is a customized Chromium runtime built into Electron. Rather than running a stock browser, Octoparse has stripped and optimized this runtime specifically for scraping — removing unnecessary overhead like extensions, background processes, and rendering features that a human user needs but a scraper doesn’t. The result is a lightweight engine that loads pages faster, consumes significantly less memory and CPU, and can handle many concurrent sessions without bogging down a machine. Compared to running a full browser instance through Puppeteer or Selenium, this purpose-built approach offers a noticeable performance advantage, particularly when running tasks locally or on hardware with limited resources. The tight integration with Octoparse’s visual editor also means users configure and execute tasks in the same environment — no context switching between tools.

Chrome for Testing driven by Puppeteer

The second is Chrome for Testing driven by Puppeteer. This is a full, unmodified Chrome browser controlled programmatically, behaving identically to what a real user would see. It’s the better option for sites with aggressive bot detection, fingerprinting, or compatibility checks that expect a standard Chrome environment. It’s heavier on resources than the Electron runtime, but the browser authenticity it provides is sometimes essential.

When to use which

The key advantage of having both built in is flexibility without complexity. With standalone tools like Puppeteer or Playwright, users need to manage browser binaries, handle versioning, configure launch options, and deal with infrastructure concerns themselves. Octoparse abstracts all of that away. The optimized Electron runtime handles the vast majority of tasks efficiently, while Chrome for Testing serves as a ready fallback when full browser fidelity is needed — and switching between them is a configuration choice, not an engineering project. The team has also indicated that additional runtime options are on the roadmap, reflecting the reality that no single browser engine is ideal for every scraping scenario. Whether a task runs on your own machine or in the cloud, the runtime choice stays the same simple toggle.

When headed wins, when headless wins

Pick thisWhen
Headless code library (Puppeteer / Playwright / Selenium)A developer owns and operates the scraper, you’re scaling on servers, and the target site isn’t heavily anti-bot
Headless cloud API (Browserless / Zyte / Browserbase)Same as above, but you don’t want to host the browser yourself
Headless + stealth variant (puppeteer-extra-stealth, undetected-chromedriver, nodriver)Same as above, but the target site fingerprints aggressively
Headed-by-design platform (Octoparse / ParseHub)The operator isn’t an engineer; the target site has serious anti-bot defenses; you need to debug visually or intervene on CAPTCHAs and logins
The choice isn’t “easier” or “better” in the abstract — it’s about who the operator is and what the target site does. A developer scraping a public-data site can run plain Puppeteer and skip every other concern on this page. The case for headed grows as the target hardens and the operator moves away from code. For the full enumeration of tools on each side, see The browser runtime landscape.