Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.octoparse.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Most articles on browser automation talk about one tool at a time. The more useful question is what the space looks like — what kinds of runtimes exist, what each kind is good at, and which axis actually matters when you’re picking one for scraping. The most decisive axis isn’t language or browser engine; it’s whether the runtime runs headless by default or headed by design. Almost the entire ecosystem sits on the headless side. The integrated scraping platforms — with Octoparse the clearest example — sit on the headed side. Knowing which side a runtime is on tells you more about how it will behave on real sites than any spec sheet.

At a glance

ToolInterfaceBrowser engine(s)Default modeTypeStrongest at
PuppeteerNode APIChromiumHeadlessOSS libraryChrome scraping in JS
PlaywrightNode, Python, Java, .NETChromium, Firefox, WebKitHeadlessOSS libraryCross-browser, modern code
SeleniumMost languagesMost browsers (WebDriver)ConfigurableOSS libraryWidest browser & legacy support
SplashHTTP API (Lua)WebKitHeadlessOSS serviceJS rendering inside Scrapy
WebdriverIONodeWebDriver / CDPConfigurableOSS libraryTest-style scraping in Node
chromedp / RodGoChromiumHeadlessOSS libraryGo-native scrapers
PyppeteerPythonChromiumHeadlessOSS libraryPuppeteer-shaped API in Python
HtmlUnitJavaPure-Java browserHeadlessOSS libraryJVM scraping without a browser binary
puppeteer-extra-stealthNode pluginChromiumHeadlessOSS pluginPuppeteer + bot evasion
undetected-chromedriverPythonChromium via SeleniumConfigurableOSS librarySelenium + bot evasion
nodriverPythonChromium via CDPHeadlessOSS libraryModern stealth, no driver binary
PatchrightNodeChromiumHeadlessOSS forkPlaywright + stealth patches
BrowserlessHTTP APIChromiumHeadlessCloud / self-hostHosted browsers behind an API
BrowserbaseHTTP APIChromiumHeadlessCloud serviceManaged browsers for AI agents
Steel.devHTTP APIChromiumHeadlessCloud / OSSOSS-friendly cloud browsers
Bright Data Scraping BrowserHTTP APIChromiumHeadlessCloud serviceBrowser + built-in unblocking
Zyte APIHTTP APIChromiumHeadlessCloud serviceBrowser + anti-bot handling
ScrapingBeeHTTP APIChromiumHeadlessCloud serviceSimple “render this URL” API
ScrapingAntHTTP APIChromiumHeadlessCloud serviceBudget scraping with proxies
Apify Browser ActorsApify platformChromium, FirefoxConfigurableCloud platformApify-native large-scale scraping
OctoparseVisual workflow + cloudElectron Chromium, Chrome for TestingHeadedIntegrated platformNo-code, WYSIWYG selection, headed by design
ParseHubVisual workflow + cloudChromiumHeadedIntegrated platformNo-code, similar concept
The bold row is the one entry where “headed” is the default — that’s the lane Octoparse owns.

Open-source automation libraries

This is where most code-based scraping starts. Puppeteer drives Chromium from Node — fast, modern, Chrome-only. Playwright, often described as Puppeteer’s successor, covers Chromium, Firefox, and WebKit across Node, Python, Java, and .NET; for a new project today, it’s usually the better default unless you specifically need Chrome-only. Selenium is the elder statesman — slower and heavier, but it speaks to nearly every browser through the WebDriver protocol, which still matters when a project needs Safari, Edge legacy, or mobile-browser bindings. Outside the big three, the field branches by language and ecosystem. Splash is the JS-rendering service that fits inside Scrapy pipelines, scripted in Lua. WebdriverIO brings a WebDriver/CDP-driven API to Node-heavy projects with a test-runner feel. In Go, chromedp and Rod are the two practical choices, with Rod often preferred for ergonomics. Pyppeteer is the Python port of Puppeteer for teams that want Puppeteer’s shape without leaving Python. HtmlUnit is the outlier — a pure-Java browser implementation, no Chromium binary involved, useful when the JVM ecosystem matters more than JS-engine fidelity. All of these run headless by default. They can run headed, but the friction is real — you need a display (or a virtual one like Xvfb), and most scripts in the wild don’t bother. Their normal posture is invisible.

Stealth and anti-detection variants

When a target site fingerprints the runtime, plain Puppeteer or Selenium gets caught quickly — navigator.webdriver, missing plugins, the headless Chrome user-agent, canvas / WebGL anomalies. The stealth variants patch those leaks. puppeteer-extra-stealth is the most established: a Puppeteer plugin that ships a stack of evasions for the common headless fingerprints. undetected-chromedriver does the same for Selenium-driven Chrome and is the go-to in the Python anti-bot space. nodriver is a newer, driver-less CDP approach from the same author, designed to look like an organic browser session from the network up. Patchright is a Playwright fork with similar stealth patches baked in, for teams already on Playwright. These don’t change the headless/headed posture — they’re still headless by default. They reduce the gap between headless and “real,” but they’re playing defense against a continuously updated detection layer.

Cloud browser APIs

Instead of self-hosting browsers, you call an HTTP endpoint and get a rendered page or a controllable session back. Browserless is the most established — works as managed cloud or self-hosted, drop-in Puppeteer/Playwright endpoint. Browserbase and Steel.dev are newer entrants oriented toward AI agents (Steel is OSS-friendly). Bright Data Scraping Browser and Zyte API bundle browser execution with anti-bot handling and unblocking infrastructure — you pay more, and you get a higher success rate against hard targets. ScrapingBee and ScrapingAnt are simpler “render this URL” APIs aimed at smaller teams. Apify is a platform of its own, with Browser Actors that combine cloud-hosted Chromium with Apify’s queueing and storage. All of these run the browser somewhere on a server with no display attached. Headless is the only mode that makes economic sense in this category — you can’t see what they’re doing, only what they return.

Integrated scraping platforms

This category is structurally different from everything above. Instead of a library you call from code, or an API you POST URLs to, an integrated platform gives you a visual workflow editor with a browser embedded inside it. You build the scraper by clicking on the page, not by writing selectors. Octoparse is the clearest example. It runs two runtimes — a stripped, optimized Electron Chromium for everyday tasks, and Chrome for Testing driven by Puppeteer for sites that need a fully authentic browser. Crucially, both are headed by design: the browser window is visible because the visible page is the editor. ParseHub sits in the same category with a similar approach. Older entries like the Web Scraper.io Chrome extension share the headed lineage too — a browser extension can only operate inside a headed Chrome window. This is the only category where headed is the default rather than a configuration option. That isn’t a limitation — it’s the design choice the workflow depends on.

Headless vs headed: the axis that matters

The headless/headed split tracks who the runtime is for. Headless makes sense when a developer is the operator. You’re writing code, reading logs, scaling out on servers without displays; you don’t need to see the page because you’re describing it programmatically. The whole ecosystem above the integrated-platform line is built on this assumption. Headed makes sense when the page itself is the interface. You’re selecting elements visually, watching a task run, intervening on a login or CAPTCHA, debugging by seeing rather than logging. That’s the Octoparse posture — and it’s also why Octoparse’s stripped Electron runtime exists: headed isn’t necessarily heavy if the underlying browser is purpose-built for scraping rather than general browsing. Two practical consequences fall out of this:
  • Bot detection. Real headed browsers — visible window, real rendering, real input events — leak fewer of the signals anti-bot services hunt for. Headless tools have to add stealth layers; headed-by-design platforms get this largely for free.
  • Operator skill. Headless tools assume engineering ownership: someone maintains the script, the proxies, the captcha solver, the deploy. Headed-by-design platforms assume the operator is closer to the data — analyst, ops, growth — and the platform owns the engineering.
For a deeper look at why headed-by-design is a deliberate choice rather than a missing feature, see Headed vs headless browsers.

How to pick

A few decision rules that hold up across most projects:
  • Writing your own scraper in code, just need to render a page? Playwright is the default. Puppeteer if Chrome-only and you’re already in Node. Selenium only if you need a browser those two don’t support.
  • Code-based scraper, target site fingerprints aggressively? Move to a stealth variant (puppeteer-extra-stealth, undetected-chromedriver, nodriver) — or skip to a cloud API that bundles anti-bot handling.
  • Don’t want to host browsers at all? Cloud APIs: Browserless / Browserbase / Steel for plain rendering; Zyte API / Bright Data Scraping Browser for unblocking-included.
  • Don’t want to write code at all? Integrated platform: Octoparse (or ParseHub). Headed by design, visual selection, runtime bundled with workflow and cloud execution.
  • Operator isn’t an engineer, and the target site has anti-bot defenses? This is the strongest case for headed by design — fewer detection signals to leak, and a human can step in when a CAPTCHA appears.
The runtime decision is rarely permanent. Many teams start in a cloud API for one-off rendering, move to a stealth-equipped code library for repeat jobs, and reach for an integrated platform when the operator needs to be someone other than the engineer.