logo
languageENdown
menu

Is Web Scraping Legal and Why

9 min read

Generally, scraping public information from websites is legal, whereas scraping private account data will raise privacy concerns. Here are a few popular use cases to show how different industries do web scraping in a well-accepted way.

E-commerce: Retails use web scraping to automate marketplace price monitoring, build up product profiles, and collect customer reviews for sentiment analysis among diverse online shopping platforms like Amazon and eBay.

Marketing and Advertising: Content creators apply web crawling to collect data from various social media platforms like Twitter and YouTube to generate new ideas for content marketing and understand what audiences are interested in.

Real Estate: Realtors scrape listings from property websites like Realtor.com to aggregate loads of research data for comparison. In that way, they predict if the real estate market will skyrocket any time soon or see in what price range their property will compete.
is web scraping legal infographic

How You Use Scraped Data

If you’re doing web crawling for your own purposes, then it is legal as it falls under the fair use doctrine such as market research and academic research. The complications start if you want to use scraped data for others, especially commercial purposes.

Quoted from Wikipedia.org, eBay v. Bidder’s Edge, 100 F.Supp.2d 1058 (N.D. Cal. 2000), was a leading case applying the trespass to chattels doctrine to online activities. In 2000, eBay, an online auction company, successfully used the ‘trespass to chattels’ theory to obtain a preliminary injunction preventing Bidder’s Edge, an auction data aggregator, from using a ‘crawler’ to gather data from eBay’s website. The opinion was a leading case applying ‘trespass to chattels’ to online activities, although its analysis has been criticized in more recent jurisprudence.

The versatility of web scraping allows access to data so easily that it would be natural to worry about potential information abuse or misuse. For people who want to decrease the likelihood of legal controversies in web scraping, it is important to identify the legal risks around web scraping.

Here comes the ultimate question: to scrape or not to scrape? Is web scraping illegal or not? What are the potential legal implications of using web scraping? Unfortunately, there is no short answer to these questions. Due to the relative novelty of web scraping in a legal context, the line between legitimate and evil use of this technique is still hard to define in most countries. For a decade or so, web scraping was only guided by a set of related, fundamental legal theories and laws, such as:

  • Copyright Infringement
  • Breach of Contract
  • Violation of the Computer Fraud and Abuse Act (CFAA)
  • Trespass to Chattels

In most countries, law enforcement specifically for web scraping is not clearly defined yet. However, with the onset of GDPR regulations, more and more people have realized the need to comply with legal standards before proceeding with a scraping project to avoid falling into a tricky legal situation. As international legal circumstances vary widely, this part only discusses the legal risks of web scraping in the United States and Europe.

The Case of the United States

In the US, the law regarding web scraping is still developing and implicates a large number of statutory regimes and areas of common law. There are major types of legal claims that website owners can use to avoid undesired web scraping. For example, web-scraping activity may implicate federal statutes, such as the Computer Fraud and Abuse Act (CFAA), Digital Millennium Copyright Act (DMCA) and insider trading laws; state blue sky laws; privacy laws; and common law claims, such as breach of contract, fraud, and trespass to chattels.

CFAA

The CFAA proscribes “intentionally access[ing] a computer without authorization or exceed[ing] authorized access, and thereby obtain[ing] … information from any protected computer.” Courts have disagreed, however, on what constitutes access without authorization or exceeding authorization.

In hiQ Labs, Inc. v. LinkedIn Corp., the court claims that a user’s act of accessing data made available by the owner to the general public does not constitute access “without authorization” under the CFAA. Yet in Facebook, Inc. v. Power Ventures, Inc., the court held that a user accesses a computer “without authorization” when he or she continues to circumvent technological measures employed by the operator to block that user’s access.

Some significant court decisions in 2020 also bear on whether scraping data that one is authorized to access for certain purposes – such as browsing as a potential customer or participating as a member of a social media network – but not authorized to access for web-scraping purposes, constitutes a breach of the CFAA. We are not going to elaborate on that in this article. In short, although the scope of the CFAA’s access provision is unsettled, a significant authority suggests that the scraping of publicly available information, such as from LinkedIn member profiles, does not violate the CFAA. Likewise, it suggests that violation of a website’s terms of use alone, without more, may not violate the CFAA.

Copyright/DMCA

The operator of a website that is the target of web scraping may bring a claim for copyright infringement against the user of the web-scraping device by proving:

  • its ownership of a valid copyright;
  • the user’s copying of the original elements of the work in question.

At least one federal court has held that a party faces liability under Section 1201(a)(1)(A) of the DMCA when it uses bots to circumvent security measures that control nonhuman access to copyrighted material on a webpage.

It is also worth noting the general copyright principle that, although compilations of facts can be protected by copyright, authors may not copyright their ideas or the facts they narrate. Accordingly, if the data scraped are purely facts without a creative component, then there is no copyright claim.

Privacy Statutes

Web scraping may also implicate the privacy statutes of states and other jurisdictions. For example, the E.U.’s General Data Protection Regulation and the California Consumer Privacy Act of 2018 grant consumers a variety of rights and protections with respect to their personal information. Web-scraping activity that compiles personally identifiable information could implicate a variety of privacy statutes – and potentially subject a web scraper to government and private litigation.

Insider Trading

Under certain circumstances, web scraping could also potentially violate federal insider trading law or state blue sky laws. For example, using affirmative misrepresentations to obtain material nonpublic information through web scraping and then trading based on that information could potentially constitute insider trading.

However, the law in this area is unsettled, and it remains to be seen how strict an approach regulators and law enforcement may take when deciding what constitutes a breach of duty or deception in the web-scraping context.

Breach of contract

In addition to the boundaries imposed by the statutes discussed above, a plaintiff could seek to invoke various common law remedies in an attempt to stem or curtail web scraping. For instance, some website operators have attempted to assert claims for breach of contract against alleged web scrapers. Courts, however, have held that defendants must be on notice of a website’s terms of service for the terms to be enforced against them.

Today 69% of the population above the age of 16 in the EU have heard about the GDPR and 71% of people heard about their national data protection authority, according to results published in a survey from the EU Fundamental Rights Agency.

Though still in its infancy, the GDPR is one of the most comprehensive and impactful data protection laws to date. It has radically changed how businesses scrape the web in Europe. If your scraping project needs you to scrape PIIs, to avoid hefty fines, it is better to make sure you’re GDPR-compliant. See our blog on GDPR: GDPR Compliance In Web Scraping, which covers almost everything you need to know about GDPR.

mportant regulations include:

  • GDPR requires lawful basis for processing personal data, even if public
  • Digital Single Market Directive permits data mining for research and innovation
  • Database Directive protects substantial database investments
  • National variations across 27 EU member states

UK Post-Brexit: Similar protections under UK GDPR, Data Protection Act 2018, and Computer Misuse Act.

The key difference: EU treats publicly available personal data as still requiring consent or legitimate interest justification, unlike the US approach.

China, India, and Japan are all implementing frameworks that make European compliance look simple.

China’s new reality is harsh but predictable. The January 2025 Network Data Security Management Regulations established mandatory data localization for major operators and enhanced audit requirements. The good news is if you focus on “publicly available information” and avoid personal data entirely, compliance becomes much simpler.

India’s Digital Personal Data Protection Act creates unique opportunities through its narrow territorial scope. Unlike GDPR’s complex balancing tests, India’s framework provides a clear exemption for “publicly available information” made available by data subjects themselves. Smart companies are restructuring operations to leverage this exemption.

Japan is taking the most innovation-friendly approach. Proposed 2025 amendments would permit personal data use for AI training without individual consent, while reciprocal adequacy arrangements with the EU and UK create streamlined compliance pathways.

Structure data collection across jurisdictions based on each region’s comparative regulatory advantages while maintaining global compliance standards.

TL;DR:

  1. Check robots.txt files before scraping
  2. Respect rate limits to avoid server overload
  3. Review terms of service for scraping policies
  4. Use APIs when available as the preferred method
  5. Avoid personal data unless compliant with privacy laws
  6. Don’t republish copyrighted content without permission
  7. Implement proper attribution for data sources
  8. Monitor legal developments as regulations evolve

On the whole, the law on web scraping is still developing, and only further court decisions and legal pronouncements will thoroughly define its parameters. To avoid being involved in lawsuits, the following is a non-exhaustive list of practical tips for users who have engaged in web scraping.

1. Respect and follow the Terms of Service.

Always review the website’s Terms of Services (ToS) and robot.txt files before consenting to web scraping data collection activity. If possible, get prior permission from the owner of the website.

2. Scrape at a reasonable and moderate rate.

Be gentle and don’t be aggressive. Give the scraped website some breathing space. When you’re scraping, you should hit the website within a reasonable time interval and keep the number of requests in control. Avoid adversely impacting a website’s physical operation, which could lead to a claim for trespass to chattels or similar claims.

3. Monitor and consider any actions a website takes to restrict web scraping.

If a website clearly restricts your web scraping activities with various anti-scraping measures, such as the use of CAPTCHAs, rate limits, blocking of IP addresses, etc., you need to be cautious of potential legal risks. Be prepared to stop if asked to do so through a cease-and-desist letter or otherwise.

4. Avoid collecting personally identifiable information.

Consider whether any data to be scraped belongs to the PII of EU citizens. You can only scrape these data with one of the five reasons below:

  • Consent – The consent of the data subject
  • Contract – A contract with the data subject
  • Compliance – Necessity for compliance with a legal obligation.
  • Vital Interest, Public Interest, or Official Authority – In the public’s interest.
  • Legitimate Interest – Necessity for other legitimate interests

Don’t scrape the copyrighted or patented data because you could be involved with copyright infringement.

6. Make good use of the scraped data.

Don’t share the scraped data randomly with others. Use data wisely to generate more insights and help improve your business.

Do you know web scraper can crash a website’s server?

Speed is tempting. When you’re staring at thousands of product listings, job postings, or search results, the urge to scrape everything as fast as possible is understandable. Some scraping tools even advertise it as a feature: “Collect millions of data points in seconds!”

But here’s what those ads don’t mention: that speed can land you in court.

1. The real cost of scraping too fast:

When a web scraper sends requests faster than a server can handle, the consequences cascade quickly. The server’s CPU maxes out. Memory fills up. Response times slow to a crawl. And if you’re hitting a smaller website or one that’s already under heavy load, your scraper might be the straw that breaks the camel’s back—causing the entire site to crash.

At that point, you’re no longer just collecting data. You’re causing real, measurable harm. And under a legal doctrine called “trespass to chattels”—a principle dating back to English common law that protects property from harmful interference—you can be held personally liable for that damage.

The landmark case here is eBay v. Bidder’s Edge (2000), where the court found that even automated queries that don’t immediately crash a server can constitute trespass to chattels if they burden the system. More recently, legal scholars Dryer and Stockton (2013) have documented how this principle applies specifically to aggressive web scraping that overloads server resources.

2. What “trespass to chattels” means for you:

This isn’t just academic legal theory. If your scraping causes a website to go down, the owner can sue you for:

  • Direct damages: Server repair costs, lost revenue during downtime, and the cost of implementing new anti-scraping measures
  • Consequential damages: Loss of customer trust, reputation damage, and missed business opportunities
  • Injunctive relief: A court order permanently banning you from accessing the website
  • Legal fees: In some cases, you’ll pay the website owner’s attorney costs too

And you don’t need to intend harm. The doctrine of trespass to chattels doesn’t require malicious intent—just that your actions caused damage to someone else’s property.

The deceptive appeal of “maximum speed scraping”:

Here’s how the problem typically unfolds. You find a scraping tool that promises to collect data at incredible speeds. You configure it to extract 10,000 product pages. You click start. And for a few glorious minutes, data pours in faster than you ever imagined.

Then the requests start timing out. The website slows to a crawl. You might see HTTP 503 or 504 errors—signs the server is overloaded. If you’re scraping a major e-commerce site with enterprise infrastructure, their systems might handle it (though you’ll likely get IP-banned). But if you’re targeting a smaller business, a regional news site, or an industry-specific database? Your scraper might have just brought down their entire online presence.

The site owner won’t see “researcher gathering data.” They’ll see what looks like a distributed denial-of-service (DDoS) attack. And they’ll respond accordingly—with lawyers.

The Facebook v. Power Ventures (2016) case illustrates why complaint scraping matters.

Power Ventures continued scraping Facebook after being explicitly told to stop. The court didn’t just find against them on the merits—it awarded Facebook substantial damages. Contrast that with companies like OpenAI and Google, which negotiated licensing agreements with Reddit in 2024. When Reddit later sued other scrapers, those licenses demonstrated the proper path.

Can I Scrape Public Data for Commercial AI Training Without Permission?

On October 22, 2025, Reddit filed a lawsuit in New York federal court against Perplexity AI and three data-scraping companies—SerpAPI, Oxylabs UAB, and AWMProxy—alleging “industrial-scale, unlawful” scraping of Reddit content for commercial AI training. Four months earlier, Reddit had sued Anthropic, maker of Claude AI, on similar grounds. Together, these cases highlight a critical question: when does scraping public data for AI training cross legal boundaries?

According to Reddit’s complaint, defendants accessed nearly two billion Google search results containing Reddit data. When Reddit couldn’t be scraped directly due to anti-scraping protections, defendants allegedly scraped Reddit content from Google’s search results instead—what Reddit’s chief legal officer Ben Lee called hijacking “the armored truck when you can’t get into the vault.”

The alleged methods included rotating IP addresses, fake user agents, and large-scale proxy networks. Reddit even set a trap: they created a test post visible only to Google. Within hours, it appeared in Perplexity’s answers, demonstrating the indirect scraping.

After Reddit sent Perplexity a cease-and-desist in May 2024, Reddit alleges that Perplexity’s citations of Reddit content increased 40-fold—not exactly the response of a company acting in good faith.

Moreover, Reddit’s lawsuits invoke multiple legal theories:

Digital Millennium Copyright Act (DMCA) Section 1201 prohibits circumventing technological measures controlling access to copyrighted works. Using proxies and fake user agents to bypass anti-scraping protections may violate this—regardless of whether the underlying use would be copyright infringement.

Computer Fraud and Abuse Act (CFAA) addresses unauthorized computer access. While the Ninth Circuit’s 2022 hiQ Labs v. LinkedIn decision held that scraping public data doesn’t automatically violate CFAA, continuing after a cease-and-desist demonstrates lack of authorization.

Unfair competition and breach of contract claims focus on commercial harm and Terms of Service violations. If you’re using scraped data to build products that compete with the source platform, you’re on shaky legal ground.

This reveals the “open internet” tension

Perplexity defends its practices as protecting “users’ rights to freely and fairly access public knowledge.” Some commentators, including Techdirt founder Mike Masnick, worry that Reddit’s position could harm the open internet’s fundamental model.

But Reddit counters: major companies like Google and OpenAI paid for licensing agreements. If those industry leaders recognized that commercial AI training requires permission, why should others get a free pass?

The Anthropic lawsuit adds complexity. Reddit claims Anthropic CEO Dario Amodei’s own research acknowledged using Reddit data to train Claude since 2021. Anthropic’s defense strategy—arguing the claims should be preempted by federal copyright law—may shape how platforms can assert control over their data.

A pattern is emerged across the industry

These Reddit cases are part of a broader wave: The New York Times sued OpenAI, Getty Images sued Stability AI, and music labels sued over song lyrics in training data. The message from content creators is clear: build your AI empire, but don’t do it with our data without asking.

And You may wonder: what it means for you

If you’re considering scraping for AI training, here’s what matters:

  • Licensing is becoming standard. Budget for it as a core expense. When industry giants are paying for access, unauthorized use is high-risk.
  • “Public” doesn’t mean “freely usable for commercial AI”. Visibility on your screen doesn’t grant unlimited commercial rights.
  • Technical evasion signals bad faith. Proxies, IP rotation, and ignoring cease-and-desist letters are factors courts weigh heavily against you. This is what sunk Power Ventures v. Facebook (2016) and Craigslist v. 3Taps (2013).
  • Scale matters. A small academic project is different from scraping billions of records for a commercial product.
  • Document good faith. If you’re scraping legitimately, show you tried to do it properly: respect robots.txt, implement rate limits, and seek permission when appropriate.

The legal landscape around AI training data is evolving rapidly. These cases won’t be decided for years, but their outcomes will define the rules. For now, the safest path is clear: when scraping at scale for commercial AI applications, seek licensing agreements. It’s cheaper than litigation and increasingly the only viable path forward.

Wrap-up

Web scraping itself is not illegal, but people need to be careful with how to use this technique even though there are still a lot of gray areas around law enforcement of web scraping. A negative answer to all the questions now does not necessarily give a clearance to proceed with the scraping project in the future. It is wise to stay up to date on evolving laws in this area. If you are hesitating about whether to scrape a certain website, a safer way to do it is to consult a lawyer for advice.

In addition, it is extremely important to make an informed choice of your web scraping tools if you want to lower your legal risks. Consider using popular web scraping tools like Octoparse. It has a large user base and only processes or shares data based on the five legal bases mentioned above. Download Octoparse for a free 14-day trial today! Wish you a safer web scraping journey then!

Get Web Data in Clicks
Easily scrape data from any website without coding.
Free Download

Hot posts

Explore topics

image
Get web automation tips right into your inbox
Subscribe to get Octoparse monthly newsletters about web scraping solutions, product updates, etc.

Get started with Octoparse today

Free Download

Related Articles