Web crawling, also called web scraping, screen scraping, or web data extraction, in technical terms, is a computer program technique used to scrape huge amounts of data from websites where regular-format data can be extracted and processed into easy-to-read structured formats. As a trending term in the data-dominated era, web scraping, combined with the power of automation, offers a scalable way to access, rank, collect, organize, and analyze the huge amount of documentation and data on the web. Modern web scrapers have streamlined the process of data extraction and thus saved us from the repetitive work of copy-pasting.
For a long time, web crawling has been considered a sort of gray zone. Because in most situations, people use this technique to grab data from web pages without the consent of the site’s webmaster. With the rising prevalence of web scraping, the legality of this technique has been realized by more and more people. In this article, we’ll discuss the legal considerations of web crawling, and how to avoid legal issues while pulling data from websites.
Is Web Crawling Legal
Let’s start with the question in the title. Is web crawling legal? Well, it depends. There is a lot of uncertainty regarding the legality of web crawling, including how you do that, how you use the scraped data, and the legal theories and laws.
How You Do Web Crawling
Generally, scraping public information from websites is legal, whereas scraping private account data will raise privacy concerns. Here are a few popular use cases to show how different industries do web scraping in a well-accepted way.
E-commerce: Retails use web scraping to automate marketplace price monitoring, build up product profiles, and collect customer reviews for sentiment analysis among diverse online shopping platforms like Amazon and eBay.
Marketing and Advertising: Content creators apply web crawling to collect data from various social media platforms like Twitter and YouTube to generate new ideas for content marketing and understand what audiences are interested in.
Real Estate: Realtors scrape listings from property websites like Realtor.com to aggregate loads of research data for comparison. In that way, they predict if the real estate market will skyrocket any time soon or see in what price range their property will compete.
How You Use Scraped Data
If you’re doing web crawling for your own purposes, then it is legal as it falls under the fair use doctrine such as market research and academic research. The complications start if you want to use scraped data for others, especially commercial purposes.
Quoted from Wikipedia.org, eBay v. Bidder’s Edge, 100 F.Supp.2d 1058 (N.D. Cal. 2000), was a leading case applying the trespass to chattels doctrine to online activities. In 2000, eBay, an online auction company, successfully used the ‘trespass to chattels’ theory to obtain a preliminary injunction preventing Bidder’s Edge, an auction data aggregator, from using a ‘crawler’ to gather data from eBay’s website. The opinion was a leading case applying ‘trespass to chattels’ to online activities, although its analysis has been criticized in more recent jurisprudence.
The Legal Theories and Laws in Different Countries
The versatility of web scraping allows access to data so easily that it would be natural to worry about potential information abuse or misuse. For people who want to decrease the likelihood of legal controversies in web scraping, it is important to identify the legal risks around web scraping.
Here comes the ultimate question: to scrape or not to scrape? Is web scraping illegal or not? What are the potential legal implications of using web scraping? Unfortunately, there is no short answer to these questions. Due to the relative novelty of web scraping in a legal context, the line between legitimate and evil use of this technique is still hard to define in most countries. For a decade or so, web scraping was only guided by a set of related, fundamental legal theories and laws, such as:
- Copyright Infringement
- Breach of Contract
- Violation of the Computer Fraud and Abuse Act (CFAA)
- Trespass to Chattels
In most countries, law enforcement specifically for web scraping is not clearly defined yet. However, with the onset of GDPR regulations, more and more people have realized the need to comply with legal standards before proceeding with a scraping project to avoid falling into a tricky legal situation. As international legal circumstances vary widely, this part only discusses the legal risks of web scraping in the United States and Europe.
The Case of the United States
In the US, the law regarding web scraping is still developing and implicates a large number of statutory regimes and areas of common law. There are major types of legal claims that website owners can use to avoid undesired web scraping. For example, web-scraping activity may implicate federal statutes, such as the Computer Fraud and Abuse Act (CFAA), Digital Millennium Copyright Act (DMCA) and insider trading laws; state blue sky laws; privacy laws; and common law claims, such as breach of contract, fraud, and trespass to chattels.
The CFAA proscribes “intentionally access[ing] a computer without authorization or exceed[ing] authorized access, and thereby obtain[ing] … information from any protected computer.” Courts have disagreed, however, on what constitutes access without authorization or exceeding authorization.
In hiQ Labs, Inc. v. LinkedIn Corp., the court claims that a user’s act of accessing data made available by the owner to the general public does not constitute access “without authorization” under the CFAA. Yet in Facebook, Inc. v. Power Ventures, Inc., the court held that a user accesses a computer “without authorization” when he or she continues to circumvent technological measures employed by the operator to block that user’s access.
The operator of a website that is the target of web scraping may bring a claim for copyright infringement against the user of the web-scraping device by proving:
- its ownership of a valid copyright;
- the user’s copying of the original elements of the work in question.
At least one federal court has held that a party faces liability under Section 1201(a)(1)(A) of the DMCA when it uses bots to circumvent security measures that control nonhuman access to copyrighted material on a webpage.
It is also worth noting the general copyright principle that, although compilations of facts can be protected by copyright, authors may not copyright their ideas or the facts they narrate. Accordingly, if the data scraped are purely facts without a creative component, then there is no copyright claim.
Web scraping may also implicate the privacy statutes of states and other jurisdictions. For example, the E.U.’s General Data Protection Regulation and the California Consumer Privacy Act of 2018 grant consumers a variety of rights and protections with respect to their personal information. Web-scraping activity that compiles personally identifiable information could implicate a variety of privacy statutes – and potentially subject a web scraper to government and private litigation.
Under certain circumstances, web scraping could also potentially violate federal insider trading law or state blue sky laws. For example, using affirmative misrepresentations to obtain material nonpublic information through web scraping and then trading based on that information could potentially constitute insider trading.
However, the law in this area is unsettled, and it remains to be seen how strict an approach regulators and law enforcement may take when deciding what constitutes a breach of duty or deception in the web-scraping context.
Breach of contract
In addition to the boundaries imposed by the statutes discussed above, a plaintiff could seek to invoke various common law remedies in an attempt to stem or curtail web scraping. For instance, some website operators have attempted to assert claims for breach of contract against alleged web scrapers. Courts, however, have held that defendants must be on notice of a website’s terms of service for the terms to be enforced against them.
The Case of Europe
Today 69% of the population above the age of 16 in the EU have heard about the GDPR and 71% of people heard about their national data protection authority, according to results published in a survey from the EU Fundamental Rights Agency. Though still in its infancy, the GDPR is one of the most comprehensive and impactful data protection laws to date. It has radically changed how businesses scrape the web in Europe. If your scraping project needs you to scrape PIIs, to avoid hefty fines, it is better to make sure you’re GDPR-compliant. See our blog on GDPR: GDPR Compliance In Web Scraping, which covers almost everything you need to know about GDPR.
6 Tips for Doing Web Scraping Properly
On the whole, the law on web scraping is still developing, and only further court decisions and legal pronouncements will thoroughly define its parameters. To avoid being involved in lawsuits, the following is a non-exhaustive list of practical tips for users who have engaged in web scraping.
1. Respect and follow the Terms of Service.
Always review the website’s Terms of Services (ToS) and robot.txt files before consenting to web scraping data collection activity. If possible, get prior permission from the owner of the website.
2. Scrape at a reasonable and moderate rate.
Be gentle and don’t be aggressive. Give the scraped website some breathing space. When you’re scraping, you should hit the website within a reasonable time interval and keep the number of requests in control. Avoid adversely impacting a website’s physical operation, which could lead to a claim for trespass to chattels or similar claims.
3. Monitor and consider any actions a website takes to restrict web scraping.
If a website clearly restricts your web scraping activities with various anti-scraping measures, such as the use of CAPTCHAs, rate limits, blocking of IP addresses, etc., you need to be cautious of potential legal risks. Be prepared to stop if asked to do so through a cease-and-desist letter or otherwise.
4. Avoid collecting personally identifiable information.
Consider whether any data to be scraped belongs to the PII of EU citizens. You can only scrape these data with one of the five reasons below:
- Consent – The consent of the data subject
- Contract – A contract with the data subject
- Compliance – Necessity for compliance with a legal obligation.
- Vital Interest, Public Interest, or Official Authority – In the public’s interest.
- Legitimate Interest – Necessity for other legitimate interests
5. Consider whether any data to be scraped is protected by copyright.
Don’t scrape the copyrighted or patented data because you could be involved with copyright infringement.
6. Make good use of the scraped data.
Don’t share the scraped data randomly with others. Use data wisely to generate more insights and help improve your business.
Web scraping itself is not illegal, but people need to be careful with how to use this technique even though there are still a lot of gray areas around law enforcement of web scraping. A negative answer to all the questions now does not necessarily give a clearance to proceed with the scraping project in the future. It is wise to stay up to date on evolving laws in this area. If you are hesitating about whether to scrape a certain website, a safer way to do it is to consult a lawyer for advice.
In addition, it is extremely important to make an informed choice of your web scraping tools if you want to lower your legal risks. Consider using popular web scraping tools like Octoparse. It has a large user base and only processes or shares data based on the five legal bases mentioned above. Download Octoparse for a free 14-day trial today! Wish you a safer web scraping journey then!