The Rising Prevalence of Web Scraping
Web scraping, also called web crawling, screen scraping, or web data extraction, is the grabbing of data from web pages with or without the consent of the site’s webmaster. As a trending term in the data-dominated era, web scraping, combined with the power of automation, offers a scalable way to access, rank, collect, organize, and analyze the huge amount of documentation and data on the web. Modern web scrapers have streamlined the process of data extraction and thus saved us from the repetitive work of copy-pasting.
The possibilities around web scraping are enormous. As one of the cornerstone technologies of the Internet world, web scraping lays the foundation for modern search engines. The Google Search database, for example, is built entirely out of scraping results. Web crawlers from Google gather information from across hundreds of billions of web pages and organize it in the Search index. Businesswise, web scraping of smaller scales is used by businesses from a variety of backgrounds to harvest third-party data and harness it to extract significant insights.
Here are a few popular use cases to show how prevalent web scraping is:
- E-commerce: Retailers use web scraping to automate marketplace price monitoring, build up product profiles, and collect customer reviews for sentiment analysis.
- Marketing/Advertising: Content creators use web scraping to collect data from various social media platforms to help generate new ideas for content marketing.
- Real estate: Realtors scrape listings from property websites like Gumtree.com, and Realtor.com to aggregate loads of research data for comparison. In that way, they predict if the real estate market will skyrocket any time soon or see in what price range their property will compete.
The Legality of Web Scraping
While web scraping for business has become a common practice, the legality of web scraping is still in a grey area. The versatility of web scraping allows access to data so easily that it would be natural to worry about potential information abuse or misuse. For people who want to decrease the likelihood of legal controversies in web scraping, it is important to identify the legal risks around web scraping.
Here comes the ultimate question: to scrape or not to scrape? Is web scraping illegal or not? What are the potential legal implications of using web scraping? Unfortunately, there is no short answer to these questions. Due to the relative novelty of web scraping in a legal context, the line between legitimate and evil use of this technique is still hard to define in most countries. For a decade or so, web scraping was only guided by a set of related, fundamental legal theories and laws, such as:
- Copyright Infringement
- Breach of Contract
- Violation of the Computer Fraud and Abuse Act (CFAA)
- Trespass to Chattels
In most countries, law enforcement specifically for web scraping is not clearly defined yet. However, with the onset of GDPR regulations, more and more people have realized the need to comply with legal standards before proceeding with a scraping project to avoid falling into a tricky legal situation. As international legal circumstances vary widely, this article only discusses the legal risks of web scraping in the United States and Europe.
The Case of the United States
In the US, the law regarding web scraping is still developing and implicates a large number of statutory regimes and areas of common law. There are major types of legal claims that website owners can use to avoid undesired web scraping. For example, web-scraping activity may implicate federal statutes, such as the Computer Fraud and Abuse Act (CFAA), Digital Millennium Copyright Act (DMCA) and insider trading laws; state blue sky laws; privacy laws; and common law claims, such as breach of contract, fraud, and trespass to chattels.
The CFAA proscribes “intentionally access[ing] a computer without authorization or exceed[ing] authorized access, and thereby obtain[ing] . . . information from any protected computer.” Courts have disagreed, however, on what constitutes access without authorization or exceeding authorization.
In hiQ Labs, Inc. v. LinkedIn Corp., the court claims that a user’s act of accessing data made available by the owner to the general public does not constitute access “without authorization” under the CFAA. Yet in Facebook, Inc. v. Power Ventures, Inc., the court held that a user accesses a computer “without authorization” when he or she continues to circumvent technological measures employed by the operator to block that user’s access.
The operator of a website that is the target of web scraping may bring a claim for copyright infringement against the user of the web-scraping device by proving:
- its ownership of a valid copyright; and
- the user’s copying of the original elements of the work in question.
At least one federal court has held that a party faces liability under Section 1201(a)(1)(A) of the DMCA when it uses bots to circumvent security measures that control nonhuman access to copyrighted material on a webpage.
It is also worth noting the general copyright principle that, although compilations of facts can be protected by copyright, authors may not copyright their ideas or the facts they narrate. Accordingly, if the data scraped are purely facts without a creative component, then there is no copyright claim.
Web scraping may also implicate the privacy statutes of states and other jurisdictions. For example, the E.U.’s General Data Protection Regulation and the California Consumer Privacy Act of 2018 grant consumers a variety of rights and protections with respect to their personal information. Web-scraping activity that compiles personally identifiable information could implicate a variety of privacy statutes – and potentially subject a web scraper to government and private litigation.
Under certain circumstances, web scraping could also potentially violate federal insider trading law or state blue sky laws. For example, using affirmative misrepresentations to obtain material nonpublic information through web scraping and then trading based on that information could potentially constitute insider trading.
However, the law in this area is unsettled, and it remains to be seen how strict an approach regulators and law enforcement may take when deciding what constitutes a breach of duty or deception in the web-scraping context.
Breach of contract
In addition to the boundaries imposed by the statutes discussed above, a plaintiff could seek to invoke various common law remedies in an attempt to stem or curtail web scraping. For instance, some website operators have attempted to assert claims for breach of contract against alleged web scrapers. Courts, however, have held that defendants must be on notice of a website’s terms of service for the terms to be enforced against them.
The Case of Europe
Today 69% of the population above the age of 16 in the EU have heard about the GDPR and 71% of people heard about their national data protection authority, according to results published in a survey from the EU Fundamental Rights Agency. Though still in its infancy, the GDPR is one of the most comprehensive and impactful data protection laws to date. It has radically changed how businesses scrape the web in Europe. If your scraping project needs you to scrape PIIs, to avoid hefty fines, it’s better to make sure you’re GDPR compliant. See our blog on GDPR: GDPR Compliance In Web Scraping, which covers almost everything you need to know about GDPR.
Advice for Users That May Engage in Web Scraping
On the whole, the law on web scraping is still developing, and only further court decisions and legal pronouncements will thoroughly define its parameters. To avoid being involved in lawsuits, the following is a non-exhaustive list of practical tips for users that have engaged in web scraping.
1. Respect and follow the Terms of Service (ToS).
2. Scrape at a reasonable and moderate rate.
Be gentle and don’t be aggressive. Give the scraped website some breathing space. When you’re scraping, you should hit the website within a reasonable time interval and keep the number of requests in control. Avoid adversely impacting a website’s physical operation, which could lead to a claim for trespass to chattels or similar claims.
3. Monitor and consider any actions a website takes to restrict web scraping.
If a website clearly restricts your web scraping activities with various anti-scraping measures, such as the use of CAPTCHAs, rate limits, and blocking of IP addresses., you need to be cautious of potential legal risks. Be prepared to stop if asked to do so through a cease-and-desist letter or otherwise.
4. Avoid collecting personally identifiable information.
Consider whether any data to be scraped belongs to the PII of EU citizens. You can only scrape these data with one of the five reasons below:
- Consent – The consent of the data subject
- Contract – A contract with the data subject
- Compliance – Necessity for compliance with a legal obligation.
- Vital Interest, Public Interest, or Official Authority – In the public’s interest.
- Legitimate Interest – Necessity for other legitimate interests
5. Consider whether any data to be scraped is protected by copyright.
Don’t scrape the copyrighted or patented data because you could be involved with copyright infringement.
6. Make good use of the scraped data.
Don’t share the scraped data randomly with others. Use data wisely to generate more insights and help improve your business.
Web scraping itself is not illegal, but people need to be careful with how to use this technique even though there are still a lot of grey areas around law enforcement of web scraping. A negative answer to all the questions now does not necessarily give a clearance to proceed with the scraping project in the future. It is wise to stay up to date on evolving laws in this area. If you are hesitating about whether to scrape a certain website, a safer way to do it is to consult a lawyer for advice.
In addition, it is extremely important to make an informed choice of your web scraping tools if you want to lower your legal risks. Consider using popular web scraping tools like Octoparse. It has a large user base and only processes or shares data based on the five legal bases mentioned above. Download Octoparse for a free 14-day trial today! Wish you a safer web scraping journey then!