Blog > Web Scraping > Post

5 Things You Need to Know Before Scraping Data From Facebook

Wednesday, January 30, 2019


1. Actually Facebook disallows any scrapers, according to its robots.txt file


When planning to scrape a website, you should always check its robots.txt first. Robots.txt is a file used by websites to let "bots" know if or how the site should be crawled and indexed. You could access the file by adding "/robots.txt" by the end of the link of your target website. 

Enter in your browser, and let’s check the robots file of Facebook. These two lines could be found at the bottom of the file:

The lines state that Facebook prohibits all automated scrapers. That is, no part of the website should be visited by an automated crawler.


Why do we need to respect robots.txt?

Websites use the robots file to specify a set of rules on how you/bots should interact with them. When a website blocks all access to crawlers, the best thing to do is leave that site alone. To follow the robots file is to avoid unethical data gathering, as well as any legal ramifications.



2. Technically, the only legal way to collect data from Facebook with a crawler is to obtain the prior written permission


Facebook warns at the very beginning of their robots file: "Crawling Facebook is prohibited unless you have express written permission."

Check the link in the second line, you could find Facebook’s Automated Data Collection Terms, last revised on April 15th, 2010.


Like any other terms and conditions on this world, Facebook Automated Data Collection Terms are long (in abnormally small characters), and full of legal terms that few people could fully understand.

These terms look so familiar, as we would see them each time we install a new app on our mobile phone or sign up to a website. 

"By obtaining permission to…you agree to abide by…"

"You agree that you will not…"

"You agree that any violation of these terms may result in…"


However, they may not be the same innocent

As the social media giant, Facebook has money, time and the dedicated legal team. If you proceed with scraping Facebook ignoring their Automated Data Collection Terms, that’s OK, but just be warned that they have been reminded you to at least obtain a "written permission". Sometimes they could be quite aggressive towards illegitimate scraping. 



3. But surely you are still able to scrape data from Facebook as you need


If you have done crawling without respecting the robots.txt, it doesn't mean you would get into legal complications because you've violated the rules.

Data scraped from social media is undoubtedly the largest and most dynamic dataset about human behavior and real-world events. For more than a decade, researchers and business experts around the world have harvested information from Facebook using scrapers, producing representative samples to understand individuals, groups and society, as well as exploring the brand new opportunities hidden in the data.

For users, they would agree that the use of social data is not always a bad thing. For example, it is the use of social data to personalize marketing that keeps the internet free and makes the ads and content we see more relevant.


Tools you could use for obtaining Facebook data

In response to the public outcry following the Cambridge Analytica scandal, Facebook implemented dramatic access restrictions on its APIs in April last year.

Application Programming Interfaces (APIs) are software interfaces designed for consumption by computer programs, which allow people to retrieve large-scale data with automated process. Nowadays many companies provide a public API as a means for users, researchers and third-party app developers to access their infrastructure.

Facebook's API lockdown and radical data access restrictions as an attempt to protect the user information is quite arguable. But still, as a result, now people are left with only one choice.

Without APIs, now we could only obtain Facebook data through the interfaces for users, that is, the web pages. This is exactly when web scrapers comes into play. We have written a blog about some best social media scraping tools. 👉 Check our article Top 5 Social Media Scraping Tools for 2018



4. After GDPR in force, however, there’s more chance to get sued if you’re trying to scrape personal data


The EU General Data Protection Regulation, or GDPR as it is more commonly known, came into force on 25th May 2018. It is said to be the most important change in data privacy regulation in 20 years, setting to force sweeping changes in everything from technology to advertising, and medicine to banking.

Companies or organizations that hold and process large amounts of consumer data, such as technology firms like Facebook, are affected most under GDPR. Before it was all up to these companies themselves to enforce the rules to protect the user data. Now under GDPR, they need to make sure they are in full compliance with the law.


The good news is…

GDPR only applies to personal data.

Here "personal data" refers to the data that could be used to directly or indirectly identify a specific individual. This kind of information is known as Personally Identifiable Information(PII), which includes a person's name, physical address, email address, phone number, IP address, date of birth, employment info and even video/audio recording.

If you aren't scraping personal data, then GDPR does not apply.

In short, unless you have the person's explicit consent it is now illegal to scrape an EU resident personal data under GDPR. 





5. And you could try Facebook alternative sources for your scraping project


As mentioned above, though Facebook prohibits all automated crawlers, it is still technically feasible to scrape data from the site. The problem is —

It is risky.

Apart from the legal ramifications, you could also find it may get harder to retrieve the desired data on a regular base, as Facebook would blocks suspicious IPs, and could even implement harder blocking mechanisms in the future, which may make scraping data from the site totally impossible.

Hence, it is recommended to look for more reliable sources for social media data to gain business intelligence and insights on your target market.


Four data sources alternative to Facebook


With about 500 million tweets generated per day, Twitter is a sea of information that can be used as the great source for brand monitoring and customer sentiment measurement. Unlike Facebook, Twitter allows people to retrieve data on a large scale via Twitter's APIs.



Having as many users as Twitter, Reddit is one of the greatest source of UGC (User Generated Content) in the world. Reddit also provides public APIs that can be used for a variety of purposes such as data collection, automatic commenting bots, or even to assist in subreddit moderation.


VKontakte (VK)

VK is a Russian social media platform geared toward Russians and other Eastern European users. By far, it boasts over 90 million unique visitors per month, and 9 billion page views every day. As a Russian company, VK adheres to Russian laws, and if you check its robots file you’ll find it is quite friendly with crawlers.



Owned by Facebook, Instagram focuses more on visual content sharing, especially videos and pictures. The platform is used by many brands to humanize their content for better connecting customers and growing brand awareness. Alongside Facebook’s data lockdown last year, however, Instagram also has implemented radical restrictions on data access, which made the site much less reliable than before. 



Written by: Ellen Y (Octoparse Team)

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download