5 Things You Need to Know of Bypassing CAPTCHA for Web ScrapingMonday, January 14, 2019
If you have ever tried to log in to a website, there's a good chance that you have been asked to enter some characters which are not easy to read. The illegible characters are called CAPTCHA. They are a little bit annoying for users and often drive people who are using web scrapers crazy as they are hard to deal with by scraping bots.
Today we are going to talk about 5 things you need to know about CAPTCHA to help you better bypass it for web scraping.
1. What is CAPTCHA?
According to Wikipedia, CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a type of challenge-response test used in computing to determine whether or not the user is human.
It is commonly used across the internet, particularly when purchasing products online or logging to a website.
2. How does CAPTCHA work
CAPTCHA technology is based on the Turing Test which is used to test whether a machine can think like humans. The goal of CAPTCHA is to ask questions or make challenges that computers are unable to deal with. It usually shows a distorted string of random characters or numbers. It works because a human looking at a distorted picture can read the words without any challenge, while a scraping tool doesn’t recognize them easily. Even the most sophisticated automated system, which has been programmed to scan a picture of a page of printed text and read the words in the picture, still has difficulty reading the words when the words are obscured or distorted too much.
3. What are the common types of CAPTCHA
CAPTCHA comes in several sizes and of different types. The most common types of CAPTCHA are text-based CAPTCHA, image-based CAPTCHA, and audio-based CAPTCHA.
A text-based CAPTCHA test is made up of two simple parts: a randomly generated sequence of letters and/or numbers that appear as a distorted image, and a text box. To pass the test and prove your human identity, simply type the characters you see in the image into the text box.
Simply showing the characters are not that difficult for bots. To increase the difficulty, there is mathematical CAPTCHA, which involves a basic math problem with easy to read numbers and 3D CAPTCHA, which displays the characters with 3D effect.
Image-based CAPTCHA usually provides users with images of objects, animals, people or landscapes, instead of distorted text, to distinguish a human from a computer program. Users are required to select the correct images that they are asked to identify or drag a block into an image to make it complete.
Audio-based CAPTCHA utilizes random words or numbers drawn from recordings, combine them and even add some noise to them. The users are required to enter the words or numbers in the recording. Sound CAPTCHAs are harder to deal with comparing with content and picture CAPTCHAs as it is not easy to let a scraping bot learn to listen.
4. Why do websites apply CAPTCHA
Nowadays, computing has become pervasive, and computerized tasks and services are commonplace, so increased levels of security have been more important. The development of CAPTCHA for computers is to ensure that they are dealing with humans in situations where human interaction is essential to security, for example, logging to a website or paying on the Internet.
CAPTCHA also blocks spammers and bots that try to automatically harvest online data, try to automatically sign up for or make use of websites, blogs or forums. It protects websites from overrun by spam, fraudulent registrations, and other illegal behaviors.
5. How to deal with CAPTCHA for web scraping
CAPTCHA can easily break down the crawlers you set up once it shows in the process of extraction, so dealing with it is quite essential for web scraping. The best way to deal with CAPTCHA is to try your best to avoid encountering it :). Never try to scrape a website too much but act more like a human. (We have another article talking about how to avoid blocks during scraping and you can check it here.)
But there are still many CAPTCHAs that cannot be avoided such as the CAPTCHA on the login page. In Octoparse, you can manually solve the CAPTCHA just as easily as what you do normally when browsing a site. (Check an example here.)
For people who code their own scrapers, there are many CAPTCHA solvers that can be integrated into their scarping system. For example, Death by CAPTCHA and Bypass CAPTCHA allow users to connect the service via API to realize solving CAPTCHA automatically during the scraping process. These CAPTCHA solving tools can deal with normal text CAPTCHA and even reCAPTCHA.
CAPTCHA can be a painful headache for web scraping. But don’t worry. With every generation of CAPTCHA, there is every generation of bots. CAPTCHA has become defeatable with the rise of scraping tools and CAPTCHA solvers. You can enjoy web scraping unimpededly with the help of these tools.
Author: Yina Huang(Octoparse Team)
Proofread: Isabel Li(Octoparse Team)
Most popular posts
- Related articles
- Extracting Data from Dynamic Websites in Real...
- Scrape Betting Odds for Sports Analytics
- Drive Your Content Marketing with Data Scrapi...
- Scraping & Visualizing YouTube Comments on 20...
- 3 Steps to Scrape Men’s Ranking on FIFA.COM