undefined
Blog > Web Scraping > Post

5 Things You Need to Know of Bypassing CAPTCHA for Web Scraping

Friday, August 6, 2021

If you have ever tried to log in to a website, there's a good chance that you have been asked to enter some characters which are not easy to read. The illegible characters are called CAPTCHA. They are a little bit annoying for users and often drive people who are using web scrapers crazy as they are hard to deal with by scraping bots.

We are going to talk about 5 things you should know about CAPTCHA and help you better bypass it for web scraping.

 

5 Things You Should Know of Bypassing Captcha

1. What is CAPTCHA?

2. How does CAPTCHA work 

3. What are the common types of CAPTCHA 

4. Why do websites apply CAPTCHA

5. How to deal with CAPTCHA for web scraping

 

 

1. What is CAPTCHA?

According to Wikipedia, CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a type of challenge-response test used in computing to determine whether or not the user is human. This is a way to detect malicious robot behaviors, block the robot and protect the website from harm.

It is commonly used across the internet, particularly when purchasing products online or logging into a website.

 

2. How does CAPTCHA work

CAPTCHA technology is based on the Turing Test. It is used to test whether a machine can think like humans. The goal of CAPTCHA is to ask questions or make challenges that computers are unable to deal with. It usually shows a distorted string of random characters or numbers. It works because a human looking at a distorted picture can read the words without any challenge, while a scraping tool doesn’t recognize them easily.

Even the most sophisticated automated system, which has been programmed to scan a picture of printed text and read the words, would still find it difficult to identify the words when they are too much distorted.

 

3. What are the common types of CAPTCHA

CAPTCHA comes in several sizes and of different types. The most common types of CAPTCHA are:

    • Text-based Captcha
    • Image-based Captcha
    • Audio-based Captcha
    • ReCaptcha vs. Captcha

 

Text-based CAPTCHA 

A text-based CAPTCHA test is made up of two parts: a randomly generated sequence of letters and/or numbers that appear as a distorted image, and a text box for input. To pass the test and prove your human identity, simply type the characters you see in the image into the text box.

CAPTCHA

Simply showing the characters are not that difficult for bots. To increase the difficulty, there is mathematical CAPTCHA, which involves a basic math problem with easy-to-read numbers, and 3D CAPTCHA, which displays the characters with 3D effect.

        bypass captcha type 3

 

Image-based CAPTCHA

Image-based CAPTCHA usually provides users with images of objects, animals, people or landscapes, instead of distorted text, to distinguish a human from a computer program. Users are required to select the correct images that they are asked to identify or drag a block into an image to make it complete.

bypass captcha type 4

 

Audio-based CAPTCHA

Audio-based CAPTCHA utilizes random words or numbers drawn from recordings, combines them and even adds some noise to them. The users are required to enter the words or numbers in the recording. Sound CAPTCHAs are harder to deal with comparing with content and picture CAPTCHAs as it is not easy to let a scraping bot learn to listen.

bypass captcha type 5

 

ReCaptcha vs. hCaptcha

Compared to Captcha, Google's reCaptcha now is more widely used among websites. There are fair reasons:

    • For developers, it is easier to set up and maintain
    • The test is more friendly for users to solve (sometimes those squiggly letters can be really tricky)
    • Free service is available and Google is taking good care of it

However, even reCaptcha with an easy question can interrupt the smooth browsing journey and annoy the user. So there comes invisible reCaptcha.

 

"Google's Invisible reCAPTCHA service, which is able to differentiate humans from bots without additional input from the website user. reCAPTCHA uses an advanced risk analysis engine and adaptive CAPTCHAs to keep automated software from engaging in abusive activities on your site. It does this while letting your valid users pass through with ease."

——Quoted from InterGen.com

google recaptcha

You may have heard about hCaptcha and wonder what is the difference between hCaptcha and reCaptcha.

In fact, reCaptcha is offered by Google and with the service set up on your site, every time when your users solve a captcha, the user data is fed back to Google. Google may use this data to improve their services, for example teach the machine to categorize photographs more intelligently. While it can be sensitive as well in regard of personal privacy.

Hcaptcha is provided by the Intuitive Machine which is far from a data tycoon and claims to protect user privacy.

 

 

4. Why do websites apply a CAPTCHA

Nowadays, computing has become pervasive, and computerized tasks and services are commonplace, so increased levels of security have been more important. The development of CAPTCHA for computers is to ensure that they are dealing with humans in situations where human interaction is essential to security, for example, logging into a website or paying on the Internet.

CAPTCHA also blocks spammers and bots that try to automatically harvest online data, try to automatically sign up for or make use of websites, blogs, or forums. It protects websites from overrun by spam, fraudulent registrations, and other illegal behaviors.

 

5. How to deal with CAPTCHA for web scraping

CAPTCHA can easily break down the crawlers you set up once it shows in the process of extraction, so dealing with it is quite essential for web scraping. The best way to deal with a CAPTCHA is to try your best to avoid encountering it in the face :). 

That means we try to avoid triggering the Captcha in the first place:

    • Slow down the scraping to make your behaviors less robot-like
    • Make use of proxy servers to minimize IP tracing
    • Be careful of honeypot traps

 

When you face CAPTCHA head-on and no coming back, there are ways to solve it.

If you use Octoparse for web scraping, you can switch the built-in browser to the manual mode and solve the CAPTCHA just as easily as a human.

For people who code their own scrapers, there are many CAPTCHA solvers that can be integrated. These Captcha solvers 

      • Death by CAPTCHA: this service allows users to connect the service via API to realize solving CAPTCHA automatically during the scraping process.
      • Bypass CAPTCHA: this CAPTCHA solving tool can deal with normal text CAPTCHA and even reCAPTCHA.
      • 2CAPTCHA: 2Captcha is a wonderful service provider to help you solve the problem.

 

CAPTCHA can be a painful headache for web scraping. But don’t worry. With every generation of CAPTCHA, there is every generation of bots. CAPTCHA has become defeatable with the rise of scraping tools and CAPTCHA solvers. You can enjoy web scraping unimpededly with the help of these tools.

 

 

 

Author: Yina Huang

Proofread: Isabel Li 

 

Artículo en español: 5 Cosas que Debes Saber al Evitar CAPTCHA para El Web Scraping
También puede leer artículos de web scraping en El Website Oficial
 

 

More Resources
 

Top 20 Web Scraping Tools to Scrape the Websites Quickly

How to Bypass Craigslist Captcha

25 Hacks to Grow you Business with Web Data Extraction

Web Scraping Templates Take Away

Video: Create Your First Scraper with Octoparse 8

 

 

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download
We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline