logo
languageENdown
menu

How To Solve CAPTCHA While Web Scraping?

7 min read

CAPTCHAs are one of the most popular anti-scraping techniques implemented by website owners. reCaptcha v3 is a CAPTCHA integration solution from Google to detect bot traffic on websites. NuCaptcha and hCaptcha are some other advanced CAPTCHA solutions. But CAPTCHAs are quite irritating, not just for users but also for web scrapers. Solving CAPTCHAs is one of the top challenges faced by web scrapers. Read this insight to find different ways of solving CAPTCHAs while you scrape your target website’s content. Here’s how the article is structured:

What is a CAPTCHA? And What’s a reCaptcha?

Completely Automated Public Turing Test to Tell Computers and Humans Apart (CAPTCHA) is an automated algorithm-generated textual, visual, or audio-based test. Solving CAPTCHAs requires three skills that humans are much better at than computers:

  • Invariant recognition (identifying different shapes, images of the same alphabet, and objects),
  • Segmentation (identifying overlapping alphabets), and
  • Parsing context (holistically understanding the image, text, or audio)

reCaptcha is the most popular CAPTCHA-generating solution. It’s from Google and can be easily integrated into a website. 

1. Normal Captcha

This is the most widely used CAPTCHA where a distorted image contains text but is readable by humans. To solve normal CAPTCHA you need to enter the distorted text in the text box.

normal captcha

2. Text Captcha

TextCaptcha is not that popular but is great for visually impaired users. This is not image-based, purely text. A CURL example of TextCaptcha:

$ curl http://api.textcaptcha.com/myemail@example.com.json

{ “q”:”If tomorrow is Saturday, what day is today?”

  “a”:[“f6f7fec07f372b7bd5eb196bbca0f3f4”,

                       “dfc47c8ef18b4689b982979d05cf4cc6”] }

CAPTCHA: If tomorrow is Saturday, what day is today?

SOLUTION: Friday.

3. Key Captcha

KeyCaptcha is another CAPTCHA integration service where you’re supposed to solve a puzzle.

key captcha

4. Click Captcha

Image CAPTCHAs that fall under classification-based puzzles are Click CAPTCHAs. reCaptcha, ASIRRA, Snapchat’s Ghost Captcha are popular examples of classification-based Click CAPTCHAs.

click captcha

5. Rotate Captcha

These are CAPTCHA puzzles based on image orientation. In Rotate CAPTCHAs, you have to click once or multiple times to rotate an image so that it fulfills the verification terms. The most popular verification condition is to get an object in the “right way up”. FunCaptcha is one of the “rotate CAPTCHA” integration providers, but it seems brokenRVerify.js is an open-source javascript library to verify image orientation.

6. GeeTest CAPTCHA

GeeTest CAPTCHAs are interesting, here you have to move a piece of the puzzle, often by dragging a slider, or you have to select certain images in a particular order.

geetest captcha

7. hCaptcha

hCaptcha is very similar to reCaptcha. The only difference is when we use hCaptcha several companies can leverage the benefit of data labeling that USERs do on websites when they click any website. Using reCaptcha only Google gets the benefit of crowdsourced data labeling.

hcaptcha

8. Capy puzzle

Similar to keyCaptcha, Capy Puzzle is a puzzle-based CAPTCHA service. CAPY.ME is a service to integrate Copy puzzles into websites.

capy puzzle

Read more on types of CAPTCHAs.

How to solve/bypass reCAPTCHAs while scraping?

Whether you’re scraping using an advanced “click and scrape” no-code screen-scraping tool, or your scraper is written in Python, Java, or Javascript, it is possible to solve and bypass all sorts of CAPTCHAs. Though no service/solution guarantees a 100% CAPTCHA solving rate, we can get efficiency up to 90% using popular tools like DeathByCaptcha and 2captcha, etc.,

There are two popular approaches to solving CAPTCHAs.

Human-based Captcha Solving 

CAPTCHAs are made to be solved by humans. There are companies out there that employ thousands of humans to solve these CAPTCHAs in real-time, at a very cheap rate. The efficiency is quite high, but time latency is an issue with this approach.

So, how should you use a CAPTCHA-solving service while Scraping?

There are several captcha-solving service providers in the market some notable ones are:

  • DeathByCaptcha
  • AZCaptcha
  • ImageTyperZ
  • EndCaptcha
  • BypassCaptcha
  • CaptchaTronix
  • AntiCaptcha
  • 2Captcha
  • CaptchaSniper

All these service providers would have a similar approach:

  1. Register on their website, get a token and credentials post paying the amount, or maybe for free if there is a trial available.
  2. Implement their API/plugin using a language of your choice i.e, Python, PHP, Java, JS, etcetera.
  3. Send your CAPTCHAs to their APIs
  4. Receive the solved CAPTCHAs in the API response   

Solving CAPTCHAs using OCRs (Optical Character Recognition)

This is a programmatic approach to solving CAPTCHAs. OCR stands for optical character recognition or optical character reader. OCR is an electronic or mechanical approach to converting typed, handwritten, or printed text into machine-encoded text. You can feed a scanned document, a picture, or a scene (for example Billboards) to OCRs. There are open-source tools like TESSERACT, GOCR, OCRAD, etc., to get you started, so you don’t need to start from scratch. OCRs have the capabilities to successfully solve different types of image-based CAPTCHAs.

Self-solving 

If you’re scraping a single site that only verifies real users using reCAPTCHAs once in a while, you may want to bypass reCaptcha on your own manually. In such cases, you can configure your scraping workflow to

  • detect a reCAPTCHA, and while you solve the CAPTCHA
    • pause scraping for a specified time say 7-8 seconds or
    • wait for an element on the page to be visible or 
    • wait for your input until it starts scraping again
  • Solve Captcha and start scraping as usual

To detect a reCaptcha, it’s important to understand its implementation.

How reCaptcha is integrated into websites?

Integrating reCaptcha involves the following steps:

1. Loading the Javascript API

<script src=”https://www.google.com/recaptcha/api.js?render=reCAPTCHA_site_key“>

</script>

2. Calling a function to handle the callback and binding it to a button or an action.

<button class=“g-recaptcha”

        data-sitekey=“reCAPTCHA_site_key”

        data-callback=‘onSubmit’

        data-action=‘submit’>Submit

</button>

Function:

<script>

  function onSubmit(token) {

    document.getElementById(“demo-form”).submit();

  }

</script>

Now, if you want to detect a captcha, use XPaths and detect a reCaptcha by looking for an element with class text containing reCaptcha

Xpath: //*[contains(“@class”,”recaptcha”)]

If an element is present, it means there is a Captcha on the page that should be solved. You can pause your scraper, solve the captcha and resume scraping again once it is solved.

Bypassing reCaptcha in Octoparse?

What is Octoparse?

As we mentioned earlier, you can scrape the web using Click & Scrape no-code solutions. Octoparse is an industry-leading no-code web scraping solution available in the market. It’s free to download and scrape the web. For scalable scraping at speed, it offers very affordable plans too. If you’re new to Octoparse, you can find great resources here. If you’re acquainted with Octoparse, here’s how you can solve CAPTCHAs in Octoparse:

Local machine scraping

While using Octoparse to scrape the web on your local machine, using the “wait before execution” or “wait until specific element appears” features provided under Octoparse scraping workflow’s advanced customization options is recommended. Due to the version update, if you don’t see the option, check out the help center for support!

local machine scraping

Cloud scraping

For large projects, the Octoparse team offers the JavaScript template customization service to get around the CAPTCHA/reCAPTCHA issue

Tips to prevent CAPTCHAs from interrupting your scraping experience

1. Use rotating IP proxies, rotate user agents, and clear your cookies. Octoparse provides you with options to configure these. Normally, the website triggers an integrated anti-scraping detection service when the same IP starts hitting the servers aggressively. If you use thousands of proxies and rotate them, you may escape facing CAPTCHAs 

2. Obey the Robots.txt file. This file contains the rules about website preferences. For example, rules state whether the website allows you to scrape it or not. If yes, which URLs it does not want you to scrape, etcetera?

3. Use headless browsers if you’re writing your web scraper, tools like Octoparse automatically take care of this, as they are smart browsers.

4. Try using headers, and referrers in your requests to the server if you’re not using a full-scale browser.

5. For scraping data-behind logins save cookies. This is how you do it in Octoparse.

6. Beware of invisible honeypot traps on the websites. These are the elements or links which are not visible, so if you’ve written a crawler that scrapes these links, the website gets to know it’s a bot as humans can’t click that link using a normal browser like Chrome or Firefox.

7. Keep random delays between consecutive requests. Especially, when you’re hitting the website with the same IP addresses repetitively.

8. Use CAPTCHA-solving services.

anti-blocking settings

Conclusion

Scraping the web to extract data is highly crucial for businesses to gain insights, and take data-led critical business decisions. Web data is also important for training machine learning algorithms. In this article, we found out different types of CAPTCHAs, discovered different approaches to solving reCaptcha, and preventing CAPTCHAs, and also talked about solving CAPTCHAs in Octoparse. To remind you again, for large projects we do provide Javascript template customization to integrate top CAPTCHA-solving services into Octoparse.

Hot posts

Explore topics

image
Get web automation tips right into your inbox
Subscribe to get Octoparse monthly newsletters about web scraping solutions, product updates, etc.

Get started with Octoparse today

Download

Related Articles