undefined
Blog > Knowledge > Post

How To Solve CAPTCHA While Web Scraping?

Monday, August 9, 2021

CAPTCHAs are one of the most popular anti-scraping techniques implemented by website owners. reCaptcha v3 is a CAPTCHA integration solution from Google to detect bot traffic on websites. NuCaptcha, hCaptcha are some other advanced CAPTCHA solutions. But CAPTCHAs are quite irritating, not just for users but also for web scrapers. Solving CAPTCHAs is one of the top challenges faced by web scrapers. Read this insight to find different ways of solving CAPTCHAs while you scrape your target website’s content. Here’s how the article is structured:

 

Table of Content:

What is a CAPTCHA? And What’s a reCaptcha?

Popular types of CAPTCHAs

How to solve/bypass reCAPTCHAs while scraping?

Bypassing reCaptcha in Octoparse?

Tips to prevent CAPTCHAs from interrupting your scraping experience

Conclusion

  

What is a CAPTCHA? And What’s a reCaptcha?

Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) is an automated algorithm-generated textual, visual, or audio-based test. Solving CAPTCHAs requires three skills that humans are much better at than computers:

  • Invariant recognition (identifying different shapes, images of the same alphabet, object),
  • Segmentation (identifying overlapping alphabets), and
  • Parsing context (holistically understanding the image, text, or audio)

 

reCaptcha is the most popular CAPTCHA generating solution. It’s from Google and can be easily integrated into a website. 

 

 

What are some popular types of CAPTCHAs?

1. Normal Captcha

normal captcha

 

This is the most widely used CAPTCHA where a distorted image contains text but is readable by humans. To solve normal CAPTCHA you need to enter the distorted text in the text box.

 

2. Text Captcha

TextCaptcha is not that popular but is great for visually impaired users. This is not image-based, purely text. A CURL example of TextCaptcha:

 

$ curl http://api.textcaptcha.com/myemail@example.com.json

{ "q":"If tomorrow is Saturday, what day is today?"

  "a":["f6f7fec07f372b7bd5eb196bbca0f3f4",

                       "dfc47c8ef18b4689b982979d05cf4cc6"] }

 

CAPTCHA: If tomorrow is Saturday, what day is today?

SOLUTION: Friday.

 

3. Key Captcha

key captcha

 

KeyCaptcha is another CAPTCHA integration service where you’re supposed to solve a puzzle.

 

 

4. Click Captcha

click captcha

 

Image CAPTCHAs that fall under classification-based puzzles are Click CAPTCHAs. reCaptcha, ASIRRA, Snapchat’s Ghost Captcha are popular examples of classification-based Click CAPTCHAs.

 

 

5. Rotate Captcha

These are CAPTCHA puzzles based on image orientation. In Rotate CAPTCHAs, you have to click once or multiple times to rotate an image so that it fulfills the verification terms. The most popular verification condition is to get an object in the “right way up”. FunCaptcha is one of the “rotate CAPTCHA” integration providers, but it seems brokenRVerify.js is an open-source javascript library to verify image orientation.

 

 

6. GeeTest CAPTCHA

 

GeeTest CAPTCHAs are interesting, here you have to move a piece of the puzzle, often by dragging a slider, or you have to select certain images in a particular order.

 

 

7. hCaptcha

hcaptcha

 

hCaptcha is very similar to reCaptcha. The only difference is when we use hCaptcha several companies can leverage the benefit of data labeling that USERs do on websites when they click any website. Using reCaptcha only Google gets the benefit of crowdsourced data labeling.

 

 

8. Capy puzzle

capy puzzle

Similar to keyCaptcha, Capy Puzzle is a puzzle-based CAPTCHA service. CAPY.ME is a service to integrate Capy puzzles into websites.

 

Read more on types of CAPTCHAs.

 

How to solve/bypass reCAPTCHAs while scraping?

Whether you’re scraping using an advanced “click and scrape” no-code screen-scraping tool, or your scraper written in Python, Java, or Javascript, it is possible to solve and bypass all sorts of CAPTCHAs. Though no service/solution guarantees a 100% CAPTCHA solving rate, we can get efficiency up to 90% using popular tools like DeathByCaptcha and 2captcha, etc.,

 

There are two popular approaches to solving CAPTCHAs

 

  • Human-based Captcha Solving 

CAPTCHAs are made to be solved by humans. There are companies out there that employ thousands of humans to solve these CAPTCHAs in real-time, at a very cheap rate. The efficiency is quite high, but time latency is an issue with this approach.

 

So, how should you use a CAPTCHA solving service while Scraping?

There are several captcha solving service providers in the market some notable ones are:

  • DeathByCaptcha
  • AZCaptcha
  • ImageTyperZ
  • EndCaptcha
  • BypassCaptcha
  • CaptchaTronix
  • AntiCaptcha
  • 2Captcha
  • CaptchaSniper

 

All these service providers would have a similar approach:

  1. Register on their website, get a token and credentials post paying the amount, or maybe for free if there is a trial available.
  2. Implement their API/plugin using a language of your choice i.e, Python, PHP, Java, JS, etcetera.
  3. Send your CAPTCHAs to their APIs
  4. Receive the solved CAPTCHAs in API response   

 

  • Solving CAPTCHAs using OCRs (Optical Character Recognition)

This is a programmatic approach for solving CAPTCHAs. OCR stands for optical character recognition or optical character reader. OCR is an electronic or mechanical approach to convert typed, handwritten, or printed text into machine-encoded text. You can feed a scanned document, a picture, or a scene (example: Billboards) to OCRs. There are open-source tools like TESSERACT, GOCR, OCRAD, etc., to get you started, so you don’t need to start from scratch. OCRs have the capabilities to successfully solve different types of image-based CAPTCHAs.

  

  • Self-solving 

If you’re scraping a single site that only verifies real users using reCAPTCHAs once in a while, you may want to bypass reCaptcha on your own manually. In such cases, you can configure your scraping workflow to

  • detect a reCAPTCHA, and while you solve the CAPTCHA
    • pause scraping for a specified time say 7-8 seconds or
    • wait for an element on the page to be visible or 
    • wait for your input until it starts scraping again
  • Solve Captcha and start scraping as usual

 

To detect a reCaptcha, it’s important to understand its implementation.

 

How reCaptcha is integrated into websites?

Integrating reCaptcha involves the following steps:

1. Loading the Javascript API

<script src="https://www.google.com/recaptcha/api.js?render=reCAPTCHA_site_key">

</script>

 

2. Calling a function to handle the callback and binding it to a button or an action.

<button class="g-recaptcha"

        data-sitekey="reCAPTCHA_site_key"

        data-callback='onSubmit'

        data-action='submit'>Submit

</button>

 

Function:

<script>

  function onSubmit(token) {

    document.getElementById("demo-form").submit();

  }

</script>

 

Now, if you want to detect captcha, use XPaths and detect a reCaptcha by looking for an element with class text containing reCaptcha

Xpath: //*[contains(“@class”,”recaptcha”)]

 

If an element is present, it means there is a Captcha on the page that should be solved. You can pause your scraper, solve the captcha and resume scraping again once it is solved.

 

Now, we shall see how to solve a reCaptcha in Octoparse.

 

 

Bypassing reCaptcha in Octoparse?

What is Octoparse?


As we mentioned earlier, you can scrape the web using Click & Scrape no-code solutions. Octoparse is an industry-leading no-code web scraping solution available in the market. It’s free to download and scrape the web. For scalable scraping at speed, it offers very affordable plans too. If you’re new to Octoparse, you can find great resources here. If you’re acquainted with Octoparse, here’s how you can solve CAPTCHAs in Octoparse:

1. Local machine scraping:

While using Octoparse to scrape the web on your local machine, using “wait before execution” or “wait until specific element appears” features provided under Octoparse scraping workflow’s advanced customization options is recommended.

local machine scraping

 

 

2. Cloud scraping

For large projects, the Octoparse team offers the JavaScript template customization service to get around the CAPTCHA/reCAPTCHA issue

 

Tips to prevent CAPTCHAs from interrupting your scraping experience

1. Use rotating IP proxies, rotate user agents, and clear your cookies. Octoparse provides you with options to configure these. Normally, the website triggers an integrated anti-scraping detection service when the same IP starts hitting the servers aggressively. If you use thousands of proxies and rotate them, you may escape facing CAPTCHAs 

prevent captcha

 

 

2. Obey Robots.txt file. This file contains the rules about website preferences. For example, rules state whether the website allows you to scrape it or not. If yes, which URLs it does not want you to scrape, etcetera.

3. Use headless browsers if you’re writing your web scraper, tools like Octoparse automatically takes care of this, as they are smart browsers.

4. Try using headers, and referrers in your requests to the server if you’re not using a full-scale browser.

5. For scraping data-behind logins save cookies. This is how you do it in Octoparse.

6. Beware of invisible honeypot traps on the websites. These are the elements or links which are not visible, so if you’ve written a crawler that scrapes these links, the website gets to know it’s a bot as humans can’t click that link using a normal browser like Chrome or Firefox.

7. Keep random delays between consecutive requests. Especially, when you’re hitting the website with the same IP addresses repetitively.

8. Use CAPTCHA solving services.

 

 

Conclusion

Scraping the web to extract data is highly crucial for businesses to gain insights, and take data-led critical business decisions. Web data is also important for training machine learning algorithms. In this article, we found out different types of CAPTCHAs, discovered different approaches to solving reCaptcha, preventing CAPTCHAs, and we also talked about solving CAPTCHAs in Octoparse. To remind you again, for large projects we do provide Javascript template customization to integrate top CAPTCHA solving services into Octoparse. Contact our team for any scraping requirements.  Happy CAPTCHA-free Scraping!

 

Author: Nishant

More Resources
 

Top 20 Web Scraping Tools to Scrape the Websites Quickly

How to Bypass Craigslist Captcha

25 Hacks to Grow you Business with Web Data Extraction

5 Things You Need to Know of Bypassing CAPTCHA for Web Scraping

Video: Create Your First Scraper with Octoparse 8

 

Download Octoparse to start web scraping or contact us for any
question about web scraping!

Contact Us Download
We use cookies to enhance your browsing experience. Read about how we use cookies and how you can control them by clicking cookie settings. If you continue to use this site, you consent to our use of cookies.
Accept decline