Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Has reCaptcha been cracked / hacked / OCR'd / defeated / broken? [closed]

I notice that almost all the answers here relate to the ineffectiveness of the concept of CAPTCHA, in principle - and while I very much agree with them, in fact gave a talk at OWASP a few months ago explaining just that - the question is very specific, so I will provide for a demonstration.
But first, I will reiterate that demonstration aside, re-read the other comments, since it's truth that CAPTCHA is pointless and not helpful, irrelevant of implementation....

But really, check out CAPTCHA Killer. You can upload a CAPTCHA image, and it will automatically, if not immediately, provide the OCR'd answer. It also provides for an API (REST, I think, but maybe also SOAP). I personally tried numerous reCAPTCHA images, and it was actually some of the easiest ones (or at least quickest) broken.

UPDATE: CAPTCHA Killer's website is now taken down, apparently under legal pressure. See http://captcha.org/ for a complete overview of the topic.

And yeah, OCR is not the best way to break a CAPTCHA protected site - there are many other better ways.


You might be interested in this detailed report on how 4chan defeated reCAPTCHA, and used it to manipulate Time.com's annual TIME 100 Poll results.

Hacking Recaptcha (aka ‘The Penis Flood’)

The next tactic used was to see if they could find a flaw in the reCAPTCHA implementation. One thing they discovered about reCAPTCHA was that it always presents two words to a user for decoding - one word is a control word known by the reCAPTCHA system, while the other is an unknown word (reCAPTCHA uses the humans to help correct OCR errors). Wikipedia describes the process: “Scanned text is subjected to analysis by two different optical character recognition programs; in cases where the programs disagree, the questionable word is converted into a CAPTCHA. The word is displayed along with a control word already known and is labeled by the human. Those words that are consistently given a single label by human judges are recycled as control words”. 2iasdo4 What Anonymous realized was that if they always labeled the unknown scanned text with the same word - and if they did this thousands and thousands of times eventually a large percentage of the unknown words would be mislabeled with their word. All they had to do was look at the two words in the captcha, enter the proper label for the ‘easy’ one (presumably that would be the one that the two optical scanners would agree upon) and enter the word “penis” for the hard one. If they did this often enough, then soon a significant percentage of the images would be labeled as ‘penis’ and the ability to autovote would be restored (one side effect, that was not lost on Anonymous, was the notion that for years to come there would be a number of digital books with the word ‘penis’ randomly inserted throughout the text. Update: I asked Ben Maurer, chief engineer of reCAPTCHA about this ‘penis flood‘ attack, Ben says that they’ve anticipated this type of attack and they have numerous protections that will keep the penises from penetrating the reCAPTCHA barrier.

Optimizing reCAPTCHA

As appealing as the notion of sprinkling the word ‘penis’ into texts, the Anonymous team knew that the clock was ticking, and if they were going to restore the Message they didn’t have time to wait for the autovoters to come back online - they were going to have to vote manually, many, many times. And so they needed to be able to enter captcha’s as fast as they could. They developed a set of guidelines that allowed them to quickly decide which reCAPTCHA words they could skip. For example:

You will be given 2 words: 1 real, 1 fake.

For [REAL FAKE] or [FAKE REAL], you can just type in REAL and it should be accepted.

If it’s [LOOKSREAL LOOKSREAL] or [LOOKSFAKE LOOKSFAKE], it’s usually just quicker to just type in both words. Don’t waste precious time deciding which one of them is real.

Use both the appearance and the type of word to identify a fake word. Don’t rely on just one of them.

The whole ruleset is here: fake captcha.


The weakness of CAPTCHA systems is that people set up rooms full of people in China whose only job it is is to look at a CAPTCHA image and type in the result, which plugs into the automated system that's actually doing the spamming.

Not much you can do about that really.

It's also far cheaper than trying to do image recognition, OCR, etc on the actual image (you may get a response for under $0.01 the other way).


Before giving in to the pressure of using captcha, consider creative workarounds such as having a field labeled "Your Comments" that is hidden by CSS. If the field is entered, the request is dropped by the server. Most bots will fall for it even if there is still not a good way to defeat the room full of underpaid laborers, which captcha does not help with anyways.

UPDATE: Just read a case study where removing CAPTCHA increased conversion rates by almost 10%. That would indicate to me that it is rather broken if you are losing 10% of your leads just to filter out bots. Imagine what 10% means to most businesses.


My favorite captcha is from Microsoft: http://research.microsoft.com/en-us/um/redmond/projects/asirra/

Asirra (Animal Species Image Recognition for Restricting Access) is a HIP that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but our user studies have shown that people can accomplish it quickly and accurately. Many even think it's fun!

It is a free service and they have example code to get you started.

I wonder how long it will be before it is cracked.