Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to Bypass Google Recaptcha while scraping with Requests

Python code to request the URL:

agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #using agent to solve the blocking issue
response = requests.get('https://www.naukri.com/jobs-in-andhra-pradesh', headers=agent)
#making the request to the link

Output when printing the html :

<!DOCTYPE html>

    <title>Naukri reCAPTCHA</title> #the title in the actual title of the URL that I am requested for
    <meta name="robots" content="noindex, nofollow">
        <link rel="stylesheet" href="https://static.naukimg.com/s/4/101/c/common_v62.min.css" />      
        <script src="https://www.google.com/recaptcha/api.js" async defer></script>   
like image 737
k monish Avatar asked Apr 23 '20 04:04

k monish

People also ask

How do I bypass Google CAPTCHA when scraping?

If your web scraper is encountering CAPTCHAs, your first recourse should be to rotate your IP address. This helps surprisingly often, especially if you're using a quality proxy network. Otherwise, there are two main approaches to bypassing CAPTCHAs: you can either try to solve the challenge or avoid it altogether.

Can Google reCAPTCHA be bypassed?

Use a VPN. VPN locations allow you to legitimately bypass Google's ReCAPTCHA roadblocks. For the best results, choose a well-known VPN service instead of a free VPN which would arrive with its own set of problems. Good VPNs disguise your traffic, protect your device details and don't record logs.

How do you bypass CAPTCHA request in Python?

You'll find the site-key when you inspect the element of the page like this: Copy this site key and store it. When we will send a request to solve captcha to 2captcha we will receive the response of the solved captcha which we will need to enter in the hidden text field with ID g-recaptcha-response .

1 Answers

Using Google Cache along with a referer (in the header) will help you bypass the captcha.
Things to note:

  • Don't send more than 2 requests/sec. You may get blocked.
  • The result you receive is a cache. This will not be effective if you are trying to scrape a real-time data.
header = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,

r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)

This gives:

>>> r.content
[Squeezed 2554 lines]
like image 198
Joshua Avatar answered Sep 27 '22 15:09
