Python code to request the URL: <pre class="prettyprint"><code>agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #using agent to solve the blocking issue response = requests.get('https://www.naukri.com/jobs-in-andhra-pradesh', headers=agent) #making the request to the link </code></pre> Output when printing the html : <pre class="prettyprint"><code><!DOCTYPE html> <html> <head> <title>Naukri reCAPTCHA</title> #the title in the actual title of the URL that I am requested for <meta name="robots" content="noindex, nofollow"> <link rel="stylesheet" href="https://static.naukimg.com/s/4/101/c/common_v62.min.css" /> <script src="https://www.google.com/recaptcha/api.js" async defer></script> </head> </html> </code></pre>

Using <code>Google Cache</code> along with a <code>referer</code> (in the header) will help you bypass the captcha. Things to note: <ul> <li>Don't send more than 2 requests/sec. You may get blocked.</li> <li>The result you receive is a cache. This will not be effective if you are trying to scrape a real-time data. Example:</li> </ul> <pre class="prettyprint lang-py prettyprint-override"><code>header = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" , 'referer':'https://www.google.com/' } r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header) </code></pre> This gives: <pre class="prettyprint lang-py prettyprint-override"><code>>>> r.content [Squeezed 2554 lines] </code></pre>

How to Bypass Google Recaptcha while scraping with Requests

Tags:

python

beautifulsoup

python-requests

web-scraping

Python code to request the URL:

agent = {"User-Agent":'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #using agent to solve the blocking issue
response = requests.get('https://www.naukri.com/jobs-in-andhra-pradesh', headers=agent)
#making the request to the link

Output when printing the html :

<!DOCTYPE html>

<html>
  <head>
    <title>Naukri reCAPTCHA</title> #the title in the actual title of the URL that I am requested for
    <meta name="robots" content="noindex, nofollow">
        <link rel="stylesheet" href="https://static.naukimg.com/s/4/101/c/common_v62.min.css" />      
        <script src="https://www.google.com/recaptcha/api.js" async defer></script>   
    </head>
</html>

737

asked Apr 23 '20 04:04

k monish

1 Answers

Using Google Cache along with a referer (in the header) will help you bypass the captcha.
Things to note:

Don't send more than 2 requests/sec. You may get blocked.
The result you receive is a cache. This will not be effective if you are trying to scrape a real-time data.
Example:

header = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
    'referer':'https://www.google.com/'
}

r = requests.get("http://webcache.googleusercontent.com/search?q=cache:www.naukri.com/jobs-in-andhra-pradesh",headers=header)

This gives:

>>> r.content
[Squeezed 2554 lines]

198

answered Sep 27 '22 15:09

Joshua

Related questions
                            
                                How to kill tensorboard with Tensorflow2 (jupyter, Win)
                            
                                How to split a dataframe based on consecutive index?
                            
                                python3 os.rename() won't rename files with the word 'Copy' in name
                            
                                How can I change the size of my python turtle window?
                            
                                Discord.py - Changing prefix with command
                            
                                How to use apache airflow in a virtual environment?
                            
                                How to interpret Python output dtype='<U32'?
                            
                                How to combine The video and audio files in ffmpeg-python
                            
                                Disable logging in gunicorn for a specific request / URL / endpoint
                            
                                adding row from one dataframe to another
                            
                                How to check if sklearn model is classifier or regressor
                            
                                How do I use pytest with bazel?
                            
                                Why Flask Migrations does not detect a field's length change?
                            
                                AttributeError: module 'win32ctypes.pywin32.win32api' has no attribute 'error'
                            
                                Dual nested dictionary to stacked DataFrame
                            
                                How to get a list of every Point inside a MultiPolygon using Shapely
                            
                                Difference between Shuffle and Random_State in train test split?
                            
                                Type-check Jupyter Notebooks with mypy
                            
                                AWS Lambda not importing Asyncio
                            
                                Better way to iterate over python dataclass keys and values?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With