I am trying to scrape some articles from xyz However, after a certain number of scrapes, a captcha appears. However, I am running into major issues. <ol> <li> I am using <code>from fake_useragent import UserAgent</code> to randomize my header. </li> <li> I am using random sleep times between requests </li> <li> I am changing IP address using a VPN once a captcha appears. However, somehow a captcha still appears once my IP address appears. </li> </ol> It is also strange because while a captcha appears in the request response, a captcha does not appear in the browser. So, I assume that by header is just wrong. I turned off js and cookies when obtaining this request because with cookie and js, there is clearly info that the website is tracking me with. <pre class="prettyprint"><code>headers = { "authority": "seekingalpha.com", "method": "GET", "path": "/article/4230872-dillards-still-room-downside", "scheme": "https", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-encoding": "gzip, deflate, br", "accept-language": 'en-US,en;q=0.9', "upgrade-insecure-requests": "1", "user-agent": RANDOM } </code></pre> This is close to what the website uses: They add <pre class="prettyprint"><code>"cache-control": "max-age=0", "if-none-match": 'W/"6f11a6f9219176fda72f3cf44b0a2059"', </code></pre> This to my research is etags which is used for carching and can be use to track people. The <code>'W/...'</code> changes each request. Also, when I use wkhtmltopdf to print the screen as pdf, I a captcha never appears. I have also tried using selenium which is even worse. In addition, I have tried using proxies as seen here. So there definitely is a way of doing this. However, I am not doing it correctly. Does anyone have an idea what I am doing wrong? Edit: <ol> <li> Sessions does not seems to be working </li> <li> Random headers does not seem to be working </li> <li> Random sleeps does not seem to be working </li> <li> I am able to access the webpage using my VPN. Even once a capcha appears using requests, there is no captcha on the website in the browser. </li> <li> Selenium does not work. </li> <li> I really do not want to pay for a service to solve capchas. </li> </ol> I believe the issue is that I am not mimicking the browser well enough.

It is not easy to pinpoint the exact reason for being blocked and facing a Captcha. Here are few thoughts: <h3>VPN and Proxies</h3> Sometimes, the Captcha service (in this case, Google) may blacklist common VPN IP addresses and treat them as potential threats, since many people are using them and they generate a lot of traffic. Sometimes, proxy servers (especially free ones) are not anonymous and can send your actual IP address in the request headers (specifically, the X-Forwarded-For header) <h3>Request Headers</h3> There are certain headers that are important to have in your request. The easiest way to make your requests look legitimate is to use the "Network" tab in your browser's "Developer Tools", and copy all the headers your browser sends. An important header to have is <code>referer</code>. While it may or may not be checked by the website, it is safer to just have it there with the URL of one of the website's pages (or homepage): <pre class="prettyprint"><code>referer: https://seekingalpha.com/ </code></pre> <h3>Timeouts and Sessions</h3> Try to increase the timeouts between your requests. Few seconds should be reasonable. Finally, try using the <code>session</code> objects in <code>requests</code>. They automatically maintain the cookies and update the <code>referer</code> across multiple requests, to emulate a real user browsing the website. I found them to be the most helpful when it comes to overcoming scraping protections. <h3>Captcha</h3> The last-resort is to use a service to break the captcha. There are many services (mostly paid) online that do that. A popular one is DeathByCaptcha. Keep in mind that you may be breaking the website's terms of use, which I do not recommend :)

Captcha using requests even after changing headers and IP. How am I being tracked?

Tags:

python

python-requests

web-scraping

web-crawler

I am trying to scrape some articles from xyz However, after a certain number of scrapes, a captcha appears.

However, I am running into major issues.

I am using from fake_useragent import UserAgent to randomize my header.
I am using random sleep times between requests
I am changing IP address using a VPN once a captcha appears. However, somehow a captcha still appears once my IP address appears.

It is also strange because while a captcha appears in the request response, a captcha does not appear in the browser.

So, I assume that by header is just wrong.

I turned off js and cookies when obtaining this request because with cookie and js, there is clearly info that the website is tracking me with.

headers = {
    "authority": "seekingalpha.com",
    "method": "GET",
    "path": "/article/4230872-dillards-still-room-downside",
    "scheme": "https",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": 'en-US,en;q=0.9',
    "upgrade-insecure-requests": "1",
    "user-agent": RANDOM
}

This is close to what the website uses: They add

"cache-control": "max-age=0",
"if-none-match": 'W/"6f11a6f9219176fda72f3cf44b0a2059"',

This to my research is etags which is used for carching and can be use to track people. The 'W/...' changes each request.

Also, when I use wkhtmltopdf to print the screen as pdf, I a captcha never appears. I have also tried using selenium which is even worse. In addition, I have tried using proxies as seen here.

So there definitely is a way of doing this. However, I am not doing it correctly. Does anyone have an idea what I am doing wrong?

Edit:

Sessions does not seems to be working
Random headers does not seem to be working
Random sleeps does not seem to be working
I am able to access the webpage using my VPN. Even once a capcha appears using requests, there is no captcha on the website in the browser.
Selenium does not work.
I really do not want to pay for a service to solve capchas.

I believe the issue is that I am not mimicking the browser well enough.

436

asked Jan 02 '19 19:01

user2330624

1 Answers

It is not easy to pinpoint the exact reason for being blocked and facing a Captcha. Here are few thoughts:

VPN and Proxies

Sometimes, the Captcha service (in this case, Google) may blacklist common VPN IP addresses and treat them as potential threats, since many people are using them and they generate a lot of traffic.

Sometimes, proxy servers (especially free ones) are not anonymous and can send your actual IP address in the request headers (specifically, the X-Forwarded-For header)

Request Headers

There are certain headers that are important to have in your request. The easiest way to make your requests look legitimate is to use the "Network" tab in your browser's "Developer Tools", and copy all the headers your browser sends.

An important header to have is referer. While it may or may not be checked by the website, it is safer to just have it there with the URL of one of the website's pages (or homepage):

referer: https://seekingalpha.com/

Timeouts and Sessions

Try to increase the timeouts between your requests. Few seconds should be reasonable.

Finally, try using the session objects in requests. They automatically maintain the cookies and update the referer across multiple requests, to emulate a real user browsing the website. I found them to be the most helpful when it comes to overcoming scraping protections.

Captcha

The last-resort is to use a service to break the captcha. There are many services (mostly paid) online that do that. A popular one is DeathByCaptcha. Keep in mind that you may be breaking the website's terms of use, which I do not recommend :)

119

answered Sep 29 '22 12:09

Aziz

Related questions
                            
                                ValueError: Input 0 of node incompatible with expected float_ref.**
                            
                                Is there an alternative to OpenCL+PyOpenCL for multiplatform GPGPU compute?
                            
                                Django Rest Framework - passing Model data through a function, then posting output in a separate field in the same model
                            
                                How to forecast using the Tensorflow model?
                            
                                Memory errors using xarray + dask - use groupby or apply_ufunc?
                            
                                Adding form to Dash/Plotly app
                            
                                Python Source Code - Update Grammar
                            
                                jinja + form + unicode control characters + xml/docx integration
                            
                                Running a single test method with pytest fails (not found)
                            
                                Merge pairs on common integer with restrictions
                            
                                Using pure numpy metric as metric in Keras/TensorFlow
                            
                                Lazy loading with python and flask
                            
                                Python program to connect to HBase via thrift server in Http mode
                            
                                python django email set correct sender gunicorn
                            
                                Why expm(2*A) != expm(A) @ expm(A)
                            
                                Numpy performance differences depending on numerical values
                            
                                Django Serializer Nested Creation: How to avoid N+1 queries on relations
                            
                                Accessing Tor with Selenium in Python
                            
                                Django - limit key size on unique_together columns
                            
                                Using itertools.tee to duplicate a nested iterator (ie itertools.groupby)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With