I am trying to scrape some articles from xyz However, after a certain number of scrapes, a captcha appears.
However, I am running into major issues.
I am using from fake_useragent import UserAgent
to randomize my header.
I am using random sleep times between requests
I am changing IP address using a VPN once a captcha appears. However, somehow a captcha still appears once my IP address appears.
It is also strange because while a captcha appears in the request response, a captcha does not appear in the browser.
So, I assume that by header is just wrong.
I turned off js and cookies when obtaining this request because with cookie and js, there is clearly info that the website is tracking me with.
headers = {
"authority": "seekingalpha.com",
"method": "GET",
"path": "/article/4230872-dillards-still-room-downside",
"scheme": "https",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding": "gzip, deflate, br",
"accept-language": 'en-US,en;q=0.9',
"upgrade-insecure-requests": "1",
"user-agent": RANDOM
}
This is close to what the website uses: They add
"cache-control": "max-age=0",
"if-none-match": 'W/"6f11a6f9219176fda72f3cf44b0a2059"',
This to my research is etags which is used for carching and can be use to track people. The 'W/...'
changes each request.
Also, when I use wkhtmltopdf to print the screen as pdf, I a captcha never appears. I have also tried using selenium which is even worse. In addition, I have tried using proxies as seen here.
So there definitely is a way of doing this. However, I am not doing it correctly. Does anyone have an idea what I am doing wrong?
Edit:
Sessions does not seems to be working
Random headers does not seem to be working
Random sleeps does not seem to be working
I am able to access the webpage using my VPN. Even once a capcha appears using requests, there is no captcha on the website in the browser.
Selenium does not work.
I really do not want to pay for a service to solve capchas.
I believe the issue is that I am not mimicking the browser well enough.
In short, yes they can. While reCAPTCHA v2 and v3 can help limit simple bot traffic, both versions come with several problems: User experience suffers, as human users hate the image/audio recognition challenges. CAPTCHA farms and advances in AI allow cybercriminals and advanced bots to bypass reCAPTCHAs easily.
Python's Requests Default 'User-Agent' utils.
It is not easy to pinpoint the exact reason for being blocked and facing a Captcha. Here are few thoughts:
Sometimes, the Captcha service (in this case, Google) may blacklist common VPN IP addresses and treat them as potential threats, since many people are using them and they generate a lot of traffic.
Sometimes, proxy servers (especially free ones) are not anonymous and can send your actual IP address in the request headers (specifically, the X-Forwarded-For header)
There are certain headers that are important to have in your request. The easiest way to make your requests look legitimate is to use the "Network" tab in your browser's "Developer Tools", and copy all the headers your browser sends.
An important header to have is referer
. While it may or may not be checked by the website, it is safer to just have it there with the URL of one of the website's pages (or homepage):
referer: https://seekingalpha.com/
Try to increase the timeouts between your requests. Few seconds should be reasonable.
Finally, try using the session
objects in requests
. They automatically maintain the cookies and update the referer
across multiple requests, to emulate a real user browsing the website. I found them to be the most helpful when it comes to overcoming scraping protections.
The last-resort is to use a service to break the captcha. There are many services (mostly paid) online that do that. A popular one is DeathByCaptcha. Keep in mind that you may be breaking the website's terms of use, which I do not recommend :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With