Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Captcha using requests even after changing headers and IP. How am I being tracked?

I am trying to scrape some articles from xyz However, after a certain number of scrapes, a captcha appears.

However, I am running into major issues.

  1. I am using from fake_useragent import UserAgent to randomize my header.

  2. I am using random sleep times between requests

  3. I am changing IP address using a VPN once a captcha appears. However, somehow a captcha still appears once my IP address appears.

It is also strange because while a captcha appears in the request response, a captcha does not appear in the browser.

So, I assume that by header is just wrong.

I turned off js and cookies when obtaining this request because with cookie and js, there is clearly info that the website is tracking me with.

headers = {
    "authority": "seekingalpha.com",
    "method": "GET",
    "path": "/article/4230872-dillards-still-room-downside",
    "scheme": "https",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": 'en-US,en;q=0.9',
    "upgrade-insecure-requests": "1",
    "user-agent": RANDOM
}

This is close to what the website uses: They add

"cache-control": "max-age=0",
"if-none-match": 'W/"6f11a6f9219176fda72f3cf44b0a2059"',

This to my research is etags which is used for carching and can be use to track people. The 'W/...' changes each request.

Also, when I use wkhtmltopdf to print the screen as pdf, I a captcha never appears. I have also tried using selenium which is even worse. In addition, I have tried using proxies as seen here.

So there definitely is a way of doing this. However, I am not doing it correctly. Does anyone have an idea what I am doing wrong?

Edit:

  1. Sessions does not seems to be working

  2. Random headers does not seem to be working

  3. Random sleeps does not seem to be working

  4. I am able to access the webpage using my VPN. Even once a capcha appears using requests, there is no captcha on the website in the browser.

  5. Selenium does not work.

  6. I really do not want to pay for a service to solve capchas.

I believe the issue is that I am not mimicking the browser well enough.

like image 436
user2330624 Avatar asked Jan 02 '19 19:01

user2330624


People also ask

Can bots bypass reCAPTCHA?

In short, yes they can. While reCAPTCHA v2 and v3 can help limit simple bot traffic, both versions come with several problems: User experience suffers, as human users hate the image/audio recognition challenges. CAPTCHA farms and advances in AI allow cybercriminals and advanced bots to bypass reCAPTCHAs easily.

What user agent does Python Requests use?

Python's Requests Default 'User-Agent' utils.


1 Answers

It is not easy to pinpoint the exact reason for being blocked and facing a Captcha. Here are few thoughts:

VPN and Proxies

Sometimes, the Captcha service (in this case, Google) may blacklist common VPN IP addresses and treat them as potential threats, since many people are using them and they generate a lot of traffic.

Sometimes, proxy servers (especially free ones) are not anonymous and can send your actual IP address in the request headers (specifically, the X-Forwarded-For header)

Request Headers

There are certain headers that are important to have in your request. The easiest way to make your requests look legitimate is to use the "Network" tab in your browser's "Developer Tools", and copy all the headers your browser sends.

An important header to have is referer. While it may or may not be checked by the website, it is safer to just have it there with the URL of one of the website's pages (or homepage):

referer: https://seekingalpha.com/

Timeouts and Sessions

Try to increase the timeouts between your requests. Few seconds should be reasonable.

Finally, try using the session objects in requests. They automatically maintain the cookies and update the referer across multiple requests, to emulate a real user browsing the website. I found them to be the most helpful when it comes to overcoming scraping protections.

Captcha

The last-resort is to use a service to break the captcha. There are many services (mostly paid) online that do that. A popular one is DeathByCaptcha. Keep in mind that you may be breaking the website's terms of use, which I do not recommend :)

like image 119
Aziz Avatar answered Sep 29 '22 12:09

Aziz