Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python requests bot detection?

I have been using the requests library to mine this website. I haven't made too many requests to it within 10 minutes. Say 25. All of a sudden, the website gives me a 404 error.

My question is: I read somewhere that getting a URL with a browser is different from getting a URL with something like a requests. Because the requests fetch does not get cookies and other things that a browser would. Is there an option in requests to emulate a browser so the server doesn't think i'm a bot? Or is this not an issue?

like image 467
jason Avatar asked Apr 09 '14 15:04

jason


People also ask

Can websites detect scraping?

Web pages detect web crawlers and web scraping tools by checking their IP addresses, user agents, browser parameters, and general behavior. If the website finds it suspicious, you receive CAPTCHAs and then eventually your requests get blocked since your crawler is detected.


3 Answers

Basically, at least one thing you can do is to send User-Agent header:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'}

response = requests.get(url, headers=headers)

Besides requests, you can simulate a real user by using selenium - it uses a real browser - in this case there is clearly no easy way to distinguish your automated user from other users. Selenium can also make use a "headless" browser.

Also, check if the web-site you are scraping provides an API. If there is no API or you are not using it, make sure you know if the site actually allows automated web-crawling like this, study Terms of use. You know, there is probably a reason why they block you after too many requests per a period of time.

Also see:

  • Sending "User-agent" using Requests library in Python
  • Headless Selenium Testing with Python and PhantomJS

edit1: selenium uses a webdriver rather than a real browser; i.e., it passes a webdriver = TRUE in the header, making it far easier to detect than requests.

like image 71
alecxe Avatar answered Oct 17 '22 06:10

alecxe


Things that can help in general :

  • Headers should be similar to common browsers, including :
    • User-Agent : use a recent one (see https://developers.whatismybrowser.com/useragents/explore/), or better, use a random recent one if you make multiple requests (see https://github.com/skratchdot/random-useragent)
    • Accept-Language : something like "en,en-US;q=0,5" (adapt for your language)
    • Accept: a standard one would be like "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8"
  • Navigation :
    • If you make multiple request, put a random timeout between them
    • If you open links found in a page, set the Referer header accordingly
    • Or better, simulate mouse activity to move, click and follow link
  • Images should be enabled
  • Javascript should be enabled
    • Check that "navigator.plugins" and "navigator.language" are set in the client javascript page context
  • Use proxies
like image 10
Grubshka Avatar answered Oct 17 '22 06:10

Grubshka


The first answer is a bit off selenium is still detectable as its a webdriver and not a normal browser it has hardcoded values that can be detected using javascript most websites use fingerprinting libraries that can find these values luckily there is a patched chromedriver called undetecatble_chromedriver that bypasses such checks

like image 3
ahmed mani Avatar answered Oct 17 '22 06:10

ahmed mani