I have been using the requests
library to mine this website. I haven't made too many requests to it within 10 minutes. Say 25. All of a sudden, the website gives me a 404 error.
My question is: I read somewhere that getting a URL with a browser is different from getting a URL with something like a requests
. Because the requests
fetch does not get cookies and other things that a browser would. Is there an option in requests
to emulate a browser so the server doesn't think i'm a bot? Or is this not an issue?
Web pages detect web crawlers and web scraping tools by checking their IP addresses, user agents, browser parameters, and general behavior. If the website finds it suspicious, you receive CAPTCHAs and then eventually your requests get blocked since your crawler is detected.
Basically, at least one thing you can do is to send User-Agent
header:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'}
response = requests.get(url, headers=headers)
Besides requests
, you can simulate a real user by using selenium - it uses a real browser - in this case there is clearly no easy way to distinguish your automated user from other users. Selenium can also make use a "headless" browser.
Also, check if the web-site you are scraping provides an API. If there is no API or you are not using it, make sure you know if the site actually allows automated web-crawling like this, study Terms of use
. You know, there is probably a reason why they block you after too many requests per a period of time.
Also see:
edit1: selenium uses a webdriver rather than a real browser; i.e., it passes a webdriver = TRUE
in the header, making it far easier to detect than requests
.
Things that can help in general :
The first answer is a bit off selenium is still detectable as its a webdriver and not a normal browser it has hardcoded values that can be detected using javascript most websites use fingerprinting libraries that can find these values luckily there is a patched chromedriver called undetecatble_chromedriver that bypasses such checks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With