I'm writing a script using Scrapy, but I'm having a trouble with the failed HTTP responses. Specifically, I'm trying to scrape from "https://www.crunchbase.com/" but I keep getting HTTP status code 416. Can websites block spiders from scraping their contents?
What's happening is that the website is looking at the headers attached to your request and deciding that you're not a browser and therefore blocking your request.
However, there is nothing that website can do to differentiate between Scrapy and Firefox/Chrome/IE/Safari if you decide to send the same headers as a browser. In Chrome, open up the Network Tools console, and you will see exactly the headers it is sending. Copy these headers into your Scrapy request and everything will work.
You might want to start by sending the same User-Agent
header as your browser.
How to send these headers with your Scrapy request is documented here.
You are right http://crunchbase.com blocks bots. It still serves an HTML page "Pardon our Interruption", which explains why they think that your are bot, and provide a form to request unblock (even though with status code 416).
According to VP of Marketing at Distil Networks, Crunchbase uses distil networks antibot.
https://www.quora.com/How-does-distil-networks-bot-and-scraper-detection-work
After several attempts, even my browser access was successfully blocked there. I submitted an unblock request and was enabled again. Not sure about other distil protected sites, but you can try to ask crunchbase management nicely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With