Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get around Scrapy failed responses (status code 416, 999, ...)

I'm writing a script using Scrapy, but I'm having a trouble with the failed HTTP responses. Specifically, I'm trying to scrape from "https://www.crunchbase.com/" but I keep getting HTTP status code 416. Can websites block spiders from scraping their contents?

like image 342
nowhereman Avatar asked Apr 27 '15 02:04

nowhereman


2 Answers

What's happening is that the website is looking at the headers attached to your request and deciding that you're not a browser and therefore blocking your request.

However, there is nothing that website can do to differentiate between Scrapy and Firefox/Chrome/IE/Safari if you decide to send the same headers as a browser. In Chrome, open up the Network Tools console, and you will see exactly the headers it is sending. Copy these headers into your Scrapy request and everything will work.

You might want to start by sending the same User-Agent header as your browser.

How to send these headers with your Scrapy request is documented here.

like image 83
14 revs, 12 users 16% Avatar answered Sep 27 '22 23:09

14 revs, 12 users 16%


You are right http://crunchbase.com blocks bots. It still serves an HTML page "Pardon our Interruption", which explains why they think that your are bot, and provide a form to request unblock (even though with status code 416).

According to VP of Marketing at Distil Networks, Crunchbase uses distil networks antibot.

https://www.quora.com/How-does-distil-networks-bot-and-scraper-detection-work

After several attempts, even my browser access was successfully blocked there. I submitted an unblock request and was enabled again. Not sure about other distil protected sites, but you can try to ask crunchbase management nicely.

like image 25
Serge Avatar answered Sep 28 '22 01:09

Serge