How to get around Scrapy failed responses (status code 416, 999, ...)

Question

I'm writing a script using Scrapy, but I'm having a trouble with the failed HTTP responses. Specifically, I'm trying to scrape from "https://www.crunchbase.com/" but I keep getting HTTP status code 416. Can websites block spiders from scraping their contents?

14 revs, 12 users 16% · Accepted Answer

What's happening is that the website is looking at the headers attached to your request and deciding that you're not a browser and therefore blocking your request.

However, there is nothing that website can do to differentiate between Scrapy and Firefox/Chrome/IE/Safari if you decide to send the same headers as a browser. In Chrome, open up the Network Tools console, and you will see exactly the headers it is sending. Copy these headers into your Scrapy request and everything will work.

You might want to start by sending the same User-Agent header as your browser.

How to send these headers with your Scrapy request is documented here.

Serge · Answer

You are right http://crunchbase.com blocks bots. It still serves an HTML page "Pardon our Interruption", which explains why they think that your are bot, and provide a form to request unblock (even though with status code 416).

According to VP of Marketing at Distil Networks, Crunchbase uses distil networks antibot.

https://www.quora.com/How-does-distil-networks-bot-and-scraper-detection-work

After several attempts, even my browser access was successfully blocked there. I submitted an unblock request and was enabled again. Not sure about other distil protected sites, but you can try to ask crunchbase management nicely.

How to get around Scrapy failed responses (status code 416, 999, ...)

Tags:

python

web-scraping

scrapy

nowhereman

2 Answers

14 revs, 12 users 16%

Serge

Recent Activity

Donate For Us

How to get around Scrapy failed responses (status code 416, 999, ...)

Tags:

python

web-scraping

scrapy

nowhereman

2 Answers

14 revs, 12 users 16%

Serge

Related questions

Recent Activity

Donate For Us