I'm using scrapy to scrape the adidas site: http://www.adidas.com/us/men-shoes
.
But it always shows error:
User timeout caused connection failure: Getting http://www.adidas.com/us/men-shoes took longer than 180.0 seconds..
It retries for 5 times and then fails completely.
I can access the url on chrome but it's not working on scrapy.
I've tried using custom user agents and emulating header requests but It's still doesn't work.
Below is my code:
import scrapy
class AdidasSpider(scrapy.Spider):
name = "adidas"
def start_requests(self):
urls = ['http://www.adidas.com/us/men-shoes']
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Host": "www.adidas.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
for url in urls:
yield scrapy.Request(url, self.parse, headers=headers)
def parse(self, response):
yield(response.body)
Scrapy log:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 224,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2018, 1, 25, 10, 59, 35, 57000),
'log_count/DEBUG': 2,
'log_count/INFO': 9,
'retry/count': 1,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 1, 25, 10, 58, 39, 550000)}
After looking at the request headers using fiddler and doing some tests I found what was causing the issue. Scrapy is sending a Connection: close
header by default due to which I'm not getting any response from the adidas site.
After testing on fiddler by making the same request but without the Connection: close
header, I got the response correctly. Now the problem is how to remove the Connection: close
header?
As scrapy doesn't let you to edit the Connection: close
header. I used scrapy-splash instead to make the requests using splash.
Now the Connection: close
header can be overidden and everythings working now. The downside is that now the web page has to load all the the assets before I get the response from splash, slower but works.
Scrapy should add the option to edit their default Connection: close
header. It is hardcoded in the library and cannot be overidden easily.
Below is my working code:
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Host": "www.adidas.com",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
def start_requests(self):
url = "http://www.adidas.com/us/men-shoes?sz=120&start=0"
yield SplashRequest(url, self.parse, headers=self.headers)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With