Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy error:User timeout caused connection failure

I'm using scrapy to scrape the adidas site: http://www.adidas.com/us/men-shoes. But it always shows error:

User timeout caused connection failure: Getting http://www.adidas.com/us/men-shoes took longer than 180.0 seconds..

It retries for 5 times and then fails completely.

I can access the url on chrome but it's not working on scrapy.
I've tried using custom user agents and emulating header requests but It's still doesn't work.

Below is my code:

import scrapy


class AdidasSpider(scrapy.Spider):
    name = "adidas"

    def start_requests(self):

        urls = ['http://www.adidas.com/us/men-shoes']

        headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "en-US,en;q=0.9",
            "Cache-Control": "max-age=0",
            "Connection": "keep-alive",
            "Host": "www.adidas.com",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
        }

        for url in urls:
            yield scrapy.Request(url, self.parse, headers=headers)

    def parse(self, response):
        yield(response.body)

Scrapy log:

{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'downloader/request_bytes': 224,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2018, 1, 25, 10, 59, 35, 57000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 9,
 'retry/count': 1,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 1, 25, 10, 58, 39, 550000)}

Update

After looking at the request headers using fiddler and doing some tests I found what was causing the issue. Scrapy is sending a Connection: close header by default due to which I'm not getting any response from the adidas site.

enter image description here

After testing on fiddler by making the same request but without the Connection: close header, I got the response correctly. Now the problem is how to remove the Connection: close header?

enter image description here

like image 850
Biswajit Chopdar Avatar asked Dec 24 '22 10:12

Biswajit Chopdar


1 Answers

As scrapy doesn't let you to edit the Connection: close header. I used scrapy-splash instead to make the requests using splash.

Now the Connection: close header can be overidden and everythings working now. The downside is that now the web page has to load all the the assets before I get the response from splash, slower but works.

Scrapy should add the option to edit their default Connection: close header. It is hardcoded in the library and cannot be overidden easily.

Below is my working code:

headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Host": "www.adidas.com",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
    }

    def start_requests(self):
        url = "http://www.adidas.com/us/men-shoes?sz=120&start=0"
        yield SplashRequest(url, self.parse, headers=self.headers)
like image 152
Biswajit Chopdar Avatar answered Dec 25 '22 23:12

Biswajit Chopdar