Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy request+response+download time

Tags:

scrapy

UPD: Not close question because I think my way is not so clear as should be

Is it possible to get current request + response + download time for saving it to Item?

In "plain" python I do

start_time = time()
urllib2.urlopen('http://example.com').read()
time() - start_time

But how i can do this with Scrapy?

UPD:

Solution enought for me but I'm not sure of quality of results. If you have many connections with timeout errors Download time may be wrong (even DOWNLOAD_TIMEOUT * 3)

For

settings.py

DOWNLOADER_MIDDLEWARES = {
    'myscraper.middlewares.DownloadTimer': 0,
}

middlewares.py

from time import time
from scrapy.http import Response


class DownloadTimer(object):
    def process_request(self, request, spider):
        request.meta['__start_time'] = time()
        # this not block middlewares which are has greater number then this
        return None

    def process_response(self, request, response, spider):
        request.meta['__end_time'] = time()
        return response  # return response coz we should

    def process_exception(self, request, exception, spider):
        request.meta['__end_time'] = time()
        return Response(
            url=request.url,
            status=110,
            request=request)

inside spider.py in def parse(...

log.msg('Download time: %.2f - %.2f = %.2f' % (
    response.meta['__end_time'], response.meta['__start_time'],
    response.meta['__end_time'] - response.meta['__start_time']
), level=log.DEBUG)
like image 787
b1_ Avatar asked Apr 05 '13 10:04

b1_


People also ask

What is download delay in Scrapy?

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. DOWNLOAD_DELAY = 0.25 # 250 ms of delay.

What does Scrapy request return?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you get cookie response from Scrapy?

log(cook1) self. log("end cookie2") return Request("http://something.net/some/sa/"+response.headers.getlist('Location')[0],cookies={cook1[0]:cook1[1]}, callback=self. check_login_response) . . .

What are Middlewares in Scrapy?

The spider middleware is a framework of hooks into Scrapy's spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to process the requests and items that are generated from spiders.


2 Answers

You could write a Downloader Middleware which would time each request. It would add a start time to the request before it's made and then a finish time when it's finished. Typically, arbitrary data such as this is stored in the Request.meta attribute. This timing information could later be read by your spider and added to your item.

This downloader middleware sounds like it could be useful on many projects.

like image 177
Shane Evans Avatar answered Sep 21 '22 17:09

Shane Evans


Not sure if you need a Middleware here. Scrapy has a request.meta which you can query and yield. For download latency, simply yield

download_latency=response.meta.get('download_latency'),

The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.

like image 28
Sam Avatar answered Sep 20 '22 17:09

Sam