Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replay a Scrapy spider on stored data

I have started using Scrapy to scrape a few websites. If I later add a new field to my model or change my parsing functions, I'd like to be able to "replay" the downloaded raw data offline to scrape it again. It looks like Scrapy had the ability to store raw data in a replay file at one point:

http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168

But this functionality seems to have been removed in the current version of Scrapy. Is there another way to achieve this?

like image 327
del Avatar asked Oct 14 '11 10:10

del


People also ask

How do you get a Scrapy response?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you get cookie response from Scrapy?

log(cook1) self. log("end cookie2") return Request("http://something.net/some/sa/"+response.headers.getlist('Location')[0],cookies={cook1[0]:cook1[1]}, callback=self. check_login_response) . . .

Is Scrapy better than BeautifulSoup?

Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.


1 Answers

If you run crawl --record=[cache.file] [scraper], you'll be able then use replay [scraper].

Alternatively, you can cache all responses with the HttpCacheMiddleware by including it in DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 300,
}

If you do this, every time you run the scraper, it will check the file system first.

like image 183
Tim McNamara Avatar answered Oct 13 '22 07:10

Tim McNamara