I have started using Scrapy to scrape a few websites. If I later add a new field to my model or change my parsing functions, I'd like to be able to "replay" the downloaded raw data offline to scrape it again. It looks like Scrapy had the ability to store raw data in a replay file at one point:
http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168
But this functionality seems to have been removed in the current version of Scrapy. Is there another way to achieve this?
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
log(cook1) self. log("end cookie2") return Request("http://something.net/some/sa/"+response.headers.getlist('Location')[0],cookies={cook1[0]:cook1[1]}, callback=self. check_login_response) . . .
Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
If you run crawl --record=[cache.file] [scraper]
, you'll be able then use replay [scraper]
.
Alternatively, you can cache all responses with the HttpCacheMiddleware
by including it in DOWNLOADER_MIDDLEWARES
:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 300,
}
If you do this, every time you run the scraper, it will check the file system first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With