Replay a Scrapy spider on stored data

Tags:

I have started using Scrapy to scrape a few websites. If I later add a new field to my model or change my parsing functions, I'd like to be able to "replay" the downloaded raw data offline to scrape it again. It looks like Scrapy had the ability to store raw data in a replay file at one point:

http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168

But this functionality seems to have been removed in the current version of Scrapy. Is there another way to achieve this?

327

asked Oct 14 '11 10:10

del

1 Answers

If you run crawl --record=[cache.file] [scraper], you'll be able then use replay [scraper].

Alternatively, you can cache all responses with the HttpCacheMiddleware by including it in DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 300,
}

If you do this, every time you run the scraper, it will check the file system first.

183

answered Oct 13 '22 07:10

Tim McNamara

Related questions
                            
                                When is a class variable initialized in Python?
                            
                                How to prevent user to access login page in django when already logged in?
                            
                                db.create_all() 'NoneType' object has no attribute 'drivername'
                            
                                Type hints for SQLAlchemy engine and session objects
                            
                                Numpy select rows based on condition
                            
                                AttributeError: module 'librosa' has no attribute 'output'
                            
                                pyenv no longer sets paths correctly when activating virtual environments
                            
                                What's the search engine used in the new Python documentation?
                            
                                Do Python lists have an equivalent to dict.get?
                            
                                Python - how to implement Bridge (or Adapter) design pattern?
                            
                                Replace __str__ method on list object in Python
                            
                                How to set the encoding for the tables' char columns in django?
                            
                                Python regex for reading CSV-like rows
                            
                                Does REGEX differ from PHP to Python
                            
                                Static variable inheritance in Python
                            
                                Adding per-object permissions to django admin
                            
                                Why does Celery work in Python shell, but not in my Django views? (import problem)
                            
                                How to create a list with the characters of a string? [duplicate]
                            
                                Python - how to convert int to string represent a 32bit Hex number
                            
                                PyPy on Windows 7 x64?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replay a Scrapy spider on stored data

Tags:

python

scrapy

web-crawler

del

People also ask

1 Answers

Tim McNamara

Recent Activity

Donate For Us