I have worked on scrapy
a bit and now I have my spider ready. But now I want my spider to scrape only those items which is not been scraped in its previous run, and scrape only the new contents. By achieving this I can reduce the runtime of my spider.
While studying about this I came across deltafetch, Which I think will serve my requirement. But I am not being able to import that feature. I would be glad if any body could guide me about using it in a well defined way.
And also if there is any other middleware which serve the similar purpose I would be interested to know.
Using standard tools:
pip install scrapylib
Then add this to you project settings.py:
SPIDER_MIDDLEWARES = {
'scrapylib.deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With