This is a question about scrapy.
When storing items in a database, why is conventional to implement via a pipeline rather than the Feed Export mechanism?
Feed Exports - Output your scraped data using different formats and storages
One of the most frequently required features when implementing scrapers is being able to store the scraped data properly
Item Pipeline - Post-process and store your scraped data
Typical use for item pipelines are... storing the scraped item in a database
What's the difference, pros/cons between the two, and (why) is the pipeline more suitable?
Thx
This is a tooooo late answer. But I just spent a whole afternoon and an evening trying to understand the difference between item pipeline and feed export which is poorly documented. And I think it would be helpful to someone who is still confused.
TL;DR: FeedExport is designed for exporting items as files. It is totally not suitable for database storage.
Feed export is implemented as an extension to scrapy in scrapy.extensions.feedexport
. In this way, just like other extensions in scrapy, it is in-turn implemented by register callback functions to some scrapy signals (open_spider
, close_spider
and item_scraped
) so that it can take necessary steps to store items.
When open_spider
, FeedExporter
(the actual extension class) initializes feed storages and item exporters. The concrete steps involve getting a file-like object which is usually a temporary file from a FeedStroage
and pass it to an ItemExporter
. When item_scraped
, FeedExporter
simply calls a pre-initialized ItemExporter
object to export_item
. When close_spider
, FeedExporter
call store
method on the previous FeedStorage
object to write the file to filesystem, upload to a remote FTP server, upload to S3 storage, etc.
There is a collection of built-in item exporters and storages. But as you may notice from the above text, FeedExporter
is by design tightly coupled with file storage. When using databases, the usual way to store items is to insert it into databases as soon as it is scraped (or possibly you may want some buffers).
Therefore, the proper way to use a database storage seems to be writing your own FeedExporter
. You can achieve it by register callbacks to scrapy signals. But it is not necessary, using item pipeline is more straightforward and does not require awareness of such implementation details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With