Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database storage: Why is Pipeline better than Feed Export?

This is a question about scrapy.

When storing items in a database, why is conventional to implement via a pipeline rather than the Feed Export mechanism?

Feed Exports - Output your scraped data using different formats and storages

One of the most frequently required features when implementing scrapers is being able to store the scraped data properly

Item Pipeline - Post-process and store your scraped data

Typical use for item pipelines are... storing the scraped item in a database

What's the difference, pros/cons between the two, and (why) is the pipeline more suitable?

Thx

like image 495
John Mee Avatar asked Apr 18 '12 08:04

John Mee


1 Answers

This is a tooooo late answer. But I just spent a whole afternoon and an evening trying to understand the difference between item pipeline and feed export which is poorly documented. And I think it would be helpful to someone who is still confused.

TL;DR: FeedExport is designed for exporting items as files. It is totally not suitable for database storage.

Feed export is implemented as an extension to scrapy in scrapy.extensions.feedexport. In this way, just like other extensions in scrapy, it is in-turn implemented by register callback functions to some scrapy signals (open_spider, close_spider and item_scraped) so that it can take necessary steps to store items.

When open_spider, FeedExporter (the actual extension class) initializes feed storages and item exporters. The concrete steps involve getting a file-like object which is usually a temporary file from a FeedStroage and pass it to an ItemExporter. When item_scraped, FeedExporter simply calls a pre-initialized ItemExporter object to export_item. When close_spider, FeedExporter call store method on the previous FeedStorage object to write the file to filesystem, upload to a remote FTP server, upload to S3 storage, etc.

There is a collection of built-in item exporters and storages. But as you may notice from the above text, FeedExporter is by design tightly coupled with file storage. When using databases, the usual way to store items is to insert it into databases as soon as it is scraped (or possibly you may want some buffers).

Therefore, the proper way to use a database storage seems to be writing your own FeedExporter. You can achieve it by register callbacks to scrapy signals. But it is not necessary, using item pipeline is more straightforward and does not require awareness of such implementation details.

like image 102
Dummmy Avatar answered Sep 23 '22 10:09

Dummmy