How can I group data scraped from multiple pages, using Scrapy, into one Item?

Tags:

I'm trying to collect a few pieces of information about a bunch of different web sites. I want to produce one Item per site that summarizes the information I found across that site, regardless of which page(s) I found it on.

I feel like this should be an item pipeline, like the duplicates filter example, except I need the final contents of the Item, not the results from the first page the crawler examined.

So I tried using request.meta to pass a single partially-filled Item through the various Requests for a given site. To make that work, I had to have my parse callback return exactly one new Request per call until it had no more pages to visit, then finally return the finished Item. Which is a pain if I find multiple links I want to follow, and breaks entirely if the scheduler throws away one of the requests due to a link cycle.

The only other approach I can see is to dump the spider output to json-lines and post-process it with an external tool. But I'd prefer to fold it into the spider, preferably in a middleware or item pipeline. How can I do that?

547

asked Apr 06 '13 22:04

Jamey Sharp

1 Answers

How about this ugly solution?

Define a dictionary (defaultdict(list)) on a pipeline for storing per-site data. In process_item you can just append a dict(item) to the list of per-site items and raise DropItem exception. Then, in close_spider method, you can dump the data to whereever you want.

Should work in theory, but I'm not sure that this solution is the best one.

answered Nov 06 '22 15:11

alecxe

Related questions
                            
                                Resize HOG feature for Scikit-Learn classifier
                            
                                Unable to Delete Videos with the Youtube Data API
                            
                                MVC pattern in desktop GUI with python
                            
                                cPickle - Ignore stuff it can't serialize instead of raising an exception
                            
                                How to Convert SASS on the fly to CSS in Python?
                            
                                Save picture from base64 code
                            
                                Setting up a local PyPi server with custom set of packages [duplicate]
                            
                                NameError: global name 'create_engine' is not defined [when trying to create a SQLAlchemyJobStore]
                            
                                Python classes, how to use them style-wise, and the Single Responsibility Principle [closed]
                            
                                Reportlab: header with data from page
                            
                                Find root of numerical integration
                            
                                Python sort list of lists over multiple levels and with a custom order
                            
                                How to get FULL type name in python?
                            
                                change the mime-type of particular static files in tornado web
                            
                                multiplication of 3-dimensional matrix in numpy
                            
                                Flask SecureCookie replacing pickle with json results in encoding error
                            
                                Python networkx : edge contraction
                            
                                Python + Mapnik: Example on how to render a map with a gps track on it
                            
                                'gi.repository.Gtk' object has no attribute 'gdk'
                            
                                Which Microdata parser should I use in Python [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I group data scraped from multiple pages, using Scrapy, into one Item?

Tags:

python

scrapy

Jamey Sharp

People also ask

1 Answers

alecxe

Recent Activity

Donate For Us