Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping data without having to explicitly define each field to be scraped

Tags:

python

scrapy

I want to scrape a page of data (using the Python Scrapy library) without having to define each individual field on the page. Instead I want to dynamically generate fields using the id of the element as the field name.

At first I was thinking the best way to do this would be to have a pipeline that collects all the data, and outputs it once it has it all.

Then I realised that I need to pass the data to the pipeline in an item, but I can't define an item as I don't know what fields it will need!

What's the best way for me to tackle this problem?

like image 237
Acorn Avatar asked Feb 21 '11 17:02

Acorn


2 Answers

Update:

The old method didn't work with item loaders and was complicating things unnecessarily. Here's a better way of achieving a flexible item:

from scrapy.item import BaseItem
from scrapy.contrib.loader import ItemLoader

class FlexibleItem(dict, BaseItem):
    pass

if __name__ == '__main__':
    item = FlexibleItem()
    loader = ItemLoader(item)

    loader.add_value('foo', 'bar')
    loader.add_value('baz', 123)
    loader.add_value('baz', 'test')
    loader.add_value(None, {'abc': 'xyz', 'foo': 555})

    print loader.load_item()

    if 'meow' not in item:
        print "it's not a cat!"

Result:

{'foo': ['bar', 555], 'baz': [123, 'test'], 'abc': ['xyz']}
it's not a cat!

Old solution:

Okay, I've found a solution. It's a bit of "hack" but it works..

A Scrapy Item stores the field names in a dict called fields. When adding data to an Item it checks if the field exists, and if it doesn't it throws and error:

def __setitem__(self, key, value):
    if key in self.fields:
        self._values[key] = value
    else:
        raise KeyError("%s does not support field: %s" %\
              (self.__class__.__name__, key))

What you can do is override this __setitem__ function to be less strict:

class FlexItem(Item):
    def __setitem__(self, key, value):
        if key not in self.fields:
            self.fields[key] = Field()

        self._values[key] = value

And there you go.

Now when you add data to an Item, if the item doesn't have that field defined, it will be added, and then the data will be added as normal.

like image 160
Acorn Avatar answered Oct 27 '22 23:10

Acorn


This solution works with the exporters (scrapy crawl -t json -o output.json):

import scrapy

class FlexibleItem(scrapy.Item):
    def __setitem__(self, key, value):
        if key not in self.fields:
            self.fields[key] = scrapy.Field()
        super(FlexibleItem, self).__setitem__(key, value)

EDIT: updated to work with latest Scrapy

like image 34
Elias Dorneles Avatar answered Oct 27 '22 22:10

Elias Dorneles