Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy how to ignore items with blank fields using Loader

I would like to know how to ignore items that don't fill all fields, some kind of droping, because in the output of scrapyd I'm getting pages that don't fill all fields.

I have that code:

class Product(scrapy.Item):
    source_url = scrapy.Field(
        output_processor = TakeFirst()
    )
    name = scrapy.Field(
        input_processor = MapCompose(remove_entities),
        output_processor = TakeFirst()
    )
    initial_price = scrapy.Field(
        input_processor = MapCompose(remove_entities, clear_price),
        output_processor = TakeFirst()
    )
    main_image_url = scrapy.Field(
        output_processor = TakeFirst()
    )

Parser:

def parse_page(self, response):
    try:
        l = ItemLoader(item=Product(), response=response)
        l.add_value('source_url', response.url)
        l.add_css('name', 'h1.title-product::text')
        l.add_css('main_image_url', 'div.pics a img.zoom::attr(src)')

        l.add_css('initial_price', 'ul.precos li.preco_normal::text')
        l.add_css('initial_price', 'ul.promocao li.preco_promocao::text')

        return l.load_item()

    except Exception as e:
        print self.log("#1 ERRO: %s" % e), response.url

I want to do it with Loader without need to create with my own Selector (to avoid processing items twice). I guess that I can drop them in pipeline but probably it's not the best way because these items aren't valid.

like image 594
Rafael Capucho Avatar asked May 22 '14 15:05

Rafael Capucho


1 Answers

Validation of data is one of typical use case for pipelines. In your case you only need to write some small amount of code to check for required fields, something along the lines of:

from scrapy.exceptions import DropItem

class YourPersonalPipeline(object):
    def process_item(self, item, spider):
        required_fields = [] # your list of required fields
        if all(field in item for field in required_fields):
            return item
        else:
            raise DropItem("your reason")

You need to enable pipeline in settings.py Read more in scrapy docs.

like image 151
Pawel Miech Avatar answered Oct 24 '22 01:10

Pawel Miech