I would like to know how to ignore items that don't fill all fields, some kind of droping, because in the output of scrapyd I'm getting pages that don't fill all fields.
I have that code:
class Product(scrapy.Item):
source_url = scrapy.Field(
output_processor = TakeFirst()
)
name = scrapy.Field(
input_processor = MapCompose(remove_entities),
output_processor = TakeFirst()
)
initial_price = scrapy.Field(
input_processor = MapCompose(remove_entities, clear_price),
output_processor = TakeFirst()
)
main_image_url = scrapy.Field(
output_processor = TakeFirst()
)
Parser:
def parse_page(self, response):
try:
l = ItemLoader(item=Product(), response=response)
l.add_value('source_url', response.url)
l.add_css('name', 'h1.title-product::text')
l.add_css('main_image_url', 'div.pics a img.zoom::attr(src)')
l.add_css('initial_price', 'ul.precos li.preco_normal::text')
l.add_css('initial_price', 'ul.promocao li.preco_promocao::text')
return l.load_item()
except Exception as e:
print self.log("#1 ERRO: %s" % e), response.url
I want to do it with Loader without need to create with my own Selector (to avoid processing items twice). I guess that I can drop them in pipeline but probably it's not the best way because these items aren't valid.
Validation of data is one of typical use case for pipelines. In your case you only need to write some small amount of code to check for required fields, something along the lines of:
from scrapy.exceptions import DropItem
class YourPersonalPipeline(object):
def process_item(self, item, spider):
required_fields = [] # your list of required fields
if all(field in item for field in required_fields):
return item
else:
raise DropItem("your reason")
You need to enable pipeline in settings.py Read more in scrapy docs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With