Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Custom ImagePipeline Settings.py

Tags:

python

scrapy

I have written my own ImagePipeline for my scrapy project. From my Googling I am getting different information about how to set the pipline in settings.py.

Let's say the pipeline is MyImagesPipeline and it exists in pipelines.py which contains:

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):

        some processing...
        return item

in my settings.py:

ITEM_PIPELINES = {
    'scrapy.contrib.pipeline.images.ImagesPipeline': 1,
    'myproject.pipelines.MyImagesPipeline': 100,
   }

I have two pipelines in there because if I put in MyImagesPipeline alone item_completed gets called but without any images and I get a KeyError because the field 'images' is not there. However, with both middlewares in the settings I am getting multiple copies of the same image.

Can someone please enlighten me on this?

EDIT:

The spider code is quite long because I am doing a lot of information processing in it but here are what I think might be the relevant parts (callback of parse):

def parse_data(self, response):
    img_urls = response.css('.product-image').xpath('.//img/@src').extract()
    img_url = img_urls[0]
    item['image_urls'] = [img_url,]
    yield item
like image 569
Nancy Poekert Avatar asked Nov 09 '22 13:11

Nancy Poekert


1 Answers

Both image pipelines are processing the images_urls field in your items, that's why you're getting their images twice.

I'd try to stick with a single pipeline and fix any errors you encounter in it to get a self-contained component handling the whole image processing. Particularly, you have to deal better with the inheritance from ImagesPipeline to do so.

Regarding the KeyError, ImagesPipeline.item_completed method is in charge of updating the images field in the items, if you override it it's not going to be available when you need it.

To fix that in your pipeline you can update it like this:

class MyImagesPipeline(ImagesPipeline):
    ...

    def item_completed(self, results, item, info):
        item = super(MyImagesPipeline, self).item_completed(results, item, info)

        some processing...
        return item

I recommend checking ImagesPipeline's code (it's placed in scrapy/pipelines/images.py in Scrapy 1.0 or scrapy/contrib/pipeline/images.py in previous versions but the code is practically the same) to fully understand what's going on inside it.

like image 55
Julia Medina Avatar answered Nov 15 '22 13:11

Julia Medina