I have written my own ImagePipeline for my scrapy project. From my Googling I am getting different information about how to set the pipline in settings.py.
Let's say the pipeline is MyImagesPipeline and it exists in pipelines.py which contains:
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
some processing...
return item
in my settings.py:
ITEM_PIPELINES = {
'scrapy.contrib.pipeline.images.ImagesPipeline': 1,
'myproject.pipelines.MyImagesPipeline': 100,
}
I have two pipelines in there because if I put in MyImagesPipeline alone item_completed gets called but without any images and I get a KeyError because the field 'images' is not there. However, with both middlewares in the settings I am getting multiple copies of the same image.
Can someone please enlighten me on this?
EDIT:
The spider code is quite long because I am doing a lot of information processing in it but here are what I think might be the relevant parts (callback of parse):
def parse_data(self, response):
img_urls = response.css('.product-image').xpath('.//img/@src').extract()
img_url = img_urls[0]
item['image_urls'] = [img_url,]
yield item
Both image pipelines are processing the images_urls
field in your items, that's why you're getting their images twice.
I'd try to stick with a single pipeline and fix any errors you encounter in it to get a self-contained component handling the whole image processing. Particularly, you have to deal better with the inheritance from ImagesPipeline
to do so.
Regarding the KeyError, ImagesPipeline.item_completed
method is in charge of updating the images
field in the items, if you override it it's not going to be available when you need it.
To fix that in your pipeline you can update it like this:
class MyImagesPipeline(ImagesPipeline):
...
def item_completed(self, results, item, info):
item = super(MyImagesPipeline, self).item_completed(results, item, info)
some processing...
return item
I recommend checking ImagesPipeline
's code (it's placed in scrapy/pipelines/images.py in Scrapy 1.0 or scrapy/contrib/pipeline/images.py in previous versions but the code is practically the same) to fully understand what's going on inside it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With