For my scrapy project I'm currently using the ImagesPipeline. The downloaded images are stored with a SHA1 hash of their URLs as the file names.
How can I store the files using my own custom file names instead?
What if my custom file name needs to contain another scraped field from the same item? e.g. use the item['desc']
and the filename for the image with item['image_url']
. If I understand correctly, that would involve somehow accessing the other item fields from the Image Pipeline.
Any help will be appreciated.
This is just actualization of the answer for scrapy 0.24 (EDITED), where the image_key()
is deprecated
class MyImagesPipeline(ImagesPipeline):
#Name download version
def file_path(self, request, response=None, info=None):
#item=request.meta['item'] # Like this you can use all from item, not just url.
image_guid = request.url.split('/')[-1]
return 'full/%s' % (image_guid)
#Name thumbnail version
def thumb_path(self, request, thumb_id, response=None, info=None):
image_guid = thumb_id + response.url.split('/')[-1]
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
#yield Request(item['images']) # Adding meta. I don't know, how to put it in one line :-)
for image in item['images']:
yield Request(image)
In scrapy 0.12 I solved something like this
class MyImagesPipeline(ImagesPipeline):
#Name download version
def image_key(self, url):
image_guid = url.split('/')[-1]
return 'full/%s.jpg' % (image_guid)
#Name thumbnail version
def thumb_key(self, url, thumb_id):
image_guid = thumb_id + url.split('/')[-1]
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
yield Request(item['images'])
I found my way in 2017,scrapy 1.1.3
def file_path(self, request, response=None, info=None):
return request.meta.get('filename','')
def get_media_requests(self, item, info):
img_url = item['img_url']
meta = {'filename': item['name']}
yield Request(url=img_url, meta=meta)
like the code above,you can add the name you want to a Request meta in get_media_requests()
, and get it back in file_path()
by request.meta.get('yourname','')
.
This was the way I solved the problem in Scrapy 0.10 . Check the method persist_image of FSImagesStoreChangeableDirectory. The filename of the downloaded image is key
class FSImagesStoreChangeableDirectory(FSImagesStore):
def persist_image(self, key, image, buf, info,append_path):
absolute_path = self._get_filesystem_path(append_path+'/'+key)
self._mkdir(os.path.dirname(absolute_path), info)
image.save(absolute_path)
class ProjectPipeline(ImagesPipeline):
def __init__(self):
super(ImagesPipeline, self).__init__()
store_uri = settings.IMAGES_STORE
if not store_uri:
raise NotConfigured
self.store = FSImagesStoreChangeableDirectory(store_uri)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With