For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names.
[(True,
{'checksum': '2b00042f7481c7b056c4b410d28f33cf',
'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
'url': 'http://www.example.com/files/product1.pdf'}),
(False,
Failure(...))]
How can I store the files using my custom file names instead?
In the example above, I would want the file name being "product1_0a79c461a4062ac383dc4fade7bc09f1384a3910.pdf" so I keep uniqueness but make the file name visible.
As a starting point, I explored the pipelines.py
of my project without much success.
import scrapy
from scrapy.pipelines.images import FilesPipeline
from scrapy.exceptions import DropItem
class MyFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
return request.meta.get('filename','')
def get_media_requests(self, item, info):
file_url = item['file_url']
meta = {'filename': item['name']}
yield Request(url=file_url, meta=meta)
with the inclusion of this parameter in my settings.py
ITEM_PIPELINES = {
#'scrapy.pipelines.files.FilesPipeline': 300
'io_spider.pipelines.MyFilesPipeline': 200
}
A similar question has been asked but it does target images and not files.
Any help will be appreciated.
Initially, we used Scrapy's default pipeline to download the files, however, the issue was the files were being saved with their SHA1 hash codes instead of their human-readable file names. So we need to create a custom pipeline that will save the original filename and then use that name while downloading the files.
Scrapy provides an item pipeline for downloading images attached to a particular item, for example, when you scrape products and also want to download their images locally.
Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
file_path
should return the path to your file. In your code, file_path
returns item['name']
and that will be your file's path. Note that by default file_path
calculates SHA1 hashes. So your method should be something like this:
def file_path(self, request, response=None, info=None):
original_path = super(MyFilesPipeline, self).file_path(request, response=None, info=None)
sha1_and_extension = original_path.split('/')[1] # delete 'full/' from the path
return request.meta.get('filename','') + "_" + sha1_and_extension
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With