For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names. <pre class="prettyprint"><code>[(True, {'checksum': '2b00042f7481c7b056c4b410d28f33cf', 'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg', 'url': 'http://www.example.com/files/product1.pdf'}), (False, Failure(...))] </code></pre> How can I store the files using my custom file names instead? In the example above, I would want the file name being "product1_0a79c461a4062ac383dc4fade7bc09f1384a3910.pdf" so I keep uniqueness but make the file name visible. As a starting point, I explored the <code>pipelines.py</code> of my project without much success. <pre class="prettyprint"><code>import scrapy from scrapy.pipelines.images import FilesPipeline from scrapy.exceptions import DropItem class MyFilesPipeline(FilesPipeline): def file_path(self, request, response=None, info=None): return request.meta.get('filename','') def get_media_requests(self, item, info): file_url = item['file_url'] meta = {'filename': item['name']} yield Request(url=file_url, meta=meta) </code></pre> with the inclusion of this parameter in my <code>settings.py</code> <pre class="prettyprint"><code>ITEM_PIPELINES = { #'scrapy.pipelines.files.FilesPipeline': 300 'io_spider.pipelines.MyFilesPipeline': 200 } </code></pre> A similar question has been asked but it does target images and not files. Any help will be appreciated.

<code>file_path</code> should return the path to your file. In your code, <code>file_path</code> returns <code>item['name']</code> and that will be your file's path. Note that by default <code>file_path</code> calculates SHA1 hashes. So your method should be something like this: <pre class="prettyprint"><code>def file_path(self, request, response=None, info=None): original_path = super(MyFilesPipeline, self).file_path(request, response=None, info=None) sha1_and_extension = original_path.split('/')[1] # delete 'full/' from the path return request.meta.get('filename','') + "_" + sha1_and_extension </code></pre>

Scrapy file download how to use custom filename

Tags:

python

scrapy

scrapy-spider

scrapy-pipeline

For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names.

[(True,
  {'checksum': '2b00042f7481c7b056c4b410d28f33cf',
   'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
   'url': 'http://www.example.com/files/product1.pdf'}),
 (False,
  Failure(...))]

How can I store the files using my custom file names instead?

In the example above, I would want the file name being "product1_0a79c461a4062ac383dc4fade7bc09f1384a3910.pdf" so I keep uniqueness but make the file name visible.

As a starting point, I explored the pipelines.py of my project without much success.

import scrapy
from scrapy.pipelines.images import FilesPipeline
from scrapy.exceptions import DropItem

class MyFilesPipeline(FilesPipeline):

    def file_path(self, request, response=None, info=None):
        return request.meta.get('filename','')

    def get_media_requests(self, item, info):
        file_url = item['file_url']
        meta = {'filename': item['name']}
        yield Request(url=file_url, meta=meta)

with the inclusion of this parameter in my settings.py

ITEM_PIPELINES = {
    #'scrapy.pipelines.files.FilesPipeline': 300
    'io_spider.pipelines.MyFilesPipeline': 200
}

A similar question has been asked but it does target images and not files.

Any help will be appreciated.

579

asked Oct 31 '17 08:10

Michael

1 Answers

file_path should return the path to your file. In your code, file_path returns item['name'] and that will be your file's path. Note that by default file_path calculates SHA1 hashes. So your method should be something like this:

def file_path(self, request, response=None, info=None):
    original_path = super(MyFilesPipeline, self).file_path(request, response=None, info=None)
    sha1_and_extension = original_path.split('/')[1] # delete 'full/' from the path
    return request.meta.get('filename','') + "_" + sha1_and_extension

170

answered Sep 26 '22 19:09

Djunzu

Related questions
                            
                                Python - how do i save a itertools.product loop and resume where it left off
                            
                                Python 3: Move email to trash by uid (imaplib)
                            
                                os.path.isdir() returns false on unaccessible, but existing directory
                            
                                Calculating XIRR in Python
                            
                                Python Selenium. How to use driver.set_page_load_timeout() properly?
                            
                                How i can easily extract data from historian with python?
                            
                                Subclassing Sequence with proper type hints in Python
                            
                                Django. Listing files from a static folder
                            
                                Class method takes 1 positional argument but 2 were given
                            
                                Move data from Postgres/MySQL to S3 using Airflow
                            
                                Boolean values to column names in one list, dataframe pandas python
                            
                                Adding keys to dicts within a list, from values in a list
                            
                                User defined function on pandas dataframe
                            
                                How do I format all the cells in an excel to a single style using openpyxl?
                            
                                Identify if there are two of the same character adjacent to eachother
                            
                                How to get a list of modules imported by a python module
                            
                                Why are some python variables uppercase whereas others are lowercase?
                            
                                Testing if a certain number is within a list of ranges
                            
                                NLTK words vs word_tokenize
                            
                                Python/socket: How to send a file to another computer which is on a different network?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With