Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy file download how to use custom filename

For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names.

[(True,
  {'checksum': '2b00042f7481c7b056c4b410d28f33cf',
   'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
   'url': 'http://www.example.com/files/product1.pdf'}),
 (False,
  Failure(...))]

How can I store the files using my custom file names instead?

In the example above, I would want the file name being "product1_0a79c461a4062ac383dc4fade7bc09f1384a3910.pdf" so I keep uniqueness but make the file name visible.

As a starting point, I explored the pipelines.py of my project without much success.

import scrapy
from scrapy.pipelines.images import FilesPipeline
from scrapy.exceptions import DropItem

class MyFilesPipeline(FilesPipeline):

    def file_path(self, request, response=None, info=None):
        return request.meta.get('filename','')

    def get_media_requests(self, item, info):
        file_url = item['file_url']
        meta = {'filename': item['name']}
        yield Request(url=file_url, meta=meta)

with the inclusion of this parameter in my settings.py

ITEM_PIPELINES = {
    #'scrapy.pipelines.files.FilesPipeline': 300
    'io_spider.pipelines.MyFilesPipeline': 200
}

A similar question has been asked but it does target images and not files.

Any help will be appreciated.

like image 579
Michael Avatar asked Oct 31 '17 08:10

Michael


People also ask

How do I download a Scrapy file?

Initially, we used Scrapy's default pipeline to download the files, however, the issue was the files were being saved with their SHA1 hash codes instead of their human-readable file names. So we need to create a custom pipeline that will save the original filename and then use that name while downloading the files.

Can Scrapy download images?

Scrapy provides an item pipeline for downloading images attached to a particular item, for example, when you scrape products and also want to download their images locally.

What is pipeline in Scrapy?

Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.


1 Answers

file_path should return the path to your file. In your code, file_path returns item['name'] and that will be your file's path. Note that by default file_path calculates SHA1 hashes. So your method should be something like this:

def file_path(self, request, response=None, info=None):
    original_path = super(MyFilesPipeline, self).file_path(request, response=None, info=None)
    sha1_and_extension = original_path.split('/')[1] # delete 'full/' from the path
    return request.meta.get('filename','') + "_" + sha1_and_extension
like image 170
Djunzu Avatar answered Sep 26 '22 19:09

Djunzu