Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently saving a file by hash in Django

I am working on a Django project. What I want the user to be able to do is upload a file (through a Form) and then save the file locally to a custom path and with a custom filename - its hash. The only solution I can think of is by using the "upload_to" argument of the FileField I'm using. What this translates to (I think):

1) Write the file to disk

2) Calculate hash

3) Return path + hash as filename

The problem is that there are two write operations: one when saving the file from memory to disk to calculate the hash, and another one when actually saving the file to specified location.

Is there a way to override FileField's save to disk method (or where can I find exactly what's going on behind the scenes) so that I can basically save the file using a temporary name and then rename it to hash, instead of having it be saved twice.

Thanks.

like image 989
Puscasu Emanuel Avatar asked Jul 30 '15 18:07

Puscasu Emanuel


People also ask

What is the use of save method in Django?

Overriding the save method – Django Models Last Updated: 11-11-2019. The save method is an inherited method from models.Model which is executed to save an instance into a particular Model. Whenever one tries to create an instance of a model either from admin interface or django shell, save() function is run.

What is hashing in Django and how does it work?

Hashing is the process where data is inputted to a hash function which uses mathematical formula to come up with a fixed length output. In a Django application, hashes of passwords are stored and not the clear text password. To know more about hashing in detail read this. How to encrypt data in databases?

How does the hash function work in Python?

The hash function only uses the contents of the file, not the name. Getting the same hash of two separating files means that there is a high probability the contents of the files are identical, even though they have different names. The code is made to work with Python 2.7 and higher (including Python 3.x).

What is the best algorithm to hash a file?

Hashing Algorithms. The most used algorithms to hash a file are MD5 and SHA-1. They are used because they are fast and they provide a good way to identify different files. The hash function only uses the contents of the file, not the name. Getting the same hash of two separating files means that there is a high probability the contents...


1 Answers

The upload_to parameter of FileField accepts a callable, and the string returned from that is joined to your MEDIA_ROOT setting to get the final filename (from the documentation):

This may also be a callable, such as a function, which will be called to obtain the upload path, including the filename. This callable must be able to accept two arguments, and return a Unix-style path (with forward slashes) to be passed along to the storage system. The two arguments that will be passed are:

  • instance: An instance of the model where the FileField is defined. More specifically, this is the particular instance where the current file is being attached. In most cases, this object will not have been saved to the database yet, so if it uses the default AutoField, it might not yet have a value for its primary key field.
  • filename: The filename that was originally given to the file. This may or may not be taken into account when determining the final destination path.

Additionally, when you access model.my_file_field, it resolves to an instance of FieldFile, which acts like a file. So, you should be able to write an upload_to like the following:

def hash_upload(instance, filename):
    instance.my_file.open() # make sure we're at the beginning of the file
    contents = instance.my_file.read() # get the contents
    fname, ext = os.path.splitext(filename)
    return "{0}_{1}{2}".format(fname, hash_function(contents), ext) # assemble the filename

Substitute the appropriate hash function you'd like to use. Saving to the disk isn't necessary at all (in fact, the file is often already uploaded to temporary storage, or in the case of smaller files just kept in memory).

You'd use this like:

class MyModel(models.Model):
    my_file = models.FileField(upload_to=hash_upload,...)

I haven't tested this yet, so you might have to poke at the line that reads the whole file (and you may want to just hash the first chunk of the file to prevent malicious users from uploading massive files and causing DoS attacks). You can get the first chunk with
instance.my_file.read(instance.my_file.DEFAULT_CHUNK_SIZE).

like image 104
Alex Van Liew Avatar answered Sep 29 '22 08:09

Alex Van Liew