For an open source project I have I am writing an abstraction layer on top of the filesystem.
This layer allows me to attach metadata and relationships to each file.
I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.
To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.
So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.
One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.
Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.
I need this algorithm to work for very large files (1GB+ and tiny files 5K)
EDIT
I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).
EDIT 2
Related question: Algorithm for determining a file’s identity (Optimisation)
Bucketing, multiple layers of comparison should be fastest and scalable across the range of files you're discussing.
First level of indexing is just the length of the file.
Second level is hash. Below a certain size it is a whole-file hash. Beyond that, yes, I agree with your idea of a sampling algorithm. Issues that I think might affect the sampling speed:
Do the first 128k, another 128k at the 1mb mark, another 128k at the 10mb mark, another 128k at the 100mb mark, another 128k at the 1000mb mark, etc. As the file sizes get larger, and it becomes more likely that you'll be able to distinguish two files based on their size alone, you hash a smaller and smaller fraction of the data. Everything under 128k is taken care of completely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With