Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for determining a file's identity

For an open source project I have I am writing an abstraction layer on top of the filesystem.

This layer allows me to attach metadata and relationships to each file.

I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.

To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.

So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.

One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.

Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.

I need this algorithm to work for very large files (1GB+ and tiny files 5K)

EDIT

I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).

EDIT 2

Related question: Algorithm for determining a file’s identity (Optimisation)

like image 293
Sam Saffron Avatar asked Jan 19 '09 00:01

Sam Saffron


2 Answers

Bucketing, multiple layers of comparison should be fastest and scalable across the range of files you're discussing.

First level of indexing is just the length of the file.

Second level is hash. Below a certain size it is a whole-file hash. Beyond that, yes, I agree with your idea of a sampling algorithm. Issues that I think might affect the sampling speed:

  1. To avoid hitting regularly spaced headers which may be highly similar or identical, you need to step in a non-conforming number, eg: multiples of a prime or successive primes.
  2. Avoid steps which might end up encountering regular record headers, so if you are getting the same value from your sample bytes despite different location, try adjusting the step by another prime.
  3. Cope with anomalous files with large stretches of identical values, either because they are unencoded images or just filled with nulls.
like image 72
Andy Dent Avatar answered Oct 22 '22 04:10

Andy Dent


Do the first 128k, another 128k at the 1mb mark, another 128k at the 10mb mark, another 128k at the 100mb mark, another 128k at the 1000mb mark, etc. As the file sizes get larger, and it becomes more likely that you'll be able to distinguish two files based on their size alone, you hash a smaller and smaller fraction of the data. Everything under 128k is taken care of completely.

like image 45
Jay Kominek Avatar answered Oct 22 '22 06:10

Jay Kominek