Further to this question: Algorithm for determining a file’s identity Recap: I'm looking for a cheap algorithm for determining a files identity which works the vast majority of the time. I went ahead and implemented an algorithm that gives me a "pretty unique" hash per file. The way my algorithm works is: <ul> <li>For files smaller than a certain threshold I use the full files content for the identity hash. </li> <li>For files larger than the threshold I take random N samples of X size.</li> <li>I include the filesize in the hashed data. (meaning all files with different sizes result in a different hash) </li> </ul> Questions: <ul> <li>What values should I choose for N and X (how many random samples should I take of which size?) I went with 4 samples of 8K each and am not able to stump the algorithm. I found that increasing the amount of samples quickly decreases the speed of the algorithm (cause seeks are pretty expensive) </li> <li>The maths one: how non-different do my files need to be for this algorithm to blow up. (2 different files with same length end up having the same hash) </li> <li>The optimization one: Are there any ways I can optimize my concrete implementation to improve throughput (I seem to be able to do about 100 files a second on my system).</li> <li>Does this implementation look sane? Can you think of any real world examples where this will fail. (My focus is on media files) </li> </ul> Relevant information: The algorithm I implemented Thanks for your help!

<ul> <li>Always include 1st and last block of file in hash.</li> </ul> This is because they're most likely to be different from file to file. If you consider BMP, it may have fairly standard header (like 800x600 image, 24bit, null rest), so you may want to overshoot the header a bit to get to the differentiating data. The problem is that headers vary wildly in size. Last block is for fileformats that append data to original. <ul> <li>Read in blocks of size that is native to the filesystem you use, or at least divisible by 512.</li> <li>Always read blocks at offset that is divisible by blocksize.</li> <li>If you get same has for same sized file, do the deep scan of it (hash all data) and memorize filepath to not scan it again.</li> </ul> Even then unless you're lucky you will misidentify some files as same (for example SQL Server database file and it's 1:1 backup copy after only a few insertions; except that SS does write a timestamp..)

Algorithm for determining a file’s identity (Optimisation)

Further to this question: Algorithm for determining a file’s identity

Recap: I'm looking for a cheap algorithm for determining a files identity which works the vast majority of the time.

I went ahead and implemented an algorithm that gives me a "pretty unique" hash per file.

The way my algorithm works is:

For files smaller than a certain threshold I use the full files content for the identity hash.
For files larger than the threshold I take random N samples of X size.
I include the filesize in the hashed data. (meaning all files with different sizes result in a different hash)

Questions:

What values should I choose for N and X (how many random samples should I take of which size?) I went with 4 samples of 8K each and am not able to stump the algorithm. I found that increasing the amount of samples quickly decreases the speed of the algorithm (cause seeks are pretty expensive)
The maths one: how non-different do my files need to be for this algorithm to blow up. (2 different files with same length end up having the same hash)
The optimization one: Are there any ways I can optimize my concrete implementation to improve throughput (I seem to be able to do about 100 files a second on my system).
Does this implementation look sane? Can you think of any real world examples where this will fail. (My focus is on media files)

Relevant information:

The algorithm I implemented

Thanks for your help!

What is an optimization algorithm?

An optimization algorithm is a procedure which is executed iteratively by comparing various solutions till an optimum or a satisfactory solution is found. With the advent of computers, optimization has become a part of computer-aided design activities.

What are the most sophisticated optimization algorithms in deep learning?

In this article, I will present to you the most sophisticated optimization algorithms in Deep Learning that allow neural networks to learn faster and achieve better performance. These algorithms are Stochastic Gradient Descent with Momentum, AdaGrad, RMSProp, and Adam Optimizer.

What are second-order optimization algorithms?

Second-order optimization algorithms explicitly involve using the second derivative (Hessian) to choose the direction to move in the search space. These algorithms are only appropriate for those objective functions where the Hessian matrix can be calculated or approximated.

What is the major division in optimization algorithms?

Perhaps the major division in optimization algorithms is whether the objective function can be differentiated at a point or not. That is, whether the first derivative (gradient or slope) of the function can be calculated for a given candidate solution or not.

Always include 1st and last block of file in hash.

This is because they're most likely to be different from file to file. If you consider BMP, it may have fairly standard header (like 800x600 image, 24bit, null rest), so you may want to overshoot the header a bit to get to the differentiating data. The problem is that headers vary wildly in size.

Last block is for fileformats that append data to original.

Read in blocks of size that is native to the filesystem you use, or at least divisible by 512.
Always read blocks at offset that is divisible by blocksize.
If you get same has for same sized file, do the deep scan of it (hash all data) and memorize filepath to not scan it again.

Even then unless you're lucky you will misidentify some files as same (for example SQL Server database file and it's 1:1 backup copy after only a few insertions; except that SS does write a timestamp..)

I would avoid an solution like this. I practice it might be close to imposible that two media files have the same size and the same data at coresponding positions for compressed formats. But if you have to deal with uncompressed images or wave files, chances that small local changes are not detected grow .

So I think you should realy hash the whole file. While this seems expensive it might not be if you have access to all the files - for example if you build a file server or something like that. You could build the hash incrementaly.

If you see a new file with an unique file length, just store the file length. If another file with the same length is added, calculate the hashes of both files block by block until they differ. Store the file length, the hash and how many blocks of the file are included in the hash. Whenever you detect matching file lengths and hashes and you have not yet hashed the whole file, you extend the hash by adding more blocks.

Some thoughts about the performance. For small files, the chances of equal file length is quite high - there are not so many diffrent small file lengths. But it is not expensive to hash small files.

For larger files the chances of file lenght collisons goes down as there are more and more possible file lengths. For diffrent media files the chances are very good that they differ directly beyond the header so you will need to hash only a short part of the begining of the file.

Finally you will be sure to detect diffrent files (except for hash collisions) because you will hash the whole file if required.

UPDATE

For movies I would consider the file length practical unique, but files recoded to fit on a given medium probably render this idea void - (S)VCD movies will all be in a small range of file lenghs of about CD-ROM capacity.

But for movie files in general, I would just hash one block (maybe 512 Byte) from the middle of the file. Two different movies with the same image and sound at the same position? Practicaly imposible besides you manipulate files to fail this test. But you could easily generate files to fail all deterministic sampling strategies - so this should not really matter.

Algorithm for determining a file’s identity (Optimisation)

Tags:

c#

optimization

identity

Sam Saffron

People also ask

2 Answers

Pasi Savolainen

Daniel Brückner

Recent Activity

Donate For Us

Algorithm for determining a file’s identity (Optimisation)

Tags:

c#

optimization

identity

Sam Saffron

People also ask

2 Answers

Pasi Savolainen

Daniel Brückner

Related questions

Recent Activity

Donate For Us