How do I assess the hash collision probability?

Tags:

I'm developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary files' names to my application. My application must process each file within a limited period of time, otherwise it is shut down - that's a watchdog-like security measure. Processing files is likely to take long so I need to design the application capable of handling this scenario. If my application gets shut down next time the search system wants to index the same file it will likely give it a different temporary name.

The obvious solution is to provide an intermediate layer between the search system and the backend. It will queue the request to the backend and wait for the result to arrive. If the request times out in the intermediate layer - no problem, the backend will continue working, only the intermediate layer is restarted and it can retrieve the result from the backend when the request is later repeated by the search system.

The problem is how to identify the files. Their names change randomly. I intend to use a hash function like MD5 to hash the file contents. I'm well aware of the birthday paradox and used an estimation from the linked article to compute the probability. If I assume I have no more than 100 000 files the probability of two files having the same MD5 (128 bit) is about 1,47x10^-29.

Should I care of such collision probability or just assume that equal hash values mean equal file contents?

746

asked May 14 '09 09:05

sharptooth

1 Answers

Equal hash means equal file, unless someone malicious is messing around with your files and injecting collisions. (this could be the case if they are downloading stuff from the internet) If that is the case go for a SHA2 based function.

There are no accidental MD5 collisions, 1,47x10^-29 is a really really really small number.

To overcome the issue of rehashing big files I would have a 3 phased identity scheme.

Filesize alone
Filesize + a hash of 64K * 4 in different positions in the file
A full hash

So if you see a file with a new size you know for certain you do not have a duplicate. And so on.

answered Sep 23 '22 01:09

Sam Saffron

Related questions
                            
                                Naming convention for non-virtual and abstract methods [closed]
                            
                                Automatically tracking development time [closed]
                            
                                Are there languages without "null"?
                            
                                Zig Zag Decoding
                            
                                Best algorithm for matching colours.
                            
                                Position N circles of different radii inside a larger circle without overlapping
                            
                                Random string that matches a regexp [duplicate]
                            
                                What ever happened to Aspect Oriented Programming? [closed]
                            
                                Does anyone know what "Quantum Computing" is?
                            
                                Why should I use code generators
                            
                                Do you use design patterns?
                            
                                What are multi-threading DOs and DONTs? [closed]
                            
                                How do you handle huge if-conditions?
                            
                                Code exercising the unique possibilities of each edge of the lambda calculus
                            
                                Is there a gentle hash function tutorial?
                            
                                Dynamic Scoping - Deep Binding vs Shallow Binding
                            
                                Algorithm to calculate the number of 1s for a range of numbers in binary
                            
                                Where and when to use Lambda?
                            
                                How would you calculate all possible permutations of 0 through N iteratively?
                            
                                Is there an algorithm for converting quaternion rotations to Euler angle rotations?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I assess the hash collision probability?

Tags:

language-agnostic

md5

probability

estimation

sharptooth

People also ask

1 Answers

Sam Saffron

Recent Activity

Donate For Us