I want to generate unique filenames per image so I'm using MD5 to make filenames.Since two of the same image could come from different locations, I'd like to actually base the hash on the image contents. What caveats does this present?
(doing this with PHP5 for what it's worth)
It's a good approach. There is an extremely small possibility that two different images might hash to the same value, but in reality your data center has a greater probability of suffering a direct hit by an asteroid.
One caveat is that you should be careful when deleting images. If you delete an image record that points to some file and you delete the file too, then you may be deleting a file that has a different record pointing to the same image (that belongs to a different user, say).
Given completely random file contents and a good cryptographic hash, the probability that there will be two files with the same hash value reaches 50% when the number of files is roughly 2 to (number of bits in the hash function / 2). That is, for a 128 bit hash there will be a 50% chance of at least one collision when the number of files reaches 2^64.
Your file contents are decidedly not random, but I have no idea how strongly that influences the probability of collision. This is called the birthday attack, if you want to google for more.
It is a probabilistic game. If the number of images will be substantially less than 2^64, you're probably fine. If you're still concerned, using a combination of SHA-1 plus MD5 (as another answer suggested) gets you to a total of 288 high-quality hash bits, which means you'll have a 50% chance of a collision once there are 2^144 files. 2^144 is a mighty big number. Mighty big. One might even say huge.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With