I'm building a database that will store information on a range of objects (such as scientific papers, specimens, DNA sequences, etc.) that all have a presence online and can be identified by a URL, or an identifier such as a DOI. Using these GUIDs as the primary key for the object seems a reasonable idea, and I've followed delicious and Connotea in using the md5 hash of the GUID. You'll see the md5 hash in your browser status bar if you mouse over the edit or delete buttons in a delicious or Connotea book mark. For example, the bookmark for http://stackoverflow/ is
http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d
where e4a42d992025b928a586b8bdc36ad38d ais the md5 hash of http://stackoverflow/.
Does anybody have views on the pros and cons of this approach?
For me an advantage of this approach (as opposed to using an auto incrementing primary key generated by the database itself) is that I have to do a lot of links between objects, and by using md5 hashes I can store these links externally in a file (say, as the result of data mining/scraping), then import them in bulk into the database. In the same way, if the database has to be rebuilt from scratch, the URLs to the objects won't change because they use the md5 hash.
I'd welcome any thoughts on whether this sounds sensible, or whether there other (better?) ways of doing this.
When compared to other algorithms like the SHA algorithm, MD5 is comparatively slow. It is possible to construct the same hash function for two distinct inputs using MD5. MD5 is less secure when compared to the SHA algorithm since MD5 is more vulnerable to collision attacks.
Although originally designed as a cryptographic message authentication code algorithm for use on the internet, MD5 hashing is no longer considered reliable for use as a cryptographic checksum because security experts have demonstrated techniques capable of easily producing MD5 collisions on commercial off-the-shelf ...
Weaknesses in the MD5 algorithm allow for collisions in output. As a result, attackers can generate cryptographic tokens or other data that illegitimately appear to be authentic.
Unfortunately, MD5 has been cryptographically broken and considered insecure. For this reason, it should not be used for anything. Instead, developers should switch to the Secure Hash Algorithm or a Symmetric Cryptographic Algorithm.
It's perfectly fine.
Accidental collision of MD5 is impossible in all practical scenarios (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second, for 100 years).
It's such an improbable chance that you're trillion times more likely to get your data messed up due to an undetected hardware failure than due to an actual collision.
Even though there is a known collision attack against MD5, intentional malicious collisions are currently impossible against hashed URLs.
The type of collision you'd need to intentionally collide with a hash of another URL is called a pre-image attack. There are no known pre-image attacks against MD5. As of 2017 there's no research that comes even close to feasibility, so even a determined well-funded attacker can't compute a URL that would hash to a hash of any existing URL in your database.
The only known collision attack against MD5 is not useful for attacking URL-like keys. It works by generating a pair of binary blobs that collide only with each other. The blobs will be relatively long, contain NUL and other unprintable bytes, so they're extremely unlikely to resemble anything like a URL.
After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With