Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pros and cons of using md5 hash of URI as the primary key in a database

Tags:

I'm building a database that will store information on a range of objects (such as scientific papers, specimens, DNA sequences, etc.) that all have a presence online and can be identified by a URL, or an identifier such as a DOI. Using these GUIDs as the primary key for the object seems a reasonable idea, and I've followed delicious and Connotea in using the md5 hash of the GUID. You'll see the md5 hash in your browser status bar if you mouse over the edit or delete buttons in a delicious or Connotea book mark. For example, the bookmark for http://stackoverflow/ is

http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d 

where e4a42d992025b928a586b8bdc36ad38d ais the md5 hash of http://stackoverflow/.

Does anybody have views on the pros and cons of this approach?

For me an advantage of this approach (as opposed to using an auto incrementing primary key generated by the database itself) is that I have to do a lot of links between objects, and by using md5 hashes I can store these links externally in a file (say, as the result of data mining/scraping), then import them in bulk into the database. In the same way, if the database has to be rebuilt from scratch, the URLs to the objects won't change because they use the md5 hash.

I'd welcome any thoughts on whether this sounds sensible, or whether there other (better?) ways of doing this.

like image 650
rdmpage Avatar asked Oct 21 '08 08:10

rdmpage


People also ask

What are the advantages and disadvantages of MD5?

When compared to other algorithms like the SHA algorithm, MD5 is comparatively slow. It is possible to construct the same hash function for two distinct inputs using MD5. MD5 is less secure when compared to the SHA algorithm since MD5 is more vulnerable to collision attacks.

Why should MD5 not be used?

Although originally designed as a cryptographic message authentication code algorithm for use on the internet, MD5 hashing is no longer considered reliable for use as a cryptographic checksum because security experts have demonstrated techniques capable of easily producing MD5 collisions on commercial off-the-shelf ...

What is the weakness of MD5?

Weaknesses in the MD5 algorithm allow for collisions in output. As a result, attackers can generate cryptographic tokens or other data that illegitimately appear to be authentic.

Is MD5 hashing good?

Unfortunately, MD5 has been cryptographically broken and considered insecure. For this reason, it should not be used for anything. Instead, developers should switch to the Secure Hash Algorithm or a Symmetric Cryptographic Algorithm.


2 Answers

It's perfectly fine.

Accidental collision of MD5 is impossible in all practical scenarios (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second, for 100 years).

It's such an improbable chance that you're trillion times more likely to get your data messed up due to an undetected hardware failure than due to an actual collision.

Even though there is a known collision attack against MD5, intentional malicious collisions are currently impossible against hashed URLs.

  • The type of collision you'd need to intentionally collide with a hash of another URL is called a pre-image attack. There are no known pre-image attacks against MD5. As of 2017 there's no research that comes even close to feasibility, so even a determined well-funded attacker can't compute a URL that would hash to a hash of any existing URL in your database.

  • The only known collision attack against MD5 is not useful for attacking URL-like keys. It works by generating a pair of binary blobs that collide only with each other. The blobs will be relatively long, contain NUL and other unprintable bytes, so they're extremely unlikely to resemble anything like a URL.

like image 78
Kornel Avatar answered Oct 04 '22 00:10

Kornel


After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.

like image 34
rdmpage Avatar answered Oct 04 '22 01:10

rdmpage