Pros and cons of using md5 hash of URI as the primary key in a database

Tags:

I'm building a database that will store information on a range of objects (such as scientific papers, specimens, DNA sequences, etc.) that all have a presence online and can be identified by a URL, or an identifier such as a DOI. Using these GUIDs as the primary key for the object seems a reasonable idea, and I've followed delicious and Connotea in using the md5 hash of the GUID. You'll see the md5 hash in your browser status bar if you mouse over the edit or delete buttons in a delicious or Connotea book mark. For example, the bookmark for http://stackoverflow/ is

http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d

where e4a42d992025b928a586b8bdc36ad38d ais the md5 hash of http://stackoverflow/.

Does anybody have views on the pros and cons of this approach?

For me an advantage of this approach (as opposed to using an auto incrementing primary key generated by the database itself) is that I have to do a lot of links between objects, and by using md5 hashes I can store these links externally in a file (say, as the result of data mining/scraping), then import them in bulk into the database. In the same way, if the database has to be rebuilt from scratch, the URLs to the objects won't change because they use the md5 hash.

I'd welcome any thoughts on whether this sounds sensible, or whether there other (better?) ways of doing this.

650

asked Oct 21 '08 08:10

rdmpage

2 Answers

It's perfectly fine.

Accidental collision of MD5 is impossible in all practical scenarios (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second, for 100 years).

It's such an improbable chance that you're trillion times more likely to get your data messed up due to an undetected hardware failure than due to an actual collision.

Even though there is a known collision attack against MD5, intentional malicious collisions are currently impossible against hashed URLs.

The type of collision you'd need to intentionally collide with a hash of another URL is called a pre-image attack. There are no known pre-image attacks against MD5. As of 2017 there's no research that comes even close to feasibility, so even a determined well-funded attacker can't compute a URL that would hash to a hash of any existing URL in your database.
The only known collision attack against MD5 is not useful for attacking URL-like keys. It works by generating a pair of binary blobs that collide only with each other. The blobs will be relatively long, contain NUL and other unprintable bytes, so they're extremely unlikely to resemble anything like a URL.

answered Oct 04 '22 00:10

Kornel

After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.

answered Oct 04 '22 01:10

rdmpage

Related questions
                            
                                Convert Word doc to HTML programmatically in Java
                            
                                Eclipse on the Mac... using Windows keyboard shortcuts?
                            
                                Projects within projects using Git
                            
                                Dynamic (Runtime Generated) Forms in ASP.NET MVC [closed]
                            
                                Why does Django's signal handling use weak references for callbacks by default?
                            
                                Getting Embedded with D (the programming language)
                            
                                Interfacing with the end-user's scanner from a webapp (web/scanner integration)
                            
                                MVVM + WPF + Popup = clueless
                            
                                Speeding up the python "import" loader
                            
                                Using a specific network interface for a socket in windows
                            
                                Java CLI UI-design: frameworks or libraries? [closed]
                            
                                How listen for UIButton state change?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pros and cons of using md5 hash of URI as the primary key in a database

Tags:

rdmpage

People also ask

2 Answers

Kornel

rdmpage

Recent Activity

Donate For Us