I have an application to deal with a file and fragment it to multiple segments, then save the result into sql server database. There are many duplicated file (maybe with different file path), so first I go through all these files and compute the Md5 hash for each file, and mark duplicated file by using the [Duplicated] column.
Then everyday, I'll run this application and save the results into the [Result] table. The db schema is as below:
CREATE TABLE [dbo].[FilePath]
(
[FilePath] NVARCHAR(256) NOT NULL PRIMARY KEY,
[FileMd5Hash] binay(16) NOT NULL,
[Duplicated] BIT NOT NULL DEFAULT 0,
[LastRunBuild] NVARCHAR(30) NOT NULL DEFAULT 0
)
CREATE TABLE [dbo].[Result]
(
[Build] NVARCHAR(30) NOT NULL,
[FileMd5Hash] binay(16) NOT NULL ,
[SegmentId] INT NOT NULL,
[SegmentContent] text NOT NULL
PRIMARY KEY ([FileMd5Hash], [Build], [SegmentId])
)
And I have a requirement to join these 2 table on FileMd5Hash.
Since the number of rows of [Result] is very large, I'd like to add an int Identity column to join these to tables as below:
CREATE TABLE [dbo].[FilePath]
(
[FilePath] NVARCHAR(256) NOT NULL PRIMARY KEY,
[FileMd5Hash] binay(16) NOT NULL,
**[Id] INT NOT NULL IDENTITY,**
[Duplicated] BIT NOT NULL DEFAULT 0,
[LastRunBuild] NVARCHAR(30) NOT NULL DEFAULT 0
)
CREATE TABLE [dbo].[Result]
(
[Build] NVARCHAR(30) NOT NULL,
**[Id] INT NOT NULL,**
[SegmentId] INT NOT NULL,
[SegmentContent] text NOT NULL
PRIMARY KEY ([FileMd5Hash], [Build], [SegmentId])
)
So What's the Pros and cons of these 2 ways?
Using MD5 hash will be like using a GUID for your primary key. Hash collisions are rare but do happen, you may want to handle it. I will personally go with INT IDENTITY but it may differ based on your implementation.
Its main purpose is to verify that a file has been unaltered. Instead of confirming that two sets of data are identical by comparing the raw data, MD5 does this by producing a checksum on both sets and then comparing the checksums to verify that they're the same.
An int key is simpler to implement and easier to use and understand. It's also smaller (4 bytes vs 16 bytes), so indexes will fit about double the number of entries per IO page, meaning better performance. The table rows too will be smaller (OK, not much smaller), so again you'll fit more rows per page = less IO.
Hash can always produce collisions. Although exceedingly rare, nevertheless, as the birthday problem shows, collisions become more and more likely as record count increases. The number of items needed for a 50% chance of a collision with various bit-length hashes is as follows:
Hash length (bits) Item count for 50% chance of collision
32 77000
64 5.1 billion
128 22 billion billion
256 400 billion billion billion billion
There's also the issue of having to pass around non-ascii bytes - harder to debug, send over wire, etc.
Use int
sequential primary keys for your tables. Everybody else does.
Use ints for primary keys, not hashes. Everyone warns about hash collisions, but in practice they are not a big problem; it's easy to check for collisions and re-hash. Sequential IDs can collide as well if you merge databases.
The big problem with hashes as keys is that you cannot change your data. If you try, your hash will change and all foreign keys become invalid. You have to create a “no, this is the real hash” column in your database and your old hash just becomes a big nonsequential integer.
I bet your business analyst will say “we implement WORM so our records will never change”. They will be proven wrong.
Here is a very nice article explaining Pros and Cons of using both:
https://web.archive.org/web/20140618031501/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html
Using MD5 hash will be like using a GUID for your primary key. Hash collisions are rare but do happen, you may want to handle it.
I will personally go with INT IDENTITY but it may differ based on your implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With