Pros and cons of using MD5 Hash as the primary key vs. use a int identity as the primary key in SQL Server

Tags:

I have an application to deal with a file and fragment it to multiple segments, then save the result into sql server database. There are many duplicated file (maybe with different file path), so first I go through all these files and compute the Md5 hash for each file, and mark duplicated file by using the [Duplicated] column.

Then everyday, I'll run this application and save the results into the [Result] table. The db schema is as below:

    CREATE TABLE [dbo].[FilePath]
    (
        [FilePath] NVARCHAR(256) NOT NULL PRIMARY KEY,
        [FileMd5Hash] binay(16) NOT NULL,
        [Duplicated] BIT NOT NULL DEFAULT 0, 
        [LastRunBuild] NVARCHAR(30) NOT NULL DEFAULT 0
    )

    CREATE TABLE [dbo].[Result]
    (
        [Build] NVARCHAR(30) NOT NULL,
        [FileMd5Hash] binay(16) NOT NULL , 
        [SegmentId] INT NOT NULL,
        [SegmentContent] text NOT NULL 
        PRIMARY KEY ([FileMd5Hash], [Build], [SegmentId])
    )

And I have a requirement to join these 2 table on FileMd5Hash.

Since the number of rows of [Result] is very large, I'd like to add an int Identity column to join these to tables as below:

    CREATE TABLE [dbo].[FilePath]
    (
        [FilePath] NVARCHAR(256) NOT NULL PRIMARY KEY,
        [FileMd5Hash] binay(16) NOT NULL,
        **[Id] INT NOT NULL IDENTITY,**
        [Duplicated] BIT NOT NULL DEFAULT 0, 
        [LastRunBuild] NVARCHAR(30) NOT NULL DEFAULT 0
    )

    CREATE TABLE [dbo].[Result]
    (
        [Build] NVARCHAR(30) NOT NULL,
        **[Id] INT NOT NULL,**  
        [SegmentId] INT NOT NULL,
        [SegmentContent] text NOT NULL 
        PRIMARY KEY ([FileMd5Hash], [Build], [SegmentId])
    )

So What's the Pros and cons of these 2 ways?

894

asked May 20 '14 04:05

ricky

3 Answers

An int key is simpler to implement and easier to use and understand. It's also smaller (4 bytes vs 16 bytes), so indexes will fit about double the number of entries per IO page, meaning better performance. The table rows too will be smaller (OK, not much smaller), so again you'll fit more rows per page = less IO.

Hash can always produce collisions. Although exceedingly rare, nevertheless, as the birthday problem shows, collisions become more and more likely as record count increases. The number of items needed for a 50% chance of a collision with various bit-length hashes is as follows:

Hash length (bits)   Item count for 50% chance of collision
                32   77000
                64   5.1 billion
               128   22 billion billion
               256   400 billion billion billion billion

There's also the issue of having to pass around non-ascii bytes - harder to debug, send over wire, etc.

Use int sequential primary keys for your tables. Everybody else does.

122

answered Oct 16 '22 08:10

Bohemian

Use ints for primary keys, not hashes. Everyone warns about hash collisions, but in practice they are not a big problem; it's easy to check for collisions and re-hash. Sequential IDs can collide as well if you merge databases.

The big problem with hashes as keys is that you cannot change your data. If you try, your hash will change and all foreign keys become invalid. You have to create a “no, this is the real hash” column in your database and your old hash just becomes a big nonsequential integer.

I bet your business analyst will say “we implement WORM so our records will never change”. They will be proven wrong.

answered Oct 16 '22 10:10

Dour High Arch

Here is a very nice article explaining Pros and Cons of using both:

https://web.archive.org/web/20140618031501/http://databases.aspfaq.com/database/what-should-i-choose-for-my-primary-key.html

Using MD5 hash will be like using a GUID for your primary key. Hash collisions are rare but do happen, you may want to handle it.

I will personally go with INT IDENTITY but it may differ based on your implementation.

answered Oct 16 '22 09:10

Virat Singh

Related questions
                            
                                MySql Select, Count(*) and SubQueries in Users<>Comments relations
                            
                                How to escape single quotes in Sybase
                            
                                How to get running sum of a column in sql server
                            
                                Order By month and year in sql with sum
                            
                                Drop a column with a default constraint in SQL Server (IF EXISTS)
                            
                                How to remove a specific character from a string, only when it is the first or last character in the string.
                            
                                Postgres LIKE with column value as substring
                            
                                How to get date difference in minutes using Hive
                            
                                � IN SQL Server database
                            
                                Getting output from dbms_output.get_lines using JDBC
                            
                                Warning in ./libraries/plugin_interface.lib.php#551 count(): Parameter must be an array or an object that implements Countable
                            
                                Should transactions be specified outside a stored procedure or inside?
                            
                                How do I find out if an oracle database is set to autocommit?
                            
                                Complex WHERE clause with Zend_Db using multiple AND OR operators
                            
                                Downside of using TransactionScope RequiresNew
                            
                                Adding a non-nullable column to existing table fails. Is the "value" attribute being ignored?
                            
                                SQL Join on Table A value within Table B range
                            
                                determine if date range falls between another date range - sql
                            
                                Insert null/empty value in sql datetime column by default
                            
                                Inserting Image Into BLOB Oracle 10g

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pros and cons of using MD5 Hash as the primary key vs. use a int identity as the primary key in SQL Server

Tags:

sql

database

sql-server

hash