Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Binary_Checksum Vs HashBytes function

I have a complex query which uses a lot of binary checksum function, when I was testing it with some test data for two distinct records it actually returned me same checksum value. Please find the test data I used below

SELECT BINARY_CHECKSUM(16   ,'EP30461105',1) AS BinaryCheckSumEx UNION ALL
SELECT BINARY_CHECKSUM(21   ,'EP30461155',1) AS BinaryCheckSumEx

Now I am trying to use HASHBYTES function with 'MD5' algorithm for which I can be certain to get unique records, but what concerns me now is that in the current query I use the 'Checksum' value to join in my 'Merge' statements to look for new records. Since 'HashBytes' returns me Varbinary data type how much of a performance overhead I can expect when I replace the join conditions with the 'HashByte' field.

SELECT HASHBYTES('MD5', CONCAT(Col1,Col2,Col3,Col4,..))

And moreover I need to create hashing for multiple columns in which case I need to have an additional Concat function will this have additional overhead to my performance.

like image 708
Prakazz Avatar asked Apr 03 '17 11:04

Prakazz


1 Answers

Here are the options:

  1. Using index on hash as VARBINARY

    • It’s not a good idea. check these links:
    • https://social.msdn.microsoft.com/Forums/sqlserver/en-US/35b61bb0-1fa8-4a2f-a9fb-729a1874dcf8/clustered-index-on-a-varbinary-column?forum=sqldatabaseengine
    • SQL Server indexing - varchar(100) vs varbinary(100)? [convert data]
  2. Using BINARY_CHECKSUM and CHECKSUM

    • It’s good but the problem is there is a high chance of duplication in checksum and as you google you see that many people have a problem with it.

However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

Source: https://msdn.microsoft.com/en-us/library/ms189788(v=SQL.100).aspx

  1. Casting HASBYTES to BIGINT and having index on that
    • It’s not a good idea

I would also be careful about converting the hashed value to BIGINT given that BIGINT is only 8 bytes yet all hash algorithms -- even MD5 -- are greater than 8 bytes (MD5 = 16 bytes, SHA1 = 20, SHA2_256 = 32, and SHA2_512 = 64). And converting binary values larger than 8 bytes to BIGINTsilently truncates the values. Hence you lose accuracy and increasing occurrences of false positives. The following query shows this behavior:

SELECT CONVERT(BIGINT, 0xFFFFFFFFFFFFFF),      --  7 bytes = 72057594037927935
       CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFF),    --  8 bytes = -1
       CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFF),  --  9 bytes = -1
       CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFFFF) -- 10 bytes = -1

Source: https://dba.stackexchange.com/questions/154945/index-maintenance-for-varbinary

  1. Casting HASHBYTES to VARCHAR and having index on that
    • This is the good choice
    • You have two options:

a) If you're using SQL 2008 or above

SELECT CONVERT(NVARCHAR(32),HashBytes('MD5', CONTENT),2)

b) If you're using SQL 2005

SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', CONTENT)), 3, 32)

PS: If you wonder which Hash Algorithm you should use:

MD5 = 16 bytes
SHA1 = 20 bytes
SHA2_256 = 32 bytes
SHA2_512 = 64 bytes

enter image description here

Source: https://blogs.msdn.microsoft.com/sqlsecurity/2011/08/26/data-hashing-in-sql-server/

For your second question, you should make Hash columns PERSISTED, to avoid the impact on running each query.

like image 82
DaNeSh Avatar answered Oct 14 '22 06:10

DaNeSh