I have an SQL table in which I store large string values that must be unique. In order to ensure the uniqueness, I have a unique index on a column in which I store a string representation of the MD5 hash of the large string.
The C# app that saves these records uses the following method to do the hashing:
public static string CreateMd5HashString(byte[] input)
{
var hashBytes = MD5.Create().ComputeHash(input);
return string.Join("", hashBytes.Select(b => b.ToString("X")));
}
In order to call this, I first convert the string
to byte[]
using the UTF-8 encoding:
// this is what I use in my app
CreateMd5HashString(Encoding.UTF8.GetBytes("abc"))
// result: 90150983CD24FB0D6963F7D28E17F72
Now I would like to be able to implement this hashing function in SQL, using the HASHBYTES
function, but I get a different value:
print hashbytes('md5', N'abc')
-- result: 0xCE1473CF80C6B3FDA8E3DFC006ADC315
This is because SQL computes the MD5 of the UTF-16 representation of the string.
I get the same result in C# if I do CreateMd5HashString(Encoding.Unicode.GetBytes("abc"))
.
I cannot change the way hashing is done in the application.
Is there a way to get SQL Server to compute the MD5 hash of the UTF-8 bytes of the string?
I looked up similar questions, I tried using collations, but had no luck so far.
An MD5 hash is created by taking a string of an any length and encoding it into a 128-bit fingerprint. Encoding the same string using the MD5 algorithm will always result in the same 128-bit hash output.
Open a terminal window. Type the following command: md5sum [type file name with extension here] [path of the file] -- NOTE: You can also drag the file to the terminal window instead of typing the full path. Hit the Enter key. You'll see the MD5 sum of the file.
You can check using the following function: function isValidMd5($md5 ='') { return preg_match('/^[a-f0-9]{32}$/', $md5); } echo isValidMd5('5d41402abc4b2a76b9719d911017c592'); The MD5 (Message-digest algorithm) Hash is typically expressed in text format as a 32 digit hexadecimal number.
The hash size for the MD5 algorithm is 128 bits. The ComputeHash methods of the MD5 class return the hash as an array of 16 bytes. Note that some MD5 implementations produce a 32-character, hexadecimal-formatted hash.
You need to create a UDF to convert the NVARCHAR data to bytes in UTF-8 Representation. Say it is called dbo.NCharToUTF8Binary
then you can do:
hashbytes('md5', dbo.NCharToUTF8Binary(N'abc', 1))
Here is a UDF which will do that:
create function dbo.NCharToUTF8Binary(@txt NVARCHAR(max), @modified bit)
returns varbinary(max)
as
begin
-- Note: This is not the fastest possible routine.
-- If you want a fast routine, use SQLCLR
set @modified = isnull(@modified, 0)
-- First shred into a table.
declare @chars table (
ix int identity primary key,
codepoint int,
utf8 varbinary(6)
)
declare @ix int
set @ix = 0
while @ix < datalength(@txt)/2 -- trailing spaces
begin
set @ix = @ix + 1
insert @chars(codepoint)
select unicode(substring(@txt, @ix, 1))
end
-- Now look for surrogate pairs.
-- If we find a pair (lead followed by trail) we will pair them
-- High surrogate is \uD800 to \uDBFF
-- Low surrogate is \uDC00 to \uDFFF
-- Look for high surrogate followed by low surrogate and update the codepoint
update c1 set codepoint = ((c1.codepoint & 0x07ff) * 0x0800) + (c2.codepoint & 0x07ff) + 0x10000
from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
where c1.codepoint >= 0xD800 and c1.codepoint <=0xDBFF
and c2.codepoint >= 0xDC00 and c2.codepoint <=0xDFFF
-- Get rid of the trailing half of the pair where found
delete c2
from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
where c1.codepoint >= 0x10000
-- Now we utf-8 encode each codepoint.
-- Lone surrogate halves will still be here
-- so they will be encoded as if they were not surrogate pairs.
update c
set utf8 =
case
-- One-byte encodings (modified UTF8 outputs zero as a two-byte encoding)
when codepoint <= 0x7f and (@modified = 0 OR codepoint <> 0)
then cast(substring(cast(codepoint as binary(4)), 4, 1) as varbinary(6))
-- Two-byte encodings
when codepoint <= 0x07ff
then substring(cast((0x00C0 + ((codepoint/0x40) & 0x1f)) as binary(4)),4,1)
+ substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
-- Three-byte encodings
when codepoint <= 0x0ffff
then substring(cast((0x00E0 + ((codepoint/0x1000) & 0x0f)) as binary(4)),4,1)
+ substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
+ substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
-- Four-byte encodings
when codepoint <= 0x1FFFFF
then substring(cast((0x00F0 + ((codepoint/0x00040000) & 0x07)) as binary(4)),4,1)
+ substring(cast((0x0080 + ((codepoint/0x1000) & 0x3f)) as binary(4)),4,1)
+ substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
+ substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
end
from @chars c
-- Finally concatenate them all and return.
declare @ret varbinary(max)
set @ret = cast('' as varbinary(max))
select @ret = @ret + utf8 from @chars c order by ix
return @ret
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With