MD5 value mismatch between SQL server and PostgreSQL

Question

In order to write some code to do consistency check of data stored in both SQL server and PostgreSQL, I plan to calculate the MD5 on table data for both the databases, and verify if they are equal. This works fine as long as data is plain text ( ANSI ) as below:

sql-server> SELECT master.dbo.fn_varbintohexstr(HashBytes('MD5', 'a'));
0x0cc175b9c0f1b6a831c399e269772661


postgres=# select MD5('a');
0cc175b9c0f1b6a831c399e269772661

Now, If I try to use some Hangul(Korean) characters, MD5 match fails:

sql-server> SELECT master.dbo.fn_varbintohexstr(HashBytes('MD5', '무'));
0x7827b52f65d9f7777d37071cbbbf7f2d


postgres=# select MD5('무');
cb3e9be1a3a28b355eabae1fa1e291b3

As per my understanding, reason of mismatch is that unicode characters are stored as UCS-2 encoding (fixed 16 bits encoding) in SQL server and UTF-8 encoding in PostgreSQL. And as MD5 works on character bits, the character bits sequence would be different in both SQL server and PostgreSQL.

AS I have been dealing mostly with Hangul character-set, the workaround I used in PostgreSQL is to convert the encoding from UTF-8 to UHC ( Universal Hangul Character-set) before calculating hash as below:

postgres=# select MD5(CONVERT('무'::bytea,'UTF8','UHC'));
7827b52f65d9f7777d37071cbbbf7f2d

As you can see, the above hash value is same as that for SQL server.

All is fine as long as I am dealing with Hangul characters. But some tables contains mix of Hangul and Chinese characters, and the conversion fails in that case:

postgres=# select MD5(CONVERT('무么'::bytea,'UTF8','UHC'));
ERROR:  character 0xe4b988 of encoding "UTF8" has no equivalent in "UHC"
postgres=#

The error makes sense as there are no equivalent of Chinese characters in UHC character-set.

How can I make it work? Basically, I need to find way to convert UCS-2 to UTF-8 in SQL server, or to convert UTF-8 to UCS-2 in PostgreSQL before calculating MD5. I want to perform all these operations within database engine, and not load data in external application to calculate MD5, as some tables have huge data set.

SQL server version 2005 PostgreSQL version 9.1

pozs · Accepted Answer

Unfortunately, PostgreSQL does not support UTF-16 / UCS-2 either.

But, you can write a function, to convert utf8 text to ucs2 binary data (bytea):

create or replace function text_to_ucs2be(input_in_utf8 text)
  returns bytea
  immutable
  strict
  language sql
as $$
  select decode(string_agg(case
           when code_point < 65536
           then lpad(to_hex(code_point), 4, '0')
         end, ''), 'hex')
  from   regexp_split_to_table(input_in_utf8, '') chr,
         ascii(chr) code_point
$$;

create or replace function text_to_ucs2le(input_in_utf8 text)
  returns bytea
  immutable
  strict
  language sql
as $$
  select decode(string_agg(case
           when code_point < 65536
           then lpad(to_hex(code_point & 255), 2, '0')
             || lpad(to_hex(code_point >> 8), 2, '0')
         end, ''), 'hex')
  from   regexp_split_to_table(input_in_utf8, '') chr,
         ascii(chr) code_point
$$;

Note: these functions above will strip out any non-BMP code-points (therefore the name ucs2 in them).

The following statements should give you the same results:

-- on PostgreSQL
select md5(text_to_ucs2le('무'));

-- on SQL server
select master.dbo.fn_varbintohexstr(HashBytes('MD5', N'무'));

MD5 value mismatch between SQL server and PostgreSQL

Tags:

sql-server

postgresql

unicode

utf-8

utf-16

LKB

1 Answers

pozs

Recent Activity

Donate For Us

MD5 value mismatch between SQL server and PostgreSQL

Tags:

sql-server

postgresql

unicode

utf-8

utf-16

LKB

1 Answers

pozs

Related questions

Recent Activity

Donate For Us