Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode normalization in Postgres

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)

I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,

create or replace function unicode_normalize(str text) returns text as $$
  import unicodedata
  return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;

This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.

EDIT: As Craig has pointed out, and one of the things I tried:

SELECT convert_to(E'\u00E1', 'iso-8859-1');

returns \xe1, whereas

SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');

fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"

like image 577
John Powell Avatar asked Jul 21 '14 11:07

John Powell


People also ask

What does Unicode normalize do?

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

Does PostgreSQL support Unicode?

One of the interesting features of PostgreSQL database is the ability to handle Unicode characters. In SQL Server, to store non-English characters, we need to use NVARCHAR or NCAHR data type. In PostgreSQL, the varchar data type itself will store both English and non-English characters.

What is encoding in PostgreSQL?

The character set support in PostgreSQL allows you to store text in a variety of character sets (also called encodings), including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended Unix Code), UTF-8, and Mule internal code.


1 Answers

I think this is a Pg bug.

In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.

I'll raise it on pgsql-bugs ... done.

http://www.postgresql.org/message-id/[email protected]

You should be able to follow the thread there.

Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.

BTW, this can be simplified down to:

regress=> SELECT 'á' = 'á';
 ?column? 
----------
 f
(1 row)

which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).

If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:

Decomposed vs precomposed unicode á showing false for equality

like image 58
Craig Ringer Avatar answered Sep 21 '22 13:09

Craig Ringer