I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for <code>á</code> instead of 0061 + 0301 (\x61\x301) I have found a solution from an old Postgres nabble mail list from 2009, using pl/python, <pre class="prettyprint"><code>create or replace function unicode_normalize(str text) returns text as $$ import unicodedata return unicodedata.normalize('NFC', str.decode('UTF-8')) $$ LANGUAGE PLPYTHONU; </code></pre> This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain. EDIT: As Craig has pointed out, and one of the things I tried: <pre class="prettyprint"><code>SELECT convert_to(E'\u00E1', 'iso-8859-1'); </code></pre> returns <code>\xe1</code>, whereas <pre class="prettyprint"><code>SELECT convert_to(E'\u0061\u0301', 'iso-8859-1'); </code></pre> fails with the <code>ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1" </code>

I think this is a Pg bug. In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong. I'll raise it on pgsql-bugs ... done. http://www.postgresql.org/message-id/53E179E1.3060404@2ndquadrant.com You should be able to follow the thread there. Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries. BTW, this can be simplified down to: <pre class="prettyprint"><code>regress=> SELECT 'á' = 'á'; ?column? ---------- f (1 row) </code></pre> which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8). If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly: <img src="https://i.stack.imgur.com/6EfjN.png" alt="Decomposed vs precomposed unicode á showing false for equality">

Unicode normalization in Postgres

Tags:

postgresql

unicode

plpython

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)

I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,

create or replace function unicode_normalize(str text) returns text as $$
  import unicodedata
  return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;

This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.

EDIT: As Craig has pointed out, and one of the things I tried:

SELECT convert_to(E'\u00E1', 'iso-8859-1');

returns \xe1, whereas

SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');

fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"

577

asked Jul 21 '14 11:07

John Powell

1 Answers

I think this is a Pg bug.

In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.

I'll raise it on pgsql-bugs ... done.

http://www.postgresql.org/message-id/[email protected]

You should be able to follow the thread there.

Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.

BTW, this can be simplified down to:

regress=> SELECT 'á' = 'á';
 ?column? 
----------
 f
(1 row)

which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).

If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:

Decomposed vs precomposed unicode á showing false for equality

answered Sep 21 '22 13:09

Craig Ringer

Related questions
                            
                                PostgreSQL Query to select data from last week?
                            
                                Postgres accent insensitive LIKE search in Rails 3.1 on Heroku
                            
                                Postgres UUID JDBC not working
                            
                                How to do Pivoting in PostgreSQL
                            
                                How to convert empty to null in PostgreSQL?
                            
                                Rake task to truncate all tables in Rails 3
                            
                                Insert values if records don't already exist in Postgres
                            
                                Ebean looks for wrong sequence name in Play Framework 2
                            
                                select into insert from values() with correct type casts using jOOQ
                            
                                PG error for SELECT DISTINCT, ORDER BY expressions must appear in select list when trying to view on Heroku
                            
                                DOCKER container with postgres, WARNING: could not open statistics file "pg_stat_tmp/global.stat": Operation not permitted
                            
                                CSV copy to Postgres with array of custom type using JDBC
                            
                                Specifying superuser PostgreSQL password for a Docker Container
                            
                                Migrate from MySQL to PostgreSQL on Linux (Kubuntu)
                            
                                Why is putenv() needed on an already defined environment variable?
                            
                                How can i insert timestamp with timezone in postgresql with prepared statement?
                            
                                What is the difference between C and Posix locales on Postgres?
                            
                                How do I get Spring Boot to automatically reconnect to PostgreSQL?
                            
                                PostgreSQL functions returning void
                            
                                REASSIGN OWNED BY for 1 specified database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With