In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like <pre class="prettyprint"><code>SELECT * FROM users WHERE name LIKE 'João' </code></pre> to find a row with a <code>Joao</code> name. I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the <code>SELECT</code> above would work.

<h3>Update for Postgres 12 or later</h3> Postgres 12 adds nondeterministic ICU collations, enabling case-insensitive and accent-insensitive grouping and ordering. The manual: <blockquote> ICU locales can only be used if support for ICU was configured when PostgreSQL was built. </blockquote> If so, this works for you: <pre class="prettyprint lang-sql prettyprint-override"><code>CREATE COLLATION ignore_accent (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false); CREATE INDEX users_name_ignore_accent_idx ON users(name COLLATE ignore_accent); SELECT * FROM users WHERE name = 'João' COLLATE ignore_accent; </code></pre> fiddle Read the manual for details. This blog post by Laurenz Albe may help to understand. But ICU collations also have drawbacks. The manual: <blockquote> [...] they also have some drawbacks. Foremost, their use leads to a performance penalty. Note, in particular, that B-tree cannot use deduplication with indexes that use a nondeterministic collation. Also, certain operations are not possible with nondeterministic collations, such as pattern matching operations. Therefore, they should be used only in cases where they are specifically wanted. </blockquote> My "legacy" solution may still be superior: <h3>For all versions</h3> Use the unaccent module for that - which is completely different from what you are linking to. <blockquote> unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes. </blockquote> Install once per database with: <pre class="prettyprint"><code>CREATE EXTENSION unaccent; </code></pre> If you get an error like: <blockquote> <pre class="prettyprint"><code>ERROR: could not open extension control file "/usr/share/postgresql/<version>/extension/unaccent.control": No such file or directory </code></pre> </blockquote> Install the contrib package on your database server like instructed in this related answer: <ul> <li>Error when creating unaccent extension on PostgreSQL</li> </ul> Among other things, it provides the function <code>unaccent()</code> you can use with your example (where <code>LIKE</code> seems not needed). <pre class="prettyprint"><code>SELECT * FROM users WHERE unaccent(name) = unaccent('João'); </code></pre> <h3>Index</h3> To use an index for that kind of query, create an index on the expression. However, Postgres only accepts <code>IMMUTABLE</code> functions for indexes. If a function can return a different result for the same input, the index could silently break. <h3> <code>unaccent()</code> only <code>STABLE</code> not <code>IMMUTABLE</code> </h3> Unfortunately, <code>unaccent()</code> is only <code>STABLE</code>, not <code>IMMUTABLE</code>. According to this thread on pgsql-bugs, this is due to three reasons: <ol> <li>It depends on the behavior of a dictionary.</li> <li>There is no hard-wired connection to this dictionary.</li> <li>It therefore also depends on the current <code>search_path</code>, which can change easily.</li> </ol> Some tutorials on the web instruct to just alter the function volatility to <code>IMMUTABLE</code>. This brute-force method can break under certain conditions. Others suggest a simple <code>IMMUTABLE</code> wrapper function (like I did myself in the past). There is an ongoing debate whether to make the variant with two parameters <code>IMMUTABLE</code> which declares the used dictionary explicitly. Read here or here. Another alternative would be this module with an IMMUTABLE <code>unaccent()</code> function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea: <h3>Best for now</h3> This approach is more efficient than other solutions floating around, and safer. Create an <code>IMMUTABLE</code> SQL wrapper function executing the two-parameter form with hard-wired, schema-qualified function and dictionary. Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared <code>IMMUTABLE</code> as well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own. The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes. <pre class="prettyprint lang-sql prettyprint-override"><code>CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text) RETURNS text LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS '$libdir/unaccent', 'unaccent_dict'; CREATE OR REPLACE FUNCTION public.f_unaccent(text) RETURNS text LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS $func$ SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1) $func$; </code></pre> Drop <code>PARALLEL SAFE</code> from both functions for Postgres 9.5 or older. <code>public</code> being the schema where you installed the extension (<code>public</code> is the default). The explicit type declaration (<code>regdictionary</code>) defends against hypothetical attacks with overloaded variants of the function by malicious users. Previously, I advocated a wrapper function based on the <code>STABLE</code> function <code>unaccent()</code> shipped with the unaccent module. That disabled function inlining. This version executes ten times faster than the simple wrapper function I had here earlier. And that was already twice as fast as the first version which added <code>SET search_path = public, pg_temp</code> to the function - until I discovered that the dictionary can be schema-qualified, too. Still (Postgres 12) not too obvious from documentation. If you lack the necessary privileges to create C functions, you are back to the second best implementation: An <code>IMMUTABLE</code> function wrapper around the <code>STABLE</code> <code>unaccent()</code> function provided by the module: <pre class="prettyprint lang-sql prettyprint-override"><code>CREATE OR REPLACE FUNCTION public.f_unaccent(text) RETURNS text LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS $func$ SELECT public.unaccent('public.unaccent', $1) -- schema-qualify function and dictionary $func$; </code></pre> Finally, the expression index to make queries fast: <pre class="prettyprint"><code>CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name)); </code></pre> Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the <code>unaccent</code> module. Adapt queries to match the index (so the query planner will use it): <pre class="prettyprint"><code>SELECT * FROM users WHERE f_unaccent(name) = f_unaccent('João'); </code></pre> We don't need the function in the expression to the right of the operator. There we can also supply unaccented strings like <code>'Joao'</code> directly. The faster function does not translate to much faster queries using the expression index. Index look-ups operate on pre-computed values and are very fast either way. But index maintenance and queries not using the index benefit. And access methods like bitmap index scans may have to recheck values in the heap (the main relation), which involves executing the underlying function. See: <ul> <li>"Recheck Cond:" line in query plans with a bitmap index scan</li> </ul> Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See: <ul> <li>'text search dictionary “unaccent” does not exist' entries in postgres log, supposedly during automatic analyze</li> </ul> <h3>Ligatures</h3> In Postgres 9.5 or older ligatures like '&OElig;' or 'ß' have to be expanded manually (if you need that), since <code>unaccent()</code> always substitutes a single letter: <pre class="prettyprint"><code>SELECT unaccent('&OElig; Æ &oelig; æ ß'); unaccent ---------- E A e a S </code></pre> You will love this update to unaccent in Postgres 9.6: <blockquote> Extend <code>contrib/unaccent</code>'s standard <code>unaccent.rules</code> file to handle all diacritics known to Unicode, and expand ligatures correctly (Thomas Munro, Léonard Benedetti) </blockquote> Bold emphasis mine. Now we get: <pre class="prettyprint"><code>SELECT unaccent('&OElig; Æ &oelig; æ ß'); unaccent ---------- OE AE oe ae ss </code></pre> <h3>Pattern matching</h3> For <code>LIKE</code> or <code>ILIKE</code> with arbitrary patterns, combine this with the module <code>pg_trgm</code> in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN: <pre class="prettyprint"><code>CREATE INDEX users_unaccent_name_trgm_idx ON users USING gin (f_unaccent(name) gin_trgm_ops); </code></pre> Can be used for queries like: <pre class="prettyprint"><code>SELECT * FROM users WHERE f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%'); </code></pre> GIN and GIST indexes are more expensive (to maintain) than plain B-tree: <ul> <li>Difference between GiST and GIN index</li> </ul> There are simpler solutions for just left-anchored patterns. More about pattern matching and performance: <ul> <li>Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL</li> </ul> <code>pg_trgm</code> also provides useful operators for "similarity" (<code>%</code>) and "distance" (<code><-></code>). Trigram indexes also support simple regular expressions with <code>~</code> et al. and case insensitive pattern matching with <code>ILIKE</code>: <ul> <li>PostgreSQL accent + case insensitive search</li> </ul>

Does PostgreSQL support "accent insensitive" collations?

Tags:

sql

pattern-matching

indexing

postgresql

localization

In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like

SELECT * FROM users WHERE name LIKE 'João'

to find a row with a Joao name.

I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the SELECT above would work.

997

asked Oct 09 '22 01:10

Daniel Serodio

1 Answers

Update for Postgres 12 or later

Postgres 12 adds nondeterministic ICU collations, enabling case-insensitive and accent-insensitive grouping and ordering. The manual:

ICU locales can only be used if support for ICU was configured when PostgreSQL was built.

If so, this works for you:

CREATE COLLATION ignore_accent (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);

CREATE INDEX users_name_ignore_accent_idx ON users(name COLLATE ignore_accent);

SELECT * FROM users WHERE name = 'João' COLLATE ignore_accent;

fiddle

Read the manual for details. This blog post by Laurenz Albe may help to understand.

But ICU collations also have drawbacks. The manual:

[...] they also have some drawbacks. Foremost, their use leads to a performance penalty. Note, in particular, that B-tree cannot use deduplication with indexes that use a nondeterministic collation. Also, certain operations are not possible with nondeterministic collations, such as pattern matching operations. Therefore, they should be used only in cases where they are specifically wanted.

My "legacy" solution may still be superior:

For all versions

Use the unaccent module for that - which is completely different from what you are linking to.

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes.

Install once per database with:

CREATE EXTENSION unaccent;

If you get an error like:

ERROR: could not open extension control file
"/usr/share/postgresql/<version>/extension/unaccent.control": No such file or directory

Install the contrib package on your database server like instructed in this related answer:

Error when creating unaccent extension on PostgreSQL

Among other things, it provides the function unaccent() you can use with your example (where LIKE seems not needed).

SELECT *
FROM   users
WHERE  unaccent(name) = unaccent('João');

Index

To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for indexes. If a function can return a different result for the same input, the index could silently break.

`unaccent()` only `STABLE` not `IMMUTABLE`

Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:

It depends on the behavior of a dictionary.
There is no hard-wired connection to this dictionary.
It therefore also depends on the current search_path, which can change easily.

Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE. This brute-force method can break under certain conditions.

Others suggest a simple IMMUTABLE wrapper function (like I did myself in the past).

There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.

Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:

Best for now

This approach is more efficient than other solutions floating around, and safer.
Create an IMMUTABLE SQL wrapper function executing the two-parameter form with hard-wired, schema-qualified function and dictionary.

Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared IMMUTABLE as well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own.

The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes.

CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
  RETURNS text
  LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';

CREATE OR REPLACE FUNCTION public.f_unaccent(text)
  RETURNS text
  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)
$func$;

Drop PARALLEL SAFE from both functions for Postgres 9.5 or older.

public being the schema where you installed the extension (public is the default).

The explicit type declaration (regdictionary) defends against hypothetical attacks with overloaded variants of the function by malicious users.

_{Previously, I advocated a wrapper function based on the STABLE function unaccent() shipped with the unaccent module. That disabled function inlining. This version executes ten times faster than the simple wrapper function I had here earlier.

And that was already twice as fast as the first version which added SET search_path = public, pg_temp to the function - until I discovered that the dictionary can be schema-qualified, too. Still (Postgres 12) not too obvious from documentation.}

If you lack the necessary privileges to create C functions, you are back to the second best implementation: An IMMUTABLE function wrapper around the STABLE unaccent() function provided by the module:

CREATE OR REPLACE FUNCTION public.f_unaccent(text)
  RETURNS text
  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.unaccent('public.unaccent', $1)  -- schema-qualify function and dictionary
$func$;

Finally, the expression index to make queries fast:

CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));

Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the unaccent module.

Adapt queries to match the index (so the query planner will use it):

SELECT * FROM users
WHERE  f_unaccent(name) = f_unaccent('João');

We don't need the function in the expression to the right of the operator. There we can also supply unaccented strings like 'Joao' directly.

The faster function does not translate to much faster queries using the expression index. Index look-ups operate on pre-computed values and are very fast either way. But index maintenance and queries not using the index benefit. And access methods like bitmap index scans may have to recheck values in the heap (the main relation), which involves executing the underlying function. See:

"Recheck Cond:" line in query plans with a bitmap index scan

Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See:

'text search dictionary “unaccent” does not exist' entries in postgres log, supposedly during automatic analyze

Ligatures

In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent() always substitutes a single letter:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
E A e a S

You will love this update to unaccent in Postgres 9.6:

Extend contrib/unaccent's standard unaccent.rules file to handle all diacritics known to Unicode, and expand ligatures correctly (Thomas Munro, Léonard Benedetti)

Bold emphasis mine. Now we get:

SELECT unaccent('Œ Æ œ æ ß');

unaccent
----------
OE AE oe ae ss

Pattern matching

For LIKE or ILIKE with arbitrary patterns, combine this with the module pg_trgm in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:

CREATE INDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);

Can be used for queries like:

SELECT * FROM users
WHERE  f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');

GIN and GIST indexes are more expensive (to maintain) than plain B-tree:

Difference between GiST and GIN index

There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

pg_trgm also provides useful operators for "similarity" (%) and "distance" (<->).

Trigram indexes also support simple regular expressions with ~ et al. and case insensitive pattern matching with ILIKE:

PostgreSQL accent + case insensitive search

184

answered Nov 08 '22 07:11

Erwin Brandstetter

Related questions
                            
                                How to enable Ad Hoc Distributed Queries
                            
                                Is there a way to list open transactions on SQL Server 2000 database?
                            
                                MySQL Query to select data from last week?
                            
                                Paging with Oracle
                            
                                Does a UNIQUE constraint automatically create an INDEX on the field(s)?
                            
                                Django in / not in query
                            
                                Best way to test SQL queries [closed]
                            
                                Python SQL query string formatting
                            
                                When to use a View instead of a Table?
                            
                                Cleanest way to build an SQL string in Java
                            
                                How can I get around MySQL Errcode 13 with SELECT INTO OUTFILE?
                            
                                How to find a table having a specific column in postgresql
                            
                                Generate random int value from 3 to 6
                            
                                How can I combine multiple rows into a comma-delimited list in Oracle? [duplicate]
                            
                                How to Concatenate Numbers and Strings to Format Numbers in T-SQL?
                            
                                Is there a better way to dynamically build an SQL WHERE clause than by using 1=1 at its beginning?
                            
                                Convert varchar to uniqueidentifier in SQL Server
                            
                                Postgresql tables exists, but getting "relation does not exist" when querying
                            
                                What is a good reason to use SQL views?
                            
                                How to query database by id using SqlAlchemy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does PostgreSQL support "accent insensitive" collations?

Tags:

sql

pattern-matching

indexing

postgresql

localization

Daniel Serodio

People also ask

1 Answers

Update for Postgres 12 or later

For all versions

Index

`unaccent()` only `STABLE` not `IMMUTABLE`

Best for now

Ligatures

Pattern matching

Erwin Brandstetter

Recent Activity

Donate For Us

Does PostgreSQL support "accent insensitive" collations?

Tags:

sql

pattern-matching

indexing

postgresql

localization

Daniel Serodio

People also ask

1 Answers

Update for Postgres 12 or later

For all versions

Index

unaccent() only STABLE not IMMUTABLE

Best for now

Ligatures

Pattern matching

Erwin Brandstetter

Related questions

Recent Activity

Donate For Us

`unaccent()` only `STABLE` not `IMMUTABLE`