MySQL REGEXP query - accent insensitive search

Question

I'm looking to query a database of wine names, many of which contain accents (but not in a uniform way, and so similar wines may be entered with or without accents)

The basic query looks like this:

SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugères[[:>:]]'

which will return entries with 'Faugères' in the title, but not 'Faugeres'

SELECT * FROM `table` WHERE `wine_name` REGEXP '[[:<:]]Faugeres[[:>:]]'

does the opposite.

I had thought something like:

SELECT * 
FROM `table` 
WHERE `wine_name` REGEXP '[[:<:]]Faug[eèêéë]r[eèêéë]s[[:>:]]'

might do the trick, but this only returns the results without the accents.

The field is collated as utf8_unicode_ci, which from what I've read is how it should be.

Any suggestions?!

Mark Manning · Accepted Answer

Because REGEXP and RLIKE are byte oriented, have you tried:

SELECT 'Faugères' REGEXP 'Faug(e|è|ê|é|ë)r(e|è|ê|é|ë)s';

This says one of these has to be in the expression. Notice that I haven't used the plus(+) because that means ONE OR MORE. Since you only want one you should not use the plus.

Álvaro González · Answer

You're out of luck:

Warning

The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

The [[:<:]] and [[:>:]] regexp operators are markers for word boundaries. The closest you can achieve with the LIKE operator is something on this line:

SELECT *
FROM `table`
WHERE wine_name = 'Faugères'
   OR wine_name LIKE 'Faugères %'
   OR wine_name LIKE '% Faugères'

As you can see it's not fully equivalent because I've restricted the concept of word boundary to spaces. Adding more clauses for other boundaries would be a mess.

You could also use full text searches (although it isn't the same) but you can't define full text indexes in InnoDB tables (yet).

You're certainly out of luck :)

Addendum: this has changed as of MySQL 8.0:

MySQL implements regular expression support using International Components for Unicode (ICU), which provides full Unicode support and is multibyte safe. (Prior to MySQL 8.0.4, MySQL used Henry Spencer's implementation of regular expressions, which operates in byte-wise fashion and is not multibyte safe.

Alexander Taver · Answer

utf8_general_ci see no difference between accent/no accent when sorting. Maybe this true for searches as well. Also, change REGEXP to LIKE. REGEXP makes binary comparison.

MySQL REGEXP query - accent insensitive search

Tags:

regex

mysql

diacritics

accent-insensitive

freestate

3 Answers

Mark Manning

Álvaro González

Alexander Taver

Recent Activity

Donate For Us

MySQL REGEXP query - accent insensitive search

Tags:

regex

mysql

diacritics

accent-insensitive

freestate

3 Answers

Mark Manning

Álvaro González

Alexander Taver

Related questions

Recent Activity

Donate For Us