Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mysql regex utf-8 characters

Tags:

regex

mysql

utf-8

I am trying to get data from MySQL database via REGEX with or without special utf-8 characters.

Let me explain on example :

If user enters word like sirena it should return rows which include words like sirena,siréna,šíreňá .. and so on.. also it should work backwards when he enters siréná it should return the same results..

I am trying to search it via REGEX, my query looks like this :

SELECT * FROM `content` WHERE `text` REGEXP '[sšŠ][iíÍ][rŕŔřŘ][eéÉěĚ][nňŇ][AaáÁäÄ0]'

It works only when in database is word sirena but not when there is word siréňa..

Is it because something with UTF-8 and MySQL? (collation of mysql column is utf8_general_ci)

Thank you!

like image 738
Maarty Avatar asked Nov 04 '13 18:11

Maarty


People also ask

Does MySQL support utf8?

MySQL supports multiple Unicode character sets: utf8mb4 : A UTF-8 encoding of the Unicode character set using one to four bytes per character. utf8mb3 : A UTF-8 encoding of the Unicode character set using one to three bytes per character.

What is the difference between utf8mb4 and utf8 charset in MySQL?

The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.

Can MySQL use RegEx?

MySQL supports another type of pattern matching operation based on the regular expressions and the REGEXP operator. It provide a powerful and flexible pattern match that can help us implement power search utilities for our database systems. REGEXP is the operator used when performing regular expression pattern matches.

What flavor of RegEx does MySQL use?

MySQL only has one operator that allows you to work with regular expressions. This is the REGEXP operator, which works just like the LIKE operator, except that instead of using the _ and % wildcards, it uses a POSIX Extended Regular Expression (ERE).


2 Answers

MySQL's regular expression library does not support utf-8.

See Bug #30241 Regular expression problems, which has been open since 2007. They will have to change the regular expression library they use before that can be fixed, and I haven't found any announcement of when or if they will do this.

The only workaround I've seen is to search for specific HEX strings:

mysql> SELECT * FROM `content` WHERE HEX(`text`) REGEXP 'C3A9C588';
+----------+
| text     |
+----------+
| siréňa   |
+----------+

Re your comment:

No, I don't know of any solution with MySQL.

You might have to switch to PostgreSQL, because that RDBMS supports \u codes for UTF characters in their regular expression syntax.

like image 159
Bill Karwin Avatar answered Sep 19 '22 14:09

Bill Karwin


Try something like ... REGEXP '(a|b|[ab])'

SELECT * FROM `content` WHERE `text` REGEXP '(s|š|Š|[sšŠ])(i|í|Í|[iíÍ])(r|ŕ|Ŕ|ř|Ř|[rŕŔřŘ])(e|é|É|ě|Ě|[eéÉěĚ])(n|ň|Ň|[nňŇ])(A|a|á|Á|ä|Ä|0|[AaáÁäÄ0])'

It works for me!

like image 31
felinux Avatar answered Sep 20 '22 14:09

felinux