How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

Question

I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+02C8 (MODIFIED LETTER VERTICAL LINE) and U+02D0 (MODIFIED LETTER TRIANGULAR COLON).

 word   | ipa
--------+----------
 Hallo  | haˈloː
 IPA    | ˌiːpeːˈʔaː

I need to search the second field with LIKE and REGEXP, but MySQL (5.0.77) seems to interpret these fields as bytes, not as characters.

SELECT * FROM pronunciation WHERE ipa LIKE '%ha?lo%';  -- 0 rows
SELECT * FROM pronunciation WHERE ipa LIKE '%ha??lo%'; -- 1 row

SELECT * FROM pronunciation WHERE ipa REGEXP 'ha.lo';  -- 0 rows
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha..lo'; -- 1 row

I'm quite sure that the data is stored correctly, as it seems good when I retrieve it and shows up fine in phpMyAdmin. I'm on a shared host, so I can't really install programs.

How can I solve this problem? If it's not possible: is there a plausible work-around that does not involve processing the entire database with PHP every time? There are 40 000 lines, and I'm not dead-set on using MySQL (or UTF8, for that matter). I only have access to PHP and MySQL on the host.

Edit: There is an open 4-year-old MySQL bug report, Bug #30241 Regular expression problems, which notes that the regexp engine works byte-wise. Thus, I'm looking for a work-around.

Bohemian · Accepted Answer

EDITED to incorporate fix to valid critisism

Use the HEX() function to render your bytes to hexadecimal and then use RLIKE on that, for example:

select * from mytable
where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.

The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.

This works for "normal" columns too, you just don't need it.

p.s. @Kieren's (valid) point addressed using rlike to enforce char pairs

How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

Tags:

sql

mysql

character-encoding

unicode

utf-8

Tim

1 Answers

Bohemian

Recent Activity

Donate For Us

How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

Tags:

sql

mysql

character-encoding

unicode

utf-8

Tim

1 Answers

Bohemian

Related questions

Recent Activity

Donate For Us