I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+02C8 (MODIFIED LETTER VERTICAL LINE) and U+02D0 (MODIFIED LETTER TRIANGULAR COLON).
word | ipa
--------+----------
Hallo | haˈloː
IPA | ˌiːpeːˈʔaː
I need to search the second field with LIKE and REGEXP, but MySQL (5.0.77) seems to interpret these fields as bytes, not as characters.
SELECT * FROM pronunciation WHERE ipa LIKE '%ha?lo%'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa LIKE '%ha??lo%'; -- 1 row
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha.lo'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha..lo'; -- 1 row
I'm quite sure that the data is stored correctly, as it seems good when I retrieve it and shows up fine in phpMyAdmin. I'm on a shared host, so I can't really install programs.
How can I solve this problem? If it's not possible: is there a plausible work-around that does not involve processing the entire database with PHP every time? There are 40 000 lines, and I'm not dead-set on using MySQL (or UTF8, for that matter). I only have access to PHP and MySQL on the host.
Edit: There is an open 4-year-old MySQL bug report, Bug #30241 Regular expression problems, which notes that the regexp engine works byte-wise. Thus, I'm looking for a work-around.
EDITED to incorporate fix to valid critisism
Use the HEX()
function to render your bytes to hexadecimal and then use RLIKE
on that, for example:
select * from mytable
where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.
The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.
This works for "normal" columns too, you just don't need it.
p.s. @Kieren's (valid) point addressed using rlike
to enforce char pairs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With