Is there a way to select in mysql words that are only Chinese, only Japanese and only Korean?
In english it can be done by:
SELECT * FROM table WHERE field REGEXP '[a-zA-Z0-9]'
or even a "dirty" solution like:
SELECT * FROM table WHERE field > "0" AND field <"ZZZZZZZZ"
Is there a similar solution for eastern languages / CJK characters?
I understand that Chinese and Japanese share characters so there is a chance that Japanese words using these characters will be mistaken for Chinese words. I guess those words would not be filtered.
The words are stored in a utf-8 string field.
If this cannot be done in mysql, can it be done in PHP?
Thanks! :)
edit 1: The data does not include in which language the string is therefore I cannot filter by another field. edit 2: using a translator api like bing's (google is closing their translator api) is an interesting idea but i was hoping for a faster regex-style solution.
Searching for a UTF-8 range of characters is not directly supported in MySQL regexp. See the mySQL reference for regexp where it states:
Warning The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets.
Fortunately in PHP you can build such a regexp e.g. with
/[\x{1234}-\x{5678}]*/u
(note the u
at the end of the regexp). You therefore need to find the appropriate ranges for your different languages. Using the unicode code charts will enable you to pick the appropriate script for the language (although not directly the language itself).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With