Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting Chinese only, Japanese only and Korean only records in mysql/php

Tags:

php

mysql

utf-8

cjk

Is there a way to select in mysql words that are only Chinese, only Japanese and only Korean?

In english it can be done by:

SELECT * FROM table WHERE field REGEXP '[a-zA-Z0-9]'

or even a "dirty" solution like:

SELECT * FROM table WHERE field > "0" AND field <"ZZZZZZZZ"

Is there a similar solution for eastern languages / CJK characters?

I understand that Chinese and Japanese share characters so there is a chance that Japanese words using these characters will be mistaken for Chinese words. I guess those words would not be filtered.

The words are stored in a utf-8 string field.

If this cannot be done in mysql, can it be done in PHP?

Thanks! :)

edit 1: The data does not include in which language the string is therefore I cannot filter by another field. edit 2: using a translator api like bing's (google is closing their translator api) is an interesting idea but i was hoping for a faster regex-style solution.

like image 678
3seconds Avatar asked Oct 11 '22 14:10

3seconds


1 Answers

Searching for a UTF-8 range of characters is not directly supported in MySQL regexp. See the mySQL reference for regexp where it states:

Warning The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets.

Fortunately in PHP you can build such a regexp e.g. with

/[\x{1234}-\x{5678}]*/u

(note the u at the end of the regexp). You therefore need to find the appropriate ranges for your different languages. Using the unicode code charts will enable you to pick the appropriate script for the language (although not directly the language itself).

like image 162
borrible Avatar answered Oct 13 '22 10:10

borrible