Differentiating CJK languages (Chinese, Japanese, Korean) in Android

Tags:

I want to be able to recognize Chinese, Japanese, and Korean written characters, both as a general group and as subdivided languages. These are the reasons:

Recognize CJK as a general group: I am making a vertical script Mongolian TextView. To do that I need to rotate the line of text 90 degrees because the glyphs are stored horizontally in the font. However, for CJK languages, I need to rotate them back again so that they are written in their correct orientation but just stacked on top of each other down the line.
Differentiate CJK into specific languages: I'm also making a Mongolian dictionary and when users enter a CJK character to lookup I would like to automatically recognize the language. Because Chinese characters are also used by Japanese and Koreans, I'm guessing that I won't be able to fully accomplish this but I want to do it to the maximum extent that the coding allows.

On the linguistic side, the subcategories that I am aware of are

Chinese traditional characters
Chinese simplified characters
Japanese Kanji (Chinese characters)
Japanese Hiragana (native alphabet)
Japanese Katakana (alphabet for writing foreign words)
Korean Hangul (phonetic)
Korean Hanja (Chinese Characters)

For the sake of completeness, Chinese characters are also used in Vietnamese (so CJK is also called CJKV). For my current purposes I don't need to worry about it, but it could be a future consideration. I am also ignoring romanized scripts like Chinese pinyin or Japanese romaji. They will be handled the same as English and Mongolian in the TextView (ie, rotated 90 degrees with the rest of the line). Bopomofo used in Taiwan could also be a future consideration, but I will ignore it for now. See also here and here for language examples.

I've seen a number of related questions that usually deal with one specific language in Java or Android but no overarching question with a canonical answer. Other questions are more general for Unicode but don't tell how to do it in Java and Android. Here are some of the specific ones.

How to check whether given text is english or chinese in android?
How can I detect japanese text in a Java string?
Check if string contains CJK (chinese) characters
Use regular expression to match ANY Chinese character in utf-8 encoding
Testing for Japanese/Chinese Characters in a string
Different representation of unicode code points in Japanese and chinese
Check if a character is Traditional Chinese in Big-5 (Java)?
Unicode characters necessary for Japanese, Korean, and Chinese
Does same chinese characters shared by cjk share same unicode value?
What's the complete range for Chinese characters in Unicode?

So my question is, how much can I differentiate the the CJK languages using Unicode codepoints and how can I test for them in Android? I've seen some newer tests in Java and Android, and while these are useful to know, I also need to support older Android devices.

649

asked Feb 01 '17 14:02

Suragch

1 Answers

Unicode

CJK (and CJKV) in Unicode refers to Han Ideographs, that is, the Chinese characters (汉字) used in Chinese, Japanese, Korean, and Vietnamese. For the Unicode script naming, it does not refer to the phonetic written scripts like Japanese Katakana and Hiragana or Korean Hangul. The Han Ideagraphs are said to be unified. By that they mean that there is only one Unicode codepoint for each ideograph, no matter which language it is used in.

This means that Unicode (and conversely Android/Java) provides no way to determine the language based upon a single ideograph alone. Even the Chinese Simplified/Traditional characters are not readily differentiated from the encoding. This is the same idea as not being able to know if the character "a" belongs to English, French, or Spanish. More context is needed to determine that.

However, you can use the Unicode encoding to determine Japanese Hiragana/Katakana and Korean Hangul. And the presence of such characters would be a good indication that nearby Han Ideographs belong to the same language.

Android

You can find the codepoint at some index with

int codepoint = Character.codePointAt(myString, offset)

And if you wanted to iterate through the codepoints in a string:

final int length = myString.length();
for (int offset = 0; offset < length; ) {
    final int codepoint = Character.codePointAt(myString, offset);

    // use codepoint here

    offset += Character.charCount(codepoint);
}

Once you have the codepoint you can look up which code block it is in with

Character.UnicodeBlock block = Character.UnicodeBlock.of(codepoint);

And then you can use the codeblock to test for the ideograph or language.

CJK

Scanning the Unicode code blocks, I think these cover all the CJK ideograms. If I missed any, then feel free to edit my answer or leave a comment.

private boolean isCJK(int codepoint) {
    Character.UnicodeBlock block = Character.UnicodeBlock.of(codepoint);
    return (
            Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS.equals(block)||
            Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A.equals(block) ||
            Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B.equals(block) ||
            Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_C.equals(block) || // api 19
            Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_D.equals(block) || // api 19
            Character.UnicodeBlock.CJK_COMPATIBILITY.equals(block) ||
            Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS.equals(block) ||
            Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS.equals(block) ||
            Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT.equals(block) ||
            Character.UnicodeBlock.CJK_RADICALS_SUPPLEMENT.equals(block) ||
            Character.UnicodeBlock.CJK_STROKES.equals(block) ||                        // api 19
            Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION.equals(block) ||
            Character.UnicodeBlock.ENCLOSED_CJK_LETTERS_AND_MONTHS.equals(block) ||
            Character.UnicodeBlock.ENCLOSED_IDEOGRAPHIC_SUPPLEMENT.equals(block) ||    // api 19
            Character.UnicodeBlock.KANGXI_RADICALS.equals(block) ||
            Character.UnicodeBlock.IDEOGRAPHIC_DESCRIPTION_CHARACTERS.equals(block));
}

The ones with comments (scroll right) are only available from API level 19. However, these could probably be safely removed if you need to support earlier versions since they are only rarely used. Also, Unicode defines a CJK Extension E, but at the time of this writing it is not supported in Android/Java. If you definitely need to include everything, then you can compare the codepoints to the Unicode block ranges directly. This site is a convenient place to browse them. You can also see them at the Unicode site.

If you don't need to support below API 19, then isIdeographic makes the test very easy (though I don't know if it returns exactly the same matches as the method above).

private boolean isCJK(int codepoint) {
    return Character.isIdeographic(codepoint);
}

Or this one for API 24+:

private boolean isCJK(int codepoint) {
    return (Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HAN);
}

Japanese

For testing Hiragana or Katakana this should work fine:

private boolean isJapaneseKana(int codepoint) {
    Character.UnicodeBlock block = Character.UnicodeBlock.of(codepoint);
    return (
            Character.UnicodeBlock.HIRAGANA.equals(block) ||
            Character.UnicodeBlock.KATAKANA.equals(block) ||
            Character.UnicodeBlock.KATAKANA_PHONETIC_EXTENSIONS.equals(block));
}

Or this if you are supporting API 24+:

(This needs more testing. See comment below.)

private boolean isJapaneseKana(int codepoint) {
    return (Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HIRAGANA || 
            Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.KATAKANA);
}

Korean

To test for Hangul on lower APIs you can use

private boolean isKoreanHangul(int codepoint) {
    Character.UnicodeBlock block = Character.UnicodeBlock.of(codepoint);
    return (Character.UnicodeBlock.HANGUL_JAMO.equals(block) ||
            Character.UnicodeBlock.HANGUL_JAMO_EXTENDED_A.equals(block) || // api 19
            Character.UnicodeBlock.HANGUL_JAMO_EXTENDED_B.equals(block) || // api 19
            Character.UnicodeBlock.HANGUL_COMPATIBILITY_JAMO.equals(block) ||
            Character.UnicodeBlock.HANGUL_SYLLABLES.equals(block));
}

Remove the lines marked API 19 if necessary.

Or for API 24+:

private boolean isKoreanHangul(int codepoint) {
    return (Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HANGUL);
}

Further study

Unicode East Asian scripts
Unicode CJK FAQs
Unicode Korean FAQs
Some source code that shows how Character.UnicodeScript works
CJK Unified Ideographs

139

answered Sep 18 '22 00:09

Suragch

Related questions
                            
                                No AuthenticationProvider found on refresh token - Spring OAuth2 java config
                            
                                creating a folder/uploading a file in amazon S3 bucket using API
                            
                                Spring-Boot-Jersey Setup CORS
                            
                                Chromedriver in Java not executable
                            
                                Null pointer exception when checking for permission with android.content.Context.checkPermission
                            
                                The method form(Class<T>) from Form class is deprecated in Play! Framework
                            
                                SonarQube Local Script in IntelliJ can't find mvn (IOException/No such directory)
                            
                                Method of ContentValues is not mocked
                            
                                Jackson annotation JsonFormat$Value json java.lang.NoSuchMethodError
                            
                                Functors in Java
                            
                                How to log the second argument in log4j
                            
                                org.hibernate.QueryParameterException: could not locate named parameter [templateId]
                            
                                How to make plots in Java like in Matlab (same syntax)
                            
                                java.lang.IllegalStateException: Cannot read while there is an open stream writer
                            
                                How to use REST Assured to upload a file?
                            
                                Java criteria builder query not null or empty
                            
                                Combine allMatch, noneMatch and anyMatch on a single stream
                            
                                How to convert a string to uppercase without using the toUpperCase method?
                            
                                Floating point literal, floating literal, double Literal
                            
                                Why isn't the value updated?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Differentiating CJK languages (Chinese, Japanese, Korean) in Android

Tags:

java

android

unicode

cjk

Suragch

People also ask

1 Answers

Unicode

Android

Further study

Suragch

Recent Activity

Donate For Us