UTF8 word contains mixed Japanese and english character. How to identify which character is Japanese and which is English?

Question

I have an UTF8 encoded string which contains Japanese and Roman characters. I want to identify which characters are Japanese and which are Roman? How to identify?

Dietrich Epp · Accepted Answer

You are looking for the Unicode "Script" property. I recommend the ICU library.

From: http://icu-project.org/apiref/icu4c/uscript_8h.html

UScriptCode     uscript_getScript (UChar32 codepoint, UErrorCode *err)
Gets the script code associated with the given codepoint.

The result will tell you the script of the character. Here are some of the possible constants returned:

USCRIPT_JAPANESE (Not sure what's in this category...)
USCRIPT_HIRAGANA (Japanese kana)
USCRIPT_KATAKANA (Japanese kana)
USCRIPT_HAN (Japanese kanji)
USCRIPT_LATIN
USCRIPT_COMMON (spaces and punctuation that are common to all scripts)

LibICU is available for Java, C, and C++. You will need to parse the Unicode code points out to use the function.

Alternative: You can also use a Unicode regular expression, although very few engines support this syntax (Perl does...) This PCRE will match strings of text that is definitely Japanese, but it will not get everything.

/\p{Katakana,Hiragana,Han}+/

You have to be careful when you parse these things out because Japanese text will often include romaji or numerals inline. A glance at ja.wikipedia.org will quickly confirm this.

mrembisz · Answer

You can determine Unicode category, in Java with Character.getType(). For Japanese it will be Lo, for latin characters Ll, Lu.

UTF8 word contains mixed Japanese and english character. How to identify which character is Japanese and which is English?

Tags:

java

c++

c

Hospeti

2 Answers

Dietrich Epp

mrembisz

Recent Activity

Donate For Us

UTF8 word contains mixed Japanese and english character. How to identify which character is Japanese and which is English?

Tags:

java

c++

c

Hospeti

2 Answers

Dietrich Epp

mrembisz

Related questions

Recent Activity

Donate For Us