Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF8 word contains mixed Japanese and english character. How to identify which character is Japanese and which is English?

Tags:

java

c++

c

I have an UTF8 encoded string which contains Japanese and Roman characters. I want to identify which characters are Japanese and which are Roman? How to identify?

like image 822
Hospeti Avatar asked Nov 17 '11 11:11

Hospeti


2 Answers

You are looking for the Unicode "Script" property. I recommend the ICU library.

From: http://icu-project.org/apiref/icu4c/uscript_8h.html

UScriptCode     uscript_getScript (UChar32 codepoint, UErrorCode *err)
Gets the script code associated with the given codepoint. 

The result will tell you the script of the character. Here are some of the possible constants returned:

  • USCRIPT_JAPANESE (Not sure what's in this category...)
  • USCRIPT_HIRAGANA (Japanese kana)
  • USCRIPT_KATAKANA (Japanese kana)
  • USCRIPT_HAN (Japanese kanji)
  • USCRIPT_LATIN
  • USCRIPT_COMMON (spaces and punctuation that are common to all scripts)

LibICU is available for Java, C, and C++. You will need to parse the Unicode code points out to use the function.

Alternative: You can also use a Unicode regular expression, although very few engines support this syntax (Perl does...) This PCRE will match strings of text that is definitely Japanese, but it will not get everything.

/\p{Katakana,Hiragana,Han}+/

You have to be careful when you parse these things out because Japanese text will often include romaji or numerals inline. A glance at ja.wikipedia.org will quickly confirm this.

like image 158
Dietrich Epp Avatar answered Sep 24 '22 08:09

Dietrich Epp


You can determine Unicode category, in Java with Character.getType(). For Japanese it will be Lo, for latin characters Ll, Lu.

like image 30
mrembisz Avatar answered Sep 25 '22 08:09

mrembisz