I have an UTF8 encoded string which contains Japanese and Roman characters. I want to identify which characters are Japanese and which are Roman? How to identify?
You are looking for the Unicode "Script" property. I recommend the ICU library.
From: http://icu-project.org/apiref/icu4c/uscript_8h.html
UScriptCode uscript_getScript (UChar32 codepoint, UErrorCode *err)
Gets the script code associated with the given codepoint.
The result will tell you the script of the character. Here are some of the possible constants returned:
LibICU is available for Java, C, and C++. You will need to parse the Unicode code points out to use the function.
Alternative: You can also use a Unicode regular expression, although very few engines support this syntax (Perl does...) This PCRE will match strings of text that is definitely Japanese, but it will not get everything.
/\p{Katakana,Hiragana,Han}+/
You have to be careful when you parse these things out because Japanese text will often include romaji or numerals inline. A glance at ja.wikipedia.org will quickly confirm this.
You can determine Unicode category, in Java with Character.getType(). For Japanese it will be Lo, for latin characters Ll, Lu.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With