I need to be able to detect Japanese characters in a Java string.
Currently I'm getting the UnicodeBlock and checking to see if it's equal to Character.UnicodeBlock.KATAKANA or Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS, but I'm not 100% that's going to cover everything.
Any suggestions?
Press "Change System Locale" and select "Japanese (Japan)" from the drop-down menu. To make sure the box that says "Beta: Use Unicode UTF-8 for worldwide language support" is UNCHECKED, this causes a heap of problems for programs and files that just cause them to appear strange or incorrectly.
I use the following java method. Might not completely address your requirement though.
<!-- language: lang-java -->
/**
* Returns if a character is one of Chinese-Japanese-Korean characters.
*
* @param c
* the character to be tested
* @return true if CJK, false otherwise
*/
private boolean isCharCJK(final char c) {
if ((Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_RADICALS_SUPPLEMENT)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION)
|| (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.ENCLOSED_CJK_LETTERS_AND_MONTHS)) {
return true;
}
return false;
}
Futhermore, these seem they should work for Hiragana and Katakana characters:
private boolean isHiragana(final char c)
{
return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.HIRAGANA);
}
private boolean isKatakana(final char c)
{
return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.KATAKANA);
}
According regular-expressions.info, Japanese isn't made of one script: "There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of."
In which case, this regex should do the trick:
yourString.matches("[\\p{Hiragana}\\p{Katakana}\\p{Han}\\p{Latin}]*+")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With