Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: how to check if character belongs to a specific unicode block?

I need to identify what natural language my input belongs to. The goal is to distinguish between Arabic and English words in a mixed input, where the input is Unicode and is extracted from XML text nodes. I have noticed the class Character.UnicodeBlock. Is it related to my problem? How can I get it to work?

Edit: The Character.UnicodeBlock approach was useful for Arabic, but apparently doesn't do it for English (or other European languages) because the BASIC_LATIN Unicode block covers symbols and non-printable characters as well as letters. So now I am using the matches() method of the String object with the regex expression "[A-Za-z]+" instead. I can live with it, but perhaps someone can suggest a nicer/faster way.

like image 638
IddoG Avatar asked Jan 01 '09 08:01

IddoG


People also ask

How do you find the Unicode of a character?

To insert a Unicode character, type the character code, press ALT, and then press X.

Is Java char Unicode or Ascii?

Java actually uses Unicode, which includes ASCII and other characters from languages around the world.

How many characters are there in Unicode in Java?

Because 16-bit encoding supports 216 (65,536) characters, which is insufficient to define all characters in use throughout the world, the Unicode standard was extended to 0x10FFFF, which supports over one million characters.

How is Unicode calculated in Java?

If you have Java 5, use char c = ...; String s = String. format ("\\u%04x", (int)c); If your source isn't a Unicode character ( char ) but a String, you must use charAt(index) to get the Unicode character at position index .


2 Answers

Yes, you can simply use Character.UnicodeBlock.of(char)

like image 103
Dennis C Avatar answered Sep 20 '22 05:09

Dennis C


If [A-Za-z]+ meets your requirement, you aren't going to find anything faster or prettier. However, if you want to match all letters in the Latin1 block (including accented letters and ligatures), you can use this:

Pattern p = Pattern.compile("[\\pL&&\\p{L1}]+");

That's the intersection of the set of all Unicode letters and the set of all Latin1 characters.

like image 27
Alan Moore Avatar answered Sep 19 '22 05:09

Alan Moore