Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Character.isLetter needs normalized text?

Tags:

java

unicode

I have a program that filter out strings by removing any character that isn't a letter or a digit. This program support a high number of languages, which include chinese, russian, arabic, etc. The program is as below:

StringBuilder strBuilder = new StringBuilder();

for (int i = 0; i < src.length(); i++) {
    int ch = src.codePointAt(i);
    if (Character.isLetterOrDigit(ch)) {
        strBuilder.appendCodePoint(ch);
    }
}

I use codePointAt method to support characters that are expressed in UTF 32 bits via high and low surrogate. I need to know if each string needs to be normalized before performing filtering? I'm referring to calling the Normalizer.normalize method before executing the loop. If yes, which Normalizer.Form should I use?

Thanks.

like image 334
user2144762 Avatar asked Mar 07 '13 14:03

user2144762


People also ask

What data type does the isLetter method in character return?

The isLetter(int codePoint)method returns a boolean value i.e. true, if the given(or specified) character is a letter. Otherwise, the method returns false.

What does character isLetter do?

IsLetter(Char)Indicates whether the specified Unicode character is categorized as a Unicode letter.

What is used of isLetter () method?

The isLetter() method is utilized to check if the stated character is letter or not. Return Type:It returns true if the stated character is letter else it returns false.

What is the difference between isLetter () and isAlphabetic ()?

The Unicode character '\u2164' when passed to the isLetter() method returns false. On the other hand, when passed to the isAlphabetic() method, it returns true. Certainly, for the English language, the distinction makes no difference. Since all the letters of the English language come under the category of alphabets.


1 Answers

It all depends on how you really want your algorithm to behave.

As an example, let us consider the string "a\u0308" (U+0061 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴀ followed by U+0308 ᴄᴏᴍʙɪɴɪɴɢ ᴅɪᴀᴇʀᴇsɪs), which is canonically equivalent to "ä" or "\u00e4" (U+00E4 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴀ ᴡɪᴛʜ ᴅɪᴀᴇʀᴇsɪs). Being canonically equivalent means that your algorithm should not make a distinction between these two. One simple way to get canonically equivalent strings to behave the same is to normalize the two to the same canonical normalization form: either NFC or NFD.

Depending on what these strings represent, you may want to use compatibility equivalence (NFKC or NFKD) instead. That is generally recommended for, for example, identifiers. These two convert compatibility characters to their recommended equivalents (like U+2126 ᴏʜᴍ sɪɢɴ to U+03A9 ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴏᴍᴇɢᴀ, or ligature caracters to the sequences of characters they are made of).

Regardless of which kind of equivalence you want, the principle remains the same: if you want to treat equivalent strings equally normalizing both is the simplest way.

Once you have the same behaviour for all equivalent strings, you need to consider another issue: if you are discarding all "character[s] that [are]n't a letter or a digit", what happens with strings with letters and combining marks, like "\u092C\u093F" (U+092C ᴅᴇᴠᴀɴᴀɢᴀʀɪ ʟᴇᴛᴛᴇʀ ʙᴀ followed by U+093F ᴅᴇᴠᴀɴᴀɢᴀʀɪ ᴠᴏᴡᴇʟ sɪɢɴ ɪ, looks like बि)? These are two separate codepoints, and U+093F is not a letter. These two do not compose in any normalization form. Do you want the combining marks to be dropped (leaving you with ब), or not?

If dropping them is fine, you can use your current algorithm. Otherwise, you probably want to iterate over grapheme clusters, which, put roughly, are sequences of base characters followed by the combining marks on it. Both Java and ICU provide APIs for finding grapheme clusters (Java calls these "character breaks").

like image 189
R. Martinho Fernandes Avatar answered Sep 20 '22 01:09

R. Martinho Fernandes