I have a program that filter out strings by removing any character that isn't a letter or a digit. This program support a high number of languages, which include chinese, russian, arabic, etc. The program is as below:
StringBuilder strBuilder = new StringBuilder();
for (int i = 0; i < src.length(); i++) {
int ch = src.codePointAt(i);
if (Character.isLetterOrDigit(ch)) {
strBuilder.appendCodePoint(ch);
}
}
I use codePointAt
method to support characters that are expressed in UTF 32 bits via high and low surrogate. I need to know if each string needs to be normalized before performing filtering? I'm referring to calling the Normalizer.normalize
method before executing the loop. If yes, which Normalizer.Form
should I use?
Thanks.
The isLetter(int codePoint)method returns a boolean value i.e. true, if the given(or specified) character is a letter. Otherwise, the method returns false.
IsLetter(Char)Indicates whether the specified Unicode character is categorized as a Unicode letter.
The isLetter() method is utilized to check if the stated character is letter or not. Return Type:It returns true if the stated character is letter else it returns false.
The Unicode character '\u2164' when passed to the isLetter() method returns false. On the other hand, when passed to the isAlphabetic() method, it returns true. Certainly, for the English language, the distinction makes no difference. Since all the letters of the English language come under the category of alphabets.
It all depends on how you really want your algorithm to behave.
As an example, let us consider the string "a\u0308"
(U+0061 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴀ followed by U+0308 ᴄᴏᴍʙɪɴɪɴɢ ᴅɪᴀᴇʀᴇsɪs), which is canonically equivalent to "ä"
or "\u00e4"
(U+00E4 ʟᴀᴛɪɴ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴀ ᴡɪᴛʜ ᴅɪᴀᴇʀᴇsɪs). Being canonically equivalent means that your algorithm should not make a distinction between these two. One simple way to get canonically equivalent strings to behave the same is to normalize the two to the same canonical normalization form: either NFC or NFD.
Depending on what these strings represent, you may want to use compatibility equivalence (NFKC or NFKD) instead. That is generally recommended for, for example, identifiers. These two convert compatibility characters to their recommended equivalents (like U+2126 ᴏʜᴍ sɪɢɴ to U+03A9 ɢʀᴇᴇᴋ ᴄᴀᴘɪᴛᴀʟ ʟᴇᴛᴛᴇʀ ᴏᴍᴇɢᴀ, or ligature caracters to the sequences of characters they are made of).
Regardless of which kind of equivalence you want, the principle remains the same: if you want to treat equivalent strings equally normalizing both is the simplest way.
Once you have the same behaviour for all equivalent strings, you need to consider another issue: if you are discarding all "character[s] that [are]n't a letter or a digit", what happens with strings with letters and combining marks, like "\u092C\u093F"
(U+092C ᴅᴇᴠᴀɴᴀɢᴀʀɪ ʟᴇᴛᴛᴇʀ ʙᴀ followed by U+093F ᴅᴇᴠᴀɴᴀɢᴀʀɪ ᴠᴏᴡᴇʟ sɪɢɴ ɪ, looks like बि)? These are two separate codepoints, and U+093F is not a letter. These two do not compose in any normalization form. Do you want the combining marks to be dropped (leaving you with ब), or not?
If dropping them is fine, you can use your current algorithm. Otherwise, you probably want to iterate over grapheme clusters, which, put roughly, are sequences of base characters followed by the combining marks on it. Both Java and ICU provide APIs for finding grapheme clusters (Java calls these "character breaks").
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With