I want to match the lower case of "I" of English (i) to lower case of "İ" of Turkish (i). They are the same glyph but they don't match. When I do System.out.println("İ".toLowerCase());
the character i and a dot is printed(this site does not display it properly)
Is there a way to match those?(Preferably without hard-coding it) I want to make the program match the same glyphs irrelevant of the language and the utf code. Is this possible?
I've tested normalization with no success.
public static void main(String... a) {
String iTurkish = "\u0130";//"İ";
String iEnglish = "I";
prin(iTurkish);
prin(iEnglish);
}
private static void prin(String s) {
System.out.print(s);
System.out.print(" - Normalized : " + Normalizer.normalize(s, Normalizer.Form.NFD));
System.out.print(" - lower case: " + s.toLowerCase());
System.out.print(" - Lower case Normalized : " + Normalizer.normalize(s.toLowerCase(), Normalizer.Form.NFD));
System.out.println();
}
The result is not properly shown in the site but the first line(iTurkish) still has the ̇
near lowercase i.
Purpose and Problem
This will be a multi lingual dictionary. I want the program to be able to recognize that "İFEL" starts with "if". To make sure they are not case sensitive I first convert both text to lower case. İFEL becomes i(dot)fel and "if" is not recognized as a part of it
The syntax of the string matches () method is: Here, string is an object of the String class. The matches () method takes a single parameter. Here, "^a...s$" is a regex, which means a 5 letter string that starts with a and ends with s.
Dotless and dotted I's in capital and lower case. Dotted İ i and dotless I ı are distinct letters in Turkish, Azerbaijani, Kazakh and the Latin alphabets of several other Turkic languages. They are also used by the common Turkic Alphabet : Dotless I, I ı, usually denotes the close back unrounded vowel sound (/ɯ/).
Separate letters in the Latin alphabets of some Turkic languages Dotted İiand dotless Iıare distinct letters in the Latin alphabets of a number of Turkic languagesincluding Turkish, Azerbaijani, and Kazakh, unlike English and most languages using the Latin script, where the capital i is dotless (I) while the lowercase i has a dot on it (i).
Compare Strings Using == Operator In String, the == operator is used to comparing the reference of the given strings, depending on if they are referring to the same objects. When you compare two strings using == operator, it will return true if the string variables are pointing toward the same java object. Otherwise, it will return false.
If you print out the hex values of the characters you're seeing, the difference is clear:
İ 0x130 - Normalized : İ 0x49 0x307 - Lower case: i̇ 0x69 0x307 - Lower case Normalized : i̇ 0x69 0x307
I 0x49 - Normalized : I 0x49 - Lower case: i 0x69 - Lower case Normalized : i 0x69
Normalizing the Turkish İ
doesn't give you an English I
, instead it gives you an English I
followed by a diacritic, 0x307
. This is correct, and to be expected by the normalization process. Normalization is not a "Convert to ASCII" operation. As the documentation for Normalizer
mentions, the process it's following is a very rigorously defined standard, the Unicode Standard Annex #15 — Unicode Normalization Forms.
There are numerous ways to strip diacritics, either before or after normalizing. What you need will depend on the specifics of your use case, but for your use case I would suggest using Guava's CharMatcher
class to strip non-ASCII characters after normalizing, e.g.:
String asciiString = CharMatcher.ascii().retainFrom(normalizedString);
This answer goes into more depth about what \p{InCombiningDiacriticalMarks}
does, and why it's not ideal. My CharMatcher
solution isn't ideal either (the linked answer offers more robust solutions), but for a quick fix you may find retaining only ASCII characters "good enough". This is both closer to "correct" and faster than the Pattern
based approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With