Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I match "i" with Turkish i in java?

I want to match the lower case of "I" of English (i) to lower case of "İ" of Turkish (i). They are the same glyph but they don't match. When I do System.out.println("İ".toLowerCase()); the character i and a dot is printed(this site does not display it properly)

Is there a way to match those?(Preferably without hard-coding it) I want to make the program match the same glyphs irrelevant of the language and the utf code. Is this possible?

I've tested normalization with no success.

public static void main(String... a) {
    String iTurkish = "\u0130";//"İ";
    String iEnglish = "I";
    prin(iTurkish);
    prin(iEnglish);
}

private static void prin(String s) {
    System.out.print(s);
    System.out.print(" -  Normalized : " + Normalizer.normalize(s, Normalizer.Form.NFD));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(s.toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

The result is not properly shown in the site but the first line(iTurkish) still has the ̇ near lowercase i.

Purpose and Problem

This will be a multi lingual dictionary. I want the program to be able to recognize that "İFEL" starts with "if". To make sure they are not case sensitive I first convert both text to lower case. İFEL becomes i(dot)fel and "if" is not recognized as a part of it

like image 265
WVrock Avatar asked Jun 09 '15 06:06

WVrock


People also ask

What is the syntax of the string matches () method in Java?

The syntax of the string matches () method is: Here, string is an object of the String class. The matches () method takes a single parameter. Here, "^a...s$" is a regex, which means a 5 letter string that starts with a and ends with s.

What is the difference between I and Iı in Turkish?

Dotless and dotted I's in capital and lower case. Dotted İ i and dotless I ı are distinct letters in Turkish, Azerbaijani, Kazakh and the Latin alphabets of several other Turkic languages. They are also used by the common Turkic Alphabet : Dotless I, I ı, usually denotes the close back unrounded vowel sound (/ɯ/).

What are the separate letters in the Latin alphabet of Turkish?

Separate letters in the Latin alphabets of some Turkic languages Dotted İiand dotless Iıare distinct letters in the Latin alphabets of a number of Turkic languagesincluding Turkish, Azerbaijani, and Kazakh, unlike English and most languages using the Latin script, where the capital i is dotless (I) while the lowercase i has a dot on it (i).

How to compare two strings in Java?

Compare Strings Using == Operator In String, the == operator is used to comparing the reference of the given strings, depending on if they are referring to the same objects. When you compare two strings using == operator, it will return true if the string variables are pointing toward the same java object. Otherwise, it will return false.


1 Answers

If you print out the hex values of the characters you're seeing, the difference is clear:

İ 0x130 - Normalized : İ 0x49 0x307 - Lower case: i̇ 0x69 0x307 - Lower case Normalized : i̇ 0x69 0x307
I 0x49 - Normalized : I 0x49 - Lower case: i 0x69 - Lower case Normalized : i 0x69

Normalizing the Turkish İ doesn't give you an English I, instead it gives you an English I followed by a diacritic, 0x307. This is correct, and to be expected by the normalization process. Normalization is not a "Convert to ASCII" operation. As the documentation for Normalizer mentions, the process it's following is a very rigorously defined standard, the Unicode Standard Annex #15 — Unicode Normalization Forms.

There are numerous ways to strip diacritics, either before or after normalizing. What you need will depend on the specifics of your use case, but for your use case I would suggest using Guava's CharMatcher class to strip non-ASCII characters after normalizing, e.g.:

String asciiString = CharMatcher.ascii().retainFrom(normalizedString);

This answer goes into more depth about what \p{InCombiningDiacriticalMarks} does, and why it's not ideal. My CharMatcher solution isn't ideal either (the linked answer offers more robust solutions), but for a quick fix you may find retaining only ASCII characters "good enough". This is both closer to "correct" and faster than the Pattern based approach.

like image 169
dimo414 Avatar answered Oct 18 '22 21:10

dimo414