How do I match "i" with Turkish i in java?

Tags:

I want to match the lower case of "I" of English (i) to lower case of "İ" of Turkish (i). They are the same glyph but they don't match. When I do System.out.println("İ".toLowerCase()); the character i and a dot is printed(this site does not display it properly)

Is there a way to match those?(Preferably without hard-coding it) I want to make the program match the same glyphs irrelevant of the language and the utf code. Is this possible?

I've tested normalization with no success.

public static void main(String... a) {
    String iTurkish = "\u0130";//"İ";
    String iEnglish = "I";
    prin(iTurkish);
    prin(iEnglish);
}

private static void prin(String s) {
    System.out.print(s);
    System.out.print(" -  Normalized : " + Normalizer.normalize(s, Normalizer.Form.NFD));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(s.toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

The result is not properly shown in the site but the first line(iTurkish) still has the ̇ near lowercase i.

Purpose and Problem

This will be a multi lingual dictionary. I want the program to be able to recognize that "İFEL" starts with "if". To make sure they are not case sensitive I first convert both text to lower case. İFEL becomes i(dot)fel and "if" is not recognized as a part of it

265

asked Jun 09 '15 06:06

WVrock

1 Answers

If you print out the hex values of the characters you're seeing, the difference is clear:

İ 0x130 - Normalized : İ 0x49 0x307 - Lower case: i̇ 0x69 0x307 - Lower case Normalized : i̇ 0x69 0x307
I 0x49 - Normalized : I 0x49 - Lower case: i 0x69 - Lower case Normalized : i 0x69

Normalizing the Turkish İ doesn't give you an English I, instead it gives you an English I followed by a diacritic, 0x307. This is correct, and to be expected by the normalization process. Normalization is not a "Convert to ASCII" operation. As the documentation for Normalizer mentions, the process it's following is a very rigorously defined standard, the Unicode Standard Annex #15 — Unicode Normalization Forms.

There are numerous ways to strip diacritics, either before or after normalizing. What you need will depend on the specifics of your use case, but for your use case I would suggest using Guava's CharMatcher class to strip non-ASCII characters after normalizing, e.g.:

String asciiString = CharMatcher.ascii().retainFrom(normalizedString);

This answer goes into more depth about what \p{InCombiningDiacriticalMarks} does, and why it's not ideal. My CharMatcher solution isn't ideal either (the linked answer offers more robust solutions), but for a quick fix you may find retaining only ASCII characters "good enough". This is both closer to "correct" and faster than the Pattern based approach.

169

answered Oct 18 '22 21:10

dimo414

Related questions
                            
                                using enum list as parameter in HQL query
                            
                                Java 8, Stream of Integer, Grouping indexes of a stream by the Integers?
                            
                                Find increasing triplets such that sum is less than or equals to k
                            
                                Attempt to invoke interface method on a null object reference finishComposingText() [duplicate]
                            
                                Do Java Lambda Expressions Utilize "Hidden" or Local Package Imports?
                            
                                #java.lang.NoClassDefFoundError: org/apache/commons/digester/Digester
                            
                                How can I terminate a Stream if I don't need any value from termination?
                            
                                Spring Boot alternative index page
                            
                                SQL Error: ORA-02000: missing ALWAYS keyword when create identity column based table
                            
                                Understanding how BufferedReader works in Java
                            
                                Java REST Mailgun
                            
                                Is this a redundant allocation of memory space in a multi dimensional array?
                            
                                Setting width of SeekBar to make "swipe to unlock" effect
                            
                                Practical use of Collections.max() signature
                            
                                Java array initialization with zero size
                            
                                Multi-point trilateration algorithm in Java
                            
                                Simulating DELETE cascades with WeakHashMaps
                            
                                maven surefire: how to print current test being run?
                            
                                Paging and sorting in Spring Data Neo4j 4
                            
                                Spring Security Kerberos chained with basic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I match "i" with Turkish i in java?

Tags:

java

unicode

normalization

unicode-normalization

WVrock

People also ask

1 Answers

dimo414

Recent Activity

Donate For Us