I'm using Java 8.
I've been struggling for a few days to understand a bug related to string comparison. Have a look at this test. The two strings are different (the "i" is not the same one, and is not the upper/lower case version of the other).
I would expect this test to pass. The first asserts do succeed but the second ones fails (for some reason the compareIgnoreCase returns 0)
Any idea what is going on ?
Thanks
String str1 = "vırus";
String str2 = "virus";
Assert.assertNotEquals(0, str1.compareTo(str2));
Assert.assertNotEquals(0, str1.compareToIgnoreCase(str2));
Javadoc of compareToIgnoreCase
says:
Compares two strings lexicographically, ignoring case differences. This method returns an integer whose sign is that of calling
compareTo
with normalized versions of the strings where case differences have been eliminated by callingCharacter.toLowerCase(Character.toUpperCase(character))
on each character.
The ı
character does not have a corresponding uppercase letter, so toUpperCase
returns I
and then toLowerCase
returns i
.
Similarly, the İ
character does not have a corresponding lowercase letter, so toLowerCase
returns i
.
Which means that compareToIgnoreCase
considers these 4 letters to be the same:
ı
- 'LATIN SMALL LETTER DOTLESS I' (U+0131)
i
- 'LATIN SMALL LETTER I' (U+0069)
I
- 'LATIN CAPITAL LETTER I' (U+0049)
İ
- 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130)
The upper-/title-/lower-case conversions are defined by Unicode, and can be seen in the links above. The uppercase I
even has a comment:
Turkish and Azerbaijani use U+0131 for lowercase
And the lowercase i
has comment:
Turkish and Azerbaijani use U+0130 for uppercase
As mentioned in comment by shmosel:
It's because character comparison is locale-insensitive. In a Turkish locale, the uppercase of
i
isİ
and the lowercase ofI
isı
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With