Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where can I find a specific set of collation rules for equality comparison of strings?

We all know that using String's equals() method for equality comparison will fail miserably. Instead, one should use Collator, like this:

// we need to detect User Interface locale somehow
Locale uiLocale = Locale.forLanguageTag("da-DK");
// Setting up collator object
Collator collator = Collator.getInstance(uiLocale);
collator.setStrength(Collator.SECONDARY);
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
// strings for equality testing
String test1 = "USA lover Grækenland støtte";
String test2 = "USA lover graekenland støtte";
boolean result = collator.equals(test1, test2);

Now, this code works, that is result is true unless uiLocale is set to Danish. In such case it will yield false. I certainly understand why this happened: this is just because the method equals is implemented like this:

return compare(s1, s2) == Collator.Equal;

This method calls the one that is used for sorting and check if strings are the same. They are not, because Danish specific collation rules requires that æ to be sorted after (if I understand the result of compare method correctly) ae. However, these strings are really the same, with this strength both case differences and such compatibility characters (that's what its called) should be treated as equal.

To fix this, one would use RuleBasedCollator with specific set of rules that will work for the equality case.
Finally the question is: does anyone know where I can get such specific rules (not only for Danish, but for other languages as well), so that compatibility characters, ligatures, etc. be treated as equal (CLDR chart does not seem to contain such or I failed searching for it)?

Or maybe I want to do something stupid here, and I should really use simply UCA for equality comparison (any code sample, please)?

like image 771
Paweł Dyda Avatar asked Dec 05 '11 18:12

Paweł Dyda


People also ask

How do you find the equality of two strings?

The equals() method compares two strings, and returns true if the strings are equal, and false if not. Tip: Use the compareTo() method to compare two strings lexicographically.

Can strings be equality == operator?

In Java Strings, the == operator is used to check the reference of both the string objects and equals() method used to check the value equality of both strings. When we assign a string value to the string variable, the JVM will check if the string with the equal value already present in the string pool or not.

How do you know if two String variables are equal?

You should use equals() method to check if two String contains exactly same characters in same order. It returns true if two String are equal or false if unequal.


1 Answers

I can't find any existing Collator for danish; the built-in one for the Danish locale is supposed to be correct. I am not sure that your assumption that ae should be sorted with æ holds, specifically due to certain foreign words (for example "aerofobi") in danish (I am not a danish speaker, though I do speak swedish).

But, if you want to sort them together, it seems like you have two ways to do this, depending upon which context you're in. In certain contexts, just replacing the characters might be approprite:

String str = "USA lover graekenland støtte";
String sortStr = str.replace("ae", "æ");

The other, perhaps better, option is the one you specified; using RuleBasedCollator. Using the example from the javadocs, this is pretty trivial:

String danish = "< a, A < b, B < c, C < d, D < e, E < f, F < g, G < h, H < i, I" +
                "< j, J < k, K < l, L < m, M < n, N < o, O < p, P < q, Q < r, R" +
                "< s, S < t, T < u, U < v, V < w, W < x, X < y, Y < z, Z" +
                "< \u00E6 = ae," +       // Latin letter ae
                "  \u00C6 = AE " +       // Latin letter AE
                "< \u00F8, \u00D8" +     // Latin letter o & O with stroke
                "< \u00E5 = a\u030A," +  // Latin letter a with ring above
                "  \u00C5 = A\u030A;" +  // Latin letter A with ring above
                "  aa, AA";
RuleBasedCollator danishCollator = new RuleBasedCollator(danish);

Which you can then use:

String test1 = "USA lover Grækenland støtte";
String test2 = "USA lover Graekenland støtte";         // note capital 'G'
boolean result = danishCollator.equals(test1, test2);  // true

If you believe that the default collator is incorrect, you may wish to report a bug. (There have previously been similar bugs).

Update: I checked this with a printed danish-language encyclopedia. There are indeed word which begin with 'ae' (primarily words from foreign languages; "aerobics", for example) which are not sorted with (and therefore not equal to) word beginning with 'æ'. So although I see why you would want to treat them as equal in many circumstances, they are not strictly so.

like image 58
beerbajay Avatar answered Oct 04 '22 09:10

beerbajay