The following test passes correctly with Java 8.
Comparator<String> stringComparator = Collator.getInstance(new Locale("sv", "SE"));
Assert.assertTrue(stringComparator.compare("aaaa", "bbbb") < 0);
Assert.assertTrue(stringComparator.compare("waaa", "vbbb") < 0);
Assert.assertTrue(stringComparator.compare("vaaa", "wbbb") < 0);
This orders waaa
before vbbb
and vaaa
before wbbb
. Apparently it treats v
and w
as the same letter.
In fact, according to Wikipedia, in Swedish language:
By 2006, 'W' had grown in usage because of new loanwords, so 'W' officially became a letter, and the 'V' = 'W' sorting rule was deprecated. Pre-2006 books and software generally use the rule. After the rule was deprecated, some books and software continued to apply it.
Does anyone have a general workaround to this, so that v
and w
are treated as separate letters within Swedish locale?
Create your own RuleBasedCollator.
Check the value of the string returned by
((RuleBasedCollator)Collator.getInstance(new Locale("sv", "SE"))).getRules()
and modify it to suit your needs and then create a new collator with your modified rules.
And probably submit a JDK bug report too, for good measure.
This orders waaa before vbbb and vaaa before wbbb. Apparently it treats v and w as the same letter.
JDK indeed doesn't treat 'w' and 'v' as the same characters even in Swedish locale. The letter 'v' comes before 'w'.
Assert.assertEquals(1, stringComparator.compare("w", "v"));//TRUE
However, based on the Swedish collation rules, JDK orders 'wa' ahead of 'vb'.
Assert.assertEquals(1, stringComparator.compare("wa", "vb"));//FALSE
You could create a custom comparator, which wraps the collator and manually handles v
and w
the way you want.
I have made two implementations of this.
The first one is short and elegant, it uses Guavas lexicographical
comparator together with the tricky regex that Holger provided in a comment.
private static final Pattern VW_BOUNDARY = Pattern.compile("(?=[vw])|(?<=[vw])", Pattern.CASE_INSENSITIVE);
public static Comparator<String> smallCorrectVwWrapper(Comparator<Object> original) {
return Comparator.comparing(
s -> Arrays.asList(VW_BOUNDARY.split((String) s)),
Comparators.lexicographical(original));
The second implementation is a big and complex thing that does the same thing, but manually implemented, without libraries and regexes.
public static Comparator<String> correctVwWrapper(Comparator<Object> original) {
return (s1, s2) -> compareSplittedVw(original, s1, s2);
}
/**
* Compares the two string by first splitting them into segments separated by W
* and V, then comparing the segments one by one.
*/
private static int compareSplittedVw(Comparator<Object> original, String s1, String s2) {
List<String> l1 = splitVw(s1);
List<String> l2 = splitVw(s2);
int minSize = Math.min(l1.size(), l2.size());
for (int ix = 0; ix < minSize; ix++) {
int comp = original.compare(l1.get(ix), l2.get(ix));
if (comp != 0) {
return comp;
}
}
return Integer.compare(l1.size(), l2.size());
}
private static boolean isVw(int ch) {
return ch == 'V' || ch == 'v' || ch == 'W' || ch == 'w';
}
/**
* Splits the string into segments separated by V and W.
*/
public static List<String> splitVw(String s) {
var b = new StringBuilder();
var result = new ArrayList<String>();
for (int offset = 0; offset < s.length();) {
int ch = s.codePointAt(offset);
if (isVw(ch)) {
if (b.length() > 0) {
result.add(b.toString());
b.setLength(0);
}
result.add(Character.toString((char) ch));
} else {
b.appendCodePoint(ch);
}
offset += Character.charCount(ch);
}
if (b.length() > 0) {
result.add(b.toString());
}
return result;
}
Usage:
public static void main(String[] args) throws Exception {
Comparator<String> stringComparator = correctVwWrapper(Collator.getInstance(new Locale("sv", "SE")));
System.out.println(stringComparator.compare("a", "z") < 0); // true
System.out.println(stringComparator.compare("wa", "vz") < 0); // false
System.out.println(stringComparator.compare("wwa", "vvz") < 0); // false
System.out.println(stringComparator.compare("va", "wz") < 0); // true
System.out.println(stringComparator.compare("v", "w") < 0); // true
}
It is a little more work to implement a wrapping Collator
, but it should not be too complicated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With