Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java.text.Collator treats "v" and "w" as the same letter for Swedish language/locale

Tags:

java

java-8

The following test passes correctly with Java 8.

Comparator<String> stringComparator = Collator.getInstance(new Locale("sv", "SE"));

Assert.assertTrue(stringComparator.compare("aaaa", "bbbb") < 0);
Assert.assertTrue(stringComparator.compare("waaa", "vbbb") < 0);
Assert.assertTrue(stringComparator.compare("vaaa", "wbbb") < 0);

This orders waaa before vbbb and vaaa before wbbb. Apparently it treats v and w as the same letter.

In fact, according to Wikipedia, in Swedish language:

By 2006, 'W' had grown in usage because of new loanwords, so 'W' officially became a letter, and the 'V' = 'W' sorting rule was deprecated. Pre-2006 books and software generally use the rule. After the rule was deprecated, some books and software continued to apply it.

Does anyone have a general workaround to this, so that v and w are treated as separate letters within Swedish locale?

like image 352
Mateva Avatar asked Feb 20 '19 15:02

Mateva


3 Answers

Create your own RuleBasedCollator.

Check the value of the string returned by

((RuleBasedCollator)Collator.getInstance(new Locale("sv", "SE"))).getRules()

and modify it to suit your needs and then create a new collator with your modified rules.

And probably submit a JDK bug report too, for good measure.

like image 95
DodgyCodeException Avatar answered Nov 20 '22 08:11

DodgyCodeException


This orders waaa before vbbb and vaaa before wbbb. Apparently it treats v and w as the same letter.

JDK indeed doesn't treat 'w' and 'v' as the same characters even in Swedish locale. The letter 'v' comes before 'w'.

Assert.assertEquals(1, stringComparator.compare("w", "v"));//TRUE

However, based on the Swedish collation rules, JDK orders 'wa' ahead of 'vb'.

Assert.assertEquals(1, stringComparator.compare("wa", "vb"));//FALSE

like image 31
VHS Avatar answered Nov 20 '22 09:11

VHS


You could create a custom comparator, which wraps the collator and manually handles v and w the way you want.

I have made two implementations of this.

The first one is short and elegant, it uses Guavas lexicographical comparator together with the tricky regex that Holger provided in a comment.

private static final Pattern VW_BOUNDARY = Pattern.compile("(?=[vw])|(?<=[vw])", Pattern.CASE_INSENSITIVE);

public static Comparator<String> smallCorrectVwWrapper(Comparator<Object> original) {
    return Comparator.comparing(
        s -> Arrays.asList(VW_BOUNDARY.split((String) s)),
        Comparators.lexicographical(original));

The second implementation is a big and complex thing that does the same thing, but manually implemented, without libraries and regexes.

public static Comparator<String> correctVwWrapper(Comparator<Object> original) {
    return (s1, s2) -> compareSplittedVw(original, s1, s2);
}

/**
 * Compares the two string by first splitting them into segments separated by W
 * and V, then comparing the segments one by one.
 */
private static int compareSplittedVw(Comparator<Object> original, String s1, String s2) {
    List<String> l1 = splitVw(s1);
    List<String> l2 = splitVw(s2);

    int minSize = Math.min(l1.size(), l2.size());

    for (int ix = 0; ix < minSize; ix++) {
        int comp = original.compare(l1.get(ix), l2.get(ix));
        if (comp != 0) {
            return comp; 
        }
    }

    return Integer.compare(l1.size(), l2.size());
}

private static boolean isVw(int ch) {
    return ch == 'V' || ch == 'v' || ch == 'W' || ch == 'w';
}


/**
 * Splits the string into segments separated by V and W.
 */
public static List<String> splitVw(String s) {
    var b = new StringBuilder();

    var result = new ArrayList<String>();

    for (int offset = 0; offset < s.length();) {
        int ch = s.codePointAt(offset);

        if (isVw(ch)) {
            if (b.length() > 0) {
                result.add(b.toString());
                b.setLength(0);
            }

            result.add(Character.toString((char) ch));
        } else {
            b.appendCodePoint(ch);
        }

        offset += Character.charCount(ch);
    }

    if (b.length() > 0) {
        result.add(b.toString());
    }

    return result;
}

Usage:

public static void main(String[] args) throws Exception {
    Comparator<String> stringComparator = correctVwWrapper(Collator.getInstance(new Locale("sv", "SE")));

    System.out.println(stringComparator.compare("a", "z") < 0);     // true
    System.out.println(stringComparator.compare("wa", "vz") < 0);   // false
    System.out.println(stringComparator.compare("wwa", "vvz") < 0); // false
    System.out.println(stringComparator.compare("va", "wz") < 0);   // true
    System.out.println(stringComparator.compare("v", "w") < 0);     // true
}

It is a little more work to implement a wrapping Collator, but it should not be too complicated.

like image 38
Lii Avatar answered Nov 20 '22 07:11

Lii