Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wrong sorting with Collator using Locale.SIMPLIFIED_CHINESE

I'm trying to order a list of countries in Chinese using Locale.SIMPLIFIED_CHINESE, which seems that it orders using pinyin (phonetic alphabet, that is characters are ordered according to their latin correspondent combination, from A to Z).

But I've found some cases when it orders bad. For example:

  • '中' character is zhong1
  • '梵' character is fan4

The correct order should be 梵 < 中, but instead it is ordered in the other way.

String[] characters = new String[] {"梵", "中"};
List<String> list = Arrays.asList(characters);
System.out.println("Before sorting...");
System.out.println(list.toString());

Collator collator = Collator.getInstance(Locale.SIMPLIFIED_CHINESE);
collator.setStrength(Collator.PRIMARY);
Collections.sort(list, collator);

System.out.println("After sorting...");
System.out.println(list.toString());

Results of this snippet are:

Before sorting...
[梵, 中]
After sorting...
[中, 梵]

Going deeper, I found the rules that Java applies with Locale.SIMPLIFIED_CHINESE. You can find in next image: https://postimg.cc/image/4t915a7gp/full/ (Notice that 梵 is after 中)

I realized before the <口<口<口<口<口 that I highlighted in red, all characters are ordered according to their latin correspondent combination, from A to Z. However, after the <口<口<口<口<口 sign, the characters are ordered by the composition of the character. For example, if all the characters have a same part (usually the left part of the character), they are then grouped together, not according to the A to Z rule.

Also, all the characters after the <口<口<口<口<口 are less common Chinese characters. So, 梵 is a less common character than 中, so it is put after <口<口<口<口<口.

I wonder why this decision, if it is intentionally. But it results in wrong sortings. I don't know how to find a solution for this.

like image 712
elegarpes Avatar asked Nov 12 '15 13:11

elegarpes


1 Answers

The sorting order provided by the collator in Java is based on the strokes needed to write that character.

See below small snippet to demonstrate. Stroke numbers taken from Wikitionary

// the unicode character and the number of strokes
String[] characters = new String[]{
    "\u68B5 (11)", "\u4E2D (4)", 
    "\u5207 (4)", "\u5973 (3)", "\u898B (7)"
};
List<String> list = Arrays.asList(characters);
System.out.println("Before sorting...");
System.out.println(list.toString());

Collator collator = Collator.getInstance(Locale.TRADITIONAL_CHINESE);
collator.setStrength(Collator.PRIMARY);
System.out.println();
Collections.sort(list, collator);

System.out.println("After sorting...");
System.out.println(list.toString());

output

Before sorting...
[梵 (11), 中 (4), 切 (4), 女 (3), 見 (7)]

After sorting...
[女 (3), 中 (4), 切 (4), 見 (7), 梵 (11)]

There is an enhancement request JDK-6415666 to implement the sorting order according the Unicode collation order. But following the information about the Java 8 supported locale it's not implemented in Java 8.

edit The sorting order using the collator from icu4j is

[梵 (11), 見 (7), 女 (3), 切 (4), 中 (4)]

ICU4J code snippet

import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator
...
Locale locale = new Locale("zh", "", "PINYIN");
Collator collator = (RuleBasedCollator) Collator.getInstance(locale);
like image 84
SubOptimal Avatar answered Oct 24 '22 01:10

SubOptimal