Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java collation ignores space

Tags:

java

collation

I became recently aware, that Java Collation seems to ignore spaces.

I have a list of the following terms:

Amman Jost 
Ammann Heinrich 
Ammanner Josef 
Bär Walter 
Bare Werner 
Barr Burt 
Barraud Maurice

The order above reflects the desired ordering for Germany, i.e. taking space into acount. However, Java Collation using

Collator collator = Collator.getInstance(Locale.GERMANY);
Collections.sort(values, collator);

gives me the following order:

Amman Jost
Ammanner Josef
Ammann Heinrich
Bare Werner
Barraud Maurice
Barr Burt
Bär Walter

The result above is actually not what I have expected, since spaces are not taken into account (looks like the case described here: Wikipedia Alphabetical order).

Does this mean, that Java Collation is not usable for such use case or am I doing something wrong here? Is there a way to make Java Collation space aware?

I would be glad for any comments or recommendations.

like image 665
jhasenbe Avatar asked May 15 '13 14:05

jhasenbe


People also ask

How do you ignore spaces in Java?

You can implicitly ignore them by just removing them from your input text. Therefore replace all occurrences with "" (empty text): fullName = fullName. replaceAll(" ", "");

What is Java collation?

The Collator class performs locale-sensitive String comparison. You use this class to build searching and sorting routines for natural language text. Collator is an abstract base class. Subclasses implement specific collation strategies.


1 Answers

You can customize the collation. Try looking at the source code to see how the Collator for German locale is built, as described in this answer.

Then adapt it to your needs. The tutorial gives a starting point. But no need to do all the work, someone else already has done it: see this blog post dealing with the exact same problem for Czech.

The essence of the solution linked above is:

String rules = ((RuleBasedCollator) Collator.getInstance(Locale.GERMANY)).getRules();
RuleBasedCollator correctedCollator 
    = new RuleBasedCollator(rules.replaceAll("<'\u005f'", "<' '<'\u005f'"));

This adds a rule for the space character just before the rule for underscore.

I confess I haven't tested this personally.

like image 74
Andrew Spencer Avatar answered Oct 02 '22 03:10

Andrew Spencer