Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort by collation

Having a collection:

{"name": "a"},
{"name": "B"},    
{"name": "b"},    
{"name": "c"},    
{"name": "á"},    
{"name": "A"}

ex. how to sort it in spanish case insensitive?

I've tried this:

var abc = [{"name": "a"}, {"name": "B"}, {"name": "b"}, {"name": "c"}, {"name": "á"}, {"name": "A"}];
for (i in abc) db.abc.save(abc[i]);

db.abc.find({},{"_id":0}).sort({"name":1});

Output is:

[
    { "name" : "A" },
    { "name" : "B" },
    { "name" : "a" },
    { "name" : "b" },
    { "name" : "c" },
    { "name" : "á" },
]

Desired result:

[
    { "name" : "a" },
    { "name" : "á" },
    { "name" : "A" },
    { "name" : "b" },
    { "name" : "B" },
    { "name" : "c" }
]
like image 818
Joan Vega Avatar asked Sep 19 '12 15:09

Joan Vega


4 Answers

Although the other answers here are correct for MongoDB versions 3.2.x and previous, starting in 3.4.0 you can "specify collations for a collection or a view, an index, or specific operations that support collation".

Full documentation for the feature is here.

like image 98
IanWhalen Avatar answered Nov 09 '22 02:11

IanWhalen


I know this is an old thread, but I think it would be useful to answer anyways.

You definitely do not want to do the sorting in your app, because that means you have to get all documents in the collection into memory to sort them and return the window that you want. If your collection is huge, then this is extremely inefficient. The database should be doing the sorting and returning the window to you.

But, MongoDB doesn't support locale-sensitive sorting, you say. How do you solve the problem? The magic is the concept of "sort keys".

Basically, let's say you had the regular English/Latin alphabet from "a" to "z". What you would do is create a sort key mapping from "a" to "01" and from "b" to "02", etc., through to "z" to "26". That is, map every letter to a number in the sort order for that language and then encode that number as a string. Then, map the string you want to sort on to this type of sort key. For example, "abc" would become "010203". Then add a property to your document with the sort key for a property, and append the name of the property with the name of the the locale:

{
    name: "abc",
    name_en: "010203"
}

Now you can sort in the language "en" just by indexing on the property "name_en" and use plain old English-based MongoDB sorting for selectors and ranges instead of "name" property.

Now, let's say you have another crazy language "xx" where the order of the alphabet is "acb" instead of "abc". (Yes, there are languages that mess with the order of the Latin alphabet in that fashion!) The sort key would be like this:

{
    name: "abc",
    name_en: "010203",
    name_xx: "010302"
}

Now, all you have to do is create indexes on name_en and name_xx and use the regular MongoDB sort in order to sort correctly on those locales. Basically, the extra properties are proxies for sorting in different locales.

So where do you get these mappings, you ask? After all, you're no globalization expert, right?

Well, if you're using Java, C, or C++, there are ready-made classes that do this mapping for you. In Java, use the standard Collator class, or use the icu4j Collator class. If you are using C/C++, use C/C++ version of the ICU Collator functions/class. For other languages, you are sort of out-of-luck unless you can find a library that does it already.

Here are some links to help you find them:

The standard Java library Collator: http://docs.oracle.com/javase/7/docs/api/java/text/Collator.html#getCollationKey(java.lang.String)

The C++ Collator class: http://icu-project.org/apiref/icu4c/classicu_1_1Collator.html#ae0bc68d37c4a88d1cb731adaa5a85e95

You can also make different sort keys that allow you to sort case-insensitively per locale (yes, case mapping is locale sensitive!) and accent-insensitively, and Unicode variant insensitive, or any combination of the above. The only problem is that now you have many properties that parallel each sortable property, and you have to keep them all in sync when you update the base "name" property. It is a pain in the you-know-what, but still, it is better than doing sorting in your app or business logic layer.

Also be careful of cursors with ranges. In English, for example, we just ignore accents on characters. So, an "Ö" sorts the same way as "O" and it will appear in the range "M" to "Z". But, in Swedish, accented characters sort after "Z". So, if you do a range "M" - "Z", you will include a bunch of records starting with "Ö" that should be there in English, but not in Swedish.

This also has implications in sharding if you split on a text property of a document. Be careful about what ranges go into which shard. It would be better to shard on things that are not locale-sensitive, like hashes.

like image 39
Edwin Hoogerbeets Avatar answered Nov 09 '22 01:11

Edwin Hoogerbeets


Right now, MongoDB does not implement collation.

Implementing the Unicode collation standard is the best way for solving that.

But this would make sorting slower, and indexes larger. So for now, it's best to sort in your application.

like image 30
saeed Avatar answered Nov 09 '22 02:11

saeed


An easy workaround is to create a new field with text converted to plain ascii characters.

{ "name": "Ánfora", "name_sort": "anfora" }
{ "name": "Óscar", "name_sort": "oscar" }
{ "name": "Barça", "name_sort": "barc~a" }
{ "name": "Niño", "name_sort": "nin~o" }
{ "name": "¡Hola!", "name_sort": "hola!" }
{ "name": "¿qué?", "name_sort": "que?" }

Then simply sort by 'name_sort'

like image 27
Pedro L. Avatar answered Nov 09 '22 02:11

Pedro L.