Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a better natural sort (than mine)

I added an answer to this question here: Sorting List<String> in C# which calls for a natural sort order, one that handles embedded numbers.

My implementation, however, is naive, and in lieu of all the posts out there about how applications doesn't handle Unicode correctly by assuming things (Turkey test anyone?), I thought I'd ask for help writing a better implementation. Or, if there is a built-in method of .NET, please tell me :)

My implementation for the answer in that question just goes through the strings, comparing character by character, until it encounters a digit in both. Then it extracts consecutive digits from both strings, which can result in varying lengths, pads the shortest with leading zeroes, and then compares.

However, there's problems with it.

For instance, what if you in string x have two codepoints which together make the character È, but in the other string you have just one codepoint, the one that is that character.

My algorithm would fail on those, since it would treat the diacritic codepoint as a single character, and compare it to the È from the other string.

Can anyone guide me towards how to handle this properly? I want support for specifying a CultureInfo object to handle language problems, like comparing "ss" with "ß" in Germany, and similar things.

I think I need to get my code to enumerate over "real characters" (I don't know the real term here) instead of individual codepoints.

What's the right approach to this?

Also, if "natural" means "the way humans expect it to work", I would add the following things to ponder:

  • What about dates and times?
  • What about floating point values?
  • Are there other sequences which are considered "natural"?
    • How far should this be stretched? (Eeny, meeny, miny, moe)
like image 763
Lasse V. Karlsen Avatar asked Sep 15 '10 11:09

Lasse V. Karlsen


People also ask

What is meant by natural sorting?

In computing, natural sort order (or natural sorting) is the ordering of strings in alphabetical order, except that multi-digit numbers are treated atomically, i.e., as if they were a single character.

Is natural order ascending or descending?

Quite an old question, but very simply put, the Natural Order is an ascending order of the enumerable collection of the comparable elements: For the numbers: 1, 2, 3...

What is natural order in statistics?

The "natural order" is what is broadly understood as the default ordering for a particular data type. Humans decide the convention - it isn't ordained. Alphabetical strings are always "naturally" sorted in lexicographical order. Dates by date order. Numerics are sorted by magnitude.


2 Answers

This is already available in Windows, the shell uses natural sort order when arranging the files in an Explorer window. The comparison function it uses is exported and available to any program, at least since Windows 2000. While P/Invoke isn't the greatest solution, it does have the considerable advantage of having been tested billions of times in the past 10 odd years. And sorting strings in a way that the user is already well familiar with.

Handling diacritics is already part of .NET, the string.Normalize() method takes care of it.

Here's a sample program that uses it, it properly sorts the strings as requested in the original thread:

using System;
using System.Collections.Generic;
using System.Runtime.InteropServices;

class Program {
    static void Main(string[] args) {
        string[] arr = new string[] { "1", "5", "3", "6", "11", "9", "NUM1", "NUM0" };
        Array.Sort(arr, new LogicalComparer());
        foreach (string s in arr) Console.WriteLine(s);
        Console.ReadLine();
    }
}
class LogicalComparer : IComparer<string> {
    public int Compare(string x, string y) {
        return StrCmpLogicalW(x.Normalize(), y.Normalize());
    }
    [DllImport("shlwapi.dll", CharSet = CharSet.Unicode, ExactSpelling = true)]
    private static extern int StrCmpLogicalW(string s1, string s2);
}
like image 144
Hans Passant Avatar answered Oct 06 '22 21:10

Hans Passant


I don't know much about .NET, but since it's also an algorithmic question, here are my two cents:

I'd try to split the string into tokens, probably using regular expressions. Then you can compare the strings token by token, using an appropriate comparison function depending on the type of token.

More specifically:

  1. Define regular expressions for dates, numbers, words, ... The last of those should be a fallback expression which matches any character.
  2. Try each expression, most specific first, until one matches at the beginning of both strings
  3. Extract the part that matches and compare it using the appropriate comparison function.
  4. In case of equality, remove the match from the beginning of both strings and repeat from step 2.

Using regular expressions, it should also be possible to support unicode, if you do not use [a-zA-Z] but proper character classes like [:alpha:].

As for the comparison of the different forms of È, you can try to normalize the string first.

like image 43
Jonas Wagner Avatar answered Oct 06 '22 21:10

Jonas Wagner