Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are spaces ignored in natsort / strnatcmp / strnatcasecmp?

I'm using strnatcmp in my comparison function for sorting person names in a table. For our Belgian client, we get some strange results. They have names like 'Van der Broecke' and 'Vander Veere', and strnatcasecmp("Van der", "Vander") returns 0!

As natural comparison aims to sort as a human would, I don't understand why the spaces are completely disregarded.

E.g.:

$names = array("Van de broecke", "Vander Veere", "Vande Muizen", "Vander Zoeker", "Van der Programma", "vande Huizen", "vande Kluizen", "vander Muizen", "Van der Luizen");
natcasesort($names);

print_r($names);

Gives:

Array ( 
[0] => Van de broecke 
[5] => vande Huizen 
[6] => vande Kluizen 
[2] => Vande Muizen 
[8] => Van der Luizen 
[7] => vander Muizen 
[4] => Van der Programma 
[1] => Vander Veere 
[3] => Vander Zoeker 
)

But a human would say:

Array ( 
[0] => Van de broecke 
[4] => Van der Programma 
[8] => Van der Luizen 
[5] => vande Huizen 
[6] => vande Kluizen 
[2] => Vande Muizen 
[7] => vander Muizen 
[1] => Vander Veere 
[3] => Vander Zoeker 
)

My solution now is to replace all spaces with underscores, which are handled properly. Two questions: Why does natsort work like this? Is there a better solution?

like image 675
Spork Avatar asked Aug 16 '13 15:08

Spork


2 Answers

If you look in the source code you can actually see this, which definitely seems like a bug: http://gcov.php.net/PHP_5_3/lcov_html/ext/standard/strnatcmp.c.gcov.php (scroll down to line 130):

 //inside a while loop...

 /* Skip consecutive whitespace */
 while (isspace((int)(unsigned char)ca)) {
         ca = *++ap;
 }

 while (isspace((int)(unsigned char)cb)) {
         cb = *++bp;
 }

Note that's a link to 5.3, but the same code still exists in 5.5 (http://gcov.php.net/PHP_5_5/lcov_html/ext/standard/strnatcmp.c.gcov.php) Admittedly my knowledge of C is limited, but this basically appears to be advancing the pointer on each string if the current character is a space, essentially ignoring that character in the sort. The comment implies that it's only doing this if the whitespaces are consecutive; however, there is no check to ensure the previous character was actually a space first. That would need something like

//declare these outside the loop
short prevAIsSpace = 0;
short prevBIsSpace = 0;

//....in the loop
while (prevAIsSpace && isspace((int)(unsigned char)ca)) {
    //won't get here the first time since prevAIsSpace == 0
    ca = *++ap;
}
//now if the character is a space, flag it for the next iteration
prevAIsSpace = isspace((int)(unsigned char)ca));
//repeat with string b
while (prevBIsSpace && isspace((int)(unsigned char)cb)) {
    cb = *++bp;
}
prevBIsSpace = isspace((int)(unsigned char)cb));

Someone who actually knows C could probably write this better, but that's the general idea.

On another potentially interesting note, for your example, if you're using PHP >= 5.4, this gives the same result as the usort mentioned by Aaron Saray (it does lose the key/value associations as well):

sort($names, SORT_FLAG_CASE | SORT_STRING);

print_r($names);
Array ( 
    [0] => Van de broecke 
    [1] => Van der Luizen 
    [2] => Van der Programma 
    [3] => vande Huizen 
    [4] => vande Kluizen 
    [5] => Vande Muizen 
    [6] => vander Muizen 
    [7] => Vander Veere 
    [8] => Vander Zoeker
) 
like image 116
ChicagoRedSox Avatar answered Oct 20 '22 01:10

ChicagoRedSox


Take a look at bugs.php.net #26412 (natsort() was compressing multiple spaces to 1 space). Apparently, this behavior is so "aa", "a a", and "a a" (note the 2 spaces) do not sort as identical strings.

like image 2
Mark Leighton Fisher Avatar answered Oct 20 '22 02:10

Mark Leighton Fisher