Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unix sort -n -t"," gives unexpected result

unix numeric sort gives strange results, even when I specify the delimiter.

$ cat example.csv  # here's a small example
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

$ cat example.csv | sort -n --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

For this example, sort gives the same result regardless if you specify the delimiter. I know if I set LC_ALL=C then sort starts to give expected behavior again. But I do not understand why the default environment settings, as shown below, would make this happen.

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

I've read from many other questions (e.g. here, here, and here) how to avoid this behavior in sort, but still, this behavior is incredibly weird and unpredictable and has caused me a week of heartache. Can someone explain why sort with default environment settings on Mac OS X (10.8.5) would behave this way? In other words: what is sort doing (with local variables set to en_US.UTF-8) to get that result?

I'm using

 sort 5.93                        November 2005

 $ type sort
 sort is /usr/bin/sort

UPDATE

I've discussed this on the gnu-coreutils list and now understand why sort with english unicode default locale settings gave the output it did. Because in English unicode, the comma character "," is considered a numeric (so as to allow for comma's as thousand's (or e.g. hundreds) separators), and sort defaults to "being greedy" when it interprets a line, it read the example numbers as approximately

581.491...
590.000...
590.001...
591.492...
600.001...
601.492...
1213.08...
1214.15...
1226.76...
1250.45...

Although this was not what I had intended and chepner is right that to get the actual result I want, I need to specify that I want sort to key on only the first field. sort defaults to interpreting more of the line as a key rather than just the first field as a key.

This behavior of sort has been discussed in gnu-coreutil's FAQ, and is further specified in the POSIX description of sort.

So that, as Eric Blake on the gnu-coreutil's list put it, if the field-separator is also a numeric (which a comma is) then "Without -k to stop things, [the field-separator] serves as BOTH a separator AND a numeric character - you are sorting on numbers that span multiple fields."

like image 250
gabe Avatar asked Oct 07 '13 15:10

gabe


1 Answers

I'm not sure this is entirely correct, but it's close.

sort -n -t, will try to sort numerically by the given key(s). In this case, the key is a tuple consisting of an integer and a float. Such tuples cannot be sorted numerically.

If you explicitly specify which single keys to sort on with

sort -k1,1n -k2,2n -t,

it should work. Now you are explicitly telling sort to first sort on the first field (numerically), then on the second field (also numerically).

I suspect that -n is useful as a global option only if each line of the input consists of a single numerical value. Otherwise, you need to use the -n option in conjunction with the -k option to specify exactly which fields are numbers.

like image 105
chepner Avatar answered Sep 28 '22 12:09

chepner