I have a file with two characters each on its own line:
$ cat roman
Ⅱ
Ⅲ
nut when I sort this file with sort -u
, only one line is displayed:
$ sort -u roman
Ⅱ
Ⅱ
is code-point U+2161 and Ⅲ
is code-point U+2162. Why is only one line displayed?
EDIT
$ xxd -g 1 roman
0000000: e2 85 a1 0a e2 85 a2 0a ........
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=
My sort
is of GNU coreutils.
$ sort --version
sort (GNU coreutils) 8.15
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and Paul Eggert.
Try setting LC_COLLATE=C
; does that fix it? This works for me:
$ export LANG=en_US.UTF-8
$ export LANGUAGE=en_US:en
$ export LC_CTYPE="en_US.UTF-8"
$ export LC_NUMERIC=en_US.UTF-8
$ export LC_TIME=en_US.UTF-8
$ export LC_COLLATE="en_US.UTF-8"
$ export LC_MONETARY=en_US.UTF-8
$ export LC_MESSAGES="en_US.UTF-8"
$ export LC_PAPER=en_US.UTF-8
$ export LC_NAME=en_US.UTF-8
$ export LC_ADDRESS=en_US.UTF-8
$ export LC_TELEPHONE=en_US.UTF-8
$ export LC_MEASUREMENT=en_US.UTF-8
$ export LC_IDENTIFICATION=en_US.UTF-8
$ export LC_ALL=
$ sort -u foo.txt |wc -l # <-- with your env variables
1
$ export LC_COLLATE=C
$ sort -u foo.txt |wc -l # <-- with LC_COLLATE changed to C
2
Looking at my copy of /usr/share/i18n/locales/en_US, I see:
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
END LC_COLLATE
Which is presumably where this is coming from. Not sure why it's telling these to collate together though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With