Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does sort -u treat U+2161 and U+2162 as the same character?

I have a file with two characters each on its own line:

$ cat roman
Ⅱ
Ⅲ

nut when I sort this file with sort -u, only one line is displayed:

$ sort -u roman
Ⅱ

is code-point U+2161 and is code-point U+2162. Why is only one line displayed?

EDIT

$ xxd -g 1 roman
0000000: e2 85 a1 0a e2 85 a2 0a                          ........


$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=

My sort is of GNU coreutils.

$ sort --version
sort (GNU coreutils) 8.15
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.
like image 900
Yishu Fang Avatar asked Dec 27 '12 17:12

Yishu Fang


1 Answers

Try setting LC_COLLATE=C; does that fix it? This works for me:

$ export LANG=en_US.UTF-8
$ export LANGUAGE=en_US:en
$ export LC_CTYPE="en_US.UTF-8"
$ export LC_NUMERIC=en_US.UTF-8
$ export LC_TIME=en_US.UTF-8
$ export LC_COLLATE="en_US.UTF-8"
$ export LC_MONETARY=en_US.UTF-8
$ export LC_MESSAGES="en_US.UTF-8"
$ export LC_PAPER=en_US.UTF-8
$ export LC_NAME=en_US.UTF-8
$ export LC_ADDRESS=en_US.UTF-8
$ export LC_TELEPHONE=en_US.UTF-8
$ export LC_MEASUREMENT=en_US.UTF-8
$ export LC_IDENTIFICATION=en_US.UTF-8
$ export LC_ALL=
$ sort -u foo.txt |wc -l         # <-- with your env variables
1
$ export LC_COLLATE=C
$ sort -u foo.txt |wc -l         # <-- with LC_COLLATE changed to C
2

Looking at my copy of /usr/share/i18n/locales/en_US, I see:

LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
END LC_COLLATE

Which is presumably where this is coming from. Not sure why it's telling these to collate together though.

like image 140
Edward Loper Avatar answered Oct 16 '22 05:10

Edward Loper