Can someone explain the behavior of the sort command with the character œ with a french locale?
$ file file.txt
file.txt: UTF-8 Unicode text, with CRLF line terminators
$ wc -l file.txt
4 file.txt
$ cat file.txt
cœz
coez
coe
cœ
$ sort file.txt
coe
cœ
cœz
coez
$ sort -d file.txt
cœ
coe
coez
cœz
$ env | grep -P "(LC|FR)"
LANG=fr_FR.UTF-8
The fact that "œ" is less or greater than "oe" seems random in the case of a regular sort, whereas the character is simply ignored in the case of a dictionary sort (sort -d
).
I guess it has something to do with collation, but I'd like to have some insight here.
Dictionary sort may be ignoring the œ ligature because it is not in the range a-zA-Z in ascii. (This is a guess).
Then in the French locale, œ and oe compare as equal, so they should come out in whatever order they went in, which is what seems to be happening to you. If that's correct, then if you put this in:
cœz
coez
cœm
coem
coep
cœp
coe
cœ
You should get this:
coe
cœ
cœm
coem
coep
cœp
cœz
coez
You can use the -c
(check if file is sorted) or -r
(reverse order) options to get more.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With