I am trying to get Perl and the GNU/Linux sort(1) program agree on how to sort Unicode strings. I'm running sort with LANG=en_US.UTF-8
. In the Perl program I have tried the following methods:
use Unicode::Collate
with $Collator = Unicode::Collate->new();
use Unicode::Collate::Locale
with $Collator = Unicode::Collate->new(locale => $ENV{'LANG'});
use locale
Each one of them failed with the following errors (from the Perl side):
The only method that worked for me involved setting LC_ALL=C
for sort, and using 8-bit characters in Perl. However, in this way Unicode strings are not properly ordered.
Using Unicode::Sort or Unicode::Sort::Locale makes no sense. You're not trying to sort based on Unicode definitions, you're trying to sort based on your locale. That's what use locale;
is for.
I don't know why you didn't get the desired order out of cmp
under use locale;
.
You could process the decompressed files.
for q in file1.uniqc file2.uniqc ; do perl -ne's/^\s*(\d+) //; for $c (1..$1) { print }' "$q" done | sort | uniq -c
It'll require more temporary storage, of course, but you'll get exactly the order you want.
I found a case use locale;
didn't cause Perl's sort
/cmp
to give the same result as the sort
utility. Weird.
$ export LC_COLLATE=en_US.UTF-8 $ perl -Mlocale -e'print for sort { $a cmp $b } <>' data ( ($1 1 $ perl -MPOSIX=strcoll -e'print for sort { strcoll($a, $b) } <>' data ( ($1 1 $ sort data ( 1 ($1
Truth be told, it's the sort
utility that's weird.
In the comments, @ninjalj points out that the weirdness is probably due to characters with undefined weights. When comparing such characters, the ordering is undefined, so different engines could produce different results. Your best bet to recreate the exact order would be to use the sort
utility through IPC::Run3, but it sounds like that's not guaranteed to always result in the same order.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With