Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linux Sort vs Perl String Comparison

Because I was dealing with very large files, I sorted my base and candidate files before comparing them to see what lines were missing from the other. I did this to avoid keeping the records in memory. The sorting was done by using the Linux command-line tool, sort.

In my Perl script, I would look at whether the string in the line was lt, gt, or eq to the line in the other file, advancing the pointers in the file where necessary. However, I hit a problem when I noticed that my string comparison thought the strings in the base file were lt a string in the candidate file which contained special characters.

Is there a surefire way of making sure my Linux sort and Perl string comparisons are using the same type of string comparator?

like image 681
syker Avatar asked Feb 26 '23 08:02

syker


1 Answers

The sort command uses the current locale, as specified by the environment variable LC_ALL, to determine the sort order for characters. Usually the easiest way to fix sorting issues is to manually set this to the C locale, which treats each 8-bit byte as a single character and compares by simple numeric value. In most shells this can be done as a one-off just for a single command by prefixing it like so:

LC_ALL=C sort < infile > outfile

This will also solve similar problems for some other text-processing programs. (E.g. I recall problems working with CSV files on a German person's computer -- this was traced back to the fact that Germans use a comma instead of a decimal point. Putting LC_ALL=C in front of the relevant commands fixed that issue too.)

[EDIT] Although Perl can be directed to treat some strings as Unicode, by default it still treats input and output as streams of 8-bit bytes, so the above approach should produce an order that is the same as Perl's sort() function. (Thanks to Ven'Tatsu for this nugget.)

like image 57
j_random_hacker Avatar answered Feb 27 '23 21:02

j_random_hacker