I have two files:
file1 has the format:
field1;field2;field3;field4
(file1 is initially unsorted)
file2 has the format:
field1
(file2 is sorted)
I run the 2 following commands:
sort -t\; -k1 file1 -o file1 # to sort file 1
join -t\; -1 1 -2 1 -o 1.1 1.2 1.3 1.4 file1 file2
I get the following message:
join: file1:27497: is not sorted: line_which_was_identified_as_out_of_order
Why is this happening ?
(I also tried to sort file1 taking into consideration the entire line not only the first filed of the line but with no success)
sort -t\; -c file1
doesn't output anything. Around line 27497, the situation is indeed strange which means that sort doesn't do its job correctly:
XYZ113017;...
line 27497--> XYZ11301;...
XYZ11301;...
To complement Wumpus Q. Wumbley's helpful answer with a broader perspective (since I found this post researching a slightly different problem).
join
, the input files must be sorted by the join field ONLY, otherwise you may see the warning reported by the OP.There are two common scenarios in which more than the field of interest is mistakenly included when sorting the input files:
If you do specify a field, it's easy to forget that you must also specify a stop field - even if you target only 1 field - because sort
uses the remainder of the line if only a start field is specified; e.g.:
sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
sort -t, -k1,1 ... # Field 1 only
If your sort field is the FIRST field in the input, it's tempting to not specify any field selector at all.
sort ... # NOT always the same as 'sort -k1,1'! see below for example
Pitfall example:
#!/usr/bin/env bash
# Input data: fields separated by '^'.
# Note that, when properly sorting by field 1, the order should
# be "nameA" before "nameAA" (followed by "nameZ").
# Note how "nameA" is a substring of "nameAA".
read -r -d '' input <<EOF
nameA^other1
nameAA^other2
nameZ^other3
EOF
# NOTE: "WRONG" below refers to deviation from the expected outcome
# of sorting by field 1 only, based on mistaken assumptions.
# The commands do work correctly in a technical sense.
echo '--- just sort'
sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort FROM field 1'
sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first
echo '--- sort with field 1 ONLY'
sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first
Explanation:
When NOT limiting sorting to the first field, it is the relative sort order of chars. ^
and A
(column index 6) that matters in this example. In other words: the field separator is compared to data, which is the source of the problem: ^
has a HIGHER ASCII value than A
, and therefore sorts after 'A', resulting in the line starting with nameAA^
sorting BEFORE the one with nameA^
.
Note: It is possible for problems to surface on one platform, but be masked on another, based on locale and character-set settings and/or the sort
implementation used; e.g., with a locale of en_US.UTF-8
in effect, with ,
as the separator and -
permissible inside fields:
sort
as used on OSX 10.10.2 (which is an old GNU sort
version, 5.93) sorts ,
before -
(in line with ASCII values)sort
as used on Ubuntu 14.04 (GNU sort
8.21) does the opposite: sorts -
before ,
[1]
[1] I don't know why - if somebody knows, please tell me. Test with sort <<<$'-\n,'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With