Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does unix utility join yield different results on different Linux distributions?

Tags:

linux

join

unix

gnu

I have two sorted files:

cat file1
1
3

cat file2
C 1 D
B 2 E
A 3 F

I run this command:

join -1 1 -2 2 -v2 file1 file2

With GNU coreutils 6.9.92.4-f088d-dirt January 2008 on Debian 4.3.2-1.1 I get:

B 2 E

With GNU coreutils 8.12.197-032bb September 2011 on Ubuntu 4.4.3-4ubuntu5.1 (Ubuntu precise (12.04.2 LTS)) I get:

2 B E

Why do I get different results? Why can't I find this change documented anywhere? Here are the contents of both man outputs:

   -a FILENUM
          print  unpairable  lines coming from file FILENUM, where FILENUM
          is 1 or 2, corresponding to FILE1 or FILE2

   -v FILENUM
          like -a FILENUM, but suppress joined output lines

Here is what I had to do to get identical answers for each of the Linux distributions:

join -1 1 -2 2 -v2 -o 2.1,2.2,2.3 file1 file2
like image 938
tommy.carstensen Avatar asked Nov 17 '13 02:11

tommy.carstensen


1 Answers

I found the answer. Here is part of the info join output for the newer version:

 `-o auto'

 If the keyword `auto' is specified, infer the output format from
 the first line in each file.  This is the same as the default
 output format but also ensures the same number of fields are
 output for each line.  Missing fields are replaced with the `-e'
 option and extra fields are discarded.

 Otherwise, construct each output line according to the format in
 FIELD-LIST.  Each element in FIELD-LIST is either the single
 character `0' or has the form M.N where the file number, M, is `1'
 or `2' and N is a positive field number.

 A field specification of `0' denotes the join field.  In most
 cases, the functionality of the `0' field spec may be reproduced
 using the explicit M.N that corresponds to the join field.
 However, when printing unpairable lines (using either of the `-a'
 or `-v' options), there is no way to specify the join field using
 M.N in FIELD-LIST if there are unpairable lines in both files.  To
 give `join' that functionality, POSIX invented the `0' field
 specification notation.

 All output lines--including those printed because of any -a or -v
 option--are subject to the specified FIELD-LIST.

Mystery solved although I still think it is unfortunate that a core unix utility has changed behavior between versions. This has taught me to re-read the documentation, when working on a new Linux distribution.

like image 93
tommy.carstensen Avatar answered Nov 04 '22 03:11

tommy.carstensen