How to compare two files on line by line basis regardless of order?

Question

I have two files and I want to check if every line in that file exists. However, sometimes the order of the words after the second word in each line is different. That's ok because i am only interested in missing/additional words after the first two words/columns.

file_A:

    foobar A a ab c bd hd
    bar B a c jd sm sldkjn
    baz C boo abd

file_B:

    foobar A a c bd hd ab
    baz C abd boo
    bar B c a jd sm sldkjn

In the example above, those two files are good based on my criteria.

At first I tried

   $ sort -u file_A > outA
   $ sort -u file_B > outB
   $ diff outA outB

This way line order is not taken into account. However, it takes into account word order in every line.

How can I disregard the order of words on each line after the second column?

Ed Morton · Accepted Answer

With GNU awk for "sorted_in":

$ cat tst.awk
BEGIN { PROCINFO["sorted_in"] = "@val_str_asc" }
{
    key = $1 FS $2
    $1 = $2 = ""
    split($0,f)
    for (i in f) {
        key = key FS f[i]
    }
    keys[key]
}
NR==FNR { a[key]++; next }
{ b[key]++ }
END {
    diff = 0

    for (key in keys) {
        if (a[key] > b[key]) {
            print "<", key
            diff = 1
        }
        else if (b[key] > a[key]) {
            print ">", key
            diff = 1
        }
    }

    exit diff
}

The per-key count and later numeric comparison is necessary to identify cases where, for example, file_A has a given key listed 2 times but file_B only has it once and so the files should, presumably be reported as different. For example:

$ cat file_A
foobar A a ab c bd hd
bar B a c jd sm sldkjn
baz C boo abd
baz C boo abd

$ cat file_B
foobar A a c bd hd ab
baz C abd boo
bar B c a jd sm sldkjn

$ awk -f tst.awk file_A file_B
< baz C abd boo

How to compare two files on line by line basis regardless of order?

Tags:

python

shell

ksh

diff

awk

Mark

1 Answers

Ed Morton

Recent Activity

Donate For Us

How to compare two files on line by line basis regardless of order?

Tags:

python

shell

ksh

diff

awk

Mark

1 Answers

Ed Morton

Related questions

Recent Activity

Donate For Us