Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWK compare two columns in two seperate files

I would like to compare two files and do something like this: if the 5th column in the first file is equal to the 5th column in the second file, I would like to print the whole line from the first file. Is that possible? I searched for the issue but was unable to find a solution :(

The files are separated by tabulators and I tried something like this:

zcat file1.txt.gz file2.txt.gz | awk -F'\t' 'NR==FNR{a[$5];next}$5 in a {print $0}'

Did anybody tried to do a similar thing? :)

Thanks in advance for help!

like image 949
Monika Avatar asked Mar 14 '23 19:03

Monika


1 Answers

Your script is fine, but you need to provide each file individually to awk and in reverse order.

$ cat file1.txt
a b c d 100
x y z w 200
p q r s 300
1 2 3 4 400

$ cat file2.txt
. . . . 200
. . . . 400

$ awk 'NR==FNR{a[$5];next} $5 in a {print $0}' file2.txt file1.txt
x y z w 200
1 2 3 4 400

EDIT:

As pointed out in the comments, the generic solution above can be improved and tailored to OP's situation of starting with compressed tab-separated files:

$ awk -F'\t' 'NR==FNR{a[$5];next} $5 in a' <(zcat file2.txt) <(zcat file1.txt)
x y z w 200
1 2 3 4 400

Explanation:

NR is the number of the current record being processed and FNR is the number of the current record within its file . Thus NR == FNR is only true when awk is processing the first file given to it (which in our case is file2.txt).

a[$5] adds the value of the 5th column as an index to the array a. Arrays in awk are associative arrays, but often you don't care about associating a value and just want to make a nice collection of things. This is a pithy way to make a collection of all the values we've seen in 5th column of the first file. The next statement, which follows, says to immediately get the next available record without looking at any anymore statements in the awk program.

Summarizing the above, this line says "If you're reading the first file (file2.txt), save the value of column 5 in the array called a and move on to the record without continuing with the rest of the awk program."

NR == FNR { a[$5]; next }

Hopefully it's clear from the above that the only way we can past that first line of the awk program is if we are reading the second file (file1.txt in our case).

$5 in a evaluates to true if the value of the 5th column occurs as an index in the a array. In other words, it is true for every record in file1.txt whose 5th column we saw as a value in the 5th column of file2.txt.

In awk, when the pattern portion evaluates to true, the accompanying action is invoked. When there's no action given, as below, the default action is triggered instead, which is to simply print the current record. Thus, by just saying $5 in a, we are telling awk to print all the records in file1.txt whose 5th column also occurs in file2.txt, which of course was the given requirement.

$5 in a
like image 175
jas Avatar answered Mar 19 '23 07:03

jas