Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing 2 files using AWK with multiple parameters

Tags:

bash

awk

I have a problem while comparing 2 text files using awk. Here is what I want to do.

File1 contains a name in the first column which has to match the name in the first column of file2. That's easy - so far so good. Then if this matches, I need to check whether the number in the 2nd column of file1 lays within the numeric range of column 2 and 3 in file2 (see example). If that's the case print both matching lines as one line to a new file. I wrote something in awk and it gives me an output with correct assignments but it misses the majority. Am I missing some kind of loop function? The files are both sorted according to the first column.

File1:

scaffold10|   300   T   C   0.9695   0.0000
scaffold10|   456   T   A   1.0000   0.0000
scaffold10|   470   C   A   0.9906   0.0000
scaffold10|   600   T   C   0.8423   0.0000
scaffold56|   5     A   C   0.8423   0.0000
scaffold56|   1000  C   T   0.8423   0.0000
scaffold56|   6000  C   C   0.7518   0.0000
scaffold7|    2     T   T   0.9046   0.0000
scaffold9|    300   T   T   0.9034   0.0000
scaffold9|    10900 T   G   0.9044   0.0000

File2:

scaffold10|   400   550   
scaffold10|   700   800    
scaffold56|   3     5000  
scaffold7|    55    200  
scaffold7|    214   567   
scaffold7|    656   800  
scaffold9|    234   675  
scaffold9|    699   1254 
scaffold9|    10887 11000   

Output:

scaffold10|  456   T   A   1.0000   0.0000   scaffold10|  400   550
scaffold10|  470   C   A   0.9906   0.0000   scaffold10|  400   550
scaffold56|  5     A   C   0.8423   0.0000   scaffold56|  3     5000
scaffold56|  1000  C   T   0.8423   0.0000   scaffold56|  3     5000
scaffold9|   300   T   T   0.9034   0.0000   scaffold9|   234   675 
scaffold9|   10900 T   G   0.9044   0.0000   scaffold9|   10887 11000 

My awk try:

awk -F "\t" ' FNR==NR {b[$1]=$0; c[$1]=$1; d[$1]=$2; e[$1]=$3; next} for {if (c[$1]==$1 && d[$1]<=$2 && e[$1]>=$2) {print b[$1]"\t"$0}}' File1 File2 > out.txt

How can I get the output I want using awk? Any suggestions are very welcome...

like image 278
bex Avatar asked Jun 07 '26 00:06

bex


1 Answers

Use join to do a database style join of the two files and then use AWK to filter out the incorrect matches:

$ join file1 file2 | awk '$2 >= $7 && $2 <= $8'
scaffold10| 456 T A 1.0000 0.0000 400 550
scaffold10| 470 C A 0.9906 0.0000 400 550
scaffold56| 5 A C 0.8423 0.0000 3 5000
scaffold56| 1000 C T 0.8423 0.0000 3 5000
scaffold9| 300 T T 0.9034 0.0000 234 675
scaffold9| 10900 T G 0.9044 0.0000 10887 11000

Or if you want the output formatted the same the way it is in the example you gave:

$ join file1 file2 | awk '$2 >= $7 && $2 <= $8 { printf("%-12s %-5s %-3s %-3s %-8s %-8s %-12s %-5s %-5s\n", $1, $2, $3, $4, $5, $6, $1, $7, $8); }'
scaffold10|  456   T   A   1.0000   0.0000   scaffold10|  400   550
scaffold10|  470   C   A   0.9906   0.0000   scaffold10|  400   550
scaffold56|  5     A   C   0.8423   0.0000   scaffold56|  3     5000
scaffold56|  1000  C   T   0.8423   0.0000   scaffold56|  3     5000
scaffold9|   300   T   T   0.9034   0.0000   scaffold9|   234   675
scaffold9|   10900 T   G   0.9044   0.0000   scaffold9|   10887 11000
like image 113
Ross Ridge Avatar answered Jun 10 '26 06:06

Ross Ridge



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!