I have a problem while comparing 2 text files using awk. Here is what I want to do.
File1 contains a name in the first column which has to match the name in the first column of file2. That's easy - so far so good. Then if this matches, I need to check whether the number in the 2nd column of file1 lays within the numeric range of column 2 and 3 in file2 (see example). If that's the case print both matching lines as one line to a new file. I wrote something in awk and it gives me an output with correct assignments but it misses the majority. Am I missing some kind of loop function? The files are both sorted according to the first column.
scaffold10| 300 T C 0.9695 0.0000
scaffold10| 456 T A 1.0000 0.0000
scaffold10| 470 C A 0.9906 0.0000
scaffold10| 600 T C 0.8423 0.0000
scaffold56| 5 A C 0.8423 0.0000
scaffold56| 1000 C T 0.8423 0.0000
scaffold56| 6000 C C 0.7518 0.0000
scaffold7| 2 T T 0.9046 0.0000
scaffold9| 300 T T 0.9034 0.0000
scaffold9| 10900 T G 0.9044 0.0000
scaffold10| 400 550
scaffold10| 700 800
scaffold56| 3 5000
scaffold7| 55 200
scaffold7| 214 567
scaffold7| 656 800
scaffold9| 234 675
scaffold9| 699 1254
scaffold9| 10887 11000
scaffold10| 456 T A 1.0000 0.0000 scaffold10| 400 550
scaffold10| 470 C A 0.9906 0.0000 scaffold10| 400 550
scaffold56| 5 A C 0.8423 0.0000 scaffold56| 3 5000
scaffold56| 1000 C T 0.8423 0.0000 scaffold56| 3 5000
scaffold9| 300 T T 0.9034 0.0000 scaffold9| 234 675
scaffold9| 10900 T G 0.9044 0.0000 scaffold9| 10887 11000
awk -F "\t" ' FNR==NR {b[$1]=$0; c[$1]=$1; d[$1]=$2; e[$1]=$3; next} for {if (c[$1]==$1 && d[$1]<=$2 && e[$1]>=$2) {print b[$1]"\t"$0}}' File1 File2 > out.txt
How can I get the output I want using awk? Any suggestions are very welcome...
Use join to do a database style join of the two files and then use AWK to filter out the incorrect matches:
$ join file1 file2 | awk '$2 >= $7 && $2 <= $8'
scaffold10| 456 T A 1.0000 0.0000 400 550
scaffold10| 470 C A 0.9906 0.0000 400 550
scaffold56| 5 A C 0.8423 0.0000 3 5000
scaffold56| 1000 C T 0.8423 0.0000 3 5000
scaffold9| 300 T T 0.9034 0.0000 234 675
scaffold9| 10900 T G 0.9044 0.0000 10887 11000
Or if you want the output formatted the same the way it is in the example you gave:
$ join file1 file2 | awk '$2 >= $7 && $2 <= $8 { printf("%-12s %-5s %-3s %-3s %-8s %-8s %-12s %-5s %-5s\n", $1, $2, $3, $4, $5, $6, $1, $7, $8); }'
scaffold10| 456 T A 1.0000 0.0000 scaffold10| 400 550
scaffold10| 470 C A 0.9906 0.0000 scaffold10| 400 550
scaffold56| 5 A C 0.8423 0.0000 scaffold56| 3 5000
scaffold56| 1000 C T 0.8423 0.0000 scaffold56| 3 5000
scaffold9| 300 T T 0.9034 0.0000 scaffold9| 234 675
scaffold9| 10900 T G 0.9044 0.0000 scaffold9| 10887 11000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With