Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk code to filter lines in one file according to matching conditions in another file

Tags:

r

awk

I have two files that look like this:

file1 with 511 lines:

chr start end
1 1227897 2779043
1 6644723 8832944
1 11067792 11372913
1 17287414 17342924
1 23254576 23590651

file2 with 12594903 lines:

CHR POS REF A1 OBS_CT BETA SE P ALT_FREQS RSID_UKB
1 58396 T C 382851 0.0882097 0.0677502 0.192923 0.000249012 rs570371753
1 91588 G A 382852 0.265908 0.0879796 0.00250811 0.000148375 rs554639997
1 713979 C G 382837 0.00630607 0.0925289 0.945664 0.000138059 rs117217250
1 715265 C T 377557 0.00260829 0.00617561 0.672768 0.0331599 rs12184267
1 715367 A G 377954 0.00212642 0.00615857 0.729886 0.0333038 rs12184277
1 717485 C A 377980 0.00449142 0.00615965 0.465899 0.0332908 rs12184279
1 6702159 G T 378749 0.00305772 0.00604916 0.613223 0.0345562 rs116801199
1 9902231 G C 378573 0.00216983 0.00607117 0.720793 0.0342995 rs12565286
1 23364524 C G 377155 0.00505093 0.00588132 0.390447 0.0368034 rs2977670

I am trying to remove all lines in file2 where POS (i.e. position) in file2 is in between start and end of file1 and and where CHR in file2 is equal to chr in file1. I can do it with R and the following code but it takes approx. 1 hour to run it this way:

for (row in 1:nrow(file1)) {
  file2 <- file2[!(file2$CHR == file1$chr[row] &
                      file2$POS >= file1$start[row] &
                      file2$POS <= file1$end[row])]
}

The output is meant to look like this, with the 2 lines (line 7 and 9) that meet the conditions removed:

CHR POS REF A1 OBS_CT BETA SE P ALT_FREQS RSID_UKB
1 58396 T C 382851 0.0882097 0.0677502 0.192923 0.000249012 rs570371753
1 91588 G A 382852 0.265908 0.0879796 0.00250811 0.000148375 rs554639997
1 713979 C G 382837 0.00630607 0.0925289 0.945664 0.000138059 rs117217250
1 715265 C T 377557 0.00260829 0.00617561 0.672768 0.0331599 rs12184267
1 715367 A G 377954 0.00212642 0.00615857 0.729886 0.0333038 rs12184277
1 717485 C A 377980 0.00449142 0.00615965 0.465899 0.0332908 rs12184279
1 9902231 G C 378573 0.00216983 0.00607117 0.720793 0.0342995 rs12565286

There must be a neat and quick awk one-liner to do this. I can't figure out how, though. Any help would be greatly appreciated!

like image 897
euffelmann Avatar asked Oct 15 '22 03:10

euffelmann


1 Answers

Here is a very quick and ugly way to do it. At just creates a list indexed by all possibilities:

awk '(NR==FNR) {for(i=$2;i<=$3;++i) a[$1,i]; next}(FNR==1);!(($1,$2) in a)' file1 file2

The problem is that, even though it is quickly written down, it is very slow ;-)

A quicker way might be:

awk '(NR==FNR) { a[$1]=a[$1] FS $2 FS $3; next}
     { t=$0;c=$1;p=$2; $0=a[c] }
     { for(i=1;i<=NF;i+=2) if( $i<=p && p<=$(i+1) ) next; print t }' file1 file2

In the latter, we create lists like:

a[1] = "1227897 2779043 6644723 8832944 11067792 11372913"

Every time a new record of file2 is read, we store the record under t and keep track of the CHR value (c=$1). By assigning a[c] to $0 we recompute all the fields, so $1=1227897, $2=2779043, etc. So all we have to do is jump in groups of two over all the fields and check if the current record lays in the respective range or not. If it lays in the range, we skip to the next record.

As is mentioned in the comment of SiegeX, this can be sped up if file2 is sorted:

 awk '(NR==FNR) { a[$1]=a[$1] FS $2 FS $3 FS a[$1]; next}
     { t=$0;c=$1;p=$2; $0=a[c] }
     { for(i=NF-1;i>0;i+=2) {
          if ( $(i+1) < p ) { $i=$(i+1)=""; $0=$0; $1=$1; a[c]=$0 }
          else if ( $i<=p && p<=$(i+1) ) { next }
          print t
     }' file1 file2
like image 178
kvantour Avatar answered Oct 19 '22 02:10

kvantour