I have two files that look like this:
file1 with 511 lines:
chr start end
1 1227897 2779043
1 6644723 8832944
1 11067792 11372913
1 17287414 17342924
1 23254576 23590651
file2 with 12594903 lines:
CHR POS REF A1 OBS_CT BETA SE P ALT_FREQS RSID_UKB
1 58396 T C 382851 0.0882097 0.0677502 0.192923 0.000249012 rs570371753
1 91588 G A 382852 0.265908 0.0879796 0.00250811 0.000148375 rs554639997
1 713979 C G 382837 0.00630607 0.0925289 0.945664 0.000138059 rs117217250
1 715265 C T 377557 0.00260829 0.00617561 0.672768 0.0331599 rs12184267
1 715367 A G 377954 0.00212642 0.00615857 0.729886 0.0333038 rs12184277
1 717485 C A 377980 0.00449142 0.00615965 0.465899 0.0332908 rs12184279
1 6702159 G T 378749 0.00305772 0.00604916 0.613223 0.0345562 rs116801199
1 9902231 G C 378573 0.00216983 0.00607117 0.720793 0.0342995 rs12565286
1 23364524 C G 377155 0.00505093 0.00588132 0.390447 0.0368034 rs2977670
I am trying to remove all lines in file2 where POS
(i.e. position) in file2 is in between start
and end
of file1 and and where CHR
in file2 is equal to chr
in file1. I can do it with R and the following code but it takes approx. 1 hour to run it this way:
for (row in 1:nrow(file1)) {
file2 <- file2[!(file2$CHR == file1$chr[row] &
file2$POS >= file1$start[row] &
file2$POS <= file1$end[row])]
}
The output is meant to look like this, with the 2 lines (line 7 and 9) that meet the conditions removed:
CHR POS REF A1 OBS_CT BETA SE P ALT_FREQS RSID_UKB
1 58396 T C 382851 0.0882097 0.0677502 0.192923 0.000249012 rs570371753
1 91588 G A 382852 0.265908 0.0879796 0.00250811 0.000148375 rs554639997
1 713979 C G 382837 0.00630607 0.0925289 0.945664 0.000138059 rs117217250
1 715265 C T 377557 0.00260829 0.00617561 0.672768 0.0331599 rs12184267
1 715367 A G 377954 0.00212642 0.00615857 0.729886 0.0333038 rs12184277
1 717485 C A 377980 0.00449142 0.00615965 0.465899 0.0332908 rs12184279
1 9902231 G C 378573 0.00216983 0.00607117 0.720793 0.0342995 rs12565286
There must be a neat and quick awk one-liner to do this. I can't figure out how, though. Any help would be greatly appreciated!
Here is a very quick and ugly way to do it. At just creates a list indexed by all possibilities:
awk '(NR==FNR) {for(i=$2;i<=$3;++i) a[$1,i]; next}(FNR==1);!(($1,$2) in a)' file1 file2
The problem is that, even though it is quickly written down, it is very slow ;-)
A quicker way might be:
awk '(NR==FNR) { a[$1]=a[$1] FS $2 FS $3; next}
{ t=$0;c=$1;p=$2; $0=a[c] }
{ for(i=1;i<=NF;i+=2) if( $i<=p && p<=$(i+1) ) next; print t }' file1 file2
In the latter, we create lists like:
a[1] = "1227897 2779043 6644723 8832944 11067792 11372913"
Every time a new record of file2 is read, we store the record under t
and keep track of the CHR
value (c=$1
). By assigning a[c]
to $0
we recompute all the fields, so $1=1227897
, $2=2779043
, etc. So all we have to do is jump in groups of two over all the fields and check if the current record lays in the respective range or not. If it lays in the range, we skip to the next record.
As is mentioned in the comment of SiegeX, this can be sped up if file2
is sorted:
awk '(NR==FNR) { a[$1]=a[$1] FS $2 FS $3 FS a[$1]; next}
{ t=$0;c=$1;p=$2; $0=a[c] }
{ for(i=NF-1;i>0;i+=2) {
if ( $(i+1) < p ) { $i=$(i+1)=""; $0=$0; $1=$1; a[c]=$0 }
else if ( $i<=p && p<=$(i+1) ) { next }
print t
}' file1 file2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With