Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk: how to extract from file A the columns with indexes specified in file B?

Tags:

bash

unix

awk

My file A has the format below, with hundreds of thousands of columns and thousands of rows:

1000->100001    DOSE    2.000   2.000   2.000   2.000   2.000 ....
1001->100101    DOSE    1.988   1.988   2.000   2.000   2.000 ....
1001->100101    DOSE    1.933   2.000   2.000   2.000   2.000 ....
1002->100201    DOSE    2.000   2.000   2.000   2.000   2.000 ....
1002->100201    DOSE    2.000   2.000   2.000   2.000   2.000 ....

While my file B (thousands of entries) has the following format:

SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,803428,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,803442,29649737,12
20:29649765,T,G,0.99963,0.00037,0.99963,0.13918,,803443,29649765,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,803456,29650462,12

What I want to do is extract from file A the first two columns plus the columns specified in the "key" column from file B. I spent some time trying to figure out how to do this using awk, but it's been taking way too long for something that theoretically should not be too difficult.

My specific question is: once I extract the "key" entries from file B, how do I pipe these values into the awk command for file A?

Manual command for the first 3 columns in file B:

awk '{print $1, $2, $803428, $803442, $803442}' fileA > output.txt 

EDITS:

There is no common column between file A and file B. The values in column 803428 in file A would represent results for the SNP "20:29649365" from file B.

If the command to be run were:

awk '{print $1, $2, $3, $4, $5}' fileA

The result would be:

1000->100001 DOSE 2.000 2.000 2.000 2.000 2.000
1001->100101 DOSE 1.988 1.988 2.000 2.000 2.000
1002->100201 DOSE 1.933 1.999 2.000 2.000 2.000
1003->100301 DOSE 2.000 2.000 2.000 2.000 2.000
1004->100401 DOSE 2.000 2.000 2.000 2.000 2.000

Specific examples I am using to test the awk commands given as solutions.

fileA:

1000->100001    DOSE    2.000   2.000   2.000   2.000   2.000
1001->100101    DOSE    1.988   1.988   2.000   2.000   2.000
1001->100101    DOSE    1.933   2.000   1.500   2.000   2.000
1002->100201    DOSE    2.000   2.000   2.000   2.000   1.622
1002->100201    DOSE    2.000   2.000   2.000   2.000   2.000

fileB:

SNP,Al1,Al2,Freq1,MAF,AvgCall,Rsq,Genotyped,key,pos,gene_key
20:29649365,C,T,0.93021,0.06979,0.93021,0.10115,,3,29649365,12
20:29649737,A,G,0.93914,0.06086,0.93914,0.14303,,4,29649737,12
20:29650462,A,T,0.89387,0.10613,0.89388,0.12864,,6,29650462,12

Desired output (columns 1, 2, 3, 4, 6 - last 3 from the key column of fileB):

1000->100001    DOSE    2.000   2.000   2.000
1001->100101    DOSE    1.988   1.988   2.000
1001->100101    DOSE    1.933   2.000   2.000
1002->100201    DOSE    2.000   2.000   2.000
1002->100201    DOSE    2.000   2.000   2.000
like image 980
AndraD Avatar asked Nov 22 '25 00:11

AndraD


2 Answers

Ok, here's an updated version which should reproduce your output.

awk 'ARGIND==2&&!/SNP/{cols[++i]=$9}ARGIND==4{printf("%s %s",$1,$2);
      for(j=1;j<=i;j++)printf(" %s%s",$cols[j],j<i?"":"\n");}' FS=',' B.txt FS='[ \t]+' A.txt 

It's complicated a bit by the different field separators in the two files and the header in the B-file that must be ignored. But the key here is that $ accepts a variable, not just a constant integer. This approach stores the list of columns in the cols array, and iterates through them for each row in A.

Switching field separators between files is done via dummy files FS=',' and 'FS='[ \t]+'`, which means that our actual files have arginds of 2 and 4.

like image 77
amaurea Avatar answered Nov 23 '25 15:11

amaurea


if you want to read the column index from fileB (the key column, not the pos) and print those columns from fileA, also keep the order of the col index in fileB, you can try:

awk 'NR==FNR{c[NR]=$(NF-2);n=NR;next}
{printf "%s %s",$1,$2;
for(i=2;i<=n;i++)printf " %s",$c[i];print ""}' FS=',' fileB FS=' ' fileA
like image 28
Kent Avatar answered Nov 23 '25 16:11

Kent



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!