Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWK to filter CSV files

Tags:

csv

awk

I want to filter out the lines that has "synonymous" in the 3rd column. The command is like below

awk '$3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv

But the fileCSV2.csv still contains the word "synonymous" in the 3rd column. I wonder what might be wrong?

Two lines from the fileCSV.csv:

"exonic","LINC00115","synonymous SNV","uc010nxx.2:c.C299T:p.P100L",,"0.99",,0.56,rs3115849,,,,,,,,,,,,,chr1,762273,762273,G,A,"chr1","762273",".","G","A","30483.62","PASS","AC=24;AF=1.00;AN=24;DP=2972;FS=0.000;MLEAC=8;MLEAF=1.00;MQ0=0;VQSLOD=19.50;culprit=FS;set=Intersection","GT:AD:DP:GQ:PL","1/1:0,2:2:6:66,6,0","1/1:0,297:297:99:10476,951,0","1/1:0,304:304:99:10098,950,0","1/1:0,295:295:99:9869,929,0","1/1:0,292:292:99:8655,895,0","1/1:0,304:304:99:10006,965,0","1/1:0,179:179:99:5862,568,0","1/1:0,273:273:99:9328,851,0","1/1:0,279:279:99:7946,850,0","1/1:0,283:283:99:9214,866,0","1/1:0,8:8:21:229,21,0","1/1:0,456:456:99:16385,1285,0"    
"exonic","SAMD11","synonymous SNV","uc001abw.1:c.T1027C:p.W343R","559;Name=lod=249",,,1.00,rs6672356,1,0.916445,N,0.0,T,0.0,B,0.998605,N,4.19E-4,N,3.17,chr1,877831,877831,T,C,"chr1","877831",".","T","C","3594.56","PASS","AC=24;AF=1.00;AN=24;DP=387;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=60.00;MQ0=0;VQSLOD=15.00;culprit=DP;set=Intersection","GT:AD:DP:GQ:PL","1/1:0,3:3:9:97,9,0","1/1:0,3:3:12:113,12,0","1/1:0,64:64:99:1805,189,0","1/1:0,57:57:99:1605,168,0","1/1:0,30:30:90:768,90,0","1/1:0,69:69:99:2026,216,0","1/1:0,15:15:45:428,45,0","1/1:0,23:23:81:809,81,0","1/1:0,22:22:69:562,69,0","1/1:0,40:40:99:1142,117,0","1/1:0,3:3:9:94,9,0","1/1:0,58:58:99:14,7,0"
like image 430
TonyGW Avatar asked Oct 31 '13 16:10

TonyGW


People also ask

What is awk filtering?

Filtering. awk provides filtering capabilities like those supported by grep and sed plus some nifty features of its own. And similar to many command line utilities, awk can accept input from both stdin and files.

How do I extract a column from a CSV file?

Steps. Make a list of columns that have to be extracted. Use read_csv() method to extract the csv file into data frame. Print the exracted data.


2 Answers

If your fileCSV.csv has columns separated by , than you need to

awk -F, '$3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv

If -F does not work with your version of awk try

awk 'BEGIN{FS=","} $3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv

EDIT: you also need to take " into account, so use /^"synonymous/

like image 91
rzymek Avatar answered Oct 18 '22 13:10

rzymek


To process csv file using awk I would prefer the following method to automatically account for quotation marks, namely preprocess with sed.

For your concrete question I would use

sed -e 's/^"//;s/"$//' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/'

If you also want to correctly process files with string fields containing quotation marks (which will be represented by double quotation marks in csv files), you need to change the sed expression the following way,

sed -e 's/^"//;s/"$//;s/""/"/g' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/'

This method has the advantage that it allows you to correctly print or process some fields using awk. For example if you want to print the first and fifth field from the filtered lines, seperated by a : you can now use

sed -e 's/^"//;s/"$//;s/""/"/g' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/ { print $1,":",$5}'

(If the difference between the methods is not clear to you, you can try the last awk command without the sed preprocessing)

like image 21
alphanum Avatar answered Oct 18 '22 15:10

alphanum