I want to filter out the lines that has "synonymous" in the 3rd column. The command is like below
awk '$3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv
But the fileCSV2.csv still contains the word "synonymous" in the 3rd column. I wonder what might be wrong?
Two lines from the fileCSV.csv:
"exonic","LINC00115","synonymous SNV","uc010nxx.2:c.C299T:p.P100L",,"0.99",,0.56,rs3115849,,,,,,,,,,,,,chr1,762273,762273,G,A,"chr1","762273",".","G","A","30483.62","PASS","AC=24;AF=1.00;AN=24;DP=2972;FS=0.000;MLEAC=8;MLEAF=1.00;MQ0=0;VQSLOD=19.50;culprit=FS;set=Intersection","GT:AD:DP:GQ:PL","1/1:0,2:2:6:66,6,0","1/1:0,297:297:99:10476,951,0","1/1:0,304:304:99:10098,950,0","1/1:0,295:295:99:9869,929,0","1/1:0,292:292:99:8655,895,0","1/1:0,304:304:99:10006,965,0","1/1:0,179:179:99:5862,568,0","1/1:0,273:273:99:9328,851,0","1/1:0,279:279:99:7946,850,0","1/1:0,283:283:99:9214,866,0","1/1:0,8:8:21:229,21,0","1/1:0,456:456:99:16385,1285,0"
"exonic","SAMD11","synonymous SNV","uc001abw.1:c.T1027C:p.W343R","559;Name=lod=249",,,1.00,rs6672356,1,0.916445,N,0.0,T,0.0,B,0.998605,N,4.19E-4,N,3.17,chr1,877831,877831,T,C,"chr1","877831",".","T","C","3594.56","PASS","AC=24;AF=1.00;AN=24;DP=387;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=60.00;MQ0=0;VQSLOD=15.00;culprit=DP;set=Intersection","GT:AD:DP:GQ:PL","1/1:0,3:3:9:97,9,0","1/1:0,3:3:12:113,12,0","1/1:0,64:64:99:1805,189,0","1/1:0,57:57:99:1605,168,0","1/1:0,30:30:90:768,90,0","1/1:0,69:69:99:2026,216,0","1/1:0,15:15:45:428,45,0","1/1:0,23:23:81:809,81,0","1/1:0,22:22:69:562,69,0","1/1:0,40:40:99:1142,117,0","1/1:0,3:3:9:94,9,0","1/1:0,58:58:99:14,7,0"
Filtering. awk provides filtering capabilities like those supported by grep and sed plus some nifty features of its own. And similar to many command line utilities, awk can accept input from both stdin and files.
Steps. Make a list of columns that have to be extracted. Use read_csv() method to extract the csv file into data frame. Print the exracted data.
If your fileCSV.csv has columns separated by ,
than you need to
awk -F, '$3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv
If -F
does not work with your version of awk
try
awk 'BEGIN{FS=","} $3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv
EDIT: you also need to take "
into account, so use /^"synonymous/
To process csv file using awk I would prefer the following method to automatically account for quotation marks, namely preprocess with sed.
For your concrete question I would use
sed -e 's/^"//;s/"$//' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/'
If you also want to correctly process files with string fields containing quotation marks (which will be represented by double quotation marks in csv files), you need to change the sed expression the following way,
sed -e 's/^"//;s/"$//;s/""/"/g' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/'
This method has the advantage that it allows you to correctly print or process some fields using awk. For example if you want to print the first and fifth field from the filtered lines, seperated by a :
you can now use
sed -e 's/^"//;s/"$//;s/""/"/g' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/ { print $1,":",$5}'
(If the difference between the methods is not clear to you, you can try the last awk command without the sed preprocessing)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With