Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word count for sequence length is wrong

Tags:

bash

unix

awk

wc

I have a fasta file that looks like this :

>0011 my.header
CAAGTTTATCCACATAATGCGAATAACCAATAATCCTTTTCATAAGTCTATTCTTCATAATCTAAATCGT
TTTCAAGTACATAATTATCCTTTGCCTGTTCGTTAGTTTTATTAAAATTATACTGATCTTTCTTTTTCAT
CCCACGGGTTAAAATCTTCCTCAATCGGTGGGTTTTCTTCATGAAATTGTTTCATTTATTTGCTGTTTTT
AGTTCTCCGATTGTATAACACTTAGTTGTATTAGTGCCGGGTAGTCTATAATTAGCCTCTTTTATATACC
CACGCTTTAATAATCTGTTTACAGAATTATATAATTTGCTCTTAGACATAAAAGGAATAATTTCTCTAAG
TTTAGAAATCGTAATAAAAACGGTATTAGGTTCTTTCTTTACCCTACATCCCTTAAACTTATCCTTATAT
GTATCAGTACAAAGTATAAGAAACATAACTGAATATACTACTGAATCATCTAAACCGATTTCTTTTGCTA
AATCTTCATTTATAACCATAATTATAACGCTTTTAATTGAATTGACTCTTTAACATTTGATGTTTTAACG
AACTGATCGTATATTTCCGGATATTGTTCTTTCAGTGCTTTAGAATCAAGTGATTCACGGCTATACGCTT
TCTTCCTTGTGACTGAAATAAGTTCCCCTTTTATATTATCAGCTTTCGCCTCAGACATCAGACCTAACAA
CTGTTCTTTGAACTTGCCTAAATGTTCGTCTATCTTCTTTTGCATTTCAAGAAGTTCGTAAACGCCTTCT
TCGATATGTGCAACCTTTGCAGGCAACGACTCCAATTTAGCTACATAACTGTCTTTGCTTGCATTGTCTG
CATATCGAACTCCATTCTTACAGCAATTAAGGAATAATTCTATTTCGCTGTCCGGTATGCGTTCAACAGA
GAAAATTCCGTCCTTATCCTTGTCACCTCTTAGCCAAATTGCGATAAGTCCCTCTACTTTCAAATTTGGG
TTTTGTCTCTCGAAAAGATAGGCGTATATTGATAGCTGCCAAGACAAATAAAGCAAATCAAGTTTGTAGG
TAGTTTTAATGTCACCTAAAACGACTGATTTATCAGAGCTGCCCAAATATACTTTATCGGTCGGTGATGC
GATAAGCTCGTTATCAGTTAGAATATACTCAGATGCGATATGAATTAAACCGCTTCCGGCTTTTAAATTC
AAATAGTTCTCTCCGTAGACCGTTTCCGGTTCAATACCTTCTTTGTCGATCCTCTCAACTTCATCATGAA
CCGCTTTCCCTCTCTCAGTTGCCGATCTCAAAATATTATCCGGTATATTGTCAAGTTTGCCTGGAAATAA

and I want the length of the sequence (without the header). I tried this:

tail -n +2 my.file | wc -c

which gives me this output:

1349

which is wrong, the real size is 1330.

I'm not sure what's going on. I'm thinking there's probably some sort of hidden characters but I don't know how to explore this.

like image 932
EvenStar69 Avatar asked Apr 03 '18 14:04

EvenStar69


2 Answers

It is because wc is counting all the line breaks as well.

You may use awk to get this done:

awk 'NR>1{s+=length()} END{print s}' my.file

1330

You may also use tail | tr | wc:

tail -n +2 my.file | tr -d '\n' | wc -c
1330
like image 200
anubhava Avatar answered Nov 12 '22 21:11

anubhava


EDIT: Adding 1 more solution of awk here too.

awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");gsub(/ /,"");print length($0)}'  Input_file

OR

awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");print length($0)}' OFS=""  Input_file

OR

awk -v RS= '{gsub(/^[^\n]*|\n/, ""); print length()}'  Input_file

Following awk may help you on same.

awk '!/^>/{sum+=length($0)} END{print "Length is:" sum}'  Input_file
like image 3
RavinderSingh13 Avatar answered Nov 12 '22 20:11

RavinderSingh13