I have a fasta file that looks like this :
>0011 my.header
CAAGTTTATCCACATAATGCGAATAACCAATAATCCTTTTCATAAGTCTATTCTTCATAATCTAAATCGT
TTTCAAGTACATAATTATCCTTTGCCTGTTCGTTAGTTTTATTAAAATTATACTGATCTTTCTTTTTCAT
CCCACGGGTTAAAATCTTCCTCAATCGGTGGGTTTTCTTCATGAAATTGTTTCATTTATTTGCTGTTTTT
AGTTCTCCGATTGTATAACACTTAGTTGTATTAGTGCCGGGTAGTCTATAATTAGCCTCTTTTATATACC
CACGCTTTAATAATCTGTTTACAGAATTATATAATTTGCTCTTAGACATAAAAGGAATAATTTCTCTAAG
TTTAGAAATCGTAATAAAAACGGTATTAGGTTCTTTCTTTACCCTACATCCCTTAAACTTATCCTTATAT
GTATCAGTACAAAGTATAAGAAACATAACTGAATATACTACTGAATCATCTAAACCGATTTCTTTTGCTA
AATCTTCATTTATAACCATAATTATAACGCTTTTAATTGAATTGACTCTTTAACATTTGATGTTTTAACG
AACTGATCGTATATTTCCGGATATTGTTCTTTCAGTGCTTTAGAATCAAGTGATTCACGGCTATACGCTT
TCTTCCTTGTGACTGAAATAAGTTCCCCTTTTATATTATCAGCTTTCGCCTCAGACATCAGACCTAACAA
CTGTTCTTTGAACTTGCCTAAATGTTCGTCTATCTTCTTTTGCATTTCAAGAAGTTCGTAAACGCCTTCT
TCGATATGTGCAACCTTTGCAGGCAACGACTCCAATTTAGCTACATAACTGTCTTTGCTTGCATTGTCTG
CATATCGAACTCCATTCTTACAGCAATTAAGGAATAATTCTATTTCGCTGTCCGGTATGCGTTCAACAGA
GAAAATTCCGTCCTTATCCTTGTCACCTCTTAGCCAAATTGCGATAAGTCCCTCTACTTTCAAATTTGGG
TTTTGTCTCTCGAAAAGATAGGCGTATATTGATAGCTGCCAAGACAAATAAAGCAAATCAAGTTTGTAGG
TAGTTTTAATGTCACCTAAAACGACTGATTTATCAGAGCTGCCCAAATATACTTTATCGGTCGGTGATGC
GATAAGCTCGTTATCAGTTAGAATATACTCAGATGCGATATGAATTAAACCGCTTCCGGCTTTTAAATTC
AAATAGTTCTCTCCGTAGACCGTTTCCGGTTCAATACCTTCTTTGTCGATCCTCTCAACTTCATCATGAA
CCGCTTTCCCTCTCTCAGTTGCCGATCTCAAAATATTATCCGGTATATTGTCAAGTTTGCCTGGAAATAA
and I want the length of the sequence (without the header). I tried this:
tail -n +2 my.file | wc -c
which gives me this output:
1349
which is wrong, the real size is 1330.
I'm not sure what's going on. I'm thinking there's probably some sort of hidden characters but I don't know how to explore this.
It is because wc
is counting all the line breaks as well.
You may use awk
to get this done:
awk 'NR>1{s+=length()} END{print s}' my.file
1330
You may also use tail | tr | wc
:
tail -n +2 my.file | tr -d '\n' | wc -c
1330
EDIT: Adding 1 more solution of awk
here too.
awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");gsub(/ /,"");print length($0)}' Input_file
OR
awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");print length($0)}' OFS="" Input_file
OR
awk -v RS= '{gsub(/^[^\n]*|\n/, ""); print length()}' Input_file
Following awk
may help you on same.
awk '!/^>/{sum+=length($0)} END{print "Length is:" sum}' Input_file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With