I'm trying to do the following
$ grep ">" file.fasta > output.txt
But it is taking so long when the input fasta file is large.
The input file looks like this:
>seq1
ATCGGTTA
>seq2
ATGGGGGG
Is there a faster alternative?
Use time command with all these
$> time grep ">" file.fasta > output.txt
$> time egrep ">" file.fasta > output.txt
$> time awk '/^>/{print $0}' file.fasta > output.txt -- If ">' is first letter
If you see the output..they are almost the same .
In my opinion ,if the data is in columnar format, then use awk to search.
Hand-built state machine. If you only want '>' to be accepted at the beginning of the line, you'll need one more state. If you need to recognise '\r' too, you will need a few more states.
#include <stdio.h>
int main(void)
{
int state,ch;
for(state=0; (ch=getc(stdin)) != EOF; ) {
switch(state) {
case 0: /* start */
if (ch == '>') state = 1;
else break;
case 1: /* echo */
fputc(ch,stdout);
if (ch == '\n') state = 0;
break;
}
}
if (state==1) fputc('\n',stdout);
return 0;
}
If you want real speed, you could replace the fgetc() and fputc() by their macro equivalents getc() and putc(). (but I think trivial programs like this will be I/O bound anyway)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With