Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster Alternative to Unix Grep

Tags:

grep

unix

I'm trying to do the following

$ grep ">" file.fasta > output.txt

But it is taking so long when the input fasta file is large.

The input file looks like this:

>seq1
ATCGGTTA
>seq2
ATGGGGGG

Is there a faster alternative?

like image 794
neversaint Avatar asked Dec 12 '22 01:12

neversaint


2 Answers

Use time command with all these

$> time grep ">" file.fasta > output.txt

$> time egrep ">" file.fasta > output.txt

$> time awk  '/^>/{print $0}' file.fasta > output.txt -- If ">' is first letter

If you see the output..they are almost the same .

In my opinion ,if the data is in columnar format, then use awk to search.

like image 106
Debaditya Avatar answered Dec 31 '22 15:12

Debaditya


Hand-built state machine. If you only want '>' to be accepted at the beginning of the line, you'll need one more state. If you need to recognise '\r' too, you will need a few more states.

#include <stdio.h>

int main(void)
{
int state,ch;

for(state=0; (ch=getc(stdin)) != EOF;   ) {
        switch(state) {
        case 0: /* start */
                if (ch == '>') state = 1;
                else break;
        case 1: /* echo */
                fputc(ch,stdout);
                if (ch == '\n') state = 0;
                break;
                }
        }
if (state==1) fputc('\n',stdout);
return 0;
}

If you want real speed, you could replace the fgetc() and fputc() by their macro equivalents getc() and putc(). (but I think trivial programs like this will be I/O bound anyway)

like image 42
wildplasser Avatar answered Dec 31 '22 15:12

wildplasser