"average length of the sequences in a fasta file": Can you improve this Erlang code?

Question

I'm trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this

>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)

I tried to answser this question using the following Erlang code:

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case io:get_line(S,'') of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    {Sequences,Total}=scanLines(standard_io,0,0),
    io:format("~p
",[Total/(1.0*Sequences)]),
    halt().

Compilation/Execution:

erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16

this code seems to work fine for a small fasta file but it takes hours to parse a larger one (>100Mo). Why ? I'm an Erlang newbie, can you please improve this code ?

Hynek -Pichi- Vychodil · Accepted Answer

If you need really fast IO then you have to do little bit more trickery than usual.

-module(g).
-export([s/0]).
s()->
  P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
  r(P, 0, 0),
  halt().
r(P, C, L) ->
  receive
    {P, {data, {eol, <<$>:8, _/binary>>}}} ->
      r(P, C+1, L);
    {P, {data, {eol, Line}}} ->
      r(P, C, L + size(Line));
    {'EXIT', P, normal} ->
      io:format("~p~n",[L/C])
  end.

It is fastest IO as I know but note -noshell -noinput. Compile just like erlc +native +"{hipe, [o3]}" g.erl but with -smp disable

erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl

and run:

time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s

With -smp enable but native it takes:

$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m5.103s
user    0m4.944s
sys     0m0.112s

Byte code but with -smp disable (almost in par with native because most of work is done in port!):

$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m3.565s
user    0m3.436s
sys     0m0.104s

Just for completeness byte code with smp:

$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
352.6697028442464

real    0m5.433s
user    0m5.236s
sys     0m0.128s

For comparison sarnold version gives me wrong answer and takes more on same HW:

$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776

real    0m17.569s
user    0m16.749s
sys     0m0.664s

EDIT: I have looked at characteristics of uniprot_sprot.fasta and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled version can handle 1.18 million text lines per second (71MB/s in line oriented IO).

sarnold · Answer

I too am learning Erlang, thanks for the fun question.

I understand working with Erlang strings as lists of characters can be very slow; if you can work with binaries instead you should see some performance gains. I don't know how you would use arbitrary-length strings with binaries, but if you can sort it out, it should help.

Also, if you don't mind working with a file directly rather than standard_io, perhaps you could speed things along by using file:open(..., [raw, read_ahead]). raw means the file must be on the local node's filesystem, and read_ahead specifies that Erlang should perform file IO with a buffer. (Think of using C's stdio facilities with and without buffering.)

I'd expect the read_ahead to make the most difference, but everything with Erlang includes the phrase "benchmark before guessing".

EDIT

Using file:open("uniprot_sprot.fasta", [read, read_ahead]) gets 1m31s on the full uniprot_sprot.fasta dataset. (Average 359.04679841439776.)

Using file:open(.., [read, read_ahead]) and file:read_line(S), I get 0m34s.

Using file:open(.., [read, read_ahead, raw]) and file:read_line(S), I get 0m9s. Yes, nine seconds.

Here's where I stand now; if you can figure out how to use binaries instead of lists, it might see still more improvement:

-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case file:read_line(S) of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            {ok, Line} -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    F = file:open("/home/sarnold/tmp/uniprot_sprot.fasta", [read, read_ahead, raw]),
    case F of
    { ok, File } -> 
        {Sequences,Total}=scanLines(File,0,0),
        io:format("~p
",[Total/(1.0*Sequences)]);
    { error, Reason } ->
            io:format("~s", Reason)
    end,
    halt().

"average length of the sequences in a fasta file": Can you improve this Erlang code?

Tags:

string

erlang

sequence

mean

bioinformatics

Pierre

2 Answers

Hynek -Pichi- Vychodil

sarnold

Recent Activity

Donate For Us

"average length of the sequences in a fasta file": Can you improve this Erlang code?

Tags:

string

erlang

sequence

mean

bioinformatics

Pierre

2 Answers

Hynek -Pichi- Vychodil

sarnold

Related questions

Recent Activity

Donate For Us