Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count coincidences on each character of two large strings without triggering the Out of Local Stack exception?

I need a clause that counts char coincidences between two large strings but omitting '_' coincidences. I have this code:

fit(GEN1, GEN2, N, N) :-
   length(GEN1, L1),
   length(GEN2, L2),
   0 is L1*L2.

fit([P1|R1], [P2|R2], N, TOTAL) :-
   member(P1, ['_',a,c,t,g]),
   member(P2, ['_',a,c,t,g]),
   append([P1],[P2],T),
   (  member(T,[[a,a],[c,c],[t,t],[g,g]])
   -> X is N+1
   ;  X is N
   ),
   fit(R1,R2,X,TOTAL).

Where GEN1 and GEN2 are lists containing all characters large strings.

I've tried increasing the stack limit to avoid Out of Local Stack exception with little success.

The issue is that, is called often and in deep recursive clauses. Is there any better way to do this?

EDIT

The clause needs to stop when one or both lists are empty.

EDIT 2

Is worth saying that testings on all answers below were done using 64bit prolog, with the --stack-limit=32g option as my code isn't well optimized and the fit clause is a small part of a larger process, but was the main problem with my code.

EDIT 3

CapelliC code worked using the less resources.

false code using the library(reif) v2 worked the faster.

See Complexity of counting matching elements in two sequences using library(aggregate) for more proposed solutions.

like image 609
RezzDeWitt Avatar asked Mar 01 '23 15:03

RezzDeWitt


2 Answers

It seems that there is no point to insist that you have letters out of "_actg" all the time. A generalized definition seems to be sufficient. Using library(reif):

fit([], _, N,N).
fit([_|_], [], N,N).
fit([P1|R1], [P2|R2], N,TOTAL) :-
   if_( ( P1 = P2, dif(P1, '_') ), X is N+1, X = N ),
   fit(R1, R2, X,TOTAL).

Update: please make sure to use v2 of library(reif). The original version did not compile dif/3.

And here a version for systems that can only index on one argument simultaneously:

fit([], _, N,N).
fit([P1|R1], L2, N,TOTAL) :-
    ifit(L2, [P1|R1], N,TOTAL).

ifit([], _, N,N).
ifit([P2|R2], [P1|R1], N,TOTAL) :-
   if_( ( P1 = P2, dif(P1, '_') ), X is N+1, X = N ),
   fit(R1, R2, X,TOTAL).
like image 187
false Avatar answered Apr 28 '23 03:04

false


if your Prolog has library(aggregate) you can do

fit(GEN1, GEN2, N) :-
  aggregate_all(count, (nth1(P,GEN1,S),nth1(P,GEN2,S),memberchk(S,[a,c,g,t])), N).

edit

Depending on the statistic of data, a noticeable improvement can be obtained just swapping the last two calls, i.e. ...(nth1(P,GEN1,S),memberchk(S,[a,c,g,t]),nth1(P,GEN2,S))...

edit

Of course a tight loop it's better that a double indexed scan. For performance, I would write it like

fit_cc(GEN1, GEN2, N) :-
  fit_cc(GEN1, GEN2, 0, N).

fit_cc([X|GEN1], [Y|GEN2], C, N) :-
  (  X\='_' /*memberchk(X, [a,c,g,t])*/, X=Y
  -> D is C+1 ; D=C
  ),
  fit_cc(GEN1, GEN2, D, N).
fit_cc(_, _, N, N).

but the generality and correctness allowed by library(reif) v2, as seen in @false' answer and comments, seems to be well worth the (pretty small) overhead.

like image 33
CapelliC Avatar answered Apr 28 '23 02:04

CapelliC