I need a clause that counts char coincidences between two large strings but omitting '_'
coincidences. I have this code:
fit(GEN1, GEN2, N, N) :-
length(GEN1, L1),
length(GEN2, L2),
0 is L1*L2.
fit([P1|R1], [P2|R2], N, TOTAL) :-
member(P1, ['_',a,c,t,g]),
member(P2, ['_',a,c,t,g]),
append([P1],[P2],T),
( member(T,[[a,a],[c,c],[t,t],[g,g]])
-> X is N+1
; X is N
),
fit(R1,R2,X,TOTAL).
Where GEN1
and GEN2
are lists containing all characters large strings.
I've tried increasing the stack limit to avoid Out of Local Stack
exception with little success.
The issue is that, is called often and in deep recursive clauses. Is there any better way to do this?
EDIT
The clause needs to stop when one or both lists are empty.
EDIT 2
Is worth saying that testings on all answers below were done using 64bit prolog, with the --stack-limit=32g
option as my code isn't well optimized and the fit
clause is a small part of a larger process, but was the main problem with my code.
EDIT 3
CapelliC code worked using the less resources.
false code using the library(reif)
v2 worked the faster.
See Complexity of counting matching elements in two sequences using library(aggregate)
for more proposed solutions.
It seems that there is no point to insist that you have letters out of "_actg"
all the time. A generalized definition seems to be sufficient. Using library(reif)
:
fit([], _, N,N).
fit([_|_], [], N,N).
fit([P1|R1], [P2|R2], N,TOTAL) :-
if_( ( P1 = P2, dif(P1, '_') ), X is N+1, X = N ),
fit(R1, R2, X,TOTAL).
Update: please make sure to use v2 of library(reif)
. The original version did not compile dif/3.
And here a version for systems that can only index on one argument simultaneously:
fit([], _, N,N).
fit([P1|R1], L2, N,TOTAL) :-
ifit(L2, [P1|R1], N,TOTAL).
ifit([], _, N,N).
ifit([P2|R2], [P1|R1], N,TOTAL) :-
if_( ( P1 = P2, dif(P1, '_') ), X is N+1, X = N ),
fit(R1, R2, X,TOTAL).
if your Prolog has library(aggregate) you can do
fit(GEN1, GEN2, N) :-
aggregate_all(count, (nth1(P,GEN1,S),nth1(P,GEN2,S),memberchk(S,[a,c,g,t])), N).
edit
Depending on the statistic of data, a noticeable improvement can be obtained just swapping the last two calls, i.e. ...(nth1(P,GEN1,S),memberchk(S,[a,c,g,t]),nth1(P,GEN2,S))...
edit
Of course a tight loop it's better that a double indexed scan. For performance, I would write it like
fit_cc(GEN1, GEN2, N) :-
fit_cc(GEN1, GEN2, 0, N).
fit_cc([X|GEN1], [Y|GEN2], C, N) :-
( X\='_' /*memberchk(X, [a,c,g,t])*/, X=Y
-> D is C+1 ; D=C
),
fit_cc(GEN1, GEN2, D, N).
fit_cc(_, _, N, N).
but the generality and correctness allowed by library(reif) v2, as seen in @false' answer and comments, seems to be well worth the (pretty small) overhead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With