Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression Matching Prolog

I am trying to do Regular Expression matching. I have all the functions written out, but they are not working as they should. From what I can tell, it has a problem when I try to compare a list.
For instance, "re_contains(a,a)." gives true (obviously), as does "re_contains(union(a,b),a)."

But as soon as I make it a list, it fails. "re_contains(seq(a,b), [a,b])." returns false. Append should be going through all the possible combinations to find the match, but none of these functions work correctly. This makes me thing that perhaps I am missing a base case. But I think "re_contains(X, L) :- X == L." should take care of it. I must be over looking something important here.

Here is my code:

re_contains(empty, []).

re_contains(X, L) :-
    X == L.

re_contains(seq(X, Y), L) :-
    append(L1, L2, L),
    re_contains(X, L1),
    re_contains(Y, L2). 

re_contains(union(X, _), L) :-
    re_contains(X, L).

re_contains(union(_, Y), L) :- 
    re_contains(Y, L).  

re_contains(kleene(X), L) :- 
    append([Car|L1], L2, L),
    re_contains(X, [Car|L1]),
    re_contains(kleene(X), L2).

re_contains(kleene(_),[]).
like image 389
Ryan C. Stacy Avatar asked Sep 25 '12 02:09

Ryan C. Stacy


2 Answers

append/3 will split L, and both L1 and L2 will be lists.

I would try to replace re_contains(X, L) :- X == L. with re_contains(X, [X]).

After the change, re_contains(a,a). will fail.

You are representing the sequence in different ways, and your matcher doesn't provide for both. Actually, the only cases 'working' are not sequences.

like image 24
CapelliC Avatar answered Oct 22 '22 03:10

CapelliC


There are several issues. Here are the most obvious:

Typing. Your predicate re_contains/2 expects a list as the second argument. That re_contains(a,a). succeeds is rather coincidence than intention.

Monotonicity. Another problem is that re_contains([a],[a]) succeeds, but re_contains([X],[a]) fails. Or, to see it from another angle: re_contains([X],[a]) fails, but X = a, re_contains([X],[a]) succeeds. By adding the goal X = a we have specialized the query thus the originally failing query should fail again.

The testing for identity (==/2) should be replaced by equality (=/2) and ensuring that we have some list. So:

re_contains(X, L) :-
    % X == L.
    X = L,
    append(X,_,_).

Note, that append/3 is used here only to ensure that X is a list - the actual appending functionality is not used.

Efficiency. The third problem concerns the actual way how you are representing concatenation. Let's just look at the following rule:

re_contains(seq(X, Y), L) :-
    append(L1, L2, L),
    re_contains(X, L1),
    re_contains(Y, L2).

Now, assume that we have here a list of length N. How many answers are possible for the goal append(L1, L2, L)? Actually N + 1. And this, regardless of the actual values involved. Now consider:

?- length(L,1000000), time(re_contains(seq([a],[]),[b|L])).
% 2,000,005 inferences, 0.886 CPU in 0.890 seconds (100% CPU, 2258604 Lips)
false.

re_contains/2 needs here time proportional to the length of the list. But it would suffice to look at the first element to realize that this is impossible.

So the problem behind is the use of append/3. There is a simple rule of thumb for Prolog: If you are using too much append/3 consider to use dcgs — Definite Clause Grammars. Please look into the tag for more details — and consult an introductory Prolog text. To give you a start, here is an subset of your definition:

re_contains(RE, L) :-
    phrase(re(RE), L).

re([]) --> [].
re([E]) --> [E].
re(seq(X,Y)) -->
    re(X),
    re(Y).

Which does no longer investigate the entire list:

?- length(L,1000000), time(phrase(re(seq([a],[])),[b|L])).
% 6 inferences, 0.000 CPU in 0.000 seconds (88% CPU, 127313 Lips)
false.

BTW, here is a complete definition.

Termination/non-termination. Related to efficiency is the property of termination. Ideally, a query terminates, if the set of solutions can be finitely represented. That is, by a finite number of answers. OK, that is the ideal we are striving for. Prolog's simple but very efficient execution algorithm will sometimes loop, when a finite number of answers would have been possible. Understanding the very reason of non-termination is sometimes very tricky. The usual debugging strategies – like tracing or stepping through with a debugger – do not work since they show you much too much detail. Fortunately, there is a better technique:

Instead of looking at your entire program, I will again look only at a very tiny fragment of it. This fragment is a failure slice (see the link for more details). It is much smaller but tells quite a lot about the original program — provided that was a pure, monotonic program:

If a failure slice does not terminate then the original program does not terminate.

So if we find such a failure slice we can make immediately conclusions about the entire program. Without even reading the rest!

Here is such an interesting failure slice:

...
re_contains(X, L) :- false,
    X = L
re_contains(seq(X, Y), L) :-
    append(L1, L2, L), false,
    re_contains(X, L1),
    re_contains(Y, L2).
re_contains(union(X, _), L) :- false,
    re_contains(X, L).
...

Consider now the goal re_contains(seq([],[]),L). Ideally, there should be exactly one answer L = [] and the goal should terminate. However, it loops, since append(L1, L2, L) does not terminate. Contrast this to the DCG-solution above which terminates for phrase(re(seq([],[])),L).

like image 200
false Avatar answered Oct 22 '22 04:10

false