Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing in Prolog without cut?

I found this nice snippet for parsing lisp in Prolog (from here):

ws --> [W], { code_type(W, space) }, ws.
ws --> [].

parse(String, Expr) :- phrase(expressions(Expr), String).

expressions([E|Es]) -->
    ws, expression(E), ws,
    !, % single solution: longest input match
    expressions(Es).
expressions([]) --> [].

% A number N is represented as n(N), a symbol S as s(S).

expression(s(A))         --> symbol(Cs), { atom_codes(A, Cs) }.
expression(n(N))         --> number(Cs), { number_codes(N, Cs) }.
expression(List)         --> "(", expressions(List), ")".
expression([s(quote),Q]) --> "'", expression(Q).

number([D|Ds]) --> digit(D), number(Ds).
number([D])    --> digit(D).

digit(D) --> [D], { code_type(D, digit) }.

symbol([A|As]) -->
    [A],
    { memberchk(A, "+/-*><=") ; code_type(A, alpha) },
    symbolr(As).

symbolr([A|As]) -->
    [A],
    { memberchk(A, "+/-*><=") ; code_type(A, alnum) },
    symbolr(As).
symbolr([]) --> [].

However expressions uses a cut. I'm assuming this is for efficiency. Is it possible to write this code so that it works efficiently without cut?

Would also be in interested answers that involve Mercury's soft-cut / committed choice.

like image 953
dnolen Avatar asked Jun 15 '11 13:06

dnolen


2 Answers

The cut is not used for efficiency, but to commit to the first solution (see the comment next to the !/0: "single solution: longest input match"). If you comment out the !/0, you get for example:

?- parse("abc", E).
E = [s(abc)] ;
E = [s(ab), s(c)] ;
E = [s(a), s(bc)] ;
E = [s(a), s(b), s(c)] ;
false.

It is clear that only the first solution, consisting of the longest sequence of characters that form a token, is desired in such cases. Given the example above, I therefore disagree with "false": expression//1 is ambiguous, because number//1 and symbolr//1 are. In Mercury, you could use the determinism declaration cc_nondet to commit to a solution, if any.

like image 124
mat Avatar answered Oct 01 '22 17:10

mat


You are touching a quite deep problem here. At the place of the cut you have added the comment "longest input match". But what you actually did was to commit to the first solution which will produce the "longest input match" for the non-terminal ws//0 but not necessarily for expression//1.

Many programming languages define their tokens based on the longest input match. This often leads to very strange effects. For example, a number may be immediately followed by a letter in many programming languages. That's the case for Pascal, Haskell, Prolog and many other languages. E.g. if a>2then 1 else 2 is valid Haskell. Valid Prolog: X is 2mod 3.

Given that, it might be a good idea to define a programming language such that it does not depend on such features at all.

Of course, you would then like to optimize the grammar. But I can only recommend to start with a definition that is unambiguous in the first place.

As for efficiency (and purity):

eos([],[]).

nows --> call(eos).
nows, [W] --> [W], { code_type(W, nospace) }.

ws --> nows.
ws --> [W], {code_type(W, space)}, ws.
like image 26
false Avatar answered Oct 01 '22 18:10

false