I found this nice snippet for parsing lisp in Prolog (from here):
ws --> [W], { code_type(W, space) }, ws.
ws --> [].
parse(String, Expr) :- phrase(expressions(Expr), String).
expressions([E|Es]) -->
ws, expression(E), ws,
!, % single solution: longest input match
expressions(Es).
expressions([]) --> [].
% A number N is represented as n(N), a symbol S as s(S).
expression(s(A)) --> symbol(Cs), { atom_codes(A, Cs) }.
expression(n(N)) --> number(Cs), { number_codes(N, Cs) }.
expression(List) --> "(", expressions(List), ")".
expression([s(quote),Q]) --> "'", expression(Q).
number([D|Ds]) --> digit(D), number(Ds).
number([D]) --> digit(D).
digit(D) --> [D], { code_type(D, digit) }.
symbol([A|As]) -->
[A],
{ memberchk(A, "+/-*><=") ; code_type(A, alpha) },
symbolr(As).
symbolr([A|As]) -->
[A],
{ memberchk(A, "+/-*><=") ; code_type(A, alnum) },
symbolr(As).
symbolr([]) --> [].
However expressions uses a cut. I'm assuming this is for efficiency. Is it possible to write this code so that it works efficiently without cut?
Would also be in interested answers that involve Mercury's soft-cut / committed choice.
The cut is not used for efficiency, but to commit to the first solution (see the comment next to the !/0: "single solution: longest input match"). If you comment out the !/0, you get for example:
?- parse("abc", E).
E = [s(abc)] ;
E = [s(ab), s(c)] ;
E = [s(a), s(bc)] ;
E = [s(a), s(b), s(c)] ;
false.
It is clear that only the first solution, consisting of the longest sequence of characters that form a token, is desired in such cases. Given the example above, I therefore disagree with "false": expression//1 is ambiguous, because number//1 and symbolr//1 are. In Mercury, you could use the determinism declaration cc_nondet to commit to a solution, if any.
You are touching a quite deep problem here. At the place of the cut you have
added the comment "longest input match". But what you actually did was to commit
to the first solution which will produce the "longest input match" for the non-terminal ws//0
but not necessarily for expression//1
.
Many programming languages define their tokens based on the longest input match. This often leads to very strange effects. For example, a number may be immediately
followed by a letter in many programming languages. That's the case for Pascal, Haskell,
Prolog and many other languages. E.g. if a>2then 1 else 2
is valid Haskell.
Valid Prolog: X is 2mod 3.
Given that, it might be a good idea to define a programming language such that it does not depend on such features at all.
Of course, you would then like to optimize the grammar. But I can only recommend to start with a definition that is unambiguous in the first place.
As for efficiency (and purity):
eos([],[]).
nows --> call(eos).
nows, [W] --> [W], { code_type(W, nospace) }.
ws --> nows.
ws --> [W], {code_type(W, space)}, ws.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With