I'm trying for the moment to keep my lexer and parser separate, based on the vague advice of the book Prolog and Natural Language Analysis, which really doesn't go into any detail about lexing/tokenizing. So I am giving it a shot and seeing several little issues that indicate to me that there is something obvious I'm missing. All my little token parsers seem to be working alright; at the moment this is a snippet of my code: <pre class="prettyprint"><code>:- use_module(library(dcg/basics)). operator('(') --> "(". operator(')') --> ")". operator('[') --> "[". operator(']') --> "]". % ... etc. keyword(array) --> "array". keyword(break) --> "break". % ... etc. </code></pre> It's a bit repetitive but it seems to work. Then I have some stuff I don't completely love and would welcome suggestions on, but does seem to work: <pre class="prettyprint"><code>id(id(Id)) --> [C], { char_type(C, alpha) }, idRest(Rest), { atom_chars(Id, [C|Rest]) }. idRest([C|Rest]) --> [C], { char_type(C, alpha) ; char_type(C, digit) ; C = '_' }, idRest(Rest). idRest([]) --> []. int(int(Int)) --> integer(Int). string(str(String)) --> "\"", stringContent(Codes), "\"", { string_chars(String, Codes) }. stringContent([C|Chars]) --> stringChar(C), stringContent(Chars). stringContent([]) --> []. stringChar(0'\n) --> "\\n". stringChar(0'\t) --> "\\t". stringChar(0'\") --> "\\\"". stringChar(0'\") --> "\\\\". stringChar(C) --> [C]. </code></pre> The main rule for my tokenizer is this: <pre class="prettyprint"><code>token(X) --> whites, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)). </code></pre> It's not perfect; I will see <code>int</code> get parsed into <code>in,id(t)</code> because <code>keyword(X)</code> comes before <code>id(X)</code>. So I guess that's the first question. The bigger question I have is that I do not see how to properly integrate comments into this situation. I have tried the following: <pre class="prettyprint"><code>skipAhead --> []. skipAhead --> (comment ; whites), skipAhead. comment --> "/*", anything, "*/". anything --> []. anything --> [_], anything. token(X) --> skipAhead, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)). </code></pre> This does not seem to work; the parses that return (and I get many parses) do not seem to have the comment removed. I'm nervous that my comment rule is needlessly inefficient and probably induces a lot of unnecessary backtracking. I'm also nervous that <code>whites//0</code> from dcg/basics is deterministic; however, that part of the equation seems to work, it's just integrating it with the comment skipping that doesn't seem to. As a final note, I don't see how to handle propagating parse errors back to the user with line/column information from here. It feels like I'd have to track and thread through some kind of current line/column info and write it into the tokens and then maybe try to rebuild the line if I wanted to do something similar to how llvm does it. Is that fair or is there a "recommended practice" there? The whole code can be found in this haste.

It currently still looks a bit strange (<code>unreadableNamesLikeInJavaAnyone?</code>), but in its core it is quite solid, so I have only a few comments about some aspects of the code and the questions: <ol> <li>Separating lexing from parsing makes perfect sense. It is also a perfectly acceptable solution to store line and column information along with each token, leaving tokens (for example) of the form <code>l_c_t(Line,Column,Token)</code> or <code>Token-lc(Line,Column)</code> for the parser to process.</li> <li>Comments are always nasty, or should I say, often not-nesty? A useful pattern in DCGs is often to go for the longest match, which you are already using in some cases, but not yet for <code>anything//0</code>. So, reordering the two rules may help you to skip everything that is meant to be commented away.</li> <li>Regarding the determinism: It is OK to commit to the first parse that matches, but do it only once, and resist the temptation to mess up the declarative grammar.</li> <li>In DCGs, it is elegant to use <code>|</code> instead of <code>;</code>.</li> <li> <code>tokenize//1</code>? Come on! That's just <code>tokens//1</code>. It makes sense in all directions.</li> </ol>

I have this code to support error reporting, that itself must be handled with care, sprinkling meaningful messages and 'skip rules' around the code. But there is not ready-to-use alternative: a DCG is a nice computation engine, but it cannot compete out-of-the-box with specialized parsing engines, that are able to emit error messages automatically, exploiting the theoretical properties of the targeted grammars... <pre class="prettyprint"><code>:- dynamic text_length/1. parse_conf_cs(Cs, AST) :- length(Cs, TL), retractall(text_length(_)), assert(text_length(TL)), phrase(cfg(AST), Cs). .... %% tag(?T, -X, -Y)// is det. % % Start/Stop tokens for XML like entries. % Maybe this should restrict somewhat the allowed text. % tag(T, X, Y) --> pos(X), unquoted(T), pos(Y). .... %% pos(-C, +P, -P) is det. % % capture offset from end of stream % pos(C, P, P) :- text_length(L), length(P, Q), C is L - Q. </code></pre> tag//3 is just an example usage, in this parser I'm building an editable AST, so I store the positions to be able to properly attribute each nested part in an editor... edit a small enhancement for id//1: SWI-Prolog has specialized code_type/2 for that: <pre class="prettyprint"><code>1 ?- code_type(0'a, csymf). true. 2 ?- code_type(0'1, csymf). false. </code></pre> so (glossing over literal transformation) <pre class="prettyprint"><code>id([C|Cs]) --> [C], {code_type(C, csymf)}, id_rest(Cs). id_rest([C|Cs]) --> [C], {code_type(C, csym)}, id_rest(Cs). id_rest([]) --> []. </code></pre> depending on your attitude to generalize small snippets, and the actual grammar details, id_rest//1 could be written in reusable fashion, and made deterministic <pre class="prettyprint"><code>id([C|Cs]) --> [C], {code_type(C, csymf)}, codes(csym, Cs). % greedy and deterministic codes(Kind, [C|Cs]) --> [C], {code_type(C, Kind)}, !, codes(Kind, Cs). codes(Kind, []), [C] --> [C], {\+code_type(C, Kind)}, !. codes(_, []) --> []. </code></pre> this stricter definition of id//1 would also allow to remove some ambiguity wrt identifiers with keyword prefix: recoding keyword//1 like <pre class="prettyprint"><code>keyword(K) --> id(id(K)), {memberchk(K, [ array, break, ... ]}. </code></pre> will correctly identify <pre class="prettyprint"><code>?- phrase(tokenize(Ts), `if1*2`). Ts = [id(if1), *, int(2)] ; </code></pre> Your string//1 (OT: what unfortunate clash with library(dcg/basics):string//1) is an easy candidate for implementing a simple 'error recovery strategy': <pre class="prettyprint"><code>stringChar(0'\") --> "\\\\". stringChar(0'") --> pos(X), "\n", {format('unclosed string at ~d~n', [X])}. </code></pre> It's an example of 'report error and insert missing token', so the parsing can go on...

Prolog DCG: Writing programming language lexer

Tags:

lexical-analysis

prolog

dcg

I'm trying for the moment to keep my lexer and parser separate, based on the vague advice of the book Prolog and Natural Language Analysis, which really doesn't go into any detail about lexing/tokenizing. So I am giving it a shot and seeing several little issues that indicate to me that there is something obvious I'm missing.

All my little token parsers seem to be working alright; at the moment this is a snippet of my code:

:- use_module(library(dcg/basics)).

operator('(')  --> "(".      operator(')')  --> ")".
operator('[')  --> "[".      operator(']')  --> "]".
% ... etc.

keyword(array)    --> "array".
keyword(break)    --> "break".
% ... etc.

It's a bit repetitive but it seems to work. Then I have some stuff I don't completely love and would welcome suggestions on, but does seem to work:

id(id(Id)) -->
    [C],
    {
        char_type(C, alpha)
    },
    idRest(Rest),
    {
        atom_chars(Id, [C|Rest])
    }.
idRest([C|Rest]) -->
    [C],
    {
        char_type(C, alpha) ; char_type(C, digit) ; C = '_'
    },
    idRest(Rest).
idRest([]) --> [].

int(int(Int)) --> integer(Int).

string(str(String)) -->
    "\"",
    stringContent(Codes),
    "\"",
    {
        string_chars(String, Codes)
    }.
stringContent([C|Chars]) -->
    stringChar(C), stringContent(Chars).
stringContent([]) --> [].

stringChar(0'\n) --> "\\n".
stringChar(0'\t) --> "\\t".
stringChar(0'\") --> "\\\"".
stringChar(0'\") --> "\\\\".
stringChar(C) --> [C].

The main rule for my tokenizer is this:

token(X) --> whites, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).

It's not perfect; I will see int get parsed into in,id(t) because keyword(X) comes before id(X). So I guess that's the first question.

The bigger question I have is that I do not see how to properly integrate comments into this situation. I have tried the following:

skipAhead --> [].
skipAhead --> (comment ; whites), skipAhead.

comment --> "/*", anything, "*/".
anything --> [].
anything --> [_], anything.

token(X) --> skipAhead, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).

This does not seem to work; the parses that return (and I get many parses) do not seem to have the comment removed. I'm nervous that my comment rule is needlessly inefficient and probably induces a lot of unnecessary backtracking. I'm also nervous that whites//0 from dcg/basics is deterministic; however, that part of the equation seems to work, it's just integrating it with the comment skipping that doesn't seem to.

As a final note, I don't see how to handle propagating parse errors back to the user with line/column information from here. It feels like I'd have to track and thread through some kind of current line/column info and write it into the tokens and then maybe try to rebuild the line if I wanted to do something similar to how llvm does it. Is that fair or is there a "recommended practice" there?

The whole code can be found in this haste.

708

asked Dec 14 '15 23:12

Daniel Lyons

2 Answers

It currently still looks a bit strange (unreadableNamesLikeInJavaAnyone?), but in its core it is quite solid, so I have only a few comments about some aspects of the code and the questions:

Separating lexing from parsing makes perfect sense. It is also a perfectly acceptable solution to store line and column information along with each token, leaving tokens (for example) of the form l_c_t(Line,Column,Token) or Token-lc(Line,Column) for the parser to process.
Comments are always nasty, or should I say, often not-nesty? A useful pattern in DCGs is often to go for the longest match, which you are already using in some cases, but not yet for anything//0. So, reordering the two rules may help you to skip everything that is meant to be commented away.
Regarding the determinism: It is OK to commit to the first parse that matches, but do it only once, and resist the temptation to mess up the declarative grammar.
In DCGs, it is elegant to use | instead of ;.
tokenize//1? Come on! That's just tokens//1. It makes sense in all directions.

136

answered Sep 29 '22 11:09

mat

I have this code to support error reporting, that itself must be handled with care, sprinkling meaningful messages and 'skip rules' around the code. But there is not ready-to-use alternative: a DCG is a nice computation engine, but it cannot compete out-of-the-box with specialized parsing engines, that are able to emit error messages automatically, exploiting the theoretical properties of the targeted grammars...

:- dynamic text_length/1.

parse_conf_cs(Cs, AST)   :-
    length(Cs, TL),
    retractall(text_length(_)),
    assert(text_length(TL)),
    phrase(cfg(AST), Cs).
....
%%  tag(?T, -X, -Y)// is det.
%
%   Start/Stop tokens for XML like entries.
%   Maybe this should restrict somewhat the allowed text.
%
tag(T, X, Y) -->
    pos(X), unquoted(T), pos(Y).
....

%%  pos(-C, +P, -P) is det.
%
%   capture offset from end of stream
%
pos(C, P, P) :- text_length(L), length(P, Q), C is L - Q.

tag//3 is just an example usage, in this parser I'm building an editable AST, so I store the positions to be able to properly attribute each nested part in an editor...

edit

a small enhancement for id//1: SWI-Prolog has specialized code_type/2 for that:

1 ?- code_type(0'a, csymf).
true.

2 ?- code_type(0'1, csymf).
false.

so (glossing over literal transformation)

id([C|Cs]) --> [C], {code_type(C, csymf)}, id_rest(Cs).

id_rest([C|Cs]) --> [C], {code_type(C, csym)}, id_rest(Cs).
id_rest([]) --> [].

depending on your attitude to generalize small snippets, and the actual grammar details, id_rest//1 could be written in reusable fashion, and made deterministic

id([C|Cs]) --> [C], {code_type(C, csymf)}, codes(csym, Cs).

% greedy and deterministic
codes(Kind, [C|Cs]) --> [C], {code_type(C, Kind)}, !, codes(Kind, Cs).
codes(Kind, []), [C] --> [C], {\+code_type(C, Kind)}, !.
codes(_, []) --> [].

this stricter definition of id//1 would also allow to remove some ambiguity wrt identifiers with keyword prefix: recoding keyword//1 like

keyword(K) --> id(id(K)), {memberchk(K, [
    array,
    break,
...
]}.

will correctly identify

?- phrase(tokenize(Ts), `if1*2`).
Ts = [id(if1), *, int(2)] ;

Your string//1 (OT: what unfortunate clash with library(dcg/basics):string//1) is an easy candidate for implementing a simple 'error recovery strategy':

stringChar(0'\") --> "\\\\".
stringChar(0'") --> pos(X), "\n", {format('unclosed string at ~d~n', [X])}.

It's an example of 'report error and insert missing token', so the parsing can go on...

answered Sep 29 '22 13:09

CapelliC

Related questions
                            
                                Parsing inflected non-word-order languages (e.g. Latin)
                            
                                SWI-Prolog write to file
                            
                                Implementing XOR function with Prolog CLPFD for 32-bit numbers
                            
                                Solving Kakuro puzzle (5x5) in Prolog
                            
                                Parsing with DCGs in Scheme (without Prolog)?
                            
                                Minor inconsistency due to different operator precedence of ** and ^
                            
                                Prolog - simplify derivative
                            
                                Prolog DCG set_prolog_flag double_quotes source code directive location matters; documentation?
                            
                                Implementing partial evaluation in SWI-Prolog
                            
                                SWI Prolog backtracking behavior after `read/1` call
                            
                                Space-efficient writing of functional notation
                            
                                Prolog: How to create all possible combinations without repetitions
                            
                                Predsort/3 like msort/2
                            
                                How Prolog is used and implement the real-world application [closed]
                            
                                swi-prolog negation
                            
                                Implementing "cut" in a recursive descent parser
                            
                                Prolog - Using terms to represent and access complex, nested data
                            
                                How to read the predicates in prolog
                            
                                Cut and Fail in Prolog
                            
                                Complexity of ISO Prolog predicates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With