Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prolog DCG: Writing programming language lexer

I'm trying for the moment to keep my lexer and parser separate, based on the vague advice of the book Prolog and Natural Language Analysis, which really doesn't go into any detail about lexing/tokenizing. So I am giving it a shot and seeing several little issues that indicate to me that there is something obvious I'm missing.

All my little token parsers seem to be working alright; at the moment this is a snippet of my code:

:- use_module(library(dcg/basics)).

operator('(')  --> "(".      operator(')')  --> ")".
operator('[')  --> "[".      operator(']')  --> "]".
% ... etc.

keyword(array)    --> "array".
keyword(break)    --> "break".
% ... etc.

It's a bit repetitive but it seems to work. Then I have some stuff I don't completely love and would welcome suggestions on, but does seem to work:

id(id(Id)) -->
    [C],
    {
        char_type(C, alpha)
    },
    idRest(Rest),
    {
        atom_chars(Id, [C|Rest])
    }.
idRest([C|Rest]) -->
    [C],
    {
        char_type(C, alpha) ; char_type(C, digit) ; C = '_'
    },
    idRest(Rest).
idRest([]) --> [].

int(int(Int)) --> integer(Int).

string(str(String)) -->
    "\"",
    stringContent(Codes),
    "\"",
    {
        string_chars(String, Codes)
    }.
stringContent([C|Chars]) -->
    stringChar(C), stringContent(Chars).
stringContent([]) --> [].

stringChar(0'\n) --> "\\n".
stringChar(0'\t) --> "\\t".
stringChar(0'\") --> "\\\"".
stringChar(0'\") --> "\\\\".
stringChar(C) --> [C].

The main rule for my tokenizer is this:

token(X) --> whites, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).

It's not perfect; I will see int get parsed into in,id(t) because keyword(X) comes before id(X). So I guess that's the first question.

The bigger question I have is that I do not see how to properly integrate comments into this situation. I have tried the following:

skipAhead --> [].
skipAhead --> (comment ; whites), skipAhead.

comment --> "/*", anything, "*/".
anything --> [].
anything --> [_], anything.

token(X) --> skipAhead, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).

This does not seem to work; the parses that return (and I get many parses) do not seem to have the comment removed. I'm nervous that my comment rule is needlessly inefficient and probably induces a lot of unnecessary backtracking. I'm also nervous that whites//0 from dcg/basics is deterministic; however, that part of the equation seems to work, it's just integrating it with the comment skipping that doesn't seem to.

As a final note, I don't see how to handle propagating parse errors back to the user with line/column information from here. It feels like I'd have to track and thread through some kind of current line/column info and write it into the tokens and then maybe try to rebuild the line if I wanted to do something similar to how llvm does it. Is that fair or is there a "recommended practice" there?

The whole code can be found in this haste.

like image 708
Daniel Lyons Avatar asked Dec 14 '15 23:12

Daniel Lyons


People also ask

What is a lexer in programming?

An interpreter is a complex program, so there are multiple stages to it: A lexer is the part of an interpreter that turns a sequence of characters (plain text) into a sequence of tokens. A parser, in turn, takes a sequence of tokens and produces an abstract syntax tree (AST) of a language.

What is a lexer vs parser?

A lexer is a software program that performs lexical analysis. ... A parser goes one level further than thelexer and takes the tokens produced by the lexer and tries to determine if proper sentences have been formed. Parsers work at the grammatical level, lexerswork at the word level.

Is lexer part of parser?

Structure of a ParserA complete parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. The parser needs the lexer because it does not work directly on the text, but on the output produced by the lexer.


2 Answers

It currently still looks a bit strange (unreadableNamesLikeInJavaAnyone?), but in its core it is quite solid, so I have only a few comments about some aspects of the code and the questions:

  1. Separating lexing from parsing makes perfect sense. It is also a perfectly acceptable solution to store line and column information along with each token, leaving tokens (for example) of the form l_c_t(Line,Column,Token) or Token-lc(Line,Column) for the parser to process.
  2. Comments are always nasty, or should I say, often not-nesty? A useful pattern in DCGs is often to go for the longest match, which you are already using in some cases, but not yet for anything//0. So, reordering the two rules may help you to skip everything that is meant to be commented away.
  3. Regarding the determinism: It is OK to commit to the first parse that matches, but do it only once, and resist the temptation to mess up the declarative grammar.
  4. In DCGs, it is elegant to use | instead of ;.
  5. tokenize//1? Come on! That's just tokens//1. It makes sense in all directions.
like image 136
mat Avatar answered Sep 29 '22 11:09

mat


I have this code to support error reporting, that itself must be handled with care, sprinkling meaningful messages and 'skip rules' around the code. But there is not ready-to-use alternative: a DCG is a nice computation engine, but it cannot compete out-of-the-box with specialized parsing engines, that are able to emit error messages automatically, exploiting the theoretical properties of the targeted grammars...

:- dynamic text_length/1.

parse_conf_cs(Cs, AST)   :-
    length(Cs, TL),
    retractall(text_length(_)),
    assert(text_length(TL)),
    phrase(cfg(AST), Cs).
....
%%  tag(?T, -X, -Y)// is det.
%
%   Start/Stop tokens for XML like entries.
%   Maybe this should restrict somewhat the allowed text.
%
tag(T, X, Y) -->
    pos(X), unquoted(T), pos(Y).
....

%%  pos(-C, +P, -P) is det.
%
%   capture offset from end of stream
%
pos(C, P, P) :- text_length(L), length(P, Q), C is L - Q.

tag//3 is just an example usage, in this parser I'm building an editable AST, so I store the positions to be able to properly attribute each nested part in an editor...

edit

a small enhancement for id//1: SWI-Prolog has specialized code_type/2 for that:

1 ?- code_type(0'a, csymf).
true.

2 ?- code_type(0'1, csymf).
false.

so (glossing over literal transformation)

id([C|Cs]) --> [C], {code_type(C, csymf)}, id_rest(Cs).

id_rest([C|Cs]) --> [C], {code_type(C, csym)}, id_rest(Cs).
id_rest([]) --> [].

depending on your attitude to generalize small snippets, and the actual grammar details, id_rest//1 could be written in reusable fashion, and made deterministic

id([C|Cs]) --> [C], {code_type(C, csymf)}, codes(csym, Cs).

% greedy and deterministic
codes(Kind, [C|Cs]) --> [C], {code_type(C, Kind)}, !, codes(Kind, Cs).
codes(Kind, []), [C] --> [C], {\+code_type(C, Kind)}, !.
codes(_, []) --> [].

this stricter definition of id//1 would also allow to remove some ambiguity wrt identifiers with keyword prefix: recoding keyword//1 like

keyword(K) --> id(id(K)), {memberchk(K, [
    array,
    break,
...
]}.

will correctly identify

?- phrase(tokenize(Ts), `if1*2`).
Ts = [id(if1), *, int(2)] ;

Your string//1 (OT: what unfortunate clash with library(dcg/basics):string//1) is an easy candidate for implementing a simple 'error recovery strategy':

stringChar(0'\") --> "\\\\".
stringChar(0'") --> pos(X), "\n", {format('unclosed string at ~d~n', [X])}.

It's an example of 'report error and insert missing token', so the parsing can go on...

like image 37
CapelliC Avatar answered Sep 29 '22 13:09

CapelliC