I'm trying for the moment to keep my lexer and parser separate, based on the vague advice of the book Prolog and Natural Language Analysis, which really doesn't go into any detail about lexing/tokenizing. So I am giving it a shot and seeing several little issues that indicate to me that there is something obvious I'm missing.
All my little token parsers seem to be working alright; at the moment this is a snippet of my code:
:- use_module(library(dcg/basics)).
operator('(') --> "(". operator(')') --> ")".
operator('[') --> "[". operator(']') --> "]".
% ... etc.
keyword(array) --> "array".
keyword(break) --> "break".
% ... etc.
It's a bit repetitive but it seems to work. Then I have some stuff I don't completely love and would welcome suggestions on, but does seem to work:
id(id(Id)) -->
[C],
{
char_type(C, alpha)
},
idRest(Rest),
{
atom_chars(Id, [C|Rest])
}.
idRest([C|Rest]) -->
[C],
{
char_type(C, alpha) ; char_type(C, digit) ; C = '_'
},
idRest(Rest).
idRest([]) --> [].
int(int(Int)) --> integer(Int).
string(str(String)) -->
"\"",
stringContent(Codes),
"\"",
{
string_chars(String, Codes)
}.
stringContent([C|Chars]) -->
stringChar(C), stringContent(Chars).
stringContent([]) --> [].
stringChar(0'\n) --> "\\n".
stringChar(0'\t) --> "\\t".
stringChar(0'\") --> "\\\"".
stringChar(0'\") --> "\\\\".
stringChar(C) --> [C].
The main rule for my tokenizer is this:
token(X) --> whites, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).
It's not perfect; I will see int
get parsed into in,id(t)
because keyword(X)
comes before id(X)
. So I guess that's the first question.
The bigger question I have is that I do not see how to properly integrate comments into this situation. I have tried the following:
skipAhead --> [].
skipAhead --> (comment ; whites), skipAhead.
comment --> "/*", anything, "*/".
anything --> [].
anything --> [_], anything.
token(X) --> skipAhead, (keyword(X) ; operator(X) ; id(X) ; int(X) ; string(X)).
This does not seem to work; the parses that return (and I get many parses) do not seem to have the comment removed. I'm nervous that my comment rule is needlessly inefficient and probably induces a lot of unnecessary backtracking. I'm also nervous that whites//0
from dcg/basics is deterministic; however, that part of the equation seems to work, it's just integrating it with the comment skipping that doesn't seem to.
As a final note, I don't see how to handle propagating parse errors back to the user with line/column information from here. It feels like I'd have to track and thread through some kind of current line/column info and write it into the tokens and then maybe try to rebuild the line if I wanted to do something similar to how llvm does it. Is that fair or is there a "recommended practice" there?
The whole code can be found in this haste.
An interpreter is a complex program, so there are multiple stages to it: A lexer is the part of an interpreter that turns a sequence of characters (plain text) into a sequence of tokens. A parser, in turn, takes a sequence of tokens and produces an abstract syntax tree (AST) of a language.
A lexer is a software program that performs lexical analysis. ... A parser goes one level further than thelexer and takes the tokens produced by the lexer and tries to determine if proper sentences have been formed. Parsers work at the grammatical level, lexerswork at the word level.
Structure of a ParserA complete parser is usually composed of two parts: a lexer, also known as scanner or tokenizer, and the proper parser. The parser needs the lexer because it does not work directly on the text, but on the output produced by the lexer.
It currently still looks a bit strange (unreadableNamesLikeInJavaAnyone?
), but in its core it is quite solid, so I have only a few comments about some aspects of the code and the questions:
l_c_t(Line,Column,Token)
or Token-lc(Line,Column)
for the parser to process.anything//0
. So, reordering the two rules may help you to skip everything that is meant to be commented away.|
instead of ;
.tokenize//1
? Come on! That's just tokens//1
. It makes sense in all directions.I have this code to support error reporting, that itself must be handled with care, sprinkling meaningful messages and 'skip rules' around the code. But there is not ready-to-use alternative: a DCG is a nice computation engine, but it cannot compete out-of-the-box with specialized parsing engines, that are able to emit error messages automatically, exploiting the theoretical properties of the targeted grammars...
:- dynamic text_length/1.
parse_conf_cs(Cs, AST) :-
length(Cs, TL),
retractall(text_length(_)),
assert(text_length(TL)),
phrase(cfg(AST), Cs).
....
%% tag(?T, -X, -Y)// is det.
%
% Start/Stop tokens for XML like entries.
% Maybe this should restrict somewhat the allowed text.
%
tag(T, X, Y) -->
pos(X), unquoted(T), pos(Y).
....
%% pos(-C, +P, -P) is det.
%
% capture offset from end of stream
%
pos(C, P, P) :- text_length(L), length(P, Q), C is L - Q.
tag//3 is just an example usage, in this parser I'm building an editable AST, so I store the positions to be able to properly attribute each nested part in an editor...
edit
a small enhancement for id//1: SWI-Prolog has specialized code_type/2 for that:
1 ?- code_type(0'a, csymf).
true.
2 ?- code_type(0'1, csymf).
false.
so (glossing over literal transformation)
id([C|Cs]) --> [C], {code_type(C, csymf)}, id_rest(Cs).
id_rest([C|Cs]) --> [C], {code_type(C, csym)}, id_rest(Cs).
id_rest([]) --> [].
depending on your attitude to generalize small snippets, and the actual grammar details, id_rest//1 could be written in reusable fashion, and made deterministic
id([C|Cs]) --> [C], {code_type(C, csymf)}, codes(csym, Cs).
% greedy and deterministic
codes(Kind, [C|Cs]) --> [C], {code_type(C, Kind)}, !, codes(Kind, Cs).
codes(Kind, []), [C] --> [C], {\+code_type(C, Kind)}, !.
codes(_, []) --> [].
this stricter definition of id//1 would also allow to remove some ambiguity wrt identifiers with keyword prefix: recoding keyword//1 like
keyword(K) --> id(id(K)), {memberchk(K, [
array,
break,
...
]}.
will correctly identify
?- phrase(tokenize(Ts), `if1*2`).
Ts = [id(if1), *, int(2)] ;
Your string//1 (OT: what unfortunate clash with library(dcg/basics):string//1) is an easy candidate for implementing a simple 'error recovery strategy':
stringChar(0'\") --> "\\\\".
stringChar(0'") --> pos(X), "\n", {format('unclosed string at ~d~n', [X])}.
It's an example of 'report error and insert missing token', so the parsing can go on...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With