Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LALR grammar, trailing comma and multiline list assignment

I'm trying to produce a LALR grammar for a very simple language composed of assignments. For example:

foo = "bar"
bar = 42

The language should also handle list of values, for example:

foo = 1, 2, 3

But I also want to handle list on multiple lines:

foo = 1, 2
      3, 4

Trailing comma (for singletons and language flexibility):

foo = 1,
foo = 1, 2,

And obviously, both at the same time:

foo = 1,
      2,
      3,

I'm able to write a grammar with trailing comma or multi-line list, but not for both at the same time.

My grammar look like this:

content : content '\n'
        : content assignment
        | <empty>

assignment : NAME '=' value
           | NAME '=' list

value : TEXT
      | NUMBER

list : ???

Note: I need the '\n' in the grammar to forbid this kind of code:

foo
=
"bar"

Thanks by advance,

Antoine.

like image 827
Antoine Avatar asked Jan 21 '26 23:01

Antoine


1 Answers

It looks like your configuration language is essentially free form. I would forget about making newline a token in the grammar. If you want the newline restrictions, you can hack it as some lexical tie-in rules, whereby the parser calls a little API added to the lexer to inform the lexer about where it is in the grammar, and the lexer can decide whether to accept newlines or reject them with an error.

Try this grammar.

%token NAME NUMBER TEXT

%%

config_file : assignments
            | /* empty */
            ;

assignments : assignment
            | assignments assignment
            ;

assignment : NAME '=' values comma_opt

comma_opt : ',' | /* empty */;

values : value
       | values ',' value
       ;

value : NUMBER | TEXT ;

It builds for me with no conflicts. I didn't run it, but a casual reading of y.output looks like the transitions are sane.

This grammar, of course, allows

foo = 1, 2, 3, bar = 4, 5, 6 xyzzy = 7 answer = 42

without additional communication with the lexer.

Your restrictions mean that newlines are only allowed in the values. Two NAME tokens must never appear on the same line, and the = must appear on the same line as the preceding NAME (and probably the first value must also).

Basically when the parser scans the first value, it can tell the lexer "values are being scanned now, turn on the admission of newlines". And then when the comma_opt is reduced, this can be turned off again. When comma_opt is reduced, the lexer may have already read the NAME token of the next assignment, but it can check that this occurs on a different line from the previous NAME. You will want your lexer to keep track of an accurate line count anyway.

like image 143
Kaz Avatar answered Jan 23 '26 11:01

Kaz