Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ebnf – Is this an LL(1) grammar?

Tags:

parsing

ll

ebnf

I found the following EBNF on wikipedia, describing EBNF:

letter = "A" | "B" | "C" | "D" | "E" | "F" | "G"
   | "H" | "I" | "J" | "K" | "L" | "M" | "N"
   | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
   | "V" | "W" | "X" | "Y" | "Z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
symbol = "[" | "]" | "{" | "}" | "(" | ")" | "<" | ">"
   | "'" | '"' | "=" | "|" | "." | "," | ";" ;
character = letter | digit | symbol | "_" ;

identifier = letter , { letter | digit | "_" } ;
terminal = "'" , character , { character } , "'" 
     | '"' , character , { character } , '"' ;

lhs = identifier ;
rhs = identifier
 | terminal
 | "[" , rhs , "]"
 | "{" , rhs , "}"
 | "(" , rhs , ")"
 | rhs , "|" , rhs
 | rhs , "," , rhs ;

rule = lhs , "=" , rhs , ";" ;
grammar = { rule } ;

Now, because of my limited knowledge on parsers and grammars, I don't know if this is an LL(1) grammar. I tried to write a parser for it, but it fails when trying to read rhs, which reads itself again, which reads itself again, which oh, you got it...

  • Is it an LL(1) grammar?
  • If not, how to turn it into one (possible?)?
like image 577
tilpner Avatar asked Nov 24 '13 13:11

tilpner


People also ask

How do you know if grammar is LL 1?

Recognizing LL(1): Here are two properties we know must be true of a grammar if it is to be LL(1): the grammar must not be left recursive. the rule which should be chosen when developing a nonterminal must be determined by that nonterminal and the (at most) next token on the input.

Is every unambiguous grammar is LL 1?

Ambiguous grammars are not LL(1) but unambiguous grammars are not necessarily LL(1) Having a non-LL(1) unambiguous grammar for a language does not mean that this language is not LL(1). But there are languages for which there exist unambiguous context-free grammars but no LL(1) grammar.

How do you prove grammar is not LL 1?

If the grammar is ambiguous (at least one sentence has more than one parse tree), then the grammar is not in LL(1).

What are the rules of EBNF?

The EBNF defines production rules where sequences of symbols are respectively assigned to a nonterminal: digit excluding zero = "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ; digit = "0" | digit excluding zero ; This production rule defines the nonterminal digit which is on the left side of the assignment.


2 Answers

The quoted Wikipedia extract is not a correct EBNF grammar for EBNF. It's also not left-parseable: indeed, it is ambiguous, so it's not unambiguously parseable at all.

In general, the terms LL(k) and LR(k) (and many other such terms) apply to Context-Free Grammars (CFGs) (and, by extension, the languages generated by those grammars). EBNF is not a formalism for describing CFGs. It is designed to be a formalism to describe context-free languages and therefore it should be possible to create a CFG from a given EBNF grammar (but see Note 1), but there is not a direct correspondence between an EBNF syntax rule and a single production in a CFG.

That said, you can usually directly create a CFG by using some standard transformations. For example:

{ ... }

can be substituted with the generated non-terminal M'', with the addition of the following productions: (ε is the empty string)

M'  → ...
M'' → ε
M'' → M' M''

The above transformation does not introduce left-recursion, so it does not artificially make the original grammar non-LL(1).

The most important error in the cited grammar [Note 2] is the ambigous EBNF rule:

rhs = identifier
    | terminal
    | "[" , rhs , "]"
    | "{" , rhs , "}"
    | "(" , rhs , ")"
    | rhs , "|" , rhs
    | rhs , "," , rhs
    ;

It's also left-recursive, so it will not correspond to an LL(1) CFG. But more importantly, it does not indicate either the associativity or the precedence of the | and , operators. (Semantically, these operators do not have a defined associativity, but the syntax should still specify one; otherwise, it is impossible to unambiguously create a parse tree. The precedence between the two operators is important semantically.)

A better set of rules would be:

primary = identifier
        | terminal
        | "[" , rhs , "]"
        | "{" , rhs , "}"
        | "(" , rhs , ")"
        ;
factor  = primary , { "|" , primary } ;
rhs     = factor ,  { "," , factor } ;

This is still an oversimplification, but it covers a large number of use cases. It's neither ambiguous nor left-recursive. [3]


Notes

  1. Syntactic constraints specified in comments may not be easy to translate into CFGs, though. For example, the ISO standard EBNF for EBNF defines the non-terminal "syntactic exception" as follows:

    syntactic exception =
       ? a syntactic-factor that could be replaced
         by a syntactic-factor containing no
         meta-identifiers
       ?


    The intention of the above text is to restrict an exception to be a regular language. That's important, since the set difference between two context-free languages is not necessarily context-free, while the difference between a context-free language and a regular language is provably context-free. Ironically, the "special sequence" describing this restriction cannot be expressed as a context-free grammar because it depends on the definition of meta-identifiers. (If it had said "a syntactic-factor containing no meta-identifiers" it would be easy to write without resorting to a special sequence, but clearly that was not the intent.)

  2. There is another important error in the Wikipedia excerpt. It defines both types of quoted strings as having the same body, but that's not correct; a double-quoted string cannot include double-quote characters, and a single-quoted string cannot contain single-quote characters. So the use of the identifier character in both of those definitions is incorrect.

  3. The formal EBNF grammar allows primary to be empty. I left that out, because it's not usually needed.

like image 129
rici Avatar answered Sep 20 '22 19:09

rici


In short, no, your grammar is not LL(1).

The first reason is the left-recursion of rhs you already discovered. I assume, you wrote a recursive descent parser (or something else that bases on LL(1) grammars). Such a parser is not able to handle left-recursive rules since they cause a special case of a so-called FIRST/FIRST conflict (cf. 1).

To tackle this problem and answer the second part of your question, you can left-factor your grammar and replace left-recursion as shown in 2.

like image 44
ojlr Avatar answered Sep 21 '22 19:09

ojlr