Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compiler compiler that consumes BNF

Is there any reason why there are no parser generators that consume straight BNF?

I'm familiar with JavaCC and Antlr, and recently came across Parse2. It seems that each has its own notation. BNF is really easy to read and the other notations aren't. BNF is unambiguous. Is there some inherent reason why I can't feed BNF to a compiler compiler and get a parse tree out?

like image 904
ccleve Avatar asked Oct 23 '14 21:10

ccleve


People also ask

What is BNF in compiler design?

BNF stands for Backus Naur Form notation. It is a formal method for describing the syntax of programming language which is understood as Backus Naur Formas introduced by John Bakus and Peter Naur in 1960.

What is BNF give an example?

BNF is a notation for Chomsky's context-free grammars. Backus was familiar with Chomsky's work. As proposed by Backus, the formula defined "classes" whose names are enclosed in angle brackets. For example, <ab> . Each of these names denotes a class of basic symbols.

What is the difference between BNF and EBNF?

BNF syntax can only represent a rule in one line, whereas in EBNF a terminating character, the semicolon, marks the end of a rule. Furthermore, EBNF includes mechanisms for enhancements, defining the number of repetitions, excluding alternatives, comments, etc.

Is Yacc a compiler?

Yacc (yet another compiler compiler) is a grammar parser and parser generator. That is, it is a program that reads a grammar specification and generates code that is able to organize input tokens in a syntactic tree in accordance with the grammar.


2 Answers

Marpa::R2, Perl interface to Marpa, a general BNF parser, accepts straight BNF as grammar description and generates a parser for it in Perl. Here is an example taken almost literally from a BNF grammar tutorial.

<tree>  ::= '(' <list> ')'
<list>  ::= <thing> | <list> ',' <thing>
<thing> ::= <tree> | <name>             
<name>  ::= 'ant' | 'bat' | 'cow' | 'dog' | 'cat'

The full code example.

like image 59
rns Avatar answered Sep 28 '22 20:09

rns


Is there any reason why there are no parser generators that consume straight BNF?

No, but most parser generators started out trying to do everything. It took the community a long time to figure out that sometimes less is more.

Tools that consume (some variation) of BNF have been around for a long time.

A paper describing MetaII was published back in 1963. MetaII's baseline BNF is what you expect (lefthandside plus right hand side) plus a number of now relatively standard EBNF extensions (Kleene Star and Plus, Groupting, alternatives) and non-standard embedded actions (used to generate literally syntax-directed translation output). It assumes built-in token lexers. MetaII can meta-compile its own grammar to reproduce itself exactly [the paper shows this and it is mind-blowing moment when you grok why this works], thus enabling bootstrapping to more sophisticated versions. and to compile other simple languages. A common easy extension is defining grammar rules for lexical tokens. I built a variety of implementations, and tools that used this, back in the 1970s because it was so cool and so useful.

Tree Meta added actions for building trees, and an auxiliary grammar for pattern matching trees to generate output.

When you add in all the extra EBNF and generation stuff, the resulting "BNF" for both MetaII and Tree Meta can be fairly hard for the uninitiated to read, mostly because it is dense and you have to be familiar with context. Otherwise it looks like the output of a chain printer (for those of you old to know what this is) that broke.

Most modern compiler-compilers aren't much different. YACC has extended BNF with embedded (your favorite language) code to do actions, usually used to build trees. JavaCC uses an extended BNF; with JJTree you can build ASTs. ANTLR 1/2/3 also has an extended BNF, with embedded actions for building trees. That makes them just as, er, ugly as MetaII... no progress in 40 years IMHO.

Our DMS Software Reengineering Toolkit (see my bio) uses the barest BNF you can imagine, for truly complex grammars such as IBM Enterprise COBOL, Fortran 2005 and C++14. It looks like:

  LHS = RHS1 RHS2 ...  RHSn ;

for various tokens LHS, RHS1 ... RHSn. Lists are easy: you use two rules, one for the base case, one for list extension. Alternatives are easy: code another rule. It is technically straightforward to code grammar rules for tokens simply as grammar rules whose terminals are actual characters. (DMS offers, and we typically use, a real lexer for parsing speed reasons).

Here is a DMS BNF for high school algebra (and a bit of calculus; some liberties in notation):

equation = sum ;
sum = product ;
sum = sum '+' product ;
sum = sum '-' product ;
product = term ;
product = product '*' term ;
product = product '/' term ;
term = primary ;
term = primary '^' term ;
primary = NUMBER ;
primary = NAME ;
primary = '(' sum ')' ;
primary = '-' primary ;
primary = NAME '(' sum ')' ;
primary = 'D' NAME ':' primary ; -- differentiate primary
primary = 'I' sum 'D' NAME ; -- indefinite integral
primary = 'I' sum ',' sum ':' sum 'D' NAME ; -- definite integral

A real DMS grammar file has other things it for describing prettyprinting etc. Here you can see a bigger example of a DMS grammar for M6809 assembly code.

What is interesting is that DMS builds ASTs using only the grammar as a guide; no additional tree-building actions. By avoiding the need to specify actions-while-parsing (to build tree nodes), the resulting grammars are pretty simple to read.

DMS has been doing this since about 1998; it is the first tool I know that took this approach. The ANTLR guys finally figured out this was a good idea, and now ANTLR4 as of 2013 will build a tree without any explicit embedded actions, although it still has actions for other purposes. DMS's trees can be either concrete (following the grammar directly) or "abstract" (many tree nodes are dropped as they can be reconstructed by DMS on demand, having the abstract tree and the grammar). The abstract trees are actually pretty decent. I don't know what ANTLR4 does here.

The really nice thing about this approach is one can write and revise really complicated, big, grammars by simply revising the rules; the construction of ASTs is "free" and it is always "right" with respect to the grammar. That allows DMS to provide a variety of other related tools (prettyprinting, source-to-source level transformation) using that as baseline. (DMS is technically a "meta" compiler in the sense that it can parse its own grammars using its own grammar; we use this capability of DMS to generate the parsers implied by those grammars).

You can see an full example of this at "algebra as a dms domain", also via my bio. (The BNF was taken from there). I'd provide links, but many SO people dislike that.

like image 23
Ira Baxter Avatar answered Sep 28 '22 20:09

Ira Baxter