Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Composable Grammars

There are so many programming languages which support the inclusion of mini-languages. PHP is embedded within HTML. XML can be embedded within JavaScript. Linq can be embedded within C#. Regular expressions can be embedded in Perl.

// JavaScript example
var a = <node><child/></node>

Come to think of it, most programming languages can be modeled as different mini-languages. Java, for example, could be broken into a least four distinct mini-languages:

  • A type-declaration langauge (package directive, import directives, class declaration)
  • A member-declaration language (access modifiers, method declarations, member vars)
  • A statement language (control flow, sequential execution)
  • An expression language (literals, assignments, comparisons, arithmetic)

Being able to implement those four conceptual languages as four distinct grammars would certainly cut down on a lot of the spaghettiism that I usually see in complex parser and compiler implementations.

I've implemented parsers for various different kinds of languages before (using ANTLR, JavaCC, and custom recursive-descent parsers), and when the language gets really big and complex, you usually end up with one huuuuuuge grammar, and the parser implementation gets really ugly really fast.

Ideally, when writing a parser for one of those languages, it'd be nice to implement it as a collection of composable parsers, passing control back and forth between them.

The tricky thing is that often, the containing langauge (e.g., Perl) defines its own terminus sentinel for the contained language (e.g., regular expressions). Here's a good example:

my $result ~= m|abc.*xyz|i;

In this code, the main perl code defines a nonstandard terminus "|" for the regular expression. Implementing the regex parser as completely distinct from the perl parser would be really hard, because the regex parser wouldn't know how to find the expression terminus without consulting the parent parser.

Or, lets say I had a language which allowed the inclusion of Linq expressions, but instead of terminating with a semicolon (as C# does), I wanted to mandate the Linq expressions appear within square brackets:

var linq_expression = [from n in numbers where n < 5 select n]

If I defined the Linq grammar within the parent language grammar, I could easily write an unambiguous production for a "LinqExpression" using syntactic lookahead to find the bracket enclosures. But then my parent grammar would have to absorb the whole Linq specification. And that's a drag. On the other hand, a separate child Linq parser would have a very difficult time figuring out where to stop, because it would need to implement lookahead for foreign token-types.

And that would pretty much rule out using separate lexing/parsing phases, since the Linq parser would define a whole different set of tokenization rules than the parent parser. If you're scanning for one token at a time, how do you know when to pass control back to the lexical analyzer of the parent language?

What do you guys think? What are the best techniques available today for implementing distinct, decoupled, and composable language grammars for the inclusion of mini-languages within larger parent langauges?

like image 806
benjismith Avatar asked Jun 04 '09 21:06

benjismith


5 Answers

You might want to listen to this podcast. Scanner-less parsing was "invented" to help solve the problem of composing different grammars (the problem being that you quickly find that you can't write a "universal" tokenizer/scanner).

like image 103
Paul Hollingsworth Avatar answered Nov 07 '22 06:11

Paul Hollingsworth


I am working on this exact problem. I'll share my thoughts:

Grammars are difficult to debug. I've debugged a few in Bison and ANTLR and it wasn't pretty. If you want the user to insert DSLs as grammar into your parser, then you've got to find some way to make it so it doesn't blow up. My approach is to not allow arbitrary DSLs, but to only allow those that follow two rules:

  • The token types (identifiers, strings, numbers) are the same between all DSLs in a file.
  • Unbalanced parentheses, braces, or brackets are not permitted

The reason for the first restriction is because modern parsers break parsing into a lexical stage and then apply your traditional grammar rules. Fortunately, I believe that a single universal tokenizer is good enough for 90% of the DSLs you want to create, even if it doesn't accommodate the DSLs you already created that you want to embed.

The second restriction allows the grammars to be more separated from each other. You can parse in two stages by grouping parentheses (braces, brackets) and then recursively parsing each group. The grammar of your embedded DSL can't escape through the brackets it's contained in.

Another part of the solution is to allow macros. For example, regex("abc*/[^.]") looks fine to me. This way the macro "regex" can parse the regex instead of building a regex grammer into the main language. You can't use different delimiters for your regex, sure, but you do gain a measure of consistency in my mind.

like image 21
Dietrich Epp Avatar answered Nov 07 '22 06:11

Dietrich Epp


Have a look at SGLR, Scannerless Generalized LR parsing. Here are some references and URLs. This parsing techniques makes composition of parsing tables very simple. Especially in combination with SDF.

Martin Bravenboer and Eelco Visser. Designing Syntax Embeddings and Assimilations for Language Libraries. In Models in Software Engineering: Workshops and Symposia at MoDELS 2007, volume 5002 of LNCS, 2008.

MetaBorg and MetaBorg in action

like image 24
Zef Hemel Avatar answered Nov 07 '22 07:11

Zef Hemel


Parsing is one aspect of the problem, but I suspect that the inter-operation between the various executable interpreters that relate to each mini-language is probably significantly harder to solve. In order to be useful, each independent syntax block has to work consistently with the overall context (or the final behavior will be unpredictable, and thus unusable).

Not that I understand what they are really doing, but a very interesting place to look for more inspiration is FoNC. They seem (I am guessing) to be headed into a direction that allows all sorts of different computational engines to interact seamlessly.

like image 36
Paul W Homer Avatar answered Nov 07 '22 07:11

Paul W Homer


Perl 6 can be seen as a set of DSLs specifically made for writing programs.

In fact the Rakudo implementation is built in exactly this fashion.

Even strings are a DSL with options you can enable or disable.

Q
:closure
:backslash
:scalar
:array
:hash
"{ 1 + 3 } \n $a @a<> %a<>"

qq"{1+2}" eq 「3」

qq:!closure"{1+2}" eq 「{1+2}」

It basically has to be made from composable grammars for this to work:

sub circumfix:«:-) :-)» (@_) { say @_ }

:-) 1,2,3 :-)

In Perl 6 grammars are just a type of class, and tokens are a type of method.

role General-tokens {
  token start-of-line { ^^ }
  token end-of-line { $$ }
}
grammar Example does General-tokens {
  token TOP {
    <start-of-line> <stuff> <end-of-line>
  }
  token stuff { \N+ }
}

role Other {
  token start-of-line { <alpha> ** 5 }
}
grammar Composed-in is Example does Other {
  token alpha { .. }
}

say Composed-in.parse: 'abcdefghijklmnopqrstuvwxyz';
「abcdefghijklmnopqrstuvwxyz」
 start-of-line => 「abcdefghij」
  alpha => 「ab」
  alpha => 「cd」
  alpha => 「ef」
  alpha => 「gh」
  alpha => 「ij」
 stuff => 「klmnopqrstuvwxyz」
 end-of-line => 「」

Note that I didn't show an actions class, which is handy for transforming the parse-tree as it is built.

like image 32
Brad Gilbert Avatar answered Nov 07 '22 06:11

Brad Gilbert