I am new to parser generators and I am wondering how the ANTLR grammar for an embedded language like JSP/ASP/PHP might look like, but unfortunately the ANTLR site doesn't provide any such grammar files.
More precisely I don't know exactly how to define an AnyText token which matches everything (including keywords which aren't having any meaning outside the code blocks) and still be able to recognize them correctly inside the blocks.
For example the following snipped should be tokenized as something like: AnyText, BlockBegin, Keyword, BlockEnd, AnyText.
lorem ipsum KEYWORD dolor sit <% KEYWORD %> amet
Maybe there is also another parser generator which is suited better for my needs. I have only tried ANTLR up to now, because of its huge popularity here at stackoverflow :)
Many thanks in advance!
I can't speak for ANTLR, as I use a different lexer/parser (the DMS Software Reengineering Toolkit, for which I have developed precisely such JSP and PHP lexer/parsers. (ASP isn't different as you have observed in your question).
But the basic idea is that the lexer needs lexical modes to recognize when you are picking up "anytext" and when you are processing "real" programming language text. So you need a starting lexical mode, say HTML, whose job is to absorb the HTML text, and when it encounters an transition-into PHP, switches modes. You also need a PHP mode which picks up all the PHP tokens, and switches back to HTML mode when the transition-out characters are encountered. Here's a sketch:
%%HTML -- mode
#token HTMLText "~[]* \< \% "
<< (GotoPHPMode) >>
%%PHP -- mode
#token KEYWORD "KEYWORD"
...
#token '%>' "\%\>"
<< (GotoHTMLMode) >>
Your lexer generator is likely to have some kind of mode-switching capability that you'll have to use instead of this. And you'll likely find that lexing the HTML stuff is more complicated than it looks (you have to worry about <SCRIPT tags and lots of other crazy HTML stuff, but those are details I presume you can handle.
I've come across this project http://code.google.com/p/phpparser/ which also contains an ANTLR grammar file for parsing PHP: http://code.google.com/p/phpparser/source/browse/grammar/Php.g
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With