Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write an ANTLR parser for JSP/ASP/PHP like languages?

Tags:

parsing

antlr

I am new to parser generators and I am wondering how the ANTLR grammar for an embedded language like JSP/ASP/PHP might look like, but unfortunately the ANTLR site doesn't provide any such grammar files.

More precisely I don't know exactly how to define an AnyText token which matches everything (including keywords which aren't having any meaning outside the code blocks) and still be able to recognize them correctly inside the blocks.

For example the following snipped should be tokenized as something like: AnyText, BlockBegin, Keyword, BlockEnd, AnyText.

lorem ipsum KEYWORD dolor sit <% KEYWORD %> amet

Maybe there is also another parser generator which is suited better for my needs. I have only tried ANTLR up to now, because of its huge popularity here at stackoverflow :)

Many thanks in advance!

like image 706
tux21b Avatar asked Sep 17 '09 18:09

tux21b


2 Answers

I can't speak for ANTLR, as I use a different lexer/parser (the DMS Software Reengineering Toolkit, for which I have developed precisely such JSP and PHP lexer/parsers. (ASP isn't different as you have observed in your question).

But the basic idea is that the lexer needs lexical modes to recognize when you are picking up "anytext" and when you are processing "real" programming language text. So you need a starting lexical mode, say HTML, whose job is to absorb the HTML text, and when it encounters an transition-into PHP, switches modes. You also need a PHP mode which picks up all the PHP tokens, and switches back to HTML mode when the transition-out characters are encountered. Here's a sketch:

%%HTML -- mode
#token HTMLText "~[]* \< \% "
   << (GotoPHPMode) >>

%%PHP -- mode
#token KEYWORD "KEYWORD"
...
#token '%>'  "\%\>"
   << (GotoHTMLMode) >>

Your lexer generator is likely to have some kind of mode-switching capability that you'll have to use instead of this. And you'll likely find that lexing the HTML stuff is more complicated than it looks (you have to worry about <SCRIPT tags and lots of other crazy HTML stuff, but those are details I presume you can handle.

like image 163
Ira Baxter Avatar answered Oct 10 '22 19:10

Ira Baxter


I've come across this project http://code.google.com/p/phpparser/ which also contains an ANTLR grammar file for parsing PHP: http://code.google.com/p/phpparser/source/browse/grammar/Php.g

Hope this helps.

like image 39
mpobrien Avatar answered Oct 10 '22 20:10

mpobrien