As a pet-project, I'd like to attempt to implement a basic language of my own design that can be used as a web-scripting language. It's trivial to run a C++ program as an Apache CGI, so the real work lies in how to parse an input file containing non-code (HTML/CSS markup) and server-side code.
In my undergrad compiler course, we used Flex and Bison to generate a scanner and a parser for a simple language. We were given a copy of the grammar and wrote a parser that translated the simple language to a simple assembly for a virtual machine. The flex scanner tokenizes the input, and passes the tokens to the Bison parser.
The difference between that and what I'd like to do is that like PHP, this language could have plain HTML markup and the scripting language interspersed like the following:
<p>Hello,
<? echo "World ?>
</p>
Am I incorrect in assuming that it would be efficient to parse the input file as follows:
Basically, the first scanner only differentiates between Markup (which is returned directly to the browser unmodified) and code, which is passed to the second scanner, which in turn tokenizes the code and passes the tokens to the parser.
If this is not a solid design pattern, how do languages such as PHP handle scanning input and parsing code efficiently?
You want to look at start conditions. For example:
"<?" { BEGIN (PHP); }
<PHP>[a-zA-Z]* { return PHP_TOKEN; }
<PHP>">?" { BEGIN (0); }
[a-zA-Z]* { return HTML_TOKEN; }
You start off in state 0, use the BEGIN macro to change states. To match a RE only while in a particular state, prefix the RE with the state name surrounded by angle-brackets.
In the example above, "PHP" is state. "PHP_TOKEN" and "HTML_TOKEN" are _%token_s defined by your yacc file.
PHP doesn't differentiate between the scanning and the Markup. It simply outputs to buffer when in Markup mode, and then switches to parsing when in code mode. You don't need a two pass scanner, and you can do this with just a single flex lexer.
If you are interested in how PHP itself works, download the source (try the PHP4 source it is a lot easier to understand). What you want to look at is in the Zend Directory, zend_language_scanner.l
.
Having written something similar myself, I would really recommend rethinking going the Flex and Bison route, and go with something modern like Antlr. It is a lot easier, easier to understand (the macros employed in a lex grammar get very confusing and hard to read) and it has a built in debugger (AntlrWorks) so you don't have to spend hours looking at 3 Meg debug files. It also supports many languages (Java, c#, C, Python, Actionscript) and has an excellent book and a very good website that should be able to get you up and running in no time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With