Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating custom PHP Syntax Parser

I am thinking about how one would go about creating a PHP equivalent for a couple of libraries I found for CSS and JS.

One is Less CSS which is a dynamic stylesheet language. The basic idea behind Less CSS is that it allows you to create more dynamic CSS rules containing entities that "regular" CSS does not support such as mixins, functions etc and then the final Less CSS compiles those syntax into regular CSS.

Another interesting JS library which behaves in a (kind of) similar pattern is CoffeeScript where you can write "tidier & simpler" code which then gets compiled into regular Javascript.

How would one go about creating a simple similar interface for PHP? Just as a proof of concept; I am only trying to learn stuff. Lets just take a simple use case of extending classes.

class a
{
    function a_test()
    {
        echo "This is test in a ";
    }
}

class b extends a
{
    function b_test()
    {
        parent::a_test();
        echo "This is test in b";
    }
}

$b = new b();
$b->b_test();

Suppose I want to let the user write class b as (just for the example):

class b[a] //would mean b extends a
{
    function b_test()
    {
        [a_test] //would mean parent::a_test()
        echo "This is test in b";
    }
}

And let them later have that code "resolve" to regular PHP (Usually by running a separate command/process I would believe). My question is how would I go about creating something like this. Can it be done in PHP, would I require to use something like C/C++. How should I approach this problem if I were to go at it? Are there any resources online? Any pointers are deeply appreciated!

like image 565
Undefined Variable Avatar asked Oct 03 '12 21:10

Undefined Variable


1 Answers

Language transcoders are not as easy as one might think.

The example you gave can be implemented very easily with a preg_replace that looks for class definitions and replaces [a] with extends a.

But more complex features need a transcoder which is a suite of smaller logical pieces of code.

In most programmer jargon people incorrectly call transcoders compilers but the difference between compilers and transcoders is that compilers read source code and output raw binary machine code while transcoders read source code and output (a different) source code.

The PHP (or JavaScript) runtime for example is neither compiler nor transcoder, it's an interpreter.

But enough about jargon let's talk about transcoders:

To build a transcoder you must first build a tokenizer, it breaks apart the source code into tokens, meaning that if it sees an entire word such as 'class' or the name of a class or 'function' or the name of a function, it captures that word and considers it a token. When it encounters another token such as an opening round bracket or an opening brace or a square bracket etc. it considers that another token.

Luckily all of the recognized tokens available in PHP are already easily scanned by token_get_all which is a function PHP is bundled with. You may have some trouble because PHP assumes some things about how you use symbols but all in all you can make use of this function.

The tokenizer creates a flat list of all the tokens it finds and gives it to the parser.

The parser is the second phase of your transcoder, it reads the list of tokens and decides stuff like "if token[0] is a class and token[1] is a name_value then we have a class" etc.. after running through the entire list of tokens we should have an abstract syntax tree.

The abstract syntax tree is a structure that symbolically retains only the relevant information about a the source code.

$ast = array(
    'my_derived_class' => array(
        'implements' => array(
            'my_interface_1',
            'my_interface_2',
            'my_interface_3'),
        'extends' => 'my_base_class',
        'members' => array(
            'my_property_name' => 'my_default_value',
            'my_method_name' => array( /* ... */ )
        )
    )
);

After you get an abstract syntax tree you need to walk through it and output the destination source code.

The real tricky part is the parser which (depending on the complexity of the language you are parsing) may need a backtracking algorithm or some other form of pattern matching to differentiate similar cases against one another.

I recommend reading about this in Terence Parr' book http://pragprog.com/book/tpdsl/language-implementation-patterns which describes in detail the design patterns needed to write a transcoder.

In Terrence' book you'll find out why some languages such as HTML or CSS are much simpler (structurally) than PHP or JavaScript and how that relates the complexity of the language parser.

like image 197
Mihai Stancu Avatar answered Sep 22 '22 07:09

Mihai Stancu