Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatically parsing PHP to separate PHP code from HTML

I'm working on a large PHP code base; I'd like to separate the PHP code from the HTML and JavaScript. (I need to do several automatic search-and-replaces on the PHP code, and different ones on the HTML, and different on the JS). Is there a good parser engine that could separate out the PHP for me? I could do this using regular expressions, but they're not perfect. I could build something in ANTLR, perhaps, but a good already existing solution would be best.

I should make clear: I don't want or need a full PHP parser. Just need to know if a given token is: - PHP code - PHP single quote string - PHP double quote string - PHP Comment - Not PHP, but rather HTML/JavaScript

like image 248
SRobertJames Avatar asked Nov 07 '10 17:11

SRobertJames


People also ask

What are the different ways to parse the PHP code?

In PHP there are two major types of XML parsers: Tree-Based Parsers. Event-Based Parsers.

How parse HTML in PHP?

We should use loadHTML() function for parsing. Parameters: $source: This variable is the container of the HTML code which you want to parse, $options: You may use the options parameter to specify additional Libxml parameters.

What is PHP code parsing?

PHP Parser is a library that takes a source code written in PHP, passes it through a lexical analyzer, and creates its respective syntax tree. This is very useful for static code analysis, where we want to check our own code not only for syntactic errors but also for satisfying certain quality criteria.


2 Answers

How about the tokenizer built right into PHP itself?

The tokenizer functions provide an interface to the PHP tokenizer embedded in the Zend Engine. Using these functions you may write your own PHP source analyzing or modification tools without having to deal with the language specification at the lexical level.

You ask in the comments whether you can regenerate the code from the tokenized output - yet you can, all whitespace is preserved as T_WHITESPACE tokens. Here's how you might turn the tokenized output back into code:

$regenerated='';

$tokens = token_get_all($code);
foreach($tokens as $idx=>$t)
{
    if (is_array($t))
    {

         //do something with string and comments here?
         switch($t[0])
         {
             case T_CONSTANT_ENCAPSED_STRING:
                  break;
             case T_COMMENT:
             case T_DOC_COMMENT:
                 break;

         }
         $regenerated.=$t[1];


    }
    else
    {
         $regenerated.=$t;
    }
}
like image 113
Paul Dixon Avatar answered Oct 21 '22 22:10

Paul Dixon


To separate the PHP from the rest, PHP's inbuilt tokenizer is your best choice: See token_get_all()

For the rest, you might be best off with a DOM parser. Isolating the <script> parts (and external scripts, and even onXXXX events) is easy that way.

It might be tough to re-build the identical document from a parsed DOM tree, though - I guess it depends on what you need to do with the results and how clean the original HTML is. A regular expression (yuck!) could work better for that part.

like image 3
Pekka Avatar answered Oct 21 '22 22:10

Pekka