I'm working on a large PHP code base; I'd like to separate the PHP code from the HTML and JavaScript. (I need to do several automatic search-and-replaces on the PHP code, and different ones on the HTML, and different on the JS). Is there a good parser engine that could separate out the PHP for me? I could do this using regular expressions, but they're not perfect. I could build something in ANTLR, perhaps, but a good already existing solution would be best.
I should make clear: I don't want or need a full PHP parser. Just need to know if a given token is: - PHP code - PHP single quote string - PHP double quote string - PHP Comment - Not PHP, but rather HTML/JavaScript
In PHP there are two major types of XML parsers: Tree-Based Parsers. Event-Based Parsers.
We should use loadHTML() function for parsing. Parameters: $source: This variable is the container of the HTML code which you want to parse, $options: You may use the options parameter to specify additional Libxml parameters.
PHP Parser is a library that takes a source code written in PHP, passes it through a lexical analyzer, and creates its respective syntax tree. This is very useful for static code analysis, where we want to check our own code not only for syntactic errors but also for satisfying certain quality criteria.
How about the tokenizer built right into PHP itself?
The tokenizer functions provide an interface to the PHP tokenizer embedded in the Zend Engine. Using these functions you may write your own PHP source analyzing or modification tools without having to deal with the language specification at the lexical level.
You ask in the comments whether you can regenerate the code from the tokenized output - yet you can, all whitespace is preserved as T_WHITESPACE tokens. Here's how you might turn the tokenized output back into code:
$regenerated='';
$tokens = token_get_all($code);
foreach($tokens as $idx=>$t)
{
if (is_array($t))
{
//do something with string and comments here?
switch($t[0])
{
case T_CONSTANT_ENCAPSED_STRING:
break;
case T_COMMENT:
case T_DOC_COMMENT:
break;
}
$regenerated.=$t[1];
}
else
{
$regenerated.=$t;
}
}
To separate the PHP from the rest, PHP's inbuilt tokenizer is your best choice: See token_get_all()
For the rest, you might be best off with a DOM parser. Isolating the <script>
parts (and external scripts, and even onXXXX
events) is easy that way.
It might be tough to re-build the identical document from a parsed DOM tree, though - I guess it depends on what you need to do with the results and how clean the original HTML is. A regular expression (yuck!) could work better for that part.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With