Update: html5lib
(bottom of question) seems to get close, I just need to improve my understanding of how it's used.
I am attempting to find an HTML5-compatible DOM parser for PHP 5.3. In particular, I need to access the following HTML-like CDATA within a script tag:
<script type="text/x-jquery-tmpl" id="foo"> <table><tr><td>${name}</td></tr></table> </script>
Most parsers will end parsing prematurely because HTML 4.01 ends script tag parsing when it finds ETAGO (</
) inside a <script>
tag. However, HTML5 allows for </
before </script>
. All of the parsers I have tried so far have either failed, or they are so poorly documented that I haven't figured out if they work or not.
My requirements:
Input:
<script id="foo"><td>bar</td></script>
Example of failing output (no closing </td>
):
<script id="foo"><td>bar</script>
Some parsers and their results:
Source:
<?php header('Content-type: text/plain'); $d = new DOMDocument; $d->loadHTML('<script id="foo"><td>bar</td></script>'); echo $d->saveHTML();
Output:
Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><head><script id="foo"><td>bar</script></head></html>
Source:
<?php header('Content-type: text/plain'); require_once 'FluentDOM/src/FluentDOM.php'; $html = "<html><head></head><body><script id='foo'><td></td></script></body></html>"; echo FluentDOM($html, 'text/html');
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><head></head><body><script id="foo"><td></script></body></html>
Source:
<?php header('Content-type: text/plain'); require_once 'phpQuery.php'; phpQuery::newDocumentHTML(<<<EOF <script type="text/x-jquery-tmpl" id="foo"> <td>test</td> </script> EOF );
echo (string)pq('#foo');
Output:
<script type="text/x-jquery-tmpl" id="foo"> <td>test </script>
Possibly promising. Can I get at the contents of the script#foo
tag?
Source:
<?php header('Content-type: text/plain'); include 'HTML5/Parser.php'; $html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>"; $d = HTML5_Parser::parse($html); echo $d->saveHTML();
Output:
<html><head></head><body><script id="foo"><td></td></script></body></html>
The DOMParser interface provides the ability to parse XML or HTML source code from a string into a DOM Document . You can perform the opposite operation—converting a DOM tree into XML or HTML source—using the XMLSerializer interface.
Scripts can be placed in the <body> , or in the <head> section of an HTML page, or in both.
DOMParser created documents are created with scripting disabled; the script is parsed, but not run, so it should be safe against XSS.
I had the same problem and apparently you can hack your way trough this by loading the document as XML, and save it as HTML :)
$d = new DOMDocument; $d->loadXML('<script id="foo"><td>bar</td></script>'); echo $d->saveHTML();
But of course the markup must be error-free for loadXML to work.
Re: html5lib
You click on the download tab and download the PHP version of the parser.
You untar the archive in a local folder
tar -zxvf html5lib-php-0.1.tar.gz x html5lib-php-0.1/ x html5lib-php-0.1/VERSION x html5lib-php-0.1/docs/ ... etc
You change directories and create a file named hello.php
cd html5lib-php-0.1 touch hello.php
You place the following PHP code in hello.php
$html = '<html><head></head><body> <script type="text/x-jquery-tmpl" id="foo"> <table><tr><td>${name}</td></tr></table> </script> </body></html>'; $dom = HTML5_Parser::parse($html); var_dump($dom->saveXml()); echo "\nDone\n";
You run hello.php
from the command line
php hello.php
The parser will parse the document tree, and return a DOMDocument object, which can be manipulated as any other DOMDocument object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With