<p><strong>Update</strong>: <code>html5lib</code> (bottom of question) seems to get close, I just need to improve my understanding of how it's used.</p> <p>I am attempting to find an HTML5-compatible DOM parser for PHP 5.3. In particular, I need to access the following HTML-like CDATA within a script tag:</p> <pre class="prettyprint"><code><script type="text/x-jquery-tmpl" id="foo"> <table><tr><td>${name}</td></tr></table> </script> </code></pre> <p>Most parsers will end parsing prematurely because HTML 4.01 ends script tag parsing when it finds ETAGO (<code></</code>) inside a <code><script></code> tag. However, HTML5 allows for <code></</code> before <code></script></code>. All of the parsers I have tried so far have either failed, or they are so poorly documented that I haven't figured out if they work or not.</p> <p>My requirements:</p> <ol> <li>Real parser, not regex hacks.</li> <li>Ability to load full pages or HTML fragments.</li> <li>Ability to pull script <em>contents</em> back out, selecting by the tag's id attribute.</li> </ol> <p>Input:</p> <pre class="prettyprint"><code><script id="foo"><td>bar</td></script> </code></pre> <p>Example of failing output (no closing <code></td></code>):</p> <pre class="prettyprint"><code><script id="foo"><td>bar</script> </code></pre> <p>Some parsers and their results:</p> <p><br></p> <h3> DOMDocument (fails)</h3> <p>Source:</p> <pre class="prettyprint"><code><?php header('Content-type: text/plain'); $d = new DOMDocument; $d->loadHTML('<script id="foo"><td>bar</td></script>'); echo $d->saveHTML(); </code></pre> <p>Output:</p> <pre class="prettyprint"><code>Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><head><script id="foo"><td>bar</script></head></html> </code></pre> <p><br></p> <h3> FluentDOM (fails)</h3> <p>Source:</p> <pre class="prettyprint"><code><?php header('Content-type: text/plain'); require_once 'FluentDOM/src/FluentDOM.php'; $html = "<html><head></head><body><script id='foo'><td></td></script></body></html>"; echo FluentDOM($html, 'text/html'); </code></pre> <p>Output:</p> <pre class="prettyprint"><code><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><head></head><body><script id="foo"><td></script></body></html> </code></pre> <p><br></p> <h3> phpQuery (fails)</h3> <p>Source:</p> <pre class="prettyprint"><code><?php header('Content-type: text/plain'); require_once 'phpQuery.php'; phpQuery::newDocumentHTML(<<<EOF <script type="text/x-jquery-tmpl" id="foo"> <td>test</td> </script> EOF ); </code></pre> <p>echo (string)pq('#foo');</p> <p>Output:</p> <pre class="prettyprint"><code><script type="text/x-jquery-tmpl" id="foo"> <td>test </script> </code></pre> <p><br></p> <h3> html5lib (passes)</h3> <p>Possibly promising. Can I get at the contents of the <code>script#foo</code> tag?</p> <p>Source:</p> <pre class="prettyprint"><code><?php header('Content-type: text/plain'); include 'HTML5/Parser.php'; $html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>"; $d = HTML5_Parser::parse($html); echo $d->saveHTML(); </code></pre> <p>Output:</p> <pre class="prettyprint"><code><html><head></head><body><script id="foo"><td></td></script></body></html> </code></pre>

<p>I had the same problem and apparently you can hack your way trough this by loading the document as XML, and save it as HTML :)</p> <pre class="prettyprint"><code>$d = new DOMDocument; $d->loadXML('<script id="foo"><td>bar</td></script>'); echo $d->saveHTML(); </code></pre> <p>But of course the markup must be error-free for loadXML to work.</p>

<p>Re: html5lib</p> <p>You click on the download tab and download the PHP version of the parser.</p> <p>You untar the archive in a local folder</p> <pre class="prettyprint"><code> tar -zxvf html5lib-php-0.1.tar.gz x html5lib-php-0.1/ x html5lib-php-0.1/VERSION x html5lib-php-0.1/docs/ ... etc </code></pre> <p>You change directories and create a file named hello.php</p> <pre class="prettyprint"><code>cd html5lib-php-0.1 touch hello.php </code></pre> <p>You place the following PHP code in <code>hello.php</code></p> <p> </p> <pre class="prettyprint"><code>$html = '<html><head></head><body> <script type="text/x-jquery-tmpl" id="foo"> <table><tr><td>${name}</td></tr></table> </script> </body></html>'; $dom = HTML5_Parser::parse($html); var_dump($dom->saveXml()); echo "\nDone\n"; </code></pre> <p>You run <code>hello.php</code> from the command line</p> <pre class="prettyprint"><code>php hello.php </code></pre> <p>The parser will parse the document tree, and return a DOMDocument object, which can be manipulated as any other DOMDocument object.</p>

DOM parser that allows HTML5-style </ in <script> tag

Tags:

html

dom

php

Update: html5lib (bottom of question) seems to get close, I just need to improve my understanding of how it's used.

I am attempting to find an HTML5-compatible DOM parser for PHP 5.3. In particular, I need to access the following HTML-like CDATA within a script tag:

<script type="text/x-jquery-tmpl" id="foo">     <table><tr><td>${name}</td></tr></table> </script>

Most parsers will end parsing prematurely because HTML 4.01 ends script tag parsing when it finds ETAGO (</) inside a <script> tag. However, HTML5 allows for </ before </script>. All of the parsers I have tried so far have either failed, or they are so poorly documented that I haven't figured out if they work or not.

My requirements:

Real parser, not regex hacks.
Ability to load full pages or HTML fragments.
Ability to pull script contents back out, selecting by the tag's id attribute.

Input:

<script id="foo"><td>bar</td></script>

Example of failing output (no closing </td>):

<script id="foo"><td>bar</script>

Some parsers and their results:

DOMDocument (fails)

Source:

<?php  header('Content-type: text/plain'); $d = new DOMDocument; $d->loadHTML('<script id="foo"><td>bar</td></script>'); echo $d->saveHTML();

Output:

Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><head><script id="foo"><td>bar</script></head></html>

FluentDOM (fails)

Source:

<?php  header('Content-type: text/plain'); require_once 'FluentDOM/src/FluentDOM.php'; $html = "<html><head></head><body><script id='foo'><td></td></script></body></html>"; echo FluentDOM($html, 'text/html');

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><head></head><body><script id="foo"><td></script></body></html>

phpQuery (fails)

Source:

<?php  header('Content-type: text/plain');  require_once 'phpQuery.php';  phpQuery::newDocumentHTML(<<<EOF <script type="text/x-jquery-tmpl" id="foo"> <td>test</td> </script> EOF );

echo (string)pq('#foo');

Output:

<script type="text/x-jquery-tmpl" id="foo"> <td>test </script>

html5lib (passes)

Possibly promising. Can I get at the contents of the script#foo tag?

Source:

<?php  header('Content-type: text/plain');  include 'HTML5/Parser.php';  $html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>"; $d = HTML5_Parser::parse($html);  echo $d->saveHTML();

Output:

<html><head></head><body><script id="foo"><td></td></script></body></html>

554

asked Oct 27 '10 01:10

Annika Backstrom

2 Answers

I had the same problem and apparently you can hack your way trough this by loading the document as XML, and save it as HTML :)

$d = new DOMDocument; $d->loadXML('<script id="foo"><td>bar</td></script>'); echo $d->saveHTML();

But of course the markup must be error-free for loadXML to work.

102

answered Sep 24 '22 22:09

Alex

Re: html5lib

You click on the download tab and download the PHP version of the parser.

You untar the archive in a local folder

 tar -zxvf html5lib-php-0.1.tar.gz  x html5lib-php-0.1/  x html5lib-php-0.1/VERSION  x html5lib-php-0.1/docs/  ... etc

You change directories and create a file named hello.php

cd html5lib-php-0.1 touch hello.php

You place the following PHP code in hello.php

$html = '<html><head></head><body> <script type="text/x-jquery-tmpl" id="foo"> <table><tr><td>${name}</td></tr></table> </script>  </body></html>'; $dom = HTML5_Parser::parse($html);  var_dump($dom->saveXml());  echo "\nDone\n";

You run hello.php from the command line

php hello.php

The parser will parse the document tree, and return a DOMDocument object, which can be manipulated as any other DOMDocument object.

answered Sep 22 '22 22:09

Alan Storm

Related questions
                            
                                Insert multiple rows with one query MySQL
                            
                                How to set auto height in phpexcel?
                            
                                PHP post_max_size overrides upload_max_filesize
                            
                                How to properly use Bearer tokens?
                            
                                can't find mcrypt => Call to undefined function Laravel\mcrypt_create_iv()
                            
                                How can I access my Laravel app from another PC?
                            
                                Show AJAX upload status on progress element
                            
                                Laravel Query Builder where max id
                            
                                How can I tell which CakePHP version is a project made with?
                            
                                PHP pass function name as param then call the function?
                            
                                Doctrine2 migrations migrate down and migrate from browser and not command line
                            
                                Remove newline character from a string using PHP regex
                            
                                Column count of mysql.proc is wrong. Expected 20, found 16. The table is probably corrupted
                            
                                Clear PHP CLI output
                            
                                Run a mySQL query as a cron job?
                            
                                Catching Stripe errors with Try/Catch PHP method
                            
                                url encode equivalent in ruby on rails
                            
                                Carbon - get first day of month
                            
                                Doctrine 2.1 - datetime column default value
                            
                                What is the best IDE for PHP? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DOM parser that allows HTML5-style </ in <script> tag

Tags:

html

dom

php

DOMDocument (fails)

FluentDOM (fails)

phpQuery (fails)

html5lib (passes)

Annika Backstrom

People also ask

2 Answers

Alex

Alan Storm

Recent Activity

Donate For Us