<p>My first guess was the PHP DOM classes (with the formatOutput parameter). However, I cannot get this block of HTML to be formatted and output correctly. As you can see, the indention and alignment is not correct.</p> <pre class="prettyprint"><code>$html = ' <html> <body> <div> <div> <div> <p>My Last paragraph</p> <div> This is another text block and some other stuff.<br><br> Again we will start a new paragraph and some other stuff <br> </div> </div> <div> <div> <h1>Another Title</h1> </div> <p>Some text again <b>for sure</b></p> </div> </div> <div> <pre><code> <span>&lt;html&gt;</span> <span>&lt;head&gt;</span> <span>&lt;title&gt;</span> Page Title <span>&lt;/title&gt;</span> <span>&lt;/head&gt;</span> <span>&lt;/html&gt;</span> </code></pre> </div> </div> </body> </html>'; header('Content-Type: text/plain'); libxml_use_internal_errors(TRUE); $dom = new DOMDocument; $dom->preserveWhiteSpace = false; $dom->formatOutput = true; $dom->loadHTML($html); print $dom->saveHTML(); </code></pre> <p><em>Update: I added a pre-formatted code block to the example.</em></p>

<p>Here are some improvements over @hijarian answer:</p> <h3>LibXML Errors</h3> <p>If you don't call <code>libxml_use_internal_errors(true)</code>, PHP will output all HTML errors found. However, if you call that function, the errors won't be suppressed, instead they will go to a pile that you can inspect by calling <code>libxml_get_errors()</code>. The problem with this is that it eats memory, and DOMDocument is known to be very picky. If you're processing lots of files in batch, you will eventually run out of memory. There are two solutions for this:</p> <pre class="prettyprint"><code>if (libxml_use_internal_errors(true) === true) { libxml_clear_errors(); } </code></pre> <p>Since <code>libxml_use_internal_errors(true)</code> returns the previous value of this setting (default <code>false</code>), this has the effect of only clearing errors if you run it more than once (as in batch processing).</p> <p><del>The other option is to pass the <code>LIBXML_NOERROR | LIBXML_NOWARNING</code> flags to the <code>loadHTML()</code> method. Unfortunately, for reasons that are unknown to me, this still leaves a couple of errors behind.</del></p> <p>Bare in mind that DOMDocument will always output a error (even when using internal <code>libxml</code> errors and setting the suppressing flags) if you pass a empty (or <em>blankish</em>) string to the <code>load*()</code> methods.</p> <h3>Regex</h3> <p>The regex <code>/>\s*</im</code> doesn't make a whole lot of sense, it's better to use <code>~>[[:space:]]++<~m</code> to also catch <code>\v</code> (vertical tabs) and only replace if spaces actually exist (<code>+</code> instead of <code>*</code>) without giving back (<code>++</code>) - which is faster - and to drop the case insensitve overhead (since whitespace has no case).</p> <p>You may also want to normalize newlines to <code>\n</code> and other control characters (specially if the origin of the HTML is unknown), since a <code>\r</code> will come back as <code>&#23;</code> after <code>saveXML()</code> for instance.</p> <p><code>DOMDocument::$preserveWhitespace</code> is useless and unnecessary after running the above regex.</p> <p>Oh, and I don't see the need to protect blank pre-like tags here. Whitespace-only snippets are useless.</p> <h3>Additional Flags for <code>loadHTML()</code> </h3> <ul> <li> <code>LIBXML_COMPACT</code> - "this may speed up your application without needing to change the code"</li> <li> <code>LIBXML_NOBLANKS</code> - <em>need to run more tests on this one</em> </li> <li> <code>LIBXML_NOCDATA</code> - <em>need to run more tests on this one</em> </li> <li> <code>LIBXML_NOXMLDECL</code> - documented, but not implemented =(</li> </ul> <p><strong>UPDATE:</strong> Setting any of these options will have the effect of not formatting the output.</p> <h3>On <code>saveXML()</code> </h3> <p>The <code>DOMDocument::saveXML()</code> method will output the XML declaration. We need to manually purge it (since the <code>LIBXML_NOXMLDECL</code> isn't implemented). To do that, we could use a combination of <code>substr() + strpos()</code> to look for the first line break or even use a regex to clean it up.</p> <p>Another option, that seems to have an added benefit is simply doing:</p> <pre class="prettyprint"><code>$dom->saveXML($dom->documentElement); </code></pre> <p>Another thing, if you have <em>inline</em> tags are are empty, such as the <code>b</code>, <code>i</code> or <code>li</code> in:</p> <pre class="prettyprint"><code><b class="carret"></b> <i class="icon-dashboard"></i> Dashboard <li class="divider"></li> </code></pre> <p>The <code>saveXML()</code> method will seriously mangle them (placing the following element inside the empty one), messing your whole HTML. Tidy also has a similar problem, except that it just drops the node.</p> <p>To fix that, you can use the <code>LIBXML_NOEMPTYTAG</code> flag along with <code>saveXML()</code>:</p> <pre class="prettyprint"><code>$dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG); </code></pre> <p>This option will convert empty (aka self-closing) tags to inline tags and allow empty inline tags as well.</p> <h3>Fixing HTML[5]</h3> <p>With all the stuff we did so far, our HTML output has two major problems now:</p> <ol> <li>no DOCTYPE (it was stripped when we used <code>$dom->documentElement</code>)</li> <li>empty tags are now inline tags, meaning one <code><br /></code> turned into two (<code><br></br></code>) and so on</li> </ol> <p>Fixing the first one is fairly easy, since HTML5 is pretty permissive:</p> <pre class="prettyprint"><code>"<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG); </code></pre> <p>To get our empty tags back, which are the following:</p> <ul> <li><code>area</code></li> <li><code>base</code></li> <li> <code>basefont</code> (<em>deprecated in HTML5</em>)</li> <li><code>br</code></li> <li><code>col</code></li> <li><code>command</code></li> <li><code>embed</code></li> <li> <code>frame</code> (<em>deprecated in HTML5</em>)</li> <li><code>hr</code></li> <li><code>img</code></li> <li><code>input</code></li> <li><code>keygen</code></li> <li><code>link</code></li> <li><code>meta</code></li> <li><code>param</code></li> <li><code>source</code></li> <li><code>track</code></li> <li><code>wbr</code></li> </ul> <p>We can either use <code>str_[i]replace</code> in a loop:</p> <pre class="prettyprint"><code>foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag) { $html = str_ireplace('>/<' . $tag . '>', ' />', $html); } </code></pre> <p>Or a regular expression:</p> <pre class="prettyprint"><code>$html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html); </code></pre> <p>This is a costly operation, I haven't benchmarked them so I can't tell you which one performs better but I would guess <code>preg_replace()</code>. Additionally, I'm not sure if the case insensitive version is needed. I'm under the impression that XML tags are always lowercased. <strong>UPDATE:</strong> Tags are always lowercased.</p> <h3>On <code><script></code> and <code><style></code> Tags</h3> <p>These tags will always have their content (if existent) encapsulated into (uncommented) CDATA blocks, which will probably break their meaning. You'll have to replace those tokens with a regular expression.</p> <h3>Implementation</h3> <pre class="prettyprint"><code>function DOM_Tidy($html) { $dom = new \DOMDocument(); if (libxml_use_internal_errors(true) === true) { libxml_clear_errors(); } $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html); if ((empty($html) !== true) && ($dom->loadHTML($html) === true)) { $dom->formatOutput = true; if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false) { $regex = array ( '~' . preg_quote('<![CDATA[', '~') . '~' => '', '~' . preg_quote(']]>', '~') . '~' => '', '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />', ); return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html); } } return false; } </code></pre>

<p>Here's the comment at the php.net: http://ru2.php.net/manual/en/domdocument.save.php#88630</p> <p>It looks like when you load HTML from the string (like you did) DOMDocument becomes lazy and does not format anything in it.</p> <p>Here's working solution to your problem:</p> <pre class="prettyprint"><code>// Clean your HTML by hand first $html = preg_replace('/>\s*</im', '><', $html); $dom = new DOMDocument; $dom->loadHTML($html); $dom->formatOutput = true; $dom->preserveWhitespace = false; // Use saveXML(), not saveHTML() print $dom->saveXML(); </code></pre> <p>Basically, you throw out the spaces between tags and use saveXML() instead of saveHTML(). saveHTML() just does not work in this situation. However, you get an XML declaration in first line of text. </p>

How do you format DOM structures in PHP?

Tags:

html

dom

php

My first guess was the PHP DOM classes (with the formatOutput parameter). However, I cannot get this block of HTML to be formatted and output correctly. As you can see, the indention and alignment is not correct.

$html = '
<html>
<body>
<div>

<div>

        <div>

                <p>My Last paragraph</p>
            <div>
                            This is another text block and some other stuff.<br><br>
                Again we will start a new paragraph
                            and some other stuff
                            <br>
        </div>
</div>
        <div>
                        <div>
                            <h1>Another Title</h1>
                                                    </div>
                        <p>Some text again <b>for sure</b></p>
                </div>
</div>
<div>
    <pre><code>
    <span>&lt;html&gt;</span>
        <span>&lt;head&gt;</span>
            <span>&lt;title&gt;</span>
                Page Title
            <span>&lt;/title&gt;</span>
            <span>&lt;/head&gt;</span>
    <span>&lt;/html&gt;</span>
    </code></pre>
</div>
</div>
</body>
</html>';

header('Content-Type: text/plain');
libxml_use_internal_errors(TRUE);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
print $dom->saveHTML();

Update: I added a pre-formatted code block to the example.

401

asked Nov 03 '11 15:11

Xeoncross

2 Answers

Here are some improvements over @hijarian answer:

LibXML Errors

If you don't call libxml_use_internal_errors(true), PHP will output all HTML errors found. However, if you call that function, the errors won't be suppressed, instead they will go to a pile that you can inspect by calling libxml_get_errors(). The problem with this is that it eats memory, and DOMDocument is known to be very picky. If you're processing lots of files in batch, you will eventually run out of memory. There are two solutions for this:

if (libxml_use_internal_errors(true) === true)
{
    libxml_clear_errors();
}

Since libxml_use_internal_errors(true) returns the previous value of this setting (default false), this has the effect of only clearing errors if you run it more than once (as in batch processing).

~~The other option is to pass the LIBXML_NOERROR | LIBXML_NOWARNING flags to the loadHTML() method. Unfortunately, for reasons that are unknown to me, this still leaves a couple of errors behind.~~

Bare in mind that DOMDocument will always output a error (even when using internal libxml errors and setting the suppressing flags) if you pass a empty (or blankish) string to the load*() methods.

Regex

The regex />\s*</im doesn't make a whole lot of sense, it's better to use ~>[[:space:]]++<~m to also catch \v (vertical tabs) and only replace if spaces actually exist (+ instead of *) without giving back (++) - which is faster - and to drop the case insensitve overhead (since whitespace has no case).

You may also want to normalize newlines to \n and other control characters (specially if the origin of the HTML is unknown), since a \r will come back as  after saveXML() for instance.

DOMDocument::$preserveWhitespace is useless and unnecessary after running the above regex.

Oh, and I don't see the need to protect blank pre-like tags here. Whitespace-only snippets are useless.

Additional Flags for `loadHTML()`

LIBXML_COMPACT - "this may speed up your application without needing to change the code"
LIBXML_NOBLANKS - need to run more tests on this one
LIBXML_NOCDATA - need to run more tests on this one
LIBXML_NOXMLDECL - documented, but not implemented =(

UPDATE: Setting any of these options will have the effect of not formatting the output.

On `saveXML()`

The DOMDocument::saveXML() method will output the XML declaration. We need to manually purge it (since the LIBXML_NOXMLDECL isn't implemented). To do that, we could use a combination of substr() + strpos() to look for the first line break or even use a regex to clean it up.

Another option, that seems to have an added benefit is simply doing:

$dom->saveXML($dom->documentElement);

Another thing, if you have inline tags are are empty, such as the b, i or li in:

<b class="carret"></b>
<i class="icon-dashboard"></i> Dashboard
<li class="divider"></li>

The saveXML() method will seriously mangle them (placing the following element inside the empty one), messing your whole HTML. Tidy also has a similar problem, except that it just drops the node.

To fix that, you can use the LIBXML_NOEMPTYTAG flag along with saveXML():

$dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

This option will convert empty (aka self-closing) tags to inline tags and allow empty inline tags as well.

Fixing HTML[5]

With all the stuff we did so far, our HTML output has two major problems now:

no DOCTYPE (it was stripped when we used $dom->documentElement)
empty tags are now inline tags, meaning one <br /> turned into two (<br></br>) and so on

Fixing the first one is fairly easy, since HTML5 is pretty permissive:

"<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

To get our empty tags back, which are the following:

area
base
basefont (deprecated in HTML5)
br
col
command
embed
frame (deprecated in HTML5)
hr
img
input
keygen
link
meta
param
source
track
wbr

We can either use str_[i]replace in a loop:

foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag)
{
    $html = str_ireplace('>/<' . $tag . '>', ' />', $html);
}

Or a regular expression:

$html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html);

This is a costly operation, I haven't benchmarked them so I can't tell you which one performs better but I would guess preg_replace(). Additionally, I'm not sure if the case insensitive version is needed. I'm under the impression that XML tags are always lowercased. UPDATE: Tags are always lowercased.

On `<script>` and `<style>` Tags

These tags will always have their content (if existent) encapsulated into (uncommented) CDATA blocks, which will probably break their meaning. You'll have to replace those tokens with a regular expression.

Implementation

function DOM_Tidy($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}

answered Sep 22 '22 16:09

Alix Axel

Here's the comment at the php.net: http://ru2.php.net/manual/en/domdocument.save.php#88630

It looks like when you load HTML from the string (like you did) DOMDocument becomes lazy and does not format anything in it.

Here's working solution to your problem:

// Clean your HTML by hand first
$html = preg_replace('/>\s*</im', '><', $html);
$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
$dom->preserveWhitespace = false;
// Use saveXML(), not saveHTML()
print $dom->saveXML();

Basically, you throw out the spaces between tags and use saveXML() instead of saveHTML(). saveHTML() just does not work in this situation. However, you get an XML declaration in first line of text.

answered Sep 22 '22 16:09

hijarian

Related questions
                            
                                shuffle order of images in php
                            
                                Add events to outlook calendar with php script
                            
                                php get parent class file path
                            
                                Identify whether HTTP requests from Android App or not? and then respond appropriately
                            
                                The advantage / disadvantage of private variables?
                            
                                PHP, in_array and fast searches (by the end) in arrays
                            
                                PHP script... goes back in time?
                            
                                Does echo() Accept numbers?
                            
                                Not getting anything back from sqlsrv_num_rows
                            
                                PHP - Class that can be instantiated only by another class
                            
                                Loading a custom Yii Component
                            
                                upload_max_filesize - The perfect ubuntu server ISPConfig 3
                            
                                Creating Discount Code System (MySQL/php)
                            
                                Sending encoding response in json
                            
                                Authentication model for Android application
                            
                                Set timezone of PostgreSQL instance
                            
                                Session injection?
                            
                                Database transactions in Zend Framework: Are they isolated?
                            
                                javascript clock with server time [duplicate]
                            
                                Where should I put the Mobile detection in .htaccess or php?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you format DOM structures in PHP?

Tags:

html

dom

php

Xeoncross

People also ask

2 Answers

LibXML Errors

Regex

Additional Flags for `loadHTML()`

On `saveXML()`

Fixing HTML[5]

On `<script>` and `<style>` Tags

Implementation

Alix Axel

hijarian

Recent Activity

Donate For Us

How do you format DOM structures in PHP?

Tags:

html

dom

php

Xeoncross

People also ask

2 Answers

LibXML Errors

Regex

Additional Flags for loadHTML()

On saveXML()

Fixing HTML[5]

On <script> and <style> Tags

Implementation

Alix Axel

hijarian

Related questions

Recent Activity

Donate For Us

Additional Flags for `loadHTML()`

On `saveXML()`

On `<script>` and `<style>` Tags