Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you format DOM structures in PHP?

Tags:

html

dom

php

My first guess was the PHP DOM classes (with the formatOutput parameter). However, I cannot get this block of HTML to be formatted and output correctly. As you can see, the indention and alignment is not correct.

$html = '
<html>
<body>
<div>

<div>

        <div>

                <p>My Last paragraph</p>
            <div>
                            This is another text block and some other stuff.<br><br>
                Again we will start a new paragraph
                            and some other stuff
                            <br>
        </div>
</div>
        <div>
                        <div>
                            <h1>Another Title</h1>
                                                    </div>
                        <p>Some text again <b>for sure</b></p>
                </div>
</div>
<div>
    <pre><code>
    <span>&lt;html&gt;</span>
        <span>&lt;head&gt;</span>
            <span>&lt;title&gt;</span>
                Page Title
            <span>&lt;/title&gt;</span>
            <span>&lt;/head&gt;</span>
    <span>&lt;/html&gt;</span>
    </code></pre>
</div>
</div>
</body>
</html>';

header('Content-Type: text/plain');
libxml_use_internal_errors(TRUE);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
print $dom->saveHTML();

Update: I added a pre-formatted code block to the example.

like image 401
Xeoncross Avatar asked Nov 03 '11 15:11

Xeoncross


People also ask

Is there a DOM in PHP?

The DOM implementation in PHP have more than 15 classes! But don't get afraid, for most cases, you might just end up using these ones: DOMNode, DOMDocument, DOMNodeList and DOMElement.

What is DOM document in PHP?

The DOMDocument::getElementsByTagName() function is an inbuilt function in PHP which is used to return a new instance of class DOMNodeList which contains all the elements of local tag name.

Can you manipulate DOM with PHP?

So if you're ever working with the content for a post (a post type or a custom post type, for that matter) and you need to manipulate tags much like you would with JavaScript, then using the DomDocument library is one of the most powerful tools are your disposal.


2 Answers

Here are some improvements over @hijarian answer:

LibXML Errors

If you don't call libxml_use_internal_errors(true), PHP will output all HTML errors found. However, if you call that function, the errors won't be suppressed, instead they will go to a pile that you can inspect by calling libxml_get_errors(). The problem with this is that it eats memory, and DOMDocument is known to be very picky. If you're processing lots of files in batch, you will eventually run out of memory. There are two solutions for this:

if (libxml_use_internal_errors(true) === true)
{
    libxml_clear_errors();
}

Since libxml_use_internal_errors(true) returns the previous value of this setting (default false), this has the effect of only clearing errors if you run it more than once (as in batch processing).

The other option is to pass the LIBXML_NOERROR | LIBXML_NOWARNING flags to the loadHTML() method. Unfortunately, for reasons that are unknown to me, this still leaves a couple of errors behind.

Bare in mind that DOMDocument will always output a error (even when using internal libxml errors and setting the suppressing flags) if you pass a empty (or blankish) string to the load*() methods.

Regex

The regex />\s*</im doesn't make a whole lot of sense, it's better to use ~>[[:space:]]++<~m to also catch \v (vertical tabs) and only replace if spaces actually exist (+ instead of *) without giving back (++) - which is faster - and to drop the case insensitve overhead (since whitespace has no case).

You may also want to normalize newlines to \n and other control characters (specially if the origin of the HTML is unknown), since a \r will come back as &#23; after saveXML() for instance.

DOMDocument::$preserveWhitespace is useless and unnecessary after running the above regex.

Oh, and I don't see the need to protect blank pre-like tags here. Whitespace-only snippets are useless.

Additional Flags for loadHTML()

  • LIBXML_COMPACT - "this may speed up your application without needing to change the code"
  • LIBXML_NOBLANKS - need to run more tests on this one
  • LIBXML_NOCDATA - need to run more tests on this one
  • LIBXML_NOXMLDECL - documented, but not implemented =(

UPDATE: Setting any of these options will have the effect of not formatting the output.

On saveXML()

The DOMDocument::saveXML() method will output the XML declaration. We need to manually purge it (since the LIBXML_NOXMLDECL isn't implemented). To do that, we could use a combination of substr() + strpos() to look for the first line break or even use a regex to clean it up.

Another option, that seems to have an added benefit is simply doing:

$dom->saveXML($dom->documentElement);

Another thing, if you have inline tags are are empty, such as the b, i or li in:

<b class="carret"></b>
<i class="icon-dashboard"></i> Dashboard
<li class="divider"></li>

The saveXML() method will seriously mangle them (placing the following element inside the empty one), messing your whole HTML. Tidy also has a similar problem, except that it just drops the node.

To fix that, you can use the LIBXML_NOEMPTYTAG flag along with saveXML():

$dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

This option will convert empty (aka self-closing) tags to inline tags and allow empty inline tags as well.

Fixing HTML[5]

With all the stuff we did so far, our HTML output has two major problems now:

  1. no DOCTYPE (it was stripped when we used $dom->documentElement)
  2. empty tags are now inline tags, meaning one <br /> turned into two (<br></br>) and so on

Fixing the first one is fairly easy, since HTML5 is pretty permissive:

"<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

To get our empty tags back, which are the following:

  • area
  • base
  • basefont (deprecated in HTML5)
  • br
  • col
  • command
  • embed
  • frame (deprecated in HTML5)
  • hr
  • img
  • input
  • keygen
  • link
  • meta
  • param
  • source
  • track
  • wbr

We can either use str_[i]replace in a loop:

foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag)
{
    $html = str_ireplace('>/<' . $tag . '>', ' />', $html);
}

Or a regular expression:

$html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html);

This is a costly operation, I haven't benchmarked them so I can't tell you which one performs better but I would guess preg_replace(). Additionally, I'm not sure if the case insensitive version is needed. I'm under the impression that XML tags are always lowercased. UPDATE: Tags are always lowercased.

On <script> and <style> Tags

These tags will always have their content (if existent) encapsulated into (uncommented) CDATA blocks, which will probably break their meaning. You'll have to replace those tokens with a regular expression.

Implementation

function DOM_Tidy($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}
like image 67
Alix Axel Avatar answered Sep 22 '22 16:09

Alix Axel


Here's the comment at the php.net: http://ru2.php.net/manual/en/domdocument.save.php#88630

It looks like when you load HTML from the string (like you did) DOMDocument becomes lazy and does not format anything in it.

Here's working solution to your problem:

// Clean your HTML by hand first
$html = preg_replace('/>\s*</im', '><', $html);
$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
$dom->preserveWhitespace = false;
// Use saveXML(), not saveHTML()
print $dom->saveXML();

Basically, you throw out the spaces between tags and use saveXML() instead of saveHTML(). saveHTML() just does not work in this situation. However, you get an XML declaration in first line of text.

like image 34
hijarian Avatar answered Sep 22 '22 16:09

hijarian