Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Traverse the DOM tree

Tags:

dom

php

traversal

As most (all?) PHP libraries that do HTML sanitization such as HTML Purifier are heavily dependant on regex, I thought trying to write a HTML sanitizer that uses the DOMDocument and related classes would be a worthwhile experiment. While I'm at a very early stage with this, the project so far shows some promise.

My idea revolves around a class that uses the DOMDocument to traverse all nodes in the supplied markup, compare them to a white list, and remove anything not on the white list. (first implementation is very basic, only removing nodes based on their type but I hope to get more sophisticated and analyse the node's attributes, whether links address items in a different domain, etc in the future).

My question is how do I traverse the DOM tree? As I understand it, DOM* objects have a childNodes attribute, so would I need to recurse over the whole tree? Also, early experiments with DOMNodeLists have shown you need to be very careful about the order you remove things otherwise you might leave items behind or trigger exceptions.

If anyone has experience with manipulating a DOM tree in PHP I'd appreciate any feedback you may have on the topic.

EDIT: I've built the following method for my HTML cleaning class. It recursively walks the DOM tree and checks whether the found elements are on the whitelist. If they aren't, they are removed.

The problem I was hitting was that if you delete a node, the indexes of all subsequent nodes in the DOMNodeList changes. Simply working from bottom to top avoids this problem. It's still a very basic approach currently, but I think it shows promise. It certainly works a lot faster than HTMLPurifier, though admittedly Purifier does a lot more stuff.

/**
 * Recursivly remove elements from the DOM that aren't whitelisted
 * @param DOMNode $elem
 * @return array List of elements removed from the DOM
 * @throws Exception If removal of a node failed than an exception is thrown
 */
private function cleanNodes (DOMNode $elem)
{
    $removed    = array ();
    if (in_array ($elem -> nodeName, $this -> whiteList))
    {
        if ($elem -> hasChildNodes ())
        {
            /*
             * Iterate over the element's children. The reason we go backwards is because
             * going forwards will cause indexes to change when elements get removed
             */
            $children   = $elem -> childNodes;
            $index      = $children -> length;
            while (--$index >= 0)
            {
                $removed = array_merge ($removed, $this -> cleanNodes ($children -> item ($index)));
            }
        }
    }
    else
    {
        // The element is not on the whitelist, so remove it
        if ($elem -> parentNode -> removeChild ($elem))
        {
            $removed [] = $elem;
        }
        else
        {
            throw new Exception ('Failed to remove node from DOM');
        }
    }
    return ($removed);
}
like image 254
GordonM Avatar asked Jun 15 '11 10:06

GordonM


1 Answers

For a start, you can have a look at this custom RecursiveDomIterator:

  • https://github.com/salathe/spl-examples/wiki/RecursiveDOMIterator

Code:

class RecursiveDOMIterator implements RecursiveIterator
{
    /**
     * Current Position in DOMNodeList
     * @var Integer
     */
    protected $_position;

    /**
     * The DOMNodeList with all children to iterate over
     * @var DOMNodeList
     */
    protected $_nodeList;

    /**
     * @param DOMNode $domNode
     * @return void
     */
    public function __construct(DOMNode $domNode)
    {
        $this->_position = 0;
        $this->_nodeList = $domNode->childNodes;
    }

    /**
     * Returns the current DOMNode
     * @return DOMNode
     */
    public function current()
    {
        return $this->_nodeList->item($this->_position);
    }

    /**
     * Returns an iterator for the current iterator entry
     * @return RecursiveDOMIterator
     */
    public function getChildren()
    {
        return new self($this->current());
    }

    /**
     * Returns if an iterator can be created for the current entry.
     * @return Boolean
     */
    public function hasChildren()
    {
        return $this->current()->hasChildNodes();
    }

    /**
     * Returns the current position
     * @return Integer
     */
    public function key()
    {
        return $this->_position;
    }

    /**
     * Moves the current position to the next element.
     * @return void
     */
    public function next()
    {
        $this->_position++;
    }

    /**
     * Rewind the Iterator to the first element
     * @return void
     */
    public function rewind()
    {
        $this->_position = 0;
    }

    /**
     * Checks if current position is valid
     * @return Boolean
     */
    public function valid()
    {
        return $this->_position < $this->_nodeList->length;
    }
}

You can use that in combination with a RecursiveIteratorIterator. Usage examples are on the page.

In general though, it would be easier to use XPath to search for blacklisted nodes instead of traversing the DOM Tree. Also keep in mind that DOM is already quite good at preventing XSS by automatically escaping xml entities in nodeValues.

The other thing you have to be aware of is that any manipulation of a DOMDocument will immediately affect any DOMNodeList you might have from XPath queries and that might lead to skipped nodes when manipulating them. See DOMNode replacement with PHP's DOM classes for an example.

like image 129
Gordon Avatar answered Sep 19 '22 16:09

Gordon