Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I find out the namespace of an element in PHP DOM?

This sounds like a pretty easy question to answer but I haven't been able to get it to work. I'm running PHP 5.2.6.

I have a DOM element (the root element) which, when I go to $element->saveXML(), it outputs an xmlns attribute:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
...

However, I cannot find any way programmatically within PHP to see that namespace. I want to be able to check whether it exists and what it's set to.

Checking $document->documentElement->namespaceURI would be the obvious answer but that is empty (I've never actually been able to get that to be non-empty). What is generating that xmlns value in the output and how can I read it?

The only practical way I've been able to do this so far is a complete hack - by saving it as XML to a string using saveXML() then reading through that using regular expressions.

Edit:

This may be a peculiarity of loading XML in using loadHTML() rather than loadXML() and then printing it out using saveXML(). When you do that, it appears that for some reason saveXML adds an xmlns attribute even though there is no way to detect that this xmlns value is part of the document using DOM methods. Which I guess means that if I had a way of detecting whether the document passed in had been loaded in using loadHTML() then I could solve this a different way.

like image 624
thomasrutter Avatar asked Aug 25 '10 13:08

thomasrutter


3 Answers

Like edorian already showed, getting the namespace works fine when the Markup is loaded with loadXML. But you are right that this wont work for Markup loaded with loadHTML:

$html = <<< XML
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="foo" lang="en">
    <body xmlns="foo">Bar</body>
</html>
XML;

$dom = new DOMDocument;
$dom->loadHTML($html);

var_dump($dom->documentElement->getAttribute("xmlns"));
var_dump($dom->documentElement->lookupNamespaceURI(NULL));
var_dump($dom->documentElement->namespaceURI);

will produce empty results. But you can use XPath

$xp = new DOMXPath($dom);
echo $xp->evaluate('string(@xmlns)');
// http://www.w3.org/1999/xhtml;

and for body

echo $xp->evaluate('string(body/@xmlns)'); // foo

or with context node

$body = $dom->documentElement->childNodes->item(0);
echo $xp->evaluate('string(@xmlns)', $body);
// foo

My uneducated assumption is that internally, a HTML Document is different from a real Document. Internally libxml uses a different module to parse HTML and the DOMDocument itself will be of a different nodeType, as you can simply verify by doing

var_dump($dom->nodeType); // 13 with loadHTML, 9 with loadXml

with 13 being a XML_HTML_DOCUMENT_NODE.

like image 150
Gordon Avatar answered Oct 12 '22 13:10

Gordon


With PHP 5.2.6 i've found 2 ways to this:

<?php
$xml = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?'.
       '><html xmlns="http://www.w3.org/1999/xhtml" lang="en"></html>';
$x = DomDocument::loadXml($xml);
var_dump($x->documentElement->getAttribute("xmlns"));
var_dump($x->documentElement->lookupNamespaceURI(NULL));

prints

string(28) "http://www.w3.org/1999/xhtml"
string(28) "http://www.w3.org/1999/xhtml"

Hope thats what you asked for :)

like image 30
edorian Avatar answered Oct 12 '22 12:10

edorian


Well, you can do so with a function like this:

function getNamespaces(DomNode $node, $recurse = false) {
    $namespaces = array();
    if ($node->namespaceURI) {
        $namespaces[] = $node->namespaceURI;
    }
    if ($node instanceof DomElement && $node->hasAttribute('xmlns')) {
        $namespaces[] = $xmlns = $node->getAttribute('xmlns');
        foreach ($node->attributes as $attr) {
            if ($attr->namespaceURI == $xmlns) {
                $namespaces[] = $attr->value;
                }
        }
    }
    if ($recurse && $node instanceof DomElement) {
        foreach ($node->childNodes as $child) {
            $namespaces = array_merge($namespaces, getNamespaces($child, vtrue));
        }
    }
    return array_unique($namespaces);
}

So, you feed it a DomEelement, and then it finds all related namespaces:

$xml = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <html xmlns="http://www.w3.org/1999/xhtml" 
         lang="en" 
         xmlns:foo="http://example.com/bar">
           <body>
                <h1>foo</h1>
                <foo:h2>bar</foo:h2>
           </body>
 </html>';
var_dump(getNamespaces($dom->documentElement, true));

Prints out:

array(2) {
  [0]=>
  string(28) "http://www.w3.org/1999/xhtml"
  [3]=>
  string(22) "http://example.com/bar"
}

Note that DomDocument will automatically strip out all unused namespaces...

As for why $dom->documentElement->namespaceURI is always null, it's because the document element doesn't have a namespace. The xmlns attribute provides a default namespace for the document, but it doesn't endow the html tag with a namespace (for purposes of DOM interaction). You can try doing a $dom->documentElement->removeAttribute('xmlns'), but I'm not 100% sure if it will work...

like image 24
ircmaxell Avatar answered Oct 12 '22 13:10

ircmaxell