Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

php DOMDocument nodeName property returning '#text' with the nodeName

I want to extract the content of body of a html page along with the tagNames of its child. I have taken an example html like this:

<html>
<head></head>
<body>
<h1>This is H1 tag</h1>
<h2>This is H2 tag</h2>
<h3>This is H3 tag</h3>
</body>
</html>

I have implemented the php code like below and its working fine.

$d=new DOMDocument();
$d->loadHTMLFile('file.html');
$l=$d->childNodes->item(1)->childNodes->item(1)->childNodes;
for($i=0;$i<$l->length;$i++)
{
echo "<".$l->item($i)->nodeName.">".$l->item($i)->nodeValue."</".$l->item($i)->nodeName.">";
}

This code is working perfectly fine, but when I tried to do this using foreach loop instead of for loop, the nodeName property was returning '#text' with every actual nodeName. Here is that code

$l=$d->childNodes->item(1)->childNodes->item(1)->childNodes;
foreach ($l as $li) {
    echo $li->childNodes->item(0)->nodeName."<br/>";
}

Why so?

like image 649
Sourabh Avatar asked Mar 06 '12 19:03

Sourabh


1 Answers

When I've had this problem it was fixed by doing the following.

$xmlDoc = new DOMDocument();
$xmlDoc->preserveWhiteSpace = false; // important!

You can trace out your $node->nodeType to see the difference. I get 3, 1, 3 even though there was only one node (child). Turn white space off and now I just get 1.

GL.

like image 182
Mark Avatar answered Oct 21 '22 15:10

Mark