I want to extract the content of body of a html page along with the tagNames of its child. I have taken an example html like this:
<html>
<head></head>
<body>
<h1>This is H1 tag</h1>
<h2>This is H2 tag</h2>
<h3>This is H3 tag</h3>
</body>
</html>
I have implemented the php code like below and its working fine.
$d=new DOMDocument();
$d->loadHTMLFile('file.html');
$l=$d->childNodes->item(1)->childNodes->item(1)->childNodes;
for($i=0;$i<$l->length;$i++)
{
echo "<".$l->item($i)->nodeName.">".$l->item($i)->nodeValue."</".$l->item($i)->nodeName.">";
}
This code is working perfectly fine, but when I tried to do this using foreach loop instead of for loop, the nodeName property was returning '#text' with every actual nodeName. Here is that code
$l=$d->childNodes->item(1)->childNodes->item(1)->childNodes;
foreach ($l as $li) {
echo $li->childNodes->item(0)->nodeName."<br/>";
}
Why so?
When I've had this problem it was fixed by doing the following.
$xmlDoc = new DOMDocument();
$xmlDoc->preserveWhiteSpace = false; // important!
You can trace out your $node->nodeType to see the difference. I get 3, 1, 3 even though there was only one node (child). Turn white space off and now I just get 1.
GL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With