<p>I have just started reading documentation and examples about DOM, in order to crawl and parse the document.</p> <p>For example I have part of document shown below:</p> <pre class="prettyprint"><code> <div id="showContent"> <table> <tr> <td> Crap </td> </tr> <tr> <td width="172" valign="top"><a href="link"><img height="91" border="0" width="172" class="" src="img"></a></td> <td width="10">&nbsp;</td> <td valign="top"><table cellspacing="0" cellpadding="0" border="0"> <tbody><tr> <td height="30"><a class="px11" href="link">title</a><a><br> <span class="px10"></span> </a></td> </tr> <tr> <td><img height="1" width="580" src="crap"></td> </tr> <tr> <td align="right"> <a href="link"><img height="16" border="0" width="65" src="/buy"></a> </td> </tr> <tr> <td valign="top" class="px10"> <p style="width: 500px;">description.</p> </td> </tr> </tbody></table></td> </tr> <tr> <td> Crap </td> </tr> <tr> <td> Crap </td> </tr> </table> </div> </code></pre> <p>I'm trying to use the following code to get all the <code>tr</code> tags and analyze whether there is crap or information inside them:</p> <pre class="prettyprint"><code>$dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); $tags = $xpath->query('.//div[@id="showContent"]'); foreach ($tags as $tag) { $string=""; $string=trim($tag->nodeValue); if(strlen($string)>3) { echo $string; echo '<br>'; } } </code></pre> <p>However I'm getting just stripped string without the tags, for example:</p> <pre class="prettyprint"><code>Crap Crap Title Description </code></pre> <p>But I would like to get:</p> <pre class="prettyprint"><code><tr> <td>Crap</td> </tr> <tr> <a href="link">title</a> </tr> </code></pre> <p>How to keep html nodes (tags)?</p>

<p>To create a parser you can use htmlDOM.</p> <p>It is very simple easy to use DOM parser written in php. By using it you can easily fetch the contents of <code>div</code> tag.</p> <p>For example, find all <code>div</code> tags which have attribute <code>id</code> with a value of <code>text</code>.</p> <pre class="prettyprint"><code>$ret = $html->find('div[id=text]'); </code></pre>

DOMDocument in php

Tags:

php

xml-parsing

html-parsing

domdocument

I have just started reading documentation and examples about DOM, in order to crawl and parse the document.

For example I have part of document shown below:

    <div id="showContent">     <table>     <tr>         <td>          Crap         </td>     </tr> <tr>           <td width="172" valign="top"><a href="link"><img height="91" border="0" width="172" class="" src="img"></a></td>           <td width="10">&nbsp;</td>           <td valign="top"><table cellspacing="0" cellpadding="0" border="0">               <tbody><tr>                 <td height="30"><a class="px11" href="link">title</a><a><br>                     <span class="px10"></span>                 </a></td>               </tr>               <tr>                 <td><img height="1" width="580" src="crap"></td>               </tr>               <tr>                 <td align="right">                     <a href="link"><img height="16" border="0" width="65" src="/buy"></a>                 </td>               </tr>               <tr>                 <td valign="top" class="px10">                     <p style="width: 500px;">description.</p>                 </td>               </tr>           </tbody></table></td>         </tr>     <tr>         <td> Crap         </td>     </tr>     <tr>         <td>          Crap         </td>     </tr>     </table>     </div>

I'm trying to use the following code to get all the tr tags and analyze whether there is crap or information inside them:

$dom = new DOMDocument(); @$dom->loadHTML($html);  $xpath = new DOMXPath($dom);   $tags = $xpath->query('.//div[@id="showContent"]'); foreach ($tags as $tag) {     $string="";     $string=trim($tag->nodeValue);     if(strlen($string)>3) {         echo $string;         echo '<br>';     } }

However I'm getting just stripped string without the tags, for example:

Crap  Crap Title Description

But I would like to get:

<tr>    <td>Crap</td> </tr> <tr>    <a href="link">title</a> </tr>

How to keep html nodes (tags)?

933

asked Feb 12 '11 18:02

Saikios

2 Answers

If you want to work with DOM you have to understand the concept. Everything in a DOM Document, including the DOMDocument, is a Node.

The DOMDocument is a hierarchical tree structure of nodes. It starts with a root node. That root node can have child nodes and all these child nodes can have child nodes on their own. Basically everything in a DOMDocument is a node type of some sort, be it elements, attributes or text content.

          HTML                               Legend:           /    \                              UPPERCASE = DOMElement        HEAD  BODY                            lowercase = DOMAttr       /          \                           "Quoted"  = DOMText     TITLE        DIV - class - "header"      |             \ "The Title"        H1                     |            "Welcome to Nodeville"

The diagram above shows a DOMDocument with some nodes. There is a root element (HTML) with two children (HEAD and BODY). The connecting lines are called axes. If you follow down the axis to the TITLE element, you will see that it has one DOMText leaf. This is important because it illustrates an often overlooked thing:

<title>The Title</title>

is not one, but two nodes. A DOMElement with a DOMText child. Likewise, this

<div class="header">

is really three nodes: the DOMElement with a DOMAttr holding a DOMText. Because all these inherit their properties and methods from DOMNode, it is essential to familiarize yourself with the DOMNode class.

In practise, this means the DIV you fetched is linked to all the other nodes in the document. You could go all the way to the root element or down to the leaves at any time. It's all there. You just have to query or traverse the document for the wanted information.

Whether you do that by iterating the childNodes of the DIV or use getElementByTagName() or XPath is up to you. You just have to understand that you are not working with raw HTML, but with nodes representing that entire HTML document.

If you need help with extracting specific information from the document, you need to clarify what information you want to fetch from it. For instance, you could ask how to fetch all the links from the table and then we could answer something like:

$div = $dom->getElementById('showContent'); foreach ($div->getElementsByTagName('a') as $link)  {     echo $dom->saveXML($link); }

But unless you are more specific, we can only guess which nodes might be relevant.

If you need more examples and code snippets on how to work with DOM browse through my previous answers to related questions:

https://stackoverflow.com/search?q=user%3A208809+DOM

By now, there should be a snippet for every basic to medium UseCase you might have with DOM.

answered Oct 15 '22 07:10

Gordon

To create a parser you can use htmlDOM.

It is very simple easy to use DOM parser written in php. By using it you can easily fetch the contents of div tag.

For example, find all div tags which have attribute id with a value of text.

$ret = $html->find('div[id=text]');

answered Oct 15 '22 06:10

lokeshsk

Related questions
                            
                                Laravel Composer sees wrong PHP Version
                            
                                Uncaught SoapFault exception: [HTTP] Error Fetching http headers
                            
                                How to change from-address when using gmail smtp server
                            
                                INSERT IGNORE using Codeigniter
                            
                                changing array keys in array_walk function?
                            
                                JQuery .on('click') not working in DataTables 2nd page or rows past 11
                            
                                Why is MD5'ing a UUID not a good idea?
                            
                                Remove EXIF data from JPG using PHP
                            
                                Methods to remove specific characters from string?
                            
                                Create an order programmatically with line items in Woocommerce 3+
                            
                                Laravel middleware with multiple roles
                            
                                Trying to get new line on sms message sent from php script
                            
                                PHP date - get name of the months in local language
                            
                                Log rotating with Monolog in Symfony2
                            
                                How to install redis extension for php 7
                            
                                what are the differences between PHP base64_encode and *nix base64
                            
                                PHP Add string to text file [duplicate]
                            
                                PHP Operator <<
                            
                                date_default_timezone_get(): It is not safe to rely on the system's timezone settings
                            
                                How to display a readable array - Laravel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With