Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP's DOMXPath is stripping out my tags inside the matched text

I asked this question yesterday, and at the time it was just what I needed, but while working with some live data I discovered that is wasn't quite doing what I expected. Parse HTML with PHP's HTML DOMDocument

It gets the data from the HTML page, but then it also strips out all the HTML tags inside the captured block of text, which isn't what I want. (I might wan't to take some of the tags out, but not all, and this can be done later)

like image 650
Mint Avatar asked Apr 04 '10 14:04

Mint


2 Answers

That's a common problem with DOM : you have to do a bit more work if you want to get the content of a tag, and the content of all its children.

Basically, you have to loop over the child nodes of the one you've matched with your XPath query, to get their contents.

There is a solution proposed in one one the user notes on the manual page of the DOMElement class -- see this note.


Integrating this solution into the code you already have should give you something that looks like this for the declaration of the HTML string, with sub-tags :

$html = <<<HTML
<div class="main">
    <div class="text">
        <p>
            Capture this <strong>text</strong> <em>1</em>
        </p>
        <p>
            And some other <strong>text</strong>
        </p>
    </div>
</div>
HTML;


And, to extract the data from that HTML string, you can use something like that :

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    $innerHTML = '';

    // see http://fr.php.net/manual/en/class.domelement.php#86803
    $children = $tag->childNodes;
    foreach ($children as $child) {
        $tmp_doc = new DOMDocument();
        $tmp_doc->appendChild($tmp_doc->importNode($child,true));       
        $innerHTML .= $tmp_doc->saveHTML();
    }

    var_dump(trim($innerHTML));
}

The only thing that has changed is the content of the foreach loop : instead of just using $tag->nodeValue, you have to iterate over the child elements.


Which gives me the following output :

string '<p>
            Capture this <strong>text</strong> <em>1</em>
        </p>


<p>
            And some other <strong>text</strong>
        </p>' (length=150)

Which is the full content of the <div> tag that was matched, and all its children -- including the tags.


Note : there are often interesting ideas and solution in the users notes of the manual ;-)

like image 79
Pascal MARTIN Avatar answered Oct 22 '22 18:10

Pascal MARTIN


Pascal MARTIN's answer is great, but I found it can be simplified

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    $innerHTML = '';

    $children = $tag->childNodes;
    foreach ($children as $child) {     
        $innerHTML .= $dom->saveHTML($child);
    }

    var_dump(trim($innerHTML));
}

This way appears to produce the same result, but doesn't require new DomDocument objects being created inside the foreach loop.

EDIT:

So, after further experimentation, you can actually reduce the above to this:

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    var_dump(trim($dom->saveHTML($tag)));
}
like image 3
Nate Avatar answered Oct 22 '22 19:10

Nate