Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP DOMNode : how to extract not only text but HTML tags also

I'm trying to make a script that scrapes a website to retrieve the latest news updates. Unfortunately I've run into a small issue that I can't seem to fix with my limited knowledge of DOM.

The page I'm trying to scrape is built as follows :

<table>
<tr class="color1">
<td>Author</td>
<td>Content <a href="#">in HTML</a></td>
<td>Date</td>
</tr>
</table>

I can retrieve the fields I need just fine, except for content. With $td->nodeValue I retrieve the content in text form, whereas I want it in HTML (there's 'a' tags in there, 'blockquote', etc)

Here's the code I have :

try {
    $html = @ file_get_contents("test.php");
    checkIfFileExists($html);

    $dom = new DOMDocument();
    @ $dom->loadHTML($html);

    $trNodes = $dom->getElementsByTagName("tr");
    foreach ($trNodes as $tr) {

        if ($tr->getAttribute("class") == "color1" || $tr->getAttribute("class") == "color2") {

        $tdNodes = $tr->childNodes;
        foreach ($tdNodes as $td) {

            echo $td->nodeValue . "<br />\n";

        }
        echo "<br /><br /><br /><br /><br />\n";
    }
} catch(Exception $e) {
    echo $e->getMessage();
}

I would prefer not to have to resort to any third party library, but obviously any answer is most appreciated, library or not.

Thanks in advance.

like image 957
Steven Avatar asked Jun 07 '11 07:06

Steven


1 Answers

replace

echo $td->nodeValue . "<br />\n";

with

echo $dom->saveXML($td)  . "<br />\n";
like image 155
Frederic Bazin Avatar answered Nov 14 '22 23:11

Frederic Bazin