I'm trying to make a script that scrapes a website to retrieve the latest news updates. Unfortunately I've run into a small issue that I can't seem to fix with my limited knowledge of DOM.
The page I'm trying to scrape is built as follows :
<table> <tr class="color1"> <td>Author</td> <td>Content <a href="#">in HTML</a></td> <td>Date</td> </tr> </table>
I can retrieve the fields I need just fine, except for content. With $td->nodeValue I retrieve the content in text form, whereas I want it in HTML (there's 'a' tags in there, 'blockquote', etc)
Here's the code I have :
try {
$html = @ file_get_contents("test.php");
checkIfFileExists($html);
$dom = new DOMDocument();
@ $dom->loadHTML($html);
$trNodes = $dom->getElementsByTagName("tr");
foreach ($trNodes as $tr) {
if ($tr->getAttribute("class") == "color1" || $tr->getAttribute("class") == "color2") {
$tdNodes = $tr->childNodes;
foreach ($tdNodes as $td) {
echo $td->nodeValue . "<br />\n";
}
echo "<br /><br /><br /><br /><br />\n";
}
} catch(Exception $e) {
echo $e->getMessage();
}
I would prefer not to have to resort to any third party library, but obviously any answer is most appreciated, library or not.
Thanks in advance.
replace
echo $td->nodeValue . "<br />\n";
with
echo $dom->saveXML($td) . "<br />\n";
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With