Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML tags from inside XML in PHP

I'm trying to create my own RSS feed (learning purposes) using simplexml_load_string while parsing http://uk.news.yahoo.com/rss in PHP. I get stuck at reading the HTML tags inside the <description> tag.

My code so far looks like this:

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);

//for each element in the feed
foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

             //how to read the href from the a tag???

             //this does not work at all
             $tags = $item->xpath('//a');
             foreach ($tags as $tag) {
                 echo $tag['href'];
             }
       }
}

Any ideas how to extract each HTML tag?

Thanks

like image 552
Adrian Avatar asked Jan 22 '26 14:01

Adrian


2 Answers

The description content has its special characters encoded, so it's not treated as nodes within the XML, rather it's just a string. You can decode the special characters, then load the HTML into DOMDocument and do whatever you want to do. For example:

foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

            $dom = new DOMDocument();
            $dom->loadHTML(htmlspecialchars_decode((string)$desc));

            $anchors = $dom->getElementsByTagName('a');
            echo $anchors->item(0)->getAttribute('href');
        }
}

XPath is also available for use with DOMDocument, see DOMXPath.

like image 68
MrCode Avatar answered Jan 24 '26 07:01

MrCode


The <description> element of the RSS feed contains HTML. Like as outlined in How to parse CDATA HTML-content of XML using SimpleXML? you need to get the node-value of that element (the HTML) and parse it within an addtional parser.

The accepted answer to the linked question already shows this quite verbose, for SimpleXML it does not play much of a role here whether that RSS feed is using CDATA or just entities like in your case.

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss  = simplexml_load_string($feed);
$dom  = new DOMDocument(); // the HTML parser used for descriptions' HTML

foreach ($rss->channel->item as $item)
{
    echo '<h3>' . $item->title . '</h3>', "\n";

    foreach ($item->description as $desc)
    {
        $dom->loadHTML($desc);

        $html = simplexml_import_dom($dom)->body;

        echo $html->p->a['href'], "\n";
    }
}

Exemplary output:

...
<h3>Chantal nears hurricane strength in Caribbean</h3>
http://uk.news.yahoo.com/chantal-nears-hurricane-strength-caribbean-220149771.html
<h3>Placido Domingo In Hospital With Blood Clot</h3>
http://uk.news.yahoo.com/placido-domingo-hospital-blood-clot-215427742.html
<h3>Berlusconi's final tax fraud appeal hearing set for July 30</h3>
http://uk.news.yahoo.com/berlusconis-final-tax-fraud-appeal-hearing-set-july-214714122.html
<h3>China: Men Rescued From River Amid Floods</h3>
http://uk.news.yahoo.com/china-men-rescued-river-amid-floods-213005159.html
<h3>Snowden has not yet accepted asylum in Venezuela - WikiLeaks</h3>
http://uk.news.yahoo.com/snowden-not-yet-accepted-asylum-venezuela-wikileaks-190332291.html
<h3>Three US kidnap victims break silence</h3>
http://uk.news.yahoo.com/three-us-kidnap-victims-release-thankyou-video-093832611.html
...

Hope this helps. Contrary to the accepted answer I see no reason to apply htmlspecialchars_decode, actually I'm pretty sure this breaks things. Also my example shows how you can stay inside the SimpleXML way of accessing the further children by showing how to turn the DOMNode back into a SimpleXMLElement once the HTML has been parsed.

like image 31
hakre Avatar answered Jan 24 '26 09:01

hakre



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!