Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to parse RSS/Atom feeds with PHP [closed]

People also ask

Is Atom compatible with RSS?

An Atom feed is very similar to an RSS feed in that it is a lightweight XML format allowing for easy syndication of web content. In fact, most RSS readers and news aggregators will be able to read Atom feeds just fine, as it is becoming a widely-used alternative to RSS feeds.

How do you read Atom feeds?

To "subscribe" to a site feed, whether RSS or ATOM, you need a feed reader. Simply point the reader to the URL (address) of the site feed, and it will do the rest: it will display the contents of the feed in a window or panel for you.

Is RSS feed push or pull?

RSS is just a data format. Nothing more. It neither pushes nor pulls. It is typically accessed by polling (and the format includes the ability to specify how often it should be polled as metadata).


I've always used the SimpleXML functions built in to PHP to parse XML documents. It's one of the few generic parsers out there that has an intuitive structure to it, which makes it extremely easy to build a meaningful class for something specific like an RSS feed. Additionally, it will detect XML warnings and errors, and upon finding any you could simply run the source through something like HTML Tidy (as ceejayoz mentioned) to clean it up and attempt it again.

Consider this very rough, simple class using SimpleXML:

class BlogPost
{
    var $date;
    var $ts;
    var $link;

    var $title;
    var $text;
}

class BlogFeed
{
    var $posts = array();

    function __construct($file_or_url)
    {
        $file_or_url = $this->resolveFile($file_or_url);
        if (!($x = simplexml_load_file($file_or_url)))
            return;

        foreach ($x->channel->item as $item)
        {
            $post = new BlogPost();
            $post->date  = (string) $item->pubDate;
            $post->ts    = strtotime($item->pubDate);
            $post->link  = (string) $item->link;
            $post->title = (string) $item->title;
            $post->text  = (string) $item->description;

            // Create summary as a shortened body and remove images, 
            // extraneous line breaks, etc.
            $post->summary = $this->summarizeText($post->text);

            $this->posts[] = $post;
        }
    }

    private function resolveFile($file_or_url) {
        if (!preg_match('|^https?:|', $file_or_url))
            $feed_uri = $_SERVER['DOCUMENT_ROOT'] .'/shared/xml/'. $file_or_url;
        else
            $feed_uri = $file_or_url;

        return $feed_uri;
    }

    private function summarizeText($summary) {
        $summary = strip_tags($summary);

        // Truncate summary line to 100 characters
        $max_len = 100;
        if (strlen($summary) > $max_len)
            $summary = substr($summary, 0, $max_len) . '...';

        return $summary;
    }
}

With 4 lines, I import a rss to an array.

$feed = implode(file('http://yourdomains.com/feed.rss'));
$xml = simplexml_load_string($feed);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

For a more complex solution

$feed = new DOMDocument();
 $feed->load('file.rss');
 $json = array();
 $json['title'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
 $json['description'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
 $json['link'] = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('link')->item(0)->firstChild->nodeValue;
 $items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');

 $json['item'] = array();
 $i = 0;

 foreach($items as $key => $item) {
 $title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
 $description = $item->getElementsByTagName('description')->item(0)->firstChild->nodeValue;
 $pubDate = $item->getElementsByTagName('pubDate')->item(0)->firstChild->nodeValue;
 $guid = $item->getElementsByTagName('guid')->item(0)->firstChild->nodeValue;

 $json['item'][$key]['title'] = $title;
 $json['item'][$key]['description'] = $description;
 $json['item'][$key]['pubdate'] = $pubDate;
 $json['item'][$key]['guid'] = $guid; 
 }

echo json_encode($json);

Your other options include:

  • SimplePie
  • Last RSS
  • PHP Universal Feed Parser

I would like introduce simple script to parse RSS:

$i = 0; // counter
$url = "http://www.banki.ru/xml/news.rss"; // url to parse
$rss = simplexml_load_file($url); // XML parser

// RSS items loop

print '<h2><img style="vertical-align: middle;" src="'.$rss->channel->image->url.'" /> '.$rss->channel->title.'</h2>'; // channel title + img with src

foreach($rss->channel->item as $item) {
if ($i < 10) { // parse only 10 items
    print '<a href="'.$item->link.'">'.$item->title.'</a><br />';
}

$i++;
}

If feed isn't well-formed XML, you're supposed to reject it, no exceptions. You're entitled to call feed creator a bozo.

Otherwise you're paving way to mess that HTML ended up in.


The HTML Tidy library is able to fix some malformed XML files. Running your feeds through that before passing them on to the parser may help.


I use SimplePie to parse a Google Reader feed and it works pretty well and has a decent feature set.

Of course, I haven't tested it with non-well-formed RSS / Atom feeds so I don't know how it copes with those, I'm assuming Google's are fairly standards compliant! :)