Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing an RSS item that has a colon in the tag with Ruby?

Tags:

parsing

ruby

rss

I'm trying to parse the info from an RSS feed that has this tag structure:

<dc:subject>foo bar</dc:subject>

using the built in Ruby RSS library. Obviously, doing item.dc:subject is throwing errors, but I can't figure out any way to pull out that info. Is there any way to get this to work? Or is it possible with a different RSS library?

like image 394
Gordon Fontenot Avatar asked Mar 23 '11 21:03

Gordon Fontenot


2 Answers

Tags with ':' in them are really XML tags with a namespace. I never had good results using the RSS module because the feed formats often don't meet the specs, causing the module to give up. I highly recommend using Nokogiri to parse the feed, whether it is RDF, RSS or ATOM.

Nokogiri has the ability to use XPath accessors or CSS accessors, and, both support namespaces. The last two lines would be equivalent:

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://somehost.com/rss_feed'))
doc.at('//dc:subject').text
doc.at('dc|subject').text

When dealing with namespaces you'll need to add the declaration to the XPath accessor:

doc.at('//dc:subject', 'dc' => 'link to dc declaration') 

See the "Namespaces" section for more info.

Without a URL or a better sample I can't do more, but that should get you pointed in a better direction.

A couple years I wrote a big RSS aggregator for my job using Nokogiri that handled RDF, RSS and ATOM. Ruby's RSS library wasn't up to the task but Nokogiri was awesome.

If you don't want to roll your own, Paul Dix's Feedzirra is a good gem for processing feeds.

like image 84
the Tin Man Avatar answered Sep 19 '22 15:09

the Tin Man


The RSS module seems to have the ability to do those XML namespace attributes, i.e. <dc:date> like this:

feed.items.each do |item| puts "Date: #{item.dc_date}" end

like image 27
CamelBlues Avatar answered Sep 19 '22 15:09

CamelBlues