Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you extract feed urls from an OPML file exported from Google Reader?

I have a piece of software called Rss-Aware that I'm trying to use. It basically desktop feed-checker that checks if RSS feeds are updated and gives a notification through Ubuntu's Notify-OSD system.

However, to know what feeds to check, you have to list out the feed urls in a text file in ~/.rss-aware/rssfeeds.txt one after the other in a list with linebreak between each feed url. Something like:

http://example.com/feed.xml
http://othersite.org/feed.xml
http://othergreatsite.net/rss.xml

...Seems pretty simple right? Well, the list of feeds I'd like to use are exported from Google Reader as an OPML file (it's a type of XML) and I have no clue how to parse it to just output the the feed urls. It seems like it should be pretty straight forward yet I'm stumped.

I'd love if anyone could give an implementation in Python or Ruby or something I could do quickly from a prompt. A bash script would be awesome.

Thanks you so much for the help, I'm a really weak programmer and would love to learn how to do this basic parsing.

EDIT: Also, here is the OPML file I'm trying to extract the feed urls from.

like image 258
Sergei R. Avatar asked Apr 23 '11 02:04

Sergei R.


3 Answers

I wrote a subscription list parser for this very purpose. It's called listparser, and it's written in Python. I just tested your OPML file, and it appears to parse the file perfectly. It will also make your feeds' labels available.

If you've ever used feedparser, the interface should be familiar:

>>> import listparser as lp
>>> d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
>>> len(d.feeds)
112
>>> d.feeds[100].url
u'http://longreads.com/rss'
>>> d.feeds[100].tags
[u'reading']

It's possible to create the file with feed URLs using a script similar to:

import listparser as lp
d = lp.parse('https://dl.dropbox.com/u/670189/google-reader-subscriptions.xml')
f = open('/home/USERNAME/.rss-aware/rssfeeds.txt', 'w')
for i in d.feeds:
    f.write(i.url + '\n')
f.close()

Just replace USERNAME with your actual username. Done!

like image 196
Kurt McKee Avatar answered Oct 12 '22 22:10

Kurt McKee


XML parsing was so easy to implement and worked great for me.

from xml.etree import ElementTree
def extract_rss_urls_from_opml(filename):
    urls = []
    with open(filename, 'rt') as f:
        tree = ElementTree.parse(f)
    for node in tree.findall('.//outline'):
        url = node.attrib.get('xmlUrl')
        if url:
            urls.append(url)
    return urls
urls = extract_rss_urls_from_opml('your_file')
like image 42
Ash Avatar answered Oct 13 '22 00:10

Ash


Since it's an XML file, you can use an XPath query to extract the urls. In the XML file, it looks like the rss feed urls are stored in xmlUrl attributes. The XPath expression //@xmlUrl will select all values of that attribute.

If you want to test this out in your web-browser, you can use an online XPath tester. If you want to perform this XPath query in Python, this question explains how to use XPath in Python. Additionally, the lxml docs have a page on using XPath in lxml that might be helpful.

like image 23
Josh Rosen Avatar answered Oct 13 '22 00:10

Josh Rosen