Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimising RSS parsing on App Engine to avoid high CPU warnings

I'm pulling some RSS feeds into a datastore in App Engine to serve up to an iPhone app. I use cron to schedule updating the RSS every x minutes. Each task only parses one RSS feed (which has 15-20 items). I frequently get warnings about high CPU usage in the App Engine dashboard, so I'm looking for ways to optimise my code.

Currently, I use minidom (since it's already there on App Engine), but I suspect it's not very efficient!

Here's the code:

 dom = minidom.parseString(urlfetch.fetch(url).content)
    if dom:
        items = []
        for node in dom.getElementsByTagName('item'):
            item = RssItem(
                key_name = self.getText(node.getElementsByTagName('guid')[0].childNodes),
                title = self.getText(node.getElementsByTagName('title')[0].childNodes),
                description = self.getText(node.getElementsByTagName('description')[0].childNodes),
                modified = datetime.now(),
                link = self.getText(node.getElementsByTagName('link')[0].childNodes),
                categories = [self.getText(category.childNodes) for category in node.getElementsByTagName('category')]
            );
            items.append(item);
        db.put(items);

def getText(self, nodelist):
    rc = ''
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
    return rc

There isn't much going on, but the scripts often take 2-6 seconds CPU time, which seems a bit excessive for looping through 20ish items and reading a few attributes.

What can I do to make this faster? Is there anything particularly bad in the above code, or should I change to another way of parsing? Are there are any libraries (that work on App Engine) that would be better, or would I be better parsing the RSS myself?

like image 883
Danny Tuppeny Avatar asked Apr 01 '10 20:04

Danny Tuppeny


1 Answers

Outsource feed parsing via for example superfeedr

You could also look into superfeedr.com. They have a reasonable free quota/paying plans. They will do the polling(within 15 minutes you get updates) for you/etc. If the feeds also support pubsubhubbub, then you will receive the feeds in realtime! This video will explain to you what pubsubhubbub is if you don't know yet.

Improved feed parser written by Brett Slatkin

I would also advice you to watch this awesome video from Brett Slatkin explaining pubsubhubbub. I also remember that somewhere in the presentation he says that he does not use Universal Feedparser because it's just does to much work for his problem. He wrote his own SAX(14:10 in video presentation he talks about it a little bit) parser which is lightning fast. I guess you should check out the pubsubhubbub code to find out how he accomplished this.

like image 98
Alfred Avatar answered Sep 29 '22 20:09

Alfred