Determining whether a feed is Atom or RSS

Question

I'm trying to determine whether a given feed is Atom based or RSS based.

Here's my code:

public boolean isRSS(String URL) throws ParserConfigurationException, SAXException, IOException{
        DocumentBuilder builder = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder();
        Document doc = builder
                .parse(URL);
        return doc.getDocumentElement().getNodeName().equalsIgnoreCase() == "rss";
    }

Is there a better way to do it? would it be better if I used a SAX Parser instead?

Chris · Accepted Answer

The root element is the easiest way to determine the type of a feed.

RSS feeds have the root element rss (see specification)
Atom feeds have the root element feed (see specification)

For different Parsers there are different ways to get the root element. None is inferior to the other. There has been written enough about StAX vs. SAX vs. DOM etc, which can be used as basis for a specific decision.

There is nothing wrong with your first two lines of code:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(URL);

In your return statement you make a mistake on Java String comparison.

When you use the comparison operator == with Strings, it compares references not values (i.e. you check if both are exactly the same object). You should use the equals() method here. Just to be sure I would recommend to use equalsIgnoreCase():

return doc.getDocumentElement().getNodeName().equalsIgnoreCase("rss");

Hint: If you check for "rss" instead of "feed" (like for Atom) in your isRss() method you don't have to use the ternary operator.

forty-two · Answer

Sniffing content is one method. But note that atom uses namespaces, and you are creating a non namespace aware parser.

public boolean isAtom(String URL) throws ParserConfigurationException, SAXException, IOException{
    DocumentBuilderFactory f = DocumentBuilderFActory.newInstance();
    f.setNamespaceAware(true);
    DocumentBuilder builder = f.newInstance().newDocumentBuilder();
    Document doc = builder.parse(URL);
    Element e = doc.getDocumentElement(); 
    return e.getLocalName().equals("feed") && 
            e.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}

Note also that you cannot compare using equalsIgnorCase(), since XML element names are case sensitive.

Another method is to react on the Content-Type header, if it is available in a HTTP GET request. Content-Type for ATOM would be application/atom+xml and for RSS application/rss+xml. I would suspect though, that not all RSS feed can be trusted to correctky set this header.

A third option is to look at the URL suffix, e.g. .atom and .rss.

The last two methods are easily configurable if you are using Spring or JAX-RS

bdoughan · Answer

You could use a StAX parser to avoid parsing the entire XML document into memory:

public boolean isAtom(String url) throws ParserConfigurationException, SAXException, IOException{
    XMLInputFactory xif = XMLInputFactory.newFactory();
    XMLStreamReader xsr = xif.createXMLStreamReader(new URL(url).openConnection());
    xsr.nextTag();  // Advance to root element
    return xsr.getLocalName().equals("feed") && 
            xsr.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}

Determining whether a feed is Atom or RSS

Tags:

java

xml

rss

atom-feed

Mahmoud Hanafy

3 Answers

Chris

forty-two

bdoughan

Recent Activity

Donate For Us

Determining whether a feed is Atom or RSS

Tags:

java

xml

rss

atom-feed

Mahmoud Hanafy

3 Answers

Chris

forty-two

bdoughan

Related questions

Recent Activity

Donate For Us