Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining whether a feed is Atom or RSS

I'm trying to determine whether a given feed is Atom based or RSS based.

Here's my code:

public boolean isRSS(String URL) throws ParserConfigurationException, SAXException, IOException{
        DocumentBuilder builder = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder();
        Document doc = builder
                .parse(URL);
        return doc.getDocumentElement().getNodeName().equalsIgnoreCase() == "rss";
    }

Is there a better way to do it? would it be better if I used a SAX Parser instead?

like image 693
Mahmoud Hanafy Avatar asked Sep 29 '11 00:09

Mahmoud Hanafy


3 Answers

The root element is the easiest way to determine the type of a feed.

  • RSS feeds have the root element rss (see specification)
  • Atom feeds have the root element feed (see specification)

For different Parsers there are different ways to get the root element. None is inferior to the other. There has been written enough about StAX vs. SAX vs. DOM etc, which can be used as basis for a specific decision.

There is nothing wrong with your first two lines of code:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(URL);

In your return statement you make a mistake on Java String comparison.

When you use the comparison operator == with Strings, it compares references not values (i.e. you check if both are exactly the same object). You should use the equals() method here. Just to be sure I would recommend to use equalsIgnoreCase():

return doc.getDocumentElement().getNodeName().equalsIgnoreCase("rss");

Hint: If you check for "rss" instead of "feed" (like for Atom) in your isRss() method you don't have to use the ternary operator.

like image 138
Chris Avatar answered Nov 07 '22 14:11

Chris


Sniffing content is one method. But note that atom uses namespaces, and you are creating a non namespace aware parser.

public boolean isAtom(String URL) throws ParserConfigurationException, SAXException, IOException{
    DocumentBuilderFactory f = DocumentBuilderFActory.newInstance();
    f.setNamespaceAware(true);
    DocumentBuilder builder = f.newInstance().newDocumentBuilder();
    Document doc = builder.parse(URL);
    Element e = doc.getDocumentElement(); 
    return e.getLocalName().equals("feed") && 
            e.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}

Note also that you cannot compare using equalsIgnorCase(), since XML element names are case sensitive.

Another method is to react on the Content-Type header, if it is available in a HTTP GET request. Content-Type for ATOM would be application/atom+xml and for RSS application/rss+xml. I would suspect though, that not all RSS feed can be trusted to correctky set this header.

A third option is to look at the URL suffix, e.g. .atom and .rss.

The last two methods are easily configurable if you are using Spring or JAX-RS

like image 44
forty-two Avatar answered Nov 07 '22 14:11

forty-two


You could use a StAX parser to avoid parsing the entire XML document into memory:

public boolean isAtom(String url) throws ParserConfigurationException, SAXException, IOException{
    XMLInputFactory xif = XMLInputFactory.newFactory();
    XMLStreamReader xsr = xif.createXMLStreamReader(new URL(url).openConnection());
    xsr.nextTag();  // Advance to root element
    return xsr.getLocalName().equals("feed") && 
            xsr.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}
like image 2
bdoughan Avatar answered Nov 07 '22 15:11

bdoughan