I'm trying to determine whether a given feed is Atom based or RSS based.
Here's my code:
public boolean isRSS(String URL) throws ParserConfigurationException, SAXException, IOException{
DocumentBuilder builder = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
Document doc = builder
.parse(URL);
return doc.getDocumentElement().getNodeName().equalsIgnoreCase() == "rss";
}
Is there a better way to do it? would it be better if I used a SAX Parser instead?
The root element is the easiest way to determine the type of a feed.
rss
(see specification)feed
(see specification)For different Parsers there are different ways to get the root element. None is inferior to the other. There has been written enough about StAX vs. SAX vs. DOM etc, which can be used as basis for a specific decision.
There is nothing wrong with your first two lines of code:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(URL);
In your return statement you make a mistake on Java String comparison.
When you use the comparison operator ==
with Strings, it compares references not values (i.e. you check if both are exactly the same object). You should use the equals()
method here. Just to be sure I would recommend to use equalsIgnoreCase()
:
return doc.getDocumentElement().getNodeName().equalsIgnoreCase("rss");
Hint: If you check for "rss" instead of "feed" (like for Atom) in your isRss()
method you don't have to use the ternary operator.
Sniffing content is one method. But note that atom uses namespaces, and you are creating a non namespace aware parser.
public boolean isAtom(String URL) throws ParserConfigurationException, SAXException, IOException{
DocumentBuilderFactory f = DocumentBuilderFActory.newInstance();
f.setNamespaceAware(true);
DocumentBuilder builder = f.newInstance().newDocumentBuilder();
Document doc = builder.parse(URL);
Element e = doc.getDocumentElement();
return e.getLocalName().equals("feed") &&
e.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}
Note also that you cannot compare using equalsIgnorCase(), since XML element names are case sensitive.
Another method is to react on the Content-Type header, if it is available in a HTTP GET request. Content-Type for ATOM would be application/atom+xml
and for RSS application/rss+xml
. I would suspect though, that not all RSS feed can be trusted to correctky set this header.
A third option is to look at the URL suffix, e.g. .atom and .rss.
The last two methods are easily configurable if you are using Spring or JAX-RS
You could use a StAX parser to avoid parsing the entire XML document into memory:
public boolean isAtom(String url) throws ParserConfigurationException, SAXException, IOException{
XMLInputFactory xif = XMLInputFactory.newFactory();
XMLStreamReader xsr = xif.createXMLStreamReader(new URL(url).openConnection());
xsr.nextTag(); // Advance to root element
return xsr.getLocalName().equals("feed") &&
xsr.getNamespaceURI().equals("http://www.w3.org/2005/Atom");
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With