I need to process some HTML pages in my Android App and I would prefer to use XPath for extracting the relevant information. For regular J2SE there are a lot of possible implementations for parsing regular HTML into a org.w3c.dom.Document:
(List may be incomplete - it has been extracted from https://stackoverflow.com/questions/2009897/recommend-an-alternative-to-jtidy)
But it is very complicated to estimate if and how good those libraries work on Android (library size, cpu and memory consumption).
Based on your experience - what is the library of your choice for Android?
Android DOM(Document Object Model) parser is a program that parses an XML document and extracts the required information from it. This parser uses an object-based approach for creating and parsing the XML files. In General, a DOM parser loads the XML file into the Android memory to parse the XML document.
DOMParser created documents are created with scripting disabled; the script is parsed, but not run, so it should be safe against XSS.
The DOMParser interface provides the ability to parse XML or HTML source code from a string into a DOM Document . You can perform the opposite operation—converting a DOM tree into XML or HTML source—using the XMLSerializer interface.
OK, looks like no-one can answer that question - then I have to check it myself.
jTidy
I downloaded the latest jTidy sources, compiled them and added the created jar file as library to my Android app. There were no problems using jTidy in my App (emulator and real phone). At runtime jTidy also works fine - but it seems that it is not a good fit for the limited Android environment - it works really slow. Looking at the Logcat output even parsing a ~10kb html file causes the garbage collector to work heavily.
HTMLCleaner
From my experience HTMLCleaner works also nice on Android; the library size is relatively small (106KB for v2.2). However the parsed DOM it creates is not as expected - HTMLCleaner inserts for example additional <span>
elements into the DOM. This may be OK if you want to display it as an HTML file but for my use case - extrecting information via XPath expressions - this is a no-go!
TagSoup
Not tested
Jericho
Not tested
NekoHTML
Not tested
JSoup
Not tested
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With