Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which HTML DOM parser works best on Android?

I need to process some HTML pages in my Android App and I would prefer to use XPath for extracting the relevant information. For regular J2SE there are a lot of possible implementations for parsing regular HTML into a org.w3c.dom.Document:

  • jTidy
  • TagSoup
  • Jericho
  • NekoHTML
  • HTMLCleaner

(List may be incomplete - it has been extracted from https://stackoverflow.com/questions/2009897/recommend-an-alternative-to-jtidy)

But it is very complicated to estimate if and how good those libraries work on Android (library size, cpu and memory consumption).

Based on your experience - what is the library of your choice for Android?

like image 448
Robert Avatar asked Sep 25 '11 14:09

Robert


People also ask

What is DOM parsing in Android?

Android DOM(Document Object Model) parser is a program that parses an XML document and extracts the required information from it. This parser uses an object-based approach for creating and parsing the XML files. In General, a DOM parser loads the XML file into the Android memory to parse the XML document.

Is DOMParser safe?

DOMParser created documents are created with scripting disabled; the script is parsed, but not run, so it should be safe against XSS.

What is HTML DOM parser?

The DOMParser interface provides the ability to parse XML or HTML source code from a string into a DOM Document . You can perform the opposite operation—converting a DOM tree into XML or HTML source—using the XMLSerializer interface.


1 Answers

OK, looks like no-one can answer that question - then I have to check it myself.

jTidy

I downloaded the latest jTidy sources, compiled them and added the created jar file as library to my Android app. There were no problems using jTidy in my App (emulator and real phone). At runtime jTidy also works fine - but it seems that it is not a good fit for the limited Android environment - it works really slow. Looking at the Logcat output even parsing a ~10kb html file causes the garbage collector to work heavily.

HTMLCleaner

From my experience HTMLCleaner works also nice on Android; the library size is relatively small (106KB for v2.2). However the parsed DOM it creates is not as expected - HTMLCleaner inserts for example additional <span> elements into the DOM. This may be OK if you want to display it as an HTML file but for my use case - extrecting information via XPath expressions - this is a no-go!

TagSoup

Not tested

Jericho

Not tested

NekoHTML

Not tested

JSoup

Not tested

like image 101
Robert Avatar answered Oct 12 '22 23:10

Robert