I am writing a little screen-scraping app that consumes some XHTML - it goes without saying that the XHTML is invalid: ampersands aren't escaped as &
.
I am using Android's XmlPullParser
and it spews out the following error upon the incorrectly encoded value:
org.xmlpull.v1.XmlPullParserException: unterminated entity ref
(position:START_TAG <a href='/Fahrinfo/bin/query.bin/dox?ld=0.1&n=3&i=9c.0323581.1266265347&rt=0&vcra'>
@55:134 in java.io.InputStreamReader@43b1ef70)
How do I get around this? I have thought about the following solutions:
InputStream
in another one that replaces the ampersands with entity refsWhich ones is likely to be more successful?
Android recommends to use XMLPullParser to parse the xml file than SAX and DOM because it is fast. The org. xmlpull. v1. XmlPullParser interface provides the functionality to parse the XML document using XMLPullParser.
XML Pull Parser is an interface that defines parsing functionality provided in XMLPULL V1 API (visit this website to learn more about API and its implementations).
XMLPullParser scrutinizes an XML file with a series of events, such as START_DOCUMENT, START_TAG, TEXT, END_TAG, and END_DOCUMENT to parse the XML document.
I was stuck on this for about an hour before figuring out that in my case it was the "&" that couldn't be resolved by the XML PULL PARSER, so i found the solution. So Here is a snippet of code which totally fix it.
void ParsingActivity(String r) {
try {
parserCreator = XmlPullParserFactory.newInstance();
parser = parserCreator.newPullParser();
// Here we give our file object in the form of a stream to the
// parser.
parser.setInput(new StringReader(r.replaceAll("&", "&")));
// as a SAX parser this will raise events/callback as and when it
// comes to a element.
int parserEvent = parser.getEventType();
// we go thru a loop of all elements in the xml till we have
// reached END of document.
while (parserEvent != XmlPullParser.END_DOCUMENT) {
switch (parserEvent) {
// if u have reached start of a tag
case XmlPullParser.START_TAG:
// get the name of the tag
String tag = parser.getName();
pretty much what I'm doing I'm just replacing the &
with &
since I was dealing with parsing a URL.
Hope this helps.
I would go with your first option, replacing the ampersands seems more of a fit solution than the other. The second option seems more of a hack to get it to work by accepting incorrect markup.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With