Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I ignore invalid XML character using Scala's builtin xml handlers?

Tags:

xml

scala

I have an xml file(from federal government's data.gov) which I'm trying to read with scala's xml handlers.

val loadnode = scala.xml.XML.loadFile(filename) 

Apparently, there is an invalid xml character. Is there an option to just ignore invalid characters? or is my only option to clean it up first?

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x12) was found in the element content of the document.

Ruby's nokogiri was able to parse it with the invalid character.

like image 720
tommy chheng Avatar asked Dec 01 '22 06:12

tommy chheng


1 Answers

To expand on @huynhjl's answer: the InputStream filter is dangerous if you have multi-byte characters, for example in UTF-8 encoded text. Instead, use a character oriented filter: FilterReader. Or if the file is small enough, load into a String and replace the characters there.

scala> val origXml = "<?xml version='1.1'?><root>\u0012</root>"                                          
origXml: java.lang.String = <?xml version='1.1'?><root></root>

scala> val cleanXml = xml flatMap { 
   case x if Character.isISOControl(x) => "&#x" + Integer.toHexString(x) + ";"
   case x => Seq(x) 
}
cleanXml: String = <?xml version='1.1'?><root>&#x12;</root>

scala> scala.xml.XML.loadString(cleanXml) 
res14: scala.xml.Elem = <root></root>
like image 50
retronym Avatar answered Dec 05 '22 04:12

retronym