I have an xml file(from federal government's data.gov) which I'm trying to read with scala's xml handlers.
val loadnode = scala.xml.XML.loadFile(filename)
Apparently, there is an invalid xml character. Is there an option to just ignore invalid characters? or is my only option to clean it up first?
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x12) was found in the element content of the document.
Ruby's nokogiri was able to parse it with the invalid character.
To expand on @huynhjl's answer: the InputStream
filter is dangerous if you have multi-byte characters, for example in UTF-8 encoded text. Instead, use a character oriented filter: FilterReader
. Or if the file is small enough, load into a String
and replace the characters there.
scala> val origXml = "<?xml version='1.1'?><root>\u0012</root>"
origXml: java.lang.String = <?xml version='1.1'?><root></root>
scala> val cleanXml = xml flatMap {
case x if Character.isISOControl(x) => "&#x" + Integer.toHexString(x) + ";"
case x => Seq(x)
}
cleanXml: String = <?xml version='1.1'?><root></root>
scala> scala.xml.XML.loadString(cleanXml)
res14: scala.xml.Elem = <root></root>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With