Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing/scanning/tokenizing "raw XML"

Tags:

java

parsing

xml

I have an application where I need to parse or tokenize XML and preserve the raw text (e.g. don't parse entities, don't convert whitespace in attributes, keep attribute order, etc.) in a Java program.

I've spent several hours today trying to use StAX, SAX, XSLT, TagSoup, etc. before realizing that none of them do this. I can't afford to spend much more time attacking this problem, and parsing the text manually seems highly nontrivial. Is there any Java library that can help me tokenize the XML?

edit: why am I doing this? -- I have a large XML file that I want to make a small number of localized changes programmatically, that need to be reviewed. It is highly valuable to be able to use a diff tool. If the parser/filter normalizes the XML, then all I see is "red ink" in the diff tool. The application that produces the XML in the first place isn't something that I can easily have changed to produce "canonical XML", if there is such a thing.

like image 288
Jason S Avatar asked Sep 08 '09 22:09

Jason S


1 Answers

I think you might have to generate your own grammar.

Some links:

  • Parsing XML with ANTLR Tutorial
  • ANTXR
  • XPA
  • http://www.google.com/search?q=antlr+xml
like image 186
ykaganovich Avatar answered Sep 30 '22 13:09

ykaganovich