I’m trying to find a way to precisely determine the line number and character position of both tags and attributes whilst parsing an XML document. I want to do this so that I can report accurately to the author of the XML document (via a web interface) where the document is invalid.
Ultimately I want to set the caret in a to be at the invalid tag or just inside the open quote of the invalid attribute. (I’m not using XML Schema at this point because the exact format of the attributes matters in a way that cannot be validated by schema alone. I may even want report some attributes as being invalid part-way through the attribute’s value. Or similarly, part-way through the text between a start and end tag.)
I’ve tried using SAX (org.xml.sax) and the Locator interface. This works up to a point but isn’t nearly good enough. It will only report the read position after an event; for example, the character immediately after an open tag ends, for startElement(). I can’t just subtract back the length of the tag name because attributes, self-closing tags and/or newlines within the open tag will throw this out. (And Locator provides no information about the position of attributes at all.)
Ideally I was looking to use an event-based approach, as I already have a SAX handler that is building an in-house DOM-like representation or further processing. However, I would be interested in knowing about any DOM or DOM-like library that includes exact position information for the model’s elements.
Has any one solved this issue, or any like it, with the required level of precision?
XML parsers will (and should) smooth over certain things like additional whitespace, so exact mapping back to the character stream is not feasible.
You should rather look into getting a lexer or 'token stream generator' for increased detail, in other words go to the detail level below XML parsers.
There is a few general frameworks for writing lexers in java. This ANTLR 3-based page has a nice overview of lexer vs parser and section one some rudimentory XML Lexer examples.
I'd also like to comment that for a user with a web interface, maybe you should consider a pure client-side (i.e. javascript) solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With