Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java XML Parsing and original byte offsets

I'd like to parse some well-formed XML into a DOM, but I'd like know the offset of each node's tag in the original media.

For example, if I had an XML document with the content something like:

<html>
<body>
<div>text</div>
</body>
</html>

I'd like to know that the node starts at offset 13 in the original media, and (more importantly) that "text" starts at offset 18.

Is this possible with standard Java XML parsers? JAXB? If no solution is easily available, what type of changes are necessary along the parsing path to make this possible?

like image 228
Bill Dwyer Avatar asked Aug 17 '10 22:08

Bill Dwyer


2 Answers

The SAX API provides a rather obscure mechanism for this - the org.xml.sax.Locator interface. When you use the SAX API, you subclass DefaultHandler and pass that to the SAX parse methods, and the SAX parser implementation is supposed to inject a Locator into your DefaultHandler via setDocumentLocator(). As the parsing proceeds, the various callback methods on your ContentHandler are invoked (e.g. startElement()), at which point you can consult the Locator to find out the parsing position (via getColumnNumber() and getLineNumber())

Technically, this is optional functionality, but the javadoc says that implementations are "strongly encouraged" to provide it, so you can likely assume the SAX parser built into JavaSE will do it.

Of course, this does mean using the SAX API, which is noone's idea of fun, but I can't see a way of accessing this information using a higher-level API.

edit: Found this example.

like image 164
skaffman Avatar answered Nov 04 '22 08:11

skaffman


Use the XML Streamreader and its getLocation() method to return location object. location.getCharacterOffset() gives the byte offset of current location.

import javax.xml.stream.Location;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;

public class Runner {

public static void main(String argv[]) {

    XMLInputFactory factory = XMLInputFactory.newInstance();
    try{
    XMLStreamReader streamReader = factory.createXMLStreamReader(
           new FileReader("D:\\BigFile.xml"));

    while(streamReader.hasNext()){
        streamReader.next();
        if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
            Location location = streamReader.getLocation();
            System.out.println("byte location: " + location.getCharacterOffset());
            }
        }
    } catch(Exception e){
        e.printStackTrace();
    }
like image 27
Lucasvw Avatar answered Nov 04 '22 07:11

Lucasvw