Is there a good XML parser for light-scanning an XML file to get the byte offsets of elements?

Question

We have a system where we're processing XML files where the file itself is too large to fit in memory.

As part of processing, we want to quickly scan through to record the offset of relevant elements, so that later on, we can seek immediately to those elements and parse just the piece we want (since the smaller slice of the file would fit in memory, we can afford to use a DOM or whatever for that part.)

Obviously we could just write our own XML parser from scratch, but before making yet another XML parser, I wanted to see if there were any other options available.

What follows is a list of the things we already know about.

Using StAX should work, but doesn't. Here's a demonstration of that. I made an XML example where there are characters longer than one byte to demonstrate that the returned byte offset is not correct once you start passing these characters. Note that even though the method in the API is called getCharacterOffset(), the documentation says that it returns the byte offset if you passed in a byte stream - which is what this code is doing.

@Test
public void testByteOffsetsFromStreamParser() throws Exception {
    // byte counts are size required for UTF-8, I checked using Ishida's tool.
    String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
"
                 "<root>
"
                 " <leaf>\u305A\u308C\u306A\u3044\u3067\u307B\u3057\u3044</leaf>
" +
                 " <leaf>\u305A\u308C\u306A\u3044\u3067\u307B\u3057\u3044</leaf>
" +
                 " <leaf>\u305A\u308C\u306A\u3044\u3067\u307B\u3057\u3044</leaf>
" +
                 "</root>
";
    byte[] xmlBytes = xml.getBytes("UTF-8");
    assertThat(xmlBytes.length, is(equalTo(171)));  // = 171 from above

    String implToTest = "com.sun.xml.internal.stream.XMLInputFactoryImpl";
    //String implToTest = "com.ctc.wstx.stax.WstxInputFactory";
    XMLInputFactory factory =
        Class.forName(implToTest).asSubclass(XMLInputFactory.class).newInstance();
    factory.setProperty("javax.xml.stream.isCoalescing", false);
    factory.setProperty("javax.xml.stream.supportDTD", false);
    XMLEventReader reader = factory.createXMLEventReader(
        new ByteArrayInputStream(xmlBytes));
    try {
        XMLEvent event;

        event = reader.nextTag(); // <root>
        checkByteOffset(event, 39);

        event = reader.nextTag(); // <leaf>
        checkByteOffset(event, 47);

        event = reader.nextEvent(); // (text)
        checkByteOffset(event, 53);

        event = reader.nextTag(); // </leaf>
        checkByteOffset(event, 77);

        event = reader.nextTag(); // <leaf>
        checkByteOffset(event, 86);

        event = reader.nextEvent(); // (text)
        checkByteOffset(event, 92);

        event = reader.nextTag(); // </leaf>
        checkByteOffset(event, 116);

        event = reader.nextTag(); // <leaf>
        checkByteOffset(event, 125);

        event = reader.nextEvent(); // (text)
        checkByteOffset(event, 131);

        event = reader.nextTag(); // </leaf>
        checkByteOffset(event, 155);

        event = reader.nextTag(); // </root>
        checkByteOffset(event, 163);
    } finally {
        reader.close(); // no auto-close :(
    }
}

private void checkByteOffset(XMLEvent event, int expectedOffset) {
    System.out.println("Expected Offset: " + expectedOffset +
        "    - Actual Offset: " + event.getLocation().getCharacterOffset());
}

Results for the factory which you get by default in Java 7:

Expected Offset: 39    - Actual Offset: 45
Expected Offset: 47    - Actual Offset: 53
Expected Offset: 53    - Actual Offset: 63
Expected Offset: 77    - Actual Offset: 68
Expected Offset: 86    - Actual Offset: 76
Expected Offset: 92    - Actual Offset: 86
Expected Offset: 116    - Actual Offset: 91
Expected Offset: 125    - Actual Offset: 99
Expected Offset: 131    - Actual Offset: 109
Expected Offset: 155    - Actual Offset: 114
Expected Offset: 163    - Actual Offset: 122

Results for Woodstox, which we tried based on some other stackoverflow post suggestion. Note that although it starts out being correct, after a few lines, it's even more incorrect than the default parser:

Expected Offset: 39    - Actual Offset: 39
Expected Offset: 47    - Actual Offset: 47
Expected Offset: 53    - Actual Offset: 53
Expected Offset: 77    - Actual Offset: 61
Expected Offset: 86    - Actual Offset: 70
Expected Offset: 92    - Actual Offset: 76
Expected Offset: 116    - Actual Offset: 84
Expected Offset: 125    - Actual Offset: 93
Expected Offset: 131    - Actual Offset: 99
Expected Offset: 155    - Actual Offset: 107
Expected Offset: 163    - Actual Offset: 115

We're aware of a library called VTD-XML which does almost exactly what we're after, but it has two problems. The first problem is that it reads the whole file into memory and the file itself won't fit. The second problem is that the licence is GPL and not compatible with the rest of our stuff.

jschnasse · Accepted Answer

Some time ago I created this approach for fun. Maybe it will help you. It basically does the following.

Create a self generated XML parser with ANTLR
Hook into the parsing routine to emit byte offsets
Use random access to stream from each byte offset into a prepared POJO using Jackson.

For complete example look into Using StAX to create index for XML for quick access

keshlam · Answer

Possible approach:

1) Open the file as a byte stream.

2) Wrap an input stream/reader around that which (a) converts from UTF-8 to UTF-16, but (b) in the process, tracks which Java characters are basic ASCII range and which are 2-byte UTF16. (I can think of several ways to keep the memory requirements of that tracking down to something reasonable.)

3) When you need a file offset, use that tracking table to back-convert from Java UTF-16 character count to byte count.

Can't think of any reason why it wouldn't work...

Is there a good XML parser for light-scanning an XML file to get the byte offsets of elements?

Tags:

java

parsing

xml

stax

Hakanai

2 Answers

jschnasse

keshlam

Recent Activity

Donate For Us

Is there a good XML parser for light-scanning an XML file to get the byte offsets of elements?

Tags:

java

parsing

xml

stax

Hakanai

2 Answers

jschnasse

keshlam

Related questions

Recent Activity

Donate For Us