Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can JAXB parse large XML files in chunks

Tags:

I need to parse potentially large XML files, of which the schema is already provided to me in several XSD files, so XML binding is highly favored. I'd like to know if I can use JAXB to parse the file in chunks and if so, how.

like image 676
John F. Avatar asked Jul 15 '09 21:07

John F.


People also ask

Which XML parser is best in Java for large files?

DOM Parser is faster than SAX Parser. Best for the larger sizes of files. Best for the smaller size of files. It is suitable for making XML files in Java.

Is JAXB memory efficient?

Generally JAXB is quite efficient and you shouldn't care about memory issues unless your application handles XMLs of very large size.

How does JAXB read XML?

To read XML, first get the JAXBContext . It is entry point to the JAXB API and provides methods to unmarshal, marshal and validate operations. Now get the Unmarshaller instance from JAXBContext . It's unmarshal() method unmarshal XML data from the specified XML and return the resulting content tree.

Is JAXB fast?

Java Architecture for XML Binding (JAXB) provides a fast and convenient way to bind XML schemas and Java representations, making it easy for Java developers to incorporate XML data and processing functions in Java applications.


2 Answers

Because code matters, here is a PartialUnmarshaller who reads a big file into chunks. It can be used that way new PartialUnmarshaller<YourClass>(stream, YourClass.class)

import javax.xml.bind.JAXBContext; import javax.xml.bind.JAXBException; import javax.xml.bind.Unmarshaller; import javax.xml.stream.*; import java.io.InputStream; import java.util.List; import java.util.NoSuchElementException; import java.util.stream.Collectors; import java.util.stream.IntStream;  import static javax.xml.stream.XMLStreamConstants.*;  public class PartialUnmarshaller<T> {     XMLStreamReader reader;     Class<T> clazz;     Unmarshaller unmarshaller;      public PartialUnmarshaller(InputStream stream, Class<T> clazz) throws XMLStreamException, FactoryConfigurationError, JAXBException {         this.clazz = clazz;         this.unmarshaller = JAXBContext.newInstance(clazz).createUnmarshaller();         this.reader = XMLInputFactory.newInstance().createXMLStreamReader(stream);          /* ignore headers */         skipElements(START_DOCUMENT, DTD);         /* ignore root element */         reader.nextTag();         /* if there's no tag, ignore root element's end */         skipElements(END_ELEMENT);     }      public T next() throws XMLStreamException, JAXBException {         if (!hasNext())             throw new NoSuchElementException();          T value = unmarshaller.unmarshal(reader, clazz).getValue();          skipElements(CHARACTERS, END_ELEMENT);         return value;     }      public boolean hasNext() throws XMLStreamException {         return reader.hasNext();     }      public void close() throws XMLStreamException {         reader.close();     }      void skipElements(int... elements) throws XMLStreamException {         int eventType = reader.getEventType();          List<Integer> types = asList(elements);         while (types.contains(eventType))             eventType = reader.next();     } } 
like image 172
yves amsellem Avatar answered Sep 20 '22 20:09

yves amsellem


This is detailed in the user guide. The JAXB download from http://jaxb.java.net/ includes an example of how to parse one chunk at a time.

When a document is large, it's usually because there's repetitive parts in it. Perhaps it's a purchase order with a large list of line items, or perhaps it's an XML log file with large number of log entries.

This kind of XML is suitable for chunk-processing; the main idea is to use the StAX API, run a loop, and unmarshal individual chunks separately. Your program acts on a single chunk, and then throws it away. In this way, you'll be only keeping at most one chunk in memory, which allows you to process large documents.

See the streaming-unmarshalling example and the partial-unmarshalling example in the JAXB RI distribution for more about how to do this. The streaming-unmarshalling example has an advantage that it can handle chunks at arbitrary nest level, yet it requires you to deal with the push model --- JAXB unmarshaller will "push" new chunk to you and you'll need to process them right there.

In contrast, the partial-unmarshalling example works in a pull model (which usually makes the processing easier), but this approach has some limitations in databinding portions other than the repeated part.

like image 45
skaffman Avatar answered Sep 18 '22 20:09

skaffman