Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to parse large complex xml

I need to parse a large complex xml and write to a Flat file, can you give some advise?

File size: 500MB Record count: 100K XML structure:

<Msg>

    <MsgHeader>
        <!--Some of the fields in the MsgHeader need to be map to a java object-->
    </MsgHeader>

    <GroupA> 
        <GroupAHeader/>
        <!--Some of the fields in the GroupAHeader need to be map to a java object--> 
        <GroupAMsg/>
        <!--50K records--> 
        <GroupAMsg/> 
        <GroupAMsg/> 
        <GroupAMsg/> 
    </GroupA>

    <GroupB> 
        <GroupBHeader/> 
        <GroupBMsg/>
        <!--50K records--> 
        <GroupBMsg/> 
        <GroupBMsg/> 
        <GroupBMsg/> 
    </GroupB>

</Msg>
like image 760
Weber Avatar asked Nov 12 '22 14:11

Weber


1 Answers

Within Spring Batch, I've written my own stax event item reader implementation that operates a bit more specifically than previously mentioned. Basically, I just stuff elements into a map and then pass them into the ItemProcessor. From there, you're free to transform it into a single object (see CompositeItemProcessor) from the "GatheredElement". Apologies for having a little copy/paste from the StaxEventItemReader, but I don't think it's avoidable.

From here, you're free to use whatever OXM marshaller you'd like, I happen to use JAXB as well.

public class ElementGatheringStaxEventItemReader<T> extends StaxEventItemReader<T> {
    private Map<String, String> gatheredElements;
    private Set<String> elementsToGather;
    ...
    @Override
    protected boolean moveCursorToNextFragment(XMLEventReader reader) throws NonTransientResourceException {
        try { 
            while (true) {
                while (reader.peek() != null && !reader.peek().isStartElement()) {
                    reader.nextEvent();
                }
                if (reader.peek() == null) {
                    return false;
                }
                QName startElementName = ((StartElement) reader.peek()).getName();
                if(elementsToGather.contains(startElementName.getLocalPart())) {
                    reader.nextEvent(); // move past the actual start element
                    XMLEvent dataEvent = reader.nextEvent();
                    gatheredElements.put(startElementName.getLocalPart(), dataEvent.asCharacters().getData());
                    continue;
                }
                if (startElementName.getLocalPart().equals(fragmentRootElementName)) {
                    if (fragmentRootElementNameSpace == null || startElementName.getNamespaceURI().equals(fragmentRootElementNameSpace)) {
                        return true;
                    }
                }
                reader.nextEvent();

            }
        } catch (XMLStreamException e) {
            throw new NonTransientResourceException("Error while reading from event reader", e);
        }
    }

    @SuppressWarnings("unchecked")
    @Override
    protected T doRead() throws Exception {
        T item = super.doRead();
        if(null == item)
            return null;
        T result = (T) new GatheredElementItem<T>(item, new     HashedMap(gatheredElements));
        if(log.isDebugEnabled())
            log.debug("Read GatheredElementItem: " + result);
        return result; 
    }

The gathered element class is pretty basic:

public class GatheredElementItem<T> {
    private final T item;
    private final Map<String, String> gatheredElements;
    ...
}
like image 95
Jason Griebeler Avatar answered Nov 15 '22 05:11

Jason Griebeler