I am receving an XML file as an input, whose size can vary from a few KBs to a lot more. I am getting this file over a network. I need to extract a small number of nodes as per my use, so most of the document is pretty useless for me. I have no memory preferences, I just need speed.
Considering all this, I concluded :
Not using DOM here (due to possible huge size of doc , no CRUD requirement, and source being network)
No SAX as I only need to get a small subset of data.
StaX can be a way to go, but I am not sure if it is the fastest way.
JAXB came up as another option - but what sort of parser does it use ? I read it uses Xerces by default (which is what type - push or pull ?), although I can configure it for use with Stax or Woodstock as per this link
I am reading a lot, still confused with so many options ! Any help would be appreciated.
Thanks !
Edit : I want to add one more question here : What is wrong in using JAXB here ?
In PHP there are two major types of XML parsers: Tree-Based Parsers. Event-Based Parsers.
XML parser is a software library or a package that provides interface for client applications to work with XML documents. It checks for proper format of the XML document and may also validate the XML documents. Modern day browsers have built-in XML parsers. The goal of a parser is to transform XML into a readable code.
Fastest solution is by far a StAX parser, specially as you only need a specific subset of the XML file and you can easily ignore whatever isn't really necessary using StAX, while you would receive the event anyway if you were using a SAX parser.
But it's also a little bit more complicated than using SAX or DOM. One of these days I had to write a StAX parser for the following XML:
<?xml version="1.0"?>
<table>
<row>
<column>1</column>
<column>Nome</column>
<column>Sobrenome</column>
<column>[email protected]</column>
<column></column>
<column>2011-06-22 03:02:14.915</column>
<column>2011-06-22 03:02:25.953</column>
<column></column>
<column></column>
</row>
</table>
Here's how the final parser code looks like:
public class Parser {
private String[] files ;
public Parser(String ... files) {
this.files = files;
}
private List<Inscrito> process() {
List<Inscrito> inscritos = new ArrayList<Inscrito>();
for ( String file : files ) {
XMLInputFactory factory = XMLInputFactory.newFactory();
try {
String content = StringEscapeUtils.unescapeXml( FileUtils.readFileToString( new File(file) ) );
XMLStreamReader parser = factory.createXMLStreamReader( new ByteArrayInputStream( content.getBytes() ) );
String currentTag = null;
int columnCount = 0;
Inscrito inscrito = null;
while ( parser.hasNext() ) {
int currentEvent = parser.next();
switch ( currentEvent ) {
case XMLStreamReader.START_ELEMENT:
currentTag = parser.getLocalName();
if ( "row".equals( currentTag ) ) {
columnCount = 0;
inscrito = new Inscrito();
}
break;
case XMLStreamReader.END_ELEMENT:
currentTag = parser.getLocalName();
if ( "row".equals( currentTag ) ) {
inscritos.add( inscrito );
}
if ( "column".equals( currentTag ) ) {
columnCount++;
}
break;
case XMLStreamReader.CHARACTERS:
if ( "column".equals( currentTag ) ) {
String text = parser.getText().trim().replaceAll( "\n" , " ");
switch( columnCount ) {
case 0:
inscrito.setId( Integer.valueOf( text ) );
break;
case 1:
inscrito.setFirstName( WordUtils.capitalizeFully( text ) );
break;
case 2:
inscrito.setLastName( WordUtils.capitalizeFully( text ) );
break;
case 3:
inscrito.setEmail( text );
break;
}
}
break;
}
}
parser.close();
} catch (Exception e) {
throw new IllegalStateException(e);
}
}
Collections.sort(inscritos);
return inscritos;
}
public Map<String,List<Inscrito>> parse() {
List<Inscrito> inscritos = this.process();
Map<String,List<Inscrito>> resultado = new LinkedHashMap<String, List<Inscrito>>();
for ( Inscrito i : inscritos ) {
List<Inscrito> lista = resultado.get( i.getInicial() );
if ( lista == null ) {
lista = new ArrayList<Inscrito>();
resultado.put( i.getInicial(), lista );
}
lista.add( i );
}
return resultado;
}
}
The code itself is in portuguese but it should be straightforward for you to understand what it is, here's the repo on github.
If you're only extracting a small amount, consider looking into using XPath as this is somewhat simpler than trying to extract the whole document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With