Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory-efficient XSLT Processor

Tags:

xslt

I need a tool to execute XSLTs against very large XML files. To be clear, I don't need anything to design, edit, or debug the XSLTs, just execute them. The transforms that I am using are already well optimized, but the large files are causing the tool I have tried (Saxon v9.1) to run out of memory.

like image 401
Justin R. Avatar asked Oct 23 '08 17:10

Justin R.


5 Answers

I found a good solution: Apache's Xalan C++. It provides a pluggable memory manager, allowing me to tune allocation based on the input and transform.

In multiple cases it is consuming ~60% less memory (I'm looking at private bytes) than the others I have tried.

like image 154
Justin R. Avatar answered Nov 19 '22 18:11

Justin R.


You may want to look into STX for streaming-based XSLT-like transformations. Alternatively, I believe StAX can integrate with XSLT nicely through the Transformer interface.

like image 33
ykaganovich Avatar answered Nov 19 '22 19:11

ykaganovich


It sounds like you're sorted - but often, another potential approach is to split the data first. Obviously this only works with some transformations (i.e. where different chunks of data can be treated in isolation from the whole) - but then you can use a simple parser (rather than a DOM) to do the splitting into manageable pieces, then process each chunk separately and reassemble.

Since I'm a .NET bod, things like XmlReader can do the chunking without a DOM; I'm sure there are equivalents for every language.

Again - just for completeness.

[edit re question] I'm not aware of any specific name; maybe Divide and Conquer. For an example; if your data is actually a flat list of like objects, then you could simply split the first-level children - i.e. rather than having 2M rows, you split it into 10 lots of 200K rows, or 100 lots of 20K rows. I've done this before lots of times for working with bulk data (for example, uploading in chunks of data [all valid] and re-assembling at the server so that each individual upload is small enough to be robust).

like image 34
Marc Gravell Avatar answered Nov 19 '22 20:11

Marc Gravell


For what it's worth, I suspect that for Java, Saxon is as good as it gets, if you need to use XSLT. It is quite efficient (both cpu and memory) for larger documents, but XSLT itself essentially forces full in-memory tree of contents to be created and retained, except for limited cases. Saxon-SA (for-fee version) supposedly has extensions to allow taking advantage of such "streaming" cases, so that might be worth checking out.

But the advice to split up the contents is the best one: if you are dealing with independent records, just split the input using other techniques (like, use Stax! :-) )

like image 2
StaxMan Avatar answered Nov 19 '22 19:11

StaxMan


I have found that a custom tool built to run the XSLT using earlier versions of MSXML makes it very fast, but also consumes incredible amounts of memory, and will not actually complete if it is too large. You also lose out on some advanced XSLT functionality as the earlier versions of MSXML don't support the full xpath stuff.

It is worth a try if your other options take too long.

like image 1
hova Avatar answered Nov 19 '22 19:11

hova