I have a very large XML file which I need to transform into another XML file, and I would like to do this with XSLT. I am more interested in optimisation for memory, rather than optimisation for speed (though, speed would be good too!). Which Java-based XSLT processor would you recommmend for this task? Would you recommend any other way of doing it (non-XSLT?, non-Java?), and if so, why? The XML files in questions are very large, but not very deep - with millions of rows (elements), but only about 3 levels deep.

At present there are only three XSLT 2.0 processors known and from them Saxon 9.x is probably the most efficient (at least according to my experience) both in speed and in memory utilisation. Saxon-SA (the schema-aware version of Saxon, not free as the B (basic) version) has special extensions for streamed processing. From the various existing XSLT 1.0 processors, .NET XslCompiledTransform (C#-based, not Java!) seems to be the champion. In the Java-based world of XSLT 1.0 processors Saxon 6.x again is pretty good. UPDATE: Now, more than 3 years from the date this question was originally answered, there isn't any evidence that the efficiency difference between of the XSLT processors mentioned has changed. As for streaming: <ol> <li> An XML document with "millions of nodes" may well be processed even without any streaming. I conducted an experiment in which Saxom 9.1.07 processed an XML document that contains around one million 3-rd level elements with integer values. The transformation simply calculates their sum. The total time for the transformation on my computer is less than 1.5 seconds. The used memory was 500MB -- something that PCs could have even 10 years ago,</li> </ol> Here are Saxon's informational messages that show details about the transformation: <blockquote> <pre class="prettyprint lang-none prettyprint-override"><code>Saxon 9.1.0.7J from Saxonica Java version 1.6.0_17 Stylesheet compilation time: 190 milliseconds Processing file:/C:\temp\delete\MRowst.xml Building tree for file:/C:\temp\delete\MRowst.xml using class net.sf.saxon.tinytree.TinyBuilder Tree built in 1053 milliseconds Tree size: 3075004 nodes, 1800000 characters, 0 attributes Loading net.sf.saxon.event.MessageEmitter Execution time: 1448 milliseconds Memory used: 506661648 NamePool contents: 14 entries in 14 chains. 6 prefixes, 6 URIs </code></pre> </blockquote> <ol start="2"> <li> Saxon 9.4 has a saxon:stream() extension function that can be used for processing huge XML documents.</li> </ol> Here is an excerpt from the documentation: <blockquote> There are basically two ways of doing streaming in Saxon: Burst-mode streaming: with this approach, the transformation of a large file is broken up into a sequence of transformations of small pieces of the file. Each piece in turn is read from the input, turned into a small tree in memory, transformed, and written to the output file. This approach works well for files that are fairly flat in structure, for example a log file holding millions of log records, where the processing of each log record is independent of the ones that went before. A variant of this technique uses the new XSLT 3.0 xsl:iterate instruction to iterate over the records, in place of xsl:for-each. This allows working data to be maintained as the records are processed: this makes it possible, for example, to output totals or averages at the end of the run, or to make the processing of one record dependent on what came before it in the file. The xsl:iterate instruction also allows early exit from the loop, which makes it possible for a transformation to process data from the beginning of a large file without actually reading the whole file. Burst-mode streaming is available in both XSLT and XQuery, but there is no equivalent in XQuery to the xsl:iterate construct. Streaming templates: this approach follows the traditional XSLT processing pattern of performing a recursive descent of the input XML hierarchy by matching template rules to the nodes at each level, but does so one element at a time, without building the tree in memory. Every template belongs to a mode (perhaps the default, unnamed mode), and streaming is a property of the mode that can be specified using the new xsl:mode declaration. If the mode is declared to be streamable, then every template rule within that mode must obey the rules for streamable processing. The rules for what is allowed in streamed processing are quite complicated, but the essential principle is that the template rule for a given node can only read the descendants of that node once, in order. There are further rules imposed by limitations in the current Saxon implementation: for example, although grouping using is theoretically consistent with a streamed implementation, it is not currently implemented in Saxon. </blockquote> <ol start="3"> <li>XSLT 3.0 would have standard streaming feature. However, the W3C document is still with a "working draft" status and the streaming specification is likely to change in subsequent draft versions. Due to this, no implementations of the current draft (streaming) specification exist.</li> <li>Warning: Not every transformation can be performed in streaming mode -- regardless of the XSLT processor. One example of a transformation that isn't possible to perform in a streaming mode (with a limited amount of RAM) for huge documents is sorting their elements (say by a common attribute).</li> </ol>

What is the Most Efficient Java-Based streaming XSLT Processor? [closed]

Tags:

java

processor

xslt

I have a very large XML file which I need to transform into another XML file, and I would like to do this with XSLT. I am more interested in optimisation for memory, rather than optimisation for speed (though, speed would be good too!).

Which Java-based XSLT processor would you recommmend for this task?

Would you recommend any other way of doing it (non-XSLT?, non-Java?), and if so, why?

The XML files in questions are very large, but not very deep - with millions of rows (elements), but only about 3 levels deep.

560

asked Jan 20 '09 11:01

Vihung

1 Answers

At present there are only three XSLT 2.0 processors known and from them Saxon 9.x is probably the most efficient (at least according to my experience) both in speed and in memory utilisation. Saxon-SA (the schema-aware version of Saxon, not free as the B (basic) version) has special extensions for streamed processing.

From the various existing XSLT 1.0 processors, .NET XslCompiledTransform (C#-based, not Java!) seems to be the champion.

In the Java-based world of XSLT 1.0 processors Saxon 6.x again is pretty good.

UPDATE:

Now, more than 3 years from the date this question was originally answered, there isn't any evidence that the efficiency difference between of the XSLT processors mentioned has changed.

As for streaming:

An XML document with "millions of nodes" may well be processed even without any streaming. I conducted an experiment in which Saxom 9.1.07 processed an XML document that contains around one million 3-rd level elements with integer values. The transformation simply calculates their sum. The total time for the transformation on my computer is less than 1.5 seconds. The used memory was 500MB -- something that PCs could have even 10 years ago,

Here are Saxon's informational messages that show details about the transformation:

Saxon 9.1.0.7J from Saxonica Java version 1.6.0_17 Stylesheet compilation time: 190 milliseconds Processing file:/C:\temp\delete\MRowst.xml Building tree for file:/C:\temp\delete\MRowst.xml using class net.sf.saxon.tinytree.TinyBuilder Tree built in 1053 milliseconds Tree size: 3075004 nodes, 1800000 characters, 0 attributes Loading net.sf.saxon.event.MessageEmitter Execution time: 1448 milliseconds Memory used: 506661648 NamePool contents: 14 entries in 14 chains. 6 prefixes, 6 URIs

Saxon 9.4 has a saxon:stream() extension function that can be used for processing huge XML documents.

Here is an excerpt from the documentation:

There are basically two ways of doing streaming in Saxon:

Burst-mode streaming: with this approach, the transformation of a large file is broken up into a sequence of transformations of small pieces of the file. Each piece in turn is read from the input, turned into a small tree in memory, transformed, and written to the output file.

This approach works well for files that are fairly flat in structure, for example a log file holding millions of log records, where the processing of each log record is independent of the ones that went before.

A variant of this technique uses the new XSLT 3.0 xsl:iterate instruction to iterate over the records, in place of xsl:for-each. This allows working data to be maintained as the records are processed: this makes it possible, for example, to output totals or averages at the end of the run, or to make the processing of one record dependent on what came before it in the file. The xsl:iterate instruction also allows early exit from the loop, which makes it possible for a transformation to process data from the beginning of a large file without actually reading the whole file.

Burst-mode streaming is available in both XSLT and XQuery, but there is no equivalent in XQuery to the xsl:iterate construct.

Streaming templates: this approach follows the traditional XSLT processing pattern of performing a recursive descent of the input XML hierarchy by matching template rules to the nodes at each level, but does so one element at a time, without building the tree in memory.

Every template belongs to a mode (perhaps the default, unnamed mode), and streaming is a property of the mode that can be specified using the new xsl:mode declaration. If the mode is declared to be streamable, then every template rule within that mode must obey the rules for streamable processing.

The rules for what is allowed in streamed processing are quite complicated, but the essential principle is that the template rule for a given node can only read the descendants of that node once, in order. There are further rules imposed by limitations in the current Saxon implementation: for example, although grouping using is theoretically consistent with a streamed implementation, it is not currently implemented in Saxon.

XSLT 3.0 would have standard streaming feature. However, the W3C document is still with a "working draft" status and the streaming specification is likely to change in subsequent draft versions. Due to this, no implementations of the current draft (streaming) specification exist.
Warning: Not every transformation can be performed in streaming mode -- regardless of the XSLT processor. One example of a transformation that isn't possible to perform in a streaming mode (with a limited amount of RAM) for huge documents is sorting their elements (say by a common attribute).

101

answered Oct 18 '22 01:10

Dimitre Novatchev

Related questions
                            
                                Public static variables and Android activity life cycle management
                            
                                In-order iterator for binary tree [closed]
                            
                                Can a non-empty string have a hashcode of zero?
                            
                                Execute multiple queries using a single JDBC Statement object
                            
                                Why is the constant HTTP.UTF_8 deprecated?
                            
                                How can I disable diff in line separators in IntelliJ IDEA?
                            
                                OkHttp proxy settings
                            
                                How to create SparkSession with Hive support (fails with "Hive classes are not found")?
                            
                                I am not able launch JNLP applications using "Java Web Start"?
                            
                                Difference between singleton class and static class? [duplicate]
                            
                                How can I listen to a TAB key pressed/typed in Java?
                            
                                Invalid row number (65536) outside allowable range (0..65535)
                            
                                What are the limitations of Python on Android?
                            
                                How to speed up autosizing columns in apache POI?
                            
                                Odd method call in java using a dot operator to access a generic list
                            
                                How to check if the OS is POSIX compliant
                            
                                Tomcat vs Vert.x
                            
                                Mixing log4j 1.x and log4j 2
                            
                                Spring @RequestBody and Enum value
                            
                                'setter for mainClassName: String' is deprecated. Deprecated in Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With