This question is related to a recent answer by michael.hor257k, which is in-turn related to an answer by Dimitre Novatchev.
When used the stylesheet in the above mentioned answer(by michael.hor257k), for a large XML(around 60MB, sample XML is present below) and the transformation was carried out successfully.
When tried another stylesheet, a little different from michael.hor257k's, and is intended to group elements(with a child sectPr
) and their following-siblings(until the next following-sibling element with a child sectPr
), recursively(i.e., group the elements to the depth of the input XML).
The sample input XML:
<body>
<p/>
<p>
<sectPr/>
</p>
<p/>
<p/>
<tbl/>
<p>
<sectPr/>
</p>
<p/>
</body>
The stylesheet I tried:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="*[1] | *[sectPr]"/>
</xsl:copy>
<xsl:apply-templates select="following-sibling::*[1][not(sectPr)]"/>
</xsl:template>
<xsl:template match="*[sectPr]">
<myTag>
<xsl:copy>
<xsl:apply-templates select="*[1] | *[sectPr]"/>
</xsl:copy>
<xsl:apply-templates select="following-sibling::*[1][not(sectPr)]"/>
</myTag>
</xsl:template>
</xsl:stylesheet>
To my curiosity, I encountered OutOfMemoryError transforming an XML of around 60MB.
I wonder, and I think I do not understand the trick behind the XSLTs provided by both michael.hor257k and Dimitre Novatchev, which wouldn't cause memory exceptions.
What is the big difference between my stylesheet and the above mentioned answers that I get OutOfMemoryError. And how can I update the stylesheet to be memory efficient.
Lingamurthy CS,
Please, add the <xsl:strip-space elements="*"/>
declaration, which you removed from the original solution. This strips from the source XML document any whitespace-only text node.
Not stripping these nodes may significantly increase the number of nodes and the memory to hold them -- in your case, the required memory to hold the XML document will be almost twice as much compared to the necessary memory to hold the XML document with these nodes stripped.
I run your transformation OK, but with the nodes stripped it runs 20% faster -- on MS XslCompiledTransform.
Then I ran your transformation -- one time as published in the question, and a second time with added <xsl:strip-space elements="*"/>
with Saxon 9.1J -- because it shows also the memory consumption of the transformation. Both runs were successful. In the first case the number of nodes processed was 9525004
and 340MB
RAM was used. The transformation took 5.3
sec. In the second case the number of nodes was 4336366
and 215MB
RAM was used. The transformation ran in 5.06
sec
In my experience, XSLT is very easy to make memory inefficient. It works really well for smaller transforms (even smaller transforms of lots of files), but when you start doing complex grouping or axis traversal it becomes inefficient for large (15mb+) XML files. Would it be possible to split your large files into small ones? I've used that technique to resolve issues like this before.
Since you're using Windows, you have a few other options as well (especially since you're only using XSLT 1.0). One that might work is to try using the .NET XslCompiledTransform
class, which compiles your XSLT to IL. This might not fix the memory issues, but it might perform better on your platform.
The other option would be to make use of the .NET XmlReader
and XmlWriter
class, which, given your requirements, probably wouldn't be very difficult to implement. These are forward-only XML reading and writing classes. Making use of streaming allows for much greater memory efficiency.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With