Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory efficient XSLT for transforming large XML files

This question is related to a recent answer by michael.hor257k, which is in-turn related to an answer by Dimitre Novatchev.

When used the stylesheet in the above mentioned answer(by michael.hor257k), for a large XML(around 60MB, sample XML is present below) and the transformation was carried out successfully.

When tried another stylesheet, a little different from michael.hor257k's, and is intended to group elements(with a child sectPr) and their following-siblings(until the next following-sibling element with a child sectPr), recursively(i.e., group the elements to the depth of the input XML).

The sample input XML:

<body>
    <p/>
    <p>
        <sectPr/>
    </p>
    <p/>
    <p/>
    <tbl/>
    <p>
        <sectPr/>
    </p>
    <p/>
</body>

The stylesheet I tried:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*[1] | *[sectPr]"/>
        </xsl:copy>
        <xsl:apply-templates select="following-sibling::*[1][not(sectPr)]"/>
    </xsl:template>

    <xsl:template match="*[sectPr]">
        <myTag>
            <xsl:copy>
                <xsl:apply-templates select="*[1] | *[sectPr]"/>
            </xsl:copy>
            <xsl:apply-templates select="following-sibling::*[1][not(sectPr)]"/>
        </myTag>
    </xsl:template>

</xsl:stylesheet>

To my curiosity, I encountered OutOfMemoryError transforming an XML of around 60MB.

I wonder, and I think I do not understand the trick behind the XSLTs provided by both michael.hor257k and Dimitre Novatchev, which wouldn't cause memory exceptions.

What is the big difference between my stylesheet and the above mentioned answers that I get OutOfMemoryError. And how can I update the stylesheet to be memory efficient.

like image 363
Lingamurthy CS Avatar asked Feb 11 '23 06:02

Lingamurthy CS


2 Answers

Lingamurthy CS,

Please, add the <xsl:strip-space elements="*"/> declaration, which you removed from the original solution. This strips from the source XML document any whitespace-only text node.

Not stripping these nodes may significantly increase the number of nodes and the memory to hold them -- in your case, the required memory to hold the XML document will be almost twice as much compared to the necessary memory to hold the XML document with these nodes stripped.

I run your transformation OK, but with the nodes stripped it runs 20% faster -- on MS XslCompiledTransform.

Then I ran your transformation -- one time as published in the question, and a second time with added <xsl:strip-space elements="*"/> with Saxon 9.1J -- because it shows also the memory consumption of the transformation. Both runs were successful. In the first case the number of nodes processed was 9525004 and 340MB RAM was used. The transformation took 5.3 sec. In the second case the number of nodes was 4336366 and 215MB RAM was used. The transformation ran in 5.06sec

like image 126
Dimitre Novatchev Avatar answered Feb 13 '23 00:02

Dimitre Novatchev


In my experience, XSLT is very easy to make memory inefficient. It works really well for smaller transforms (even smaller transforms of lots of files), but when you start doing complex grouping or axis traversal it becomes inefficient for large (15mb+) XML files. Would it be possible to split your large files into small ones? I've used that technique to resolve issues like this before.

Since you're using Windows, you have a few other options as well (especially since you're only using XSLT 1.0). One that might work is to try using the .NET XslCompiledTransform class, which compiles your XSLT to IL. This might not fix the memory issues, but it might perform better on your platform.

The other option would be to make use of the .NET XmlReader and XmlWriter class, which, given your requirements, probably wouldn't be very difficult to implement. These are forward-only XML reading and writing classes. Making use of streaming allows for much greater memory efficiency.

like image 45
Dan Field Avatar answered Feb 13 '23 02:02

Dan Field