Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using XSL to make a hash of XML file

Tags:

xml

hash

xslt

I am trying to find a way to "hash" the contents of an XML file. At the root of this is a need to compare some text nodes that are passed in to text nodes that I am expecting to make sure that the checksum is the same. The passed-in text nodes have returned from a form submission and I need to ensure that they were not changed (within reason, ruling out collisions).

The architecture is horrible, so please don't ask about it! I am locked in to a given implementation of sharepoint with some very bad custom code that I need to work around.

Is there a well-performing checksum/hash function that can be implemented? I would need to check about 100 text nodes.

like image 909
electrichead Avatar asked Jul 19 '11 20:07

electrichead


People also ask

What is the use of XSL in an XML file?

XSL gives a developer the tools to describe exactly which data fields in an XML file to display and exactly where and how to display them. Like any style sheet language, XSL can be used to create a style definition for one XML document or reused for many other XML documents.

Can we convert XML to XSL?

The standard way to transform XML data into other formats is by Extensible Stylesheet Language Transformations (XSLT). You can use the built-in XSLTRANSFORM function to convert XML documents into HTML, plain text, or different XML schemas. XSLT uses stylesheets to convert XML into other data formats.


1 Answers

Sounds like you need a position-dependent checksum. Are you asking for an XSLT implementation, or just the algorithm?

Here is an implementation of Fletcher's checksum in C, which should not be very hard to port to XSLT.

Update: Below is an XSLT 2.0 adaptation of Fletcher's checksum. Whether it's fast enough, depends on the size of your data and the amount of time you have. I'd be interested to hear how your tests go. To optimize, I would attempt to change xs:integer to xs:int.

Note that I have substituted plain addition for the bitwise OR (|) of the implementation I linked to above. I'm not really qualified to analyze the ramifications of this change in regard to uniformity or non-invertibility, but it seems OK as long as you don't have a smart hacker trying to maliciously bypass your checksum checks.

Do note that because of the above change, this implementation will not give the same results as true implementations of Fletcher's checksum (@MDBiker). So you can't compare the output of this function with that of Java's Fletcher16, for example. However it will always return the same result for the same input (it's deterministic), so you can compare the output of this function on two text strings.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:foo="my.foo.org">

    <xsl:variable name="str1">The quick brown fox jumps over the lazy dog.</xsl:variable>
    <xsl:variable name="str2">The quick frown box jumps over the hazy frog.</xsl:variable>

    <xsl:template match="/">
        Checksum 1: <xsl:value-of select="foo:checksum($str1)"/>    
        Checksum 2: <xsl:value-of select="foo:checksum($str2)"/>    
    </xsl:template>

    <xsl:function name="foo:checksum" as="xs:int">
        <xsl:param name="str" as="xs:string"/>
        <xsl:variable name="codepoints" select="string-to-codepoints($str)"/>
        <xsl:value-of select="foo:fletcher16($codepoints, count($codepoints), 1, 0, 0)"/>
    </xsl:function>

    <!-- can I change some xs:integers to xs:int and help performance? -->
    <xsl:function name="foo:fletcher16">
        <xsl:param name="str" as="xs:integer*"/>
        <xsl:param name="len" as="xs:integer" />
        <xsl:param name="index" as="xs:integer" />
        <xsl:param name="sum1" as="xs:integer" />
        <xsl:param name="sum2" as="xs:integer"/>
        <xsl:choose>
            <xsl:when test="$index gt $len">
                <xsl:sequence select="$sum2 * 256 + $sum1"/>
            </xsl:when>
            <xsl:otherwise>
                <xsl:variable name="newSum1" as="xs:integer"
                    select="($sum1 + $str[$index]) mod 255"/>
                <xsl:sequence select="foo:fletcher16($str, $len, $index + 1, $newSum1,
                        ($sum2 + $newSum1) mod 255)" />
            </xsl:otherwise>
        </xsl:choose>
    </xsl:function>
</xsl:stylesheet>

The output:

    Checksum 1: 65256    
    Checksum 2: 25689

A note on usage: You said you needed to run checksum on "the contents of an XML file. At the root of this is a need to compare some text nodes". If you pass a text node to foo:checksum(), it will work fine: its string value will be extracted.

FYI, I ran a performance test, to calculate the checksum of text nodes in a 535KB XML input file. Here was the initial template I used:

<xsl:template match="/">
    Checksum of input: <xsl:value-of
      select="sum(for $t in //text() return foo:checksum($t)) mod 65536"/>    
</xsl:template>

It finished in 0.8s, using Saxon PE.

Alternatively:

If the amount of text is not very large, it would probably be faster and more accurate to simply compare the strings themselves (instead of checksums) to each other. But maybe you can't get access to both text nodes at the same time, due to your architecture restrictions... I'm not clear on that from your description.

like image 121
LarsH Avatar answered Sep 18 '22 11:09

LarsH