I find myself needing to generate a checksum for a string of data, for consistency purposes. The broad idea is that the client can regenerate the checksum based on the payload it recieves and thus detect any corruption that took place in transit. I am vaguely aware that there are all kinds of mathematical principles behind this kind of thing, and that it's very easy for subtle errors to make the whole algorithm ineffective if you try to roll it yourself. So I'm looking for advice on a hashing/checksum algorithm with the following criteria: <ul> <li>It will be generated by Javascript, so needs to be relatively light computationally.</li> <li>The validation will be done by Java (though I cannot see this actually being an issue).</li> <li>It will take textual input (URL-encoded Unicode, which I believe is ASCII) of a moderate length; typically around 200-300 characters and in all cases below 2000.</li> <li>The output should be ASCII text as well, and the shorter it can be the better.</li> </ul> I'm primarily interested in something lightweight rather than getting the absolute smallest potential for collisions possible. Would I be naive to imagine that an eight-character hash would be suitable for this? I should also clarify that it's not the end of the world if corruption isn't picked up at the validation stage (and I do realise that this will not be 100% reliable), though the rest of my code is markedly less efficient for every corrupt entry that slips through. Edit - thanks to all that contributed. I went with the Adler32 option and given that it was natively supported in Java, extremely easy to implement in Javascript, fast to calculate at both ends and have an 8-byte output it was exactly right for my requirements. (Note that I realise that the network transport is unlikely to be responsible for any corruption errors and won't be folding my arms on this issue just yet; however adding the checksum validation removes one point of failure and means we can focus on other areas should this reoccur.)

Other people have mentioned CRC32 already, but here's a link to the W3C implementation of CRC-32 for PNG, as one of the few well-known, reputable sites with a reference CRC implementation. (A few years back I tried to find a well-known site with a CRC algorithm or at least one that cited the source for its algorithm, & was almost tearing my hair out until I found the PNG page.)

[UPDATE 30/5/2013: The link to the old JS CRC32 implementation died, so I've now linked to a different one.] Google CRC32: fast, and much lighter weight than MD5 et al. There is a Javascript implementation here.

In my search for a JavaScript implementation of a good checksum algorithm I came across this question. Andrzej Doyle rightfully chose Adler32 as the checksum, as it is indeed easy to implement and has some excellent properties. DroidOS then provided an actual implementation in JavaScript, which demonstrated the simplicity. However, the algorithm can be further improved upon as detailed in the Wikipedia page and as implemented below. The trick is that you need not determine the modulo in each step. Rather, you can defer this to the end. This considerably increases the speed of the implementation, up to 6x faster on Chrome and Safari. In addition, this optimalisation does not affect the readability of the code making it a win-win. As such, it definitely fits in well with the original question as to having an algorithm / implementation that is computationally light. <pre class="prettyprint"><code>function adler32(data) { var MOD_ADLER = 65521; var a = 1, b = 0; var len = data.length; for (var i = 0; i < len; i++) { a += data.charCodeAt(i); b += a; } a %= MOD_ADLER; b %= MOD_ADLER; return (b << 16) | a; } </code></pre> edit: imaya created a jsperf comparison a while back showing the difference in speed when running the simple version, as detailed by DroidOS, compared to an optimised version that defers the modulo operation. I have added the above implementation under the name full-length to the jsperf page showing that the above implementation is about 25% faster than the one from imaya and about 570% faster than the simple implementation (tests run on Chrome 30): http://jsperf.com/adler-32-simple-vs-optimized/6 edit2: please don't forget that, when working on large files, you will eventually hit the limit of your JavaScript implementation in terms of the a and b variables. As such, when working with a large data source, you should perform intermediate modulo operations as to ensure that you do not exceed the maximum value of the integer that you can reliably store.

Here's a relatively simple one I've 'invented' - there's no mathematical research behind it but it's extremely fast and works in practice. I've also included the Java equivalent that tests the algorithm and shows that there's less than 1 in 10,000,000 chance of failure (it takes a minute or two to run). JavaScript <pre class="prettyprint"><code>function getCrc(s) { var result = 0; for(var i = 0; i < s.length; i++) { var c = s.charCodeAt(i); result = (result << 1) ^ c; } return result; } </code></pre> Java <pre class="prettyprint"><code>package test; import java.util.*; public class SimpleCrc { public static void main(String[] args) { final Random randomGenerator = new Random(); int lastCrc = -1; int dupes = 0; for(int i = 0; i < 10000000; i++) { final StringBuilder sb = new StringBuilder(); for(int j = 0; j < 1000; j++) { final char c = (char)(randomGenerator.nextInt(128 - 32) + 32); sb.append(c); } final int crc = crc(sb.toString()); if(lastCrc == crc) { dupes++; } lastCrc = crc; } System.out.println("Dupes: " + dupes); } public static int crc(String string) { int result = 0; for(final char c : string.toCharArray()) { result = (result << 1) ^ c; } return result; } } </code></pre>

Good choice for a lightweight checksum algorithm?

Tags:

java

javascript

hash

checksum

I find myself needing to generate a checksum for a string of data, for consistency purposes. The broad idea is that the client can regenerate the checksum based on the payload it recieves and thus detect any corruption that took place in transit. I am vaguely aware that there are all kinds of mathematical principles behind this kind of thing, and that it's very easy for subtle errors to make the whole algorithm ineffective if you try to roll it yourself.

So I'm looking for advice on a hashing/checksum algorithm with the following criteria:

It will be generated by Javascript, so needs to be relatively light computationally.
The validation will be done by Java (though I cannot see this actually being an issue).
It will take textual input (URL-encoded Unicode, which I believe is ASCII) of a moderate length; typically around 200-300 characters and in all cases below 2000.
The output should be ASCII text as well, and the shorter it can be the better.

I'm primarily interested in something lightweight rather than getting the absolute smallest potential for collisions possible. Would I be naive to imagine that an eight-character hash would be suitable for this? I should also clarify that it's not the end of the world if corruption isn't picked up at the validation stage (and I do realise that this will not be 100% reliable), though the rest of my code is markedly less efficient for every corrupt entry that slips through.

Edit - thanks to all that contributed. I went with the Adler32 option and given that it was natively supported in Java, extremely easy to implement in Javascript, fast to calculate at both ends and have an 8-byte output it was exactly right for my requirements.

(Note that I realise that the network transport is unlikely to be responsible for any corruption errors and won't be folding my arms on this issue just yet; however adding the checksum validation removes one point of failure and means we can focus on other areas should this reoccur.)

522

asked Jan 07 '09 18:01

Andrzej Doyle

8 Answers

CRC32 is not too hard to implement in any language, it is good enough to detect simple data corruption and when implemted in a good fashion, it is very fast. However you can also try Adler32, which is almost equally good as CRC32, but it's even easier to implement (and about equally fast).

Adler32 in the Wikipedia

CRC32 JavaScript implementation sample

Either of these two (or maybe even both) are available in Java right out of the box.

105

answered Oct 05 '22 05:10

Mecki

Are aware that both TCP and UDP (and IP, and Ethernet, and...) already provide checksum protection to data in transit?

Unless you're doing something really weird, if you're seeing corruption, something is very wrong. I suggest starting with a memory tester.

Also, you receive strong data integrity protection if you use SSL/TLS.

answered Oct 05 '22 04:10

derobert

Javascript implementation of MD4, MD5 and SHA1. BSD license.

answered Oct 05 '22 04:10

sastanin

Other people have mentioned CRC32 already, but here's a link to the W3C implementation of CRC-32 for PNG, as one of the few well-known, reputable sites with a reference CRC implementation.

(A few years back I tried to find a well-known site with a CRC algorithm or at least one that cited the source for its algorithm, & was almost tearing my hair out until I found the PNG page.)

answered Oct 05 '22 03:10

Jason S

[UPDATE 30/5/2013: The link to the old JS CRC32 implementation died, so I've now linked to a different one.]

Google CRC32: fast, and much lighter weight than MD5 et al. There is a Javascript implementation here.

answered Oct 05 '22 04:10

j_random_hacker

In my search for a JavaScript implementation of a good checksum algorithm I came across this question. Andrzej Doyle rightfully chose Adler32 as the checksum, as it is indeed easy to implement and has some excellent properties. DroidOS then provided an actual implementation in JavaScript, which demonstrated the simplicity.

However, the algorithm can be further improved upon as detailed in the Wikipedia page and as implemented below. The trick is that you need not determine the modulo in each step. Rather, you can defer this to the end. This considerably increases the speed of the implementation, up to 6x faster on Chrome and Safari. In addition, this optimalisation does not affect the readability of the code making it a win-win. As such, it definitely fits in well with the original question as to having an algorithm / implementation that is computationally light.

function adler32(data) {
  var MOD_ADLER = 65521;
  var a = 1, b = 0;

  var len = data.length;

  for (var i = 0; i < len; i++) {
    a += data.charCodeAt(i);
    b += a;
  }

  a %= MOD_ADLER;
  b %= MOD_ADLER;

  return (b << 16) | a;
}

edit: imaya created a jsperf comparison a while back showing the difference in speed when running the simple version, as detailed by DroidOS, compared to an optimised version that defers the modulo operation. I have added the above implementation under the name full-length to the jsperf page showing that the above implementation is about 25% faster than the one from imaya and about 570% faster than the simple implementation (tests run on Chrome 30): http://jsperf.com/adler-32-simple-vs-optimized/6

edit2: please don't forget that, when working on large files, you will eventually hit the limit of your JavaScript implementation in terms of the a and b variables. As such, when working with a large data source, you should perform intermediate modulo operations as to ensure that you do not exceed the maximum value of the integer that you can reliably store.

answered Oct 05 '22 03:10

kvaruni

Use SHA-1 JS implementation. It's not as slow as you think (Firefox 3.0 on Core 2 Duo 2.4Ghz hashes over 100KB per second).

answered Oct 05 '22 04:10

Kornel

Here's a relatively simple one I've 'invented' - there's no mathematical research behind it but it's extremely fast and works in practice. I've also included the Java equivalent that tests the algorithm and shows that there's less than 1 in 10,000,000 chance of failure (it takes a minute or two to run).

JavaScript

function getCrc(s) {
    var result = 0;
    for(var i = 0; i < s.length; i++) {
        var c = s.charCodeAt(i);
        result = (result << 1) ^ c;
    }
    return result;
}

Java

package test;

import java.util.*;

public class SimpleCrc {

    public static void main(String[] args) {
        final Random randomGenerator = new Random();
        int lastCrc = -1;
        int dupes = 0;
        for(int i = 0; i < 10000000; i++) {
            final StringBuilder sb = new StringBuilder();
            for(int j = 0; j < 1000; j++) {
                final char c = (char)(randomGenerator.nextInt(128 - 32) + 32);
                sb.append(c);
            }
            final int crc = crc(sb.toString());
            if(lastCrc == crc) {
                dupes++;
            }
            lastCrc = crc;
        }
        System.out.println("Dupes: " + dupes);
    }

    public static int crc(String string) {
        int result = 0;
        for(final char c : string.toCharArray()) {
            result = (result << 1) ^ c;
        }
        return result;
    }
}

answered Oct 05 '22 05:10

Keith Whittingham

Related questions
                            
                                In Java Collections Map<Key,?> What does "?" refer to?
                            
                                Specifying an external configuration file for Apache Spark
                            
                                mongodb java driver 3.0: how to store JSON document
                            
                                Why builder should be static inner class?
                            
                                One DAO per 'container' class or one DAO per table?
                            
                                Parsing ISO_INSTANT and similar Date Time Strings
                            
                                for loop, iteration through alphabet? java
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                Thymeleaf th:each adding coma between elements
                            
                                How to? Listen for Location Setting being turned ON (Android App)
                            
                                Spring Batch - Using an ItemWriter with List of Lists
                            
                                DynamoDBMappingException: no mapping for HASH key
                            
                                PDFBOX : U+000A ('controlLF') is not available in this font Helvetica encoding: WinAnsiEncoding
                            
                                How to call readEntity on a Response twice?
                            
                                Why null key is not allowed in TreeMap?
                            
                                Open pdf from bytes array in angular 5
                            
                                How ro refresh/sync a gradle project in VisualStudioCode
                            
                                Environment variables for list in spring boot configuration
                            
                                Springboot run failing on Eureka client dependency
                            
                                My fragments keep recreating whenever I reclick or navigate to the next fragment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Good choice for a lightweight checksum algorithm?

Tags:

java

javascript

hash

checksum

Andrzej Doyle

People also ask

8 Answers

Mecki

derobert

sastanin

Jason S

j_random_hacker

kvaruni

Kornel

Keith Whittingham

Recent Activity

Donate For Us