Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is this the right way to find a checksum?

Tags:

c

checksum

I am trying to calculate checksum for some data. This is the code:

#include <stdio.h>
#include <string.h>

int main()
{
    char MyArray[] = "my secret data";
    char checksum = 0;
    int SizeOfArray = strlen(MyArray);

    for(int x = 0; x < SizeOfArray; x++)
    {
          checksum += MyArray[x];
    }
    printf("Sum of the bytes for MyArray is: %d\n", checksum);

    printf("The checksum: \n");
    checksum = (checksum ^ 0xFF);
    printf("%d\n",checksum);
}

Output:

Sum of the bytes for MyArray is: 70
The checksum:
-71

Modification in the code:

#include <stdio.h>
#include <string.h>

int main()
{
    char MyArray[] = "my secret data";
    char checksum = 0; // could be an int if preferred
    int SizeOfArray = strlen(MyArray);

    for(int x = 0; x < SizeOfArray; x++)
    {
          checksum += MyArray[x];
    }
    printf("Sum of the bytes for MyArray is: %d\n", checksum);

    //Perform bitwise inversion
    checksum=~checksum;
    //Increment
    checksum++;
    printf("Checksum for MyArray is: %d\n", checksum);
    }

Output:

Sum of the bytes for MyArray is: 70
Checksum for MyArray is: -70

Why modification of checksum value? Will different algorithms provide different checksums?

How will the end value be useful? Well actually I am not clear about the checksum and its use in data validation. I searched the net, found lot of articles, but still not clear. Hope I will understand about checksum here today.

like image 814
highlander141 Avatar asked Dec 25 '15 06:12

highlander141


3 Answers

You need to understand what is a checksum before thinking how you generate it. Assume the issue of sending data through an unreliable communication channel, for example a network connection. You need to make sure that there was not interference that affected your message.

One approach to do this is to send the message twice, and check for differences (indeed, there is quite a small chance for the exact same error to happen during the transmission of both messages). However, this requires to use quite a lot of bandwidth (sending the message twice).

A more efficient approach is to compute a value based on the message and attach it to the message. The recipient then applies the same function and checks if the value is the same.

To get a more intuitive example, the checksum of a book may be the number of pages. You buy a book from the library and count its pages. If the number of pages is not what you expected, there is a problem.

You implement a specific checksum function (LSB of sum) which is fine. All checksum functions have some properties that you should be aware of, but the point is that there is not a right way to compute a checksum. There are many functions that can be used for this purpose.

like image 188
Paul92 Avatar answered Oct 20 '22 18:10

Paul92


This is the beauty of checksum algorithm: that the way you produce the checksum and the way you check it is somehow symmetric!

  1. About checksum

Checksum is typically used to verify the integrity of the data, especially over noisy/unrealiable communication channel. Thus, it is primarily used for error detection. That is, to know if the data received is correct.

This is quite different with, for example, error correction. Since its use is not only to check if there is error, but also to correct it.

Both error detection and error correction have overhead (or augemented) data. That is, data that is not part of the original data, but is added to the original data for the purpose of checking if the original data is changed (such as in the case or error detection) or to correct it (such as in the case or error correction).

Unlike in error detection algorithms, overhead data size of error correction algorithms tend to grow in proportional with the original data (since the more data you have, the more overhead you need to restore it).

We usually don't want overhead data size to be big, since big data means more resources we need to handle (i.e. time to process, time to transfer, etc). Thus, in that sense a good checksum algorithm is typically one that use the smallest amount of overhead data to detect the error but with great robustness (extremely rarely to produce false result).

And with that understanding the problem lies, since the robustness of the checksum, in reality, not only depends on the algorithm but also depends on the channel characteristics. Some channels may prone to certain errors while other channels to different ones.

In general, however, there are some checksums which are known to be more popular and more robust than the others. Of the latter, partly because it has inherent property to ensure that it will be robust in most, if nor all, of the channels (one of my favourites error detection algorithm is CRC - Cyclic Redundancy Check because it has that inherent property). But there is no perfect checksum for every scenario, it really depends on the use and scenario.

But still, you can measure robustness of a checksum algorithm. And there is a mathematical way to do it, which I think is beyond the scope of this discussion. Thus, some checksums, in those senses, can be said to be weaker than the others. Checksums which you showed in your question are also the weak ones, since they may produce false results more easily than strong ones such as, say, CRC.

  1. About the code

XOR with 0xFF for 8-bit is perfectly equivalent with binary-inverting a value, and it is not too hard to see it.

XOR with 0xFF

1110 0010
1111 1111
--------- XOR
0001 1101 //notice that this is exactly the same as binary inverting!

Thus, when you do XOR with 0xFF and ~checksum, you get the same result -71 (and since your data type is char, it has negative number). Then you increment it by 1, thus you get -70.

  1. About 2' Complement

Two's complement is a mathematical operation on binary numbers, as well as a binary signed number representation based on this operation. Its wide use in computing makes it the most important example of a radix complement. (wikipedia)

In other words, 2' complement is to find the negative representation of a value (in Computer Binary) and its method is, as you correctly did, by inverting all its bits and then add one to it. That's why you get -70 by 2' complementing 70. But this does not mean that 2' complement and XOR by 0xFF is the same, and as you can see by example, it is really not the same.

What XOR by 0xFF does in 8-bit data is simply equivalent to reversing all its bits. It doesn't add one to it.

  1. About the way to read add/read checksum

Since the checksum is used to know the integrity of the data (whether it is altered or not), people try to find best practice to do it. What you do is actually to get the checksum by 2' complement or by XOR with 0xFF.

And these are what they do:

  • For 2' complement checksum. Let say your message length is N. Since what you get by summation of N numbers are, let say, 70. Then by adding 2'complement checksum (that is -70), In the receiver side, you simply need to sum all the N+1 messages including the checksum and you should get 0 if the message is unaltered. This is how to use 2' complement checksum correctly.
  • For XOR with 0xFF Again, with the same example as the previous, you should get -1 if you sum up all the N+1 messages, including the checksum. And since the hex representation of -1 is 0xFF in 8-bit signed, thus by XOR ing the result (-1) with 0xFF, you should get 0xFF ^ 0xFF = 0 if the message contains no error

Hence in both cases, you simply need to check if the message contains error or not by checking if the final result is 0 (no error) or not!! And this is typically true for checksum algorithms!

This is the beauty of checksum algorithm: that the way you produce the checksum and the way you check is it somehow symmetric!

like image 33
Ian Avatar answered Oct 20 '22 17:10

Ian


A checksum is usually utilized to detect a change in data. Communications, encryption/signature, etc... checksums are used everywhere.

How a checksum can be useful?

  • it detects a change on 1 bit for instance
  • it even detects changes when more than 1 bits are changed

That may seem paradoxical, but when only 1 bit changes, your checksum will work. However, take

(A) checksum += 0x11 instead of 0x10

and later

(B) checksum += 0x30 instead of 0x31

In (A) the checksum will be -1... and in (B) it will be +1. Plus and minus 1 == 0. The two errors will not be detected by your checksum.

Basically the quality of a checksum depends

  • on the length of the checksum (the bigger the checksum, the more it will embrace bigger data, without "looping" (one byte has only 256 checksums possible, 2 bytes has 65536 ; note that in the case above with your algorithm that wouldn't change the result)

  • the quality of the checksum calculation, in order to prevent as much as possible that two differences cancel mutually.

There are many algorithms available. This answer on SO is a good start.

like image 45
Déjà vu Avatar answered Oct 20 '22 18:10

Déjà vu