Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can this checksum algorithm be improved?

We have a very old, unsupported program which copies files across SMB shares. It has a checksum algorithm to determine if the file contents have changed before copying. The algorithm seems easily fooled -- we've just found an example where two files, identical except a single '1' changing to a '2', return the same checksum. Here's the algorithm:

unsigned long GetFileCheckSum(CString PathFilename)
{
        FILE* File;
        unsigned long CheckSum = 0;
        unsigned long Data = 0;
        unsigned long Count = 0;

        if ((File = fopen(PathFilename, "rb")) != NULL)
        {
                while (fread(&Data, 1, sizeof(unsigned long), File) != FALSE)
                {
                        CheckSum ^= Data + ++Count;
                        Data = 0;
                }
                fclose(File);
        }
        return CheckSum;
}

I'm not much of a programmer (I am a sysadmin) but I know an XOR-based checksum is going to be pretty crude. What're the chances of this algorithm returning the same checksum for two files of the same size with different contents? (I'm not expecting an exact answer, "remote" or "quite likely" is fine.)

How could it be improved without a huge performance hit?

Lastly, what's going on with the fread()? I had a quick scan of the documentation but I couldn't figure it out. Is Data being set to each byte of the file in turn? Edit: okay, so it's reading the file into unsigned long (let's assume a 32-bit OS here) chunks. What does each chunk contain? If the contents of the file are abcd, what is the value of Data on the first pass? Is it (in Perl):

(ord('a') << 24) & (ord('b') << 16) & (ord('c') << 8) & ord('d')
like image 674
markdrayton Avatar asked Jun 25 '09 17:06

markdrayton


1 Answers

MD5 is commonly used to verify the integrity of transfer files. Source code is readily available in c++. It is widely considered to be a fast and accurate algorithm.

See also Robust and fast checksum algorithm?

like image 96
Robert Harvey Avatar answered Oct 12 '22 13:10

Robert Harvey