We have a very old, unsupported program which copies files across SMB shares. It has a checksum algorithm to determine if the file contents have changed before copying. The algorithm seems easily fooled -- we've just found an example where two files, identical except a single '1' changing to a '2', return the same checksum. Here's the algorithm:
unsigned long GetFileCheckSum(CString PathFilename)
{
FILE* File;
unsigned long CheckSum = 0;
unsigned long Data = 0;
unsigned long Count = 0;
if ((File = fopen(PathFilename, "rb")) != NULL)
{
while (fread(&Data, 1, sizeof(unsigned long), File) != FALSE)
{
CheckSum ^= Data + ++Count;
Data = 0;
}
fclose(File);
}
return CheckSum;
}
I'm not much of a programmer (I am a sysadmin) but I know an XOR-based checksum is going to be pretty crude. What're the chances of this algorithm returning the same checksum for two files of the same size with different contents? (I'm not expecting an exact answer, "remote" or "quite likely" is fine.)
How could it be improved without a huge performance hit?
Lastly, what's going on with the fread()
? I had a quick scan of the documentation but I couldn't figure it out. Is Data
being set to each byte of the file in turn? Edit: okay, so it's reading the file into unsigned long
(let's assume a 32-bit OS here) chunks. What does each chunk contain? If the contents of the file are abcd
, what is the value of Data
on the first pass? Is it (in Perl):
(ord('a') << 24) & (ord('b') << 16) & (ord('c') << 8) & ord('d')
MD5 is commonly used to verify the integrity of transfer files. Source code is readily available in c++. It is widely considered to be a fast and accurate algorithm.
See also Robust and fast checksum algorithm?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With