Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing large text files - Is comparing hashes faster than using subsets of the file?

Say I have two large (text) files which are allegedly identical, but I want to make sure. The entire Harry Potter series of 'adult' and 'child' editions perhaps...

If the full text's string representation is too large to be held in memory at once, is it going to be faster to:

  • a) Hash both files in their entirety and then test to see if the hashes are identical

or

  • b) Read in manageable chunks of each file and compare them until you either reach EOF or find a mismatch

In other words, would the convenience of comparing 2 small hashes be offset by the time it took to generate said hashes?

I'm expecting a couple of "it depends" answers, so if you want some assumtions to work with:

  • Language is C# in .NET
  • Text files are 3GB each
  • Hash function is MD5
  • Maximum 'spare' RAM is 1GB
like image 761
Widor Avatar asked Oct 06 '11 13:10

Widor


2 Answers

  1. The MD5 Checksum will be slower since you need to process the two files to get the outcome. You say you have 3GB files and only 1GB of memory spare you do the math.

  2. Checking them in byte chunks will actually determine any difference earlier, also by checking the file size, file length etc...

I would go with option 2.

like image 179
Michael D. Irizarry Avatar answered Sep 28 '22 17:09

Michael D. Irizarry


Option A is only useful if you reuse the hash (i.e. have other files to compare) so that the cost of calculating the hash isn't a factor...

Otherwise Option B is what i would go for...

To get the maximum speed I would use MemoryMappedFile instances and XOR the content - the comparison can stop at the first encounter of a difference (i.e. the XOR operation returns something != 0). Regarding memory consumption you can use a "moving window" (i.e. via the call to CreateViewAccessor) which would allow for literally processing files of TB-size...

It could even be worth to test performance of XOR against some LINQ based comparison methods... and always start by comparing the file sizes, this way you avoid doing unnecessary calculations...

like image 32
Yahia Avatar answered Sep 28 '22 15:09

Yahia