Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up file comparisons (with `cmp`) on Cygwin?

Tags:

bash

cygwin

I've written a bash script on Cygwin which is rather like rsync, although different enough that I believe I can't actually use rsync for what I need. It iterates over about a thousand pairs of files in corresponding directories, comparing them with cmp.

Unfortunately, this seems to run abysmally slowly -- taking about ten (Edit: actually 25!) times as long as it takes to generate one of the sets of files using a Python program.

Am I right in thinking that this is surprisingly slow? Are there any simple alternatives that would go faster?

(To elaborate a bit on my use-case: I am autogenerating a bunch of .c files in a temporary directory, and when I re-generate them, I'd like to copy only the ones that have changed into the actual source directory, leaving the unchanged ones untouched (with their old creation times) so that make will know that it doesn't need to recompile them. Not all the generated files are .c files, though, so I need to do binary comparisons rather than text comparisons.)

like image 366
Brooks Moses Avatar asked Jan 24 '12 03:01

Brooks Moses


1 Answers

Maybe you should use Python to do some - or even all - of the comparison work too?

One improvement would be to only bother running cmp if the file sizes are the same; if they're different, clearly the file has changed. Instead of running cmp, you could think about generating a hash for each file, using MD5 or SHA1 or SHA-256 or whatever takes your fancy (using Python modules or extensions, if that's the correct term). If you don't think you'll be dealing with malicious intent, then MD5 is probably sufficient to identify differences.

Even in a shell script, you could run an external hashing command, and give it the names of all the files in one directory, then give it the names of all the files in the other directory. Then you can read the two sets of hash values plus file names and decide which have changed.

Yes, it does sound like it is taking too long. But the trouble includes having to launch 1000 copies of cmp, plus the other processing. Both the Python and the shell script suggestions above have in common that they avoid running a program 1000 times; they try to minimize the number of programs executed. This reduction in the number of processes executed will give you a pretty big bang for you buck, I expect.


If you can keep the hashes from 'the current set of files' around and simply generate new hashes for the new set of files, and then compare them, you will do well. Clearly, if the file containing the 'old hashes' (current set of files) is missing, you'll have to regenerate it from the existing files. This is slightly fleshing out information in the comments.

One other possibility: can you track changes in the data that you use to generate these files and use that to tell you which files will have changed (or, at least, limit the set of files that may have changed and that therefore need to be compared, as your comments indicate that most files are the same each time).

like image 182
Jonathan Leffler Avatar answered Nov 15 '22 07:11

Jonathan Leffler