Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

recursive diff is extremely slow - checking contents of directories

Tags:

bash

diff

rsync

I am running a diff on two directories, recursively, with a few options. The directories are somewhat large, however, I am trying to just see the differences in the contents of folders, not between the files, using the -q option (am i using this right?)

I have also tried rsync dry run, that seems to take equally as long. The output goes through sed, I have tried without, it doesn't seem to effect anything. I also ignore hidden files. I think I may be mis-using diff -q to just compare the contents of 2 directories.

I used a code block from another tip to time how long just comparing ONE of these directories was (1 directory, 14 subdirectories) and it took 88 minutes. However, every file was a 30 minutes long TV-show, so if diff is comparing these files, that makes sense, but I thought that -q would cause that to not happen?

Also, one directory is mounted over AFP, one is a firewire connected external drive. This doesn't matter, because I copied both directories locally and the diff took the same amount of time.

I have a solution to this - I ran ls -1 over both directories and diff'd the output - but why is diff taking so long to run?

Here is the code; any suggestions?

#!/bin/bash

before="$(date +%s)"

diff -r -x '.*' /Volumes/directory1/ /Volumes/directory2/ | sed 's/^.\{24\}//g' > /Volumes/stuff.txt
diff -r -x '.*' /Volumes/directory3/ /Volumes/directory4/ | sed 's/^.\{24\}//g' > /Volumes/stuff.txt
diff -r -x '.*' /Volumes/directory5/ /Volumes/directory6/ | sed 's/^.\{24\}//g' > /Volumes/stuff.txt
diff -r -x '.*' /Volumes/directory7/ /Volumes/directory8/ | sed 's/^.\{24\}//g' > /Volumes/stuff.txt
diff -r -x '.*' /Volumes/directory9/ /Volumes/directory10/ | sed 's/^.\{24\}//g' > /Volumes/stuff.txt
diff -r -x '.*' /Volumes/directory11/ /Volumes/directory12/ | sed 's/^.\{24\}//g' > /Volumes/stuff.txt

after="$(date +%s)"
elapsed_seconds="$(expr $after - $before)"
echo Elapsed time for code block: $elapsed_seconds
like image 391
rick Avatar asked Mar 17 '11 20:03

rick


1 Answers

When files are different diff will be able to figure that out fairly quickly. When they're the same, though, it has to scan the files in full to verify that they are indeed byte-for-byte identical.

If all you care about is differences in file names and don't want to inspect the contents of the files, try something like:

diff <(find /Volumes/directory1/ -printf '%P\n') \
     <(find /Volumes/directory2/ -printf '%P\n')

This assumes you have GNU find with the -printf action. If you don't, use some subshell magic per Gordon's comment:

diff <(cd /Volumes/directory1; find .) \
     <(cd /Volumes/directory2; find .)
like image 53
John Kugelman Avatar answered Sep 29 '22 20:09

John Kugelman