Run the md5sum command on every file in that list. Create a string that contains the list of file paths along with their hashes. And finally, run md5sum on this string we just created to obtain a single hash value.
Checksums are calculated for files. Calculating the checksum for a directory requires recursively calculating the checksums for all the files in the directory. The -r option allows md5deep to recurse into sub-directories. The -l option enables displaying the relative path, instead of the default absolute path.
Create a tar archive file on the fly and pipe that to md5sum
:
tar c dir | md5sum
This produces a single MD5 hash value that should be unique to your file and sub-directory setup. No files are created on disk.
find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | awk '{print $1}' | sort | md5sum
The find command lists all the files that end in .py. The MD5 hash value is computed for each .py file. AWK is used to pick off the MD5 hash values (ignoring the filenames, which may not be unique). The MD5 hash values are sorted. The MD5 hash value of this sorted list is then returned.
I've tested this by copying a test directory:
rsync -a ~/pybin/ ~/pybin2/
I renamed some of the files in ~/pybin2.
The find...md5sum
command returns the same output for both directories.
2bcf49a4d19ef9abd284311108d626f1 -
ire_and_curses's suggestion of using tar c <dir>
has some issues:
rsync -a --delete
does: it synchronizes virtually everything (minus xattrs and acls), but it will sync owner and group based on their ID, not on string representation. So if you synced to a different system that doesn't necessarily have the same users/groups, you should add the --numeric-owner
flag to tarAs long as there is no fix for the first problem (or unless you're sure it does not affect you), I would not use this approach.
The proposed find
-based solutions are also no good because they only include files, not directories, which becomes an issue if you the checksumming should keep in mind empty directories.
Finally, most suggested solutions don't sort consistently, because the collation might be different across systems.
This is the solution I came up with:
dir=<mydir>; (find "$dir" -type f -exec md5sum {} +; find "$dir" -type d) | LC_ALL=C sort | md5sum
Notes about this solution:
LC_ALL=C
is to ensure reliable sorting order across systems-print0
flag for find
, but since there's other stuff going on here, I can only see solutions that would make the command more complicated than it's worth.PS: one of my systems uses a limited busybox find
which does not support -exec
nor -print0
flags, and also it appends '/' to denote directories, while findutils find doesn't seem to, so for this machine I need to run:
dir=<mydir>; (find "$dir" -type f | while read f; do md5sum "$f"; done; find "$dir" -type d | sed 's#/$##') | LC_ALL=C sort | md5sum
Luckily, I have no files/directories with newlines in their names, so this is not an issue on that system.
If you only care about files and not empty directories, this works nicely:
find /path -type f | sort -u | xargs cat | md5sum
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With