Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bash: parallelize md5sum checksum on many files

Tags:

lets say, I have a 64-core server, and I need to compute md5sum of all files in /mnt/data, and store the results in a text file:

find /mnt/data -type f -exec md5sum {} \; > md5.txt 

The problem with the above command is, that only one process runs at any given time. I would like to harness the full power of my 64-cores. Ideally, I would like to makes sure, that at any given time, 64 parallel md5 processes are running (but not more than 64).

Also. I would need output from all the processes to be stored into one file.

NOTE: I am not looking for a way to compute md5sum of one file in parallel. I am looking for a way to compute 64 md5sums of 64 different files in parallel, as long as there are any files coming from find.

like image 686
user1968963 Avatar asked May 27 '13 11:05

user1968963


People also ask

How do I get the md5sum of all files in a directory?

Now that we can get a list with all of our files, our next steps are: Run the md5sum command on every file in that list. Create a string that contains the list of file paths along with their hashes. And finally, run md5sum on this string we just created to obtain a single hash value.

Can checksum have two files?

Yes. There are an infinite number of binary files, but only a finite number of md5 hashes (since they have fixed size) hence there are infinitely many files that have the same hash.

Can you spoof checksum?

In checksum spoofing an adversary modifies the message body and then modifies the corresponding checksum so that the recipient's checksum calculation will match the checksum (created by the adversary) in the message. This would prevent the recipient from realizing that a change occurred.

Can you checksum a directory?

Checksums are calculated for files. Calculating the checksum for a directory requires recursively calculating the checksums for all the files in the directory. The -r option allows md5deep to recurse into sub-directories. The -l option enables displaying the relative path, instead of the default absolute path.


2 Answers

Use GNU parallel. And you can find some more examples on how to implement it here.

find /mnt/data -type f | parallel -j 64 md5sum > md5.txt 
like image 105
Steve Avatar answered Sep 23 '22 04:09

Steve


You can use xargs as well, It might be more available than parallels on some distro.

-P controls the number of process spawned.

find /mnt/data -type f | xargs -L1 -P24  md5sum > /tmp/result.txt 
like image 30
Tony Avatar answered Sep 20 '22 04:09

Tony