I have a python script that recursively walks a specified directory, and checksums each file it finds. It then writes a log file which lists all file paths and their md5 checksums.
Sequentially, this takes a long time for 50,000 files at 15 MB each. However, my computer has much more resources available than it's actually using. How can I adjust my approach so that the script uses more resources to execute faster?
For example, could I split my file list into thirds and run a thread for each, giving me a 3x runtime?
I'm not very comfortable with threading, and I hope someone wouldn't mind whipping up and example for my case.
Here's the code for my sequential md5 loop:
for (root, dirs, files) in os.walk(root_path):
for filename in files:
file_path = root + "/" + filename
md5_pairs.append([file_path, md5file(file_path, 128)])
Thanks for your help in advance!
For this kind of work, I think multiprocessing.Pool would give you less surprises - check the examples and docs at http://docs.python.org/library/multiprocessing.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With