Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multithreaded MD5 Checksum in Python

I have a python script that recursively walks a specified directory, and checksums each file it finds. It then writes a log file which lists all file paths and their md5 checksums.

Sequentially, this takes a long time for 50,000 files at 15 MB each. However, my computer has much more resources available than it's actually using. How can I adjust my approach so that the script uses more resources to execute faster?

For example, could I split my file list into thirds and run a thread for each, giving me a 3x runtime?

I'm not very comfortable with threading, and I hope someone wouldn't mind whipping up and example for my case.

Here's the code for my sequential md5 loop:

for (root, dirs, files) in os.walk(root_path):
    for filename in files:
        file_path = root + "/" + filename
        md5_pairs.append([file_path, md5file(file_path, 128)])

Thanks for your help in advance!

like image 915
Jamie Avatar asked Apr 12 '12 19:04

Jamie


1 Answers

For this kind of work, I think multiprocessing.Pool would give you less surprises - check the examples and docs at http://docs.python.org/library/multiprocessing.html

like image 109
jsbueno Avatar answered Oct 11 '22 16:10

jsbueno