I'd like to recurse a directory of images and generate thumbnails for each image. I have 12 usable cores on my machine. What's a good way to utilize them? I don't have much experience writing multithreaded applications so any simple sample code is appreciated. Thanks in advance.
Use processes, not threads, because Python is inefficient with CPU-intensive threads due to the GIL. Two possible solutions for multiprocessing are:
multiprocessing
moduleThis is preferred if you're using an internal thumbnail maker (e.g., PIL
). Simply write a thumbnail maker function, and launch 12 in parallel. When one of the processes is finished, run another in its slot.
Adapted from the Python documentation, here's a script should utilize 12 cores:
from multiprocessing import Process
import os
def info(title): # For learning purpose, remove when you got the PID\PPID idea
print title
print 'module:', __name__
print 'parent process:', os.getppid(),
print 'process id:', os.getpid()
def f(name): # Working function
info('function f')
print 'hello', name
if __name__ == '__main__':
info('main line')
processes=[Process(target=f, args=('bob-%d' % i,)) for i in range(12)]
[p.start() for p in processes]
[p.join() for p in processes]
multiprocess.pool()
Following soulman's comment, you can use the provided process pull.
I've adapted some code from the multiprocessing manual
. Note that you probably should use multiprocessing.cpu_count()
instead of 4
to automatically determine the number of CPUs.
from multiprocessing import Pool
import datetime
def f(x): # You thumbnail maker function, probably using some module like PIL
print '%-4d: Started at %s' % (x, datetime.datetime.now())
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
print pool.map(f, range(25)) # prints "[0, 1, 4,..., 81]"
Which gives (note that the printouts are not strictly ordered!):
0 : Started at 2011-04-28 17:25:58.992560
1 : Started at 2011-04-28 17:25:58.992749
4 : Started at 2011-04-28 17:25:58.992829
5 : Started at 2011-04-28 17:25:58.992848
2 : Started at 2011-04-28 17:25:58.992741
3 : Started at 2011-04-28 17:25:58.992877
6 : Started at 2011-04-28 17:25:58.992884
7 : Started at 2011-04-28 17:25:58.992902
10 : Started at 2011-04-28 17:25:58.992998
11 : Started at 2011-04-28 17:25:58.993019
12 : Started at 2011-04-28 17:25:58.993056
13 : Started at 2011-04-28 17:25:58.993074
14 : Started at 2011-04-28 17:25:58.993109
15 : Started at 2011-04-28 17:25:58.993127
8 : Started at 2011-04-28 17:25:58.993025
9 : Started at 2011-04-28 17:25:58.993158
16 : Started at 2011-04-28 17:25:58.993161
17 : Started at 2011-04-28 17:25:58.993179
18 : Started at 2011-04-28 17:25:58.993230
20 : Started at 2011-04-28 17:25:58.993233
19 : Started at 2011-04-28 17:25:58.993249
21 : Started at 2011-04-28 17:25:58.993252
22 : Started at 2011-04-28 17:25:58.993288
24 : Started at 2011-04-28 17:25:58.993297
23 : Started at 2011-04-28 17:25:58.993307
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256,
289, 324, 361, 400, 441, 484, 529, 576]
subprocess
moduleThe subprocess
module is useful for running external processes, and thus preferred if you plan on using an external thumbnail maker like imagemagick
's convert
. Code example:
import subprocess as sp
processes=[sp.Popen('your-command-here', shell=True,
stdout=sp.PIPE, stderr=sp.PIPE) for i in range(12)]
Now, iterate over processes. If any process has finished (using subprocess.poll()
), remove it and add a new process to your list.
Like others have answered, subprocesses is usually preferable to threads. multiprocessing.Pool makes it easy to use exactly as many subprocesses as you want, for instance like this:
import os
from multiprocessing import Pool
def process_file(filepath):
[if filepath is an image file, resize it]
def enumerate_files(folder):
for dirpath, dirnames, filenames in os.walk(folder):
for fname in filenames:
yield os.path.join(dirpath, fname)
if __name__ == '__main__':
pool = Pool(12) # or omit the parameter to use CPU count
# use pool.map() only for the side effects, ignore the return value
pool.map(process_file, enumerate_files('.'), chunksize=1)
The chunksize=1 parameter makes sense if each file operation is relatively slow compared to communicating with each subprocess.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With