Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multithreaded thumbnail generation in Python

I'd like to recurse a directory of images and generate thumbnails for each image. I have 12 usable cores on my machine. What's a good way to utilize them? I don't have much experience writing multithreaded applications so any simple sample code is appreciated. Thanks in advance.

like image 560
ensnare Avatar asked Dec 16 '22 14:12

ensnare


2 Answers

Abstract

Use processes, not threads, because Python is inefficient with CPU-intensive threads due to the GIL. Two possible solutions for multiprocessing are:

The multiprocessing module

This is preferred if you're using an internal thumbnail maker (e.g., PIL). Simply write a thumbnail maker function, and launch 12 in parallel. When one of the processes is finished, run another in its slot.

Adapted from the Python documentation, here's a script should utilize 12 cores:

from multiprocessing import Process
import os

def info(title):  # For learning purpose, remove when you got the PID\PPID idea
    print title
    print 'module:', __name__
    print 'parent process:', os.getppid(), 
    print 'process id:', os.getpid()
 
def f(name):      # Working function
    info('function f')
    print 'hello', name

if __name__ == '__main__':
    info('main line')
    processes=[Process(target=f, args=('bob-%d' % i,)) for i  in range(12)]
    [p.start() for p in processes]
    [p.join()  for p in processes]

Addendum: Using multiprocess.pool()

Following soulman's comment, you can use the provided process pull.

I've adapted some code from the multiprocessing manual. Note that you probably should use multiprocessing.cpu_count() instead of 4 to automatically determine the number of CPUs.

from multiprocessing import Pool
import datetime

def f(x):  # You thumbnail maker function, probably using some module like PIL
    print '%-4d: Started at %s' % (x, datetime.datetime.now())
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes
    print pool.map(f, range(25))          # prints "[0, 1, 4,..., 81]"

Which gives (note that the printouts are not strictly ordered!):

0   : Started at 2011-04-28 17:25:58.992560
1   : Started at 2011-04-28 17:25:58.992749
4   : Started at 2011-04-28 17:25:58.992829
5   : Started at 2011-04-28 17:25:58.992848
2   : Started at 2011-04-28 17:25:58.992741
3   : Started at 2011-04-28 17:25:58.992877
6   : Started at 2011-04-28 17:25:58.992884
7   : Started at 2011-04-28 17:25:58.992902
10  : Started at 2011-04-28 17:25:58.992998
11  : Started at 2011-04-28 17:25:58.993019
12  : Started at 2011-04-28 17:25:58.993056
13  : Started at 2011-04-28 17:25:58.993074
14  : Started at 2011-04-28 17:25:58.993109
15  : Started at 2011-04-28 17:25:58.993127
8   : Started at 2011-04-28 17:25:58.993025
9   : Started at 2011-04-28 17:25:58.993158
16  : Started at 2011-04-28 17:25:58.993161
17  : Started at 2011-04-28 17:25:58.993179
18  : Started at 2011-04-28 17:25:58.993230
20  : Started at 2011-04-28 17:25:58.993233
19  : Started at 2011-04-28 17:25:58.993249
21  : Started at 2011-04-28 17:25:58.993252
22  : Started at 2011-04-28 17:25:58.993288
24  : Started at 2011-04-28 17:25:58.993297
23  : Started at 2011-04-28 17:25:58.993307
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 
 289, 324, 361, 400, 441, 484, 529, 576]

The subprocess module

The subprocess module is useful for running external processes, and thus preferred if you plan on using an external thumbnail maker like imagemagick's convert. Code example:

import subprocess as sp

processes=[sp.Popen('your-command-here', shell=True, 
                    stdout=sp.PIPE, stderr=sp.PIPE) for i in range(12)]

Now, iterate over processes. If any process has finished (using subprocess.poll()), remove it and add a new process to your list.

like image 56
Adam Matan Avatar answered Dec 27 '22 13:12

Adam Matan


Like others have answered, subprocesses is usually preferable to threads. multiprocessing.Pool makes it easy to use exactly as many subprocesses as you want, for instance like this:

import os
from multiprocessing import Pool

def process_file(filepath):
    [if filepath is an image file, resize it]

def enumerate_files(folder):
    for dirpath, dirnames, filenames in os.walk(folder):
       for fname in filenames:
           yield os.path.join(dirpath, fname)

if __name__ == '__main__':
    pool = Pool(12) # or omit the parameter to use CPU count
    # use pool.map() only for the side effects, ignore the return value
    pool.map(process_file, enumerate_files('.'), chunksize=1)

The chunksize=1 parameter makes sense if each file operation is relatively slow compared to communicating with each subprocess.

like image 33
Soulman Avatar answered Dec 27 '22 12:12

Soulman