Fastest way to process large files in Python

Tags:

We have a about 500GB of images in various directories we need to process. Each image is about 4MB in size and we have a python script to process each image one at a time (it reads metadata and stores it in a database). Each directory can take 1-4 hours to process depending on size.

We have at our disposal a 2.2Ghz quad core processor and 16GB of RAM on a GNU/Linux OS. The current script is utilizing only one processor. What's the best way to take advantage of the other cores and RAM to process images faster? Will starting multiple Python processes to run the script take advantage of the other cores?

Another option is to use something like Gearman or Beanstalk to farm out the work to other machines. I've taken a look at the multiprocessing library but not sure how I can utilize it.

814

asked Apr 04 '12 14:04

CoolGravatar

1 Answers

Will starting multiple Python processes to run the script take advantage of the other cores?

Yes, it will, if the task is CPU-bound. This is probably the easiest option. However, don't spawn a single process per file or per directory; consider using a tool such as parallel(1) and let it spawn something like two processes per core.

Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.

That might work. Also, have a look at the Python binding for ZeroMQ, it makes distributed processing pretty easy.

I've taken a look at the multiprocessing library but not sure how I can utilize it.

Define a function, say process, that reads the images in a single directory, connects to the database and stores the metadata. Let it return a boolean indicating success or failure. Let directories be the list of directories to process. Then

import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))

will process all the directories in parallel. You can also do the parallelism at the file-level if you want; that needs just a bit more tinkering.

Note that this will stop at the first failure; making it fault-tolerant takes a bit more work.

116

answered Sep 27 '22 17:09

Fred Foo

Related questions
                            
                                How can I get the last-modified time with python3 urllib?
                            
                                override Django get_or_create
                            
                                pylint complains about wxPython 'Too many public methods'
                            
                                Ruby alternative to Scrapy? [closed]
                            
                                How can I add a default path to look for python script files in?
                            
                                Python to read PDF files [duplicate]
                            
                                Split unicode string into 300 byte chunks without destroying characters
                            
                                Gtk.StatusIcon PopupMenu in python
                            
                                Abort execution of a module in Python
                            
                                Python -- Share a Numpy Array Between Processes?
                            
                                python base64 string decoding
                            
                                How to use django-compressor behind load balancer?
                            
                                How to install Qt documentation for PyQt demo and Qt tools
                            
                                PyQt4 names showing as undefined in eclipse, but it runs fine
                            
                                How to crawl a website/extract data into database with python?
                            
                                django admin - group permissions to edit or view models
                            
                                Extracting Words using nltk from German Text
                            
                                Python: why is __dict__ attribute not in built-in class instances
                            
                                Auto generate doctest output with Sphinx extension
                            
                                Python + Tornado Restart after editing files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to process large files in Python

Tags:

python

optimization

parallel-processing

CoolGravatar

People also ask

1 Answers

Fred Foo

Recent Activity

Donate For Us