Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to process large files in Python

We have a about 500GB of images in various directories we need to process. Each image is about 4MB in size and we have a python script to process each image one at a time (it reads metadata and stores it in a database). Each directory can take 1-4 hours to process depending on size.

We have at our disposal a 2.2Ghz quad core processor and 16GB of RAM on a GNU/Linux OS. The current script is utilizing only one processor. What's the best way to take advantage of the other cores and RAM to process images faster? Will starting multiple Python processes to run the script take advantage of the other cores?

Another option is to use something like Gearman or Beanstalk to farm out the work to other machines. I've taken a look at the multiprocessing library but not sure how I can utilize it.

like image 814
CoolGravatar Avatar asked Apr 04 '12 14:04

CoolGravatar


People also ask

How do you process large files in Python?

Learn various techniques to reduce data processing time by using multiprocessing, joblib, and tqdm concurrent. For parallel processing, we divide our task into sub-units. It increases the number of jobs processed by the program and reduces overall processing time.

How do I read a 100gb file in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.

How do I read a 10gb file in Python?

In Python, files are read by using the readlines() method. The readlines() method returns a list where each item of the list is a complete sentence in the file. This method is useful when the file size is small.

How do you make Python files load faster?

If you know that your file contains 10^6 rows, you could preallocate the list. It should be faster than appending to it in each iteration. Just use features = [None] * (10 ** 6) to initialize your list.


1 Answers

Will starting multiple Python processes to run the script take advantage of the other cores?

Yes, it will, if the task is CPU-bound. This is probably the easiest option. However, don't spawn a single process per file or per directory; consider using a tool such as parallel(1) and let it spawn something like two processes per core.

Another option is to use something like Gearman or Beanstalk to farm out the work to other machines.

That might work. Also, have a look at the Python binding for ZeroMQ, it makes distributed processing pretty easy.

I've taken a look at the multiprocessing library but not sure how I can utilize it.

Define a function, say process, that reads the images in a single directory, connects to the database and stores the metadata. Let it return a boolean indicating success or failure. Let directories be the list of directories to process. Then

import multiprocessing
pool = multiprocessing.Pool(multiprocessing.cpu_count())
success = all(pool.imap_unordered(process, directories))

will process all the directories in parallel. You can also do the parallelism at the file-level if you want; that needs just a bit more tinkering.

Note that this will stop at the first failure; making it fault-tolerant takes a bit more work.

like image 116
Fred Foo Avatar answered Sep 27 '22 17:09

Fred Foo