Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Batch processing on multiple cores

I want to do batch processing of files on multiple cores. I have following scenario:

  1. I have 20 files.
  2. I have a function that takes a filename, processes it and produces an integer result. I want to apply the function to all the 20 files, calculate integer output for each of them and finally sum individual outputs and print the total result.
  3. Since I have 4 cores, I can only process 4 files at time. Thus I want to run 5 rounds of processing 4 files at a time (4*5 = 20).
  4. That is I want to create 4 processes each processing 5 files one after another (1st process processes files 0, 4, 8, 12, 16, 2nd process processes files 1, 5, 9, 13, 17, etc.).

How do I achieve this? I am confused by multiprocessing.Pool(), multiprocessing.Process() and various other options.

Thanks.

like image 776
abhinavkulkarni Avatar asked Apr 12 '13 22:04

abhinavkulkarni


People also ask

Does multiprocessing require multiple cores?

Common research programming languages use only one processor The “multi” in multiprocessing refers to the multiple cores in a computer's central processing unit (CPU). Computers originally had only one CPU core or processor, which is the unit that makes all our mathematical calculations possible.

Does Pyspark use multiple cores?

Spark does not require the users to have high end, expensive systems with great computing power. It splits the big data into multiple cores or systems available in the cluster and optimally utilizes these computing resources to the processes this data in a distributed manner.

Can Spark be used for batch processing?

Launched in 2014, Apache Spark is an open-source and multi-language data processing engine that allows you to implement distributed stream and batch processing operations for large-scale data workloads.

What is parallel batch processing?

Multiple jobs are processed simultaneously on a given batch processing machine in parallel batching. The resulting batch is called a p-batch. Batching can lead to reduced production costs, but depending how the jobs are grouped into a batch can lead to better or worse delivery times of products.


3 Answers

In order to demonstrate Pool I'm going to suppose your working function, which consumes a filename and produces a number, is named work and that the 20 files are labeled 1.txt,... 20.txt. One way to set this up would be as follows,

from multiprocessing import Pool

pool = Pool(processes=4)
result = pool.map_async(work, ("%d.txt"%n for n in xrange(1,21)))
print sum(result.get())

This method will do the work of steps 3 and 4 for you.

like image 140
Jared Avatar answered Oct 06 '22 19:10

Jared


It's pretty simple.

from multiprocessing import Pool

def process_file(filename):
    return filename

if __name__ == '__main__':
    pool = Pool()
    files = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

    results = pool.imap(process_file, files)

    for result in results:
        print result

Pool automatically defaults to the number of processor cores that you have. Also, make sure that your processing function is importable from the file and that your multiprocessing code is inside of the if __name__ == '__main__':. If not, you'll make a fork bomb and lock up your computer.

like image 39
Blender Avatar answered Oct 06 '22 18:10

Blender


Although Jared's answer is great, I personally would use a ProcessPoolExecutor from the futures module, and not even worry about multiprocessing:

with ProcessPoolExecutor(max_workers=4) as executor:
    result = sum(executor.map(process_file, files))

When it gets a little more complicated, the future object, or futures.as_completed, can be really nifty compared to the multiprocessing equivalents. When it gets a lot more complicated, multiprocessing is a whole lot more flexible and powerful. But when it's this trivial, really, it's almost hard to tell the difference.

like image 31
abarnert Avatar answered Oct 06 '22 20:10

abarnert