I want to do batch processing of files on multiple cores. I have following scenario:
How do I achieve this? I am confused by multiprocessing.Pool()
, multiprocessing.Process()
and various other options.
Thanks.
Common research programming languages use only one processor The “multi” in multiprocessing refers to the multiple cores in a computer's central processing unit (CPU). Computers originally had only one CPU core or processor, which is the unit that makes all our mathematical calculations possible.
Spark does not require the users to have high end, expensive systems with great computing power. It splits the big data into multiple cores or systems available in the cluster and optimally utilizes these computing resources to the processes this data in a distributed manner.
Launched in 2014, Apache Spark is an open-source and multi-language data processing engine that allows you to implement distributed stream and batch processing operations for large-scale data workloads.
Multiple jobs are processed simultaneously on a given batch processing machine in parallel batching. The resulting batch is called a p-batch. Batching can lead to reduced production costs, but depending how the jobs are grouped into a batch can lead to better or worse delivery times of products.
In order to demonstrate Pool
I'm going to suppose your working function, which consumes a filename and produces a number, is named work
and that the 20 files are labeled 1.txt
,... 20.txt
. One way to set this up would be as follows,
from multiprocessing import Pool
pool = Pool(processes=4)
result = pool.map_async(work, ("%d.txt"%n for n in xrange(1,21)))
print sum(result.get())
This method will do the work of steps 3 and 4 for you.
It's pretty simple.
from multiprocessing import Pool
def process_file(filename):
return filename
if __name__ == '__main__':
pool = Pool()
files = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
results = pool.imap(process_file, files)
for result in results:
print result
Pool
automatically defaults to the number of processor cores that you have. Also, make sure that your processing function is importable from the file and that your multiprocessing code is inside of the if __name__ == '__main__':
. If not, you'll make a fork bomb and lock up your computer.
Although Jared's answer is great, I personally would use a ProcessPoolExecutor
from the futures
module, and not even worry about multiprocessing
:
with ProcessPoolExecutor(max_workers=4) as executor:
result = sum(executor.map(process_file, files))
When it gets a little more complicated, the future
object, or futures.as_completed
, can be really nifty compared to the multiprocessing
equivalents. When it gets a lot more complicated, multiprocessing
is a whole lot more flexible and powerful. But when it's this trivial, really, it's almost hard to tell the difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With