I'm looking for a simple process-based parallel map for python, that is, a function
parmap(function,[data])
that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of results.
Does something like this exist? I would like something simple, so a simple module would be nice. Of course, if no such thing exists, I will settle for a big library :-/
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code. To parallelize your example, you'd need to define your map function with the @ray. remote decorator, and then invoke it with . remote .
Parallel processing can increase the number of tasks done by your program which reduces the overall processing time. These help to handle large scale problems.
To parallelize the loop, we can use the multiprocessing package in Python as it supports creating a child process by the request of another ongoing process. The multiprocessing module could be used instead of the for loop to execute operations on every element of the iterable. It's multiprocessing.
In this example, at first we import the Process class then initiate Process object with the display() function. Then process is started with start() method and then complete the process with the join() method. We can also pass arguments to the function using args keyword.
I seems like what you need is the map method in multiprocessing.Pool():
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks till the result is ready. This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integ
For example, if you wanted to map this function:
def f(x): return x**2
to range(10), you could do it using the built-in map() function:
map(f, range(10))
or using a multiprocessing.Pool() object's method map():
import multiprocessing pool = multiprocessing.Pool() print pool.map(f, range(10))
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your map function with the @ray.remote
decorator, and then invoke it with .remote
. This will ensure that every instance of the remote function will executed in a different process.
import time import ray ray.init() # Define the function you want to apply map on, as remote function. @ray.remote def f(x): # Do some work... time.sleep(1) return x*x # Define a helper parmap(f, list) function. # This function executes a copy of f() on each element in "list". # Each copy of f() runs in a different process. # Note f.remote(x) returns a future of its result (i.e., # an identifier of the result) rather than the result itself. def parmap(f, list): return [f.remote(x) for x in list] # Call parmap() on a list consisting of first 5 integers. result_ids = parmap(f, range(1, 6)) # Get the results results = ray.get(result_ids) print(results)
This will print:
[1, 4, 9, 16, 25]
and it will finish in approximately len(list)/p
(rounded up the nearest integer) where p
is number of cores on your machine. Assuming a machine with 2 cores, our example will execute in 5/2
rounded up, i.e, in approximately 3
sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With