I have a short code that uses the multiprocessing
package and works fine on my local machine.
When I uploaded to AWS Lambda
and run there, I got the following error (stacktrace trimmed):
[Errno 38] Function not implemented: OSError Traceback (most recent call last): File "/var/task/recorder.py", line 41, in record pool = multiprocessing.Pool(10) File "/usr/lib64/python2.7/multiprocessing/__init__.py", line 232, in Pool return Pool(processes, initializer, initargs, maxtasksperchild) File "/usr/lib64/python2.7/multiprocessing/pool.py", line 138, in __init__ self._setup_queues() File "/usr/lib64/python2.7/multiprocessing/pool.py", line 234, in _setup_queues self._inqueue = SimpleQueue() File "/usr/lib64/python2.7/multiprocessing/queues.py", line 354, in __init__ self._rlock = Lock() File "/usr/lib64/python2.7/multiprocessing/synchronize.py", line 147, in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1) File "/usr/lib64/python2.7/multiprocessing/synchronize.py", line 75, in __init__ sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue) OSError: [Errno 38] Function not implemented
Can it be that a part of python's core packages in not implemented? I have no idea what am I running on underneath so I can't login there and debug.
Any ideas how can I run multiprocessing
on Lambda?
If you deploy Python code to an AWS Lambda function, the multiprocessing functions in the standard library such as multiprocessing.
Lambda supports Python 2.7 and Python 3.6, both of which have multiprocessing and threading modules. The multiprocessing module supports multiple cores so it is a better choice, especially for CPU intensive workloads.
As far as I can tell, multiprocessing won't work on AWS Lambda because the execution environment/container is missing /dev/shm
- see https://forums.aws.amazon.com/thread.jspa?threadID=219962 (login may be required).
No word (that I can find) on if/when Amazon will change this. I also looked at other libraries e.g. https://pythonhosted.org/joblib/parallel.html will fallback to /tmp
(which we know DOES exist) if it can't find /dev/shm
, but that doesn't actually solve the problem.
You can run routines in parallel on AWS Lambda using Python's multiprocessing module but you can't use Pools or Queues as noted in other answers. A workable solution is to use Process and Pipe as outlined in this article https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/
While the article definitely helped me get to a solution (shared below) there are a few things to be aware of. First, the Process and Pipe based solution is not as fast as the built in map function in Pool, though I did see nearly linear speed-ups as I increased the available memory/CPU resources in my Lambda function. Second there is a fair bit of management that has to be undertaken when developing multiprocessing functions in this way. I suspect this is at least partly why my solution is slower than the built in methods. If anyone has suggestions to speed it up I'd love to hear them! Finally, while the article notes that multiprocessing is useful for offloading asynchronous processes, there are other reasons to use multiprocessing such as lots of intensive math ops which is what I was trying to do. In the end I was happy enough with the performance improvement as it was much better than sequential execution!
The code:
# Python 3.6 from multiprocessing import Pipe, Process def myWorkFunc(data, connection): result = None # Do some work and store it in result if result: connection.send([result]) else: connection.send([None]) def myPipedMultiProcessFunc(): # Get number of available logical cores plimit = multiprocessing.cpu_count() # Setup management variables results = [] parent_conns = [] processes = [] pcount = 0 pactive = [] i = 0 for data in iterable: # Create the pipe for parent-child process communication parent_conn, child_conn = Pipe() # create the process, pass data to be operated on and connection process = Process(target=myWorkFunc, args=(data, child_conn,)) parent_conns.append(parent_conn) process.start() pcount += 1 if pcount == plimit: # There is not currently room for another process # Wait until there are results in the Pipes finishedConns = multiprocessing.connection.wait(parent_conns) # Collect the results and remove the connection as processing # the connection again will lead to errors for conn in finishedConns: results.append(conn.recv()[0]) parent_conns.remove(conn) # Decrement pcount so we can add a new process pcount -= 1 # Ensure all remaining active processes have their results collected for conn in parent_conns: results.append(conn.recv()[0]) conn.close() # Process results as needed
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With