Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python multiprocess share memory vs using arguments

I'm trying to get my head around what is the most efficient and less memory consuming way to share the same data source between different process.

Imagine the following code, that simplify my problem.

import pandas as pd
import numpy as np
from multiprocessing import Pool

# method #1
def foo(i): return data[i]
if __name__ == '__main__':
    data = pd.Series(np.array(range(100000)))
    pool = Pool(2)
    print pool.map(foo,[10,134,8,1])

# method #2
def foo((data,i)): return data[i]
if __name__ == '__main__':
    data = pd.Series(np.array(range(100000)))
    pool = Pool(2)
    print pool.map(foo,[(data,10),(data,134),(data,8),(data,1)])

In the first method will use the global variable (won't work on Windows, only on Linux/OSX) which will then access by the function. In the second method I'm passing "data" as part of the arguments.

In terms of memory used during the process, there will be a difference between the two methods?

# method #3
def foo((data,i)): return data[i]
if __name__ == '__main__':
    data = pd.Series(np.array(range(100000)))
    pool = Pool(2)
    # reduce the size of the argument passed
    data1 = data[:1000]
    print pool.map(foo,[(data1,10),(data1,134),(data1,8),(data1,1)])

A third method, rather than passing all the "data", since we know we'll be using only the first records, I'm only passing the first 1000 records. Will this make any difference?

Background The problem I'm facing I have a big dataset of about 2 millions rows (4GB in memory) which will then by four subprocess to do some elaboration. Each elaboration only affect a small portion of the data (20000 rows) and I'd like to minimize the memory use by each concurrent process.

like image 332
Alessandro Mariani Avatar asked Mar 04 '15 18:03

Alessandro Mariani


People also ask

Does Python multiprocessing use shared memory?

multiprocessing is a drop in replacement for Python's multiprocessing module. It supports the exact same operations, but extends it, so that all tensors sent through a multiprocessing. Queue , will have their data moved into shared memory and will only send a handle to another process.

When would you use a multiprocessing pool?

Use the multiprocessing pool if your tasks are independent. This means that each task is not dependent on other tasks that could execute at the same time. It also may mean tasks that are not dependent on any data other than data provided via function arguments to the task.

How does Python multiprocess work?

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.

Does multiprocessing speed up Python?

You can speed up your program execution using multiprocessing by running multiple CPU extensive tasks in parallel. You can create and manage processes using the multiprocessing module.


1 Answers

I'm going to start with the second and third methods, because they're easier to explain.

When you pass the arguments to pool.map or pool.apply, the arguments will be pickled, sent to the child process using a pipe, and then unpickled in the child. This of course requires two completely distinct copies of the data structures you're passing. It also can lead to slow performance with large data structures, since pickling/unpickling large objects can take quite a while.

With the third method, you're just passing smaller data structures than method two. This should perform better, since you don't need to pickle/unpickle as much data.

One other note - passing data multiple times is definitely a bad idea, because each copy will be getting pickled/unpickled repeatedly. You want to pass it to each child once. Method 1 is a good way to do that, or you can use the initializer keyword argument to explicitly pass data to the child. This will use fork on Linux and pickling on Windows to pass data to the child process:

import pandas as pd
import numpy as np
from multiprocessing import Pool

data = None

def init(_data):
    global data
    data = _data  # data is now accessible in all children, even on Windows

# method #1
def foo(i): return data[i]
if __name__ == '__main__':
    data = pd.Series(np.array(range(100000)))
    pool = Pool(2, initializer=init, initargs=(data,))
    print pool.map(foo,[10,134,8,1])

Using the first method, you're leveraging the behavior of fork to allow the child process to inherit the data object. fork has copy-on-write semantics, which means that the memory is actually shared between the parent and its children, until you try to write to it in the child. When you try to write, the memory page that the data you're trying to write is contained in must be copied, to keep it separate from the parent version.

Now, this sounds like a slam dunk - no need to copy anything as long as we don't write to it, which is surely faster than the pickle/unpickle method. And that's usually the case. However, in practice, Python is internally writing to its objects, even when you wouldn't really expect it to. Because Python uses reference counting for memory management, it needs to increment the internal reference counter on each object every time its passed to a method, or assigned to variable, etc. So, that means the memory page containing the reference count for each object passed to your child process will end up getting copied. This will definitely be faster and use less memory than pickling data multiple times, but isn't quite completely shared, either.

like image 186
dano Avatar answered Sep 18 '22 11:09

dano