I am using Python multiprocessing, more precisely <pre class="prettyprint"><code>from multiprocessing import Pool p = Pool(15) args = [(df, config1), (df, config2), ...] #list of args - df is the same object in each tuple res = p.map_async(func, args) #func is some arbitrary function p.close() p.join() </code></pre> This approach has a huge memory consumption; eating up pretty much all my RAM (at which point it gets extremely slow, hence making the multiprocessing pretty useless). I assume the problem is that <code>df</code> is a huge object (a large pandas dataframe) and it gets copied for each process. I have tried using <code>multiprocessing.Value</code> to share the dataframe without copying <pre class="prettyprint"><code>shared_df = multiprocessing.Value(pandas.DataFrame, df) args = [(shared_df, config1), (shared_df, config2), ...] </code></pre> (as suggested in Python multiprocessing shared memory), but that gives me <code>TypeError: this type has no size</code> (same as Sharing a complex object between Python processes?, to which I unfortunately don't understand the answer). I am using multiprocessing for the first time and maybe my understanding is not (yet) good enough. Is <code>multiprocessing.Value</code> actually even the right thing to use in this case? I have seen other suggestions (e.g. queue) but am by now a bit confused. What options are there to share memory, and which one would be best in this case?

You can use <code>Array</code> instead of <code>Value</code> for storing your dataframe. The solution below converts a <code>pandas</code> dataframe to an object that stores its data in shared memory: <pre class="prettyprint"><code>import numpy as np import pandas as pd import multiprocessing as mp import ctypes # the origingal dataframe is df, store the columns/dtypes pairs df_dtypes_dict = dict(list(zip(df.columns, df.dtypes))) # declare a shared Array with data from df mparr = mp.Array(ctypes.c_double, df.values.reshape(-1)) # create a new df based on the shared array df_shared = pd.DataFrame(np.frombuffer(mparr.get_obj()).reshape(df.shape), columns=df.columns).astype(df_dtypes_dict) </code></pre> If now you share <code>df_shared</code> across processes, no additional copies will be made. For you case: <pre class="prettyprint"><code>pool = mp.Pool(15) def fun(config): # df_shared is global to the script df_shared.apply(config) # whatever compute you do with df/config config_list = [config1, config2] res = p.map_async(fun, config_list) p.close() p.join() </code></pre> This is also particularly useful if you use pandarallel, for example: <pre class="prettyprint"><code># this will not explode in memory from pandarallel import pandarallel pandarallel.initialize() df_shared.parallel_apply(your_fun, axis=1) </code></pre> Note: with this solution you end up with two dataframes (df and df_shared), which consume twice the memory and are long to initialise. It might be possible to read the data directly in shared memory.

multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes

Tags:

python

pandas

multiprocessing

I am using Python multiprocessing, more precisely

from multiprocessing import Pool p = Pool(15)  args = [(df, config1), (df, config2), ...] #list of args - df is the same object in each tuple res = p.map_async(func, args) #func is some arbitrary function p.close() p.join()

This approach has a huge memory consumption; eating up pretty much all my RAM (at which point it gets extremely slow, hence making the multiprocessing pretty useless). I assume the problem is that df is a huge object (a large pandas dataframe) and it gets copied for each process. I have tried using multiprocessing.Value to share the dataframe without copying

shared_df = multiprocessing.Value(pandas.DataFrame, df) args = [(shared_df, config1), (shared_df, config2), ...]

(as suggested in Python multiprocessing shared memory), but that gives me TypeError: this type has no size (same as Sharing a complex object between Python processes?, to which I unfortunately don't understand the answer).

I am using multiprocessing for the first time and maybe my understanding is not (yet) good enough. Is multiprocessing.Value actually even the right thing to use in this case? I have seen other suggestions (e.g. queue) but am by now a bit confused. What options are there to share memory, and which one would be best in this case?

640

asked Mar 18 '14 17:03

Anne

2 Answers

The first argument to Value is typecode_or_type. That is defined as:

typecode_or_type determines the type of the returned object: it is either a ctypes type or a one character typecode of the kind used by the array module. *args is passed on to the constructor for the type.

Emphasis mine. So, you simply cannot put a pandas dataframe in a Value, it has to be a ctypes type.

You could instead use a multiprocessing.Manager to serve your singleton dataframe instance to all of your processes. There's a few different ways to end up in the same place - probably the easiest is to just plop your dataframe into the manager's Namespace.

from multiprocessing import Manager  mgr = Manager() ns = mgr.Namespace() ns.df = my_dataframe  # now just give your processes access to ns, i.e. most simply # p = Process(target=worker, args=(ns, work_unit))

Now your dataframe instance is accessible to any process that gets passed a reference to the Manager. Or just pass a reference to the Namespace, it's cleaner.

One thing I didn't/won't cover is events and signaling - if your processes need to wait for others to finish executing, you'll need to add that in. Here is a page with some Event examples which also cover with a bit more detail how to use the manager's Namespace.

(note that none of this addresses whether multiprocessing is going to result in tangible performance benefits, this is just giving you the tools to explore that question)

173

answered Sep 22 '22 21:09

roippi

You can use Array instead of Value for storing your dataframe.

The solution below converts a pandas dataframe to an object that stores its data in shared memory:

import numpy as np import pandas as pd import multiprocessing as mp import ctypes  # the origingal dataframe is df, store the columns/dtypes pairs df_dtypes_dict = dict(list(zip(df.columns, df.dtypes)))  # declare a shared Array with data from df mparr = mp.Array(ctypes.c_double, df.values.reshape(-1))  # create a new df based on the shared array df_shared = pd.DataFrame(np.frombuffer(mparr.get_obj()).reshape(df.shape),                          columns=df.columns).astype(df_dtypes_dict)

If now you share df_shared across processes, no additional copies will be made. For you case:

pool = mp.Pool(15)  def fun(config):     # df_shared is global to the script     df_shared.apply(config)  # whatever compute you do with df/config  config_list = [config1, config2] res = p.map_async(fun, config_list) p.close() p.join()

This is also particularly useful if you use pandarallel, for example:

# this will not explode in memory from pandarallel import pandarallel pandarallel.initialize() df_shared.parallel_apply(your_fun, axis=1)

Note: with this solution you end up with two dataframes (df and df_shared), which consume twice the memory and are long to initialise. It might be possible to read the data directly in shared memory.

answered Sep 22 '22 21:09

toine

Related questions
                            
                                Why aren't Python sets hashable?
                            
                                How to implement retry mechanism into python requests library?
                            
                                User-friendly time format in Python?
                            
                                Find the end of the month of a Pandas DataFrame Series
                            
                                How do I use pdfminer as a library
                            
                                efficiently checking that string consists of one character in Python
                            
                                How to add a calculated field to a Django model
                            
                                TypeError: 'int' object is not callable
                            
                                Pandas groupby month and year
                            
                                How do I resolve a TesseractNotFoundError?
                            
                                Appending column totals to a Pandas DataFrame
                            
                                What's the simplest way of detecting keyboard input in a script from the terminal?
                            
                                Is there a way to make the Tkinter text widget read only?
                            
                                pip on Windows giving the error - Unknown or unsupported command 'install'
                            
                                Django not sending emails to admins
                            
                                Symbol not found: __PyCodecInfo_GetIncrementalDecoder
                            
                                Removing space from columns in pandas
                            
                                Check if a number is odd or even in python [duplicate]
                            
                                What SOAP libraries exist for Python 3.x? [closed]
                            
                                Longest equally-spaced subsequence

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With