Large Pandas Dataframe parallel processing

Tags:

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.

Eg.

df = db.query("select id, a_lot_of_data from table")

def process(id):
    temp_df = df.loc[id]
    temp_df.apply(another_function)

Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())

Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)

314

asked Nov 09 '15 15:11

autodidacticon

2 Answers

The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.

One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use select to select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.

An alternative would be to explore numba.vectorize with target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.

In the long run, dask is hoped to bring parallel execution to Pandas, but this is not something to expect soon.

187

answered Oct 18 '22 20:10

Kevin S

Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmap as mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping

answered Oct 18 '22 20:10

Randy

Related questions
                            
                                TypeError in Python single inheritance with "super" attribute
                            
                                Create a mjpeg stream from jpeg images in python
                            
                                Fabric - Passing arguments to tasks via execute
                            
                                Django-Rest-Framework AssertionError HTTPresponse Expected
                            
                                Module dependency graph in Python 3
                            
                                Where should the uwsgi_params file be located and what is its extension?
                            
                                Convert a string tuple to a tuple [duplicate]
                            
                                How to force python print numpy datetime64 with specified timezone?
                            
                                How to delete a session variable in django?
                            
                                Django 1.7 - updating base_site.html not working
                            
                                Python floating point number comparison
                            
                                String matching performance: gcc versus CPython
                            
                                error: [Errno 98] Address already in use
                            
                                Cache busting in Django 1.8?
                            
                                Gensim Word2vec : Semantic Similarity
                            
                                Redirect to other view after submitting form
                            
                                NLTK word tokenize behaviour for double quotation marks is confusing
                            
                                Is there a fast way to find (not necessarily recognize) human speech in an audio file?
                            
                                Error when loading rpy2 with anaconda
                            
                                Matplotlib imshow: how to apply a mask on the matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Large Pandas Dataframe parallel processing

Tags:

python

pandas

joblib

autodidacticon

People also ask

2 Answers

Kevin S

Randy

Recent Activity

Donate For Us