Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large Pandas Dataframe parallel processing

I am accessing a very large Pandas dataframe as a global variable. This variable is accessed in parallel via joblib.

Eg.

df = db.query("select id, a_lot_of_data from table")

def process(id):
    temp_df = df.loc[id]
    temp_df.apply(another_function)

Parallel(n_jobs=8)(delayed(process)(id) for id in df['id'].to_list())

Accessing the original df in this manner seems to copy the data across processes. This is unexpected since the original df isnt being altered in any of the subprocesses? (or is it?)

like image 314
autodidacticon Avatar asked Nov 09 '15 15:11

autodidacticon


People also ask

Does Pandas support parallel processing?

Multiprocessing helps us to perform parallel processing on data-sets with pandas. With this, you can have 100% core utilization and the processing is very fast.

Is Pandas efficient for large data sets?

Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

How big is too big for a Pandas DataFrame?

The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.


2 Answers

The entire DataFrame needs to be pickled and unpickled for each process created by joblib. In practice, this is very slow and also requires many times the memory of each.

One solution is to store your data in HDF (df.to_hdf) using the table format. You can then use select to select subsets of data for further processing. In practice this will be too slow for interactive use. It is also very complex, and your workers will need to store their work so that it can be consolidated in the final step.

An alternative would be to explore numba.vectorize with target='parallel'. This would require the use of NumPy arrays not Pandas objects, so it also has some complexity costs.

In the long run, dask is hoped to bring parallel execution to Pandas, but this is not something to expect soon.

like image 187
Kevin S Avatar answered Oct 18 '22 20:10

Kevin S


Python multiprocessing is typically done using separate processes, as you noted, meaning that the processes don't share memory. There's a potential workaround if you can get things to work with np.memmap as mentioned a little farther down the joblib docs, though dumping to disk will obviously add some overhead of its own: https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping

like image 34
Randy Avatar answered Oct 18 '22 20:10

Randy