Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Update:

The pandas df was created like this:

df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])

Creating a dask df from this df looks like this:

df = dd.from_pandas(encoded, 50)

Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):

result = df.groupby('journal_entry').max().reset_index().compute()

Original:

I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:

result = df.groupby('id').max().reset_index()

Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max() needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.

What strategy could mitigate the memory issue?

like image 961
OverflowingTheGlass Avatar asked Apr 26 '18 20:04

OverflowingTheGlass


People also ask

How do I reduce panda memory usage?

Ways to optimize memory in Pandas Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage. By converting the data types without any compromises we can directly cut the memory usage to near half.

How big is too big for a Pandas DataFrame?

The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.

Is Pandas efficient for large data sets?

Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

Is Groupby faster on index Pandas?

Although Groupby is much faster than Pandas GroupBy. apply and GroupBy. transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython.


2 Answers

If you have any categorical columns in your data (rather than categories stored as object columns or strings), make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present, e.g. only one line per customer_id,order_id combination, rather than creating n_custs * n_orders lines!

I just did a groupby-sum on a 26M row dataset, never going above 7GB of RAM. Before adding the observed=True option, it was going up to 62GB and then running out.

like image 193
Heather Walker Avatar answered Oct 23 '22 13:10

Heather Walker


you could use dask.dataframe for this task

import dask.dataframe as dd
df = dd.from_pandas(df)
result = df.groupby('id').max().reset_index().compute()

All you need to do is convert your pandas.DataFrame into a dask.dataframe. Dask is a python out-of-core parallelization framework that offers various parallelized container types, one of which is the dataframe. It let's you perform most common pandas.DataFrame operations in parallel and/or distributed with data that is too large to fit in memory. The core of dask is a set of schedulers and an API for building computation graphs, hence we have to call .compute() at the end in order for any computation to actually take place. The library is easy to install because it is written in pure python for the most part.

like image 45
tobsecret Avatar answered Oct 23 '22 15:10

tobsecret