Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Tags:

Update:

The pandas df was created like this:

df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])

Creating a dask df from this df looks like this:

df = dd.from_pandas(encoded, 50)

Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):

result = df.groupby('journal_entry').max().reset_index().compute()

Original:

I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:

result = df.groupby('id').max().reset_index()

Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max() needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.

What strategy could mitigate the memory issue?

961

asked Apr 26 '18 20:04

OverflowingTheGlass

2 Answers

If you have any categorical columns in your data (rather than categories stored as object columns or strings), make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present, e.g. only one line per customer_id,order_id combination, rather than creating n_custs * n_orders lines!

I just did a groupby-sum on a 26M row dataset, never going above 7GB of RAM. Before adding the observed=True option, it was going up to 62GB and then running out.

193

answered Oct 23 '22 13:10

Heather Walker

you could use dask.dataframe for this task

import dask.dataframe as dd
df = dd.from_pandas(df)
result = df.groupby('id').max().reset_index().compute()

All you need to do is convert your pandas.DataFrame into a dask.dataframe. Dask is a python out-of-core parallelization framework that offers various parallelized container types, one of which is the dataframe. It let's you perform most common pandas.DataFrame operations in parallel and/or distributed with data that is too large to fit in memory. The core of dask is a set of schedulers and an API for building computation graphs, hence we have to call .compute() at the end in order for any computation to actually take place. The library is easy to install because it is written in pure python for the most part.

answered Oct 23 '22 15:10

tobsecret

Related questions
                            
                                How to write a list of list into excel using python?
                            
                                How can I call a DLL from a scripting language?
                            
                                internal server error (500) in simple cgi script
                            
                                How to deserialize an object with PyYAML using safe_load?
                            
                                Are Interfaces just "Syntactic Sugar"?
                            
                                100% CPU usage with a regexp depending on input length
                            
                                pyserial for Python 2.7.2
                            
                                Calculating angle between two vectors in python
                            
                                Get Absolute URL in Django when using Class Based Views
                            
                                Counting the character recursively
                            
                                Google App Engine "Error parsing ./app.yaml: Unknown url handler type"
                            
                                convert xldate to python datetime
                            
                                How to print to console in color? [duplicate]
                            
                                Run python script with some of the argument that are optional
                            
                                How to left justify Python tkinter grid columns while filling entire cell
                            
                                GeneratorExit in Python generator
                            
                                creating an alias for python3
                            
                                Calculating Average True Range (ATR) on OHLC data with Python
                            
                                python requests.get always get 404
                            
                                How to XOR two lists in Python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Tags:

python

memory

pandas

dataframe

dask

OverflowingTheGlass

People also ask

2 Answers

Heather Walker

tobsecret

Recent Activity

Donate For Us