Pandas groupby with categories with redundant nan

Tags:

I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categories must be accounted for.

I sometimes use categories even when there's a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each column. When grouping by 3 columns, pandas forces us to hold results for 1000^3 groups.

My question: is there a convenient way to use groupby with categories while avoiding this untoward behaviour? I'm not looking for any of these solutions:

Recreating all the functionality via numpy.
Continually converting to strings/codes before groupby, reverting to categories later.
Making a tuple column from group columns, then group by the tuple column.

I'm hoping there's a way to modify just this particular pandas idiosyncrasy. A simple example is below. Instead of 4 categories I want in the output, I end up with 12.

import pandas as pd  group_cols = ['Group1', 'Group2', 'Group3']  df = pd.DataFrame([['A', 'B', 'C', 54.34],                    ['A', 'B', 'D', 61.34],                    ['B', 'A', 'C', 514.5],                    ['B', 'A', 'A', 765.4],                    ['A', 'B', 'D', 765.4]],                   columns=(group_cols+['Value']))  for col in group_cols:     df[col] = df[col].astype('category')  df.groupby(group_cols, as_index=False).sum()  Group1  Group2  Group3  Value #   A   A   A   NaN #   A   A   C   NaN #   A   A   D   NaN #   A   B   A   NaN #   A   B   C   54.34 #   A   B   D   826.74 #   B   A   A   765.40 #   B   A   C   514.50 #   B   A   D   NaN #   B   B   A   NaN #   B   B   C   NaN #   B   B   D   NaN

Bounty update

The issue is poorly addressed by pandas development team (cf github.com/pandas-dev/pandas/issues/17594). Therefore, I am looking for responses that address any of the following:

Why, with reference to pandas source code, is categorical data treated differently in groupby operations?
Why would the current implementation be preferred? I appreciate this is subjective, but I am struggling to find any answer to this question. Current behaviour is prohibitive in many situations without cumbersome, potentially expensive, workarounds.
Is there a clean solution to override pandas treatment of categorical data in groupby operations? Note the 3 no-go routes (dropping down to numpy; conversions to/from codes; creating and grouping by tuple columns). I would prefer a solution that is "pandas-compliant" to minimise / avoid loss of other pandas categorical functionality.
A response from pandas development team to support and clarify existing treatment. Also, why should considering all category combinations not be configurable as a Boolean parameter?

Bounty update #2

To be clear, I'm not expecting answers to all of the above 4 questions. The main question I am asking is whether it's possible, or advisable, to overwrite pandas library methods so that categories are treated in a way that facilitates groupby / set_index operations.

997

asked Jan 27 '18 01:01

jpp

2 Answers

Since Pandas 0.23.0, the groupby method can now take a parameter observed which fixes this issue if it is set to True (False by default). Below is the exact same code as in the question with just observed=True added :

import pandas as pd  group_cols = ['Group1', 'Group2', 'Group3']  df = pd.DataFrame([['A', 'B', 'C', 54.34],                    ['A', 'B', 'D', 61.34],                    ['B', 'A', 'C', 514.5],                    ['B', 'A', 'A', 765.4],                    ['A', 'B', 'D', 765.4]],                   columns=(group_cols+['Value']))  for col in group_cols:     df[col] = df[col].astype('category')  df.groupby(group_cols, as_index=False, observed=True).sum()

enter image description here

answered Oct 04 '22 06:10

Ismael EL ATIFI

I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?

import pandas as pd  group_cols = ['Group1', 'Group2', 'Group3']  df = pd.DataFrame([['A', 'B', 'C', 54.34],                    ['A', 'B', 'D', 61.34],                    ['B', 'A', 'C', 514.5],                    ['B', 'A', 'A', 765.4],                    ['A', 'B', 'D', 765.4]],                   columns=(group_cols+['Value'])) for col in group_cols:     df[col] = df[col].astype('category')   result = df.groupby([df[col].values.codes for col in group_cols]).sum() result = result.reset_index() level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)} result = result.rename(columns=level_to_column_name) for col in group_cols:     result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories) result

So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.

Group1  Group2  Group3  Value A       B       C       54.34 A       B       D       826.74 B       A       A       765.40 B       A       C       514.50

So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.

def categorical_groupby(df,group_cols,agg_fuction="sum"):     "Does a groupby on a number of categorical columns"     result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)     result = result.reset_index()     level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}     result = result.rename(columns=level_to_column_name)     for col in group_cols:         result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)     return result

call it like this:

df.pipe(categorical_groupby,group_cols)

answered Oct 04 '22 07:10

Gabriel A

Related questions
                            
                                TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
                            
                                Is there a generic way for a function to reference itself?
                            
                                Is "*_" an acceptable way to ignore arguments in python
                            
                                Can I specify a numpy dtype when generating random values?
                            
                                Spark add new column to dataframe with value from previous row
                            
                                How do pandas Rolling objects work?
                            
                                Python - install script to system
                            
                                Python Dictionary to CSV
                            
                                Django Overriding Model Clean() vs Save()
                            
                                Replicating rows in a pandas data frame by a column value
                            
                                PEP 0492 - Python 3.5 async keyword
                            
                                Python type hinting: how to tell X is a subclass for Foo?
                            
                                Python Code Obfuscation [closed]
                            
                                Python creating a shared variable between threads
                            
                                Is "from matplotlib import pyplot as plt" == "import matplotlib.pyplot as plt"?
                            
                                Relational/Logic Programming in Python?
                            
                                Ctrl-C crashes Python after importing scipy.stats
                            
                                Changing iteration variable inside for loop in Python [duplicate]
                            
                                python pass different **kwargs to multiple functions
                            
                                Tensorflow: How to replace a node in a calculation graph?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas groupby with categories with redundant nan

Tags:

python

pandas

group-by

numpy

pandas-groupby

jpp

People also ask

2 Answers

Ismael EL ATIFI

Gabriel A

Recent Activity

Donate For Us