I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categories must be accounted for.
I sometimes use categories even when there's a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each column. When grouping by 3 columns, pandas
forces us to hold results for 1000^3 groups.
My question: is there a convenient way to use groupby
with categories while avoiding this untoward behaviour? I'm not looking for any of these solutions:
numpy
.groupby
, reverting to categories later.I'm hoping there's a way to modify just this particular pandas
idiosyncrasy. A simple example is below. Instead of 4 categories I want in the output, I end up with 12.
import pandas as pd group_cols = ['Group1', 'Group2', 'Group3'] df = pd.DataFrame([['A', 'B', 'C', 54.34], ['A', 'B', 'D', 61.34], ['B', 'A', 'C', 514.5], ['B', 'A', 'A', 765.4], ['A', 'B', 'D', 765.4]], columns=(group_cols+['Value'])) for col in group_cols: df[col] = df[col].astype('category') df.groupby(group_cols, as_index=False).sum() Group1 Group2 Group3 Value # A A A NaN # A A C NaN # A A D NaN # A B A NaN # A B C 54.34 # A B D 826.74 # B A A 765.40 # B A C 514.50 # B A D NaN # B B A NaN # B B C NaN # B B D NaN
Bounty update
The issue is poorly addressed by pandas development team (cf github.com/pandas-dev/pandas/issues/17594). Therefore, I am looking for responses that address any of the following:
Bounty update #2
To be clear, I'm not expecting answers to all of the above 4 questions. The main question I am asking is whether it's possible, or advisable, to overwrite pandas
library methods so that categories are treated in a way that facilitates groupby
/ set_index
operations.
NaN values mean "Not-a-Number" which generally means that there are some missing values in the cell. Here, we are going to learn how to groupby column values with NaN values, as the groupby method usually excludes the NaN values hence to include NaN values, we use groupby method with some special parameters.
Pandas dataframe has groupby([column(s)]). first() method which is used to get the first record from each group. The result of grouby.
What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
Since Pandas 0.23.0, the groupby
method can now take a parameter observed
which fixes this issue if it is set to True
(False by default). Below is the exact same code as in the question with just observed=True
added :
import pandas as pd group_cols = ['Group1', 'Group2', 'Group3'] df = pd.DataFrame([['A', 'B', 'C', 54.34], ['A', 'B', 'D', 61.34], ['B', 'A', 'C', 514.5], ['B', 'A', 'A', 765.4], ['A', 'B', 'D', 765.4]], columns=(group_cols+['Value'])) for col in group_cols: df[col] = df[col].astype('category') df.groupby(group_cols, as_index=False, observed=True).sum()
I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?
import pandas as pd group_cols = ['Group1', 'Group2', 'Group3'] df = pd.DataFrame([['A', 'B', 'C', 54.34], ['A', 'B', 'D', 61.34], ['B', 'A', 'C', 514.5], ['B', 'A', 'A', 765.4], ['A', 'B', 'D', 765.4]], columns=(group_cols+['Value'])) for col in group_cols: df[col] = df[col].astype('category') result = df.groupby([df[col].values.codes for col in group_cols]).sum() result = result.reset_index() level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)} result = result.rename(columns=level_to_column_name) for col in group_cols: result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories) result
So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.
Group1 Group2 Group3 Value A B C 54.34 A B D 826.74 B A A 765.40 B A C 514.50
So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.
def categorical_groupby(df,group_cols,agg_fuction="sum"): "Does a groupby on a number of categorical columns" result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction) result = result.reset_index() level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)} result = result.rename(columns=level_to_column_name) for col in group_cols: result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories) return result
call it like this:
df.pipe(categorical_groupby,group_cols)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With