Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby with categories with redundant nan

I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categories must be accounted for.

I sometimes use categories even when there's a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each column. When grouping by 3 columns, pandas forces us to hold results for 1000^3 groups.

My question: is there a convenient way to use groupby with categories while avoiding this untoward behaviour? I'm not looking for any of these solutions:

  • Recreating all the functionality via numpy.
  • Continually converting to strings/codes before groupby, reverting to categories later.
  • Making a tuple column from group columns, then group by the tuple column.

I'm hoping there's a way to modify just this particular pandas idiosyncrasy. A simple example is below. Instead of 4 categories I want in the output, I end up with 12.

import pandas as pd  group_cols = ['Group1', 'Group2', 'Group3']  df = pd.DataFrame([['A', 'B', 'C', 54.34],                    ['A', 'B', 'D', 61.34],                    ['B', 'A', 'C', 514.5],                    ['B', 'A', 'A', 765.4],                    ['A', 'B', 'D', 765.4]],                   columns=(group_cols+['Value']))  for col in group_cols:     df[col] = df[col].astype('category')  df.groupby(group_cols, as_index=False).sum()  Group1  Group2  Group3  Value #   A   A   A   NaN #   A   A   C   NaN #   A   A   D   NaN #   A   B   A   NaN #   A   B   C   54.34 #   A   B   D   826.74 #   B   A   A   765.40 #   B   A   C   514.50 #   B   A   D   NaN #   B   B   A   NaN #   B   B   C   NaN #   B   B   D   NaN 

Bounty update

The issue is poorly addressed by pandas development team (cf github.com/pandas-dev/pandas/issues/17594). Therefore, I am looking for responses that address any of the following:

  1. Why, with reference to pandas source code, is categorical data treated differently in groupby operations?
  2. Why would the current implementation be preferred? I appreciate this is subjective, but I am struggling to find any answer to this question. Current behaviour is prohibitive in many situations without cumbersome, potentially expensive, workarounds.
  3. Is there a clean solution to override pandas treatment of categorical data in groupby operations? Note the 3 no-go routes (dropping down to numpy; conversions to/from codes; creating and grouping by tuple columns). I would prefer a solution that is "pandas-compliant" to minimise / avoid loss of other pandas categorical functionality.
  4. A response from pandas development team to support and clarify existing treatment. Also, why should considering all category combinations not be configurable as a Boolean parameter?

Bounty update #2

To be clear, I'm not expecting answers to all of the above 4 questions. The main question I am asking is whether it's possible, or advisable, to overwrite pandas library methods so that categories are treated in a way that facilitates groupby / set_index operations.

like image 997
jpp Avatar asked Jan 27 '18 01:01

jpp


People also ask

Does Groupby include NaN values?

NaN values mean "Not-a-Number" which generally means that there are some missing values in the cell. Here, we are going to learn how to groupby column values with NaN values, as the groupby method usually excludes the NaN values hence to include NaN values, we use groupby method with some special parameters.

What does .first do in Groupby?

Pandas dataframe has groupby([column(s)]). first() method which is used to get the first record from each group. The result of grouby.

What is the difference between Groupby and Pivot_table in pandas?

What is the difference between the pivot_table and the groupby? The groupby method is generally enough for two-dimensional operations, but pivot_table is used for multi-dimensional grouping operations.

What does Group_by do in pandas?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.


2 Answers

Since Pandas 0.23.0, the groupby method can now take a parameter observed which fixes this issue if it is set to True (False by default). Below is the exact same code as in the question with just observed=True added :

import pandas as pd  group_cols = ['Group1', 'Group2', 'Group3']  df = pd.DataFrame([['A', 'B', 'C', 54.34],                    ['A', 'B', 'D', 61.34],                    ['B', 'A', 'C', 514.5],                    ['B', 'A', 'A', 765.4],                    ['A', 'B', 'D', 765.4]],                   columns=(group_cols+['Value']))  for col in group_cols:     df[col] = df[col].astype('category')  df.groupby(group_cols, as_index=False, observed=True).sum() 

enter image description here

like image 63
Ismael EL ATIFI Avatar answered Oct 04 '22 06:10

Ismael EL ATIFI


I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?

import pandas as pd  group_cols = ['Group1', 'Group2', 'Group3']  df = pd.DataFrame([['A', 'B', 'C', 54.34],                    ['A', 'B', 'D', 61.34],                    ['B', 'A', 'C', 514.5],                    ['B', 'A', 'A', 765.4],                    ['A', 'B', 'D', 765.4]],                   columns=(group_cols+['Value'])) for col in group_cols:     df[col] = df[col].astype('category')   result = df.groupby([df[col].values.codes for col in group_cols]).sum() result = result.reset_index() level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)} result = result.rename(columns=level_to_column_name) for col in group_cols:     result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories) result 

So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.

Group1  Group2  Group3  Value A       B       C       54.34 A       B       D       826.74 B       A       A       765.40 B       A       C       514.50 

So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.

def categorical_groupby(df,group_cols,agg_fuction="sum"):     "Does a groupby on a number of categorical columns"     result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)     result = result.reset_index()     level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}     result = result.rename(columns=level_to_column_name)     for col in group_cols:         result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)     return result 

call it like this:

df.pipe(categorical_groupby,group_cols) 
like image 31
Gabriel A Avatar answered Oct 04 '22 07:10

Gabriel A