Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Groupby: 'observed' parameter with multiple categoricals

Consider the following DataFrame with two categorical columns:

df = pd.DataFrame({
    "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
    "gender": pd.Categorical(["M", "M", "M", "F"]),
    "name": list("abcd"),
})

In df.groupby(), the default is observed=False. The description for observed (Pandas 0.25.0) is:

When using a Categorical grouper (as a single grouper, or as part of multiple groupers), the observed keyword controls whether to return a cartesian product of all possible groupers values (observed=False) or only those that are observed groupers (observed=True).

Accordingly, this is the result I would expect:

>>> # Expected result
>>> df.groupby(["state", "gender"])["name"].count()
state  gender
AK     M         2
       F         0
AL     F         1
       M         1
Name: name, dtype: int64

This is the actual result:

>>> df.groupby(["state", "gender"])["name"].count()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

Am I misinterpreting the description here?

This workaround seems like a huge pain and exactly what should be created by observed=False. Am I missing an alternative?

>>> idx = pd.MultiIndex.from_product(
...     (
...         df["state"].cat.categories,
...         df["gender"].cat.categories,
...     ),
...     names=["state", "gender"]
... )
>>> df.groupby(["state", "gender"])["name"].count().reindex(idx).fillna(0.).astype(int)
state  gender
AK     F         0
       M         2
AL     F         1
       M         1
Name: name, dtype: int64
like image 771
Brad Solomon Avatar asked Aug 06 '19 22:08

Brad Solomon


People also ask

Can you use Groupby with multiple columns in pandas?

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

Does pandas Groupby preserve order?

Groupby preserves the order of rows within each group. When calling apply, add group keys to index to identify pieces. Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

Can you group by two things in pandas?

Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns.


1 Answers

Seems like where you put ["name"] is throwing it off. I think this works:

df.groupby(["state", "gender"]).count().fillna(0)["name"]
state  gender
AK     F         0.0
       M         2.0
AL     F         1.0
       M         1.0
Name: name, dtype: float64

Here are some useful variations:

In [16]: df.groupby(["state", "gender"], observed=False).count().fillna(0)["name"].astype(int)
Out[16]:
state  gender
AK     F         0
       M         2
AL     F         1
       M         1
Name: name, dtype: int64

In [17]: df.groupby(["state", "gender"], observed=True).count()["name"]
Out[17]:
state  gender
AK     M         2
AL     M         1
       F         1
Name: name, dtype: int64
like image 146
ajp619 Avatar answered Oct 06 '22 00:10

ajp619