Consider the following DataFrame with two categorical columns:
df = pd.DataFrame({
"state": pd.Categorical(["AK", "AL", "AK", "AL"]),
"gender": pd.Categorical(["M", "M", "M", "F"]),
"name": list("abcd"),
})
In df.groupby()
, the default is observed=False
. The description for observed
(Pandas 0.25.0) is:
When using a Categorical grouper (as a single grouper, or as part of multiple groupers), the observed keyword controls whether to return a cartesian product of all possible groupers values (observed=False) or only those that are observed groupers (observed=True).
Accordingly, this is the result I would expect:
>>> # Expected result
>>> df.groupby(["state", "gender"])["name"].count()
state gender
AK M 2
F 0
AL F 1
M 1
Name: name, dtype: int64
This is the actual result:
>>> df.groupby(["state", "gender"])["name"].count()
state gender
AK M 2
AL F 1
M 1
Name: name, dtype: int64
Am I misinterpreting the description here?
This workaround seems like a huge pain and exactly what should be created by observed=False
. Am I missing an alternative?
>>> idx = pd.MultiIndex.from_product(
... (
... df["state"].cat.categories,
... df["gender"].cat.categories,
... ),
... names=["state", "gender"]
... )
>>> df.groupby(["state", "gender"])["name"].count().reindex(idx).fillna(0.).astype(int)
state gender
AK F 0
M 2
AL F 1
M 1
Name: name, dtype: int64
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Groupby preserves the order of rows within each group. When calling apply, add group keys to index to identify pieces. Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns.
Seems like where you put ["name"]
is throwing it off. I think this works:
df.groupby(["state", "gender"]).count().fillna(0)["name"]
state gender
AK F 0.0
M 2.0
AL F 1.0
M 1.0
Name: name, dtype: float64
Here are some useful variations:
In [16]: df.groupby(["state", "gender"], observed=False).count().fillna(0)["name"].astype(int)
Out[16]:
state gender
AK F 0
M 2
AL F 1
M 1
Name: name, dtype: int64
In [17]: df.groupby(["state", "gender"], observed=True).count()["name"]
Out[17]:
state gender
AK M 2
AL M 1
F 1
Name: name, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With