Consider the following DataFrame with two categorical columns: <pre class="prettyprint"><code>df = pd.DataFrame({ "state": pd.Categorical(["AK", "AL", "AK", "AL"]), "gender": pd.Categorical(["M", "M", "M", "F"]), "name": list("abcd"), }) </code></pre> In <code>df.groupby()</code>, the default is <code>observed=False</code>. The description for <code>observed</code> (Pandas 0.25.0) is: <blockquote> When using a Categorical grouper (as a single grouper, or as part of multiple groupers), the observed keyword controls whether to return a cartesian product of all possible groupers values (observed=False) or only those that are observed groupers (observed=True). </blockquote> Accordingly, this is the result I would expect: <pre class="prettyprint"><code>>>> # Expected result >>> df.groupby(["state", "gender"])["name"].count() state gender AK M 2 F 0 AL F 1 M 1 Name: name, dtype: int64 </code></pre> This is the actual result: <pre class="prettyprint"><code>>>> df.groupby(["state", "gender"])["name"].count() state gender AK M 2 AL F 1 M 1 Name: name, dtype: int64 </code></pre> Am I misinterpreting the description here? This workaround seems like a huge pain and exactly what should be created by <code>observed=False</code>. Am I missing an alternative? <pre class="prettyprint"><code>>>> idx = pd.MultiIndex.from_product( ... ( ... df["state"].cat.categories, ... df["gender"].cat.categories, ... ), ... names=["state", "gender"] ... ) >>> df.groupby(["state", "gender"])["name"].count().reindex(idx).fillna(0.).astype(int) state gender AK F 0 M 2 AL F 1 M 1 Name: name, dtype: int64 </code></pre>

Seems like where you put <code>["name"]</code> is throwing it off. I think this works: <pre class="prettyprint"><code>df.groupby(["state", "gender"]).count().fillna(0)["name"] state gender AK F 0.0 M 2.0 AL F 1.0 M 1.0 Name: name, dtype: float64 </code></pre> Here are some useful variations: <pre class="prettyprint"><code>In [16]: df.groupby(["state", "gender"], observed=False).count().fillna(0)["name"].astype(int) Out[16]: state gender AK F 0 M 2 AL F 1 M 1 Name: name, dtype: int64 In [17]: df.groupby(["state", "gender"], observed=True).count()["name"] Out[17]: state gender AK M 2 AL M 1 F 1 Name: name, dtype: int64 </code></pre>

Pandas Groupby: 'observed' parameter with multiple categoricals

Tags:

python

python-3.x

pandas

Consider the following DataFrame with two categorical columns:

df = pd.DataFrame({
    "state": pd.Categorical(["AK", "AL", "AK", "AL"]),
    "gender": pd.Categorical(["M", "M", "M", "F"]),
    "name": list("abcd"),
})

In df.groupby(), the default is observed=False. The description for observed (Pandas 0.25.0) is:

When using a Categorical grouper (as a single grouper, or as part of multiple groupers), the observed keyword controls whether to return a cartesian product of all possible groupers values (observed=False) or only those that are observed groupers (observed=True).

Accordingly, this is the result I would expect:

>>> # Expected result
>>> df.groupby(["state", "gender"])["name"].count()
state  gender
AK     M         2
       F         0
AL     F         1
       M         1
Name: name, dtype: int64

This is the actual result:

>>> df.groupby(["state", "gender"])["name"].count()
state  gender
AK     M         2
AL     F         1
       M         1
Name: name, dtype: int64

Am I misinterpreting the description here?

This workaround seems like a huge pain and exactly what should be created by observed=False. Am I missing an alternative?

>>> idx = pd.MultiIndex.from_product(
...     (
...         df["state"].cat.categories,
...         df["gender"].cat.categories,
...     ),
...     names=["state", "gender"]
... )
>>> df.groupby(["state", "gender"])["name"].count().reindex(idx).fillna(0.).astype(int)
state  gender
AK     F         0
       M         2
AL     F         1
       M         1
Name: name, dtype: int64

771

asked Aug 06 '19 22:08

Brad Solomon

1 Answers

Seems like where you put ["name"] is throwing it off. I think this works:

df.groupby(["state", "gender"]).count().fillna(0)["name"]
state  gender
AK     F         0.0
       M         2.0
AL     F         1.0
       M         1.0
Name: name, dtype: float64

Here are some useful variations:

In [16]: df.groupby(["state", "gender"], observed=False).count().fillna(0)["name"].astype(int)
Out[16]:
state  gender
AK     F         0
       M         2
AL     F         1
       M         1
Name: name, dtype: int64

In [17]: df.groupby(["state", "gender"], observed=True).count()["name"]
Out[17]:
state  gender
AK     M         2
AL     M         1
       F         1
Name: name, dtype: int64

146

answered Oct 06 '22 00:10

ajp619

Related questions
                            
                                Python list comprehension for if else statemets
                            
                                Pause Jupyter Notebook widgets, waiting for user input
                            
                                How to compile the resources.qrc file with pyrcc5
                            
                                Best way to combine a permutation of conditional statements
                            
                                How to get decision function in randomforest in sklearn
                            
                                Remove rows of a dataframe based on the row number
                            
                                Python Fuzzy matching strings in list performance
                            
                                Disabling `@tf.function` decorators for debugging?
                            
                                How exactly does inspect.signature work with classes?
                            
                                Retrieve definition for parenthesized abbreviation, based on letter count
                            
                                Assigning a scalar value to an empty DataFrame doesn't appear to do anything
                            
                                json.loads() returns a string
                            
                                Error 429 with simple query on google with requests python
                            
                                What does a red triangle mean in Visual Studio Code?
                            
                                How to send an image directly from flask server to html?
                            
                                How to print the type annotations of a function in Python?
                            
                                ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. spacy
                            
                                What is the correct way to use distinct on (Postgres) with SqlAlchemy?
                            
                                How to convert video on python to .mp4 without ffmpeg?
                            
                                Creating a ragged tensor from a list of tensors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With