I'll start with 3 simple examples:
pd.DataFrame([[True]]).sum()
0 1
dtype: int64
pd.DataFrame([True]).sum()
0 1
dtype: int64
pd.Series([True]).sum()
1
All of these are as expected. Here is a more complicated example.
df = pd.DataFrame([
['a', 'A', True],
['a', 'B', False],
['a', 'C', True],
['b', 'A', True],
['b', 'B', True],
['b', 'C', False],
], columns=list('XYZ'))
df.Z.sum()
4
Also as expected. However, if I groupby(['X', 'Y']).sum()
I expected it to look like:
I'm thinking bug. Is there another explanation?
Per @unutbu's answer
pandas is trying to recast as original dtypes. I had thought that maybe the group by I'd performed didn't really groupby anything. So I tried this example to test out the idea.
df = pd.DataFrame([
['a', 'A', False],
['a', 'B', False],
['a', 'C', True],
['b', 'A', False],
['b', 'B', False],
['b', 'C', False],
], columns=list('XYZ'))
I'll groupby('X')
and sum
. If @unutbu is correct, these sums should be 1
and 0
and are castable to bool
, therefore we should see bool
df.groupby('X').sum()
Sure enough... bool
But if the process is the same but the values are slightly different.
df = pd.DataFrame([
['a', 'A', True],
['a', 'B', False],
['a', 'C', True],
['b', 'A', False],
['b', 'B', False],
['b', 'C', False],
], columns=list('XYZ'))
df.groupby('X').sum()
lesson learned. Always use astype(int)
or something similar when doing this.
df.groupby('X').sum().astype(int)
gives consistent results for either scenario.
This occurs because _cython_agg_blocks
calls _try_coerce_and_cast_result
which calls _try_cast_result
which tries to return a result of the same dtype as the original values (in this case, bool
).
This returns something a little peculiar when Z
has dtype bool (and all the groups have no more than one True value). If any of the groups have 2 or more True values, then the resulting values are floats since _try_cast_result
does not convert 2.0 back to a boolean.
_try_cast_result
does something more useful when Z
has dtype int
: Internally, the Cython aggregator used by
df.groupby(['X', 'Y']).sum()
returns a result
of dtype float
. Here then, _try_cast_result
returns the result to dtype int
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With