Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't groupby sum convert boolean to int or float?

Tags:

python

pandas

I'll start with 3 simple examples:

pd.DataFrame([[True]]).sum()

0    1
dtype: int64

pd.DataFrame([True]).sum()

0    1
dtype: int64

pd.Series([True]).sum()

1

All of these are as expected. Here is a more complicated example.

df = pd.DataFrame([
        ['a', 'A', True],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', True],
        ['b', 'B', True],
        ['b', 'C', False],
    ], columns=list('XYZ'))

df.Z.sum()

4

Also as expected. However, if I groupby(['X', 'Y']).sum()

enter image description here

I expected it to look like:

enter image description here

I'm thinking bug. Is there another explanation?


Per @unutbu's answer

pandas is trying to recast as original dtypes. I had thought that maybe the group by I'd performed didn't really groupby anything. So I tried this example to test out the idea.

df = pd.DataFrame([
        ['a', 'A', False],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', False],
        ['b', 'B', False],
        ['b', 'C', False],
    ], columns=list('XYZ'))

I'll groupby('X') and sum. If @unutbu is correct, these sums should be 1 and 0 and are castable to bool, therefore we should see bool

df.groupby('X').sum()

enter image description here

Sure enough... bool

But if the process is the same but the values are slightly different.

df = pd.DataFrame([
        ['a', 'A', True],
        ['a', 'B', False],
        ['a', 'C', True],
        ['b', 'A', False],
        ['b', 'B', False],
        ['b', 'C', False],
    ], columns=list('XYZ'))

df.groupby('X').sum()

enter image description here

lesson learned. Always use astype(int) or something similar when doing this.

df.groupby('X').sum().astype(int)

gives consistent results for either scenario.

like image 459
piRSquared Avatar asked Jul 26 '16 00:07

piRSquared


1 Answers

This occurs because _cython_agg_blocks calls _try_coerce_and_cast_result which calls _try_cast_result which tries to return a result of the same dtype as the original values (in this case, bool).

This returns something a little peculiar when Z has dtype bool (and all the groups have no more than one True value). If any of the groups have 2 or more True values, then the resulting values are floats since _try_cast_result does not convert 2.0 back to a boolean.

_try_cast_result does something more useful when Z has dtype int: Internally, the Cython aggregator used by df.groupby(['X', 'Y']).sum() returns a result of dtype float. Here then, _try_cast_result returns the result to dtype int.

like image 73
unutbu Avatar answered Sep 27 '22 20:09

unutbu