I don't understand why apply
and transform
return different dtypes when called on the same data frame. The way I explained the two functions to myself before went something along the lines of "apply
collapses the data, and transform
does exactly the same thing as apply
but preserves the original index and doesn't collapse." Consider the following.
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [1,1,0,0,1,0,0,0,0,1]})
Let's identify those id
s which have a nonzero entry in the cat
column.
>>> df.groupby('id')['cat'].apply(lambda x: (x == 1).any())
id
1 True
2 True
3 False
4 True
Name: cat, dtype: bool
Great. If we wanted to create an indicator column, however, we could do the following.
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64
I don't understand why the dtype is now int64
instead of the boolean returned by the any()
function.
When I change the original data frame to contain some booleans (note that the zeros remain), the transform approach returns booleans in an object
column. This is an extra mystery to me since all of the values are boolean, but it's listed as object
apparently to match the dtype
of the original mixed-type column of integers and booleans.
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [True,True,0,0,True,0,0,0,0,True]})
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
Name: cat, dtype: object
However, when I use all booleans, the transform function returns a boolean column.
df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
'cat': [True,True,False,False,True,False,False,False,False,True]})
>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
Name: cat, dtype: bool
Using my acute pattern-recognition skills, it appears that the dtype
of the resulting column mirrors that of the original column. I would appreciate any hints about why this occurs or what's going on under the hood in the transform
function. Cheers.
It looks like SeriesGroupBy.transform()
tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform()
doesn't seem to do that:
In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64
# v v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
cat
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True
In [141]: df.dtypes
Out[141]:
cat int64
id int64
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With