When trying to debug groupby
function applications, someone suggested that I use a dummy function to "see what is being passed" into the function for each group. Sure, I'm game:
import numpy as np
import pandas as pd
np.random.seed(0) # so we can all play along at home
categories = list('abc')
categories = categories * 4
data_1 = np.random.randn(len(categories))
data_2 = np.random.randn(len(categories))
df = pd.DataFrame({'category': categories, 'data_1': data_1, 'data_2': data_2})
def f(x):
print type(x)
return x
print 'single column transform'
df.groupby(['category'])['data_1'].transform(f)
print '\n'
print 'single column (nested) transform'
df.groupby(['category'])[['data_1']].transform(f)
print '\n'
print 'multiple column transform'
df.groupby(['category'])[['data_1', 'data_2']].transform(f)
print '\n'
print '\n'
print 'single column apply'
df.groupby(['category'])['data_1'].apply(f)
print '\n'
print 'single column (nested) apply'
df.groupby(['category'])[['data_1']].apply(f)
print '\n'
print 'multiple column apply'
df.groupby(['category'])[['data_1', 'data_2']].apply(f)
This puts the following into my standard output:
single column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
single column (nested) transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
multiple column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
single column apply
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
single column (nested) apply
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
multiple column apply
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
So it seems like:
Series
Series
and 3 DataFrame
Series
and 3 DataFrame
Series
DataFrame
DataFrame
What's going on here? Can anyone explain why each of these 6 calls is leading to the series of objects described above being passed to the function specified?
GroupBy.transform
will try fast_path and slow_path for your function.
DataFrame.apply
functionWhen the result of fast_path is the same as slow_path, it will choose the fast_path.
the following output means that it finally selected the fast_path:
multiple column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
Here is the code link:
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2277
Edit
to inspect the calling stack:
import numpy as np
import pandas as pd
np.random.seed(0) # so we can all play along at home
categories = list('abc')
categories = categories * 4
data_1 = np.random.randn(len(categories))
data_2 = np.random.randn(len(categories))
df = pd.DataFrame({'category': categories, 'data_1': data_1, 'data_2': data_2})
import traceback
import inspect
import itertools
def f(x):
flag = True
stack = itertools.dropwhile(lambda x:"#stop here" not in x,
traceback.format_stack(inspect.currentframe().f_back))
print "*"*20
print x
print type(x)
print
print "\n".join(stack)
return x
df.groupby(['category'])[['data_1', 'data_2']].transform(f) #stop here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With