Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What Pandas data type is passed to transform or apply in a groupby

Tags:

python

pandas

When trying to debug groupby function applications, someone suggested that I use a dummy function to "see what is being passed" into the function for each group. Sure, I'm game:

import numpy as np
import pandas as pd

np.random.seed(0) # so we can all play along at home

categories = list('abc')
categories = categories * 4
data_1 = np.random.randn(len(categories))
data_2 = np.random.randn(len(categories))

df = pd.DataFrame({'category': categories, 'data_1': data_1, 'data_2': data_2})

def f(x):
    print type(x)
    return x

print 'single column transform'
df.groupby(['category'])['data_1'].transform(f)
print '\n'

print 'single column (nested) transform'
df.groupby(['category'])[['data_1']].transform(f)
print '\n'

print 'multiple column transform'
df.groupby(['category'])[['data_1', 'data_2']].transform(f)

print '\n'
print '\n'

print 'single column apply'
df.groupby(['category'])['data_1'].apply(f)
print '\n'

print 'single column (nested) apply'
df.groupby(['category'])[['data_1']].apply(f)
print '\n'

print 'multiple column apply'
df.groupby(['category'])[['data_1', 'data_2']].apply(f)

This puts the following into my standard output:

single column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


single column (nested) transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


multiple column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>




single column apply
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


single column (nested) apply
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


multiple column apply
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

So it seems like:

  • Transform
    • Single column: 3 Series
    • Single column (nested): 2 Series and 3 DataFrame
    • Multiple columns: 3 Series and 3 DataFrame
  • Apply
    • Single column: 3 Series
    • Single column (nested): 4 DataFrame
    • Multiple columns: 4 DataFrame

What's going on here? Can anyone explain why each of these 6 calls is leading to the series of objects described above being passed to the function specified?

like image 509
8one6 Avatar asked Dec 19 '13 01:12

8one6


1 Answers

GroupBy.transform will try fast_path and slow_path for your function.

  • fast_path: call your function with a DataFrame object
  • slow_path: call your function with DataFrame.apply function

When the result of fast_path is the same as slow_path, it will choose the fast_path.

the following output means that it finally selected the fast_path:

multiple column transform
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>

Here is the code link:

https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2277

Edit

to inspect the calling stack:

import numpy as np
import pandas as pd

np.random.seed(0) # so we can all play along at home

categories = list('abc')
categories = categories * 4
data_1 = np.random.randn(len(categories))
data_2 = np.random.randn(len(categories))

df = pd.DataFrame({'category': categories, 'data_1': data_1, 'data_2': data_2})

import traceback
import inspect
import itertools

def f(x):
    flag = True
    stack = itertools.dropwhile(lambda x:"#stop here" not in x, 
                                traceback.format_stack(inspect.currentframe().f_back))
    print "*"*20
    print x
    print type(x)
    print
    print "\n".join(stack)
    return x

df.groupby(['category'])[['data_1', 'data_2']].transform(f) #stop here
like image 191
HYRY Avatar answered Nov 07 '22 10:11

HYRY