Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas' transform doesn't work sorting groupby output

Tags:

Another pandas question.

Reading Wes Mckinney's excellent book about Data Analysis and Pandas, I encountered the following thing that I thought should work:

Suppose I have some info about tips.

In [119]:  tips.head() Out[119]: total_bill  tip      sex     smoker    day   time    size  tip_pct 0    16.99   1.01    Female  False   Sun     Dinner  2   0.059447 1    10.34   1.66    Male    False   Sun     Dinner  3   0.160542 2    21.01   3.50    Male    False   Sun     Dinner  3   0.166587 3    23.68   3.31    Male    False   Sun     Dinner  2   0.139780 4    24.59   3.61    Female  False   Sun     Dinner  4   0.146808 

and I want to know the five largest tips in relation to the total bill, that is, tip_pct for smokers and non-smokers separately. So this works:

def top(df, n=5, column='tip_pct'):      return df.sort_index(by=column)[-n:]  In [101]:  tips.groupby('smoker').apply(top) Out[101]:            total_bill   tip sex smoker  day time    size    tip_pct smoker                                   False   88   24.71   5.85    Male    False   Thur    Lunch   2   0.236746 185  20.69   5.00    Male    False   Sun     Dinner  5   0.241663 51   10.29   2.60    Female  False   Sun     Dinner  2   0.252672 149  7.51    2.00    Male    False   Thur    Lunch   2   0.266312 232  11.61   3.39    Male    False   Sat     Dinner  2   0.291990  True    109  14.31   4.00    Female  True    Sat     Dinner  2   0.279525 183  23.17   6.50    Male    True    Sun     Dinner  4   0.280535 67   3.07    1.00    Female  True    Sat     Dinner  1   0.325733 178  9.60    4.00    Female  True    Sun     Dinner  2   0.416667 172  7.25    5.15    Male    True    Sun     Dinner  2   0.710345 

Good enough, but then I wanted to use pandas' transform to do the same like this:

def top_all(df):     return df.sort_index(by='tip_pct')  tips.groupby('smoker').transform(top_all) 

but instead I get this:

TypeError: Transform function invalid for data types 

Why? I know that transform requires to return an array of the same dimensions that it accepts as input, so I thought I'd be complying with that requirement just sorting both slices (smokers and non-smokers) of the original DataFrame without changing their respective dimensions. Can anyone explain why it failed?

like image 544
Robert Smith Avatar asked Dec 13 '12 06:12

Robert Smith


People also ask

How do I sort a panda Groupby object?

To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values().

How do you sort values after Groupby?

Sort Values in Descending Order with Groupby You can sort values in descending order by using ascending=False param to sort_values() method. The head() function is used to get the first n rows. It is useful for quickly testing if your object has the right type of data in it.

What the difference between pandas apply and transform?

transform() can take a function, a string function, a list of functions, and a dict. However, apply() is only allowed a function. apply() works with multiple Series at a time. But, transform() is only allowed to work with a single Series at a time.

What does transform Do pandas?

Call func on self producing a DataFrame with the same axis shape as self. Function to use for transforming the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.


1 Answers

transform is not that well documented, but it seems that the way it works is that what the transform function is passed is not the entire group as a dataframe, but a single column of a single group. I don't think it's really meant for what you're trying to do, and your solution with apply is fine.

So suppose tips.groupby('smoker').transform(func). There will be two groups, call them group1 and group2. The transform does not call func(group1) and func(group2). Instead, it calls func(group1['total_bill']), then func(group1['tip']), etc., and then func(group2['total_bill']), func(group2['tip']). Here's an example:

>>> print d    A  B  C 0 -2  5  4 1  1 -1  2 2  0  2  1 3 -3  1  2 4  5  0  2 >>> def foo(df): ...     print ">>>" ...     print df ...     print "<<<" ...     return df >>> print d.groupby('C').transform(foo) >>> 2    0 Name: A <<< >>> 2    2 Name: B <<< >>> 1    1 3   -3 4    5 Name: A <<< >>> 1   -1 3    1 4    0 Name: B # etc. 

You can see that foo is first called with just the A column of the C=1 group of the original data frame, then the B column of that group, then the A column of the C=2 group, etc.

This makes sense if you think about what transform is for. It's meant for applying transform functions on the groups. But in general, these functions won't make sense when applied to the entire group, only to a given column. For instance, the example in the pandas docs is about z-standardizing using transform. If you have a DataFrame with columns for age and weight, it wouldn't make sense to z-standardize with respect to the overall mean of both these variables. It doesn't even mean anything to take the overall mean of a bunch of numbers, some of which are ages and some of which are weights. You have to z-standardize the age with respect to the mean age and the weight with respect to the mean weight, which means you want to transform separately for each column.

So basically, you don't need to use transform here. apply is the appropriate function here, because apply really does operate on each group as a single DataFrame, while transform operates on each column of each group.

like image 169
BrenBarn Avatar answered Oct 05 '22 22:10

BrenBarn