pandas' transform doesn't work sorting groupby output

Tags:

Another pandas question.

Reading Wes Mckinney's excellent book about Data Analysis and Pandas, I encountered the following thing that I thought should work:

Suppose I have some info about tips.

In [119]:  tips.head() Out[119]: total_bill  tip      sex     smoker    day   time    size  tip_pct 0    16.99   1.01    Female  False   Sun     Dinner  2   0.059447 1    10.34   1.66    Male    False   Sun     Dinner  3   0.160542 2    21.01   3.50    Male    False   Sun     Dinner  3   0.166587 3    23.68   3.31    Male    False   Sun     Dinner  2   0.139780 4    24.59   3.61    Female  False   Sun     Dinner  4   0.146808

and I want to know the five largest tips in relation to the total bill, that is, tip_pct for smokers and non-smokers separately. So this works:

def top(df, n=5, column='tip_pct'):      return df.sort_index(by=column)[-n:]  In [101]:  tips.groupby('smoker').apply(top) Out[101]:            total_bill   tip sex smoker  day time    size    tip_pct smoker                                   False   88   24.71   5.85    Male    False   Thur    Lunch   2   0.236746 185  20.69   5.00    Male    False   Sun     Dinner  5   0.241663 51   10.29   2.60    Female  False   Sun     Dinner  2   0.252672 149  7.51    2.00    Male    False   Thur    Lunch   2   0.266312 232  11.61   3.39    Male    False   Sat     Dinner  2   0.291990  True    109  14.31   4.00    Female  True    Sat     Dinner  2   0.279525 183  23.17   6.50    Male    True    Sun     Dinner  4   0.280535 67   3.07    1.00    Female  True    Sat     Dinner  1   0.325733 178  9.60    4.00    Female  True    Sun     Dinner  2   0.416667 172  7.25    5.15    Male    True    Sun     Dinner  2   0.710345

Good enough, but then I wanted to use pandas' transform to do the same like this:

def top_all(df):     return df.sort_index(by='tip_pct')  tips.groupby('smoker').transform(top_all)

but instead I get this:

TypeError: Transform function invalid for data types

Why? I know that transform requires to return an array of the same dimensions that it accepts as input, so I thought I'd be complying with that requirement just sorting both slices (smokers and non-smokers) of the original DataFrame without changing their respective dimensions. Can anyone explain why it failed?

544

asked Dec 13 '12 06:12

Robert Smith

1 Answers

transform is not that well documented, but it seems that the way it works is that what the transform function is passed is not the entire group as a dataframe, but a single column of a single group. I don't think it's really meant for what you're trying to do, and your solution with apply is fine.

So suppose tips.groupby('smoker').transform(func). There will be two groups, call them group1 and group2. The transform does not call func(group1) and func(group2). Instead, it calls func(group1['total_bill']), then func(group1['tip']), etc., and then func(group2['total_bill']), func(group2['tip']). Here's an example:

>>> print d    A  B  C 0 -2  5  4 1  1 -1  2 2  0  2  1 3 -3  1  2 4  5  0  2 >>> def foo(df): ...     print ">>>" ...     print df ...     print "<<<" ...     return df >>> print d.groupby('C').transform(foo) >>> 2    0 Name: A <<< >>> 2    2 Name: B <<< >>> 1    1 3   -3 4    5 Name: A <<< >>> 1   -1 3    1 4    0 Name: B # etc.

You can see that foo is first called with just the A column of the C=1 group of the original data frame, then the B column of that group, then the A column of the C=2 group, etc.

This makes sense if you think about what transform is for. It's meant for applying transform functions on the groups. But in general, these functions won't make sense when applied to the entire group, only to a given column. For instance, the example in the pandas docs is about z-standardizing using transform. If you have a DataFrame with columns for age and weight, it wouldn't make sense to z-standardize with respect to the overall mean of both these variables. It doesn't even mean anything to take the overall mean of a bunch of numbers, some of which are ages and some of which are weights. You have to z-standardize the age with respect to the mean age and the weight with respect to the mean weight, which means you want to transform separately for each column.

So basically, you don't need to use transform here. apply is the appropriate function here, because apply really does operate on each group as a single DataFrame, while transform operates on each column of each group.

169

answered Oct 05 '22 22:10

BrenBarn

Related questions
                            
                                "ORA-01438: value larger than specified precision allowed for this column" when inserting 3
                            
                                Float formatting in C++
                            
                                LINQ way to get items between two indexes in a List
                            
                                Refreshing a UICollectionview
                            
                                SED command error on MACOS X
                            
                                Vector going out of bounds without giving error
                            
                                Filling a queue and managing multiprocessing in python
                            
                                Find the dimensions of a multidimensional Python array
                            
                                ScrollIntoView for WPF DataGrid (MVVM)
                            
                                Error in C#: "An object reference is required for the non-static field, method, or property"
                            
                                Location of hibernate.cfg.xml in project?
                            
                                Get Hibernate SessionFactory from JPA's entityManagerFactory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With