Another pandas question.
Reading Wes Mckinney's excellent book about Data Analysis and Pandas, I encountered the following thing that I thought should work:
Suppose I have some info about tips.
In [119]: tips.head() Out[119]: total_bill tip sex smoker day time size tip_pct 0 16.99 1.01 Female False Sun Dinner 2 0.059447 1 10.34 1.66 Male False Sun Dinner 3 0.160542 2 21.01 3.50 Male False Sun Dinner 3 0.166587 3 23.68 3.31 Male False Sun Dinner 2 0.139780 4 24.59 3.61 Female False Sun Dinner 4 0.146808
and I want to know the five largest tips in relation to the total bill, that is, tip_pct
for smokers and non-smokers separately. So this works:
def top(df, n=5, column='tip_pct'): return df.sort_index(by=column)[-n:] In [101]: tips.groupby('smoker').apply(top) Out[101]: total_bill tip sex smoker day time size tip_pct smoker False 88 24.71 5.85 Male False Thur Lunch 2 0.236746 185 20.69 5.00 Male False Sun Dinner 5 0.241663 51 10.29 2.60 Female False Sun Dinner 2 0.252672 149 7.51 2.00 Male False Thur Lunch 2 0.266312 232 11.61 3.39 Male False Sat Dinner 2 0.291990 True 109 14.31 4.00 Female True Sat Dinner 2 0.279525 183 23.17 6.50 Male True Sun Dinner 4 0.280535 67 3.07 1.00 Female True Sat Dinner 1 0.325733 178 9.60 4.00 Female True Sun Dinner 2 0.416667 172 7.25 5.15 Male True Sun Dinner 2 0.710345
Good enough, but then I wanted to use pandas' transform to do the same like this:
def top_all(df): return df.sort_index(by='tip_pct') tips.groupby('smoker').transform(top_all)
but instead I get this:
TypeError: Transform function invalid for data types
Why? I know that transform requires to return an array of the same dimensions that it accepts as input, so I thought I'd be complying with that requirement just sorting both slices (smokers and non-smokers) of the original DataFrame without changing their respective dimensions. Can anyone explain why it failed?
To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values().
Sort Values in Descending Order with Groupby You can sort values in descending order by using ascending=False param to sort_values() method. The head() function is used to get the first n rows. It is useful for quickly testing if your object has the right type of data in it.
transform() can take a function, a string function, a list of functions, and a dict. However, apply() is only allowed a function. apply() works with multiple Series at a time. But, transform() is only allowed to work with a single Series at a time.
Call func on self producing a DataFrame with the same axis shape as self. Function to use for transforming the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.
transform
is not that well documented, but it seems that the way it works is that what the transform function is passed is not the entire group as a dataframe, but a single column of a single group. I don't think it's really meant for what you're trying to do, and your solution with apply
is fine.
So suppose tips.groupby('smoker').transform(func)
. There will be two groups, call them group1 and group2. The transform does not call func(group1)
and func(group2)
. Instead, it calls func(group1['total_bill'])
, then func(group1['tip'])
, etc., and then func(group2['total_bill'])
, func(group2['tip'])
. Here's an example:
>>> print d A B C 0 -2 5 4 1 1 -1 2 2 0 2 1 3 -3 1 2 4 5 0 2 >>> def foo(df): ... print ">>>" ... print df ... print "<<<" ... return df >>> print d.groupby('C').transform(foo) >>> 2 0 Name: A <<< >>> 2 2 Name: B <<< >>> 1 1 3 -3 4 5 Name: A <<< >>> 1 -1 3 1 4 0 Name: B # etc.
You can see that foo
is first called with just the A column of the C=1 group of the original data frame, then the B column of that group, then the A column of the C=2 group, etc.
This makes sense if you think about what transform is for. It's meant for applying transform functions on the groups. But in general, these functions won't make sense when applied to the entire group, only to a given column. For instance, the example in the pandas docs is about z-standardizing using transform
. If you have a DataFrame with columns for age and weight, it wouldn't make sense to z-standardize with respect to the overall mean of both these variables. It doesn't even mean anything to take the overall mean of a bunch of numbers, some of which are ages and some of which are weights. You have to z-standardize the age with respect to the mean age and the weight with respect to the mean weight, which means you want to transform separately for each column.
So basically, you don't need to use transform here. apply
is the appropriate function here, because apply
really does operate on each group as a single DataFrame, while transform
operates on each column of each group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With