I have a dataframe like this:
1 2
0 P 214233
1 P 130435
2 P 258824
3 P 75488
4 C 101215
5 C 105793
6 C 101591
I want to perform a Wilcoxon rank-sum test for instance. Why the following command doesn't work ?
import scipy.stats as ss
df.groupby(1).apply(ss.ranksums)
I think it doesn't work because scipy doesn't recognize the group :
TypeError: ranksums() takes exactly 2 arguments (1 given)
How one can achieve this ? Without doing the groupby manually :
ss.ranksums(df[df[1]=="C"][2], df[df[1]=="P"][2])
And somehow same problem with ANOVA :
if the dataframe is like this :
1 2
0 P 214233
1 P 130435
2 A 258824
3 A 75488
4 A 101215
5 C 105793
6 C 101591
But here the error is :
TypeError: can't multiply sequence by non-int of type 'str'
Thanks
This works.
values_per_group = [col for col_name, col in df.groupby(1)[2]]
ss.ranksums(*values_per_group)
The explanation for @innohead's method is that scipy.stats
tests expects only values columns and groupby
splits a DataFrame
into (group_name, DataFrame)
tuples. Given a group column 1
and a value column 2
, you can use list comprehension with a groupby object df.groupby(1)
, extract only the value column df.groupby(1)[2]
and then iterate through the groupby tuples keeping the values (col
) and discarding the names(col_names
).
Instead of using list comprehension, you could also keep the variable names attached to the values by using a dict comprehension:
values_per_group = {col_name:col for col_name, col in df.groupby(1)[2]}
ss.ranksums(*values_per_group.values())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With