Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas : how to apply scipy.stats test on a groupby object ?

I have a dataframe like this:

   1       2
0  P  214233
1  P  130435
2  P  258824
3  P   75488
4  C  101215
5  C  105793
6  C  101591

I want to perform a Wilcoxon rank-sum test for instance. Why the following command doesn't work ?

import scipy.stats as ss
df.groupby(1).apply(ss.ranksums)

I think it doesn't work because scipy doesn't recognize the group :

TypeError: ranksums() takes exactly 2 arguments (1 given)

How one can achieve this ? Without doing the groupby manually :

ss.ranksums(df[df[1]=="C"][2], df[df[1]=="P"][2])

And somehow same problem with ANOVA :

if the dataframe is like this :

   1       2
0  P  214233
1  P  130435
2  A  258824
3  A  75488
4  A  101215
5  C  105793
6  C  101591

But here the error is :

TypeError: can't multiply sequence by non-int of type 'str'

Thanks

like image 827
jrjc Avatar asked Mar 18 '23 07:03

jrjc


2 Answers

This works.

values_per_group = [col for col_name, col in df.groupby(1)[2]]
ss.ranksums(*values_per_group)
like image 87
innohead Avatar answered Mar 20 '23 19:03

innohead


The explanation for @innohead's method is that scipy.stats tests expects only values columns and groupby splits a DataFrame into (group_name, DataFrame) tuples. Given a group column 1 and a value column 2, you can use list comprehension with a groupby object df.groupby(1), extract only the value column df.groupby(1)[2] and then iterate through the groupby tuples keeping the values (col) and discarding the names(col_names).

Instead of using list comprehension, you could also keep the variable names attached to the values by using a dict comprehension:

values_per_group = {col_name:col for col_name, col in df.groupby(1)[2]}
ss.ranksums(*values_per_group.values())
like image 33
r3robertson Avatar answered Mar 20 '23 20:03

r3robertson