Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply function to each row in Pandas dataframe by group

I built a Pandas dataframe (example below) indexed by gene name that has sample names for columns and integers as cell values. What I want to do is run an ANOVA (f_oneway(), from scipy.stats) for lists of row values as defined by lists of the columns corresponding to groups of samples. Those results would then be stored in a new Pandas dataframe with group names as columns and the same genes for index.

An example of the dataframe (it's returned from another function in my ):

import pandas as pd
counts = {'sample1' : [0, 1, 5, 0, 10],
        'sample2' : [2, 0, 10, 0, 0],
        'sample3' : [0, 0, 0, 1, 0],
        'sample4' : [10, 0, 1, 4, 0]}
data = pd.DataFrame(counts, columns = ['sample1', 'sample2', 'sample3', 'sample4'],
        index = ['gene1', 'gene2', 'gene3', 'gene4', 'gene5'])

Groups are imported as arguments by main(), so in this function I have:

def compare(out_prefix, pops, data):
    import scipy.stats as stats
    sig = pd.DataFrame(index=data.index)

#groups will look like:
#groups = [['sample1', 'sample2'],['sample3', 'sample4']]

    for group in groups:
        with open(group) as infile:
            groups_s = []
            for spl in infile:
                group_s.append(spl.replace("\n",""))

        mean_col = pop.split(".")[0]+"_mean"
        std_col = pop.split(".")[0]+"_std"
        stat_col = pop.split(".")[0]+"_stat"
        p_col = pop.split(".")[0]+"_sig"

        sig[mean_col] = data[group_s].mean(axis=1)
        sig[std_col] = data[group_s].std(axis=1)

        sig[[stat_col, p_col]] = data.apply(lambda row : stats.f_oneway(data.loc[group_s].values.tolist()))

This last line doesn't work and I can't see how it could be done from some googling in the last few days - could someone point me in the right direction? Ideally, it would write the results of the ANOVA test (power, significance) per row for the samples in each group by group to columns stat_col and p_col in sig. For gene1 it would feed stats.f_oneway a list of lists of the values for samples in each group e.g. [[0,2],[0, 10]].

Thanks in advance!

like image 734
André Soares Avatar asked Sep 25 '20 13:09

André Soares


1 Answers

Try this:

group = ['sample1', 'sample2']

On your sample:

data[group].T

looks likes:

    gene1   gene2   gene3   gene4   gene5
sample1     0   1   5   0   10
sample2     2   0   10  0   0

and finally:

anova = stats.f_oneway(*data[group].T.values)
print(anova.statistic, anova.pvalue)

anova object contains what you expect:

0.0853333333333 0.777628169862
like image 147
dokteurwho Avatar answered Sep 29 '22 15:09

dokteurwho