Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?

I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).

This is what works for me (only for three rows in this example):

df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])

which yields

(16.714285714285694, 0.00023471398805908193)

However, the line of code is too long when I want to use all 11 rows of df.

First, I tried to pass the values in the following manner:

df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])

but I get:

ValueError: 
Less than 3 levels.  Friedman test not appropriate.

Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:

print stats.friedmanchisquare([row for index, row in df.iterrows()])

which also gives me the error:

ValueError: 
Less than 3 levels.  Friedman test not appropriate.

So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)

You can download my dataframe in csv format here and read it using:

df = pd.read_csv('df.csv', header=0, index_col=0)

Thank you for your help :)

Solution:

Based on @Ami Tavory and @vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:

df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])

And the same is true if you want to work with the original dataframe, which is what I ideally wanted:

print stats.friedmanchisquare(*[row for index, row in df.iterrows()])

in this manner you iterate over the dataframe in its native format.

Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.

This was my experimental setup:

import timeit

setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''

theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''

print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))

theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''

print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))

which yields the following results:

4.97029900551
8.7627799511
like image 295
Pablo Rivas Avatar asked Jul 02 '15 22:07

Pablo Rivas


2 Answers

The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.

The stats.friedmanchisquare needs multiple array_like arguments, not one list

Try using the * (star/unpack) operator to unpack the list

Like this

df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
like image 125
vicg Avatar answered Nov 15 '22 18:11

vicg


You could pass it using the "star operator", similarly to this:

a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))
like image 24
Ami Tavory Avatar answered Nov 15 '22 18:11

Ami Tavory