Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Correlation Groupby

Assuming I have a dataframe similar to the below, how would I get the correlation between 2 specific columns and then group by the 'ID' column? I believe the Pandas 'corr' method finds the correlation between all columns. If possible I would also like to know how I could find the 'groupby' correlation using the .agg function (i.e. np.correlate).

What I have:

ID  Val1    Val2    OtherData   OtherData A   5       4       x           x A   4       5       x           x A   6       6       x           x B   4       1       x           x B   8       2       x           x B   7       9       x           x C   4       8       x           x C   5       5       x           x C   2       1       x           x 

What I need:

ID  Correlation_Val1_Val2 A   0.12 B   0.22 C   0.05 

Thanks!

like image 775
bsheehy Avatar asked Mar 11 '15 14:03

bsheehy


1 Answers

You pretty much figured out all the pieces, just need to combine them:

>>> df.groupby('ID')[['Val1','Val2']].corr()               Val1      Val2 ID                          A  Val1  1.000000  0.500000    Val2  0.500000  1.000000 B  Val1  1.000000  0.385727    Val2  0.385727  1.000000 

In your case, printing out a 2x2 for each ID is excessively verbose. I don't see an option to print a scalar correlation instead of the whole matrix, but you can do something simple like this if you only have two variables:

>>> df.groupby('ID')[['Val1','Val2']].corr().iloc[0::2,-1]  ID        A   Val1    0.500000 B   Val1    0.385727 

For the more general case of 3+ variables

For 3 or more variables, it is not straightforward to create concise output but you could do something like this:

groups = list('Val1', 'Val2', 'Val3', 'Val4') df2 = pd.DataFrame() for i in range( len(groups)-1):      df2 = df2.append( df.groupby('ID')[groups].corr().stack()                         .loc[:,groups[i],groups[i+1]:].reset_index() )  df2.columns = ['ID', 'v1', 'v2', 'corr'] df2.set_index(['ID','v1','v2']).sort_index() 

Note that if we didn't have the groupby element, it would be straightforward to use an upper or lower triangle function from numpy. But since that element is present, it is not so easy to produce concise output in a more elegant manner as far as I can tell.

like image 124
JohnE Avatar answered Sep 19 '22 19:09

JohnE