Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pd.corrwith on pandas dataframes with different column names

Tags:

python

pandas

I would like to get the pearson r between x1 and each of the three columns in y, in an efficient manner.

It appears that pd.corrwith() is only able to calculate this for columns that have exactly the same column labels e.g. x and y.

This seems a bit impractical, as I presume computing correlations between different variables would be a common problem.

In [1]: import pandas as pd; import numpy as np

In [2]: x = pd.DataFrame(np.random.randn(5,3),columns=['A','B','C'])

In [3]: y = pd.DataFrame(np.random.randn(5,3),columns=['A','B','C'])

In [4]: x1 = pd.DataFrame(x.ix[:,0])

In [5]: x.corrwith(y)
Out[5]:
A   -0.752631
B   -0.525705
C    0.516071
dtype: float64

In [6]: x1.corrwith(y)
Out[6]:
A   -0.752631
B         NaN
C         NaN
dtype: float64
like image 926
themachinist Avatar asked Nov 22 '14 15:11

themachinist


People also ask

How do I join Pandas DataFrames on different column names?

Different column names are specified for merges in Pandas using the “left_on” and “right_on” parameters, instead of using only the “on” parameter. Merging dataframes with different names for the joining variable is achieved using the left_on and right_on arguments to the pandas merge function.

Can you concatenate DataFrames with different columns?

It is possible to join the different columns is using concat() method. DataFrame: It is dataframe name. axis: 0 refers to the row axis and1 refers the column axis. join: Type of join.

How do you use Corrwith in Pandas?

corrwith() is used to compute pairwise correlation between rows or columns of two DataFrame objects. If the shape of two dataframe object is not same then the corresponding correlation value will be a NaN value. Note: The correlation of a variable with itself is 1.


1 Answers

You can accomplish what you want using DataFrame.corrwith(Series) rather than DataFrame.corrwith(DataFrame):

In [203]: x1 = x['A']

In [204]: y.corrwith(x1)
Out[204]:
A    0.347629
B   -0.480474
C   -0.729303
dtype: float64

Alternatively, you can form the matrix of correlations between each column of x and each column of y as follows:

In [214]: pd.expanding_corr(x, y, pairwise=True).iloc[-1, :, :]
Out[214]:
          A         B         C
A  0.347629 -0.480474 -0.729303
B -0.334814  0.778019  0.654583
C -0.453273  0.212057  0.149544

Alas DataFrame.corrwith() doesn't have a pairwise=True option.

like image 139
seth-p Avatar answered Oct 10 '22 14:10

seth-p