Similar questions have been asked, but I've not seen a lucid answer. Forgive me for asking again. I have two dataframes, and I simply want the correlation of the first data frame with each column in the second. Here is code which does exactly what I want:
df1=pd.DataFrame( {'Y':np.random.randn(10) } )
df2=pd.DataFrame( {'X1':np.random.randn(10), 'X2':np.random.randn(10) ,'X3':np.random.randn(10) } )
for col in df2:
print df1['Y'].corr(df2[col])
but it doesn't seem like I should be looping through the dataframe. I was hoping that something as simple as
df1.corr(df2)
ought to get the job done. Is there a clear way to perform this function without looping?
By using corr() function we can get the correlation between two columns in the dataframe.
This can be done by calculating a matrix of the relationships between each pair of variables in the dataset. The result is a symmetric matrix called a correlation matrix with a value of 1.0 along the diagonal as each column always perfectly correlates with itself.
Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.
Two columns are correlated if the value of one column is related to the value of the other column. For example, state name and country name columns are strongly correlated because the city name usually, but perhaps not always, identifies the state name.
You can use corrwith
:
>>> df2.corrwith(df1.Y)
X1 0.051002
X2 -0.339775
X3 0.076935
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With