Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas.DataFrame corrwith() method

I recently start working with pandas. Can anyone explain me difference in behaviour of function .corrwith() with Series and DataFrame?

Suppose i have one DataFrame:

frame = pd.DataFrame(data={'a':[1,2,3], 'b':[-1,-2,-3], 'c':[10, -10, 10]})

And i want calculate correlation between features 'a' and all other features. I can do it in the following way:

frame.drop(labels='a', axis=1).corrwith(frame['a'])

And result will be:

b   -1.0
c    0.0

But very similar code:

frame.drop(labels='a', axis=1).corrwith(frame[['a']])

Generate absolutely different and unacceptable table:

a   NaN
b   NaN
c   NaN

So, my question is: why in case of DataFrame as second argument we get such strange output?

like image 992
Nikita Sivukhin Avatar asked Jul 17 '16 13:07

Nikita Sivukhin


People also ask

How do you find the correlation between two rows in pandas?

Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.

What does Corr () do in Python?

corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.

How can you find a correlation matrix of a PD DataFrame?

Pandas dataframe. corr() method is used for creating the correlation matrix. It is used to find the pairwise correlation of all columns in the dataframe. Any na values are automatically excluded.

How do I merge two Dataframes in pandas?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.


1 Answers

What I think you're looking for:

Let's say your frame is:

frame = pd.DataFrame(np.random.rand(10, 6), columns=['cost', 'amount', 'day', 'month', 'is_sale', 'hour'])

You want the 'cost' and 'amount' columns to be correlated with all other columns in every combination.

focus_cols = ['cost', 'amount']
frame.corr().filter(focus_cols).drop(focus_cols)

enter image description here

Answering what you asked:

Compute pairwise correlation between rows or columns of two DataFrame objects.

Parameters:

other : DataFrame

axis : {0 or ‘index’, 1 or ‘columns’},

default 0 0 or ‘index’ to compute column-wise, 1 or ‘columns’ for row-wise drop : boolean, default False Drop missing indices from result, default returns union of all Returns: correls : Series

corrwith is behaving similarly to add, sub, mul, div in that it expects to find a DataFrame or a Series being passed in other despite the documentation saying just DataFrame.

When other is a Series it broadcast that series and matches along the axis specified by axis, default is 0. This is why the following worked:

frame.drop(labels='a', axis=1).corrwith(frame.a)

b   -1.0
c    0.0
dtype: float64

When other is a DataFrame it will match the axis specified by axis and correlate each pair identified by the other axis. If we did:

frame.drop('a', axis=1).corrwith(frame.drop('b', axis=1))

a    NaN
b    NaN
c    1.0
dtype: float64

Only c was in common and only c had its correlation calculated.

In the case you specified:

frame.drop(labels='a', axis=1).corrwith(frame[['a']])

frame[['a']] is a DataFrame because of the [['a']] and now plays by the DataFrame rules in which its columns must match up with what its being correlated with. But you explicitly drop a from the first frame then correlate with a DataFrame with nothing but a. The result is NaN for every column.

like image 130
piRSquared Avatar answered Sep 22 '22 13:09

piRSquared