Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Creating Difference Matrix from Data Frame

Tags:

python

pandas

I'm trying to create a matrix to show the differences between the rows in a Pandas data frame.

import pandas as pd

data = {'Country':['GB','JP','US'],'Values':[20.2,-10.5,5.7]}
df = pd.DataFrame(data)

I would like this:

  Country  Values
0      GB    20.2
1      JP   -10.5
2      US     5.7

To become something like this (differences going vertically):

  Country     GB     JP     US
0      GB    0.0  -30.7   14.5
1      JP   30.7    0.0   16.2
2      US   14.5  -16.2    0.0

Is this achievable with built-in function or would I need to build a loop to get the desired output? Thanks for your help!

like image 794
alpacafondue Avatar asked Sep 17 '17 17:09

alpacafondue


People also ask

How do you create a matrix from a DataFrame in Python?

Convert Pandas DataFrame to NumPy Matrix A two-dimensional rectangular array to store data in rows and columns is called python matrix. Matrix is a Numpy array to store data in rows and columns. Using dataframe. to_numpy() method we can convert dataframe to Numpy Matrix.

How do you compare two DataFrame for differences?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

How do you find the difference between two DataFrames in pandas?

By using equals() function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not. Unlike dataframe. eq() method, the result of the operation is a scalar boolean value indicating if the dataframe objects are equal or not.

How do pandas make a difference?

You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.


2 Answers

This is a standard use case for numpy's broadcasting:

df['Values'].values - df['Values'].values[:, None]
Out: 
array([[  0. , -30.7, -14.5],
       [ 30.7,   0. ,  16.2],
       [ 14.5, -16.2,   0. ]])

We access the underlying numpy array with the values attribute and [:, None] introduces a new axis so the result is two dimensional.

You can concat this with your original Series:

arr = df['Values'].values - df['Values'].values[:, None]
pd.concat((df['Country'], pd.DataFrame(arr, columns=df['Country'])), axis=1)
Out: 
  Country    GB    JP    US
0      GB   0.0 -30.7 -14.5
1      JP  30.7   0.0  16.2
2      US  14.5 -16.2   0.0

The array can also be generated with the following, thanks to @Divakar:

arr = np.subtract.outer(*[df.Values]*2).T

Here we are calling .outer on the subtract ufunc and it applies it to all pair of its inputs.

like image 131
ayhan Avatar answered Oct 25 '22 01:10

ayhan


I try improve Divakar comment:

a = np.column_stack([df['Country'], np.subtract.outer(*[-df.Values]*2)])

df = pd.DataFrame(a, columns=['Country'] + df['Country'].tolist())
print (df)
  Country    GB    JP    US
0      GB     0 -30.7 -14.5
1      JP  30.7     0  16.2
2      US  14.5 -16.2     0
like image 27
jezrael Avatar answered Oct 25 '22 01:10

jezrael