Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subtracting columns based on key column in pandas dataframe

I have two dataframes looking like

df1:

   ID    A   B   C   D 
0 'ID1' 0.5 2.1 3.5 6.6
1 'ID2' 1.2 5.5 4.3 2.2
2 'ID1' 0.7 1.2 5.6 6.0 
3 'ID3' 1.1 7.2 10. 3.2

df2:

   ID    A   B   C   D 
0 'ID1' 1.0 2.0 3.3 4.4
1 'ID2' 1.5 5.0 4.0 2.2
2 'ID3' 0.6 1.2 5.9 6.2 
3 'ID4' 1.1 7.2 8.5 3.0

df1 can have multiple entries with the same ID whereas each ID occurs only once in df2. Also not all ID in df2 are necessarily present in df1. I can't solve this by using set_index() as multiple rows in df1 can have the same ID, and that the ID in df1 and df2 are not aligned.

I want to create a new dataframe where I subtract the values in df2[['A','B','C','D']] from df1[['A','B','C','D']] based on matching the ID.

The resulting dataframe would look like:

df_new:

   ID     A    B   C   D 
0 'ID1' -0.5  0.1 0.2 2.2
1 'ID2' -0.3  0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3'  0.5  6.0 1.5 0.2

I know how to do this with a loop, but since I'm dealing with huge data quantities this is not practical at all. What is the best way of approaching this with Pandas?

like image 944
AstroAT Avatar asked May 03 '18 14:05

AstroAT


People also ask

How do I subtract multiple columns in pandas?

We can create a function specifically for subtracting the columns, by taking column data as arguments and then using the apply method to apply it to all the data points throughout the column.

How do you find the difference between two columns in a data frame?

Difference between rows or columns of a pandas DataFrame object is found using the diff() method. The axis parameter decides whether difference to be calculated is between rows or between columns. When the periods parameter assumes positive values, difference is found by subtracting the previous row from the next row.

How do you subtract values in pandas?

Pandas DataFrame sub() MethodThe sub() method subtracts each value in the DataFrame with a specified value. The specified value must be an object that can be subtracted from the values in the DataFrame.


3 Answers

You just need set_index and subtract

(df1.set_index('ID')-df2.set_index('ID')).dropna(axis=0)
Out[174]: 
         A    B    C    D
ID                       
'ID1' -0.5  0.1  0.2  2.2
'ID1' -0.3 -0.8  2.3  1.6
'ID2' -0.3  0.5  0.3  0.0
'ID3'  0.5  6.0  4.1 -3.0

If the order matters add reindex for df2

(df1.set_index('ID')-df2.set_index('ID').reindex(df1.ID)).dropna(axis=0).reset_index()
Out[211]: 
      ID    A    B    C    D
0  'ID1' -0.5  0.1  0.2  2.2
1  'ID2' -0.3  0.5  0.3  0.0
2  'ID1' -0.3 -0.8  2.3  1.6
3  'ID3'  0.5  6.0  4.1 -3.0
like image 199
BENY Avatar answered Oct 11 '22 08:10

BENY


Similarly to what Wen (who beat me to it) proposed, you can use pd.DataFrame.subtract:

df1.set_index('ID').subtract(df2.set_index('ID')).reset_index()

         A    B    C    D
ID                       
'ID1' -0.5  0.1  0.2  2.2
'ID1' -0.3 -0.8  2.3  1.6
'ID2' -0.3  0.5  0.3  0.0
'ID3'  0.5  6.0  4.1 -3.0
like image 32
Ami Tavory Avatar answered Oct 11 '22 10:10

Ami Tavory


One method is to use numpy. We can extract the ordered indices required from df2 using numpy.searchsorted.

Then feed this into the construction of a new dataframe.

idx = np.searchsorted(df2['ID'], df1['ID'])

res = pd.DataFrame(df1.iloc[:, 1:].values - df2.iloc[:, 1:].values[idx],
                   index=df1['ID']).reset_index()

print(res)

      ID    0    1    2    3
0  'ID1' -0.5  0.1  0.2  2.2
1  'ID2' -0.3  0.5  0.3  0.0
2  'ID1' -0.3 -0.8  2.3  1.6
3  'ID3'  0.5  6.0  4.1 -3.0
like image 2
jpp Avatar answered Oct 11 '22 10:10

jpp