I have two dataframes looking like df1: <pre class="prettyprint"><code> ID A B C D 0 'ID1' 0.5 2.1 3.5 6.6 1 'ID2' 1.2 5.5 4.3 2.2 2 'ID1' 0.7 1.2 5.6 6.0 3 'ID3' 1.1 7.2 10. 3.2 </code></pre> df2: <pre class="prettyprint"><code> ID A B C D 0 'ID1' 1.0 2.0 3.3 4.4 1 'ID2' 1.5 5.0 4.0 2.2 2 'ID3' 0.6 1.2 5.9 6.2 3 'ID4' 1.1 7.2 8.5 3.0 </code></pre> df1 can have multiple entries with the same <code>ID</code> whereas each <code>ID</code> occurs only once in df2. Also not all <code>ID</code> in df2 are necessarily present in df1. I can't solve this by using <code>set_index()</code> as multiple rows in df1 can have the same <code>ID</code>, and that the <code>ID</code> in df1 and df2 are not aligned. I want to create a new dataframe where I subtract the values in <code>df2[['A','B','C','D']]</code> from <code>df1[['A','B','C','D']]</code> based on matching the ID. The resulting dataframe would look like: df_new: <pre class="prettyprint"><code> ID A B C D 0 'ID1' -0.5 0.1 0.2 2.2 1 'ID2' -0.3 0.5 0.3 0.0 2 'ID1' -0.3 -0.8 2.3 1.6 3 'ID3' 0.5 6.0 1.5 0.2 </code></pre> I know how to do this with a loop, but since I'm dealing with huge data quantities this is not practical at all. What is the best way of approaching this with Pandas?

You just need set_index and subtract <pre class="prettyprint"><code>(df1.set_index('ID')-df2.set_index('ID')).dropna(axis=0) Out[174]: A B C D ID 'ID1' -0.5 0.1 0.2 2.2 'ID1' -0.3 -0.8 2.3 1.6 'ID2' -0.3 0.5 0.3 0.0 'ID3' 0.5 6.0 4.1 -3.0 </code></pre> If the order matters add <code>reindex</code> for df2 <pre class="prettyprint"><code>(df1.set_index('ID')-df2.set_index('ID').reindex(df1.ID)).dropna(axis=0).reset_index() Out[211]: ID A B C D 0 'ID1' -0.5 0.1 0.2 2.2 1 'ID2' -0.3 0.5 0.3 0.0 2 'ID1' -0.3 -0.8 2.3 1.6 3 'ID3' 0.5 6.0 4.1 -3.0 </code></pre>

Similarly to what Wen (who beat me to it) proposed, you can use <code>pd.DataFrame.subtract</code>: <pre class="prettyprint"><code>df1.set_index('ID').subtract(df2.set_index('ID')).reset_index() A B C D ID 'ID1' -0.5 0.1 0.2 2.2 'ID1' -0.3 -0.8 2.3 1.6 'ID2' -0.3 0.5 0.3 0.0 'ID3' 0.5 6.0 4.1 -3.0 </code></pre>

One method is to use <code>numpy</code>. We can extract the ordered indices required from <code>df2</code> using <code>numpy.searchsorted</code>. Then feed this into the construction of a new dataframe. <pre class="prettyprint"><code>idx = np.searchsorted(df2['ID'], df1['ID']) res = pd.DataFrame(df1.iloc[:, 1:].values - df2.iloc[:, 1:].values[idx], index=df1['ID']).reset_index() print(res) ID 0 1 2 3 0 'ID1' -0.5 0.1 0.2 2.2 1 'ID2' -0.3 0.5 0.3 0.0 2 'ID1' -0.3 -0.8 2.3 1.6 3 'ID3' 0.5 6.0 4.1 -3.0 </code></pre>

Subtracting columns based on key column in pandas dataframe

Tags:

python

python-3.x

pandas

dataframe

I have two dataframes looking like

df1:

   ID    A   B   C   D 
0 'ID1' 0.5 2.1 3.5 6.6
1 'ID2' 1.2 5.5 4.3 2.2
2 'ID1' 0.7 1.2 5.6 6.0 
3 'ID3' 1.1 7.2 10. 3.2

df2:

   ID    A   B   C   D 
0 'ID1' 1.0 2.0 3.3 4.4
1 'ID2' 1.5 5.0 4.0 2.2
2 'ID3' 0.6 1.2 5.9 6.2 
3 'ID4' 1.1 7.2 8.5 3.0

df1 can have multiple entries with the same ID whereas each ID occurs only once in df2. Also not all ID in df2 are necessarily present in df1. I can't solve this by using set_index() as multiple rows in df1 can have the same ID, and that the ID in df1 and df2 are not aligned.

I want to create a new dataframe where I subtract the values in df2[['A','B','C','D']] from df1[['A','B','C','D']] based on matching the ID.

The resulting dataframe would look like:

df_new:

   ID     A    B   C   D 
0 'ID1' -0.5  0.1 0.2 2.2
1 'ID2' -0.3  0.5 0.3 0.0
2 'ID1' -0.3 -0.8 2.3 1.6
3 'ID3'  0.5  6.0 1.5 0.2

I know how to do this with a loop, but since I'm dealing with huge data quantities this is not practical at all. What is the best way of approaching this with Pandas?

944

asked May 03 '18 14:05

AstroAT

3 Answers

You just need set_index and subtract

(df1.set_index('ID')-df2.set_index('ID')).dropna(axis=0)
Out[174]: 
         A    B    C    D
ID                       
'ID1' -0.5  0.1  0.2  2.2
'ID1' -0.3 -0.8  2.3  1.6
'ID2' -0.3  0.5  0.3  0.0
'ID3'  0.5  6.0  4.1 -3.0

If the order matters add reindex for df2

(df1.set_index('ID')-df2.set_index('ID').reindex(df1.ID)).dropna(axis=0).reset_index()
Out[211]: 
      ID    A    B    C    D
0  'ID1' -0.5  0.1  0.2  2.2
1  'ID2' -0.3  0.5  0.3  0.0
2  'ID1' -0.3 -0.8  2.3  1.6
3  'ID3'  0.5  6.0  4.1 -3.0

199

answered Oct 11 '22 08:10

BENY

Similarly to what Wen (who beat me to it) proposed, you can use pd.DataFrame.subtract:

df1.set_index('ID').subtract(df2.set_index('ID')).reset_index()

         A    B    C    D
ID                       
'ID1' -0.5  0.1  0.2  2.2
'ID1' -0.3 -0.8  2.3  1.6
'ID2' -0.3  0.5  0.3  0.0
'ID3'  0.5  6.0  4.1 -3.0

answered Oct 11 '22 10:10

Ami Tavory

One method is to use numpy. We can extract the ordered indices required from df2 using numpy.searchsorted.

Then feed this into the construction of a new dataframe.

idx = np.searchsorted(df2['ID'], df1['ID'])

res = pd.DataFrame(df1.iloc[:, 1:].values - df2.iloc[:, 1:].values[idx],
                   index=df1['ID']).reset_index()

print(res)

      ID    0    1    2    3
0  'ID1' -0.5  0.1  0.2  2.2
1  'ID2' -0.3  0.5  0.3  0.0
2  'ID1' -0.3 -0.8  2.3  1.6
3  'ID3'  0.5  6.0  4.1 -3.0