The context is I want to compare two df's and find the difference.
Here's df and df2 with a small difference:
df = pd.DataFrame({'a': range(3),
'b': range(3)})
df2 = df.copy()
df2.iloc[1,1] = 100
Comparing them yields a 2D boolean df of the same shape:
df != df2
Out[28]:
a b
0 False False
1 False True
2 False False
I tried to extract the elements corresponding to the True's, but other elements (that I don't want) still occurs as NaN
df[df != df2]
Out[29]:
a b
0 NaN NaN
1 NaN 1.0
2 NaN NaN
How to extract only the elements corresponding to the True's and the indices (so I know where in the df):
df[df != df2] # somehow?
Out[30]:
b
1 1.0
update: the above example has only one True. In a general situation with multiple True's, I think are two cases:
df is small and one may want to see:
df = pd.DataFrame({'a': range(3),
'b': range(3)})
df2 = df.copy()
df2.iloc[0,0] = 100
df2.iloc[1,1] = 100
df[df!=df2].dropna(how='all',axis=(0,1)) # U9-Forward's answer
Out[39]:
a b
0 0.0 NaN
1 NaN 1.0
df is large and one may want to see:
index column df_value df2_value
0 a 0.0 100
1 b 1.0 100
@U9-Forward's answer works nicely for case 1, and when there's only one True.
@coldspeed provided a comprehensive solution. Thanks!
Mask on the values:
df.values[df != df2]
# array([1])
How should this case be handled?
df2.at[0, 'a'] = 100
df
a b
0 0 0
1 1 1
2 2 2
df2
a b
0 100 0
1 1 100
2 2 2
df != df2
a b
0 True False
1 False True
2 False False
df.values[df != df2]
# array([0, 1])
# in the other answer
df[df!=df2].dropna(how='all',axis=(0,1))
a b
0 0.0 NaN
1 NaN 1.0
Which is the required output?
If you just the values in each column of df that differ, something simple like agg and dropna will do.
df[df != df2].agg(lambda x: x.dropna().tolist())
a [0.0]
b [1.0]
dtype: object
If you want the indices and columns, use melt:
u = df2.reset_index().melt('index')
v = df.reset_index().melt('index')
u[u['value'] != v['value']]
index variable value
0 0 a 100
4 1 b 100
Or, use np.nonzero, to do this with numpy - True values are nonzero, the indices of these are returned.
m = (df != df2).values
idx, cols = np.nonzero(m)
pd.DataFrame({
'index': df.index.values[idx],
'column': df.columns.values[cols],
'value_1': df.values[m],
'value_2': df2.values[m]
})
index column value_1 value_2
0 0 a 0 100
1 1 b 1 100
Use condition and then dropna:
print(df[df!=df2].dropna(how='all',axis=(0,1)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With