Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Report difference/change in values between two dataFrames of identical shape

The context is I want to compare two df's and find the difference.

Here's df and df2 with a small difference:

df = pd.DataFrame({'a': range(3),
                   'b': range(3)})

df2 = df.copy()
df2.iloc[1,1] = 100

Comparing them yields a 2D boolean df of the same shape:

df != df2
Out[28]: 
       a      b
0  False  False
1  False   True
2  False  False

I tried to extract the elements corresponding to the True's, but other elements (that I don't want) still occurs as NaN

df[df != df2]
Out[29]: 
    a    b
0 NaN  NaN
1 NaN  1.0
2 NaN  NaN

How to extract only the elements corresponding to the True's and the indices (so I know where in the df):

df[df != df2] # somehow?
Out[30]: 
    b
1 1.0

update: the above example has only one True. In a general situation with multiple True's, I think are two cases:

  1. df is small and one may want to see:

    df = pd.DataFrame({'a': range(3),
                       'b': range(3)})
    
    df2 = df.copy()
    df2.iloc[0,0] = 100
    df2.iloc[1,1] = 100
    
    df[df!=df2].dropna(how='all',axis=(0,1)) # U9-Forward's answer
    Out[39]: 
         a    b
    0  0.0  NaN
    1  NaN  1.0
    
  2. df is large and one may want to see:

    index    column   df_value     df2_value
        0         a        0.0           100
        1         b        1.0           100
    

@U9-Forward's answer works nicely for case 1, and when there's only one True.

@coldspeed provided a comprehensive solution. Thanks!

like image 454
YJZ Avatar asked Feb 05 '26 03:02

YJZ


2 Answers

Mask on the values:

df.values[df != df2]
# array([1])

How should this case be handled?

df2.at[0, 'a'] = 100

df
   a  b
0  0  0
1  1  1
2  2  2

df2
     a    b
0  100    0
1    1  100
2    2    2

df != df2 
       a      b
0   True  False
1  False   True
2  False  False

df.values[df != df2]
# array([0, 1])

# in the other answer
df[df!=df2].dropna(how='all',axis=(0,1))
     a    b
0  0.0  NaN
1  NaN  1.0

Which is the required output?


If you just the values in each column of df that differ, something simple like agg and dropna will do.

df[df != df2].agg(lambda x: x.dropna().tolist())

a    [0.0]
b    [1.0]
dtype: object

If you want the indices and columns, use melt:

u = df2.reset_index().melt('index')
v = df.reset_index().melt('index')

u[u['value'] != v['value']]
   index variable  value
0      0        a    100
4      1        b    100

Or, use np.nonzero, to do this with numpy - True values are nonzero, the indices of these are returned.

m = (df != df2).values
idx, cols = np.nonzero(m)

pd.DataFrame({
    'index': df.index.values[idx],
    'column': df.columns.values[cols],
    'value_1': df.values[m],
    'value_2': df2.values[m]
})

   index column  value_1  value_2
0      0      a        0      100
1      1      b        1      100
like image 169
cs95 Avatar answered Feb 07 '26 17:02

cs95


Use condition and then dropna:

print(df[df!=df2].dropna(how='all',axis=(0,1)))
like image 34
U12-Forward Avatar answered Feb 07 '26 16:02

U12-Forward



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!