Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python, Pandas: Compare two dataframes and return combined

Good evening,

I would like to know, which is the best way to compare two dataframes and return a combination of them? Or if there's even a build-in function inside pandas?

For example, these are my two dataframes:

Dataframe 01:

first_name | age | id | value_a | value_b | value_c
peter      | 37  | 19 | 4562    | 78      | 21.5
jane       | 32  | 5  | 3832    | 85      | 17.0
michael    | 43  | 41 | 2195    | 63      | 44.4

Dataframe 02:

first_name | age | id | value_a | value_b | value_c
sarah      | 51  | 2  | 63      | 81      | 4.1
peter      | 37  | 19 | 4562    | 81      | 21.5
tom        | 22  | 89 | 107     | 14      | 0.0
michael    | 43  | 41 | 1838    | 63      | 44.4

As you can see, there are some new entrys troughout the whole dataframe (Dataframe 02) and some of the already existing ones are also listed --> some changes were made in these rows! What I want to achieve is a new(?) dataframe that contains all the new rows, the already existing ones and those who got updated! In this case:

Dataframe New

first_name | age | id | value_a | value_b | value_c
peter      | 37  | 19 | 4562    | 81      | 21.5
jane       | 32  | 5  | 3832    | 85      | 17.0
michael    | 43  | 41 | 1838    | 63      | 44.4
sarah      | 51  | 2  | 63      | 81      | 4.1
tom        | 22  | 89 | 107     | 14      | 0.0

Notes:

  • there is always a column (here: 'id') that can be seen as a non changing key
  • the amount of rows may differ
  • the amount and names of the colums are always staying the same
  • the order of the rows is not important

Thanks for all your help and a great evening!

like image 920
finethen Avatar asked Jun 13 '26 10:06

finethen


2 Answers

Since you asked if there's even a built-in function inside pandas?. The answer is yes, there is a built in function in pandas that allows you to compare identically labelled (with same index and columns) dataframe's.

There is a DataFrame.compare function which is available in pandas version >= 1.1.0 and allows you to compare first dataframe to second DataFrame and show the differences:

So, now lets look at your example as you said

  • there is always a column (here: 'id') that can be seen as a non changing key
  • the amount and names of the colums are always staying the same

Therefore in order to compare the two dataframes you first need to align both the dataframe this can be done using DataFrame.align function after setting the column id as common index in both the dataframes:

d1, d2 = df1.set_index('id').align(df2.set_index('id'))

Now you can use DataFrame.compare on aligned dataframes:

d1.compare(d2, keep_equal=True)

which results:

         first_name     age           value_a         value_b       value_c      
         self    other  self other    self   other    self other    self other
id                                                                            
2         NaN    sarah   NaN  51.0     NaN    63.0     NaN  81.0     NaN   4.1
5        jane      NaN  32.0   NaN  3832.0     NaN    85.0   NaN    17.0   NaN
19      peter    peter  37.0  37.0  4562.0  4562.0    78.0  81.0    21.5  21.5
41    michael  michael  43.0  43.0  2195.0  1838.0    63.0  63.0    44.4  44.4
89        NaN      tom   NaN  22.0     NaN   107.0     NaN  14.0     NaN   0.0

Now coming to your second question:

How to achieve is a new(?) dataframe that contains all the new rows, the already existing ones and those who got updated!

You can use DataFrame.comine_first on the aligned dataframes d1 and d2:

d2.combine_first(d1)

Or, on the non aligned ones as follows:

df2.set_index('id').combine_first(df1.set_index('id'))

which results:

   first_name   age  value_a  value_b  value_c
id                                            
2       sarah  51.0     63.0     81.0      4.1
5        jane  32.0   3832.0     85.0     17.0
19      peter  37.0   4562.0     81.0     21.5
41    michael  43.0   1838.0     63.0     44.4
89        tom  22.0    107.0     14.0      0.0
like image 161
Shubham Sharma Avatar answered Jun 16 '26 07:06

Shubham Sharma


Here is one way of doing it

>>> (pd.concat([df1, df2])
       .drop_duplicates(subset=['id','first_name'], keep='last')
       .reset_index(drop=True)
       .set_index('first_name')
     )

first_name  |  age   |  id   |   value_a  |  value_b   |   value_c
jane        |   32   |   5   |     3832   |       85   |     17.0
sarah       |   51   |   2   |       63   |       81   |      4.1
peter       |   37   |  19   |     4562   |       81   |     21.5
tom         |   22   |  89   |      107   |       14   |      0.0
michael     |   43   |  41   |     1838   |       63   |     44.4
like image 43
Danail Petrov Avatar answered Jun 16 '26 08:06

Danail Petrov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!