Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Update a dataframe by dataframes with NaN values

I try to update a DataFrame

df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})

by another DataFrame

df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]}).

Now, my aim is to update df1 by df2 and overwrite all values (NaN values too) using

df1.update(df2)

In contrast with the common usage it's important to me to get the NaN values finally in df1. But as far as I see the update returns

>>> df1
      A   B
0     1   9
1     2   6
2     3   11
3     4   8

Is there a way to get

>>> df1
    A    B
0   1    9
1   2    NaN
2   3    11
3   4    NaN

without building df1 manually?

like image 371
Danny Abstemio Avatar asked Oct 09 '18 20:10

Danny Abstemio


1 Answers

I am late to the party but I was recently confronted to the same issue, i.e. trying to update a dataframe without ignoring NaN values like the Pandas built-in update method does. For two dataframes sharing the same column names, a workaround would be to concatenate both dataframes and then remove duplicates, only keeping the last entry:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [9, np.nan, 11, np.nan]})

frames = [df1, df2]
df_concatenated = pd.concat(frames)
df1=df_concatenated.loc[~df_concatenated.index.duplicated(keep='last')]

Depending on indexing, it might be necessary to sort the indices of the output dataframe:

df1=df1.sort_index()

To address you very specific example for which df2 does not have a column A, you could run:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data = {'A' : [1,2,3,4], 'B' : [5,6,7,8]})
df2 = pd.DataFrame(data = {'B' : [9, np.nan, 11, np.nan]})

frames = [df1, df2]
df_concatenated = pd.concat(frames)

df1['B']=df_concatenated.loc[~df_concatenated.index.duplicated(keep='last')]['B']
like image 137
Sheldon Avatar answered Nov 05 '22 02:11

Sheldon