Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas merge dataframes with shared column, fillna in left with right

I am trying to merge two dataframes and replace the nan in the left df with the right df, I can do it with three lines of code as below, but I want to know if there is a better/shorter way?

# Example data (my actual df is ~500k rows x 11 cols)
df1 = pd.DataFrame({'a': [1,2,3,4], 'b': [0,1,np.nan, 1], 'e': ['a', 1, 2,'b']})
df2 = pd.DataFrame({'a': [1,2,3,4], 'b': [np.nan, 1, 0, 1]})

# Merge the dataframes...
df = df1.merge(df2, on='a', how='left')

# Fillna in 'b' column of left df with right df...
df['b'] = df['b_x'].fillna(df['b_y'])

# Drop the columns no longer needed
df = df.drop(['b_x', 'b_y'], axis=1)
like image 327
Kenan Avatar asked Jul 01 '19 20:07

Kenan


People also ask

What is left on and right on in pandas merge?

left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame. right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

How do I merge two DataFrames based on a column?

Key Points Pandas' merge and concat can be used to combine subsets of a DataFrame, or even data from different files. join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.

What does Fillna () method do?

The fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.


Video Answer


2 Answers

The problem confusing merge is that both dataframes have a 'b' column, but the left and right versions have NaNs in mismatched places. You want to avoid getting unwanted multiple 'b' columns 'b_x', 'b_y' from merge in the first place:

  • slice the non-shared columns 'a','e' from df1
  • do merge(df2, 'left'), this will pick up 'b' from the right dataframe (since it only exists in the right df)
  • finally do df1.update(...) , this will update the NaNs in the column 'b' taken from df2 with df1['b']

Solution:

df1.update(df1[['a', 'e']].merge(df2, 'left'))

df1

   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Note: Because I used merge(..., how='left'), I preserve the row order of the calling dataframe. If my df1 had values of a that were not in order

   a    b  e
0  1  0.0  a
1  2  1.0  1
2  4  1.0  b
3  3  NaN  2

The result would be

df1.update(df1[['a', 'e']].merge(df2, 'left'))

df1

   a    b  e
0  1  0.0  a
1  2  1.0  1
2  4  1.0  b
3  3  0.0  2

Which is as expected.


Further...

If you want to be more explicit when there may be more columns involved

df1.update(df1.drop('b', 1).merge(df2, 'left', 'a'))

Even Further...

If you don't want to update the dataframe, we can use combine_first

Quick

df1.combine_first(df1[['a', 'e']].merge(df2, 'left'))

Explicit

df1.combine_first(df1.drop('b', 1).merge(df2, 'left', 'a'))

EVEN FURTHER!...

The 'left' merge may preserve order but NOT the index. This is the ultra conservative approach:

df3 = df1.drop('b', 1).merge(df2, 'left', on='a').set_index(df1.index)
df1.combine_first(df3)
like image 70
piRSquared Avatar answered Sep 18 '22 14:09

piRSquared


Short version

df1.b.fillna(df1.a.map(df2.set_index('a').b),inplace=True)
df1
Out[173]: 
   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Since you mentioned there will be multiple columns

df = df1.combine_first(df1[['a']].merge(df2, on='a', how='left'))
df
Out[184]: 
   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b

Also we can pass to fillna with df

df1.fillna(df1[['a']].merge(df2, on='a', how='left'))
Out[185]: 
   a    b  e
0  1  0.0  a
1  2  1.0  1
2  3  0.0  2
3  4  1.0  b
like image 36
BENY Avatar answered Sep 19 '22 14:09

BENY