I have two datasets like this
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'id': [1, 2,3,4,5], 'first': [np.nan,np.nan,1,0,np.nan], 'second': [1,np.nan,np.nan,np.nan,0]})
df2 = pd.DataFrame({'id': [1, 2,3,4,5, 6], 'first': [np.nan,1,np.nan,np.nan,0, 1], 'third': [1,0,np.nan,1,1, 0]})
And I want to get
result = pd.merge(df1, df2, left_index=True, right_index=True,on='id', how= 'outer')
result['first']= result[["first_x", "first_y"]].sum(axis=1)
result.loc[(result['first_x'].isnull()) & (result['first_y'].isnull()), 'first'] = np.nan
result.drop(['first_x','first_y'] , 1)
id second third first
0 1 1.0 1.0 NaN
1 2 NaN 0.0 1.0
2 3 NaN NaN 1.0
3 4 NaN 1.0 0.0
4 5 0.0 1.0 0.0
5 6 NaN 0.0 1.0
The problem is that the real dataset includes about 200 variables and my way is very long. How to make it easier? Thanks
Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.
Let's say you want to create a single Full Name column by combining two other columns, First Name and Last Name. To combine first and last names, use the CONCATENATE function or the ampersand (&) operator.
To merge two pandas DataFrames on multiple columns use pandas. merge() method. merge() is considered more versatile and flexible and we also have the same method in DataFrame.
You should be able to use combine_first
:
>>> df1.set_index('id').combine_first(df2.set_index('id'))
first second third
id
1 NaN 1 1
2 1 NaN 0
3 1 NaN NaN
4 0 NaN 1
5 0 0 1
6 1 NaN 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With