I have two dataframes <code>df1</code> and <code>df2</code>. <code>df1</code> contains the information of the age of people, while <code>df2</code> contains the information of the sex of people. Not all the people are in <code>df1</code> nor in <code>df2</code> <pre class="prettyprint"><code>df1 Name Age 0 Tom 34 1 Sara 18 2 Eva 44 3 Jack 27 4 Laura 30 df2 Name Sex 0 Tom M 1 Paul M 2 Eva F 3 Jack M 4 Michelle F </code></pre> I want to have the information of the sex of the people in <code>df1</code> and setting <code>NaN</code> if I do not have this information in <code>df2</code>. I tried to do <code>df1 = pd.merge(df1, df2, on = 'Name', how = 'outer')</code> but I keep the information of some people in <code>df2</code> that I don't want. <pre class="prettyprint"><code>df1 Name Age Sex 0 Tom 34 M 1 Sara 18 NaN 2 Eva 44 F 3 Jack 27 M 4 Laura 30 NaN </code></pre>

You can also use the <code>join</code> method: <pre class="prettyprint"><code>df1.set_index("Name").join(df2.set_index("Name"), how="left") </code></pre> edit: added <code>set_index("Name")</code>

Pandas: how to merge two dataframes on a column by keeping the information of the first one?

Tags:

python

pandas

dataframe

I have two dataframes df1 and df2. df1 contains the information of the age of people, while df2 contains the information of the sex of people. Not all the people are in df1 nor in df2

df1
     Name   Age 
0     Tom    34
1     Sara   18
2     Eva    44
3     Jack   27
4     Laura  30

df2
     Name      Sex 
0     Tom       M
1     Paul      M
2     Eva       F
3     Jack      M
4     Michelle  F

I want to have the information of the sex of the people in df1 and setting NaN if I do not have this information in df2. I tried to do df1 = pd.merge(df1, df2, on = 'Name', how = 'outer') but I keep the information of some people in df2 that I don't want.

df1
     Name   Age     Sex
0     Tom    34      M
1     Sara   18     NaN
2     Eva    44      F
3     Jack   27      M
4     Laura  30     NaN

417

asked Oct 26 '18 13:10

emax

2 Answers

Sample:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

#print (df1)
df3 = df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F']})
#print (df2)

Use map by Series created by set_index:

df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
print (df1)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

Alternative solution with merge with left join:

df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
print (df)
    Name  Age  Sex
0    Tom   34    M
1   Sara   18  NaN
2    Eva   44    F
3   Jack   27    M
4  Laura   30  NaN

If need map by multiple columns (e.g. Year and Code) need merge with left join:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Year':[2000,2003,2003,2004,2007],
                    'Code':[1,2,3,4,4],
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Year  Code  Age
0    Tom  2000     1   34
1   Sara  2003     2   18
2    Eva  2003     3   44
3   Jack  2004     4   27
4  Laura  2007     4   30

df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                    'Sex': ['M', 'M', 'F', 'M', 'F'],
                    'Year':[2001,2003,2003,2004,2007],
                    'Code':[1,2,3,5,3],
                    'Val':[21,34,23,44,67]})
print (df2)
       Name Sex  Year  Code  Val
0       Tom   M  2001     1   21
1      Paul   M  2003     2   34
2       Eva   F  2003     3   23
3      Jack   M  2004     5   44
4  Michelle   F  2007     3   67

#merge by all columns
df = df1.merge(df2, on=['Year','Code'], how='left')
print (df)
  Name_x  Year  Code  Age Name_y  Sex   Val
0    Tom  2000     1   34    NaN  NaN   NaN
1   Sara  2003     2   18   Paul    M  34.0
2    Eva  2003     3   44    Eva    F  23.0
3   Jack  2004     4   27    NaN  NaN   NaN
4  Laura  2007     4   30    NaN  NaN   NaN

#specified columns - columns for join (Year, Code) need always + appended columns (Val)
df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
print (df)
    Name  Year  Code  Age   Val
0    Tom  2000     1   34   NaN
1   Sara  2003     2   18  34.0
2    Eva  2003     3   44  23.0
3   Jack  2004     4   27   NaN
4  Laura  2007     4   30   NaN

If get error with map it means duplicates by columns of join, here Name:

df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                    'Age': [34, 18, 44, 27, 30]})

print (df1)
    Name  Age
0    Tom   34
1   Sara   18
2    Eva   44
3   Jack   27
4  Laura   30

df3, df4 = df1.copy(), df1.copy()

df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 
                    'Val': [1,2,3,4,5]})
print (df2)
       Name  Val
0       Tom    1 <-duplicated name Tom
1       Tom    2 <-duplicated name Tom
2       Eva    3
3      Jack    4
4  Michelle    5

s = df2.set_index('Name')['Val']
df1['New'] = df1['Name'].map(s)
print (df1)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Solutions are removed duplicates by DataFrame.drop_duplicates, or use map by dict for last dupe match:

#default keep first value
s = df2.drop_duplicates('Name').set_index('Name')['Val']
print (s)
Name
Tom         1
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df1['New'] = df1['Name'].map(s)
print (df1)
    Name  Age  New
0    Tom   34  1.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

#add parameter for keep last value 
s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
print (s)
Name
Tom         2
Eva         3
Jack        4
Michelle    5
Name: Val, dtype: int64

df3['New'] = df3['Name'].map(s)
print (df3)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

#map by dictionary
d = dict(zip(df2['Name'], df2['Val']))
print (d)
{'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}

df4['New'] = df4['Name'].map(d)
print (df4)
    Name  Age  New
0    Tom   34  2.0
1   Sara   18  NaN
2    Eva   44  3.0
3   Jack   27  4.0
4  Laura   30  NaN

103

answered Sep 22 '22 12:09

jezrael

You can also use the join method:

df1.set_index("Name").join(df2.set_index("Name"), how="left")

edit: added set_index("Name")

answered Sep 18 '22 12:09

Xiaoyu Lu

Related questions
                            
                                How to reverse a dictionary (whose values are lists) in Python?
                            
                                How can I separate runs of my TensorFlow code in TensorBoard?
                            
                                what is the correct way to check for False? [duplicate]
                            
                                why does pandas rolling use single dimension ndarray
                            
                                Object has no attribute '.__dict__' in python3
                            
                                Why can itertools.groupby group the NaNs in lists but not in numpy arrays
                            
                                keras flow_from_directory over or undersample a class
                            
                                Representation of all values in Flag enum
                            
                                How to set a window icon with PyQt5?
                            
                                Tensorflow equivalent to numpy.diff
                            
                                What is the type of traceback objects in Python?
                            
                                generating word cloud for items in a list in python
                            
                                Python aiohttp/asyncio - how to process returned data
                            
                                is seaborn confidence interval computed correctly?
                            
                                How to install Selenium in a conda environment?
                            
                                Matplotlib plots not showing on Jupyter Notebook when I 'run all'
                            
                                python retry with tenacity, disable `wait` for unittest
                            
                                Use Sphinx autosummary recursively to generate API documentation
                            
                                Python type hinting for async function as function argument
                            
                                How to run MATLAB code from within Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With