Consider the following DataFrames:
df1 = pd.DataFrame({'a': [0, 1, 2, 3], 'b': list('abcd')})
df2 = pd.DataFrame({'c': list('abcd'), 'd': 'Alex'})
In this instance, df1['b']
and df2['c']
are the key columns. So when merging:
df1.merge(df2, left_on='b', right_on='c')
a b c d
0 0 a a Alex
1 1 b b Alex
2 2 c c Alex
3 3 d d Alex
I end up with both key columns in the resultant DataFrame when I only need one. I've been using:
df1.merge(df2, left_on='b', right_on='c').drop('c', axis='columns')
Is there a way to only keep one key column?
In this example, we are using the pd. merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'.
We can merge two Pandas DataFrames on certain columns using the merge function by simply specifying the certain columns for merge. Example1: Let's create a Dataframe and then merge them into a single dataframe. Creating a Dataframe: Python3.
We can join columns from two Dataframes using the merge() function. This is similar to the SQL 'join' functionality. A detailed discussion of different join types is given in the SQL lesson. You specify the type of join you want using the how parameter.
Different column names are specified for merges in Pandas using the “left_on” and “right_on” parameters, instead of using only the “on” parameter. Merging dataframes with different names for the joining variable is achieved using the left_on and right_on arguments to the pandas merge function.
One way is to set b
and c
as the index of your frames respectively, and use join
followed by reset_index
:
df1.set_index('b').join(df2.set_index('c')).reset_index()
b a d
0 a 0 Alex
1 b 1 Alex
2 c 2 Alex
3 d 3 Alex
This will be faster than the merge/drop
method on large dataframes, mostly because drop
is slow. @Bill's method is faster than my suggestion, and @W-B & @PiRsquared easily outspeed the other suggestions:
import timeit
df1 = pd.concat((df1 for _ in range(1000)))
df2 = pd.concat((df2 for _ in range(1000)))
def index_method(df1 = df1, df2 = df2):
return df1.set_index('b').join(df2.set_index('c')).reset_index()
def merge_method(df1 = df1, df2=df2):
return df1.merge(df2, left_on='b', right_on='c').drop('c', axis='columns')
def rename_method(df1 = df1, df2 = df2):
return df1.rename({'b': 'c'}, axis=1).merge(df2)
def index_method2(df1 = df1, df2 = df2):
return df1.join(df2.set_index('c'), on='b')
def assign_method(df1 = df1, df2 = df2):
return df1.set_index('b').assign(c=df2.set_index('c').d).reset_index()
def map_method(df1 = df1, df2 = df2):
return df1.assign(d=df1.b.map(dict(df2.values)))
>>> timeit.timeit(index_method, number=10) / 10
0.7853091600998596
>>> timeit.timeit(merge_method, number=10) / 10
1.1696729859002517
>>> timeit.timeit(rename_method, number=10) / 10
0.4291436871004407
>>> timeit.timeit(index_method2, number=10) / 10
0.5037374985004135
>>> timeit.timeit(assign_method, number=10) / 10
0.0038641377999738325
>>> timeit.timeit(map_method, number=10) / 10
0.006620216699957382
Another way is to give b and c the same name. At least for the merge operation.
df1.rename({'b': 'c'}, axis=1).merge(df2)
a c d
0 0 a Alex
1 1 b Alex
2 2 c Alex
3 3 d Alex
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With