The Fastest Ways As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal.
The merge() method updates the content of two DataFrame by merging them together, using the specified method(s). Use the parameters to control which values to keep and which to replace.
Different column names are specified for merges in Pandas using the “left_on” and “right_on” parameters, instead of using only the “on” parameter. Merging dataframes with different names for the joining variable is achieved using the left_on and right_on arguments to the pandas merge function.
The join() functions from dplyr preserve the original order of rows in the data frames while the merge() function automatically sorts the rows alphabetically based on the column you used to perform the join.
pandas.merge()
is the underlying function used for all merge/join behavior.
DataFrames provide the pandas.DataFrame.merge()
and pandas.DataFrame.join()
methods as a convenient way to access the capabilities of pandas.merge()
. For example, df1.merge(right=df2, ...)
is equivalent to pandas.merge(left=df1, right=df2, ...)
.
These are the main differences between df.join()
and df.merge()
:
df1.join(df2)
always joins via the index of df2
, but df1.merge(df2)
can join to one or more columns of df2
(default) or to the index of df2
(with right_index=True
). df1.join(df2)
uses the index of df1
and df1.merge(df2)
uses column(s) of df1
. That can be overridden by specifying df1.join(df2, on=key_or_keys)
or df1.merge(df2, left_index=True)
. df1.join(df2)
does a left join by default (keeps all rows of df1
), but df.merge
does an inner join by default (returns only matching rows of df1
and df2
).So, the generic approach is to use pandas.merge(df1, df2)
or df1.merge(df2)
. But for a number of common situations (keeping all rows of df1
and joining to an index in df2
), you can save some typing by using df1.join(df2)
instead.
Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:
merge
is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.The related
DataFrame.join
method, usesmerge
internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior formerge
). If you are joining on index, you may wish to useDataFrame.join
to save yourself some typing.
...
These two function calls are completely equivalent:
left.join(right, on=key_or_keys) pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)
I always use join
on indices:
import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')
val_l val_r
key
foo 1 4
bar 2 5
The same functionality can be had by using merge
on the columns follows:
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))
key val_l val_r
0 foo 1 4
1 bar 2 5
From this documentation
pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:
merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False)
And :
DataFrame.join
is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. Here is a very basic example: The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge plus additional arguments instructing it to use the indexes:result = pd.merge(left, right, left_index=True, right_index=True, how='outer')
One of the difference is that merge
is creating a new index, and join
is keeping the left side index. It can have a big consequence on your later transformations if you wrongly assume that your index isn't changed with merge
.
For example:
import pandas as pd
df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
'date': [201801, 201801, 201802, 201802],
'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1
date org_index val
101 201801 101 1
102 201801 102 2
103 201802 103 3
104 201802 104 4
-
df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2
dateval
date
201801 A
201802 B
-
df1.merge(df2, on='date')
date org_index val dateval
0 201801 101 1 A
1 201801 102 2 A
2 201802 103 3 B
3 201802 104 4 B
-
df1.join(df2, on='date')
date org_index val dateval
101 201801 101 1 A
102 201801 102 2 A
103 201802 103 3 B
104 201802 104 4 B
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With