I am really struggling to understand the "left_index" and "right_index" arguments in pandas.merge. I read the documentation, searched around, experimented with various setting and tried to understand but I am still confused. Consider this example:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'E': [1,2,3,4]})
Now, when I run the following command:
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, left_index=True)
I get:
key1_x key2_x A B key1_y key2_y C D E _merge
0 K0 K0 A0 B0 K0 K0 C0 D0 1.0 both
1 K0 K1 A1 B1 K1 K0 C1 D1 2.0 both
2 K0 K1 A1 B1 K1 K0 C2 D2 3.0 both
3 K1 K0 A2 B2 NaN NaN NaN NaN NaN left_only
3 K2 K1 A3 B3 NaN NaN NaN NaN NaN left_only
3 NaN NaN NaN NaN K2 K0 C3 D3 4.0 right_only
However, running the same with right_index=True
gives an error. Same if I introduce both. More interestingly, running the following merge gives a very unexpected result
pd.merge(left, right, on=['key1', 'key2'],how='outer', validate = 'one_to_many', indicator=True, left_index = True, right_index = True)
Result is:
key1 key2 A B C D E _merge
0 K0 K0 A0 B0 C0 D0 1 both
1 K0 K1 A1 B1 C1 D1 2 both
2 K1 K0 A2 B2 C2 D2 3 both
3 K2 K1 A3 B3 C3 D3 4 both
As you can see, all information for right frame for key1
and key2
is completely lost.
Please help me understand the purpose and function of these arguments. Thank you.
Using merge indicator to track merges To assist with the identification of where rows originate from, Pandas provides an “indicator” parameter that can be used with the merge function which creates an additional column called “_merge” in the output that labels the original source for each row.
left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame. right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
INNER Merge Pandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data. In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.
Column-Column Merge: Use left_on, right_on and how.
Example:
# Gives same answer
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how = 'outer')
pd.merge(left, right, on=['key1', 'key2'], how='outer', indicator=True)
Index-Index Merge: Set left_index and right_index to True or use on and use how.
Example:
pd.merge(left, right, how = 'inner', right_index = True, left_index = True)
# If you make matching unique multi-indexes for both data frames you can do
# pd.merge(left, right, how = 'inner', on = ['indexname1', 'indexname2'])
# In your data frames, you're keys duplicate values so you can't do this
# In general, a column with duplicate values does not make a good key
Column-Index Merge: Use left_on + right_index or left_index + right_on and how.
Note: Both the values in index and left_on must match. If you're index is a integer and you're left_on is a string, you get error. Also, number of indexing levels must match.
Example:
# If how not specified, inner join is used
pd.merge(left, right, right_on=['E'], left_index = True, how = 'outer')
# Gives error because left_on is string and right_index is integer
pd.merge(left, right, left_on=['key1'], right_index = True, how = 'outer')
# This gave you error because left_on has indexing level of 2 but right_index only has indexing level of 1.
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, right_index=True)
You kind of mix up the different types of merges which gave weird results. If you can't see how the merging is going to happen conceptually, chances are a computer isn't going to do any better.
If I understand the behavior of merge
correctly, you should pick only one option for left
and right
respectively (i.e. You should not pick left_on=['x']
and left_index=True
at the same time). Otherwise, strange thing can happen in arbitrary way since it confuses merge
as to which key
should be actually used as you have shown in current implementation of merge
(I have not checked the pandas source in detail, but the behavior can change for different implementations in each version). Here is a small experiment.
>>> left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
>>> right
key1 key2 C D E
0 K0 K0 C0 D0 1
1 K1 K0 C1 D1 2
2 K1 K0 C2 D2 3
3 K2 K0 C3 D3 4
(1) merge
using ['key1', 'key2']
>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K1 A3 B3 NaN NaN NaN
5 K2 K0 NaN NaN C3 D3 4.0
(2) Set ['key1', 'key2']
as left
index and merge
it using the index and keys
>>> left = left.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_on=['key1', 'key2'], how='outer').reset_index(drop=True)
A B key1 key2 C D E
0 A0 B0 K0 K0 C0 D0 1.0
1 A1 B1 K0 K1 NaN NaN NaN
2 A2 B2 K1 K0 C1 D1 2.0
3 A2 B2 K1 K0 C2 D2 3.0
4 A3 B3 K2 K1 NaN NaN NaN
5 NaN NaN K2 K0 C3 D3 4.0
(3) Further set ['key1', 'key2']
as right
index and merge
it using the index
>>> right = right.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_index=True, how='outer').reset_index()
key1 key2 A B C D E
0 K0 K0 A0 B0 C0 D0 1.0
1 K0 K1 A1 B1 NaN NaN NaN
2 K1 K0 A2 B2 C1 D1 2.0
3 K1 K0 A2 B2 C2 D2 3.0
4 K2 K0 NaN NaN C3 D3 4.0
5 K2 K1 A3 B3 NaN NaN NaN
Please note that (1)(2)(3) above are showing the same results, and even if ['key1', 'key2']
are set as index, you can still use left_on = ['key1', 'key2']
instead of left_index=True
.
Now, if you really want to merge using both ['key1', 'key2']
with index
, one way to achieve this is:
>>> pd.merge(left.reset_index(), right.reset_index(), on=['index', 'key1', 'key2'], how='outer')
index key1 key2 A B C D E
0 0 K0 K0 A0 B0 C0 D0 1.0
1 1 K0 K1 A1 B1 NaN NaN NaN
2 2 K1 K0 A2 B2 C2 D2 3.0
3 3 K2 K1 A3 B3 NaN NaN NaN
4 1 K1 K0 NaN NaN C1 D1 2.0
5 3 K2 K0 NaN NaN C3 D3 4.0
If you read down to here, I'm pretty sure now you know how to achieve above using multiple different ways. Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With