Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the "left_index" and "right_index" arguments in pandas merge

I am really struggling to understand the "left_index" and "right_index" arguments in pandas.merge. I read the documentation, searched around, experimented with various setting and tried to understand but I am still confused. Consider this example:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'], 
                 'key2': ['K0', 'K1', 'K0', 'K1'],
                 'A': ['A0', 'A1', 'A2', 'A3'],
                 'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3'],
                      'E': [1,2,3,4]})

Now, when I run the following command:

pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, left_index=True)

I get:

  key1_x key2_x    A    B key1_y key2_y    C    D    E      _merge
0     K0     K0   A0   B0     K0     K0   C0   D0  1.0        both
1     K0     K1   A1   B1     K1     K0   C1   D1  2.0        both
2     K0     K1   A1   B1     K1     K0   C2   D2  3.0        both
3     K1     K0   A2   B2    NaN    NaN  NaN  NaN  NaN   left_only
3     K2     K1   A3   B3    NaN    NaN  NaN  NaN  NaN   left_only
3    NaN    NaN  NaN  NaN     K2     K0   C3   D3  4.0  right_only

However, running the same with right_index=True gives an error. Same if I introduce both. More interestingly, running the following merge gives a very unexpected result

pd.merge(left, right,  on=['key1', 'key2'],how='outer', validate = 'one_to_many', indicator=True, left_index = True, right_index = True)

Result is:

  key1 key2   A   B   C   D  E _merge
0   K0   K0  A0  B0  C0  D0  1   both
1   K0   K1  A1  B1  C1  D1  2   both
2   K1   K0  A2  B2  C2  D2  3   both
3   K2   K1  A3  B3  C3  D3  4   both

As you can see, all information for right frame for key1 and key2 is completely lost.

Please help me understand the purpose and function of these arguments. Thank you.

like image 395
ste_kwr Avatar asked Aug 13 '18 02:08

ste_kwr


People also ask

What does the argument indicator in Pandas merge () function do?

Using merge indicator to track merges To assist with the identification of where rows originate from, Pandas provides an “indicator” parameter that can be used with the merge function which creates an additional column called “_merge” in the output that labels the original source for each row.

What is left on and right on in Pandas merge?

left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame. right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

How does merging work in Pandas?

INNER Merge Pandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data. In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.


2 Answers

Merging happens in a couple of ways:

Column-Column Merge: Use left_on, right_on and how.

Example:

# Gives same answer
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how = 'outer')
pd.merge(left, right, on=['key1', 'key2'], how='outer', indicator=True)

Index-Index Merge: Set left_index and right_index to True or use on and use how.

Example:

pd.merge(left, right, how = 'inner', right_index = True, left_index = True)
# If you make matching unique multi-indexes for both data frames you can do
# pd.merge(left, right, how = 'inner', on = ['indexname1', 'indexname2'])
# In your data frames, you're keys duplicate values so you can't do this
# In general, a column with duplicate values does not make a good key

Column-Index Merge: Use left_on + right_index or left_index + right_on and how.

Note: Both the values in index and left_on must match. If you're index is a integer and you're left_on is a string, you get error. Also, number of indexing levels must match.

Example:

# If how not specified, inner join is used
pd.merge(left, right, right_on=['E'], left_index = True, how = 'outer')  

# Gives error because left_on is string and right_index is integer
pd.merge(left, right, left_on=['key1'], right_index = True, how = 'outer')

# This gave you error because left_on has indexing level of 2 but right_index only has indexing level of 1.
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, right_index=True)

You kind of mix up the different types of merges which gave weird results. If you can't see how the merging is going to happen conceptually, chances are a computer isn't going to do any better.

like image 147
Tam Le Avatar answered Oct 22 '22 08:10

Tam Le


If I understand the behavior of merge correctly, you should pick only one option for left and right respectively (i.e. You should not pick left_on=['x'] and left_index=True at the same time). Otherwise, strange thing can happen in arbitrary way since it confuses merge as to which key should be actually used as you have shown in current implementation of merge (I have not checked the pandas source in detail, but the behavior can change for different implementations in each version). Here is a small experiment.

>>> left
  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3

>>> right
  key1 key2   C   D  E
0   K0   K0  C0  D0  1
1   K1   K0  C1  D1  2
2   K1   K0  C2  D2  3
3   K2   K0  C3  D3  4

(1) merge using ['key1', 'key2']

>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')

  key1 key2    A    B    C    D    E
0   K0   K0   A0   B0   C0   D0  1.0
1   K0   K1   A1   B1  NaN  NaN  NaN
2   K1   K0   A2   B2   C1   D1  2.0
3   K1   K0   A2   B2   C2   D2  3.0
4   K2   K1   A3   B3  NaN  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3  4.0

(2) Set ['key1', 'key2'] as left index and merge it using the index and keys

>>> left = left.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_on=['key1', 'key2'], how='outer').reset_index(drop=True)

     A    B key1 key2    C    D    E
0   A0   B0   K0   K0   C0   D0  1.0
1   A1   B1   K0   K1  NaN  NaN  NaN
2   A2   B2   K1   K0   C1   D1  2.0
3   A2   B2   K1   K0   C2   D2  3.0
4   A3   B3   K2   K1  NaN  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3  4.0

(3) Further set ['key1', 'key2'] as right index and merge it using the index

>>> right = right.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_index=True, how='outer').reset_index()

  key1 key2    A    B    C    D    E
0   K0   K0   A0   B0   C0   D0  1.0
1   K0   K1   A1   B1  NaN  NaN  NaN
2   K1   K0   A2   B2   C1   D1  2.0
3   K1   K0   A2   B2   C2   D2  3.0
4   K2   K0  NaN  NaN   C3   D3  4.0
5   K2   K1   A3   B3  NaN  NaN  NaN

Please note that (1)(2)(3) above are showing the same results, and even if ['key1', 'key2'] are set as index, you can still use left_on = ['key1', 'key2'] instead of left_index=True.

Now, if you really want to merge using both ['key1', 'key2'] with index, one way to achieve this is:

>>> pd.merge(left.reset_index(), right.reset_index(), on=['index', 'key1', 'key2'], how='outer')

   index key1 key2    A    B    C    D    E
0      0   K0   K0   A0   B0   C0   D0  1.0
1      1   K0   K1   A1   B1  NaN  NaN  NaN
2      2   K1   K0   A2   B2   C2   D2  3.0
3      3   K2   K1   A3   B3  NaN  NaN  NaN
4      1   K1   K0  NaN  NaN   C1   D1  2.0
5      3   K2   K0  NaN  NaN   C3   D3  4.0

If you read down to here, I'm pretty sure now you know how to achieve above using multiple different ways. Hope this helps.

like image 40
gyoza Avatar answered Oct 22 '22 09:10

gyoza