Understanding the "left_index" and "right_index" arguments in pandas merge

Tags:

I am really struggling to understand the "left_index" and "right_index" arguments in pandas.merge. I read the documentation, searched around, experimented with various setting and tried to understand but I am still confused. Consider this example:

Click to copy

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'], 
                 'key2': ['K0', 'K1', 'K0', 'K1'],
                 'A': ['A0', 'A1', 'A2', 'A3'],
                 'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3'],
                      'E': [1,2,3,4]})

Now, when I run the following command:

Click to copy

pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, left_index=True)

I get:

Click to copy

  key1_x key2_x    A    B key1_y key2_y    C    D    E      _merge
0     K0     K0   A0   B0     K0     K0   C0   D0  1.0        both
1     K0     K1   A1   B1     K1     K0   C1   D1  2.0        both
2     K0     K1   A1   B1     K1     K0   C2   D2  3.0        both
3     K1     K0   A2   B2    NaN    NaN  NaN  NaN  NaN   left_only
3     K2     K1   A3   B3    NaN    NaN  NaN  NaN  NaN   left_only
3    NaN    NaN  NaN  NaN     K2     K0   C3   D3  4.0  right_only

However, running the same with right_index=True gives an error. Same if I introduce both. More interestingly, running the following merge gives a very unexpected result

Click to copy

pd.merge(left, right,  on=['key1', 'key2'],how='outer', validate = 'one_to_many', indicator=True, left_index = True, right_index = True)

Result is:

Click to copy

  key1 key2   A   B   C   D  E _merge
0   K0   K0  A0  B0  C0  D0  1   both
1   K0   K1  A1  B1  C1  D1  2   both
2   K1   K0  A2  B2  C2  D2  3   both
3   K2   K1  A3  B3  C3  D3  4   both

As you can see, all information for right frame for key1 and key2 is completely lost.

Please help me understand the purpose and function of these arguments. Thank you.

395

asked Aug 13 '18 02:08

ste_kwr

2 Answers

Merging happens in a couple of ways:

Column-Column Merge: Use left_on, right_on and how.

Example:

Click to copy

# Gives same answer
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how = 'outer')
pd.merge(left, right, on=['key1', 'key2'], how='outer', indicator=True)

Index-Index Merge: Set left_index and right_index to True or use on and use how.

Example:

Click to copy

pd.merge(left, right, how = 'inner', right_index = True, left_index = True)
# If you make matching unique multi-indexes for both data frames you can do
# pd.merge(left, right, how = 'inner', on = ['indexname1', 'indexname2'])
# In your data frames, you're keys duplicate values so you can't do this
# In general, a column with duplicate values does not make a good key

Column-Index Merge: Use left_on + right_index or left_index + right_on and how.

Note: Both the values in index and left_on must match. If you're index is a integer and you're left_on is a string, you get error. Also, number of indexing levels must match.

Example:

Click to copy

# If how not specified, inner join is used
pd.merge(left, right, right_on=['E'], left_index = True, how = 'outer')  

# Gives error because left_on is string and right_index is integer
pd.merge(left, right, left_on=['key1'], right_index = True, how = 'outer')

# This gave you error because left_on has indexing level of 2 but right_index only has indexing level of 1.
pd.merge(left, right, left_on=['key2', 'key1'], right_on=['key1', 'key2'], how='outer', indicator=True, right_index=True)

You kind of mix up the different types of merges which gave weird results. If you can't see how the merging is going to happen conceptually, chances are a computer isn't going to do any better.

147

answered Oct 22 '22 08:10

Tam Le

If I understand the behavior of merge correctly, you should pick only one option for left and right respectively (i.e. You should not pick left_on=['x'] and left_index=True at the same time). Otherwise, strange thing can happen in arbitrary way since it confuses merge as to which key should be actually used as you have shown in current implementation of merge (I have not checked the pandas source in detail, but the behavior can change for different implementations in each version). Here is a small experiment.

Click to copy

>>> left
  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3

>>> right
  key1 key2   C   D  E
0   K0   K0  C0  D0  1
1   K1   K0  C1  D1  2
2   K1   K0  C2  D2  3
3   K2   K0  C3  D3  4

(1) merge using ['key1', 'key2']

Click to copy

>>> pd.merge(left, right, on=['key1', 'key2'], how='outer')

  key1 key2    A    B    C    D    E
0   K0   K0   A0   B0   C0   D0  1.0
1   K0   K1   A1   B1  NaN  NaN  NaN
2   K1   K0   A2   B2   C1   D1  2.0
3   K1   K0   A2   B2   C2   D2  3.0
4   K2   K1   A3   B3  NaN  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3  4.0

(2) Set ['key1', 'key2'] as left index and merge it using the index and keys

Click to copy

>>> left = left.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_on=['key1', 'key2'], how='outer').reset_index(drop=True)

     A    B key1 key2    C    D    E
0   A0   B0   K0   K0   C0   D0  1.0
1   A1   B1   K0   K1  NaN  NaN  NaN
2   A2   B2   K1   K0   C1   D1  2.0
3   A2   B2   K1   K0   C2   D2  3.0
4   A3   B3   K2   K1  NaN  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3  4.0

(3) Further set ['key1', 'key2'] as right index and merge it using the index

Click to copy

>>> right = right.set_index(['key1', 'key2'])
>>> pd.merge(left, right, left_index=True, right_index=True, how='outer').reset_index()

  key1 key2    A    B    C    D    E
0   K0   K0   A0   B0   C0   D0  1.0
1   K0   K1   A1   B1  NaN  NaN  NaN
2   K1   K0   A2   B2   C1   D1  2.0
3   K1   K0   A2   B2   C2   D2  3.0
4   K2   K0  NaN  NaN   C3   D3  4.0
5   K2   K1   A3   B3  NaN  NaN  NaN

Please note that (1)(2)(3) above are showing the same results, and even if ['key1', 'key2'] are set as index, you can still use left_on = ['key1', 'key2'] instead of left_index=True.

Now, if you really want to merge using both ['key1', 'key2'] with index, one way to achieve this is:

Click to copy

>>> pd.merge(left.reset_index(), right.reset_index(), on=['index', 'key1', 'key2'], how='outer')

   index key1 key2    A    B    C    D    E
0      0   K0   K0   A0   B0   C0   D0  1.0
1      1   K0   K1   A1   B1  NaN  NaN  NaN
2      2   K1   K0   A2   B2   C2   D2  3.0
3      3   K2   K1   A3   B3  NaN  NaN  NaN
4      1   K1   K0  NaN  NaN   C1   D1  2.0
5      3   K2   K0  NaN  NaN   C3   D3  4.0

If you read down to here, I'm pretty sure now you know how to achieve above using multiple different ways. Hope this helps.

answered Oct 22 '22 09:10

gyoza

Related questions
                            
                                Create a custom sklearn TransformerMixin that transforms categorical variables consistently
                            
                                Browser loads old bundle.js from Webpack
                            
                                How to centre a colorbar at a defined value for seaborn heatmap?
                            
                                Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue? [duplicate]
                            
                                How to overcome - pip install ansible on windows failing with filename or extension too long on windows
                            
                                Remove labels but keep legend in a pie chart
                            
                                How do I make a Python function with mutually exclusive arguments?
                            
                                UserWarning: Discrepancy between trainable weights and collected trainable weights error
                            
                                Run Flask using python3 not python
                            
                                How to merge two pandas time series objects with different date time indices?
                            
                                Serve React production App (local server) through Flask-Python
                            
                                %precision 2 on Python
                            
                                How do I disable geckodriver's log on Selenium (Python 3)?
                            
                                Why is a NumPy int not an instance of a Python int, but a NumPy float is an instance of a Python float?
                            
                                werkzeug.routing.BuildError: Could not build url for endpoint
                            
                                How to extract metadata of video files using Python 3.7? [closed]
                            
                                how to remove a row which has empty column in a dataframe using pandas
                            
                                How to plot SVM decision boundary in sklearn Python?
                            
                                Is distributing python source code in Docker secure?
                            
                                Error "Unable to open Jupyter Notebook: Port 8888 is already in use"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding the "left_index" and "right_index" arguments in pandas merge

Tags:

python

merge

join

pandas

ste_kwr

People also ask

2 Answers

Merging happens in a couple of ways:

Tam Le

gyoza

Recent Activity

Donate For Us