I have two DataFrames in pandas, trying to merge them. But pandas keeps changing the order. I've tried setting indexes, resetting them, no matter what I do, I can't get the returned output to have the rows in the same order. Is there a trick? Note we start out with the loans order 'a,b,c' but after the merge, it's "a,c,b". <pre class="prettyprint"><code>import pandas loans = [ 'a', 'b', 'c' ] states = [ 'OR', 'CA', 'OR' ] x = pandas.DataFrame({ 'loan' : loans, 'state' : states }) y = pandas.DataFrame({ 'state' : [ 'CA', 'OR' ], 'value' : [ 1, 2]}) z = x.merge(y, how='left', on='state') </code></pre> But now the order is no longer the original 'a,b,c'. Any ideas? I'm using pandas version 11.

Hopefully someone will provide a better answer, but in case no one does, this will definitely work, so… Zeroth, I'm assuming you don't want to just end up sorted on <code>loan</code>, but to preserve whatever original order was in <code>x</code>, which may or may not have anything to do with the order of the <code>loan</code> column. (Otherwise, the problem is easier, and less interesting.) First, you're asking it to sort based on the join keys. As the docs explain, that's the default when you don't pass a <code>sort</code> argument. <hr> Second, if you don't sort based on the join keys, the rows will end up grouped together, such that two rows that merged from the same source row end up next to each other, which means you're still going to get <code>a</code>, <code>c</code>, <code>b</code>. You can work around this by getting the rows grouped together in the order they appear in the original <code>x</code> by just merging again with <code>x</code> (on either side, it doesn't really matter), or by reindexing based on <code>x</code> if you prefer. Like this: <pre class="prettyprint"><code>x.merge(x.merge(y, how='left', on='state', sort=False)) </code></pre> <hr> Alternatively, you can cram an x-index in there with <code>reset_index</code>, then just sort on that, like this: <pre class="prettyprint"><code>x.reset_index().merge(y, how='left', on='state', sort=False).sort('index') </code></pre> <hr> Either way obviously seems a bit wasteful, and clumsy… so, as I said, hopefully there's a better answer that I'm just not seeing at the moment. But if not, that works.

I might have a much more simple solution: <pre class="prettyprint"><code>df_z = df_x.join(df_y.set_index('state'), on = 'state') </code></pre> Hope it helps

How can a pandas merge preserve order?

Tags:

python

pandas

I have two DataFrames in pandas, trying to merge them. But pandas keeps changing the order. I've tried setting indexes, resetting them, no matter what I do, I can't get the returned output to have the rows in the same order. Is there a trick? Note we start out with the loans order 'a,b,c' but after the merge, it's "a,c,b".

import pandas
loans = [  'a',  'b', 'c' ]
states = [  'OR',  'CA', 'OR' ]
x = pandas.DataFrame({ 'loan' : loans, 'state' : states })
y = pandas.DataFrame({ 'state' : [ 'CA', 'OR' ], 'value' : [ 1, 2]})
z = x.merge(y, how='left', on='state')

But now the order is no longer the original 'a,b,c'. Any ideas? I'm using pandas version 11.

809

asked Nov 26 '13 00:11

user2543623

2 Answers

Hopefully someone will provide a better answer, but in case no one does, this will definitely work, so…

Zeroth, I'm assuming you don't want to just end up sorted on loan, but to preserve whatever original order was in x, which may or may not have anything to do with the order of the loan column. (Otherwise, the problem is easier, and less interesting.)

First, you're asking it to sort based on the join keys. As the docs explain, that's the default when you don't pass a sort argument.

Second, if you don't sort based on the join keys, the rows will end up grouped together, such that two rows that merged from the same source row end up next to each other, which means you're still going to get a, c, b.

You can work around this by getting the rows grouped together in the order they appear in the original x by just merging again with x (on either side, it doesn't really matter), or by reindexing based on x if you prefer. Like this:

x.merge(x.merge(y, how='left', on='state', sort=False))

Alternatively, you can cram an x-index in there with reset_index, then just sort on that, like this:

x.reset_index().merge(y, how='left', on='state', sort=False).sort('index')

Either way obviously seems a bit wasteful, and clumsy… so, as I said, hopefully there's a better answer that I'm just not seeing at the moment. But if not, that works.

answered Oct 17 '22 02:10

abarnert

I might have a much more simple solution:

df_z = df_x.join(df_y.set_index('state'), on = 'state')

Hope it helps

answered Oct 17 '22 04:10

Laurent T

Related questions
                            
                                Pandas rolling regression: alternatives to looping
                            
                                Using Matplotlib when DISPLAY is undefined [duplicate]
                            
                                How to change the Python Interpreter that gdb uses?
                            
                                Python:When to use Threads vs. Multiprocessing
                            
                                How to sort a dictionary by key? [duplicate]
                            
                                What is the equivalent of imp.find_module in importlib
                            
                                Why no @override decorator in Python to help code readability? [closed]
                            
                                Pickling dynamically generated classes?
                            
                                how to create virtualenv with pypy?
                            
                                Django Rest Framework Business Logic
                            
                                Python os.environ throws key error?
                            
                                When is semicolon use in Python considered "good" or "acceptable"?
                            
                                Send asyncio tasks to loop running in other thread
                            
                                What is the difference between AF_INET and PF_INET constants?
                            
                                Securing communication [Authenticity, Privacy & Integrity] with mobile app?
                            
                                pythonic implementation of Bayesian networks for a specific application
                            
                                Distributing a shared library and some C code with a Cython extension module
                            
                                How to verify a JWT using python PyJWT with public key
                            
                                What is the difference between ActivePython and Python?
                            
                                Which should I be using: urlparse or urlsplit?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With