Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can a pandas merge preserve order?

Tags:

python

pandas

I have two DataFrames in pandas, trying to merge them. But pandas keeps changing the order. I've tried setting indexes, resetting them, no matter what I do, I can't get the returned output to have the rows in the same order. Is there a trick? Note we start out with the loans order 'a,b,c' but after the merge, it's "a,c,b".

import pandas
loans = [  'a',  'b', 'c' ]
states = [  'OR',  'CA', 'OR' ]
x = pandas.DataFrame({ 'loan' : loans, 'state' : states })
y = pandas.DataFrame({ 'state' : [ 'CA', 'OR' ], 'value' : [ 1, 2]})
z = x.merge(y, how='left', on='state')

But now the order is no longer the original 'a,b,c'. Any ideas? I'm using pandas version 11.

like image 809
user2543623 Avatar asked Nov 26 '13 00:11

user2543623


People also ask

Does Pandas DataFrame preserve order?

Pandas. DataFrame doesn't preserve the column order when converting from a DataFrames.

How does merging work in Pandas?

INNER MergePandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data. In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.

Is Pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.

Is merge in Pandas case sensitive?

By default Pandas merge method is case-sensitive. There should be a way to merge 2 dataframes without considering upper/ lower case just like SQL joins.


2 Answers

Hopefully someone will provide a better answer, but in case no one does, this will definitely work, so…

Zeroth, I'm assuming you don't want to just end up sorted on loan, but to preserve whatever original order was in x, which may or may not have anything to do with the order of the loan column. (Otherwise, the problem is easier, and less interesting.)

First, you're asking it to sort based on the join keys. As the docs explain, that's the default when you don't pass a sort argument.


Second, if you don't sort based on the join keys, the rows will end up grouped together, such that two rows that merged from the same source row end up next to each other, which means you're still going to get a, c, b.

You can work around this by getting the rows grouped together in the order they appear in the original x by just merging again with x (on either side, it doesn't really matter), or by reindexing based on x if you prefer. Like this:

x.merge(x.merge(y, how='left', on='state', sort=False))

Alternatively, you can cram an x-index in there with reset_index, then just sort on that, like this:

x.reset_index().merge(y, how='left', on='state', sort=False).sort('index')

Either way obviously seems a bit wasteful, and clumsy… so, as I said, hopefully there's a better answer that I'm just not seeing at the moment. But if not, that works.

like image 96
abarnert Avatar answered Oct 17 '22 02:10

abarnert


I might have a much more simple solution:

df_z = df_x.join(df_y.set_index('state'), on = 'state')

Hope it helps

like image 6
Laurent T Avatar answered Oct 17 '22 04:10

Laurent T