Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quick way to find all permutations of a pandas DataFrame that preserves a sort?

I have a DataFrame, and I'd like find all the permutations of it that fulfill a simple ascending sort on one of the columns. (There are many ties.) For example, in the following DataFrame

df = pd.DataFrame({'name': ["Abe", "Bob", "Chris", "David", "Evan"], 
                   'age': [28, 20, 21, 22, 21]})

I'd be looking to sort by age and obtain the orders ["Bob", "Chris", "Evan", "David", "Abe"] and ["Bob", "Evan", "Chris", "David", "Abe"].

I'm new to python (and to pandas) and curious if there is a simple way to do this that I don't see.

Thanks!

like image 226
Reason Avatar asked Nov 10 '22 14:11

Reason


1 Answers

Since you're grouping by age, let's do that and return all the permutations for each group and then take the product (using itertools' product and permutation functions):

In [11]: age = df.groupby("age")

If we look at the permutations of a single group:

In [12]: age.get_group(21)
Out[12]:
   age   name
2   21  Chris
4   21   Evan

In [13]: list(permutations(age.get_group(21).index))
Out[13]: [(2, 4), (4, 2)]

In [14]: [df.loc[list(p)] for p in permutations(age.get_group(21).index)]
Out[14]:
[   age   name
 2   21  Chris
 4   21   Evan,    age   name
 4   21   Evan
 2   21  Chris]

We can do this on the entire DataFrame by returning just the index for each group (this assumes that the index is unique, if it's not reset_index prior to doing this... you may be able to do something slightly more lower level):

In [21]: [list(permutations(grp.index)) for (name, grp) in age]
Out[21]: [[(1,)], [(2, 4), (4, 2)], [(3,)], [(0,)]]

In [22]: list(product(*[(permutations(grp.index)) for (name, grp) in age]))
Out[22]: [((1,), (2, 4), (3,), (0,)), ((1,), (4, 2), (3,), (0,))]

We can glue these up with sum:

In [23]: [sum(tups, ()) for tups in product(*[(permutations(grp.index)) for (name, grp) in age])]
Out[23]: [(1, 2, 4, 3, 0), (1, 4, 2, 3, 0)]

If you make these a list you can apply loc (which gets you the desired result):

In [24]: [df.loc[list(sum(tups, ()))] for tups in product(*[list(permutations(grp.index)) for (name, grp) in age])]
Out[24]:
[   age   name
 1   20    Bob
 2   21  Chris
 4   21   Evan
 3   22  David
 0   28    Abe,    age   name
 1   20    Bob
 4   21   Evan
 2   21  Chris
 3   22  David
 0   28    Abe]

And the (list of) the name column:

In [25]: [list(df.loc[list(sum(tups, ())), "name"]) for tups in product(*[(permutations(grp.index)) for (name, grp) in age])]
Out[25]:
[['Bob', 'Chris', 'Evan', 'David', 'Abe'],
 ['Bob', 'Evan', 'Chris', 'David', 'Abe']]

Note: It may be faster to use a numpy permutation matrix and pd.tools.util.cartesian_product. I suspect it's much of a muchness and wouldn't explore this unless this was unusably slow (it's potentially going to be slow anyway because there could be many many permutations)...

like image 112
Andy Hayden Avatar answered Nov 14 '22 22:11

Andy Hayden