Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Pandas How to sort one level of a multi-index based on the values of a column, while maintaining the grouping of the other level

I'm taking a Data Mining course at university right now, but I'm a wee bit stuck on a multi-index sorting problem.

The actual data involves about 1 million reviews of movies, and I'm trying to analyze that based on American zip codes, but to test out how to do what I want, I've been using a much smaller data set of 250 randomly generated ratings for 10 movies and instead of zip codes, I'm using age groups.

So this is what I have right now, it's a multiindexed DataFrame in Pandas with two levels, 'group' and 'title'

                        rating
group       title   
            Alien       4.000000
            Argo        2.166667
Adults      Ben-Hur     3.666667
            Gandhi      3.200000
            ...         ...

            Alien       3.000000
            Argo        3.750000
Coeds       Ben-Hur     3.000000
            Gandhi      2.833333
            ...         ...

            Alien       2.500000
            Argo        2.750000
Kids        Ben-Hur     3.000000
            Gandhi      3.200000
            ...         ...

What I'm aiming for is to sort the titles based on their rating within the group (and only show the most popular 5 or so titles within each group)

So something like this (but I'm only going to show two titles in each group):

                        rating
group       title   
            Alien       4.000000
Adults      Ben-Hur     3.666667

            Argo        3.750000
Coeds       Alien       3.000000

            Gandhi      3.200000
Kids        Ben-Hur     3.000000

Anyone know how to do this? I've tried sort_order, sort_index, etc and swapping the levels, but they mix up the groups too. So it then looks like:

                          rating
group         title 
Adults        Alien      4.000000
Coeds         Argo       3.750000
Adults        Ben-Hur    3.666667
Kids          Gandhi     3.666667
Coeds         Alien      3.000000
Kids          Ben-Hur    3.000000

I'm kind of looking for something like this: Multi-Index Sorting in Pandas, but instead of sorting based on another level, I want to sort based on the values. Kind of like if that person wanted to sort based on his sales column.

Thanks!

like image 294
Nadamir Avatar asked Dec 05 '13 23:12

Nadamir


1 Answers

You're looking for sort:

In [11]: s = pd.Series([3, 1, 2], [[1, 1, 2], [1, 3, 1]])

In [12]: s.sort()

In [13]: s
Out[13]: 
1  3    1
2  1    2
1  1    3
dtype: int64

Note; this works inplace (i.e. modifies s), to return a copy use order:

In [14]: s.order()
Out[14]: 
1  3    1
2  1    2
1  1    3
dtype: int64

Update: I realised what you were actually asking, and I think this ought to be an option in sortlevels, but for now I think you have to reset_index, groupby and apply:

In [21]: s.reset_index(name='s').groupby('level_0').apply(lambda s: s.sort('s')).set_index(['level_0', 'level_1'])['s']
Out[21]: 
level_0  level_1
1        3          1
         1          3
2        1          2
Name: 0, dtype: int64

Note: you can set the level names to [None, None] afterwards.

like image 166
Andy Hayden Avatar answered Sep 28 '22 10:09

Andy Hayden