Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to rank rows in subgroups in pandas dataframe

Tags:

python

pandas

I have a pandas data frame that has is composed of different subgroups.

    df = pd.DataFrame({     'id':[1, 2, 3, 4, 5, 6, 7, 8],      'group':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],      'value':[.01, .4, .2, .3, .11, .21, .4, .01]     }) 

I want to find the rank of each id in its group with say, lower values being better. In the example above, in group A, Id 1 would have a rank of 1, Id 2 would have a rank of 4. In group B, Id 5 would have a rank of 2, Id 8 would have a rank of 1 and so on.

Right now I assess the ranks by:

  1. Sorting by value.

    df.sort('value', ascending = True, inplace=True)

  2. Create a ranker function (it assumes variables already sorted)

    def ranker(df): df['rank'] = np.arange(len(df)) + 1 return df

  3. Apply the ranker function on each group separately:

    df = df.groupby(['group']).apply(ranker)

This process works but it is really slow when I run it on millions of rows of data. Does anyone have any ideas on how to make a faster ranker function.

like image 241
captain ahab Avatar asked Nov 03 '14 18:11

captain ahab


People also ask

How do you get top 5 rows in Pandas?

pandas.DataFrame.head() In Python's Pandas module, the Dataframe class provides a head() function to fetch top rows from a Dataframe i.e. It returns the first n rows from a dataframe. If n is not provided then default value is 5.

Is Pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.


1 Answers

rank is cythonized so should be very fast. And you can pass the same options as df.rank() here are the docs for rank. As you can see, tie-breaks can be done in one of five different ways via the method argument.

Its also possible you simply want the .cumcount() of the group.

In [12]: df.groupby('group')['value'].rank(ascending=False) Out[12]:  0    4 1    1 2    3 3    2 4    3 5    2 6    1 7    4 dtype: float64 
like image 190
Jeff Avatar answered Oct 11 '22 11:10

Jeff