I have a pandas data frame that has is composed of different subgroups.
df = pd.DataFrame({ 'id':[1, 2, 3, 4, 5, 6, 7, 8], 'group':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], 'value':[.01, .4, .2, .3, .11, .21, .4, .01] })
I want to find the rank of each id in its group with say, lower values being better. In the example above, in group A, Id 1 would have a rank of 1, Id 2 would have a rank of 4. In group B, Id 5 would have a rank of 2, Id 8 would have a rank of 1 and so on.
Right now I assess the ranks by:
Sorting by value.
df.sort('value', ascending = True, inplace=True)
Create a ranker function (it assumes variables already sorted)
def ranker(df): df['rank'] = np.arange(len(df)) + 1 return df
Apply the ranker function on each group separately:
df = df.groupby(['group']).apply(ranker)
This process works but it is really slow when I run it on millions of rows of data. Does anyone have any ideas on how to make a faster ranker function.
pandas.DataFrame.head() In Python's Pandas module, the Dataframe class provides a head() function to fetch top rows from a Dataframe i.e. It returns the first n rows from a dataframe. If n is not provided then default value is 5.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
rank is cythonized so should be very fast. And you can pass the same options as df.rank()
here are the docs for rank
. As you can see, tie-breaks can be done in one of five different ways via the method
argument.
Its also possible you simply want the .cumcount()
of the group.
In [12]: df.groupby('group')['value'].rank(ascending=False) Out[12]: 0 4 1 1 2 3 3 2 4 3 5 2 6 1 7 4 dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With