I have a pandas data frame that has is composed of different subgroups. <pre class="prettyprint"><code> df = pd.DataFrame({ 'id':[1, 2, 3, 4, 5, 6, 7, 8], 'group':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], 'value':[.01, .4, .2, .3, .11, .21, .4, .01] }) </code></pre> I want to find the rank of each id in its group with say, lower values being better. In the example above, in group A, Id 1 would have a rank of 1, Id 2 would have a rank of 4. In group B, Id 5 would have a rank of 2, Id 8 would have a rank of 1 and so on. Right now I assess the ranks by: <ol> <li> Sorting by value. <code>df.sort('value', ascending = True, inplace=True)</code> </li> <li> Create a ranker function (it assumes variables already sorted) <code>def ranker(df): df['rank'] = np.arange(len(df)) + 1 return df</code> </li> <li> Apply the ranker function on each group separately: <code>df = df.groupby(['group']).apply(ranker)</code> </li> </ol> This process works but it is really slow when I run it on millions of rows of data. Does anyone have any ideas on how to make a faster ranker function.

rank is cythonized so should be very fast. And you can pass the same options as <code>df.rank()</code> here are the docs for <code>rank</code>. As you can see, tie-breaks can be done in one of five different ways via the <code>method</code> argument. Its also possible you simply want the <code>.cumcount()</code> of the group. <pre class="prettyprint"><code>In [12]: df.groupby('group')['value'].rank(ascending=False) Out[12]: 0 4 1 1 2 3 3 2 4 3 5 2 6 1 7 4 dtype: float64 </code></pre>

Faster way to rank rows in subgroups in pandas dataframe

Tags:

python

pandas

I have a pandas data frame that has is composed of different subgroups.

    df = pd.DataFrame({     'id':[1, 2, 3, 4, 5, 6, 7, 8],      'group':['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],      'value':[.01, .4, .2, .3, .11, .21, .4, .01]     })

I want to find the rank of each id in its group with say, lower values being better. In the example above, in group A, Id 1 would have a rank of 1, Id 2 would have a rank of 4. In group B, Id 5 would have a rank of 2, Id 8 would have a rank of 1 and so on.

Right now I assess the ranks by:

Sorting by value.

df.sort('value', ascending = True, inplace=True)
Create a ranker function (it assumes variables already sorted)

def ranker(df): df['rank'] = np.arange(len(df)) + 1 return df
Apply the ranker function on each group separately:

df = df.groupby(['group']).apply(ranker)

This process works but it is really slow when I run it on millions of rows of data. Does anyone have any ideas on how to make a faster ranker function.

241

asked Nov 03 '14 18:11

captain ahab

1 Answers

rank is cythonized so should be very fast. And you can pass the same options as df.rank() here are the docs for rank. As you can see, tie-breaks can be done in one of five different ways via the method argument.

Its also possible you simply want the .cumcount() of the group.

In [12]: df.groupby('group')['value'].rank(ascending=False) Out[12]:  0    4 1    1 2    3 3    2 4    3 5    2 6    1 7    4 dtype: float64

190

answered Oct 11 '22 11:10

Jeff

Related questions
                            
                                Convert NumPy array to 0 or 1 based on threshold
                            
                                Find substring in string but only if whole words?
                            
                                Split a dictionary in half?
                            
                                Calculate RGB value for a range of values to create heat map
                            
                                Django, save ModelForm
                            
                                Python extract wav from video file
                            
                                Failure when filtering string list with re.match [duplicate]
                            
                                Plot trees for a Random Forest in Python with Scikit-Learn
                            
                                Converting strings to a lower case in pandas [duplicate]
                            
                                How to select all elements in a NumPy array except for a sequence of indices [duplicate]
                            
                                Flask app wont launch 'ImportError: cannot import name 'cached_property' from 'werkzeug' '
                            
                                M2Crypto doesn't install in venv, or swig doesn't define __x86_64__ which breaks compiling against OpenSSL
                            
                                How to close a Tkinter window by pressing a Button?
                            
                                Adding row/column headers to NumPy arrays
                            
                                passing parameters to apscheduler handler function
                            
                                Skipping execution of -with- block
                            
                                Elegant grid search in python/numpy
                            
                                How to set ticks on Fixed Position , matplotlib
                            
                                Joblib Parallel multiple cpu's slower than single
                            
                                How to import csv file as numpy.array in python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With