Sorting columns and selecting top n rows in each group pandas dataframe

Tags:

python

pandas

I have a dataframe like this:

mainid  pidx    pidy   score   1      a        b      2   1      a        c      5   1      c        a      7   1      c        b      2   1      a        e      8   2      x        y      1   2      y        z      3   2      z        y      5   2      x        w      12   2      x        v      1   2      y        x      6

I want to groupby on column 'pidx' and then sort score in descending order in each group i.e for each pidx

and then select head(2) i.e top 2 from each group.

The result I am looking for is like this:

mainid   pidx    pidy    score   1        a      e        8   1        a      c        5   1        c      a        7   1        c      b        2   2        x      w        12   2        x      y        1   2        y      x        6   2        y      z        3   2        z      y        5

What I tried was:

df.sort(['pidx','score'],ascending = False).groupby('pidx').head(2)

and this seems to work, but I dont know if it's the right approach if working on a huge dataset. What other best method can I use to get such result?

726

asked Jan 24 '17 10:01

Shubham R

1 Answers

There are 2 solutions:

1.sort_values and aggregate head:

df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2) print (df1)      mainid pidx pidy  score 8        2    x    w     12 4        1    a    e      8 2        1    c    a      7 10       2    y    x      6 1        1    a    c      5 7        2    z    y      5 6        2    y    z      3 3        1    c    b      2 5        2    x    y      1

2.set_index and aggregate nlargest:

df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()  print (df)   pidx  mainid pidy  score 0    a       1    e      8 1    a       1    c      5 2    c       1    a      7 3    c       1    b      2 4    x       2    w     12 5    x       2    y      1 6    y       2    x      6 7    y       2    z      3 8    z       2    y      5

Timings:

np.random.seed(123) N = 1000000  L1 = list('abcdefghijklmnopqrstu') L2 = list('efghijklmnopqrstuvwxyz') df = pd.DataFrame({'mainid':np.random.randint(1000, size=N),                    'pidx': np.random.randint(10000, size=N),                    'pidy': np.random.choice(L2, N),                    'score':np.random.randint(1000, size=N)}) #print (df)  def epat(df):     grouped = df.groupby('pidx')     new_df = pd.DataFrame([], columns = df.columns)     for key, values in grouped:         new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0)     return (new_df)  print (epat(df))  In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2)) 1 loop, best of 3: 309 ms per loop  In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()) 1 loop, best of 3: 7.11 s per loop  In [147]: %timeit (epat(df)) 1 loop, best of 3: 22 s per loop

133

answered Sep 21 '22 19:09

jezrael

Related questions
                            
                                Python Unit Testing: Automatically Running the Debugger when a test fails
                            
                                Valid JSON giving JSONDecodeError: Expecting , delimiter
                            
                                Long imports in Python
                            
                                How to make an optional value for argument using argparse?
                            
                                python subclasscheck & subclasshook
                            
                                Performing a getattr() style lookup in a django template
                            
                                How to use logging with python's fileConfig and configure the logfile filename
                            
                                How to create a utility class correctly [closed]
                            
                                Correct Type annotation for __init__
                            
                                Accessing the user's request in a post_save signal
                            
                                What makes something iterable in python
                            
                                Add months to a date in Pandas
                            
                                OrderedDict vs defaultdict vs dict [closed]
                            
                                Invoking a constructor in a 'with' statement
                            
                                Python state-machine design
                            
                                When I assign a list to variable why Pycharm give me a prompt that is "this list creation could be rewritten as a list literal"?
                            
                                Error handling when importing modules
                            
                                more than 9 subplots in matplotlib
                            
                                Creating a nested dictionary from a flattened dictionary
                            
                                How to PATCH a single field using Django Rest Framework?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With