I have a dataframe like this:
mainid pidx pidy score 1 a b 2 1 a c 5 1 c a 7 1 c b 2 1 a e 8 2 x y 1 2 y z 3 2 z y 5 2 x w 12 2 x v 1 2 y x 6
I want to groupby on column 'pidx'
and then sort score in descending order in each group
i.e for each pidx
and then select head(2)
i.e top 2 from each group.
The result I am looking for is like this:
mainid pidx pidy score 1 a e 8 1 a c 5 1 c a 7 1 c b 2 2 x w 12 2 x y 1 2 y x 6 2 y z 3 2 z y 5
What I tried was:
df.sort(['pidx','score'],ascending = False).groupby('pidx').head(2)
and this seems to work, but I dont know if it's the right approach if working on a huge dataset. What other best method can I use to get such result?
Use pandas. DataFrame. head(n) to get the first n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the start).
You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.
There are 2 solutions:
1.sort_values
and aggregate head
:
df1 = df.sort_values('score',ascending = False).groupby('pidx').head(2) print (df1) mainid pidx pidy score 8 2 x w 12 4 1 a e 8 2 1 c a 7 10 2 y x 6 1 1 a c 5 7 2 z y 5 6 2 y z 3 3 1 c b 2 5 2 x y 1
2.set_index
and aggregate nlargest
:
df = df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index() print (df) pidx mainid pidy score 0 a 1 e 8 1 a 1 c 5 2 c 1 a 7 3 c 1 b 2 4 x 2 w 12 5 x 2 y 1 6 y 2 x 6 7 y 2 z 3 8 z 2 y 5
Timings:
np.random.seed(123) N = 1000000 L1 = list('abcdefghijklmnopqrstu') L2 = list('efghijklmnopqrstuvwxyz') df = pd.DataFrame({'mainid':np.random.randint(1000, size=N), 'pidx': np.random.randint(10000, size=N), 'pidy': np.random.choice(L2, N), 'score':np.random.randint(1000, size=N)}) #print (df) def epat(df): grouped = df.groupby('pidx') new_df = pd.DataFrame([], columns = df.columns) for key, values in grouped: new_df = pd.concat([new_df, grouped.get_group(key).sort_values('score', ascending=True)[:2]], 0) return (new_df) print (epat(df)) In [133]: %timeit (df.sort_values('score',ascending = False).groupby('pidx').head(2)) 1 loop, best of 3: 309 ms per loop In [134]: %timeit (df.set_index(['mainid','pidy']).groupby('pidx')['score'].nlargest(2).reset_index()) 1 loop, best of 3: 7.11 s per loop In [147]: %timeit (epat(df)) 1 loop, best of 3: 22 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With