What are the top two C values for every A based on the value in B?
df = pd.DataFrame({
'A': ["first","second","second","first",
"second","first","third","fourth",
"fifth","second","fifth","first",
"first","second","third","fourth","fifth"],
'B': [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
'C': ["a", "b", "c", "d",
"e", "f", "g", "h",
"i", "j", "k", "l",
"m", "n", "o", "p", "q"]})
I am trying
x = df.groupby(['A'])['B'].nlargest(2)
A
fifth 16 7
10 4
first 12 6
11 5
fourth 15 7
7 3
second 13 6
9 4
third 14 6
6 3
But this drops column C, and that is the actual value I need.
I want C in the results, not a row index to the original df. Do I have to join back up? I'll even take just a list of C alone...
I need to act on the top 2 C values (based on B) for every A.
IIUC:
In [42]: df.groupby(['A'])['B','C'].apply(lambda x: x.nlargest(2, columns=['B'])
Out[42]:
B C
A
fifth 16 7 q
10 4 k
first 12 6 m
11 5 l
fourth 15 7 p
7 3 h
second 13 6 n
9 4 j
third 14 6 o
6 3 g
I just came across the same issue and used @MaxU solution (also voted it up). However, it is slow due to apply
which essentially creates many new sub dataframes and merges them again. Here is different approach using sort_values
in conjunction with tail
:
df.sort_values(["A", "B"]).groupby("A").tail(2)
A B C
10 fifth 4 k
16 fifth 7 q
11 first 5 l
12 first 6 m
7 fourth 3 h
15 fourth 7 p
9 second 4 j
13 second 6 n
6 third 3 g
14 third 6 o
This solution yields the same result except row order is different but I assume this doesn't matter in your example anyway. However, in case it's important, you may add some extra calls to get the exact result:
df.sort_values(["A", "B"], ascending=[True, False]).groupby("A").head(2).set_index("A")
B C
A
fifth 7 q
fifth 4 k
first 6 m
first 5 l
fourth 7 p
fourth 3 h
second 6 n
second 4 j
third 6 o
third 3 g
Here are the benchmarks:
%%timeit
df.sort_values(["A", "B"]).groupby("A").tail(2)
1.9 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.sort_values(["A", "B"], ascending=[True, False]).groupby("A").head(2).set_index("A")
2.4 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby(['A'])['B','C'].apply(lambda x: x.nlargest(2, columns=['B']))
10.1 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The solution using sort_values
is roughly 5 times faster. I expect this to increase for real (larger) datasets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With