Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe : Group by A, take nlargest by B, output C

What are the top two C values for every A based on the value in B?

    df = pd.DataFrame({
            'A': ["first","second","second","first",
                        "second","first","third","fourth",
                        "fifth","second","fifth","first",
                        "first","second","third","fourth","fifth"],
            'B': [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
            'C': ["a", "b", "c", "d",
                     "e", "f", "g", "h",
                     "i", "j", "k", "l",
                     "m", "n", "o", "p", "q"]})

I am trying

    x = df.groupby(['A'])['B'].nlargest(2)

    A
    fifth   16    7
            10    4
    first   12    6
            11    5
    fourth  15    7
            7     3
    second  13    6
            9     4
    third   14    6
            6     3

But this drops column C, and that is the actual value I need.

I want C in the results, not a row index to the original df. Do I have to join back up? I'll even take just a list of C alone...

I need to act on the top 2 C values (based on B) for every A.

like image 676
user4445586 Avatar asked May 03 '17 14:05

user4445586


2 Answers

IIUC:

In [42]: df.groupby(['A'])['B','C'].apply(lambda x: x.nlargest(2, columns=['B'])
Out[42]:
           B  C
A
fifth  16  7  q
       10  4  k
first  12  6  m
       11  5  l
fourth 15  7  p
       7   3  h
second 13  6  n
       9   4  j
third  14  6  o
       6   3  g
like image 71
MaxU - stop WAR against UA Avatar answered Nov 17 '22 14:11

MaxU - stop WAR against UA


I just came across the same issue and used @MaxU solution (also voted it up). However, it is slow due to apply which essentially creates many new sub dataframes and merges them again. Here is different approach using sort_values in conjunction with tail:

df.sort_values(["A", "B"]).groupby("A").tail(2)

    A       B   C
10  fifth   4   k
16  fifth   7   q
11  first   5   l
12  first   6   m
7   fourth  3   h
15  fourth  7   p
9   second  4   j
13  second  6   n
6   third   3   g
14  third   6   o

This solution yields the same result except row order is different but I assume this doesn't matter in your example anyway. However, in case it's important, you may add some extra calls to get the exact result:

df.sort_values(["A", "B"], ascending=[True, False]).groupby("A").head(2).set_index("A")

        B   C
A       
fifth   7   q
fifth   4   k
first   6   m
first   5   l
fourth  7   p
fourth  3   h
second  6   n
second  4   j
third   6   o
third   3   g

Here are the benchmarks:

%%timeit
df.sort_values(["A", "B"]).groupby("A").tail(2)
1.9 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.sort_values(["A", "B"], ascending=[True, False]).groupby("A").head(2).set_index("A")
2.4 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.groupby(['A'])['B','C'].apply(lambda x: x.nlargest(2, columns=['B']))
10.1 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The solution using sort_values is roughly 5 times faster. I expect this to increase for real (larger) datasets.

like image 2
pansen Avatar answered Nov 17 '22 14:11

pansen