Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: How to find the max n values for each category in a column

Tags:

python

pandas

I have a huge municipal library catalog dataset with book title, the library it's in, the library's borough, and the number of times it was loaned out.

I want to find the top 3 most loaned books for each neighbourhood.

Ideally, I'd get something like this:

Borough    Title    Total_loans
A          Book1    35615 
A          Book2    34895
A          Book3    2548
B          Book1    6541
B          Book2    5425

etc.

This is the closest I was able to get, but the resulting data frame is not grouped by borough and hard to read.

import pandas as pd

df = pd.DataFrame({"borough":["A", "B", "B", "A", "A"], "title":["Book2", "Book1", "Book2", "Book2", "Book1"], "total_loans":[4, 48, 46, 78, 15]})

top_boroughs = df.groupby(['borough','title'])
top_boroughs.aggregate(sum).sort(['total_loans','title'], ascending=False)

Thanks for your help.

like image 999
robroc Avatar asked Jul 30 '14 17:07

robroc


2 Answers

In short:

df.groupby(level=[0,1]).sum().reset_index().sort_values(['borough', 'total_loans'], ascending=[1,0]).groupby('borough').head(3)

The steps:

  • Do the correct grouping and sum
  • Sort by borough and maximum values
  • group by borough and take 3 first

This is superior to the accepted answer due to both

  • readability (yes, one long line, but you could equally split it up): all standard operations
  • performance (standard optimized operations compared to enlarging a dataframe iteratively with concat, wasting memory

My output (using head(1) since test data has only 2 rows per group:

Out[484]: 
  borough  title  total_loans
1       A  Book2           82
2       B  Book1           48
like image 137
FooBar Avatar answered Nov 16 '22 21:11

FooBar


something like this:

t = df.groupby(['borough', 'title']).sum()
t.sort('total_loans', ascending=True)
t = t.groupby(level=[0,1]).head(3).reset_index()
t.sort(['borough', 'title'], ascending=(True, False)) #not sure if this is necessary, tough to tell with limited data, but just in case...
like image 2
acushner Avatar answered Nov 16 '22 22:11

acushner