Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How I can apply groupby two times on pandas data frame?

I have pandas data frame with column 'year', 'month' and 'transaction id'. I want to get the transaction count of every month for every year. For ex my data is like:

year: {2015,2015,2015,2016,2016,2017}
month: {1,  1,   2,   2,   2,    1}
tid: {123,  343, 453, 675, 786, 332}

I want to get the output such that for every year I will get the number of transactions per month. For ex for year 2015 I will get the output:

month: [1,2]
count: [2,1]

I used groupby('year'). but after that how I can get the per month transaction count.

like image 506
neha Avatar asked Aug 08 '17 06:08

neha


Video Answer


2 Answers

You need groupby by both columns - year and month and then aggregate size:

year = [2015,2015,2015,2016,2016,2017]
month =  [1,  1,   2,   2,   2,    1]
tid = [123,  343, 453, 675, 786, 332]

df = pd.DataFrame({'year':year, 'month':month,'tid':tid})
print (df)
   month  tid  year
0      1  123  2015
1      1  343  2015
2      2  453  2015
3      2  675  2016
4      2  786  2016
5      1  332  2017

df1 = df.groupby(['year','month'])['tid'].size().reset_index(name='count')
print (df1)
   year  month  count
0  2015      1      2
1  2015      2      1
2  2016      2      2
3  2017      1      1
like image 112
jezrael Avatar answered Nov 15 '22 04:11

jezrael


Another option for more complex tasks - suppose you want to group by "year" and a function applied to "tid" - e.g. a bucket categorization

def tidBucket(x):
   if x<300:             return "low"
   if (300<=x & x<700):  return "medium"
   if 700<=x:            return "high"

Then the above solution would not work. You could solve the problem by first grouping by year, then iterate over the contents of the groupby object with another groupby:

gb = df.groupby(by='year') #['tid'].size().reset_index(name='count')
for _,df1 in gb:
    df1.index = df1["tid"]
    df1 = df1.groupby(by=tidBucket)

Then aggregate as desired. Alternatively, you could create an additional "bucket" column

df["bucket"] = df["tid"].map(tidBucket)

and follow the @jezrael 's solution.

like image 36
Owen Avatar answered Nov 15 '22 04:11

Owen