Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe group: sum one column, take first element from others

I have a pandas dataframe

x = pd.DataFrame.from_dict({'row':[1, 1, 2, 2, 3, 3, 3], 'add': [1, 2, 3, 4, 5, 6, 7], 'take1': ['a', 'b', 'c', 'd', 'e', 'f', 'g'], 'take2': ['11', '22', '33', '44', '55', '66', '77'], 'range': [100, 200, 300, 400, 500, 600, 700]})


   add  range  row take1 take2
0    1    100    1     a    11
1    2    200    1     b    22
2    3    300    2     c    33
3    4    400    2     d    44
4    5    500    3     e    55
5    6    600    3     f    66
6    7    700    3     g    77

I want to group it by the row column, then add up entries in add column, but take the first entry from take1 and take2, and select the min and max from range:

   add    row take1 take2  min_range   max_range
0    3      1     a    11    100        200
1    7      2     c    33    300        400
2    18     3     e    55    500        700
like image 315
Baron Yugovich Avatar asked Sep 19 '17 15:09

Baron Yugovich


1 Answers

Use DataFrameGroupBy.agg by dict, but then some cleaning is necessary, because get MultiIndex in columns:

#create a dictionary of column names and functions to apply to that column

d = {'add':'sum', 'take1':'first', 'take2':'first', 'range':['min','max']}

#group by the row column and apply the corresponding aggregation to each 
#column as specified in the dictionary d
df = x.groupby('row', as_index=False).agg(d)

#rename some columns
df = df.rename(columns={'first':'', 'sum':''})
df.columns = ['{0[0]}_{0[1]}'.format(x).strip('_') for x in df.columns] 
print (df)
   row take1  range_min  range_max take2  add
0    1     a        100        200    11    3
1    2     c        300        400    33    7
2    3     e        500        700    55   18

Details : Aggregate the columns based by the functions specified in the dictionary :

df = x.groupby('row', as_index=False).agg(d)
row range      take2 take1 add
        min  max first first sum
0   1   100  200    11     a   3
1   2   300  400    33     c   7
2   3   500  700    55     e  18

Replacing column names sum and first with '' will lead to

 row range      take2 take1 add
        min  max                
0   1   100  200    11     a   3
1   2   300  400    33     c   7
2   3   500  700    55     e  18

List comprehension on columns by using string formatters will get the desired column names. Assigning it to df.columns will get the desired output.

like image 57
jezrael Avatar answered Sep 28 '22 08:09

jezrael