Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserve the non-numerical columns when doing pandas.DataFrame.groupby().sum()

Tags:

python

pandas

Can I preserve the non-numerical columns (the 1st appeared value) when doing pandas.DataFrame.groupby().sum() ?

For example, I have a DataFrame like this:

df = pd.DataFrame({'A' : ['aa1', 'aa2', 'aa1', 'aa2'],'B' : ['bb1', 'bbb1', 'bb2', 'bbb2'],'C' : ['cc1', 'ccc2', 'ccc3', 'ccc4'],'D' : [1, 2, 3, 4],'E' : [1, 2, 3, 4]})
>>> df
     A     B     C  D  E
0  aa1   bb1   cc1  1  1
1  aa2  bbb1  ccc2  2  2
2  aa1   bb2  ccc3  3  3
3  aa2  bbb2  ccc4  4  4
>>> df.groupby(["A"]).sum()
     D  E
A        
aa1  4  4
aa2  6  6

Following is the result I want to obtain:

     B    C    D  E
A        
aa1  bb1  cc1  4  4
aa2  bbb1 ccc2 6  6

Notice that the value of column B and C is the first associated B value and C value of each group key.

like image 469
Aaron Wang Avatar asked Dec 19 '15 07:12

Aaron Wang


3 Answers

Just use 'first':

df.groupby(["A"]).agg({'B': 'first',
                       'C': 'first',
                       'D': sum,
                       'E': sum})
like image 56
Phillip Homer Avatar answered Oct 24 '22 07:10

Phillip Homer


For each key in the groupby-sum dataframe, look up the key in the original dataframe and put the associated value of column B into a new column.

#groupby and sum over columns C and D
df_1 = df.groupby(['A']).sum()

Find the first values in column B associated with groupby keys

groupby keys
col_b = []
#iterate through keys and find the the first value in df['B'] with that key in column A
for i in df_1.index:
    col_b.append(df['B'][df['A'] == i].iloc[0])

#insert list of values into new dataframe
df_1.insert(0, 'B', col_b)
>>>df_1
      B  D  E
A           
aa1 bb1  4  4
aa2 bbb1 6  6
like image 4
ilyas patanam Avatar answered Oct 24 '22 06:10

ilyas patanam


Grouping only on column 'A' gives:

df.groupby(['A']).sum()

        C     D
A              
bar  1.26  0.88
foo  0.92 -4.19

Grouping on column 'A' and 'B' gives:

df.groupby(['A','B']).sum()

            C     D
A   B                
bar one    1.38 -0.73
    three  0.26  0.80
    two   -0.38  0.81
foo one    1.96 -2.72
    three -0.42 -0.18
    two   -0.62 -1.29

If you want only the column 'B' that has 'one' you can do:

d = df.groupby(['A','B'], as_index=False).sum()
d[d.B=='one'].set_index('A')

    B     C     D
A                   
bar  one  1.38 -0.73
foo  one  1.96 -2.72

I'm not sure I understand but is this what you want to do? Note: I increased the output precision just to get the same numbers shown in the post.

d = df.groupby('A').sum()
d['B'] = 'one'
d.sort_index(axis=1)

       B         C         D
A                           
bar  one  1.259069  0.876959
foo  one  0.921510 -4.193397

If you want to put the first sorted value of the column from 'B' instead you can use:

d['B'] = df.B.sort(inplace=False)[0]

So here I replaced 'one','two','three' with 'a', 'b','c' to see if this is what you are trying to do, and use insert() method as suggested by other post

df

    A  B         C         D
0  foo  a  0.638362 -0.931817
1  bar  a  1.380706 -0.733307
2  foo  b -0.324514  0.203515
3  bar  c  0.258534  0.803298
4  foo  b -0.299485 -1.495979
5  bar  b -0.380171  0.806968
6  foo  a  1.324810 -1.792996
7  foo  c -0.417663 -0.176120

d = df.groupby('A').sum()
d.insert(0, 'B', df.B.sort(inplace=False)[0])
d

    B         C         D
A                         
bar  a  1.259069  0.876959
foo  a  0.921510 -4.193397
like image 4
Steve Misuta Avatar answered Oct 24 '22 07:10

Steve Misuta