Can I preserve the non-numerical columns (the 1st appeared value) when doing pandas.DataFrame.groupby().sum() ?
For example, I have a DataFrame like this:
df = pd.DataFrame({'A' : ['aa1', 'aa2', 'aa1', 'aa2'],'B' : ['bb1', 'bbb1', 'bb2', 'bbb2'],'C' : ['cc1', 'ccc2', 'ccc3', 'ccc4'],'D' : [1, 2, 3, 4],'E' : [1, 2, 3, 4]})
>>> df
A B C D E
0 aa1 bb1 cc1 1 1
1 aa2 bbb1 ccc2 2 2
2 aa1 bb2 ccc3 3 3
3 aa2 bbb2 ccc4 4 4
>>> df.groupby(["A"]).sum()
D E
A
aa1 4 4
aa2 6 6
Following is the result I want to obtain:
B C D E
A
aa1 bb1 cc1 4 4
aa2 bbb1 ccc2 6 6
Notice that the value of column B
and C
is the first associated B value and C value of each group key.
Just use 'first':
df.groupby(["A"]).agg({'B': 'first',
'C': 'first',
'D': sum,
'E': sum})
For each key in the groupby-sum dataframe, look up the key in the original dataframe and put the associated value of column B
into a new column.
#groupby and sum over columns C and D
df_1 = df.groupby(['A']).sum()
Find the first values in column B associated with groupby keys
groupby keys
col_b = []
#iterate through keys and find the the first value in df['B'] with that key in column A
for i in df_1.index:
col_b.append(df['B'][df['A'] == i].iloc[0])
#insert list of values into new dataframe
df_1.insert(0, 'B', col_b)
>>>df_1
B D E
A
aa1 bb1 4 4
aa2 bbb1 6 6
Grouping only on column 'A' gives:
df.groupby(['A']).sum()
C D
A
bar 1.26 0.88
foo 0.92 -4.19
Grouping on column 'A' and 'B' gives:
df.groupby(['A','B']).sum()
C D
A B
bar one 1.38 -0.73
three 0.26 0.80
two -0.38 0.81
foo one 1.96 -2.72
three -0.42 -0.18
two -0.62 -1.29
If you want only the column 'B' that has 'one' you can do:
d = df.groupby(['A','B'], as_index=False).sum()
d[d.B=='one'].set_index('A')
B C D
A
bar one 1.38 -0.73
foo one 1.96 -2.72
I'm not sure I understand but is this what you want to do? Note: I increased the output precision just to get the same numbers shown in the post.
d = df.groupby('A').sum()
d['B'] = 'one'
d.sort_index(axis=1)
B C D
A
bar one 1.259069 0.876959
foo one 0.921510 -4.193397
If you want to put the first sorted value of the column from 'B' instead you can use:
d['B'] = df.B.sort(inplace=False)[0]
So here I replaced 'one','two','three' with 'a', 'b','c' to see if this is what you are trying to do, and use insert() method as suggested by other post
df
A B C D
0 foo a 0.638362 -0.931817
1 bar a 1.380706 -0.733307
2 foo b -0.324514 0.203515
3 bar c 0.258534 0.803298
4 foo b -0.299485 -1.495979
5 bar b -0.380171 0.806968
6 foo a 1.324810 -1.792996
7 foo c -0.417663 -0.176120
d = df.groupby('A').sum()
d.insert(0, 'B', df.B.sort(inplace=False)[0])
d
B C D
A
bar a 1.259069 0.876959
foo a 0.921510 -4.193397
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With