I have a dataframe as follows:
import pandas
import numpy
df = pandas.DataFrame( data= {'s1' :numpy.random.choice( ['A', 'B', 'C', 'D', 'E'], size=20 ),
's2' :numpy.random.choice( ['A', 'B', 'C', 'D', 'E'], size=20 ),
'val':numpy.random.randint(low=-1, high=3, size=20)}, )
I want to generate two result columns that provide a cumulative sum of a value (val) based on the categories in 's1' and/or 's2'.
A category ('A, 'B', 'C' etc) can appear in either s1 or s2.The first time a category appears in either s1 or s2, its value starts at zero, then next time it appears its value would be sum of previous values (val)
Dataframe example could look as follows:
s1 s2 val ans1 ans2
0 E B 1 0.0 0.0
1 E C 1 1.0 0.0
2 E A 2 2.0 0.0
3 B A 0 1.0 2.0
4 E B 1 4.0 1.0
5 B C 1 2.0 1.0
I can generate the correct answer columns (ans1 and ans2 - corresponding to set1 and set2 columns) as follows:
temp={}
df['ans1'] = numpy.nan
df['ans2'] = numpy.nan
for idx, row in df.iterrows():
if row['s1'] in temp:
df.loc[idx,'ans1'] = temp[ row['s1'] ]
temp[ row['s1'] ] = temp[ row['s1'] ] + row['val']
else:
temp[ row['s1'] ] = row['val']
df.loc[idx,'ans1'] = 0
if row['s2'] in temp:
df.loc[idx,'ans2'] = temp[ row['s2'] ]
temp[ row['s2'] ] = temp[ row['s2'] ] + row['val']
else:
temp[ row['s2'] ] = row['val']
df.loc[idx,'ans2'] = 0
using 'temp' as a dictionary to hold the running totals of each category (A-E) I can get the two answer columns... What i cant do is find a solution to this without iterating over each row of the dataframe. I dont can an issue in the case with only s1 - where i can use .groupby().cumsum().shift(1) and get the correct values in correct rows, but cannot find a solution where there are two sets s1 and s2 (or more as I have multiple sensors to track), so i am hoping there is a general more vectorised solution that will work?
What you want is a shifted cumulated sum after flattening the input dataset. Use melt, groupby.transform with shift+cumsum, then restore the original shape with pivot
df[['ans1', 'ans2']] = (df
.melt('val', ['s1', 's2'], ignore_index=False).sort_index(kind='stable')
.assign(S=lambda x: x.groupby('value')['val'].transform(lambda x: x.shift(fill_value=0).cumsum()))
.pivot(columns='variable', values='S')
)
NB. the operation is applied in the lexicographic order of the columns (here s1 is before s2), not the original order of the columns. If you need a custom order you must use ordered categoricals.
order = ['s1', 's2']
df[['ans1', 'ans2']] = (df
.melt('val', ['s1', 's2'], ignore_index=False)
.assign(variable=lambda x: pd.Categorical(x['variable'], categories=order, ordered=True))
.sort_values(by='variable', kind='stable').sort_index(kind='stable')
.assign(S=lambda x: x.groupby('value')['val'].transform(lambda x: x.shift(fill_value=0).cumsum()))
.pivot(columns='variable', values='S')
)
Output:
s1 s2 val ans1 ans2
0 E A 2 0 0
1 A E -1 2 2
2 D C 2 0 0
3 D B 0 2 0
4 D A 1 2 1
5 B B 2 0 2
6 D B 2 3 4
7 C A -1 2 2
8 E B 1 1 6
9 A E 2 1 2
Used input:
np.random.seed(0)
N = 10
df = pandas.DataFrame( data= {'s1' :numpy.random.choice(['A', 'B', 'C', 'D', 'E'], size=N),
's2' :numpy.random.choice(['A', 'B', 'C', 'D', 'E'], size=N),
'val':numpy.random.randint(low=-1, high=3, size=N)},)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With