pandas generate columns of cumsums based on variable names in two different columns

Question

I have a dataframe as follows:

    import pandas
    import numpy
    df = pandas.DataFrame( data= {'s1' :numpy.random.choice( ['A', 'B', 'C', 'D', 'E'], size=20 ),
                                  's2' :numpy.random.choice( ['A', 'B', 'C', 'D', 'E'], size=20 ),
                                  'val':numpy.random.randint(low=-1, high=3, size=20)},   )

I want to generate two result columns that provide a cumulative sum of a value (val) based on the categories in 's1' and/or 's2'.

A category ('A, 'B', 'C' etc) can appear in either s1 or s2.The first time a category appears in either s1 or s2, its value starts at zero, then next time it appears its value would be sum of previous values (val)

Dataframe example could look as follows:

       s1 s2  val  ans1  ans2
    0   E  B    1   0.0   0.0
    1   E  C    1   1.0   0.0
    2   E  A    2   2.0   0.0
    3   B  A    0   1.0   2.0
    4   E  B    1   4.0   1.0
    5   B  C    1   2.0   1.0

I can generate the correct answer columns (ans1 and ans2 - corresponding to set1 and set2 columns) as follows:

    temp={}
    df['ans1'] = numpy.nan
    df['ans2'] = numpy.nan
    for idx, row in df.iterrows():
        if row['s1'] in temp:
            df.loc[idx,'ans1'] = temp[ row['s1'] ] 
            temp[ row['s1'] ]  = temp[ row['s1'] ] + row['val']
        else:
            temp[ row['s1'] ] = row['val']
            df.loc[idx,'ans1'] = 0

        if row['s2'] in temp:
            df.loc[idx,'ans2'] = temp[ row['s2'] ] 
            temp[ row['s2'] ]  = temp[ row['s2'] ] + row['val']
        else:
            temp[ row['s2'] ] = row['val']
            df.loc[idx,'ans2'] = 0

using 'temp' as a dictionary to hold the running totals of each category (A-E) I can get the two answer columns... What i cant do is find a solution to this without iterating over each row of the dataframe. I dont can an issue in the case with only s1 - where i can use .groupby().cumsum().shift(1) and get the correct values in correct rows, but cannot find a solution where there are two sets s1 and s2 (or more as I have multiple sensors to track), so i am hoping there is a general more vectorised solution that will work?

mozway · Accepted Answer

What you want is a shifted cumulated sum after flattening the input dataset. Use melt, groupby.transform with shift+cumsum, then restore the original shape with pivot

df[['ans1', 'ans2']] = (df
   .melt('val', ['s1', 's2'], ignore_index=False).sort_index(kind='stable')
   .assign(S=lambda x: x.groupby('value')['val'].transform(lambda x: x.shift(fill_value=0).cumsum()))
   .pivot(columns='variable', values='S')
)

NB. the operation is applied in the lexicographic order of the columns (here s1 is before s2), not the original order of the columns. If you need a custom order you must use ordered categoricals.

order = ['s1', 's2']

df[['ans1', 'ans2']] = (df
   .melt('val', ['s1', 's2'], ignore_index=False)
   .assign(variable=lambda x: pd.Categorical(x['variable'], categories=order, ordered=True))
   .sort_values(by='variable', kind='stable').sort_index(kind='stable')
   .assign(S=lambda x: x.groupby('value')['val'].transform(lambda x: x.shift(fill_value=0).cumsum()))
   .pivot(columns='variable', values='S')
)

Output:

  s1 s2  val  ans1  ans2
0  E  A    2     0     0
1  A  E   -1     2     2
2  D  C    2     0     0
3  D  B    0     2     0
4  D  A    1     2     1
5  B  B    2     0     2
6  D  B    2     3     4
7  C  A   -1     2     2
8  E  B    1     1     6
9  A  E    2     1     2

Used input:

np.random.seed(0)
N = 10
df = pandas.DataFrame( data= {'s1' :numpy.random.choice(['A', 'B', 'C', 'D', 'E'], size=N),
                              's2' :numpy.random.choice(['A', 'B', 'C', 'D', 'E'], size=N),
                              'val':numpy.random.randint(low=-1, high=3, size=N)},)

pandas generate columns of cumsums based on variable names in two different columns

Tags:

python

pandas

Mudjud

1 Answers

mozway

Recent Activity

Donate For Us

pandas generate columns of cumsums based on variable names in two different columns

Tags:

python

pandas

Mudjud

1 Answers

mozway

Related questions

Recent Activity

Donate For Us