Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sum of dynamic columns based on certain condition

Tags:

I am trying to get sum of dynamic columns based on certain condition.

cols = ['ID','A','B','C','D','E','F','G']

dataframe df has all the columns listed above.

If ID = 2, I need sum of first two columns A, B IF ID = 3, I need sum of first three columns A,B,C

df.loc['SUM'] = df.loc[df['ID'] > 0,cols[0:df['ID']]].sum(axis=1)

above line of code is giving an error that :

TypeError: slice indices must be integers or None or have an __index__ method

Note : ID can be any number but it would be always less than or equal to total number of columns

There are no nan and missing values in dataframe.

like image 715
Kshitij Agrawal Avatar asked Jun 25 '19 17:06

Kshitij Agrawal


2 Answers

You can vectorise this thanks to the power of NumPy:

df.set_index('ID', inplace=True)
df.wherenp.arange(df.shape[1]) < df.index.values[:,None]).sum(axis=1)    

Minimal Reproducible Example

import pandas as pd

pd.__version__
# '0.24.2'

np.random.seed(0) 
df = pd.DataFrame(np.random.randint(1, 10, (5, 8))) 
df.insert(0, 'ID', np.random.randint(1, 8, 5))  
df                                           

   ID  0  1  2  3  4  5  6  7
0   4  6  1  4  4  8  4  6  3
1   7  5  8  7  9  9  2  7  8
2   4  8  9  2  6  9  5  4  1
3   3  4  6  1  3  4  9  2  4
4   5  4  4  8  1  2  1  5  8

df.set_index('ID', inplace=True)
df.assign(
    SUM=df.where(np.arange(df.shape[1]) < df.index.values[:,None]).sum(axis=1))

    0  1  2  3  4  5  6  7   SUM
ID                              
4   6  1  4  4  8  4  6  3  15.0
7   5  8  7  9  9  2  7  8  47.0
4   8  9  2  6  9  5  4  1  25.0
3   4  6  1  3  4  9  2  4  11.0
5   4  4  8  1  2  1  5  8  19.0

The where function will hide values that we don't want to sum over:

df.where(np.arange(df.shape[1]) < df.index[:,None])   

    0  1  2    3    4    5    6   7
ID                                 
4   6  1  4  4.0  NaN  NaN  NaN NaN
7   5  8  7  9.0  9.0  2.0  7.0 NaN
4   8  9  2  6.0  NaN  NaN  NaN NaN
3   4  6  1  NaN  NaN  NaN  NaN NaN
5   4  4  8  1.0  2.0  NaN  NaN NaN

Thanks to @Quang Hoang for pointing out a potential bug in the solution.

like image 100
cs95 Avatar answered Nov 15 '22 05:11

cs95


Another way :

with df :

    0  1  2  3  4  5  6  7   
ID                              
4   6  1  4  4  8  4  6  3  
7   5  8  7  9  9  2  7  8  
4   8  9  2  6  9  5  4  1 
3   4  6  1  3  4  9  2  4  
5   4  4  8  1  2  1  5  8  

sums are computed by df.cumsum(axis=1).values[range(len(df)),df.index-1]

it's 5x faster than df.where(np.arange(df.shape[1]) <df.index.values[:,None]).sum(axis=1).

like image 25
B. M. Avatar answered Nov 15 '22 07:11

B. M.