Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Take the sum of every N rows in a pandas series

Tags:

python

pandas

Suppose

s = pd.Series(range(50))

0      0
1      1
2      2
3      3
...
48     48
49     49

How can I get the new series that consists of sum of every n rows?

Expected result is like below, when n = 5;

0      10
1      35
2      60
3      85
...
8      210
9      235

If using loc or iloc and loop by python, of course it can be accomplished, however I believe it could be done simply in Pandas way.

Also, this is a very simplified example, I don't expect the explanation of the sequences:). Actual data series I'm trying has the time index and the the number of events occurred in every second as the values.

like image 799
HirofumiTamori Avatar asked Nov 11 '17 15:11

HirofumiTamori


People also ask

How do you get the sum of rows in pandas?

To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.

How do you sum all values in a pandas series?

sum() method is used to get the sum of the values for the requested axis. level[int or level name, default None] : If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.

How do you get the last n rows of a DataFrame with row sum 100?

Use pandas. DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end).


2 Answers

GroupBy.sum

N = 5
s.groupby(s.index // N).sum()
     
0     10
1     35
2     60
3     85
4    110
5    135
6    160
7    185
8    210
9    235
dtype: int64

Chunk the index into groups of 5 and group accordingly.


numpy.reshape + sum

If the size is a multiple of N (or 5), you can reshape and add:

s.values.reshape(-1, N).sum(1)
# array([ 10,  35,  60,  85, 110, 135, 160, 185, 210, 235])

numpy.add.at

b = np.zeros(len(s) // N)
np.add.at(b, s.index // N, s.values)
b
# array([ 10.,  35.,  60.,  85., 110., 135., 160., 185., 210., 235.])
like image 64
cs95 Avatar answered Oct 20 '22 08:10

cs95


The most efficient solution I can think of is f1() in my example below. It is orders of magnitude faster than using the groupby in the other answer. Note that f1() doesn't work when the length of the array is not an exact multiple, e.g. if you want to sum a 3-item array every 2 items. For those cases, you can use f1v2():

f1v2( [0,1,2,3,4] ,2 ) = [1,5,4]

My code is below. I have used timeit for the comparisons:

import timeit
import numpy as np
import pandas as pd


def f1(a,x):
    if isinstance(a, pd.Series):
        a = a.to_numpy()
    return a.reshape((int(a.shape[0]/x), int(x) )).sum(1)

def f2(myarray, x):
  return [sum(myarray[n: n+x]) for n in range(0, len(myarray), x)]

def f3(myarray, x):
    s = pd.Series(myarray)
    out = s.groupby(s.index // 2).sum()
    return out

def f1v2(a,x):
    if isinstance(a, pd.Series):
        a = a.to_numpy()
        
    mod = a.shape[0] % x
    if  mod != 0:
        excl = a[-mod:]
        keep = a[: len(a) - mod]
        out = keep.reshape((int(keep.shape[0]/x), int(x) )).sum(1)
        out = np.hstack( (excl.sum() , out) ) 
    else:       
        out = a.reshape((int(a.shape[0]/x), int(x) )).sum(1)
    
    return out
    

a = np.arange(0,1e6)

out1 = f1(a,2)
out2 = f2(a,2)
out3 = f2(a,2)

t1 = timeit.Timer( "f1(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t1v2 = timeit.Timer( "f1v2(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t2 = timeit.Timer( "f2(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t3 = timeit.Timer( "f3(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)

resdf = pd.DataFrame(index = ['min time'])
resdf['f1'] = [min(t1)]
resdf['f1v2'] = [min(t1v2)]
resdf['f2'] = [min(t2)]
resdf['f3'] = [min(t3)]
#the docs explain why it makes more sense to take the min than the avg
resdf = resdf.transpose()
resdf['% difference vs fastes'] = (resdf /resdf.min() - 1) * 100

b = np.array( [0,1,2,4,5,6,7] )

out1v2 = f1v2(b,2)
like image 32
Pythonista anonymous Avatar answered Oct 20 '22 07:10

Pythonista anonymous