Suppose
s = pd.Series(range(50))
0 0
1 1
2 2
3 3
...
48 48
49 49
How can I get the new series that consists of sum of every n rows?
Expected result is like below, when n = 5;
0 10
1 35
2 60
3 85
...
8 210
9 235
If using loc or iloc and loop by python, of course it can be accomplished, however I believe it could be done simply in Pandas way.
Also, this is a very simplified example, I don't expect the explanation of the sequences:). Actual data series I'm trying has the time index and the the number of events occurred in every second as the values.
To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.
sum() method is used to get the sum of the values for the requested axis. level[int or level name, default None] : If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
Use pandas. DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end).
GroupBy.sum
N = 5
s.groupby(s.index // N).sum()
0 10
1 35
2 60
3 85
4 110
5 135
6 160
7 185
8 210
9 235
dtype: int64
Chunk the index into groups of 5 and group accordingly.
numpy.reshape
+ sum
If the size is a multiple of N (or 5), you can reshape and add:
s.values.reshape(-1, N).sum(1)
# array([ 10, 35, 60, 85, 110, 135, 160, 185, 210, 235])
numpy.add.at
b = np.zeros(len(s) // N)
np.add.at(b, s.index // N, s.values)
b
# array([ 10., 35., 60., 85., 110., 135., 160., 185., 210., 235.])
The most efficient solution I can think of is f1()
in my example below. It is orders of magnitude faster than using the groupby
in the other answer.
Note that f1() doesn't work when the length of the array is not an exact multiple, e.g. if you want to sum a 3-item array every 2 items.
For those cases, you can use f1v2():
f1v2( [0,1,2,3,4] ,2 ) = [1,5,4]
My code is below. I have used timeit
for the comparisons:
import timeit
import numpy as np
import pandas as pd
def f1(a,x):
if isinstance(a, pd.Series):
a = a.to_numpy()
return a.reshape((int(a.shape[0]/x), int(x) )).sum(1)
def f2(myarray, x):
return [sum(myarray[n: n+x]) for n in range(0, len(myarray), x)]
def f3(myarray, x):
s = pd.Series(myarray)
out = s.groupby(s.index // 2).sum()
return out
def f1v2(a,x):
if isinstance(a, pd.Series):
a = a.to_numpy()
mod = a.shape[0] % x
if mod != 0:
excl = a[-mod:]
keep = a[: len(a) - mod]
out = keep.reshape((int(keep.shape[0]/x), int(x) )).sum(1)
out = np.hstack( (excl.sum() , out) )
else:
out = a.reshape((int(a.shape[0]/x), int(x) )).sum(1)
return out
a = np.arange(0,1e6)
out1 = f1(a,2)
out2 = f2(a,2)
out3 = f2(a,2)
t1 = timeit.Timer( "f1(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t1v2 = timeit.Timer( "f1v2(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t2 = timeit.Timer( "f2(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
t3 = timeit.Timer( "f3(a,2)" , globals = globals() ).repeat(repeat = 5, number = 2)
resdf = pd.DataFrame(index = ['min time'])
resdf['f1'] = [min(t1)]
resdf['f1v2'] = [min(t1v2)]
resdf['f2'] = [min(t2)]
resdf['f3'] = [min(t3)]
#the docs explain why it makes more sense to take the min than the avg
resdf = resdf.transpose()
resdf['% difference vs fastes'] = (resdf /resdf.min() - 1) * 100
b = np.array( [0,1,2,4,5,6,7] )
out1v2 = f1v2(b,2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With