Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More Pythonic/Pandaic approach to looping over a pandas Series

This is most likely something very basic, but I can't figure it out. Suppose that I have a Series like this:

s1 = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])

How can I do operations on sub-series of this Series without having to revert to using a for-loop?

Suppose, for example, that I want to turn it into a new Series that contains four elements. The first element in this new Series is the sum of the first three elements in the original Series (1, 1, 1), the second the sum of the second three (2, 2, 2), etc.:

s2 = pd.Series([3, 6, 9, 12])

How can I do this?

like image 212
rdv Avatar asked Jan 05 '17 12:01

rdv


4 Answers

You could also use np.add.reduceat by specifying the slices to be reduced at every 3rd element and compute their running sum:

>>> pd.Series(np.add.reduceat(s1.values, np.arange(0, s1.shape[0], 3)))
0     3
1     6
2     9
3    12
dtype: int64

Timing Constraints:

arr = np.repeat(np.arange(10**5), 3)
s = pd.Series(arr)
s.shape
(300000,)

# @IanS soln
%timeit s.rolling(3).sum()[2::3]        
100 loops, best of 3: 15.6 ms per loop

# @Divakar soln
%timeit pd.Series(np.bincount(np.arange(s.size)//3, s))  
100 loops, best of 3: 5.44 ms per loop

# @Nikolas Rieble soln
%timeit pd.Series(np.sum(np.array(s).reshape(len(s)/3,3), axis = 1))  
100 loops, best of 3: 2.17 ms per loop

# @Nikolas Rieble modified soln
%timeit pd.Series(np.sum(np.array(s).reshape(-1, 3), axis=1))  
100 loops, best of 3: 2.15 ms per loop

# @Divakar modified soln
%timeit pd.Series(s.values.reshape(-1,3).sum(1))
1000 loops, best of 3: 1.62 ms per loop

# Proposed solution in post
%timeit pd.Series(np.add.reduceat(s.values, np.arange(0, s.shape[0], 3)))
1000 loops, best of 3: 1.45 ms per loop
like image 157
Nickil Maveli Avatar answered Nov 19 '22 15:11

Nickil Maveli


Here's a NumPy approach using np.bincount to handle generic number of elements -

pd.Series(np.bincount(np.arange(s1.size)//3, s1))

Sample run -

In [42]: s1 = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 9, 5])

In [43]: pd.Series(np.bincount(np.arange(s1.size)//3, s1))
Out[43]: 
0     3.0
1     6.0
2     9.0
3    12.0
4    14.0
dtype: float64

If we are really craving for performance and for case when the length of the series is divisible by the window length, we can get the view into the series with s1.values, then reshape and finally use np.einsum for summation, like so -

pd.Series(np.einsum('ij->i',s.values.reshape(-1,3)))

Timings with the same benchmark dataset as used in @Nickil Maveli's post -

In [140]: s = pd.Series(np.repeat(np.arange(10**5), 3))

# @Nickil Maveli's soln
In [141]: %timeit pd.Series(np.add.reduceat(s.values, np.arange(0, s.shape[0], 3)))
100 loops, best of 3: 2.07 ms per loop

# Using views+sum
In [142]: %timeit pd.Series(s.values.reshape(-1,3).sum(1))
100 loops, best of 3: 2.03 ms per loop

# Using views+einsum
In [143]: %timeit pd.Series(np.einsum('ij->i',s.values.reshape(-1,3)))
1000 loops, best of 3: 1.04 ms per loop
like image 26
Divakar Avatar answered Nov 19 '22 13:11

Divakar


You could reshape the series s1 using numpy and then sum over the rows such as:

np.sum(np.array(s1).reshape(len(s1)/3,3), axis = 1)

which results in

array([ 3,  6,  9, 12], dtype=int64)

EDIT: as MSeifert mentioned in his comment, you can also let numpy compute the length such as:

np.sum(np.array(s1).reshape(-1, 3), axis=1)
like image 5
Nikolas Rieble Avatar answered Nov 19 '22 13:11

Nikolas Rieble


This computes the rolling sum:

s1.rolling(3).sum()

You simply need to select every third element:

s1.rolling(3).sum()[2::3]

Output:

2      3.0
5      6.0
8      9.0
11    12.0
like image 3
IanS Avatar answered Nov 19 '22 15:11

IanS