Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently counting runs of non-zero values

Tags:

python

numpy

I am dealing with timeseries of rainfall volumes, for which I want to compute the lengths and volumes of individual rainfall events, where an "event" is a sequence of non-zero timesteps. I am dealing with multiple timeseries of ~60k timesteps and my current approach is quite slow.

Currently I have the following:

import numpy as np

def count_events(timeseries):
    start = 0  
    end = 0
    lengths = []
    volumes = []
    # pad a 0 at the edges so as to include edges as "events"
    for i, val in enumerate(np.pad(timeseries, pad_width = 1, mode = 'constant')):

        if val > 0 and start==0:
            start = i
        if val == 0 and start>0:
            end = i

            if end - start != 1:
                volumes.append(np.sum(timeseries[start:end]))
            elif end - start == 1:
                volumes.append(timeseries[start-1])

            lengths.append(end-start)
            start = 0

    return np.asarray(lengths), np.asarray(volumes)

Expected output:

testrain = np.array([1,0,1,0,2,2,8,2,0,0,0.1,0,0,1])
lengths, volumes = count_events(testrain)
print lengths
[1 1 4 1 1]
print volumes
[  1.    1.   12.    0.1   1. ] # 12 should actually be 14, my code returns wrong results.

I imagine there's a far better way to do this, leveraging numpy's efficiency, but nothing comes to mind...

EDIT:

Comparing the different solutions:

testrain = np.random.normal(10,5, 60000)
testrain[testrain<0] = 0 

My solution (produces wrong results, not exactly sure why):

%timeit count_events(testrain)
#10 loops, best of 3: 129 ms per loop

@dawg's:

%timeit dawg(testrain) # using itertools
#10 loops, best of 3: 113 ms per loop
%timeit dawg2(testrain) # using pure numpy
#10 loops, best of 3: 156 ms per loop

@DSM's:

%timeit DSM(testrain)
#10 loops, best of 3: 28.4 ms per loop

@DanielLenz's:

%timeit DanielLenz(testrain)
#10 loops, best of 3: 316 ms per loop
like image 273
areuexperienced Avatar asked Dec 24 '22 14:12

areuexperienced


2 Answers

While you can do this in pure numpy, you're basically applying numpy to a pandas problem. Your volume is the result of a groupby operation, which you can fake in numpy but is native to pandas.

For example:

>>> tr = pd.Series(testrain)
>>> nonzero = (tr != 0)
>>> group_ids = (nonzero & (nonzero != nonzero.shift())).cumsum()
>>> events = tr[nonzero].groupby(group_ids).agg([sum, len])
>>> events
    sum  len
1   1.0    1
2   1.0    1
3  14.0    4
4   0.1    1
5   1.0    1
like image 70
DSM Avatar answered Jan 14 '23 14:01

DSM


Here is a groupby solution:

import numpy as np
from itertools import groupby

testrain = np.array([1,0,1,0,2,2,8,2,0,0,0.1,0,0,1])

lengths=[]
volumes=[]
for k, l in groupby(testrain, key=lambda v: v>0):
    if k:
        li=list(l)
        lengths.append(len(li))
        volumes.append(sum(li))

print lengths     
print volumes

Prints

[1, 1, 4, 1, 1]
[1.0, 1.0, 14.0, 0.10000000000000001, 1.0]

If you want something purely in numpy:

def find_runs(arr):
    subs=np.split(testrain, np.where(testrain== 0.)[0])
    arrs=[np.delete(sub, np.where(sub==0.)) for sub in subs]
    return [(len(e), sum(e)) for e in arrs if len(e)]

>>> find_runs(testrain)    
[(1, 1.0), (1, 1.0), (4, 14.0), (1, 0.10000000000000001), (1, 1.0)]
>>> length, volume=zip(*find_runs(testrain))
like image 24
dawg Avatar answered Jan 14 '23 15:01

dawg