Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find groups of values that are !=0 in a list

Tags:

python

numpy

I'm looking for an easy way to find "plateaus" or groups in python lists. As input, I have something like this:

mydata = [0.0, 0.0, 0.0, 0.0, 0.0, 0.143, 0.0, 0.22, 0.135, 0.44, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.33, 0.65, 0.22, 0.0, 0.0, 0.0, 0.0, 0.0]

I want to extract the middle position of every "group". Group is defined in this case as data that is !=0 and for example at least 3 positions long. Enclaved single zeros (like on position 6) should be ignored.

Basically, I want to get the following output:

myoutput = [8, 20]

For my use case, it is not really important to get very precise output data. [10,21] would still be fine.

To conclude everything: first group: [0.143, 0.0, 0.22, 0.135, 0.44, 0.1]; second group: [0.33, 0.65, 0.22]. Now, the position of the middle element (or left or right from the middle, if there is no true middle value). So in the output 8 would be the middle of the first group and 20 the middle of the second group.

I've already tried some approaches. But they are not as stable as I wanted them to be (for example: more enclaved zeros can cause problems). So before investing more time in this idea, I wanted to ask if there is a better way to implement this feature. I even think that this could be a generic problem. Is there maybe already standard code that solves it?

There are other questions that describe roughly the same problem, but I have also the need to "smooth" the data before processing.

  1. smooth the data - get rid of enclaved zeros

     import numpy as np
     def smooth(y, box_pts):
         box = np.ones(box_pts)/box_pts
         y_smooth = np.convolve(y, box, mode='same')
         return y_smooth
    
     y_smooth = smooth(mydata, 20)
    
  2. find start points in the smooth list (if a value is !=0 and the value before was 0 it should be a start point). If an endpoint was found: use the last start point that was found and the current endpoint to get the middle position of the group and write it to a deque.

     laststart = 0
     lastend = 0
     myoutput = deque()
    
     for i in range(1, len(y_smooth)-1):
             #detect start:
             if y_smooth[i]!=0 and y_smooth[i-1]==0:
                 laststart = i   
             #detect end:
             elif y_smooth[i]!=0 and y_smooth[i+1]==0 and laststart+2 < i:
                 lastend = i
                 myoutput.appendleft(laststart+(lastend-laststart)/2)
    

EDIT: to simplify everything, I gave only a short example for my input data at the beginning. This short list actually causes a problematic smoothing output - the whole list will get smoothed, and no zero will be left. actual input data; actual input data after smoothing

like image 910
chrisg Avatar asked Mar 06 '23 22:03

chrisg


1 Answers

A fairly simple way of finding groups as you described would be to convert data to a boolean array with ones for data inside groups and 0 for data outside the groups and compute the difference of two consecutive value, this way you'll have 1 for the start of a group and -1 for the end.

Here's an example of that :

import numpy as np

mydata = [0.0, 0.0, 0.0, 0.0, 0.0, 0.143, 0.0, 0.22, 0.135, 0.44, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.33, 0.65, 0.22, 0.0, 0.0, 0.0, 0.0, 0.0]
arr = np.array(mydata)

mask = (arr!=0).astype(np.int) #array that contains 1 for every non zero value, zero other wise
padded_mask =  np.pad(mask,(1,),"constant") #add a zero at the start and at the end to handle edge cases
edge_mask = padded_mask[1:] - padded_mask[:-1] #diff between a value and the following one 
#if there's a 1 in edge mask it's a group start
#if there's a -1 it's a group stop

#where gives us the index of those starts and stops
starts = np.where(edge_mask == 1)[0]
stops = np.where(edge_mask == -1)[0]
print(starts,stops)

#we format groups and drop groups that are too small
groups = [group for group in zip(starts,stops) if (group[0]+2 < group[1])]


for group in groups:
        print("start,stop : {}  middle : {}".format(group,(group[0]+group[1])/2) ) 

And the output :

[ 5  7 19] [ 6 11 22]
start,stop : (7, 11)  middle : 9.0
start,stop : (19, 22)  middle : 20.5
like image 62
jadsq Avatar answered Apr 28 '23 11:04

jadsq