Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Arrange consecutive zeros in panda by specific rule

I have panda series as the following :

    1   1
    2   2
    3   3 
    4   4
    5   0
    6   0
    7   1
    8   2
    9   3
   10   0
   11   0
   12   0
   13   0
   14   1
   15   2

I have to arrange this in following format :

    1   1
    2   2
    3   3 
    4   4
    5   0
    6   0
    7   3  ---> 4-2+1 (previous non zero value - amount of previous zeroes + current value)
    8   4  ---> 4-2+2 (previous non zero value - amount of previous zeroes + current value)
    9   5  ---> 4-2+3 (previous non zero value - amount of previous zeroes + current value)
   10   0
   11   0
   12   0
   13   0
   14   2 ---> 5-4+1 (previous non zero value - amount of previous zeroes + current value)
   15   3 ---> 5-4+2 (previous non zero value - amount of previous zeroes + current value)

I am stuck at this. Till now I am able to produce a data frame with consecutive zeroes.

zero = ser.eq(0).groupby(ser.ne(0).cumsum()).cumsum()

which gave me:

    1   0
    2   0
    3   0 
    4   0
    5   1
    6   2
    7   0
    8   0
    9   0
   10   1
   11   2
   12   3
   13   4
   14   0
   15   0

if someone willing to assist on this. i am dropping cookie cutter for this problem which will create the above series.

d = {'1': 1, '2': 2, '3': 3, '4':4, '5':0, '6':0, '7':1, '8':2, '9':3, '10':0, '11':0, '12':0, '13':0, '14':1, '15':2}
ser = pd.Series(data=d)
like image 236
prem Avatar asked Sep 18 '25 22:09

prem


2 Answers

Although this can only be done with Pandas in a rather convoluted way IMHO, here is a straightforward implementation using Numba (which should also be faster than all Pandas solutions):

import numba as nb
import numpy as np

@nb.njit(['(int32[:],)', '(int64[:],)'])
def compute(arr):
    res = np.empty(arr.size, dtype=arr.dtype)
    z_count = 0
    last_nnz_val = 0
    nnz_count = 0
    for i in range(arr.size):
        if arr[i] == 0:
            if i > 0 and arr[i-1] != 0:   # If there is a switch from nnz to zero
                last_nnz_val += nnz_count - z_count   # Save the last nnz result
                z_count = 0
            z_count += 1
            res[i] = 0
        else:
            if i > 0 and arr[i-1] == 0:   # If there is a switch from zero to nnz
                nnz_count = 0
            nnz_count += 1
            res[i] = last_nnz_val - z_count + nnz_count
    return res

# [...]
compute(ser.to_numpy())

Note the result is a basic Numpy array, but you can easily create a dataframe from it.


Benchmark

Here are performance results on my machine (i5-9600KF CPU) on the tiny example dataset:

MichaelCao's answer:    886 µs
This answer:              2 µs   <-----

On a 1000x larger dataset (repeated), I get:

MichaelCao's answer:   1240 µs
This answer:             20 µs   <-----

It is much faster than the other answer. I also get different output results so one of the answer implementation is certainly wrong.

like image 151
Jérôme Richard Avatar answered Sep 20 '25 13:09

Jérôme Richard


Let's look at row 14:

It's 5 - 4 + 1 but that 5 is from the previous calculation 4 - 2 + 3, so the full calculation is really 4 - 2 + 3 - 4 + 1 and what that really involves are the cumulative sum of non-zero values and cumulative total of zeros (i.e. 4 + 3 (non-zeros) - 2 + 4 (zeros) + 1 (current value)).

With that in mind:

cum_zero = (ser == 0).cumsum()
cum_zero *= (ser != 0)

right_before_zero = ser * (ser.shift(-1) == 0)
previous = right_before_zero.shift(1).cumsum().fillna(0) * (ser != 0)

ser2 = previous - cum_zero + ser
like image 36
Michael Cao Avatar answered Sep 20 '25 11:09

Michael Cao