Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the number of time intervals that meet a boolean condition within a pandas dataframe?

Tags:

python

pandas

I have a pandas df with a time series in column1, and a boolean condition in column2. This describes continuous time intervals that meet a specific condition. Note that the time intervals are of unequal length.

Timestamp   Boolean_condition
1           1
2           1
3           0
4           1
5           1
6           1
7           0
8           0
9           1
10          0

How to count the total number of time intervals within the whole series that meet this condition?

The desired output should look like this:

Timestamp   Boolean_condition   Event_number
1           1                   1
2           1                   1
3           0                   NaN
4           1                   2
5           1                   2
6           1                   2
7           0                   NaN
8           0                   NaN
9           1                   3
10          0                   NaN
like image 861
sudonym Avatar asked Mar 10 '23 03:03

sudonym


2 Answers

You can create Series with cumsum of two masks and then create NaN by function Series.mask:

mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
print ((mask2 & mask0).cumsum().add(1))
0    1
1    1
2    2
3    2
4    2
5    2
6    3
7    3
8    3
9    4
Name: Boolean_condition, dtype: int32

df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
print (df)
   Timestamp  Boolean_condition  Event_number
0          1                  1           1.0
1          2                  1           1.0
2          3                  0           NaN
3          4                  1           2.0
4          5                  1           2.0
5          6                  1           2.0
6          7                  0           NaN
7          8                  0           NaN
8          9                  1           3.0
9         10                  0           NaN

Timings:

#[100000 rows x 2 columns
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()

def nick(df):
    isone = df.Boolean_condition[df.Boolean_condition.eq(1)]
    idx = isone.index
    grp = (isone != idx.to_series().diff().eq(1)).cumsum()
    df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
    return df

def jez(df):
    mask0 = df.Boolean_condition.eq(0)
    mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
    df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
    return (df)

def jez1(df):
    mask0 = ~df.Boolean_condition
    mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
    df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
    return (df)

In [68]: %timeit (jez1(df))
100 loops, best of 3: 6.45 ms per loop

In [69]: %timeit (nick(df1))
100 loops, best of 3: 12 ms per loop

In [70]: %timeit (jez(df2))
100 loops, best of 3: 5.34 ms per loop
like image 118
jezrael Avatar answered Mar 12 '23 15:03

jezrael


You could try the following:

1) Get all values of True instance (here, 1) which comprises of isone

2) Take it's corresponding set of indices and convert this to a series representation so that the new series has both it's index and values as the earlier computed indices. Perform the difference between successive rows and check if they are equal to 1. This becomes our boolean mask.

3) Compare isone with the obtained boolean mask and whenever they do not become equal, we take their cumulative sum (also known as adjacency check between elements). These help us in grouping purposes.

4) Using loc for the indices of isone, we assign the codes computed after changing the grp array to Categorical format to a new column created, Event_number.


isone = df.Bolean_condition[df.Bolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1

enter image description here


Faster approach:

Using only numpy:

1) Get it's array representation.

2) Compute the non-zero, here (1's) indices.

3) Insert NaN at the beginning of this array which would act as a starting point for us to perform difference taking successive rows into consideration.

4) Initialize a new array filled with Nan's of the same shape as that of the original array.

5) Whenever the difference between successive rows is not equal to 1, we take their cumulative sum, else they fall in the same group. These values get imputed at the indices where there were 1's before.

6) Assign these back to the new column.


def nick(df):
    b = df.Bolean_condition.values
    slc = np.flatnonzero(b)
    slc_pl_1 = np.append(np.nan, slc)
    nan_arr = np.full(b.size, fill_value=np.nan)
    nan_arr[slc] = np.cumsum(slc_pl_1[1:] - slc_pl_1[:-1] != 1)
    df['Event_number'] = nan_arr
    return df

Timings:

For a DF of 10,000 rows:

np.random.seed(42)
df1 = pd.DataFrame(dict(
        Timestamp=np.arange(10000),
        Bolean_condition=np.random.choice(np.array([0,1]), 10000, p=[0.4, 0.6]))
                  )

df1.shape
# (10000, 2)

def jez(df):
    mask0 = df.Bolean_condition.eq(0)
    mask2 = df.Bolean_condition.ne(df.Bolean_condition.shift(1))
    df['Event_number'] = (mask2 & mask0).cumsum().mask(mask0)
    return (df)

nick(df1).equals(jez(df1))
# True

%%timeit
nick(df1)
1000 loops, best of 3: 362 µs per loop

%%timeit
jez(df1)
100 loops, best of 3: 1.56 ms per loop

For a DF containing 1 million rows:

np.random.seed(42)
df1 = pd.DataFrame(dict(
        Timestamp=np.arange(1000000),
        Bolean_condition=np.random.choice(np.array([0,1]), 1000000, p=[0.4, 0.6]))
                  )

df1.shape
# (1000000, 2)

nick(df1).equals(jez(df1))
# True

%%timeit
nick(df1)
10 loops, best of 3: 34.9 ms per loop

%%timeit
jez(df1)
10 loops, best of 3: 50.1 ms per loop
like image 28
Nickil Maveli Avatar answered Mar 12 '23 16:03

Nickil Maveli