Logo Questions Linux Laravel Mysql Ubuntu Git Menu

python pandas get index boundaries from a series of Booleans




I am trying cut videos based on some caracteristics. My current strategy leads on a pandas series of booleans for each frame, indexed by timestamp. True to keep it, False to dump it.

As I plan to cut videos, i need to extract boundaries from this list, so that i can tell fmpeg beginning and end of the parts I want to extract from the main video.

Tu sum up :

I have a pandas Series which looks like this:

0.577331     False
0.611298     False
0.645255     False
0.679218     False
0.716538     False
0.784453      True
0.784453      True
0.818417      True
0.852379      True
0.886336      True
0.920301      True
0.954259     False
83.393376    False
83.427345    False
dtype: bool

(truncated for presenting reasons, but the TimeStamp usually begins at 0)

and I need to get boundaries of True sequences, so in this example i should get [[t_0,t_1],[t_2,t_3]n, ... [t_2n-1,t_2n]] , with t_0 = 0.784453 and t_1 = 0.920301 if I have n different sequences of True in my pandas Series.

Now that probleme seems very simple, in fact you can just shift the sequence by one and a make a xor between the to get a list of boolean with True being for boundaries

e = df.shift(periods=1, freq=None, axis=0)^df

(with df being a pandas Series) there is still some work to do, like figuring if first element is a rising edge or a falling edge, but this hack works.

However that doesn't seem very pythonic. In fact, the probleme is so simple I believe that there must be somewhere in pandas, numpy or even python a prebuilt function for this which would fit nicely in a single function call instead of a hack like above. The groupby function seems promising though, but i never used it before.

How would be the best way of doing this ?

like image 339
Clément Pinard Avatar asked Aug 12 '16 11:08

Clément Pinard

People also ask

Can you use ILOC on a series?

iloc attribute enables purely integer-location based indexing for selection by position over the given Series object. Example #1: Use Series. iloc attribute to perform indexing over the given Series object.

How do you access the index of a Pandas series?

In order to access the series element refers to the index number. Use the index operator [ ] to access an element in a series. The index must be an integer. In order to access multiple elements from a series, we use Slice operation.

How do you slice series Pandas?

slice() method is used to slice substrings from a string present in Pandas series object. It is very similar to Python's basic principal of slicing objects that works on [start:stop:step] which means it requires three parameters, where to start, where to end and how much elements to skip.

Can Pandas series have index?

Pandas with Python Labels can be called indexes and data present in a series called values. If you want to get labels and values individually. Then we can use the index and values attributes of the Series object. Let's take an example and see how these attributes will work.

1 Answers

You could use scipy.ndimage.label to identify the clusters of Trues:

In [102]: ts
0.069347    False
0.131956    False
0.143948    False
0.224864    False
0.242640     True
0.372599    False
0.451989    False
0.462090    False
0.579956     True
0.588791     True
0.603638    False
0.625107    False
0.642565    False
0.708547    False
0.730239    False
0.741652    False
0.747126     True
0.783276     True
0.896705     True
0.942829     True
Name: keep, dtype: bool

In [103]: groups, nobs = ndimage.label(ts); groups
Out[103]: array([0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3], dtype=int32)

Once you have the groups array, you can find the associated times using groupby/agg:

    result = (df.loc[df['group'] != 0]

For example,

import numpy as np
import pandas as pd
import scipy.ndimage as ndimage

def make_ts(N, ngroups):
    times = np.random.random(N)
    times = np.sort(times)
    idx = np.sort(np.random.randint(N, size=(ngroups,)))
    arr = np.zeros(N)
    arr[idx] = 1
    arr = arr.cumsum()
    arr = (arr % 2).astype(bool)
    ts = pd.Series(arr, index=times, name='keep')
    return ts

def find_groups(ts):
    groups, nobs = ndimage.label(ts)
    df = pd.DataFrame({'times': ts.index, 'group': groups})
    result = (df.loc[df['group'] != 0]
    return result

ts = make_ts(20, 5)
result = find_groups(ts)


          start       end
1      0.242640  0.242640
2      0.579956  0.588791
3      0.747126  0.942829

To obtain the start and end times as a list of lists you could use:

In [125]: result.values.tolist()
[[0.24264034406127022, 0.24264034406127022],
 [0.5799564094638113, 0.5887908182432907],
 [0.7471260123697537, 0.9428288694956402]]

Using ndimage.label is convenient, but note that it is also possible to compute this without scipy:

def find_groups_without_scipy(ts):
    df = pd.DataFrame({'times': ts.index, 'group': (ts.diff() == True).cumsum()})
    result = (df.loc[df['group'] % 2 == 1]
    return result

The main idea here is to find labels for the clusters of Trues using (ts.diff() == True).cumsum(). ts.diff() == True gives the same result as ts.shift() ^ ts, but is a bit faster. Taking the cumulative sum (i.e. calling cumsum) treats True as equal to 1 and False as equal to 0, so each time a True is encountered the cumulative sum increases by 1. Thus each cluster gets labeled with a different number:

In [111]: (ts.diff() == True).cumsum()
0.069347    0
0.131956    0
0.143948    0
0.224864    0
0.242640    1
0.372599    2
0.451989    2
0.462090    2
0.579956    3
0.588791    3
0.603638    4
0.625107    4
0.642565    4
0.708547    4
0.730239    4
0.741652    4
0.747126    5
0.783276    5
0.896705    5
0.942829    5
Name: keep, dtype: int64
like image 180
unutbu Avatar answered Sep 19 '22 06:09
