python pandas get index boundaries from a series of Booleans

Tags:

pandas

I am trying cut videos based on some caracteristics. My current strategy leads on a pandas series of booleans for each frame, indexed by timestamp. True to keep it, False to dump it.

As I plan to cut videos, i need to extract boundaries from this list, so that i can tell fmpeg beginning and end of the parts I want to extract from the main video.

Tu sum up :

I have a pandas Series which looks like this:

acquisitionTs
0.577331     False
0.611298     False
0.645255     False
0.679218     False
0.716538     False
0.784453      True
0.784453      True
0.818417      True
0.852379      True
0.886336      True
0.920301      True
0.954259     False
             ...  
83.393376    False
83.427345    False
dtype: bool

(truncated for presenting reasons, but the TimeStamp usually begins at 0)

and I need to get boundaries of True sequences, so in this example i should get [[t_0,t_1],[t_2,t_3]n, ... [t_2n-1,t_2n]] , with t_0 = 0.784453 and t_1 = 0.920301 if I have n different sequences of True in my pandas Series.

Now that probleme seems very simple, in fact you can just shift the sequence by one and a make a xor between the to get a list of boolean with True being for boundaries

e = df.shift(periods=1, freq=None, axis=0)^df
print(e[e].index)

(with df being a pandas Series) there is still some work to do, like figuring if first element is a rising edge or a falling edge, but this hack works.

However that doesn't seem very pythonic. In fact, the probleme is so simple I believe that there must be somewhere in pandas, numpy or even python a prebuilt function for this which would fit nicely in a single function call instead of a hack like above. The groupby function seems promising though, but i never used it before.

How would be the best way of doing this ?

339

asked Aug 12 '16 11:08

Clément Pinard

1 Answers

You could use scipy.ndimage.label to identify the clusters of Trues:

In [102]: ts
Out[102]: 
0.069347    False
0.131956    False
0.143948    False
0.224864    False
0.242640     True
0.372599    False
0.451989    False
0.462090    False
0.579956     True
0.588791     True
0.603638    False
0.625107    False
0.642565    False
0.708547    False
0.730239    False
0.741652    False
0.747126     True
0.783276     True
0.896705     True
0.942829     True
Name: keep, dtype: bool

In [103]: groups, nobs = ndimage.label(ts); groups
Out[103]: array([0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3], dtype=int32)

Once you have the groups array, you can find the associated times using groupby/agg:

    result = (df.loc[df['group'] != 0]
                .groupby('group')['times']
                .agg({'start':'first','end':'last'}))

For example,

import numpy as np
import pandas as pd
import scipy.ndimage as ndimage
np.random.seed(2016)

def make_ts(N, ngroups):
    times = np.random.random(N)
    times = np.sort(times)
    idx = np.sort(np.random.randint(N, size=(ngroups,)))
    arr = np.zeros(N)
    arr[idx] = 1
    arr = arr.cumsum()
    arr = (arr % 2).astype(bool)
    ts = pd.Series(arr, index=times, name='keep')
    return ts

def find_groups(ts):
    groups, nobs = ndimage.label(ts)
    df = pd.DataFrame({'times': ts.index, 'group': groups})
    result = (df.loc[df['group'] != 0]
                .groupby('group')['times']
                .agg({'start':'first','end':'last'}))
    return result

ts = make_ts(20, 5)
result = find_groups(ts)

yields

          start       end
group                    
1      0.242640  0.242640
2      0.579956  0.588791
3      0.747126  0.942829

To obtain the start and end times as a list of lists you could use:

In [125]: result.values.tolist()
Out[125]: 
[[0.24264034406127022, 0.24264034406127022],
 [0.5799564094638113, 0.5887908182432907],
 [0.7471260123697537, 0.9428288694956402]]

Using ndimage.label is convenient, but note that it is also possible to compute this without scipy:

def find_groups_without_scipy(ts):
    df = pd.DataFrame({'times': ts.index, 'group': (ts.diff() == True).cumsum()})
    result = (df.loc[df['group'] % 2 == 1]
                .groupby('group')['times']
                .agg({'start':'first','end':'last'}))
    return result

The main idea here is to find labels for the clusters of Trues using (ts.diff() == True).cumsum(). ts.diff() == True gives the same result as ts.shift() ^ ts, but is a bit faster. Taking the cumulative sum (i.e. calling cumsum) treats True as equal to 1 and False as equal to 0, so each time a True is encountered the cumulative sum increases by 1. Thus each cluster gets labeled with a different number:

In [111]: (ts.diff() == True).cumsum()
Out[111]: 
0.069347    0
0.131956    0
0.143948    0
0.224864    0
0.242640    1
0.372599    2
0.451989    2
0.462090    2
0.579956    3
0.588791    3
0.603638    4
0.625107    4
0.642565    4
0.708547    4
0.730239    4
0.741652    4
0.747126    5
0.783276    5
0.896705    5
0.942829    5
Name: keep, dtype: int64

180

answered Sep 19 '22 06:09

unutbu

Related questions
                            
                                What is the difference between StringIO and ByteIO?
                            
                                Can someone give a python requests example of uploading a release asset in github?
                            
                                Force pandas xaxis datetime index using a specific format
                            
                                AttributeError in python-rtmidi sample code
                            
                                How can I change an attribute value in the DOM using Selenium and Python
                            
                                pyspark, Compare two rows in dataframe
                            
                                How do I delete a similar alembic version?
                            
                                How to make Celery worker return results from task
                            
                                Numerical Laplace transform python
                            
                                AttributeError: module 'socket' has no attribute 'AF_PACKET'
                            
                                Intermediate results from joblib
                            
                                How to read timezone aware datetimes as a timezone naive local DatetimeIndex with read_csv in pandas?
                            
                                Listing users for certain DB with PyMongo
                            
                                How to find the diameter of objects using image processing in Python?
                            
                                filtering dataframe on multiple conditions
                            
                                How to Sort Two Columns by Descending Order in Pandas?
                            
                                How do I trim a .fits image and keep world coordinates for plotting in astropy Python?
                            
                                Scikit Learn - Extract word tokens from a string delimiter using CountVectorizer
                            
                                Python multiprocessing/threading takes longer than single processing on a virtual machine
                            
                                tf.contrib.layers.embedding_column from tensor flow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With