I am trying cut videos based on some caracteristics.
My current strategy leads on a pandas
series of booleans for each frame, indexed by timestamp. True
to keep it, False
to dump it.
As I plan to cut videos, i need to extract boundaries from this list, so that i can tell fmpeg beginning and end of the parts I want to extract from the main video.
Tu sum up :
I have a pandas
Series which looks like this:
acquisitionTs
0.577331 False
0.611298 False
0.645255 False
0.679218 False
0.716538 False
0.784453 True
0.784453 True
0.818417 True
0.852379 True
0.886336 True
0.920301 True
0.954259 False
...
83.393376 False
83.427345 False
dtype: bool
(truncated for presenting reasons, but the TimeStamp usually begins at 0)
and I need to get boundaries of True
sequences, so in this example i should get [[t_0,t_1],[t_2,t_3]n, ... [t_2n-1,t_2n]]
, with t_0 = 0.784453
and t_1 = 0.920301
if I have n
different sequences of True
in my pandas Series.
Now that probleme seems very simple, in fact you can just shift the sequence by one and a make a xor between the to get a list of boolean with True
being for boundaries
e = df.shift(periods=1, freq=None, axis=0)^df
print(e[e].index)
(with df
being a pandas Series)
there is still some work to do, like figuring if first element is a rising edge or a falling edge, but this hack works.
However that doesn't seem very pythonic. In fact, the probleme is so simple I believe that there must be somewhere in pandas
, numpy
or even python
a prebuilt function for this which would fit nicely in a single function call instead of a hack like above. The groupby
function seems promising though, but i never used it before.
How would be the best way of doing this ?
iloc attribute enables purely integer-location based indexing for selection by position over the given Series object. Example #1: Use Series. iloc attribute to perform indexing over the given Series object.
In order to access the series element refers to the index number. Use the index operator [ ] to access an element in a series. The index must be an integer. In order to access multiple elements from a series, we use Slice operation.
slice() method is used to slice substrings from a string present in Pandas series object. It is very similar to Python's basic principal of slicing objects that works on [start:stop:step] which means it requires three parameters, where to start, where to end and how much elements to skip.
Pandas with Python Labels can be called indexes and data present in a series called values. If you want to get labels and values individually. Then we can use the index and values attributes of the Series object. Let's take an example and see how these attributes will work.
You could use scipy.ndimage.label
to identify the clusters of True
s:
In [102]: ts
Out[102]:
0.069347 False
0.131956 False
0.143948 False
0.224864 False
0.242640 True
0.372599 False
0.451989 False
0.462090 False
0.579956 True
0.588791 True
0.603638 False
0.625107 False
0.642565 False
0.708547 False
0.730239 False
0.741652 False
0.747126 True
0.783276 True
0.896705 True
0.942829 True
Name: keep, dtype: bool
In [103]: groups, nobs = ndimage.label(ts); groups
Out[103]: array([0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3], dtype=int32)
Once you have the groups
array, you can find the associated times using groupby/agg
:
result = (df.loc[df['group'] != 0]
.groupby('group')['times']
.agg({'start':'first','end':'last'}))
For example,
import numpy as np
import pandas as pd
import scipy.ndimage as ndimage
np.random.seed(2016)
def make_ts(N, ngroups):
times = np.random.random(N)
times = np.sort(times)
idx = np.sort(np.random.randint(N, size=(ngroups,)))
arr = np.zeros(N)
arr[idx] = 1
arr = arr.cumsum()
arr = (arr % 2).astype(bool)
ts = pd.Series(arr, index=times, name='keep')
return ts
def find_groups(ts):
groups, nobs = ndimage.label(ts)
df = pd.DataFrame({'times': ts.index, 'group': groups})
result = (df.loc[df['group'] != 0]
.groupby('group')['times']
.agg({'start':'first','end':'last'}))
return result
ts = make_ts(20, 5)
result = find_groups(ts)
yields
start end
group
1 0.242640 0.242640
2 0.579956 0.588791
3 0.747126 0.942829
To obtain the start and end times as a list of lists you could use:
In [125]: result.values.tolist()
Out[125]:
[[0.24264034406127022, 0.24264034406127022],
[0.5799564094638113, 0.5887908182432907],
[0.7471260123697537, 0.9428288694956402]]
Using ndimage.label
is convenient, but note that it is also possible to compute this without scipy
:
def find_groups_without_scipy(ts):
df = pd.DataFrame({'times': ts.index, 'group': (ts.diff() == True).cumsum()})
result = (df.loc[df['group'] % 2 == 1]
.groupby('group')['times']
.agg({'start':'first','end':'last'}))
return result
The main idea here is to find labels for the clusters of True
s using (ts.diff() == True).cumsum()
. ts.diff() == True
gives the same result as ts.shift() ^ ts
, but is a bit faster. Taking the cumulative sum (i.e. calling cumsum
) treats True
as equal to 1 and False
as equal to 0, so each time a True
is encountered the cumulative sum increases by 1. Thus each cluster gets labeled with a different number:
In [111]: (ts.diff() == True).cumsum()
Out[111]:
0.069347 0
0.131956 0
0.143948 0
0.224864 0
0.242640 1
0.372599 2
0.451989 2
0.462090 2
0.579956 3
0.588791 3
0.603638 4
0.625107 4
0.642565 4
0.708547 4
0.730239 4
0.741652 4
0.747126 5
0.783276 5
0.896705 5
0.942829 5
Name: keep, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With