I have a Pandas Dataframe of indices and values between 0 and 1, something like this:
6 0.047033
7 0.047650
8 0.054067
9 0.064767
10 0.073183
11 0.077950
I would like to retrieve tuples of the start and end points of regions of more than 5 consecutive values that are all over a certain threshold (e.g. 0.5). So that I would have something like this:
[(150, 185), (632, 680), (1500,1870)]
Where the first tuple is of a region that starts at index 150, has 35 values that are all above 0.5 in row, and ends on index 185 non-inclusive.
I started by filtering for only values above 0.5 like so
df = df[df['values'] >= 0.5]
And now I have values like this:
632 0.545700
633 0.574983
634 0.572083
635 0.595500
636 0.632033
637 0.657617
638 0.643300
639 0.646283
I can't show my actual dataset, but the following one should be a good representation
import numpy as np
from pandas import *
np.random.seed(seed=901212)
df = DataFrame(range(1,501), columns=['indices'])
df['values'] = np.random.rand(500)*.5 + .35
yielding:
1 0.491233
2 0.538596
3 0.516740
4 0.381134
5 0.670157
6 0.846366
7 0.495554
8 0.436044
9 0.695597
10 0.826591
...
Where the region (2,4) has two values above 0.5. However this would be too short. On the other hand, the region (25,44) with 19 values above 0.5 in a row would be added to list.
To select the rows, the syntax is df. loc[start:stop:step] ; where start is the name of the first-row label to take, stop is the name of the last row label to take, and step as the number of indices to advance after each extraction; for example, you can use it to select alternate rows.
In Pandas, data is typically arranged in rows and columns. A DataFrame is an indexed and typed two-dimensional data structure. In Pandas, you can use a technique called DataFrame slicing to extract just the data you need from large or small datasets.
Series is a one-dimensional labeled array capable of holding data of the type integer, string, float, python objects, etc. The axis labels are collectively called index.
You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:
# tag rows based on the threshold
df['tag'] = df['values'] > .5
# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]
so for example the first region would be:
>>> i, j = pr[0]
>>> df.loc[i:j]
indices values tag
15 16 0.639992 True
16 17 0.593427 True
17 18 0.810888 True
18 19 0.596243 True
19 20 0.812684 True
20 21 0.617945 True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With