Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delimiting contiguous regions with values above a certain threshold in Pandas DataFrame

I have a Pandas Dataframe of indices and values between 0 and 1, something like this:

 6  0.047033
 7  0.047650
 8  0.054067
 9  0.064767
10  0.073183
11  0.077950

I would like to retrieve tuples of the start and end points of regions of more than 5 consecutive values that are all over a certain threshold (e.g. 0.5). So that I would have something like this:

 [(150, 185), (632, 680), (1500,1870)]

Where the first tuple is of a region that starts at index 150, has 35 values that are all above 0.5 in row, and ends on index 185 non-inclusive.

I started by filtering for only values above 0.5 like so

 df = df[df['values'] >= 0.5]

And now I have values like this:

632  0.545700
633  0.574983
634  0.572083
635  0.595500
636  0.632033
637  0.657617
638  0.643300
639  0.646283

I can't show my actual dataset, but the following one should be a good representation

import numpy as np
from pandas import *

np.random.seed(seed=901212)

df = DataFrame(range(1,501), columns=['indices'])
df['values'] = np.random.rand(500)*.5 + .35

yielding:

 1  0.491233
 2  0.538596
 3  0.516740
 4  0.381134
 5  0.670157
 6  0.846366
 7  0.495554
 8  0.436044
 9  0.695597
10  0.826591
...

Where the region (2,4) has two values above 0.5. However this would be too short. On the other hand, the region (25,44) with 19 values above 0.5 in a row would be added to list.

like image 311
tlnagy Avatar asked Jun 18 '14 09:06

tlnagy


People also ask

How do I select a range of rows in pandas DataFrame?

To select the rows, the syntax is df. loc[start:stop:step] ; where start is the name of the first-row label to take, stop is the name of the last row label to take, and step as the number of indices to advance after each extraction; for example, you can use it to select alternate rows.

What is DataFrame slicing?

In Pandas, data is typically arranged in rows and columns. A DataFrame is an indexed and typed two-dimensional data structure. In Pandas, you can use a technique called DataFrame slicing to extract just the data you need from large or small datasets.

Can pandas series hold different data types?

Series is a one-dimensional labeled array capable of holding data of the type integer, string, float, python objects, etc. The axis labels are collectively called index.


1 Answers

You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:

# tag rows based on the threshold
df['tag'] = df['values'] > .5

# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]

# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]

# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]

so for example the first region would be:

>>> i, j = pr[0]
>>> df.loc[i:j]
    indices    values   tag
15       16  0.639992  True
16       17  0.593427  True
17       18  0.810888  True
18       19  0.596243  True
19       20  0.812684  True
20       21  0.617945  True
like image 179
behzad.nouri Avatar answered Oct 15 '22 16:10

behzad.nouri