Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify consecutive repeating values in data frame column?

so I am trying to figure out how I can identify consecutive repeating values in a data frame column in python, and then be able to set a number for how many consecutive repeating values I am looking for. I will explain further here.

I have the following data frame:

DateTime                 Value         
-------------------------------
2015-03-11 06:00:00          1               
2015-03-11 07:00:00          1               
2015-03-11 08:00:00          1               
2015-03-11 09:00:00          1               
2015-03-11 10:00:00          0               
2015-03-11 11:00:00          0               
2015-03-11 12:00:00          0               
2015-03-11 13:00:00          0               
2015-03-11 14:00:00          0               
2015-03-11 15:00:00          0               
...

Now I have the following question: In the "Value" column, is there ever an instance where there are "2" or more consecutive "0" values? Yes! Now I want to return a "True".

Now I have this data frame:

DateTime                 Value         
-------------------------------
2015-03-11 06:00:00          1               
2015-03-11 07:00:00          1               
2015-03-11 08:00:00          0               
2015-03-11 09:00:00          0               
2015-03-11 10:00:00          1               
2015-03-11 11:00:00          0               
2015-03-11 12:00:00          0               
2015-03-11 13:00:00          0               
2015-03-11 14:00:00          1               
2015-03-11 15:00:00          1               
...

Now I have the following question: In the "Value" column, is there ever an instance where there are "3" or more consecutive "0" values? Yes! Now I want to return a "True".

And of course, if the answer is "No", then I would want to return a "False"

How can this be done in python? What is this process even called? How can you set this so that you can change the number of consecutive values being looked for?

like image 505
LostinSpatialAnalysis Avatar asked Oct 18 '25 08:10

LostinSpatialAnalysis


2 Answers

First, you can use .shift() to create a new column that has the same values as your column Value. Than

df["Value_shif"] = df["Value"].shift()

output:

              DateTime  Value  Value_shif 
0  2015-03-11 06:00:00      1         NaN    
1  2015-03-11 07:00:00      1         1.0  
2  2015-03-11 08:00:00      0         1.0   
3  2015-03-11 09:00:00      1         0.0 

than you can compare them and get True/False:

df["Value"] == df["Value_shif"]

output:

0    False
1     True
2    False
3    False

than Sum the number of repeating values:

df["count"] = (df["Value"] == df["Value_shif"]).cumsum()

cumsum() will treat True as 1 and False as 0

output:

              DateTime  Value  Value_shif  count
0  2015-03-11 06:00:00      1         NaN      0
1  2015-03-11 07:00:00      1         1.0      1
2  2015-03-11 08:00:00      0         1.0      1
3  2015-03-11 09:00:00      1         0.0      1

if Sum is larger than 1 then you have consecutive repeating values.

Once you have this info you can filter the dataframe under specific conditions, check for specific values if the number_of_times it occurs is larger than a certain amount.

def check(dataframe, value, number_of_times):
    """
    Check for condition
    """
    df = dataframe.copy()
    df = df[df['Value'] == value]
    if df["count"].max() >= number_of_times:
        return True
    else:
        return False

print(check(df, 1, 1))
True

print(check(df, 0, 3))
False

You'll need to check for specific boundary conditions to make sure everything works as intended. The problem with shift() is that it creates NaN as the first value and removes the last value from the column...

like image 165
Geom Avatar answered Oct 20 '25 21:10

Geom


To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:

# for the second frame
>>> consecutives = df["Value"].diff().ne(0).cumsum()
>>> consecutives

0    1
1    1
2    2
3    2
4    3
5    4
6    4
7    4
8    5
9    5

But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaNs wherever we don't have 0 in the original series:

>>> masked_consecs = consecutives.mask(df["Value"].ne(0))
>>> masked_consecs

0    NaN
1    NaN
2    2.0
3    2.0
4    NaN
5    4.0
6    4.0
7    4.0
8    NaN
9    NaN

Now we can group by this series and look at the groups' sizes:

>>> consec_sizes = df["Value"].groupby(masked_consecs).size().to_numpy()
>>> consec_sizes

array([2, 3])

The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:

>>> is_okay = (consec_sizes >= 2).any()
>>> is_okay
True

Now we can wrap this procedure in a function for reusability:

def is_consec_found(series, value=0, threshold=2):
    # mark consecutive groups
    consecs = series.diff().ne(0).cumsum()

    # disregard those groups that are not of `value`
    masked_consecs = consecs.mask(series.ne(value))

    # get size of each
    consec_sizes = series.groupby(masked_consecs).size().to_numpy()

    # check sizes agains the threshold
    is_okay = (consec_sizes >= threshold).any()

    # whether a suitable sequence is found or not
    return is_okay

and we can run it as:

# these are all for the second dataframe you posted
>>> is_consec_found(df["Value"], value=0, threshold=2)
True

>>> is_consec_found(df["Value"], value=0, threshold=5)
False

>>> is_consec_found(df["Value"], value=1, threshold=2)
True

>>> is_consec_found(df["Value"], value=1, threshold=3)
False
like image 34
Mustafa Aydın Avatar answered Oct 20 '25 21:10

Mustafa Aydın



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!