so I am trying to figure out how I can identify consecutive repeating values in a data frame column in python, and then be able to set a number for how many consecutive repeating values I am looking for. I will explain further here.
I have the following data frame:
DateTime Value
-------------------------------
2015-03-11 06:00:00 1
2015-03-11 07:00:00 1
2015-03-11 08:00:00 1
2015-03-11 09:00:00 1
2015-03-11 10:00:00 0
2015-03-11 11:00:00 0
2015-03-11 12:00:00 0
2015-03-11 13:00:00 0
2015-03-11 14:00:00 0
2015-03-11 15:00:00 0
...
Now I have the following question: In the "Value" column, is there ever an instance where there are "2" or more consecutive "0" values? Yes! Now I want to return a "True".
Now I have this data frame:
DateTime Value
-------------------------------
2015-03-11 06:00:00 1
2015-03-11 07:00:00 1
2015-03-11 08:00:00 0
2015-03-11 09:00:00 0
2015-03-11 10:00:00 1
2015-03-11 11:00:00 0
2015-03-11 12:00:00 0
2015-03-11 13:00:00 0
2015-03-11 14:00:00 1
2015-03-11 15:00:00 1
...
Now I have the following question: In the "Value" column, is there ever an instance where there are "3" or more consecutive "0" values? Yes! Now I want to return a "True".
And of course, if the answer is "No", then I would want to return a "False"
How can this be done in python? What is this process even called? How can you set this so that you can change the number of consecutive values being looked for?
First, you can use .shift() to create a new column that has the same values as your column Value
. Than
df["Value_shif"] = df["Value"].shift()
output:
DateTime Value Value_shif
0 2015-03-11 06:00:00 1 NaN
1 2015-03-11 07:00:00 1 1.0
2 2015-03-11 08:00:00 0 1.0
3 2015-03-11 09:00:00 1 0.0
than you can compare them and get True/False:
df["Value"] == df["Value_shif"]
output:
0 False
1 True
2 False
3 False
than Sum the number of repeating values:
df["count"] = (df["Value"] == df["Value_shif"]).cumsum()
cumsum() will treat True
as 1
and False
as 0
output:
DateTime Value Value_shif count
0 2015-03-11 06:00:00 1 NaN 0
1 2015-03-11 07:00:00 1 1.0 1
2 2015-03-11 08:00:00 0 1.0 1
3 2015-03-11 09:00:00 1 0.0 1
if Sum is larger than 1 then you have consecutive repeating values.
Once you have this info you can filter the dataframe under specific conditions, check for specific values
if the number_of_times
it occurs is larger than a certain amount.
def check(dataframe, value, number_of_times):
"""
Check for condition
"""
df = dataframe.copy()
df = df[df['Value'] == value]
if df["count"].max() >= number_of_times:
return True
else:
return False
print(check(df, 1, 1))
True
print(check(df, 0, 3))
False
You'll need to check for specific boundary conditions to make sure everything works as intended. The problem with shift() is that it creates NaN as the first value and removes the last value from the column...
To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:
# for the second frame
>>> consecutives = df["Value"].diff().ne(0).cumsum()
>>> consecutives
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
9 5
But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaN
s wherever we don't have 0 in the original series:
>>> masked_consecs = consecutives.mask(df["Value"].ne(0))
>>> masked_consecs
0 NaN
1 NaN
2 2.0
3 2.0
4 NaN
5 4.0
6 4.0
7 4.0
8 NaN
9 NaN
Now we can group by this series and look at the groups' sizes:
>>> consec_sizes = df["Value"].groupby(masked_consecs).size().to_numpy()
>>> consec_sizes
array([2, 3])
The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:
>>> is_okay = (consec_sizes >= 2).any()
>>> is_okay
True
Now we can wrap this procedure in a function for reusability:
def is_consec_found(series, value=0, threshold=2):
# mark consecutive groups
consecs = series.diff().ne(0).cumsum()
# disregard those groups that are not of `value`
masked_consecs = consecs.mask(series.ne(value))
# get size of each
consec_sizes = series.groupby(masked_consecs).size().to_numpy()
# check sizes agains the threshold
is_okay = (consec_sizes >= threshold).any()
# whether a suitable sequence is found or not
return is_okay
and we can run it as:
# these are all for the second dataframe you posted
>>> is_consec_found(df["Value"], value=0, threshold=2)
True
>>> is_consec_found(df["Value"], value=0, threshold=5)
False
>>> is_consec_found(df["Value"], value=1, threshold=2)
True
>>> is_consec_found(df["Value"], value=1, threshold=3)
False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With