I have noisy data at roughly 1 minute intervals across a day.
Here is a simple version:
How can I identify the start and end index values of the less noisy and lower valued period marked in yellow?
Here is the test data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
arr = np.array([8,9,7,3,6,3,2,1,2,3,1,2,3,2,2,3,2,2,5,7,8,9,15,20,21])
plt.plot(arr)
plt.show()
You could try to detect less noisy points by measuring the variance of the values in their neighborhood.
For example, for each point you can look at the last N values before it and calculate their standard deviation, then flag the point if the std is lower than some threshold.
The following code applies this procedure using the rolling
method of a pandas series.
std_thresh = 1
window_len = 5
s = pd.Series([8,9,7,3,6,3,2,1,2,3,1,2,3,2,2,3,2,2,5,7,8,9,15,20,21])
# Create a boolean mask which marks the less noisy points
marked = s.rolling(window=window_len).std() < std_thresh
# Whenever a new point is marked, mark also the other points of the window (see discussion below)
for i in range(window_len + 1, len(marked)):
if marked[i] and ~marked[i-1]:
marked[i - (window_len-1) : i] = True
plt.plot(s)
plt.scatter(s[marked].index, s[marked], c='orange')
You can try to change the values of window_len
(the length of the window where you calculate the std) and std_thresh
(points whose window has std less than it are flagged) and tune them according to your needs.
Note that rolling
considers a window which end at each point, so, whenever you encounter a segment of less noisy points, the first window_len-1
of them will not be marked. This is why I included the for loop in the code after defining marked
.
For a given point, we can decide to keep/mask it based on certain criteria:
Note: Since you tagged and imported pandas, I'll use pandas for convenience, but the same ideas can be implemented with pure numpy/matplotlib.
Then a simple approach is to use a neighbor delta with minimum threshold (though be careful of outliers in the real data):
s = pd.Series(np.hstack([arr, arr]))
delta = 2
threshold = s.std()
# check if each point's neighbors are within `delta`
mask_delta = s.diff().abs().le(delta) & s.diff(-1).abs().le(delta)
# check if each point is within `threshold` of the minimum
mask_threshold = s < s.min() + threshold
s.plot(label='raw')
s.where(mask_threshold & mask_delta).plot(marker='*', label='delta & threshold')
Then a global minimum threshold won't work since some periods will be too high. In this case try a neighbor delta with contiguous blocks:
# shift the second period by 5
s = pd.Series(np.hstack([arr, arr + 5]))
delta = 2
blocksize = 10
# check if each point's neighbors are within `delta`
mask_delta = s.diff().abs().le(delta) & s.diff(-1).abs().le(delta)
# check if each point is in a contiguous block of at least `blocksize`
masked = s.where(mask_delta)
groups = masked.isnull().cumsum()
blocksizes = masked.groupby(groups).transform('count').mask(masked.isnull())
mask_contiguous = blocksizes >= blocksize
s.plot(label='raw')
s.where(mask_contiguous).plot(marker='*', label='delta & contiguous')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With