Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I identify the start and end of lower period of noisy data?

I have noisy data at roughly 1 minute intervals across a day.

Here is a simple version:

enter image description here

How can I identify the start and end index values of the less noisy and lower valued period marked in yellow?

Here is the test data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

arr = np.array([8,9,7,3,6,3,2,1,2,3,1,2,3,2,2,3,2,2,5,7,8,9,15,20,21])

plt.plot(arr)
plt.show()
like image 331
ManInMoon Avatar asked Oct 01 '21 15:10

ManInMoon


Video Answer


2 Answers

You could try to detect less noisy points by measuring the variance of the values in their neighborhood.

For example, for each point you can look at the last N values before it and calculate their standard deviation, then flag the point if the std is lower than some threshold.

The following code applies this procedure using the rolling method of a pandas series.

std_thresh = 1
window_len = 5

s = pd.Series([8,9,7,3,6,3,2,1,2,3,1,2,3,2,2,3,2,2,5,7,8,9,15,20,21])

# Create a boolean mask which marks the less noisy points
marked = s.rolling(window=window_len).std() < std_thresh

# Whenever a new point is marked, mark also the other points of the window (see discussion below)
for i in range(window_len + 1, len(marked)):
    if marked[i] and ~marked[i-1]:
        marked[i - (window_len-1) : i] = True
        
plt.plot(s)
plt.scatter(s[marked].index, s[marked], c='orange')

enter image description here

You can try to change the values of window_len (the length of the window where you calculate the std) and std_thresh (points whose window has std less than it are flagged) and tune them according to your needs.

Note that rolling considers a window which end at each point, so, whenever you encounter a segment of less noisy points, the first window_len-1 of them will not be marked. This is why I included the for loop in the code after defining marked.

like image 71
Simone Avatar answered Oct 16 '22 15:10

Simone


For a given point, we can decide to keep/mask it based on certain criteria:

  1. Are its neighbors are within some delta?
  2. Is it within some threshold of the minimum?
  3. Is it in a contiguous block?

Note: Since you tagged and imported pandas, I'll use pandas for convenience, but the same ideas can be implemented with pure numpy/matplotlib.


If all lower periods are around the same level

Then a simple approach is to use a neighbor delta with minimum threshold (though be careful of outliers in the real data):

neighbor delta with minimum threshold

s = pd.Series(np.hstack([arr, arr]))

delta = 2
threshold = s.std()

# check if each point's neighbors are within `delta`
mask_delta = s.diff().abs().le(delta) & s.diff(-1).abs().le(delta)

# check if each point is within `threshold` of the minimum
mask_threshold = s < s.min() + threshold

s.plot(label='raw')
s.where(mask_threshold & mask_delta).plot(marker='*', label='delta & threshold')

If the lower periods are at different levels

Then a global minimum threshold won't work since some periods will be too high. In this case try a neighbor delta with contiguous blocks:

neighbor delta with contiguous blocks

# shift the second period by 5
s = pd.Series(np.hstack([arr, arr + 5]))

delta = 2
blocksize = 10

# check if each point's neighbors are within `delta`
mask_delta = s.diff().abs().le(delta) & s.diff(-1).abs().le(delta)

# check if each point is in a contiguous block of at least `blocksize`
masked = s.where(mask_delta)
groups = masked.isnull().cumsum()
blocksizes = masked.groupby(groups).transform('count').mask(masked.isnull())
mask_contiguous = blocksizes >= blocksize

s.plot(label='raw')
s.where(mask_contiguous).plot(marker='*', label='delta & contiguous')
like image 1
tdy Avatar answered Oct 16 '22 15:10

tdy