Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Functions to smooth a time series with known dips

I have results of an Internet measurement experiment over time, as shown in the figure below. I am doing time series analysis in pandas. There are certain drops in the data, that are due to server outages. I am looking at good ways of smoothing the data.

Among the simpler built-in smoothing functions, pd.rolling_max() provides a reasonably good estimate. It however overestimates a little. I have also experimented with writing my own smoothing function, which carries forwards values when there is a >20% drop. This provides a reasonably good estimate too, but the threshold is set arbitrarily.

def my_smooth(win, thresh = 0.80):
    win = win.copy()
    for i, val in enumerate(win):
        if i > 1 and val < win[i-1] * thresh:
            win[i] = win[i-1]       
    return win[-1]

ts = pd.rolling_apply(ts, 6, my_smooth)

My question is, what are better smoothing functions for this type of time-series, given the specific characteristics? (i.e., it's count of events, and the major measurement errors are large under counts at specific times). Also, can my suggested smoothing function be made less adhoc or optimized?

enter image description here

like image 880
Hadi Avatar asked Dec 02 '25 17:12

Hadi


1 Answers

I would like to add how I eventually solved this issue for anyone else interested. Foremost, after looking at a number of smoothing techniques, I eventually decided against smoothing due to the fact that it changes the data. I instead opted to filter out 10% of the points as outliers, a common technique in machine learning and signal processing.

Outliers in our case are low measurements caused by measurement logging failure. There are a number of techniques to detect outliers, the popular of which are named in NIST's Engineering Statistics Handbook. Given the clear trend in my data, I opted for a variation on "Median Absolute Deviation": to compare each point in the measurement series with the rolling median, generate differences, and select a cutoff point appropriately.

# 'data' are the weekly measurements, in a Pandas series
filtered = data.copy()
dm = pd.rolling_median(data, 9, center=True) 
df = sorted(np.abs(data - dm).dropna(), reverse=True)
cutoff = df[len(df) // 10]
filtered[np.abs(data - dm) > cutoff] = np.nan
like image 165
Hadi Avatar answered Dec 05 '25 06:12

Hadi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!