Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect anomaly in a time series data(specifically) with trend and seasonality present in it?

I want to detect the outliers in a "time series data" which contains the trend and seasonality components. I want to leave out the peaks which are seasonal and only consider only the other peaks and label them as outliers. As I am new to time series analysis, Please assist me to approach this time series problem.

The coding platform is using is Python.

Attempt 1 : Using ARIMA model

I have trained my model and forecasted for the test data. Then being able to compute the difference between forecasted results with my actual values of test data then able to find out the outliers based on the variance observed.

Implementation of Auto Arima

!pip install pyramid-arima
from pyramid.arima import auto_arima
stepwise_model = auto_arima(train_log, start_p=1, start_q=1,max_p=3, max_q=3,m=7,start_P=0, seasonal=True,d=1, D=1, trace=True,error_action='ignore', suppress_warnings=True,stepwise=True)

import math
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from sklearn.metrics import mean_squared_error

Split data into train and test-sets

train, test = actual_vals[0:-70], actual_vals[-70:]

Log Transformation

train_log, test_log = np.log10(train), np.log10(test)

Converting to list

history = [x for x in train_log]
predictions = list()
predict_log=list()

Fitting Stepwise ARIMA model

for t in range(len(test_log)):
stepwise_model.fit(history)
    output = stepwise_model.predict(n_periods=1)
    predict_log.append(output[0])
    yhat = 10**output[0]
    predictions.append(yhat)
    obs = test_log[t]
    history.append(obs)

Plotting

figsize=(12, 7)
plt.figure(figsize=figsize)
pyplot.plot(test,label='Actuals')
pyplot.plot(predictions, color='red',label='Predicted')
pyplot.legend(loc='upper right')
pyplot.show()

But I can detect the outliers only in test data. Actually, I have to detect the outliers for the whole time series data including the train data I am having.

Attempt 2 : Using Seasonal Decomposition

I have used the below code to split the original data into Seasonal, Trend, Residuals and can be seen in the below image.

from statsmodels.tsa.seasonal import seasonal_decompose

decomposed = seasonal_decompose()

enter image description here

Then am using the residual data to find out the outliers using boxplot since the seasonal and trend components were removed. Does this makes sense ?

Or is there any other simple or better approach to go with ?

like image 759
Raja Sahe S Avatar asked Jul 17 '19 06:07

Raja Sahe S


People also ask

How do you identify an anomaly in a time series?

The procedure for detecting anomalies with ARIMA is: Predict the new point from past datums and find the difference in magnitude with those in the training data. Choose a threshold and identify anomalies based on that difference threshold. That's it!

What are the three 3 basic approaches to anomaly detection?

There are three main classes of anomaly detection techniques: unsupervised, semi-supervised, and supervised. Essentially, the correct anomaly detection method depends on the available labels in the dataset.

How do you investigate that a certain trend in the distribution is due to anomaly?

The simplest approach to identifying irregularities in data is to flag the data points that deviate from common statistical properties of distribution, including mean, median, mode, and quartiles. One of the most popular ways is the Interquartile Range (IQR).

How can you detect seasonality outliers and Changepoints in your time series?

Outlier Detection and RemoverDecompose the time series using seasonal decomposition. Remove trend and seasonality to generate a residual time series. Detect points in the residual which are outside 3 times the interquartile range.


1 Answers

You can:

  • in the 4th graph (residual plot) at "Attempt 2 : Using Seasonal Decomposition" try to check for extreme points and that may lead you to some anomalies in the seasonal series.
  • Supervised(if you have some labeled data): Do some classification.
  • Unsupervised: Try to predict the next value and create a confidence interval to check whether the prediction lays inside it or not.
  • You can try to calculate the relative extrema of data. using argrelextrema as shown here for example:
from scipy.signal import argrelextrema
x = np.array([2, 1, 2, 3, 2, 0, 1, 0]) 
argrelextrema(x, np.greater)

output:

(array([3, 6]),)

Some random data (My implementation of the above argrelextrema): enter image description here

like image 165
Dor Avatar answered Sep 28 '22 00:09

Dor