Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle Continuous Missing values in time-series data

Tags:

r

na

time-series

I have a time-series data as shown below.

2015-04-26 23:00:00  5704.27388916015661380
2015-04-27 00:00:00  4470.30868326822928793
2015-04-27 01:00:00  4552.57241617838553793
2015-04-27 02:00:00  4570.22250032825650123
2015-04-27 03:00:00  NA
2015-04-27 04:00:00  NA
2015-04-27 05:00:00  NA
2015-04-27 06:00:00 12697.37724086216439900
2015-04-27 07:00:00  5538.71119009653739340
2015-04-27 08:00:00    81.95060647328695325
2015-04-27 09:00:00  8550.65816895300667966
2015-04-27 10:00:00  2925.76573206583680076

How should I handle Continous NA values. In cases where I have only one NA, I use to take the average of extreme values of NA entry. Are there any standard approaches to deal with continuous missing values?

like image 784
Haroon Rashid Avatar asked Sep 17 '25 10:09

Haroon Rashid


1 Answers

The zoo package has several functions for dealing with NA values. One of the following functions might suit your needs:

  • na.locf: Last observation carried forward. Using the parameter fromLast = TRUE corresponds to next observation carried backward (NOCB).
  • na.aggregate: Replace the NA's with some aggregated value. The default aggregation function is the mean, but you can specify other functions as well. See ?na.aggregate for more info.
  • na.approx: NA's are replaced with linear interpolated values.

You can compare the outcomes to see what these functions do:

library(zoo)
df$V.loc <- na.locf(df$V2)
df$V.agg <- na.aggregate(df$V2)
df$V.app <- na.approx(df$V2)

this results in:

> df
                    V1          V2       V.loc       V.agg       V.app
1  2015-04-26 23:00:00  5704.27389  5704.27389  5704.27389  5704.27389
2  2015-04-27 00:00:00  4470.30868  4470.30868  4470.30868  4470.30868
3  2015-04-27 01:00:00  4552.57242  4552.57242  4552.57242  4552.57242
4  2015-04-27 02:00:00  4570.22250  4570.22250  4570.22250  4570.22250
5  2015-04-27 03:00:00          NA  4570.22250  5454.64894  6602.01119
6  2015-04-27 04:00:00          NA  4570.22250  5454.64894  8633.79987
7  2015-04-27 05:00:00          NA  4570.22250  5454.64894 10665.58856
8  2015-04-27 06:00:00 12697.37724 12697.37724 12697.37724 12697.37724
9  2015-04-27 07:00:00  5538.71119  5538.71119  5538.71119  5538.71119
10 2015-04-27 08:00:00    81.95061    81.95061    81.95061    81.95061
11 2015-04-27 09:00:00  8550.65817  8550.65817  8550.65817  8550.65817
12 2015-04-27 10:00:00  2925.76573  2925.76573  2925.76573  2925.76573

Used data:

df <- structure(list(V1 = structure(c(1430082000, 1430085600, 1430089200, 1430092800, 1430096400, 1430100000, 1430103600, 1430107200, 1430110800, 1430114400, 1430118000, 1430121600), class = c("POSIXct", "POSIXt"), tzone = ""), V2 = c(5704.27388916016, 4470.30868326823, 4552.57241617839, 4570.22250032826, NA, NA, NA, 12697.3772408622, 5538.71119009654, 81.950606473287, 8550.65816895301, 2925.76573206584)), .Names = c("V1", "V2"), row.names = c(NA, -12L), class = "data.frame")

Addition:

There are also additional time series functions for dealing with NAs in the imputeTS and the forecast package (also some more advanced functions).

For example:

 library("imputeTS")

 # Moving Average Imputation
 na_ma(df$V2)

 # Imputation via Kalman Smoothing on structural time series models 
 na_kalman(df$V2)

 # Just interpolation but with some nice options (linear, spline,stine)
 na_interpolation(df$V2)

or

library("forecast")

#Interpolation via seasonal decomposition and interpolation
na.interp(df$V2)
like image 134
Jaap Avatar answered Sep 20 '25 02:09

Jaap