Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dealing with NA in seasonal cycle analysis R

Tags:

r

time-series

xts

I have a timeseries of monthly data with lots of missing datapoints, set to NA. I want to simply subtract the annual cycle from the data, ignoring the missing entries. It seems that the decompose function can't handle missing data points, but I have seen elsewhere that the seasonal package is suggested instead. However I am also running into problems there too with the NA.

Here is a minimum reproducible example of the problem using a built in dataset...

library(seasonal)

# set range to missing NA in Co2 dataset
c2<-co2
c2[c2>330 & c2<350]=NA
seas(c2,na.action=na.omit)

Error in na.omit.ts(x) : time series contains internal NAs

Yes, I know! that's why I asked you to omit them! Let's try this:

seas(c2,na.action=na.x13)

Error: X-13 run failed

Errors:
- Adding MV1981.Apr exceeds the number of regression effects
  allowed in the model (80).

Hmmm, interesting, no idea what that means, okay, please just exclude the NA:

seas(c2,na.action=na.exclude)

Error in na.omit.ts(x) : time series contains internal NAs

that didn't help much! and for good measure

decompose(c2)

Error in na.omit.ts(x) : time series contains internal NAs

I'm on the following:

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

Why is leaving out NA such a problem? I'm obviously being completely stupid, but I can't see what I'm doing wrong with the seas function. Happy to consider an alternative solution using xts.

like image 667
Adrian Tompkins Avatar asked Sep 13 '18 10:09

Adrian Tompkins


2 Answers

My first solution, simply manually calculating the seasonal cycle, converting to a dataframe to subtract the vector and then transforming back.

# seasonal cycle
scycle=tapply(c2,cycle(c2),mean,na.rm=T) 
# converting to df
df=tapply(c2, list(year=floor(time(c2)), month = cycle(c2)), c)
# subtract seasonal cycle
for (i in 1:nrow(df)){df[i,]=df[i,]-scycle}
# convert back to timeseries
anomco2=ts(c(t(df)),start=start(c2),freq=12)

Not very pretty, and not very efficient either.

The comment of missuse lead me to another Seasonal decompose of monthly data including NA in r I missed with a near duplicate question and this suggested the package zoo, which seems to work really well for additive series

library(zoo)
c2=co2
c2[c2>330&c2<350]=NA
d=decompose(na.StructTS(c2)) 
plot(co2)
lines(d$x,col="red")

shows that the series is very well reconstructed through the missing period.

black lines shows Co2 series with missing chunk and the red line is the reconstructed series

The output of deconstruct has the trend and seasonal cycle available. I wish I could transfer my bounty to user https://stackoverflow.com/users/516548/g-grothendieck for this helpful response. Thanks to user missuse too.

However, if the missing portion is at the end of the series, the software has to extrapolate the trend and has more difficulties. The original series (in black) maintains the trend, while the trend is smaller in the reconstructed series (red):

c2=co2
c2[c2>350]=NA
d=decompose(na.StructTS(c2)) 
plot(co2)
lines(d$x,col="red")

extrapolation of data using zoo

Lastly, if instead the missing portion is at the start of the series, the software is unable to extrapolate backwards in time and throws an error... I feel another SO question coming on...

c2=co2
c2[c2<330]=NA
d=decompose(na.StructTS(c2)) 

Error in StructTS(y) :  
the first value of the time series must not be missing
like image 54
Adrian Tompkins Avatar answered Nov 11 '22 18:11

Adrian Tompkins


You could just use some algorithm that fills the missing data before. (e.g. from package imputeTS or zoo)

imputeTS for example has extra imputation algorithms for seasonal time series e.g.:

x <- na_seadec(co2)

Another good option for seasonal data:

x <- na_kalman(co2)

And now just go on without the missing data.

An important hint from Adrian Tompkins (see also comment below): This will work best, when the missing data is somewhere in the middle. For a lot of leading NAs the method is no good choice. In this case it fills the NAs, but it is not able to extrapolate the trend backwards:

c2<-co2
c2[c2<330]<-NA
c3<-na_kalman(c2)
c4<-na_seadec(c2)
plot(co2)
lines(c3,col="blue")
lines(c4,col="red")

enter image description here

like image 2
Steffen Moritz Avatar answered Nov 11 '22 19:11

Steffen Moritz