Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to include the product of two smooth terms in a mgcv gam model

I have had great success using gam to model seasonality for time series data. My latest model clearly shows a weekly pattern in addition to seasonal changes. While the weekly pattern itself is very stable over the year, its amplitude also varies with the seasons. So ideally I would like to model my data as:

y ~ f(day in year) + g(day in year) * h(day in week)

where f, g and h are cyclic smooth functions in mgcv

gam(
  y ~ s(day_in_year, k=52, bs='cc') 
  + s(day_in_year, k=52, bs='cc'):s(day_in_week, k=5, bs='cc')
  , knots=list(
    day_in_year=c(0, 356)
    , day_in_week=c(0,7)
  )
  , data = data
)

Unfortunately this doesn't work and throws back the error NA/NaN argument. I tried using te(day_in_year, day_in_week, k=c(52, 5), bs='cc') which works, but introduces too many degrees of freedom as the model overfits holidays which fall on specific weekdays within the short number of available years.

Is it possible to specify a model the way I am trying to do?

like image 589
unique2 Avatar asked May 12 '14 22:05

unique2


People also ask

What is smooth function in GAM?

In the account that will be given here, Generalized Additive Models (GAMs) extend linear and. generalized linear models to include smooth functions of explanatory variables with the smoothness. determined by either. a parameter that directly controls the smoothness of the curve, or estimated predictive accuracy.

What is a tensor product smooth?

Tensor product smooths provide the natural way of representing smooth interaction terms in regression models because they are invariant to the units in which the covariates are measured, hence avoiding the need for arbitrary decisions about relative scaling of variables.

What is MGCV?

mgcv is an R package for estimating penalized Generalized Linear models including Generalized Additive Models and Generalized Additive Mixed Models. mgcv includes an implementation of 'gam', based on penalized regression splines with automatic smoothness estimation.


2 Answers

Wow, quite an old question!

Regarding interaction

While the weekly pattern itself is very stable over the year, its amplitude also varies with the seasons.

The use of tensor product spline basis te is the right way for interaction, although a more decent constructor is ti. You said that te returns you many parameters. Of course. You have k = 52 for the first margin, and k = 5 for the second, then you end up with 52 * 5 - 1 coefficients for this tensor term. But this is just the way an interaction is created.

Note that in mgcv GAM formula, : or * is only valid for interaction between parametric terms. Interaction between smooths are handled by te or ti.

If this is not what you hope for, then what do you expect the "product" to be? A Hadamard product of two marginal design matrices? Then what's the sense of that? By the way, a Hadamard product requires two matrices of the same dimension. However, your two margins do not have the same number of columns.

If you don't understand why I keep talking about matrices, then you need to read Simon's book in 2006. Though GAM estimation explained there is rather outdated now, the construction / setup of GAM (like design matrices and penalty matrices) explained in Chapter 4 does not change at all even after a decade.

OK , let me give you one more hint. The row wise Kronecker product used for construction of te / ti design matrix is not a new invention.

A smooth term s(x) is quite like a factor variable g, as though they appear to be a single variable, they are constructed to be a design matrix with many columns. For g it is a dummy matrix, while for f(x) it is a basis matrix. So, interaction between two smooth functions is constructed in the same way that interaction between two factors is constructed.

If you have a factor g1 of 5 levels, and another factor g2 of 10 levels, their marginal design matrix (after contrasts) have 4 columns and 9 columns, then the interaction g1:g2 would have 36 columns. Such design matrix, is just the row wise Kronecker product of the design matrix of g1 and g2.


Regarding overfitting

As you said, you only have few years of data, maybe 2 or 3? In that case, your model have been over-parametrized by using k = 52 for day_in_year. Try reduces it to for example k = 30.

If overfitting is still evident, here are a few ways to address it.

Way 1: You are using GCV for smoothness selection. Try method = "REML". GCV always inclines to overfit data.

Way 2: Staying with GCV, manually inflate smoothing parameters for a heavier penalization. gamma parameter of gam is useful here. Try gamma = 1.5 for example.


A typo in your code?

The knots location, should it be day_in_year = c(0, 365)?

like image 151
Zheyuan Li Avatar answered Sep 29 '22 05:09

Zheyuan Li


Your model doesn't make much sense as you state that there is:

  1. a seasonal effect,
  2. a day of week effect,
  3. an interaction such that the day of week effect varies with the day of year

This can be fully represented as a tensor product smooth. The model you mention in the comment on the other answer here

y ~ f(day in year) + g(day in year) * h(day in week)

is just a decomposition of the full tensor product if you mean * as main effect + interaction. In which case the model as you have it isn't identifiable - you ave a function of day of year twice. And if you meant the equivalent of :, then your model doesn't have the main effect of day of week, which seems undesirable.

I fit models of this form all the time (just for year and day of year). I would approach this via:

gam(y ~ te(day_of_year, day_of_week, k = c(20, 6), bs = c("cc", "cc")),
    data = foo, method = "REML", knots = knots)

You could also tweak the knots definitions. I tend to use the following:

knots <- list(day_of_year = c(0.5, 366.5),
              day_of_week = c(0.5, 7.5)

This won't make much difference but you're just putting boundary knots just a little closer to the data.

If you want to decompose the effects, you could fit the model with ti() smooths

gam(y ~ ti(day_of_year, bs = "cc", k = 12) + ti(day_of_week, bs = "cc", k = 6) + 
      te(day_of_year, day_of_week, k = c(12, 6), bs = c("cc", "cc")),
    data = foo, method = "REML", knots = knots)

You can tweak the values of k to determine suitable values in conjunction with your data and gam.check().

You will also need to add a term to the model to handle holidays. This would be parametric term which applies an adjustment if the day is a holiday - so, create a factor holiday and add it to the model + holiday. You could think of more complex models; perhaps a factor indexing if a week has a holiday coupled with a by factor smooth for the day_of_week component, such that you estimate one weekly pattern if the week is a normal week and a second weekly pattern if the week contains a holiday.

If you show us an example/plot of the data I could expand or comment less generally.

It isn't surprising that the te() smooth you fitted doesn't adapt well to the holidays; the model assumes that the week effect is smooth and smoothly varying with a smooth day of year effect. Holidays are not smooth departures from the weekly pattern in the previous or subsequent week. The holiday effect is not well modelled by smooth relationships and something else is need to account for that effect.

like image 28
Gavin Simpson Avatar answered Sep 29 '22 04:09

Gavin Simpson