I have had great success using gam to model seasonality for time series data. My latest model clearly shows a weekly pattern in addition to seasonal changes. While the weekly pattern itself is very stable over the year, its amplitude also varies with the seasons. So ideally I would like to model my data as: <pre class="prettyprint"><code>y ~ f(day in year) + g(day in year) * h(day in week) </code></pre> where <code>f</code>, <code>g</code> and <code>h</code> are cyclic smooth functions in <code>mgcv</code> <pre class="prettyprint"><code>gam( y ~ s(day_in_year, k=52, bs='cc') + s(day_in_year, k=52, bs='cc'):s(day_in_week, k=5, bs='cc') , knots=list( day_in_year=c(0, 356) , day_in_week=c(0,7) ) , data = data ) </code></pre> Unfortunately this doesn't work and throws back the error <code>NA/NaN argument</code>. I tried using <code>te(day_in_year, day_in_week, k=c(52, 5), bs='cc')</code> which works, but introduces too many degrees of freedom as the model overfits holidays which fall on specific weekdays within the short number of available years. Is it possible to specify a model the way I am trying to do?

Wow, quite an old question! <h3>Regarding interaction</h3> <blockquote> While the weekly pattern itself is very stable over the year, its amplitude also varies with the seasons. </blockquote> The use of tensor product spline basis <code>te</code> is the right way for interaction, although a more decent constructor is <code>ti</code>. You said that <code>te</code> returns you many parameters. Of course. You have <code>k = 52</code> for the first margin, and <code>k = 5</code> for the second, then you end up with <code>52 * 5 - 1</code> coefficients for this tensor term. But this is just the way an interaction is created. Note that in <code>mgcv</code> GAM formula, <code>:</code> or <code>*</code> is only valid for interaction between parametric terms. Interaction between smooths are handled by <code>te</code> or <code>ti</code>. If this is not what you hope for, then what do you expect the "product" to be? A Hadamard product of two marginal design matrices? Then what's the sense of that? By the way, a Hadamard product requires two matrices of the same dimension. However, your two margins do not have the same number of columns. If you don't understand why I keep talking about matrices, then you need to read Simon's book in 2006. Though GAM estimation explained there is rather outdated now, the construction / setup of GAM (like design matrices and penalty matrices) explained in Chapter 4 does not change at all even after a decade. OK , let me give you one more hint. The row wise Kronecker product used for construction of <code>te</code> / <code>ti</code> design matrix is not a new invention. A smooth term <code>s(x)</code> is quite like a factor variable <code>g</code>, as though they appear to be a single variable, they are constructed to be a design matrix with many columns. For <code>g</code> it is a dummy matrix, while for <code>f(x)</code> it is a basis matrix. So, interaction between two smooth functions is constructed in the same way that interaction between two factors is constructed. If you have a factor <code>g1</code> of 5 levels, and another factor <code>g2</code> of 10 levels, their marginal design matrix (after contrasts) have 4 columns and 9 columns, then the interaction <code>g1:g2</code> would have 36 columns. Such design matrix, is just the row wise Kronecker product of the design matrix of <code>g1</code> and <code>g2</code>. <hr> <h3>Regarding overfitting</h3> As you said, you only have few years of data, maybe 2 or 3? In that case, your model have been over-parametrized by using <code>k = 52</code> for <code>day_in_year</code>. Try reduces it to for example <code>k = 30</code>. If overfitting is still evident, here are a few ways to address it. Way 1: You are using GCV for smoothness selection. Try <code>method = "REML"</code>. GCV always inclines to overfit data. Way 2: Staying with GCV, manually inflate smoothing parameters for a heavier penalization. <code>gamma</code> parameter of <code>gam</code> is useful here. Try <code>gamma = 1.5</code> for example. <hr> <h3>A typo in your code?</h3> The knots location, should it be <code>day_in_year = c(0, 365)</code>?

Your model doesn't make much sense as you state that there is: <ol> <li>a seasonal effect,</li> <li>a day of week effect,</li> <li>an interaction such that the day of week effect varies with the day of year</li> </ol> This can be fully represented as a tensor product smooth. The model you mention in the comment on the other answer here <blockquote> y ~ f(day in year) + g(day in year) * h(day in week) </blockquote> is just a decomposition of the full tensor product if you mean <code>*</code> as main effect + interaction. In which case the model as you have it isn't identifiable - you ave a function of day of year twice. And if you meant the equivalent of <code>:</code>, then your model doesn't have the main effect of day of week, which seems undesirable. I fit models of this form all the time (just for year and day of year). I would approach this via: <pre class="prettyprint"><code>gam(y ~ te(day_of_year, day_of_week, k = c(20, 6), bs = c("cc", "cc")), data = foo, method = "REML", knots = knots) </code></pre> You could also tweak the knots definitions. I tend to use the following: <pre class="prettyprint"><code>knots <- list(day_of_year = c(0.5, 366.5), day_of_week = c(0.5, 7.5) </code></pre> This won't make much difference but you're just putting boundary knots just a little closer to the data. If you want to decompose the effects, you could fit the model with <code>ti()</code> smooths <pre class="prettyprint"><code>gam(y ~ ti(day_of_year, bs = "cc", k = 12) + ti(day_of_week, bs = "cc", k = 6) + te(day_of_year, day_of_week, k = c(12, 6), bs = c("cc", "cc")), data = foo, method = "REML", knots = knots) </code></pre> You can tweak the values of <code>k</code> to determine suitable values in conjunction with your data and <code>gam.check()</code>. You will also need to add a term to the model to handle holidays. This would be parametric term which applies an adjustment if the day is a holiday - so, create a factor <code>holiday</code> and add it to the model <code>+ holiday</code>. You could think of more complex models; perhaps a factor indexing if a week has a holiday coupled with a by factor smooth for the <code>day_of_week</code> component, such that you estimate one weekly pattern if the week is a normal week and a second weekly pattern if the week contains a holiday. If you show us an example/plot of the data I could expand or comment less generally. It isn't surprising that the <code>te()</code> smooth you fitted doesn't adapt well to the holidays; the model assumes that the week effect is smooth and smoothly varying with a smooth day of year effect. Holidays are not smooth departures from the weekly pattern in the previous or subsequent week. The holiday effect is not well modelled by smooth relationships and something else is need to account for that effect.

Is it possible to include the product of two smooth terms in a mgcv gam model

Tags:

r

regression

mgcv

gam

I have had great success using gam to model seasonality for time series data. My latest model clearly shows a weekly pattern in addition to seasonal changes. While the weekly pattern itself is very stable over the year, its amplitude also varies with the seasons. So ideally I would like to model my data as:

y ~ f(day in year) + g(day in year) * h(day in week)

where f, g and h are cyclic smooth functions in mgcv

gam(
  y ~ s(day_in_year, k=52, bs='cc') 
  + s(day_in_year, k=52, bs='cc'):s(day_in_week, k=5, bs='cc')
  , knots=list(
    day_in_year=c(0, 356)
    , day_in_week=c(0,7)
  )
  , data = data
)

Unfortunately this doesn't work and throws back the error NA/NaN argument. I tried using te(day_in_year, day_in_week, k=c(52, 5), bs='cc') which works, but introduces too many degrees of freedom as the model overfits holidays which fall on specific weekdays within the short number of available years.

Is it possible to specify a model the way I am trying to do?

589

asked May 12 '14 22:05

unique2

2 Answers

Wow, quite an old question!

Regarding interaction

While the weekly pattern itself is very stable over the year, its amplitude also varies with the seasons.

The use of tensor product spline basis te is the right way for interaction, although a more decent constructor is ti. You said that te returns you many parameters. Of course. You have k = 52 for the first margin, and k = 5 for the second, then you end up with 52 * 5 - 1 coefficients for this tensor term. But this is just the way an interaction is created.

Note that in mgcv GAM formula, : or * is only valid for interaction between parametric terms. Interaction between smooths are handled by te or ti.

If this is not what you hope for, then what do you expect the "product" to be? A Hadamard product of two marginal design matrices? Then what's the sense of that? By the way, a Hadamard product requires two matrices of the same dimension. However, your two margins do not have the same number of columns.

If you don't understand why I keep talking about matrices, then you need to read Simon's book in 2006. Though GAM estimation explained there is rather outdated now, the construction / setup of GAM (like design matrices and penalty matrices) explained in Chapter 4 does not change at all even after a decade.

OK , let me give you one more hint. The row wise Kronecker product used for construction of te / ti design matrix is not a new invention.

A smooth term s(x) is quite like a factor variable g, as though they appear to be a single variable, they are constructed to be a design matrix with many columns. For g it is a dummy matrix, while for f(x) it is a basis matrix. So, interaction between two smooth functions is constructed in the same way that interaction between two factors is constructed.

If you have a factor g1 of 5 levels, and another factor g2 of 10 levels, their marginal design matrix (after contrasts) have 4 columns and 9 columns, then the interaction g1:g2 would have 36 columns. Such design matrix, is just the row wise Kronecker product of the design matrix of g1 and g2.

Regarding overfitting

As you said, you only have few years of data, maybe 2 or 3? In that case, your model have been over-parametrized by using k = 52 for day_in_year. Try reduces it to for example k = 30.

If overfitting is still evident, here are a few ways to address it.

Way 1: You are using GCV for smoothness selection. Try method = "REML". GCV always inclines to overfit data.

Way 2: Staying with GCV, manually inflate smoothing parameters for a heavier penalization. gamma parameter of gam is useful here. Try gamma = 1.5 for example.

A typo in your code?

The knots location, should it be day_in_year = c(0, 365)?

151

answered Sep 29 '22 05:09

Zheyuan Li

Your model doesn't make much sense as you state that there is:

a seasonal effect,
a day of week effect,
an interaction such that the day of week effect varies with the day of year

This can be fully represented as a tensor product smooth. The model you mention in the comment on the other answer here

y ~ f(day in year) + g(day in year) * h(day in week)

is just a decomposition of the full tensor product if you mean * as main effect + interaction. In which case the model as you have it isn't identifiable - you ave a function of day of year twice. And if you meant the equivalent of :, then your model doesn't have the main effect of day of week, which seems undesirable.

I fit models of this form all the time (just for year and day of year). I would approach this via:

gam(y ~ te(day_of_year, day_of_week, k = c(20, 6), bs = c("cc", "cc")),
    data = foo, method = "REML", knots = knots)

You could also tweak the knots definitions. I tend to use the following:

knots <- list(day_of_year = c(0.5, 366.5),
              day_of_week = c(0.5, 7.5)

This won't make much difference but you're just putting boundary knots just a little closer to the data.

If you want to decompose the effects, you could fit the model with ti() smooths

gam(y ~ ti(day_of_year, bs = "cc", k = 12) + ti(day_of_week, bs = "cc", k = 6) + 
      te(day_of_year, day_of_week, k = c(12, 6), bs = c("cc", "cc")),
    data = foo, method = "REML", knots = knots)

You can tweak the values of k to determine suitable values in conjunction with your data and gam.check().

You will also need to add a term to the model to handle holidays. This would be parametric term which applies an adjustment if the day is a holiday - so, create a factor holiday and add it to the model + holiday. You could think of more complex models; perhaps a factor indexing if a week has a holiday coupled with a by factor smooth for the day_of_week component, such that you estimate one weekly pattern if the week is a normal week and a second weekly pattern if the week contains a holiday.

If you show us an example/plot of the data I could expand or comment less generally.

It isn't surprising that the te() smooth you fitted doesn't adapt well to the holidays; the model assumes that the week effect is smooth and smoothly varying with a smooth day of year effect. Holidays are not smooth departures from the weekly pattern in the previous or subsequent week. The holiday effect is not well modelled by smooth relationships and something else is need to account for that effect.

answered Sep 29 '22 04:09

Gavin Simpson

Related questions
                            
                                R plot title encoding in Pdf
                            
                                Overdraw mean points in grouped boxplot with ggplot2
                            
                                Vastly different results for SVM model using e1071 and caret
                            
                                Grabbing object names from within a function
                            
                                How to plot NA bar with ggplot2
                            
                                Efficiently selecting top number of rows for each unique value of a column in a data.frame
                            
                                Sort by absolute value
                            
                                Using parallelisation to scrape web pages with R
                            
                                qqline in ggplot2 with facets
                            
                                export all the content of r script into pdf
                            
                                testthat: handling both warning and value
                            
                                Ignore safety check when using setnames
                            
                                Generating a consistent dynamic color palette in ggplot within a loop?
                            
                                How to plot overlapping ranges with ggplot2
                            
                                How to remove selected R variables without having to type their names
                            
                                Dictionary() is not supported anymore in tm package. How to emend code?
                            
                                R - aggregate data for 1 column by another column, based on statistices on a 3rd column
                            
                                varType usage in RODBC
                            
                                Creating month-sized bins with ggplot2 in r
                            
                                "Permission denied" when playing WAV file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With