ordisurf vs mgcv:gam model

Question

I wanted to update the GAM model behind ordisurf function in R vegan (add random factor). To do that I first constructed the model that should be identical to the one behind ordisurf as explained here: https://www.fromthebottomoftheheap.net/2011/06/10/what-is-ordisurf-doing/

However, the ordisurf and GAM model I made in mgcv give really different results. Example code with dune data below. Can anyone explain? What should the GAM model look like to perform similarly to ordisurf? (Important to know before one starts improving it... :))

ord <-metaMDS(dune, k=3) 

surf <- ordisurf(ord, dune.env$A1, choices = c(2,3))

scrs <- data.frame(scores(ord, display = "sites", choices = c(1,2,3)))

dat <- with(dune, cbind(scrs, dune.env$A1))

mod_23 <- gam(dune.env$A1 ~ s(NMDS2, NMDS3, k = 10), data = dat)

plot.gam(mod_12, se=FALSE, cex = 1, pch = 1, col="blue")

Gavin Simpson · Accepted Answer

The best answer to questions of this sort is to read the source code to see what is actually being done. Since I wrote that blog post, we decided to tweak the default settings for ordisurf() to reflect good practice in estimating GAMs and to be stricter about model selection.

The first of these means we fit the model with mgcv::gam() using method = 'REML', to take advantage of REML smoothness selection. GCV smoothness selection is (currently) the default in mgcv::gam(), but it is not recommended and Simon Wood, the developer of mgcv, hints that some future version of mgcv may change the default from method = 'GCV.Cp' to method = 'REML'. The reason for this is that GCV smoothness selection has been observed to undersmooth in many applications. This happens where the profile of the GCV function gets very flat around the global optimum GCV score. Slight differences in the data can result in very different smoothness parameters being selected, some of which result in considerable undersmoothing. REML and ML smoothness selection has been shown to be less affected by this issue; the profile of the REML score tends to be more curved and have a pronounced minimum.

The second point, and why we now use select = TRUE, results from the observation that the smoothness penalty in the GAM which is what chooses the degree of wiggliness of the estimated surface, only affects the wiggly basis functions. The basis expansion contains a couple of basis functions that, from the point of view of the penalty, are perfectly smooth; there are two linear, 2-D basis functions representing two linear planes, which have, by definition, zero curvature (zero second derivative) and hence don't contribute at all to the wiggliness penalty (the default anyway, which is to use a penalty on the curvature of the estimated spline). The end result is that the smoothness penalty can penalize all the way back to an estimated planar surface, but no further. What this means in practice is that users could interpret the result of the ordisurf() model as being a linear surface, even if that linear surface was not statistically significant (and many users just look at the plot and not the underlying GAM which would tell them if their plane was significant or not). What select = TRUE does is to add a second penalty to the smooth, which only affects the perfectly smooth basis functions. This has the effect of the GAM being able to penalise/shrink out of the model entirely the surface we are estimating. In other words, it can shrink the surface to a flat, horizontal surface of zero effect === zero relationship between the ordination configuration and the response variable.

Together, I felt these options best guarded against users getting false positives.

If you change your gam() call to be:

mod_23 <- gam(dune.env$A1 ~ s(NMDS2, NMDS3, k = 10), data = dat,
              method = 'REML', select = TRUE)

then you should get the same output as that produced by ordisurf():

r$> mod_23                                                                       

Family: gaussian 
Link function: identity 

Formula:
dune.env$A1 ~ s(NMDS2, NMDS3, k = 10)

Estimated degrees of freedom:
0.285  total = 1.28 

REML score: 43.25057     

r$> surf                                                                         

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x1, x2, k = 10, bs = "tp", fx = FALSE)

Estimated degrees of freedom:
0.285  total = 1.28 

REML score: 43.25057

ordisurf vs mgcv:gam model

Tags:

r

gam

vegan

k_ru

1 Answers

Gavin Simpson

Recent Activity

Donate For Us

ordisurf vs mgcv:gam model

Tags:

r

gam

vegan

k_ru

1 Answers

Gavin Simpson

Related questions

Recent Activity

Donate For Us