I have some data, a sample of which below. My goal is to apply a gam
to each Year, and to have another value that is the predicted value from gam model.
fertility <- structure(list(AGE = c(15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L,
36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L,
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L
), Year = c(1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930,
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930,
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1931,
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931,
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931,
1931, 1931, 1931, 1931, 1931, 1931, 1931), fertility = c(5.170284269,
14.18135114, 27.69795144, 44.61216712, 59.08896308, 89.66036496,
105.4563852, 120.1754041, 137.4074262, 148.7159407, 161.5645606,
157.200515, 143.6340251, 127.8855125, 117.7343628, 159.2909484,
126.6158821, 109.0681613, 86.98223678, 70.64470361, 111.0070633,
86.15051988, 68.9204159, 55.92722274, 42.93402958, 56.84376018,
39.35337243, 26.72142573, 18.46207596, 9.231037978, 4.769704534,
13.08261815, 25.55198857, 41.15573626, 54.51090896, 81.99522459,
96.44082973, 109.9015072, 125.6603492, 136.0020892, 148.679958,
144.6639404, 132.1793638, 117.6867783, 108.345172, 144.2820726,
114.68575, 98.79142865, 78.7865069, 63.9883456, 100.217918, 77.77726461,
62.22181169, 50.49147014, 38.76112859, 52.48807067, 36.33789508,
24.67387938, 17.04740757, 8.523703784)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -60L), .Names = c("AGE",
"Year", "fertility"))
So, the non-dplyr, "dumb" way to do it would be
count <- 0
for (i in 1930:1931){
count <- count + 1
temp <- filter(fertility, Year == i)
mod <- mgcv::gam(fertility ~ s(AGE), data=temp)
pred[length(15:44) * (count - 1) + 1:30] <- predict(mod, newdata = data.frame(AGE = 15:44))
}
fertility1 <- mutate(fertility, pred = pred)
But I'd like a method in dplyr
. My thought was to use do
to create a model for each column, then use predict
to obtain the values. The first step I can do, but I'm struggling to implement the second part in dplyr
:
library(mgcv)
library(dplyr)
fertility %>%
#filter(!is.na(fertility)) %>% # not sure if this is necessary
group_by(Year) %>%
dplyr::do(model = mgcv::gam(fertility ~ s(AGE), data = .)) %>%
left_join(fertility, .) %>%
mutate(smoothed = predict(model, newdata = AGE))
I get the error message
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "list"
which presumably means dplyr
doesn't remember that model
is a model, rather than just a list element.
The smart way to do this would be to use factor-smooth interactions that have been available in mgcv for ages, either via by
terms in s()
or via the newer bs = "fs"
basis type. Here is an example with your data:
library("mgcv")
## Make Year a factor
fertility <- transform(fertility, Year = factor(Year))
## Fit model using by terms - include factor as fixed effect too!
mod <- gam(fertility ~ Year + s(AGE, by = Year), data = fertility)
## Plot to see what form this model takes
plot(mod, pages = 1)
## Some prediction data
ages <- with(fertility, seq(min(AGE), max(AGE)))
## Need to replicate this once per Year
pdat <- with(fertility,
data.frame(AGE = rep(ages, nlevels(Year)),
Year = rep(levels(Year), each = length(ages))))
## Add the fitted values to the prediction data
pdat <- transform(pdat, fitted = predict(mod, newdata = pdat))
head(pdat)
> head(pdat)
AGE Year fitted
1 15 1930 -0.8496705
2 16 1930 15.9568574
3 17 1930 33.0754019
4 18 1930 50.7419122
5 19 1930 68.9116594
6 20 1930 87.1306489
However, you can just ask for the fitted values if all you want to do is predict for the observed values of AGES
:
fertility <- transform(fertility, fitted = predict(mod))
head(fertility)
> head(fertility)
AGE Year fertility fitted
1 15 1930 5.170284 -0.8496705
2 16 1930 14.181351 15.9568574
3 17 1930 27.697951 33.0754019
4 18 1930 44.612167 50.7419122
5 19 1930 59.088963 68.9116594
6 20 1930 89.660365 87.1306489
You might also look at the specific factor-smooth basis type bs = "fs"
and ?smooth.terms
and ?factor.smooth.interaction
for details; basically these are efficient if you have a lot of levels but you want each level's smoother to have the same value of the smoothing parameter.
The main advantage here is that you use all your data and fit a single model, which you can then interrogate in a number of ways not easily open to you if you fit m separate models, such as being able to investigate differences in the smoothers per year.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With