I am using the mgcv
package to model the ozone pollution concentration according to some environmental covariates. The model takes the form :
model1 <- gam(O3 ~ s(X, Y, bs = "tp", k = 10) + wd + s(date, bs = "cc", k = 100) + district,
data = mydata, family = gaussian(link ="log"),
na.action = "na.omit", method = "REML")
And here is the structure of covariates:
> str(mydata)
'data.frame': 7100 obs. of 286 variables:
$ date : Date, format: "2016-01-01" "2016-01-01" "2016-01-01" ...
$ O3 : num 0.0141 0.0149 0.0102 0.0159 0.0186 ...
$ district : Factor w/ 10 levels "bc","bh","dl",..: 1 8 7 8 2 6 4 4 10 2 ...
$ wd : Factor w/ 16 levels "E","ENE","ESE",..: 13 13 13 13 13 2 9 9 11 13 ...
$ X : num 0.389 0.365 1 0.44 0.892 ...
$ Y : num 0.311 0.204 0.426 0.223 0.162 ...
I am stuck on an
error in R: 'names' attribute [1] must be the same length as the vector [0].
I try to find where the problem is by delete the term of s(date, bs = "cc", k = 100)
from the fomular and it could work well. It seems like there is something wrong with date field.
I'm not exactly sure how to fix this problem. Any advice would be greatly appreciated!
The date
variable won't be automatically converted to a numeric variable; you need to do this yourself. I normally process such information as follows
mydata <- transform(mydata, ndate = as.numeric(date),
nyear = as.numeric(format(date, '%Y')),
nmonth = as.numeric(format(date, '%m')),
doy = as.numeric(format(date, '%j')))
Then I can choose to model the time component in a number of ways:
ndate
of nyear
with a non-cyclic spline, ornmonth
or doy
(for day of year), orIt's unclear from your question if your data are restricted to a single year. If the data span multiple years then you can't just use the cyclic spline on the ndate
variable. You will need either a very complex standard spline (option 1) or include two splines, one for the between year part and one for the within year part (option 3.)
If your data is over multiple years then I would set the model up as
O3 ~ s(X, Y, bs = "tp", k = 10) + wd + s(doy, bs = 'cc', k = 20) +
s(ndate, bs = "tp", k = 50) + district
or perhaps s(nyear, .... )
will be sufficient instead of s(ndate, .... )
.
This kind of decomposition of the time component is useful as you can often do a better job of fitting the series via two simple, well-estimated smooths than a single more complex smooth. It also allows you to test for within and between year effects.
If you need the seasonal cycle to vary with the trend, then a tensor product is helpful:
O3 ~ s(X, Y, bs = "tp", k = 10) + wd +
te(doy, ndate, bs = c('cc','tp'), k = c(20,50)) + district
For cyclic splines you may also want to set the knots
argument, especially if your data don't quite span the full range of days of year etc. For doy
I would use knots = list(doy = c(0.5, 366.5))
as this allows Dec 31st and Jan 1st to have slightly different estimated values. For nmonth
this is more important as otherwise Dec and Jan would get the same fitted value. I use: knots = list(nmonth = c(0.5, 12.5))
.
The idea here is that 1
and 12
reflect the middle of the respective month and 0.5
and 12.5
the beginning and end of the first and last months, which we might expect to be the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With