I'm wondering if there is a cleaner way than just dummy-coding months (e.g., isJan, isFeb...) to have more meaningful independent variable names (under intercept). My data set is rather large, so I've simulated a simple one here.
#create simulated data set with sales, and date
sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- seq(from = 14610, to = 15609)
data <- cbind(sales, dates)
#regression with months
model <- lm(sales ~ months(dates))
summary(model)
I would like the intercept labels to show the actual month they refer to...currently my output looks like this:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 999.1934 1.2673 788.432 <2e-16 ***
months(dates).L -4.9537 4.5689 -1.084 0.2785
months(dates).Q -6.4931 4.4211 -1.469 0.1422
months(dates).C -5.5078 4.4180 -1.247 0.2128
months(dates)^4 2.3713 4.4864 0.529 0.5972
months(dates)^5 -1.7749 4.4605 -0.398 0.6908
months(dates)^6 1.5774 4.4555 0.354 0.7234
months(dates)^7 -10.9954 4.4511 -2.470 0.0137 *
months(dates)^8 -0.9627 4.4032 -0.219 0.8270
months(dates)^9 1.8847 4.2996 0.438 0.6612
months(dates)^10 -8.5990 4.1776 -2.058 0.0398 *
months(dates)^11 7.8436 4.1292 1.900 0.0578 .
Thanks in advance, --JT
The problem you have is that R has created an ordered factor and the contrasts produced for an ordered factor a polynomial contrasts (.L
is linear, .Q
is quadratic, .C
cubic and .^n
is the n-th order polynomial. It may be better to define the month as a factor, set the first level to January and then fit the model.
If in an English locale, then we can use the month.name
or month.abb
constants as follows
set.seed(42)
dat <- data.frame(sales = rnorm(1000, mean = 1000, sd = 40),
dates = as.Date(seq(from = 14610, to = 15609),
origin = "1970-01-01"))
dat <- transform(dat, month = factor(format(dates, format = "%B"),
levels = month.name))
This gives
> head(dat)
sales dates month
1 1054.8383 2010-01-01 January
2 977.4121 2010-01-02 January
3 1014.5251 2010-01-03 January
4 1025.3145 2010-01-04 January
5 1016.1707 2010-01-05 January
6 995.7550 2010-01-06 January
> with(dat, levels(month))
[1] "January" "February" "March" "April" "May"
[6] "June" "July" "August" "September" "October"
[11] "November" "December"
Note the order of the levels is in a logical rather than alphabetical order. If you are in a none English locale then the output of "%B"
will be the month names in your local language or convention. You will then need to provide the correct levels as a character vector to the levels
argument in the code above.
This data set can then be used to fit the model and we get more meaningful coefficient names
> mod <- lm(sales ~ month, data = dat)
> summary(mod)
Call:
lm(formula = sales ~ month, data = dat)
Residuals:
Min 1Q Median 3Q Max
-140.333 -24.551 0.108 28.102 134.349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1001.7034 4.1567 240.983 <2e-16 ***
monthFebruary -8.3618 6.0153 -1.390 0.165
monthMarch -0.5347 5.8785 -0.091 0.928
monthApril -7.5618 5.9273 -1.276 0.202
monthMay -2.2961 5.8785 -0.391 0.696
monthJune 3.5091 5.9273 0.592 0.554
monthJuly -4.9975 5.8785 -0.850 0.395
monthAugust -0.3558 5.8785 -0.061 0.952
monthSeptember 3.7597 5.9970 0.627 0.531
monthOctober -2.5948 6.5724 -0.395 0.693
monthNovember -10.5670 6.6378 -1.592 0.112
monthDecember -6.9064 6.5724 -1.051 0.294
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.01173, Adjusted R-squared: 0.0007317
F-statistic: 1.066 on 11 and 988 DF, p-value: 0.3854
In the above, note that January is the first level so its mean is the (Intercept)
estimate and the other estimates are deviations from the January mean. An alternative parameterisation of the model is to suppress the intercept:
> mod2 <- lm(sales ~ month - 1, data = dat)
> summary(mod2)
Call:
lm(formula = sales ~ month - 1, data = dat)
Residuals:
Min 1Q Median 3Q Max
-140.333 -24.551 0.108 28.102 134.349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
monthJanuary 1001.703 4.157 241.0 <2e-16 ***
monthFebruary 993.342 4.348 228.5 <2e-16 ***
monthMarch 1001.169 4.157 240.9 <2e-16 ***
monthApril 994.142 4.225 235.3 <2e-16 ***
monthMay 999.407 4.157 240.4 <2e-16 ***
monthJune 1005.213 4.225 237.9 <2e-16 ***
monthJuly 996.706 4.157 239.8 <2e-16 ***
monthAugust 1001.348 4.157 240.9 <2e-16 ***
monthSeptember 1005.463 4.323 232.6 <2e-16 ***
monthOctober 999.109 5.091 196.3 <2e-16 ***
monthNovember 991.136 5.175 191.5 <2e-16 ***
monthDecember 994.797 5.091 195.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 5.175e+04 on 12 and 988 DF, p-value: < 2.2e-16
Now the Estimates are of the monthly means and the t-tests are of the hypothesis that the individual monthly means are zero (0).
Create a month variable that is a factor, and R will automatically create pretty names.
sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- as.Date(seq(from = 14610, to = 15609),origin='1970-01-01')
data <- data.frame(sales, dates)
data$months=as.factor(months(dates))
model <- lm(sales ~ months,data=data)
summary(model)
It automatically picks April as the contrast month, but you can change this with contrasts
.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1001.3989 4.2880 233.535 <2e-16 ***
monthsAugust 6.8982 6.0150 1.147 0.2517
monthsDecember -6.0561 6.7140 -0.902 0.3673
monthsFebruary -1.3977 6.1527 -0.227 0.8203
monthsJanuary -3.2086 6.0150 -0.533 0.5939
monthsJuly -10.0742 6.0150 -1.675 0.0943 .
monthsJune -3.3393 6.0641 -0.551 0.5820
monthsMarch 0.3159 6.0150 0.053 0.9581
monthsMay -0.1448 6.0150 -0.024 0.9808
monthsNovember 3.4901 6.7799 0.515 0.6068
monthsOctober 3.2082 6.7140 0.478 0.6329
monthsSeptember -7.3039 6.1343 -1.191 0.2341
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With