Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regression with months as independent variables (labels)

Tags:

r

I'm wondering if there is a cleaner way than just dummy-coding months (e.g., isJan, isFeb...) to have more meaningful independent variable names (under intercept). My data set is rather large, so I've simulated a simple one here.

#create simulated data set with sales, and date
sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- seq(from = 14610, to = 15609)
data <- cbind(sales, dates)

#regression with months 
model <- lm(sales ~ months(dates))
summary(model) 

I would like the intercept labels to show the actual month they refer to...currently my output looks like this:

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      999.1934     1.2673 788.432   <2e-16 ***
months(dates).L   -4.9537     4.5689  -1.084   0.2785    
months(dates).Q   -6.4931     4.4211  -1.469   0.1422    
months(dates).C   -5.5078     4.4180  -1.247   0.2128    
months(dates)^4    2.3713     4.4864   0.529   0.5972    
months(dates)^5   -1.7749     4.4605  -0.398   0.6908    
months(dates)^6    1.5774     4.4555   0.354   0.7234    
months(dates)^7  -10.9954     4.4511  -2.470   0.0137 *  
months(dates)^8   -0.9627     4.4032  -0.219   0.8270    
months(dates)^9    1.8847     4.2996   0.438   0.6612    
months(dates)^10  -8.5990     4.1776  -2.058   0.0398 *  
months(dates)^11   7.8436     4.1292   1.900   0.0578 . 

Thanks in advance, --JT

like image 879
JimmyT Avatar asked Jun 08 '12 18:06

JimmyT


2 Answers

The problem you have is that R has created an ordered factor and the contrasts produced for an ordered factor a polynomial contrasts (.L is linear, .Q is quadratic, .C cubic and .^n is the n-th order polynomial. It may be better to define the month as a factor, set the first level to January and then fit the model.

If in an English locale, then we can use the month.name or month.abb constants as follows

set.seed(42)
dat <- data.frame(sales = rnorm(1000, mean = 1000, sd = 40),
                  dates = as.Date(seq(from = 14610, to = 15609),
                                  origin = "1970-01-01"))
dat <- transform(dat, month = factor(format(dates, format = "%B"),
                                     levels = month.name))

This gives

> head(dat)
      sales      dates   month
1 1054.8383 2010-01-01 January
2  977.4121 2010-01-02 January
3 1014.5251 2010-01-03 January
4 1025.3145 2010-01-04 January
5 1016.1707 2010-01-05 January
6  995.7550 2010-01-06 January
> with(dat, levels(month))
 [1] "January"   "February"  "March"     "April"     "May"      
 [6] "June"      "July"      "August"    "September" "October"  
[11] "November"  "December"

Note the order of the levels is in a logical rather than alphabetical order. If you are in a none English locale then the output of "%B" will be the month names in your local language or convention. You will then need to provide the correct levels as a character vector to the levels argument in the code above.

This data set can then be used to fit the model and we get more meaningful coefficient names

> mod <- lm(sales ~ month, data = dat)
> summary(mod)

Call:
lm(formula = sales ~ month, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.333  -24.551    0.108   28.102  134.349 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1001.7034     4.1567 240.983   <2e-16 ***
monthFebruary    -8.3618     6.0153  -1.390    0.165    
monthMarch       -0.5347     5.8785  -0.091    0.928    
monthApril       -7.5618     5.9273  -1.276    0.202    
monthMay         -2.2961     5.8785  -0.391    0.696    
monthJune         3.5091     5.9273   0.592    0.554    
monthJuly        -4.9975     5.8785  -0.850    0.395    
monthAugust      -0.3558     5.8785  -0.061    0.952    
monthSeptember    3.7597     5.9970   0.627    0.531    
monthOctober     -2.5948     6.5724  -0.395    0.693    
monthNovember   -10.5670     6.6378  -1.592    0.112    
monthDecember    -6.9064     6.5724  -1.051    0.294    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.01173,    Adjusted R-squared: 0.0007317 
F-statistic: 1.066 on 11 and 988 DF,  p-value: 0.3854

In the above, note that January is the first level so its mean is the (Intercept) estimate and the other estimates are deviations from the January mean. An alternative parameterisation of the model is to suppress the intercept:

> mod2 <- lm(sales ~ month - 1, data = dat)
> summary(mod2)

Call:
lm(formula = sales ~ month - 1, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.333  -24.551    0.108   28.102  134.349 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
monthJanuary   1001.703      4.157   241.0   <2e-16 ***
monthFebruary   993.342      4.348   228.5   <2e-16 ***
monthMarch     1001.169      4.157   240.9   <2e-16 ***
monthApril      994.142      4.225   235.3   <2e-16 ***
monthMay        999.407      4.157   240.4   <2e-16 ***
monthJune      1005.213      4.225   237.9   <2e-16 ***
monthJuly       996.706      4.157   239.8   <2e-16 ***
monthAugust    1001.348      4.157   240.9   <2e-16 ***
monthSeptember 1005.463      4.323   232.6   <2e-16 ***
monthOctober    999.109      5.091   196.3   <2e-16 ***
monthNovember   991.136      5.175   191.5   <2e-16 ***
monthDecember   994.797      5.091   195.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984 
F-statistic: 5.175e+04 on 12 and 988 DF,  p-value: < 2.2e-16

Now the Estimates are of the monthly means and the t-tests are of the hypothesis that the individual monthly means are zero (0).

like image 176
Gavin Simpson Avatar answered Nov 03 '22 01:11

Gavin Simpson


Create a month variable that is a factor, and R will automatically create pretty names.

sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- as.Date(seq(from = 14610, to = 15609),origin='1970-01-01')
data <- data.frame(sales, dates)
data$months=as.factor(months(dates))

model <- lm(sales ~ months,data=data)
summary(model) 

It automatically picks April as the contrast month, but you can change this with contrasts.

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1001.3989     4.2880 233.535   <2e-16 ***
monthsAugust       6.8982     6.0150   1.147   0.2517    
monthsDecember    -6.0561     6.7140  -0.902   0.3673    
monthsFebruary    -1.3977     6.1527  -0.227   0.8203    
monthsJanuary     -3.2086     6.0150  -0.533   0.5939    
monthsJuly       -10.0742     6.0150  -1.675   0.0943 .  
monthsJune        -3.3393     6.0641  -0.551   0.5820    
monthsMarch        0.3159     6.0150   0.053   0.9581    
monthsMay         -0.1448     6.0150  -0.024   0.9808    
monthsNovember     3.4901     6.7799   0.515   0.6068    
monthsOctober      3.2082     6.7140   0.478   0.6329    
monthsSeptember   -7.3039     6.1343  -1.191   0.2341    
like image 28
nograpes Avatar answered Nov 03 '22 00:11

nograpes