I am running the summary(lm(...)) function in R. When I print the coefficients I get estimates for all variables except the last variable. The last variable I get "NA".
I tried switching the last column of data with another column and again, whatever was in the last column got "NA", but everything else got estimates.
A little bit about the data: I have about 5 variables with data in every row and then I have 12 seasonal variables that, for example, if the month is january there is a 1 for every day in january, 0 otherwise. For february variable there is a 1 if month is february and 0 otherwise and so on. Does anyone know what would produce "NA" in the last column of the coefficient estimate? So the first time I ran it, it was the coefficient for the December dummy variable. Is it because of my monthly dummy variables? Thanks
This is my reproducible example.
dat<- data.frame( one<-c(sample(1000:1239)), two<-c(sample(200:439)), three<-c(sample(600:839)), Jan<-c(rep(1,20), rep(0,220)), Feb<-c(rep(0,20),rep(1,20),rep(0,200)), Mar<-c(rep(0,40),rep(1,20),rep(0,180)), Apr<-c(rep(0,60),rep(1,20),rep(0,160)), May<-c(rep(0,80),rep(1,20),rep(0,140)), Jun<-c(rep(0,100),rep(1,20),rep(0,120)), Jul<-c(rep(0,120),rep(1,20),rep(0,100)), Aug<-c(rep(0,140),rep(1,20),rep(0,80)), Sep<-c(rep(0,160),rep(1,20),rep(0,60)), Oct<-c(rep(0,180),rep(1,20),rep(0,40)), Nov<-c(rep(0,200),rep(1,20),rep(0,20)), Dec<-c(rep(0,220),rep(1,20) ) attach(dat) summary(lm(one ~ two + three + Jan + Feb + Mar + Apr + May + Jun + Jul + Aug + Sep + Oct + Nov + Dec))
NA as a coefficient in a regression indicates that the variable in question is linearly related to the other variables.
In linear regression, coefficients are the values that multiply the predictor values. Suppose you have the following regression equation: y = 3X + 5. In this equation, +3 is the coefficient, X is the predictor, and +5 is the constant.
How many coefficients do you need to estimate in a simple linear regression model (One independent variable)? In simple linear regression, there is one independent variable so 2 coefficients (Y=a+bx).
In regression with a single independent variable, the coefficient tells you how much the dependent variable is expected to increase (if the coefficient is positive) or decrease (if the coefficient is negative) when that independent variable increases by one.
You have to think a bit more about how your model is defined.
Here's your approach (edited for readability):
> set.seed(101) > dat<-data.frame(one=c(sample(1000:1239)), two=c(sample(200:439)), three=c(sample(600:839)), Jan=c(rep(1,20),rep(0,220)), Feb=c(rep(0,20),rep(1,20),rep(0,200)), Mar=c(rep(0,40),rep(1,20),rep(0,180)), Apr=c(rep(0,60),rep(1,20),rep(0,160)), May=c(rep(0,80),rep(1,20),rep(0,140)), Jun=c(rep(0,100),rep(1,20),rep(0,120)), Jul=c(rep(0,120),rep(1,20),rep(0,100)), Aug=c(rep(0,140),rep(1,20),rep(0,80)), Sep=c(rep(0,160),rep(1,20),rep(0,60)), Oct=c(rep(0,180),rep(1,20),rep(0,40)), Nov=c(rep(0,200),rep(1,20),rep(0,20)), Dec=c(rep(0,220),rep(1,20))) > summary(lm(one ~ two + three + Jan + Feb + Mar + Apr + May + Jun + Jul + Aug + Sep + Oct + Nov + Dec, data=dat))
And the answers:
[snip] Coefficients: (1 not defined because of singularities)
note this line, it indicates that R (and any other statistical package you choose to use) can't estimate all the parameters because the predictor variables are not all linearly independent.
Estimate Std. Error t value Pr(>|t|) (Intercept) 1149.55556 53.52499 21.477 <2e-16 ***
The intercept here represents the predicted value when all predictor variables are zero. In any particular case the interpretation of the intercept depends on how you have parameterized your model. The dummy variables you have defined for month are not all linearly independent; lm
is smart enough to detect this and drop some of the unidentifiable (linearly dependent) predictor variables. The details of which particular predictor(s) are discarded in this case are obscure and technical (you would probably have to look inside the lm.fit
function, but you probably don't want to do this). In this case, R decides to throw away the December
predictor. Therefore, if we set all the predictors (two
, three
, and all month dummies Jan-Nov) to zero, we end up with the expected value when two
=0 and three
=0 and when the month is not equal to any of Jan-Nov -- i.e., the expected value for December.
two -0.09670 0.06621 -1.460 0.1455 three 0.02446 0.06666 0.367 0.7141 Jan -19.49744 22.17404 -0.879 0.3802 Feb -28.22652 22.27438 -1.267 0.2064 Mar -6.05246 22.25468 -0.272 0.7859 Apr -5.60192 22.41204 -0.250 0.8029 May -13.19127 22.34289 -0.590 0.5555 Jun -19.69547 22.14274 -0.889 0.3747 Jul -44.45511 22.20837 -2.002 0.0465 * Aug -2.08404 22.26202 -0.094 0.9255 Sep -10.13351 22.10252 -0.458 0.6470 Oct -31.80482 22.33335 -1.424 0.1558 Nov -20.35348 22.09953 -0.921 0.3580 Dec NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 69.81 on 226 degrees of freedom Multiple R-squared: 0.04381, Adjusted R-squared: -0.01119 F-statistic: 0.7966 on 13 and 226 DF, p-value: 0.6635
Now do it again, this time setting up a model formula that uses -1
to discard the intercept term (we reset the random seed for reproducibility):
> set.seed(101) > dat1 <- data.frame(one=c(sample(1000:1239)),two=c(sample(200:439)), three=c(sample(600:839)), month=factor(rep(month.abb,each=20),levels=month.abb)) > summary(lm(one ~ two + three + month-1, data=dat1)) Coefficients: Estimate Std. Error t value Pr(>|t|) two -0.09670 0.06621 -1.460 0.146 three 0.02446 0.06666 0.367 0.714
The estimates for two
and three
are the same as before.
monthJan 1130.05812 52.79625 21.404 <2e-16 *** monthFeb 1121.32904 55.18864 20.318 <2e-16 *** monthMar 1143.50310 53.59603 21.336 <2e-16 *** monthApr 1143.95365 54.99724 20.800 <2e-16 *** monthMay 1136.36429 53.38218 21.287 <2e-16 *** monthJun 1129.86010 53.85865 20.978 <2e-16 *** monthJul 1105.10045 54.94940 20.111 <2e-16 *** monthAug 1147.47152 54.57201 21.027 <2e-16 *** monthSep 1139.42205 53.58611 21.263 <2e-16 *** monthOct 1117.75075 55.35703 20.192 <2e-16 *** monthNov 1129.20208 53.54934 21.087 <2e-16 *** monthDec 1149.55556 53.52499 21.477 <2e-16 ***
The estimate for December is the same as the intercept estimate above. The other months' parameter estimates are equal to (intercept+previous value). The p values are different, because their meaning has changed. Previously, they were a test of differences of each month from December; now they are a test of the differences of each month from a baseline value of zero.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With