Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does lm() know which predictors are categorical?

Tags:

r

regression

Normally, me and you(assuming you're not a bot) are easily able to identify whether a predictor is categorical or quantitative. Like, for example, gender is obviously categorical. Your last vote can be classified categorically.
Basically, we can identify categorical predictors easily. But what happens when we input some data in R, and it's lm function makes dummy variables for a predictor? How does it do that?

Somewhat related Question on StackOverflow.

like image 984
Mooncrater Avatar asked Jul 17 '17 17:07

Mooncrater


People also ask

How do you know which variables are categorical?

Categorical Variable: A categorical variable is a variable that is not numerical - instead it is based on a qualitative property, such as color, breed, or gender, among others. Categorical variables do not have a particular ordering, since they are not numerical, and take on values from a limited set of possibilities.

How does linear regression work with categorical variables?

Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.

Can linear regression have categorical predictors?

In linear regression the independent variables can be categorical and/or continuous. But, when you fit the model if you have more than two category in the categorical independent variable make sure you are creating dummy variables.

Can predictor variables be categorical?

Predictor variables in statistical models can be treated as either continuous or categorical.


2 Answers

Search R factor function. Here is a small demo, first model uses number of cylinder as a numerical valuable. Second model uses it as a categorical variable.

> summary(lm(mpg~cyl,mtcars))

Call:
lm(formula = mpg ~ cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9814 -2.1185  0.2217  1.0717  7.5186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
cyl          -2.8758     0.3224   -8.92 6.11e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared:  0.7262,    Adjusted R-squared:  0.7171 
F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10

> summary(lm(mpg~factor(cyl),mtcars))

Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2636 -1.8357  0.0286  1.3893  7.2364 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   26.6636     0.9718  27.437  < 2e-16 ***
factor(cyl)6  -6.9208     1.5583  -4.441 0.000119 ***
factor(cyl)8 -11.5636     1.2986  -8.905 8.57e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared:  0.7325,    Adjusted R-squared:  0.714 
F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09
like image 99
hxd1011 Avatar answered Sep 30 '22 08:09

hxd1011


Hxd1011 adressed the more difficult case, when a categorical variable is stored as a number and therefore R understands by default that it is a numerical value - and if this is not the desired behaviour we must use factor function.

Your example with predictor ShelveLoc in dataset Carseats is easier because it's a text (character) variable, and therefore it can only be a categorical variable.

> head(Carseats$ShelveLoc)
[1] Bad    Good   Medium Medium Bad    Bad   
Levels: Bad Good Medium
like image 31
Pere Avatar answered Sep 30 '22 09:09

Pere