Normally, me and you(assuming you're not a bot) are easily able to identify whether a predictor is categorical or quantitative. Like, for example, gender is obviously categorical. Your last vote can be classified categorically. Basically, we can identify categorical predictors easily. But what happens when we input some data in <code>R</code>, and it's <code>lm</code> function makes dummy variables for a predictor? How does it do that? Somewhat related Question on StackOverflow.

Hxd1011 adressed the more difficult case, when a categorical variable is stored as a number and therefore R understands by default that it is a numerical value - and if this is not the desired behaviour we must use <code>factor</code> function. Your example with predictor <code>ShelveLoc</code> in dataset <code>Carseats</code> is easier because it's a text (character) variable, and therefore it can only be a categorical variable. <pre class="prettyprint"><code>> head(Carseats$ShelveLoc) [1] Bad Good Medium Medium Bad Bad Levels: Bad Good Medium </code></pre>

How does lm() know which predictors are categorical?

Tags:

r

regression

Normally, me and you(assuming you're not a bot) are easily able to identify whether a predictor is categorical or quantitative. Like, for example, gender is obviously categorical. Your last vote can be classified categorically.
Basically, we can identify categorical predictors easily. But what happens when we input some data in R, and it's lm function makes dummy variables for a predictor? How does it do that?

Somewhat related Question on StackOverflow.

984

asked Jul 17 '17 17:07

Mooncrater

2 Answers

Search R factor function. Here is a small demo, first model uses number of cylinder as a numerical valuable. Second model uses it as a categorical variable.

> summary(lm(mpg~cyl,mtcars))

Call:
lm(formula = mpg ~ cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9814 -2.1185  0.2217  1.0717  7.5186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
cyl          -2.8758     0.3224   -8.92 6.11e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared:  0.7262,    Adjusted R-squared:  0.7171 
F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10

> summary(lm(mpg~factor(cyl),mtcars))

Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2636 -1.8357  0.0286  1.3893  7.2364 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   26.6636     0.9718  27.437  < 2e-16 ***
factor(cyl)6  -6.9208     1.5583  -4.441 0.000119 ***
factor(cyl)8 -11.5636     1.2986  -8.905 8.57e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared:  0.7325,    Adjusted R-squared:  0.714 
F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09

answered Sep 30 '22 08:09

hxd1011

Hxd1011 adressed the more difficult case, when a categorical variable is stored as a number and therefore R understands by default that it is a numerical value - and if this is not the desired behaviour we must use factor function.

Your example with predictor ShelveLoc in dataset Carseats is easier because it's a text (character) variable, and therefore it can only be a categorical variable.

> head(Carseats$ShelveLoc)
[1] Bad    Good   Medium Medium Bad    Bad   
Levels: Bad Good Medium

answered Sep 30 '22 09:09

Pere

Related questions
                            
                                Return a matrix with `ifelse`
                            
                                Plot sine curve in R
                            
                                create sequence of numbers with leading zeroes [duplicate]
                            
                                Automatic loading of data from sysdata.rda in package
                            
                                Making symbols bold in ggplot2
                            
                                Rcpparmadillo: can't call Fortran routine "dgebal"?
                            
                                S4 object with a pointer to a C struct
                            
                                Merge columns of a dataframe by two conditions using aggregate
                            
                                How to select unique columns in an R matrix
                            
                                Cannot Install R Packages in Docker Image
                            
                                Creating a New Variable Based on a Categorical Variable Already in the Dataset
                            
                                gather with multiple keys [duplicate]
                            
                                Regular expression matching on comma bounded by nonwhite space
                            
                                R: grep returns 0 when x clearly in y (I checked no spaces)
                            
                                Calculate sum of one column based on another column
                            
                                Renaming a column name, by using the data frame title/name
                            
                                Merge multiple .csv files into one [duplicate]
                            
                                igraph vs sna: can one do something well the other can't or does poorly?
                            
                                R: How to extract a list from a dataframe?
                            
                                Make all elements unique in a dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With