I have a <code>data.frame</code> consisting of numeric and factor variables as seen below. <pre class="prettyprint"><code>testFrame <- data.frame(First=sample(1:10, 20, replace=T), Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T), Fourth=rep(c("Alice","Bob","Charlie","David"), 5), Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4)) </code></pre> I want to build out a <code>matrix</code> that assigns dummy variables to the factor and leaves the numeric variables alone. <pre class="prettyprint"><code>model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame) </code></pre> As expected when running <code>lm</code> this leaves out one level of each factor as the reference level. However, I want to build out a <code>matrix</code> with a dummy/indicator variable for every level of all the factors. I am building this matrix for <code>glmnet</code> so I am not worried about multicollinearity. Is there a way to have <code>model.matrix</code> create the dummy for every level of the factor?

(Trying to redeem myself...) In response to Jared's comment on @Fabians answer about automating it, note that all you need to supply is a named list of contrast matrices. <code>contrasts()</code> takes a vector/factor and produces the contrasts matrix from it. For this then we can use <code>lapply()</code> to run <code>contrasts()</code> on each factor in our data set, e.g. for the <code>testFrame</code> example provided: <pre class="prettyprint"><code>> lapply(testFrame[,4:5], contrasts, contrasts = FALSE) $Fourth Alice Bob Charlie David Alice 1 0 0 0 Bob 0 1 0 0 Charlie 0 0 1 0 David 0 0 0 1 $Fifth Edward Frank Georgia Hank Isaac Edward 1 0 0 0 0 Frank 0 1 0 0 0 Georgia 0 0 1 0 0 Hank 0 0 0 1 0 Isaac 0 0 0 0 1 </code></pre> Which slots nicely into @fabians answer: <pre class="prettyprint"><code>model.matrix(~ ., data=testFrame, contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE)) </code></pre>

You need to reset the <code>contrasts</code> for the factor variables: <pre class="prettyprint"><code>model.matrix(~ Fourth + Fifth, data=testFrame, contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), Fifth=contrasts(testFrame$Fifth, contrasts=F))) </code></pre> or, with a little less typing and without the proper names: <pre class="prettyprint"><code>model.matrix(~ Fourth + Fifth, data=testFrame, contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), Fifth=diag(nlevels(testFrame$Fifth)))) </code></pre>

<code>dummyVars</code> from <code>caret</code> could also be used. http://caret.r-forge.r-project.org/preprocess.html

Ok. Just reading the above and putting it all together. Suppose you wanted the matrix e.g. 'X.factors' that multiplies by your coefficient vector to get your linear predictor. There are still a couple extra steps: <pre class="prettyprint"><code>X.factors = model.matrix( ~ ., data=X, contrasts.arg = lapply(data.frame(X[,sapply(data.frame(X), is.factor)]), contrasts, contrasts = FALSE)) </code></pre> (Note that you need to turn X[*] back into a data frame in case you have only one factor column.) Then say you get something like this: <pre class="prettyprint"><code>attr(X.factors,"assign") [1] 0 1 **2** 2 **3** 3 3 **4** 4 4 5 6 7 8 9 10 #emphasis added </code></pre> We want to get rid of the **'d reference levels of each factor <pre class="prettyprint"><code>att = attr(X.factors,"assign") factor.columns = unique(att[duplicated(att)]) unwanted.columns = match(factor.columns,att) X.factors = X.factors[,-unwanted.columns] X.factors = (data.matrix(X.factors)) </code></pre>

A <code>tidyverse</code> answer: <pre class="prettyprint"><code>library(dplyr) library(tidyr) result <- testFrame %>% mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>% mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "") </code></pre> yields the desired result (same as @Gavin Simpson's answer): <pre class="prettyprint"><code>> head(result, 6) First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac 1 1 5 4 0 0 1 0 0 1 0 0 0 2 1 14 10 0 0 0 1 0 0 1 0 0 3 2 2 9 0 1 0 0 1 0 0 0 0 4 2 5 4 0 0 0 1 0 1 0 0 0 5 2 13 5 0 0 1 0 1 0 0 0 0 6 2 15 7 1 0 0 0 1 0 0 0 0 </code></pre>

I am currently learning Lasso model and <code>glmnet::cv.glmnet()</code>, <code>model.matrix()</code> and <code>Matrix::sparse.model.matrix()</code>(for high dimensions matrix, using <code>model.matrix</code> will killing our time as suggested by the author of <code>glmnet</code>.). Just sharing there has a tidy coding to get the same answer as @fabians and @Gavin's answer. Meanwhile, @asdf123 introduced another package <code>library('CatEncoders')</code> as well. <pre class="prettyprint"><code>> require('useful') > # always use all levels > build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE) > > # just use all levels for Fourth > build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE)) </code></pre> Source : R for Everyone: Advanced Analytics and Graphics (page273)

You can use <code>tidyverse</code> to achieve this without specifying each column manually. The trick is to make a "long" dataframe. Then, munge a few things, and spread it back to wide to create the indicators/dummy variables. Code: <pre class="prettyprint"><code>library(tidyverse) ## add index variable for pivoting testFrame$id <- 1:nrow(testFrame) testFrame %>% ## pivot to "long" format gather(feature, value, -id) %>% ## add indicator value mutate(indicator=1) %>% ## create feature name that unites a feature and its value unite(feature, value, col="feature_value", sep="_") %>% ## convert to wide format, filling missing values with zero spread(feature_value, indicator, fill=0) </code></pre> The output: <pre class="prettyprint"><code> id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ... 1 1 1 0 0 0 0 0 0 0 2 2 0 1 0 0 0 0 0 0 3 3 0 0 1 0 0 0 0 0 4 4 0 0 0 1 0 0 0 0 5 5 0 0 0 0 1 0 0 0 6 6 1 0 0 0 0 0 0 0 7 7 0 1 0 0 0 0 1 0 8 8 0 0 1 0 0 1 0 0 9 9 0 0 0 1 0 0 0 0 10 10 0 0 0 0 1 0 0 0 11 11 1 0 0 0 0 0 0 0 12 12 0 1 0 0 0 0 0 0 ... </code></pre>

All Levels of a Factor in a Model Matrix in R

Tags:

r

matrix

model

indicator

I have a data.frame consisting of numeric and factor variables as seen below.

testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

I want to build out a matrix that assigns dummy variables to the factor and leaves the numeric variables alone.

model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)

As expected when running lm this leaves out one level of each factor as the reference level. However, I want to build out a matrix with a dummy/indicator variable for every level of all the factors. I am building this matrix for glmnet so I am not worried about multicollinearity.

Is there a way to have model.matrix create the dummy for every level of the factor?

476

asked Dec 30 '10 06:12

Jared

10 Answers

(Trying to redeem myself...) In response to Jared's comment on @Fabians answer about automating it, note that all you need to supply is a named list of contrast matrices. contrasts() takes a vector/factor and produces the contrasts matrix from it. For this then we can use lapply() to run contrasts() on each factor in our data set, e.g. for the testFrame example provided:

> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
        Alice Bob Charlie David
Alice       1   0       0     0
Bob         0   1       0     0
Charlie     0   0       1     0
David       0   0       0     1

$Fifth
        Edward Frank Georgia Hank Isaac
Edward       1     0       0    0     0
Frank        0     1       0    0     0
Georgia      0     0       1    0     0
Hank         0     0       0    1     0
Isaac        0     0       0    0     1

Which slots nicely into @fabians answer:

model.matrix(~ ., data=testFrame, 
             contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))

195

answered Oct 14 '22 22:10

Gavin Simpson

You need to reset the contrasts for the factor variables:

model.matrix(~ Fourth + Fifth, data=testFrame, 
        contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), 
                Fifth=contrasts(testFrame$Fifth, contrasts=F)))

or, with a little less typing and without the proper names:

model.matrix(~ Fourth + Fifth, data=testFrame, 
    contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), 
            Fifth=diag(nlevels(testFrame$Fifth))))

answered Oct 14 '22 23:10

fabians

caret implemented a nice function dummyVars to achieve this with 2 lines:

library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))

Checking the final columns:

colnames(testFrame2)

"First"  "Second"         "Third"          "Fourth.Alice"   "Fourth.Bob"     "Fourth.Charlie" "Fourth.David"   "Fifth.Edward"   "Fifth.Frank"   "Fifth.Georgia"  "Fifth.Hank"     "Fifth.Isaac"

The nicest point here is you get the original data frame, plus the dummy variables having excluded the original ones used for the transformation.

More info: http://amunategui.github.io/dummyVar-Walkthrough/

answered Oct 14 '22 23:10

Pablo Casas

dummyVars from caret could also be used. http://caret.r-forge.r-project.org/preprocess.html

answered Oct 14 '22 22:10

Sagar Jauhari

Ok. Just reading the above and putting it all together. Suppose you wanted the matrix e.g. 'X.factors' that multiplies by your coefficient vector to get your linear predictor. There are still a couple extra steps:

X.factors = 
  model.matrix( ~ ., data=X, contrasts.arg = 
    lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
                                             contrasts, contrasts = FALSE))

(Note that you need to turn X[*] back into a data frame in case you have only one factor column.)

Then say you get something like this:

attr(X.factors,"assign")
[1]  0  1  **2**  2  **3**  3  3  **4**  4  4  5  6  7  8  9 10 #emphasis added

We want to get rid of the **'d reference levels of each factor

att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))

answered Oct 14 '22 23:10

user36302

A tidyverse answer:

library(dplyr)
library(tidyr)
result <- testFrame %>% 
    mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>% 
    mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")

yields the desired result (same as @Gavin Simpson's answer):

> head(result, 6)
  First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1     1      5     4           0         0             1           0           0          1            0         0          0
2     1     14    10           0         0             0           1           0          0            1         0          0
3     2      2     9           0         1             0           0           1          0            0         0          0
4     2      5     4           0         0             0           1           0          1            0         0          0
5     2     13     5           0         0             1           0           1          0            0         0          0
6     2     15     7           1         0             0           0           1          0            0         0          0

answered Oct 14 '22 22:10

shosaco

Using the R package 'CatEncoders'

library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

fit <- OneHotEncoder.fit(testFrame)

z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output

answered Oct 14 '22 22:10

asdf123

I am currently learning Lasso model and glmnet::cv.glmnet(), model.matrix() and Matrix::sparse.model.matrix()(for high dimensions matrix, using model.matrix will killing our time as suggested by the author of glmnet.).

Just sharing there has a tidy coding to get the same answer as @fabians and @Gavin's answer. Meanwhile, @asdf123 introduced another package library('CatEncoders') as well.

> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
> 
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))

Source : R for Everyone: Advanced Analytics and Graphics (page273)

answered Oct 15 '22 00:10

Rγσ ξηg Lιαη Ημ

I write a package called ModelMatrixModel to improve the functionality of model.matrix(). The ModelMatrixModel() function in the package in default return a class containing a sparse matrix with all levels of dummy variables which is suitable for input in cv.glmnet() in glmnet package. Importantly, returned class also stores transforming parameters such as the factor level information, which can then be applied to new data. The function can hand most items in r formula like poly() and interaction. It also gives several other options like handle invalid factor levels , and scale output.

#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
                        Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
                        Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
                   Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
                   Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     7     17           1         0             0           0
## 2     9      7           0         1             0           0

#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data     
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2))) 
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     6      3           0         1             0           0
## 2     2     12           0         0             1           0

answered Oct 14 '22 23:10

Ben2018

You can use tidyverse to achieve this without specifying each column manually.

The trick is to make a "long" dataframe.

Then, munge a few things, and spread it back to wide to create the indicators/dummy variables.

Code:

library(tidyverse)

## add index variable for pivoting
testFrame$id <- 1:nrow(testFrame)

testFrame %>%
    ## pivot to "long" format
    gather(feature, value, -id) %>%
    ## add indicator value
    mutate(indicator=1) %>%
    ## create feature name that unites a feature and its value
    unite(feature, value, col="feature_value", sep="_") %>%
    ## convert to wide format, filling missing values with zero
    spread(feature_value, indicator, fill=0)

The output:

   id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ...
1   1            1           0             0          0           0       0       0       0
2   2            0           1             0          0           0       0       0       0
3   3            0           0             1          0           0       0       0       0
4   4            0           0             0          1           0       0       0       0
5   5            0           0             0          0           1       0       0       0
6   6            1           0             0          0           0       0       0       0
7   7            0           1             0          0           0       0       1       0
8   8            0           0             1          0           0       1       0       0
9   9            0           0             0          1           0       0       0       0
10 10            0           0             0          0           1       0       0       0
11 11            1           0             0          0           0       0       0       0
12 12            0           1             0          0           0       0       0       0
...

answered Oct 14 '22 23:10

Paul

Related questions
                            
                                How to crash R?
                            
                                Delete rows with blank values in one particular column
                            
                                Compare two character vectors in R
                            
                                dplyr mutate rowSums calculations or custom functions
                            
                                groupby weighted average and sum in pandas dataframe
                            
                                Subset data to contain only columns whose names match a condition
                            
                                Replace multiple letters with accents with gsub
                            
                                Wrap long axis labels via labeller=label_wrap in ggplot2
                            
                                Plot a function with ggplot, equivalent of curve()
                            
                                Confidence intervals for predictions from logistic regression
                            
                                ending "+" prompt in R
                            
                                What specifically are the dangers of eval(parse(...))?
                            
                                Why am I getting X. in my column names when reading a data frame?
                            
                                Merge unequal dataframes and replace missing rows with 0
                            
                                dplyr summarise_each with na.rm
                            
                                How to get row index number in R?
                            
                                How to find the highest value of a column in a data frame in R?
                            
                                Calculate difference between values in consecutive rows by group
                            
                                Getting R plots into LaTeX?
                            
                                How do I clean up R memory without restarting my PC?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

All Levels of a Factor in a Model Matrix in R

Tags:

r

matrix

model

indicator

Jared

People also ask

10 Answers

Gavin Simpson

fabians

Pablo Casas

Sagar Jauhari

user36302

shosaco

asdf123

Rγσ ξηg Lιαη Ημ

Ben2018

Paul

Recent Activity

Donate For Us