I have a formula and a data frame, and I want to extract the model.matrix()
. However, I need the resulting matrix to include the NAs that were found in the original dataset. If I were to use model.frame()
to do this, I would simply pass it na.action=NULL
. However, the output I need is of the model.matrix()
format. Specifically, I need only the right-hand side variables, I need the output to be a matrix (not a data frame), and I need factors to be converted to a series of dummy variables.
I'm sure I could hack something together using loops or something, but I was wondering if anyone could suggest a cleaner and more efficient workaround. Thanks a lot for your time!
And here's an example:
dat <- data.frame(matrix(rnorm(20),5,4), gl(5,2))
dat[3,5] <- NA
names(dat) <- c(letters[1:4], 'fact')
ff <- a ~ b + fact
# This omits the row with a missing observation on the factor
model.matrix(ff, dat)
# This keeps the NA, but it gives me a data frame and does not dichotomize the factor
model.frame(ff, dat, na.action=NULL)
Here is what I would like to obtain:
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.7266086 0 0 0 0
2 1 -0.6088697 0 0 0 0
3 NA 0.4643360 NA NA NA NA
4 1 -1.1666248 1 0 0 0
5 1 -0.7577394 0 1 0 0
6 1 0.7266086 0 1 0 0
7 1 -0.6088697 0 0 1 0
8 1 0.4643360 0 0 1 0
9 1 -1.1666248 0 0 0 1
10 1 -0.7577394 0 0 0 1
Joris's suggestion works, but a quicker and cleaner way to do this is via the global na.action setting. The 'Pass' option achieves our goal of preserving NA's from the original dataset.
Resulting matrix will contain NA's in rows corresponding to the original dataset.
options(na.action='na.pass')
model.matrix(ff, dat)
Resulting matrix will skip rows containing NA's.
options(na.action='na.omit')
model.matrix(ff, dat)
An error will occur if the original data contains NA's.
options(na.action='na.fail')
model.matrix(ff, dat)
Of course, always be careful when changing global options because they can alter behavior of other parts of your code. A cautious person might store the original setting with something like current.na.action <- options('na.action')
, and then change it back after making the model.matrix.
Another way is to use the model.frame
function with argument na.action=na.pass
as your second argument to model.matrix
:
> model.matrix(ff, model.frame(~ ., dat, na.action=na.pass))
(Intercept) b fact2 fact3 fact4 fact5
1 1 -1.3560754 0 0 0 0
2 1 2.5476965 0 0 0 0
3 1 0.4635628 NA NA NA NA
4 1 -0.2871379 1 0 0 0
5 1 2.2684958 0 1 0 0
6 1 -1.3560754 0 1 0 0
7 1 2.5476965 0 0 1 0
8 1 0.4635628 0 0 1 0
9 1 -0.2871379 0 0 0 1
10 1 2.2684958 0 0 0 1
model.frame
allows you to set the appropriate action for na.action
which is maintained when model.matrix
is called.
I half-stumbled across a simpler solution after looking at mattdevlin and Nathan Gould's answers:
model.matrix.lm(ff, dat, na.action = "na.pass")
model.matrix.default
may not support the na.action
argument, but model.matrix.lm
does!
(I found model.matrix.lm
from Rstudio's auto-complete suggestions — it appears to be the only non-default method for model.matrix
if you haven't loaded any libraries that add others. Then I just guessed it might support the na.action
argument.)
You can mess around a little with the model.matrix
object, based on the rownames :
MM <- model.matrix(ff,dat)
MM <- MM[match(rownames(dat),rownames(MM)),]
MM[,"b"] <- dat$b
rownames(MM) <- rownames(dat)
which gives :
> MM
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.9583010 0 0 0 0
2 1 0.3266986 0 0 0 0
3 NA 1.4992358 NA NA NA NA
4 1 1.2867461 1 0 0 0
5 1 0.5024700 0 1 0 0
6 1 0.9583010 0 1 0 0
7 1 0.3266986 0 0 1 0
8 1 1.4992358 0 0 1 0
9 1 1.2867461 0 0 0 1
10 1 0.5024700 0 0 0 1
Alternatively, you can use contrasts()
to do the work for you. Constructing the matrix by hand would be :
cont <- contrasts(dat$fact)[as.numeric(dat$fact),]
colnames(cont) <- paste("fact",colnames(cont),sep="")
out <- cbind(1,dat$b,cont)
out[is.na(dat$fact),1] <- NA
colnames(out)[1:2]<- c("Intercept","b")
rownames(out) <- rownames(dat)
which gives :
> out
Intercept b fact2 fact3 fact4 fact5
1 1 0.2534288 0 0 0 0
2 1 0.2697760 0 0 0 0
3 NA -0.8236879 NA NA NA NA
4 1 -0.6053445 1 0 0 0
5 1 0.4608907 0 1 0 0
6 1 0.2534288 0 1 0 0
7 1 0.2697760 0 0 1 0
8 1 -0.8236879 0 0 1 0
9 1 -0.6053445 0 0 0 1
10 1 0.4608907 0 0 0 1
In any case, both methods can be incorporated in a function that can deal with more complex formulae. I leave the exercise to the reader (what do I loath that sentence when I meet it in a paper ;-) )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With