Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R missing levels in a model.matrix

Tags:

r

model.matrix

I am trying to convert a data frame with categorical variables to a model.matrix but am losing levels of variables.

Here's my code:

df1 <- data.frame(id = 1:200, y =rbinom(200, 1, .5),  var1 = factor(rep(c('abc','def','ghi','jkl'),50)))
df1$var2 <- factor(rep(c('ab c','ghi','jkl','def'),50))
df1$var3 <- factor(rep(c('abc','ghi','nop','xyz'),50))

df1$var2 <- as.character(df1$var2)
df1$var2 <- gsub('\\s','',df1$var2)
df1$var2 <- factor(df1$var2)
sapply(df1, levels)

mm1 <- model.matrix(~ 0+.,df1)
head(mm1)

Any suggestions? Is this a matrix non-invertability issue?

like image 694
screechOwl Avatar asked May 14 '26 16:05

screechOwl


1 Answers

The model matrix is perfectly correct. For factors, the model matrix contains one column less than there are factors: this information is already contained in the (Intercept) column. You are missing this column because you have specified +0 in your model term. Try this:

mm2 <- model.matrix(~., df1)
head(mm2)

You will now see the (Intercept) column which encodes "default" information, and now also the first level of var1 is missing in the column names. The (Intercept) represents your observation at the "reference level", which is the combination of first level of each categorical attribute. Any deviation from this reference level is encoded in the var*??? columns, and since your model assumes no interactions between these columns, you get (4 - 1) * 3 var*??? columns plus the (Intercept) column (which is replaced by var1abc in your initial model matrix).

Unfortunately I lack the precise terms to describe this. Anyone help me out?

like image 96
krlmlr Avatar answered May 17 '26 12:05

krlmlr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!