Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Model Matrices Incompatible - Error in update in Biglm package in R

I'm running through a large dataset chunk by chunk, updating a list of linear models as I go using the biglm function. The issue occurs when a particular chunk does not contain all the factors that I have in my linear model, and I get this error:

Error in update.biglm(model, new) : model matrices incompatible

The description of update.biglm mentions that factor levels must be the same across all chunks. I could probably come up with a workaround to avoid this, but there must be a better way. This pdf, on the 'biglm' page, mentions that "Factors must have their full set of levels specified (not necessarily present in the data chunk)". So I think there is some way to specify all the possible levels so that I can update a model with not all the factors present, but I can't figure out how to do it.

Here's an example piece of code to illustrate my problem:

df = data.frame(a = rnorm(12),b = as.factor(rep(1:4,each = 3)),c = rep(0:1,6))
model = biglm(a~b+c,data = df

df.new = data.frame(a = rnorm(6),b = as.factor(rep(1:2,each = 3)),c =rep(0:1, 3))
model.new = update(model,df.new)

Thanks for any advice you have.

like image 960
Ore M Avatar asked Nov 01 '22 07:11

Ore M


1 Answers

I came across this problem also. Are the variables in your large data frame specified as factors before breaking them into chunks? Also, is the data set formatted as a data frame?

large_df <- as.data.frame(large_data_set) # just to make sure it's a df.
large_df$factor.vars <- as.factor(large_df$factor.vars)

If this is the case, then all of the factor levels should be preserved in the factor variables even after breaking the data frame into chunks. This will ensure that biglm creates the proper design matrix from the first call, and that all subsequent updates will be compatible.

If you have different data frames from the start, (as you illustrate in your example), perhaps you should merge them into one before breaking down into chunks. Continuing from your example:

df.large <- rbind(df,df.new)
chunk1 <- df.large[1:12,]
chunk2 <- df.large[13:18,]

model <- biglm(a~b+c,data = chunk1)
model.new <- update(model,chunk2)   # this is now compatible  
like image 118
Snadhelta Avatar answered Nov 15 '22 05:11

Snadhelta