I would like to build separate models for the different segments of my data. I have built the models like so:
log1 <- glm(y ~ ., family = "binomial", data = train, subset = x1==0)
log2 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2<10)
log3 <- glm(y ~ ., family = "binomial", data = train, subset = x1==1 & x2>=10)
If I run the predictions on the training data, R remembers the subsets and the prediction vectors are with the length of the respective subset.
However, if I run the predictions on the testing data, the prediction vectors are with the length of the whole dataset, not that of the subsets.
My question is whether there is a simpler way to achieve what I would by first subsetting the testing data, then running the predictions on each dataset, concatenating the predictions, rbinding the subset data, and appending the concatenated predictions like this:
T1 <- subset(Test, x1==0)
T2 <- subset(Test, x1==1 & x2<10)
T3 <- subset(Test, x1==1 & x2>=10)
log1pred <- predict(log1, newdata = T1, type = "response")
log2pred <- predict(log2, newdata = T2, type = "response")
log3pred <- predict(log3, newdata = T3, type = "response")
allpred <- c(log1pred, log2pred, log3pred)
TAll <- rbind(T1, T2, T3)
TAll$allpred <- as.data.frame(allpred)
I'd like to think I am being stupid and there is an easier way to accomplish this - many models on small subsets of the data. How to combine them to get the predictions on the full testing data?
First, here's some sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T),
x2=rpois(100,10),
y=sample(0:1, 100, replace=T))
test <- data.frame(x1=sample(0:1, 10, replace=T),
x2=rpois(10,10))
Now we can fit the models. Here I place them in a list to make it easier to keep them together, and I also remove x1 from the model since it will be fixed for each subset
fits<-list(
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==0),
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2<10),
glm(y ~ .-x1, family = "binomial", data = train, subset = x1==1 & x2>=10)
)
Now, for the training data, I create an indicator which specifies which group the observation falls into. I do this by looking at the subset= parameter of each of the calls and evaluating those conditions in the test data.
whichsubset <- as.vector(sapply(fits, function(x) {
subsetparam<-x$call$subset
eval(subsetparam, test)
})%*% matrix(1:length(fits), ncol=1))
You'll want to make sure your groups are mutually exclusive because this code does not check. Then you can use factor with a split/unsplit strategy for making your predictions
unsplit(
Map(function(a,b) predict(a,b),
fits, split(test, whichsubset)
),
whichsubset
)
And even easier strategy would have been just to create the segregating factor in the first place. This would make the model fitting easier as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With