Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Manually build logistic regression model for prediction in R

I'm attempting to test a logistic regression model (e.g. 3 coefficients for 3 predictor variables, X1,X2,X3), on a dataset. I'm aware of how to test a model after i created the model object using, for example,

mymodel <- glm( Outcome ~  X1 + X2 + X3 , family = binomial,data=trainDat)

and then test the data

prob <- predict(mymodel,type="response",newdata=test)

But i want to, now, create a logistic model using coefficients and intercept that I have, and then test this model on data.

Basically I'm not clear on how to create "mymodel" without running glm.

Context for the question: I've run a logistic regression using offsets eg:

mymodel <- glm(Outcome ~ offset(C1 * X1) + offset(C2 * X2) + X3, 
               family = binomial, data = trainDat)

Thus, the mymodel object generates a model with only an intercept (I) and C3 coefficients (for feature X3).
I now need to test the full model (i.e. I + C1*X1 + C2*X2 + C3*X3), on a test dataset, but I don't know how to get the full model, since the output of mymodel has only intercept and C3. So I thought my more general question was: "how do you manually build a logisitic regression model object?"

Thank you for your patience.

like image 319
nerdlyfe Avatar asked Jun 09 '14 02:06

nerdlyfe


1 Answers

I could not find a simple function to do this. There is some code in the predict function that depends on having a fitted model (like determining the rank of the model). However, we can create a function to make a fake glm object that you can use with predict. Here's my first attempt at such a function

makeglm <- function(formula, family, data=NULL, ...) {
    dots <- list(...)
    out<-list()
    tt <- terms(formula, data=data)
    if(!is.null(data)) {
        mf <- model.frame(tt, data)
        vn <- sapply(attr(tt, "variables")[-1], deparse)

        if((yvar <- attr(tt, "response"))>0)
            vn <- vn[-yvar]
            xlvl <- lapply(data[vn], function(x) if (is.factor(x))
           levels(x)
        else if (is.character(x))
           levels(as.factor(x))
        else
            NULL)
        attr(out, "xlevels") <- xlvl[!vapply(xlvl,is.null,NA)]
        attr(tt, "dataClasses") <- sapply(data[vn], stats:::.MFclass)
    }
    out$terms <- tt
    coef <- numeric(0)
    stopifnot(length(dots)>1 & !is.null(names(dots)))
    for(i in seq_along(dots)) {
        if((n<-names(dots)[i]) != "") {
            v <- dots[[i]]
            if(!is.null(names(v))) {
                coef[paste0(n, names(v))] <- v
            } else {
                stopifnot(length(v)==1)
                coef[n] <- v
            }
        } else {
            coef["(Intercept)"] <- dots[[i]]
        }   
    }
    out$coefficients <- coef
    out$rank <- length(coef)
    out$qr <- list(pivot=seq_len(out$rank))
    out$family <- if (class(family) == "family") {
        family
    } else if (class(family) == "function") {
        family()
    } else {
        stop(paste("invalid family class:", class(family)))
    }
    out$deviance <- 1
    out$null.deviance <- 1
    out$aic <- 1
    class(out) <- c("glm","lm")
    out
}

So this function creates an object and passes all the values that predict and print expect to find on such an object. Now we can test it out. First, here's some test data

set.seed(15)
dd <- data.frame(
    X1=runif(50),
    X2=factor(sample(letters[1:4], 50, replace=T)),
    X3=rpois(50, 5),
    Outcome = sample(0:1, 50, replace=T)
)

And we can fit a standard binomial model with

mymodel<-glm(Outcome~X1+X2+X3, data=dd, family=binomial)

Which gives

Call:  glm(formula = Outcome ~ X1 + X2 + X3, family = binomial, data = dd)

Coefficients:
(Intercept)           X1          X2b          X2c          X2d           X3  
    -0.4411       0.8853       1.8384       0.9455       1.5059      -0.1818  

Degrees of Freedom: 49 Total (i.e. Null);  44 Residual
Null Deviance:      68.03 
Residual Deviance: 62.67    AIC: 74.67

Now let's say we wanted to try out model that we read in a publication on the same data. Here's how we use the makeglm function

newmodel <- makeglm(Outcome~X1+X2+X3, binomial, data=dd, 
    -.5, X1=1, X2=c(b=1.5, c=1, d=1.5), X3=-.15)

The first parameter is the formula of the model. This defines the response and the covariates just like you would when running glm. Next you specify the family like you would with glm(). And you need to pass a data frame so R can sniff the correct data types for each of the variables involved. This will also identify all of the factor variables and their levels using the data.frame. So this can be new data that's coded just like the fitted data.frame or it can be the original one.

Now we start to specify the coefficients to use in our model. The coefficients will be filled using the names of the parameters. The unnamed parameter will be used as the intercept. If you have a factor, you need to give coefficients to all the levels via named parameters. Here I just decided to round off the fitted estimates to "nice" numbers.

Now I can use our newmodel with predict.

predict(mymodel, type="response")
#         1         2         3         4         5
# 0.4866398 0.3553439 0.6564668 0.7819917 0.3008108

predict(newmodel, newdata=dd, type="response")

#         1         2         3         4         5
# 0.5503572 0.4121811 0.7143200 0.7942776 0.3245525

Here I call predict on the original model and on the new model using the old data with my specified coefficients. We can see the estimate of probability have changed a bit.

Now I haven't thoroughly tested this function so use at your own risk. I don't do as much error checking as I probably should. Maybe someone else does know of a better way.

like image 199
MrFlick Avatar answered Oct 02 '22 20:10

MrFlick