R predict glm fit on each column in data frame using column index number

Tags:

Trying to fit BLR model to each column in data frame, and then predict on new data pts. Have a lot of columns, so cannot identify the columns by name, only column number. Having reviewed the several examples of similar nature on this site, cannot figure out why this does not work.

df <- data.frame(x1 = runif(1000, -10, 10),
                 x2 = runif(1000, -2, 2),
                 x3 = runif(1000, -5, 5),
                 y = rbinom(1000, size = 1, prob = 0.40))

for (i in 1:length(df)-1)
{
        fit <- glm (y ~ df[,i], data = df, family = binomial, na.action = na.exclude)

        new_pts <- data.frame(seq(min(df[,i], na.rm = TRUE), max(df[,i], na.rm = TRUE), len = 200))
        names(new_pts) <- names(df[, i])

        new_pred <- predict(fit, newdata = new_pts, type = "response")

}

The predict() function raises warning message and returns array 1000 elements long, whereas the test data has only 200 elements.

Warning message : Warning message: 'newdata' has 200 lines bu the variables found have 1000 lines

967

asked Aug 11 '18 03:08

bici-sancta

1 Answers

For repeated modelling I use a similar approach as shown below. I have implemented it with data.table, but it could be rewritten to use the base data.frame (the code would then be more verbose, I guess). In this approach I store all the models in a separate object (below I have provided two versions of the code, one more explanatory part, and one more advanced aiming at a clean output).

Of course, you could also write a loop/function that only fits one model per iteration without storing them. From my perspective, its a good idea to save the models, since you probably will have to investigate the models for robustness, etc. and not only predict new values.

HINT: Please also have a look at the answer of @AndS. providing a tidyverse approach. Together with this answer, I think, this is certainly a nice side by side comparison for learning/understanding data.table and tidyverse approaches

# i have used some more simple data to show that the output is correct, see the plots
df <- data.frame(x1 = seq(1, 100, 10),
                 x2 = (1:10)^2,
                 y =  seq(1, 20, 2))
library(data.table)
setDT(df)
# prepare the data by melting it
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
# also i used a more simple model (in this case lm would also do)
# create model for each variable (formerly columns)
models = setnames(DT[, data.table(list(glm(y ~ x))), by = "variable"], "V1", "model")
# create a new set of data to be predicted
# NOTE: this could, of course, also be added to the models data.table
# as new column via `:=list(...)`
new_pts = setnames(DT[, seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), len = 200), by = variable], "V1", "x")
# add the predicted values
new_pts[, predicted:= predict(models[variable == unlist(.BY), model][[1]], newdata = as.data.frame(x),  type = "response")
        , by = variable]
# plot and check if it makes sense
plot(df$x1, df$y)
lines(new_pts[variable == "x1", .(x, predicted)])
points(df$x2, df$y)
lines(new_pts[variable == "x2", .(x, predicted)])

# also the following version of above code is possible
# that generates only one new objects in the environment
# but maybe looks more complicated at first sight
# not sure if this is the best way to do it
# data.table experts might provide some shortcuts
setDT(df)
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
DT = data.table(variable = unique(DT$variable), dat = split(DT, DT$variable))
DT[, models:= list(list(glm(y ~ x, data = dat[[1]]))), by = variable]
DT[, new_pts:= list(list(data.frame(x = dat[[1]][
                                                 ,seq(min(x, na.rm = TRUE)
                                                 , max(x, na.rm = TRUE), len = 200)]
                                    )))
       , by = variable]
models[, predicted:= list(list(data.frame(pred = predict(model[[1]]
                                          , newdata = new_pts[[1]]
                                          ,  type = "response")))),
       by = variable]
plot(df$x1, df$y)
lines(models[variable == "x1", .(unlist(new_pts), unlist(predicted))])
points(df$x2, df$y)
lines(models[variable == "x2", .(unlist(new_pts), unlist(predicted))])

answered Sep 29 '22 15:09

Manuel Bickel

Related questions
                            
                                How to create Custom Shinydashboard skin
                            
                                sapply - retain column names
                            
                                Get values from a column where a threshold is crossed for the first time for each group in R
                            
                                Replace value by column name for many columns using R and dplyr [duplicate]
                            
                                why some R codes have to be included in index.Rmd of Bookdown?
                            
                                Installing lightgbm in R
                            
                                R ~ Vectorization of a user defined function
                            
                                Numbers of columns of arguments do not match
                            
                                define a bracket (`[`) operator on an R6 class
                            
                                Update Leaflet Marker based on timer in Shinydashboard
                            
                                How do I refer to multiple columns in a dataframe expression?
                            
                                Generating different shades of the same colour in R
                            
                                xaringan: generated table with nested formatted code
                            
                                Why does using "mgcv::s" in "gam(y ~ mgcv::s...)" result in an error?
                            
                                Select data frame values row-wise using a variable of column names
                            
                                R lubridate package date-time creation omits time at midnight
                            
                                How to get the center and scale after using the scale function in R
                            
                                Plotly (r): Unable to apply correct colors to 3D scatter and show legend at the same time
                            
                                "object not found" and "unexpected symbol" errors when timing R code with system.time()
                            
                                How to compare one column to a series of related dummy variables without a for loop in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R predict glm fit on each column in data frame using column index number

Tags:

r

glm

predict

bici-sancta

People also ask

1 Answers

Manuel Bickel

Recent Activity

Donate For Us