Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Printing regression coefficients from multiple models to a shared data frame

Tags:

r

This is a little rudimentary, I know. Basically, I want to use the save data from the coef function to a shared data frame for models that all pull limited possible variables from a larger shared data set.

I have 3 sets of 14 models. Each set uses 15-25 variables from a 100 variable data-set, and each of the models uses a mix of about 12 variables, which change from model to model. What I would like to do is save the coefficients for each of the 14 models into one data-frame.

Coefs=data.frame(col.names = names(EST))

The coefficients look something like this:

Coefficients:
                    Estimate   Std. Error  t value           Pr(>|t|)    
RT_SCORE_USER       0.2427506  0.0310486   7.818 0.0000000000000836 ***
VOD.Window..weeks.  0.0092641  0.0009985   9.278            < 2e-16 ***
PX_WK3              0.0300395  0.0098943   3.036           0.002600 ** 

For a good 10-15 variables. For instance, PX has 14 weeks (WK1, 2, etc). I want to save the Estimate values into this grid, where for each row, there are 100 columns listing all possible variables. The majority of which will be 0. This table will be imported into excel where I can simply cross multiply for each week's model.

My struggle is figuring out how to record all the varying coefficients from the various weeks into ONE data.frame, where each model has a separate row:

       PX_WK1     PXWK_2   RT_SCORE_USER  IMAVARIABLE etc.
ESTWK1   .030     0         .24            0
ESTWK2   0        .023      .44            etc
ESTWK3   0        0         etc etc etc

I understand how to use coef(ESTWK1), but when I try to paste that into a row, I naturally get an error confusing the lengths of the two vectors, say 15 in this model out of a potential 100.

I want to automate this process so when new data is processed and the regressions are run, I can run my code saving the new coefficients' data, and then I can output that to a CSV (that part I've got). Thoughts?

like image 830
Jason O Avatar asked Apr 16 '15 20:04

Jason O


2 Answers

The first step is to combine your coefficients into a data frame with one row per combination of model and term. Then you'll be able to spread it into a table with one row per model and one column per term.

My broom package has a useful function, tidy for turning a linear fit into a data frame of coefficients:

fit <- lm(mpg ~ wt + disp + qsec, mtcars)
library(broom)
tidy(fit)
#          term  estimate std.error statistic p.value
# 1 (Intercept) 19.777558    5.9383    3.3305 0.00244
# 2          wt -5.034410    1.2241   -4.1127 0.00031
# 3        disp -0.000128    0.0106   -0.0121 0.99042
# 4        qsec  0.926649    0.3421    2.7087 0.01139

(Note that unlike coef, this returns a data frame rather than a matrix, and incorporates the terms as a column rather than rownames). You can apply this function to each of your models and then recombine, for example with plyr's ldply. We generate an example using 20 of the same model as your "models":

models <- replicate(20, lm(mpg ~ wt + disp + qsec, mtcars), simplify = FALSE)
names(models) <- paste0("MODEL", 1:20)

Then our "tidy and recombine" code will be:

all_coefs <- plyr::ldply(models, tidy, .id = "model")
head(all_coefs)
#    model        term  estimate std.error statistic p.value
# 1 MODEL1 (Intercept) 19.777558    5.9383    3.3305 0.00244
# 2 MODEL1          wt -5.034410    1.2241   -4.1127 0.00031
# 3 MODEL1        disp -0.000128    0.0106   -0.0121 0.99042
# 4 MODEL1        qsec  0.926649    0.3421    2.7087 0.01139
# 5 MODEL2 (Intercept) 19.777558    5.9383    3.3305 0.00244
# 6 MODEL2          wt -5.034410    1.2241   -4.1127 0.00031

You then need to remove the std.error, statistic, and p.value columns and spread the estimate term out. This can be done with the dplyr and tidyr packages:

library(dplyr)
library(tidyr)
results <- all_coefs %>% select(-(std.error:p.value)) %>%
    spread(term, estimate)

This produces:

     model (Intercept)      disp  qsec    wt
1   MODEL1        19.8 -0.000128 0.927 -5.03
2   MODEL2        19.8 -0.000128 0.927 -5.03
3   MODEL3        19.8 -0.000128 0.927 -5.03
4   MODEL4        19.8 -0.000128 0.927 -5.03
5   MODEL5        19.8 -0.000128 0.927 -5.03

Which is your desired output. (This output is boring since all the models were the same, but presumably yours are different). If some models have coefficients others don't, the missing values will be filled in with NA.

like image 146
David Robinson Avatar answered Nov 03 '22 04:11

David Robinson


I'd approach this by doing something like this:

x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
y <- rnorm(10)
m1 <- lm(y ~ x1 + x2)
m2 <- lm(y ~ x1 + x3) 
m3 <- lm(y ~ x2 + x3)

variables <- data.frame(variable = c("(Intercept)", "x1", "x2", "x3"),
                        model = rep(c("m1", "m2", "m3"), each = 4))
data <- data.frame(variable = c(names(coef(m1)), names(coef(m2)), 
                                names(coef(m3))),
                   estimate = c(coef(m1), coef(m2), coef(m3)), 
                   model = c(rep("m1", length(coef(m1))), 
                             rep("m2", length(coef(m2))),
                             rep("m3", length(coef(m3)))))
data2 <- left_join(variables, data)
data2$estimate[is.na(data2$estimate)] <- 0
data2
reshape(data2, timevar = "variable", v.names = "estimate", 
        idvar = "model", direction = "wide")

Basically, fit the models and then extract the estimates and the row names. Then make a data frame variables that includes all possible variable names for each model. Use left_join from dplyr to do the join and then reshape it into the format you want.

like image 32
iacobus Avatar answered Nov 03 '22 06:11

iacobus