This is a little rudimentary, I know. Basically, I want to use the save data from the coef function to a shared data frame for models that all pull limited possible variables from a larger shared data set.
I have 3 sets of 14 models. Each set uses 15-25 variables from a 100 variable data-set, and each of the models uses a mix of about 12 variables, which change from model to model. What I would like to do is save the coefficients for each of the 14 models into one data-frame.
Coefs=data.frame(col.names = names(EST))
The coefficients look something like this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
RT_SCORE_USER 0.2427506 0.0310486 7.818 0.0000000000000836 ***
VOD.Window..weeks. 0.0092641 0.0009985 9.278 < 2e-16 ***
PX_WK3 0.0300395 0.0098943 3.036 0.002600 **
For a good 10-15 variables. For instance, PX has 14 weeks (WK1, 2, etc). I want to save the Estimate values into this grid, where for each row, there are 100 columns listing all possible variables. The majority of which will be 0. This table will be imported into excel where I can simply cross multiply for each week's model.
My struggle is figuring out how to record all the varying coefficients from the various weeks into ONE data.frame, where each model has a separate row:
PX_WK1 PXWK_2 RT_SCORE_USER IMAVARIABLE etc.
ESTWK1 .030 0 .24 0
ESTWK2 0 .023 .44 etc
ESTWK3 0 0 etc etc etc
I understand how to use coef(ESTWK1), but when I try to paste that into a row, I naturally get an error confusing the lengths of the two vectors, say 15 in this model out of a potential 100.
I want to automate this process so when new data is processed and the regressions are run, I can run my code saving the new coefficients' data, and then I can output that to a CSV (that part I've got). Thoughts?
The first step is to combine your coefficients into a data frame with one row per combination of model and term. Then you'll be able to spread it into a table with one row per model and one column per term.
My broom package has a useful function, tidy
for turning a linear fit into a data frame of coefficients:
fit <- lm(mpg ~ wt + disp + qsec, mtcars)
library(broom)
tidy(fit)
# term estimate std.error statistic p.value
# 1 (Intercept) 19.777558 5.9383 3.3305 0.00244
# 2 wt -5.034410 1.2241 -4.1127 0.00031
# 3 disp -0.000128 0.0106 -0.0121 0.99042
# 4 qsec 0.926649 0.3421 2.7087 0.01139
(Note that unlike coef
, this returns a data frame rather than a matrix, and incorporates the terms as a column rather than rownames). You can apply this function to each of your models and then recombine, for example with plyr's ldply. We generate an example using 20 of the same model as your "models":
models <- replicate(20, lm(mpg ~ wt + disp + qsec, mtcars), simplify = FALSE)
names(models) <- paste0("MODEL", 1:20)
Then our "tidy and recombine" code will be:
all_coefs <- plyr::ldply(models, tidy, .id = "model")
head(all_coefs)
# model term estimate std.error statistic p.value
# 1 MODEL1 (Intercept) 19.777558 5.9383 3.3305 0.00244
# 2 MODEL1 wt -5.034410 1.2241 -4.1127 0.00031
# 3 MODEL1 disp -0.000128 0.0106 -0.0121 0.99042
# 4 MODEL1 qsec 0.926649 0.3421 2.7087 0.01139
# 5 MODEL2 (Intercept) 19.777558 5.9383 3.3305 0.00244
# 6 MODEL2 wt -5.034410 1.2241 -4.1127 0.00031
You then need to remove the std.error, statistic, and p.value columns and spread the estimate
term out. This can be done with the dplyr and tidyr packages:
library(dplyr)
library(tidyr)
results <- all_coefs %>% select(-(std.error:p.value)) %>%
spread(term, estimate)
This produces:
model (Intercept) disp qsec wt
1 MODEL1 19.8 -0.000128 0.927 -5.03
2 MODEL2 19.8 -0.000128 0.927 -5.03
3 MODEL3 19.8 -0.000128 0.927 -5.03
4 MODEL4 19.8 -0.000128 0.927 -5.03
5 MODEL5 19.8 -0.000128 0.927 -5.03
Which is your desired output. (This output is boring since all the models were the same, but presumably yours are different). If some models have coefficients others don't, the missing values will be filled in with NA.
I'd approach this by doing something like this:
x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
y <- rnorm(10)
m1 <- lm(y ~ x1 + x2)
m2 <- lm(y ~ x1 + x3)
m3 <- lm(y ~ x2 + x3)
variables <- data.frame(variable = c("(Intercept)", "x1", "x2", "x3"),
model = rep(c("m1", "m2", "m3"), each = 4))
data <- data.frame(variable = c(names(coef(m1)), names(coef(m2)),
names(coef(m3))),
estimate = c(coef(m1), coef(m2), coef(m3)),
model = c(rep("m1", length(coef(m1))),
rep("m2", length(coef(m2))),
rep("m3", length(coef(m3)))))
data2 <- left_join(variables, data)
data2$estimate[is.na(data2$estimate)] <- 0
data2
reshape(data2, timevar = "variable", v.names = "estimate",
idvar = "model", direction = "wide")
Basically, fit the models and then extract the estimates and the row names. Then make a data frame variables
that includes all possible variable names for each model. Use left_join
from dplyr
to do the join and then reshape it into the format you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With