Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing list of data frames into lm() and viewing the results

I've got three data frames, dfLON, dfMOS, and dfATA. Each features the same variables: y is a continuous variable, and a, b, and c are binary categorical, and there is also some NA.

I'd like to build separate linear regression models, one for each data set.

With my current code I've managed to make a list of data frames and pass it into lm(). But is there are more concise way to view the results than eg fitdfLON <- DfList[[1]]? I've provided three data frames in this example but I actually have ~25 so I'd have to type it 25 times!

Any help would be much appreciated.

Starting point (dfs):

dfLON <- data.frame(y=c(1.23,2.32,3.21,2.43),a=c(1,NA,1,2),b=c(1,1,2,2),c=c(2,1,2,1))
dfMOS <- data.frame(y=c(4.56,6.54,4.43,5.78),a=c(2,1,2,1),b=c(2,1,1,2),c=c(1,2,1,2))
dfATA <- data.frame(y=c(1.22,6.54,3.23,4.23),a=c(2,2,2,1),b=c(1,2,1,2),c=c(1,NA,1,2))

Current code:

Mylm <- function(df){
 fit <- lm(y ~ a + b + c, data=df)
  return(fit)
}
DfList <- lapply(list(dfLON, dfMOS, dfATA), Mylm)

fitdfLON <- DfList[[1]]
fitdfMOS <- DfList[[2]]
fitdfATA <- DfList[[3]]
like image 279
LLL Avatar asked Nov 26 '25 17:11

LLL


2 Answers

Whenever you're running models on many different datasets, it makes sense to tidy them using the broom library. This produces a clean data frame for each model, which you can then output or use in downstream analyses.

Simplest example:

library(broom)

Mylm <- function(df){
  fit <- lm(y ~ a + b + c, data=df)
  tidy(fit) # tidy the fit object
}

list(dfLON, dfMOS, dfATA) %>% lapply(Mylm)

#[[1]]
#         term estimate std.error statistic p.value
#1 (Intercept)     0.03       NaN       NaN     NaN
#2           a    -0.78       NaN       NaN     NaN
#3           b     1.98       NaN       NaN     NaN
#
#[[2]]
#         term estimate std.error  statistic    p.value
#1 (Intercept)   8.2975  0.969855  8.5554025 0.07407531
#2           a  -1.6650  0.445000 -3.7415730 0.16626155
#3           b  -0.3150  0.445000 -0.7078652 0.60785169
#
#[[3]]
#         term estimate std.error statistic   p.value
#1 (Intercept)    6.235  3.015000  2.067993 0.2867398
#2           a   -2.005  1.740711 -1.151828 0.4551559

And you can now combine this with the map_dfr() function from purrr to combine everything into one combined data frame:

library(purrr)

# note the named list entries; these will go into the "model" column
# without them, you'd just get a model number
list("LON" = dfLON, "MOS" = dfMOS, "ATA" = dfATA) %>% 
  map_dfr(Mylm, .id = "model")

#  model        term estimate std.error  statistic    p.value
#1   LON (Intercept)   0.0300       NaN        NaN        NaN
#2   LON           a  -0.7800       NaN        NaN        NaN
#3   LON           b   1.9800       NaN        NaN        NaN
#4   MOS (Intercept)   8.2975  0.969855  8.5554025 0.07407531
#5   MOS           a  -1.6650  0.445000 -3.7415730 0.16626155
#6   MOS           b  -0.3150  0.445000 -0.7078652 0.60785169
#7   ATA (Intercept)   6.2350  3.015000  2.0679934 0.28673976
#8   ATA           a  -2.0050  1.740711 -1.1518281 0.45515586

And to make things more compact, you can define the function on the fly inside map_dfr. Seems appropriate when all you're doing is fit a linear model.

list("LON" = dfLON, "MOS" = dfMOS, "ATA" = dfATA) %>% 
  map_dfr(~ tidy(lm(y ~ a + b + c, data = .)),
          .id = "model")

#  model        term estimate std.error  statistic    p.value
#1   LON (Intercept)   0.0300       NaN        NaN        NaN
#2   LON           a  -0.7800       NaN        NaN        NaN
#3   LON           b   1.9800       NaN        NaN        NaN
#4   MOS (Intercept)   8.2975  0.969855  8.5554025 0.07407531
#5   MOS           a  -1.6650  0.445000 -3.7415730 0.16626155
#6   MOS           b  -0.3150  0.445000 -0.7078652 0.60785169
#7   ATA (Intercept)   6.2350  3.015000  2.0679934 0.28673976
#8   ATA           a  -2.0050  1.740711 -1.1518281 0.45515586
like image 92
Claus Wilke Avatar answered Nov 28 '25 16:11

Claus Wilke


If the names of the data.frame have a common pattern, you can use a combination of mget and ls to extract them and run lm using lapply

fit = lapply(mget(ls(pattern = "^df[A-Z]{3}")), function(x) lm(y ~ a + b + c, data = x))
fit$dfATA

#Call:
#lm(formula = y ~ a + b + c, data = x)

#Coefficients:
#(Intercept)            a            b            c  
#      6.235       -2.005           NA           NA  

If you just want coefficients for all, you could do

do.call(rbind,
        lapply(X = mget(ls(pattern = "^df[A-Z]{3}")),
               FUN = function(x) lm(formula = y ~ a + b + c, data = x)[[1]]))
#      (Intercept)      a      b  c
#dfATA      6.2350 -2.005     NA NA
#dfLON      0.0300 -0.780  1.980 NA
#dfMOS      8.2975 -1.665 -0.315 NA

Instead of ls(pattern = "df[A-Z]{3}"), you could also just provide a vector that has the names of all the data.frame

like image 26
d.b Avatar answered Nov 28 '25 16:11

d.b