Running a regression model over 30 specific set.seed automatically using R

Tags:

I am working with several linear regression models.

I want to run a linear regression model with different 30 set.seed

For clarification, I only share the code with two regression models and 10 set.seed (In my project I have 12 regression models and each one should be run with 30 different set.seeds)

I need a solution that I can run a 30 set.seed for one linear regression model so I can go away from my laptop during the running period (30 set.seeds). Then I did the same for the second regression model.

Is there a way to run the code over the 30 different set.seed automatically. So I got a result for each set.seed.

I hope everything is clear and I am happy to clarify more.

NOTE

Bear in mind that I have four related Blocks with each regression model. So any change with set.seed or creatFolds may affect the other blocks.

EDIT1

The dataset used

wdbc <- read.delim("airfoil_self_noise.dat",header=F)
wdbcc=as.data.frame(scale(wdbc))
#set.seed(1)
#set.seed(2)
#set.seed(3)
#set.seed(4)
...
k = 30
folds <- createFolds(wdbcc$V6, k = k, list = TRUE, returnTrain = TRUE)

## Ordinary Least Square regression ##
#Block A
lm = list()
for (i in 1:k) {
  lm[[i]] = lm(V6~ ., data = wdbcc[folds[[i]],])
}

#Block B
lm_coef = list()
lm_coef_var = list()
for(j in 1:(lm[[1]]$coefficients %>% length())){
  for(i in 1:k){
    lm_coef[[i]] = lm[[i]]$coefficients[j] 
    lm_coef_var[[j]] = lm_coef %>% unlist() %>% var()
  }
}

#Block C
lm_var = unlist(lm_coef_var)
lm_df = cbind(coefficients = lm[[1]]$coefficients%>% names() %>% as.data.frame()
              , variance = lm_var %>% as.data.frame()) 
colnames(lm_df) = c("coefficients", "variance_lm")

#Block D
lm_var_sum = sum(lm_var)

PQSQ-Regression

X=list()
Y=list()
for (i in 1:k) {
  n=wdbcc[folds[[i]],-6]
  m=wdbcc[folds[[i]],6]
  X[[i]]=n
  Y[[i]]=m
}

#Block A
lmPQSQ1 = list()
for (i in 1:k) {
  lmPQSQ1[[i]] = PQSQRegression(X[[i]],Y[[i]],0.01,data = wdbcc[folds[[i]],])
}

lmmPQSQ1=list()
for (i in 1:k) {
  L=list(coefficients = c(lmPQSQ1[[i]][[2]],lmPQSQ1[[i]][[1]]))
  lmmPQSQ1[[i]]=L
}
#Block B
lm_coefPQSQ1 = list()
lm_coef_varPQSQ1 = list()
for(j in 1:(lmmPQSQ1[[1]]$coefficients %>% length())){
  for(i in 1:k){
    lm_coefPQSQ1[[i]] = lmmPQSQ1[[i]]$coefficients[j] 
    lm_coef_varPQSQ1[[j]] = lm_coefPQSQ1 %>% unlist() %>% var()
  }
}

#Block C
lm_varPQSQ1 = unlist(lm_coef_varPQSQ1)
lm_dfPQSQ1 = variance = lm_varPQSQ1 %>% as.data.frame()
#Block D
PQSQ1_var_sum = sum(lm_varPQSQ1)

236

asked Mar 27 '20 23:03

jeza

1 Answers

If I understand you correctly you want to regress V6 on all the other variables using both OLS and a LAD model. You want to select k=30 random "folds" using createFolds and repeat the process also n=30 times. As result you want the variances for each repetition and each coefficient.

I would wrap the fitting part into a function FX. Generate n=30 seeds with sample, loop over it with an lapply to repeat FX n=30 times.

FX <- function(seed, data, k=30) {
  set.seed(seed)  ## sets seed for each iteration
  folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE)  ## folds
  ## OLS
  lm1 <- lapply(folds, function(folds) lm(V6 ~ ., data=data[folds, ]))
  lm.coefs <- t(sapply(lm1, coef))  ## lm coefficients
  ## LAD
  lad1 <- lapply(folds, function(folds) lad(V6 ~ ., data=data[folds, ], method="BR"))
  lad.coefs <- t(sapply(lad1, coef))  ## lad coefficients
  ## calculate column variances for both coef matrices
  ## use `attr<-` to add the seed as an attribute if you want
  return(`attr<-`(cbind(lm=apply(lm.coefs, 2, var), lad=apply(lad.coefs, 2, var)),
                  "seed", seed))
}

seeds <- 1:30  ## specific seeds 1, 2, ... 30
## if you want non-consecutive specific seeds, do:
# set.seed(42)                ## set some initial seed
# n <- 30                     ## n. o. seeds
# seeds <- sample(1:1e6, n)   ## sample seeds for `FX`


res <- lapply(seeds, FX, data=wdbcc)  ## lapply loop

Result

This results in a list of length 30 with variance matrices for each repetition, each model, and each coefficient.

res[1:2]  ## first two lists
# [[1]]
#                       lm          lad
# (Intercept) 9.104280e-06 1.273920e-05
# V1          2.609623e-05 6.992753e-05
# V2          7.082099e-05 2.075875e-05
# V3          1.352299e-05 1.209651e-05
# V4          7.986000e-06 9.273005e-06
# V5          5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
# 
# [[2]]
#                       lm          lad
# (Intercept) 4.558722e-06 2.031707e-05
# V1          2.256583e-05 9.291900e-05
# V2          6.519648e-05 2.768443e-05
# V3          1.800889e-05 9.983524e-06
# V4          1.131813e-05 1.174496e-05
# V5          3.866105e-05 1.022452e-05
# attr(,"seed")
# [1] 2

length(res)
# [1] 30

To calculate the sum of variances for each seed you may use colSums in an sapply.

# sum of variances
sov <- t(sapply(res, colSums))

dim(sov)
# [1] 30  2

head(sov)
#                lm          lad
# [1,] 1.829835e-04 0.0001401535
# [2,] 1.603091e-04 0.0001728735
# [3,] 1.003093e-04 0.0001972869
# [4,] 1.460591e-04 0.0001508251
# [5,] 9.915082e-05 0.0001262106
# [6,] 1.425996e-04 0.0001478449

To understand what one iteration of the lapply does, consider this:

## provide the values of first iteration for arguments of function `FX`
seed <- 1
data <- wdbcc
k <- 30

## first iteration of `lapply`
set.seed(seed)
folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE)  ## folds
## OLS
lm1 <- lapply(folds, function(folds) lm(V6 ~ ., data=data[folds, ]))
lm.coefs <- t(sapply(lm1, coef))  ## lm coefficients

dim(lm.coefs)
# [1] 30  6
head(lm.coefs)
#          (Intercept)         V1         V2         V3        V4         V5
# Fold01 -0.0039130125 -0.5806272 -0.3564769 -0.4804492 0.2271908 -0.2805472
# Fold02  0.0013260444 -0.5863764 -0.3533327 -0.4759213 0.2253128 -0.2874691
# Fold03  0.0006791787 -0.5890787 -0.3678586 -0.4832066 0.2220979 -0.2739124
# Fold04 -0.0010721593 -0.5868079 -0.3722466 -0.4895328 0.2227811 -0.2749657
# Fold05  0.0021856620 -0.5850165 -0.3495360 -0.4810657 0.2235410 -0.2936287
# Fold06  0.0001486909 -0.5872607 -0.3677774 -0.4848523 0.2275780 -0.2823764

## LAD (same as OLS)
lad1 <- lapply(folds, function(folds) lad(V6 ~ ., data=data[folds, ], method="BR"))
lad.coefs <- t(sapply(lad1, coef))  ## lad coefficients

## return, throws variances for each coefficient of each model in a matrix
## the seed is added as an attribute, to be able to identify it later
res.1 <- `attr<-`(cbind(var.lm=apply(lm.coefs, 2, var), 
                        var.lad=apply(lad.coefs, 2, var)),
                  "seed", seed)
res.1
#                   var.lm      var.lad
# (Intercept) 9.104280e-06 1.273920e-05
# V1          2.609623e-05 6.992753e-05
# V2          7.082099e-05 2.075875e-05
# V3          1.352299e-05 1.209651e-05
# V4          7.986000e-06 9.273005e-06
# V5          5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1

Compare res.1 with the first element of list res above.

sov.1 <- colSums(res.1)
sov.1
#       var.lm      var.lad 
# 0.0001829835 0.0001401535

Compare sov.1 with the first row of matrix sov above.

Edit

For regression functions with matrix notation, such as lm.fit, we may use model.matrix and do the subsetting beforehand, see line lm2.coefs in the function; compare lm and lm2 columns in res2 below, they're equal. (lm.fit is also faster than lm, because it omits unnecessary calculations, and you just need the coefficients; hence you may actually replace lm with lm.fit line. There might also be a way with lad using lsfit in the code, but honestly I'm too unfamiliar with lad to provide you this solution.)

Also notice, that, for sake of brevity I merged the two lines per model into one using sapply directly on the $coefficients. sapply works as lapply but throws a matrix; note that we need to transpose.

FX2 <- function(seed, data, k=30) {
  set.seed(seed)  ## sets seed for each iteration
  folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE)  ## draw folds
  lm.coefs <- t(sapply(folds, function(f) lm(V6 ~ ., data=data[f, ])$coef))
  lm2.coefs <- t(sapply(folds, function(f) {
    data2 <- data[f, ]
    lm.fit(x=model.matrix(V6 ~ ., data2), y=data2[,"V6"])$coef
    }))
  lad.coefs <- t(sapply(folds, function(f) lad(V6 ~ ., data=data[f, ], method="BR")$coef))
  return(`attr<-`(cbind(lm=apply(lm.coefs, 2, var), 
                        lm2=apply(lm2.coefs, 2, var), 
                        lad=apply(lad.coefs, 2, var)),
                  "seed", seed))
}

seeds <- 1:30 
res.2 <- lapply(seeds, FX2, data=wdbcc)  ## lapply loop
res.2[1:2]
# [[1]]
#                       lm          lm2          lad
# (Intercept) 9.104280e-06 9.104280e-06 1.273920e-05
# V1          2.609623e-05 2.609623e-05 6.992753e-05
# V2          7.082099e-05 7.082099e-05 2.075875e-05
# V3          1.352299e-05 1.352299e-05 1.209651e-05
# V4          7.986000e-06 7.986000e-06 9.273005e-06
# V5          5.545298e-05 5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
# 
# [[2]]
#                       lm          lm2          lad
# (Intercept) 4.558722e-06 4.558722e-06 2.031707e-05
# V1          2.256583e-05 2.256583e-05 9.291900e-05
# V2          6.519648e-05 6.519648e-05 2.768443e-05
# V3          1.800889e-05 1.800889e-05 9.983524e-06
# V4          1.131813e-05 1.131813e-05 1.174496e-05
# V5          3.866105e-05 3.866105e-05 1.022452e-05
# attr(,"seed")
# [1] 2

Data and libraries:

invisible(lapply(c("caret", "L1pack"), library, character.only=TRUE))
wdbcc <- read.delim("airfoil_self_noise.dat", header=F)
wdbcc[] <- lapply(wdbcc, scale)

121

answered Oct 21 '22 08:10

jay.sf

Related questions
                            
                                Plotting posterior parameter estimates from multiple models with bayesplot
                            
                                R implementation of Genetic Algorithm for Bass Model
                            
                                ambiguity of `<<-` when defining it for `x < y <- z`
                            
                                In R, how do I split timestamp interval data into regular slots?
                            
                                Exception handling RSelenium switchToFrame() Error: ElementNotVisible
                            
                                Wait for a user input for 5 seconds and use a default value otherwise
                            
                                Why do logical vectors in RStudio View mode not show length?
                            
                                How to group_by variable and cut time into 10s bins starting at 13:24:00 exactly and average for group_by variable
                            
                                Equivalent Python code for mutate_if from tidyverse
                            
                                Find most repeated sequence of values
                            
                                Shiny Apps: is it possible to render HTML in validation messages?
                            
                                Rmarkdown not working on Win 10 machine: openBinaryFile: does not exist (No such file or directory)
                            
                                Is there a way of making beautiful tables for Word in R?
                            
                                Function that composes functions with existing sql translations in dbplyr
                            
                                Install R packages using Dockerfile
                            
                                Apply different functions to different columns programmatically in data.table R
                            
                                Combining/summing two positions in a vector of integers in R
                            
                                Is R still single threaded? [closed]
                            
                                path error in pkgdown article with multiple bibliographies
                            
                                R Shiny – Make wellPanel pop-up follow as you scroll with CSS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Running a regression model over 30 specific set.seed automatically using R

Tags:

r

machine-learning

regression