I am working with several linear regression models.
I want to run a linear regression model with different 30 set.seed
For clarification, I only share the code with two regression models and 10 set.seed (In my project I have 12 regression models and each one should be run with 30 different set.seeds)
I need a solution that I can run a 30 set.seed for one linear regression model so I can go away from my laptop during the running period (30 set.seeds). Then I did the same for the second regression model.
Is there a way to run the code over the 30 different set.seed automatically. So I got a result for each set.seed.
I hope everything is clear and I am happy to clarify more.
NOTE
Bear in mind that I have four related Blocks with each regression model. So any change with set.seed or creatFolds may affect the other blocks.
EDIT1
The dataset used
wdbc <- read.delim("airfoil_self_noise.dat",header=F)
wdbcc=as.data.frame(scale(wdbc))
#set.seed(1)
#set.seed(2)
#set.seed(3)
#set.seed(4)
...
k = 30
folds <- createFolds(wdbcc$V6, k = k, list = TRUE, returnTrain = TRUE)
## Ordinary Least Square regression ##
#Block A
lm = list()
for (i in 1:k) {
lm[[i]] = lm(V6~ ., data = wdbcc[folds[[i]],])
}
#Block B
lm_coef = list()
lm_coef_var = list()
for(j in 1:(lm[[1]]$coefficients %>% length())){
for(i in 1:k){
lm_coef[[i]] = lm[[i]]$coefficients[j]
lm_coef_var[[j]] = lm_coef %>% unlist() %>% var()
}
}
#Block C
lm_var = unlist(lm_coef_var)
lm_df = cbind(coefficients = lm[[1]]$coefficients%>% names() %>% as.data.frame()
, variance = lm_var %>% as.data.frame())
colnames(lm_df) = c("coefficients", "variance_lm")
#Block D
lm_var_sum = sum(lm_var)
PQSQ-Regression
X=list()
Y=list()
for (i in 1:k) {
n=wdbcc[folds[[i]],-6]
m=wdbcc[folds[[i]],6]
X[[i]]=n
Y[[i]]=m
}
#Block A
lmPQSQ1 = list()
for (i in 1:k) {
lmPQSQ1[[i]] = PQSQRegression(X[[i]],Y[[i]],0.01,data = wdbcc[folds[[i]],])
}
lmmPQSQ1=list()
for (i in 1:k) {
L=list(coefficients = c(lmPQSQ1[[i]][[2]],lmPQSQ1[[i]][[1]]))
lmmPQSQ1[[i]]=L
}
#Block B
lm_coefPQSQ1 = list()
lm_coef_varPQSQ1 = list()
for(j in 1:(lmmPQSQ1[[1]]$coefficients %>% length())){
for(i in 1:k){
lm_coefPQSQ1[[i]] = lmmPQSQ1[[i]]$coefficients[j]
lm_coef_varPQSQ1[[j]] = lm_coefPQSQ1 %>% unlist() %>% var()
}
}
#Block C
lm_varPQSQ1 = unlist(lm_coef_varPQSQ1)
lm_dfPQSQ1 = variance = lm_varPQSQ1 %>% as.data.frame()
#Block D
PQSQ1_var_sum = sum(lm_varPQSQ1)
Syntax: set.seed(123) The main point of using the seed is to be able to reproduce a particular sequence of 'random' numbers. and sed(n) reproduces random numbers results by seed.
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Whether to calculate the intercept for this model.
The use of set. seed is to make sure that we get the same results for randomization. If we randomly select some observations for any task in R or in any statistical software it results in different values all the time and this happens because of randomization.
Lasso regression is a regularized regression algorithm that performs L1 regularization which adds penalty equal to the absolute value of the magnitude of coefficients. “LASSO” stands for Least Absolute Shrinkage and Selection Operator.
If I understand you correctly you want to regress V6
on all the other variables using both OLS and a LAD model. You want to select k=30
random "folds" using createFolds
and repeat the process also n=30
times. As result you want the variances for each repetition and each coefficient.
I would wrap the fitting part into a function FX
. Generate n=30
seeds with sample
, loop over it with an lapply
to repeat FX
n=30
times.
FX <- function(seed, data, k=30) {
set.seed(seed) ## sets seed for each iteration
folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE) ## folds
## OLS
lm1 <- lapply(folds, function(folds) lm(V6 ~ ., data=data[folds, ]))
lm.coefs <- t(sapply(lm1, coef)) ## lm coefficients
## LAD
lad1 <- lapply(folds, function(folds) lad(V6 ~ ., data=data[folds, ], method="BR"))
lad.coefs <- t(sapply(lad1, coef)) ## lad coefficients
## calculate column variances for both coef matrices
## use `attr<-` to add the seed as an attribute if you want
return(`attr<-`(cbind(lm=apply(lm.coefs, 2, var), lad=apply(lad.coefs, 2, var)),
"seed", seed))
}
seeds <- 1:30 ## specific seeds 1, 2, ... 30
## if you want non-consecutive specific seeds, do:
# set.seed(42) ## set some initial seed
# n <- 30 ## n. o. seeds
# seeds <- sample(1:1e6, n) ## sample seeds for `FX`
res <- lapply(seeds, FX, data=wdbcc) ## lapply loop
This results in a list of length 30 with variance matrices for each repetition, each model, and each coefficient.
res[1:2] ## first two lists
# [[1]]
# lm lad
# (Intercept) 9.104280e-06 1.273920e-05
# V1 2.609623e-05 6.992753e-05
# V2 7.082099e-05 2.075875e-05
# V3 1.352299e-05 1.209651e-05
# V4 7.986000e-06 9.273005e-06
# V5 5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
#
# [[2]]
# lm lad
# (Intercept) 4.558722e-06 2.031707e-05
# V1 2.256583e-05 9.291900e-05
# V2 6.519648e-05 2.768443e-05
# V3 1.800889e-05 9.983524e-06
# V4 1.131813e-05 1.174496e-05
# V5 3.866105e-05 1.022452e-05
# attr(,"seed")
# [1] 2
length(res)
# [1] 30
To calculate the sum of variances for each seed you may use colSums
in an sapply
.
# sum of variances
sov <- t(sapply(res, colSums))
dim(sov)
# [1] 30 2
head(sov)
# lm lad
# [1,] 1.829835e-04 0.0001401535
# [2,] 1.603091e-04 0.0001728735
# [3,] 1.003093e-04 0.0001972869
# [4,] 1.460591e-04 0.0001508251
# [5,] 9.915082e-05 0.0001262106
# [6,] 1.425996e-04 0.0001478449
To understand what one iteration of the lapply
does, consider this:
## provide the values of first iteration for arguments of function `FX`
seed <- 1
data <- wdbcc
k <- 30
## first iteration of `lapply`
set.seed(seed)
folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE) ## folds
## OLS
lm1 <- lapply(folds, function(folds) lm(V6 ~ ., data=data[folds, ]))
lm.coefs <- t(sapply(lm1, coef)) ## lm coefficients
dim(lm.coefs)
# [1] 30 6
head(lm.coefs)
# (Intercept) V1 V2 V3 V4 V5
# Fold01 -0.0039130125 -0.5806272 -0.3564769 -0.4804492 0.2271908 -0.2805472
# Fold02 0.0013260444 -0.5863764 -0.3533327 -0.4759213 0.2253128 -0.2874691
# Fold03 0.0006791787 -0.5890787 -0.3678586 -0.4832066 0.2220979 -0.2739124
# Fold04 -0.0010721593 -0.5868079 -0.3722466 -0.4895328 0.2227811 -0.2749657
# Fold05 0.0021856620 -0.5850165 -0.3495360 -0.4810657 0.2235410 -0.2936287
# Fold06 0.0001486909 -0.5872607 -0.3677774 -0.4848523 0.2275780 -0.2823764
## LAD (same as OLS)
lad1 <- lapply(folds, function(folds) lad(V6 ~ ., data=data[folds, ], method="BR"))
lad.coefs <- t(sapply(lad1, coef)) ## lad coefficients
## return, throws variances for each coefficient of each model in a matrix
## the seed is added as an attribute, to be able to identify it later
res.1 <- `attr<-`(cbind(var.lm=apply(lm.coefs, 2, var),
var.lad=apply(lad.coefs, 2, var)),
"seed", seed)
res.1
# var.lm var.lad
# (Intercept) 9.104280e-06 1.273920e-05
# V1 2.609623e-05 6.992753e-05
# V2 7.082099e-05 2.075875e-05
# V3 1.352299e-05 1.209651e-05
# V4 7.986000e-06 9.273005e-06
# V5 5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
Compare res.1
with the first element of list res
above.
sov.1 <- colSums(res.1)
sov.1
# var.lm var.lad
# 0.0001829835 0.0001401535
Compare sov.1
with the first row of matrix sov
above.
For regression functions with matrix notation, such as lm.fit
, we may use model.matrix
and do the subsetting beforehand, see line lm2.coefs
in the function; compare lm
and lm2
columns in res2
below, they're equal. (lm.fit
is also faster than lm
, because it omits unnecessary calculations, and you just need the coefficients; hence you may actually replace lm
with lm.fit
line. There might also be a way with lad
using lsfit
in the code, but honestly I'm too unfamiliar with lad
to provide you this solution.)
Also notice, that, for sake of brevity I merged the two lines per model into one using sapply
directly on the $coefficients
. sapply
works as lapply
but throws a matrix; note that we need to t
ranspose.
FX2 <- function(seed, data, k=30) {
set.seed(seed) ## sets seed for each iteration
folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE) ## draw folds
lm.coefs <- t(sapply(folds, function(f) lm(V6 ~ ., data=data[f, ])$coef))
lm2.coefs <- t(sapply(folds, function(f) {
data2 <- data[f, ]
lm.fit(x=model.matrix(V6 ~ ., data2), y=data2[,"V6"])$coef
}))
lad.coefs <- t(sapply(folds, function(f) lad(V6 ~ ., data=data[f, ], method="BR")$coef))
return(`attr<-`(cbind(lm=apply(lm.coefs, 2, var),
lm2=apply(lm2.coefs, 2, var),
lad=apply(lad.coefs, 2, var)),
"seed", seed))
}
seeds <- 1:30
res.2 <- lapply(seeds, FX2, data=wdbcc) ## lapply loop
res.2[1:2]
# [[1]]
# lm lm2 lad
# (Intercept) 9.104280e-06 9.104280e-06 1.273920e-05
# V1 2.609623e-05 2.609623e-05 6.992753e-05
# V2 7.082099e-05 7.082099e-05 2.075875e-05
# V3 1.352299e-05 1.352299e-05 1.209651e-05
# V4 7.986000e-06 7.986000e-06 9.273005e-06
# V5 5.545298e-05 5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
#
# [[2]]
# lm lm2 lad
# (Intercept) 4.558722e-06 4.558722e-06 2.031707e-05
# V1 2.256583e-05 2.256583e-05 9.291900e-05
# V2 6.519648e-05 6.519648e-05 2.768443e-05
# V3 1.800889e-05 1.800889e-05 9.983524e-06
# V4 1.131813e-05 1.131813e-05 1.174496e-05
# V5 3.866105e-05 3.866105e-05 1.022452e-05
# attr(,"seed")
# [1] 2
Data and libraries:
invisible(lapply(c("caret", "L1pack"), library, character.only=TRUE))
wdbcc <- read.delim("airfoil_self_noise.dat", header=F)
wdbcc[] <- lapply(wdbcc, scale)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With