Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running a regression model over 30 specific set.seed automatically using R

I am working with several linear regression models.

I want to run a linear regression model with different 30 set.seed

For clarification, I only share the code with two regression models and 10 set.seed (In my project I have 12 regression models and each one should be run with 30 different set.seeds)

I need a solution that I can run a 30 set.seed for one linear regression model so I can go away from my laptop during the running period (30 set.seeds). Then I did the same for the second regression model.

Is there a way to run the code over the 30 different set.seed automatically. So I got a result for each set.seed.

I hope everything is clear and I am happy to clarify more.

NOTE

Bear in mind that I have four related Blocks with each regression model. So any change with set.seed or creatFolds may affect the other blocks.

EDIT1

The dataset used

wdbc <- read.delim("airfoil_self_noise.dat",header=F)
wdbcc=as.data.frame(scale(wdbc))
#set.seed(1)
#set.seed(2)
#set.seed(3)
#set.seed(4)
...
k = 30
folds <- createFolds(wdbcc$V6, k = k, list = TRUE, returnTrain = TRUE)

## Ordinary Least Square regression ##
#Block A
lm = list()
for (i in 1:k) {
  lm[[i]] = lm(V6~ ., data = wdbcc[folds[[i]],])
}

#Block B
lm_coef = list()
lm_coef_var = list()
for(j in 1:(lm[[1]]$coefficients %>% length())){
  for(i in 1:k){
    lm_coef[[i]] = lm[[i]]$coefficients[j] 
    lm_coef_var[[j]] = lm_coef %>% unlist() %>% var()
  }
}

#Block C
lm_var = unlist(lm_coef_var)
lm_df = cbind(coefficients = lm[[1]]$coefficients%>% names() %>% as.data.frame()
              , variance = lm_var %>% as.data.frame()) 
colnames(lm_df) = c("coefficients", "variance_lm")

#Block D
lm_var_sum = sum(lm_var)

PQSQ-Regression

X=list()
Y=list()
for (i in 1:k) {
  n=wdbcc[folds[[i]],-6]
  m=wdbcc[folds[[i]],6]
  X[[i]]=n
  Y[[i]]=m
}

#Block A
lmPQSQ1 = list()
for (i in 1:k) {
  lmPQSQ1[[i]] = PQSQRegression(X[[i]],Y[[i]],0.01,data = wdbcc[folds[[i]],])
}

lmmPQSQ1=list()
for (i in 1:k) {
  L=list(coefficients = c(lmPQSQ1[[i]][[2]],lmPQSQ1[[i]][[1]]))
  lmmPQSQ1[[i]]=L
}
#Block B
lm_coefPQSQ1 = list()
lm_coef_varPQSQ1 = list()
for(j in 1:(lmmPQSQ1[[1]]$coefficients %>% length())){
  for(i in 1:k){
    lm_coefPQSQ1[[i]] = lmmPQSQ1[[i]]$coefficients[j] 
    lm_coef_varPQSQ1[[j]] = lm_coefPQSQ1 %>% unlist() %>% var()
  }
}

#Block C
lm_varPQSQ1 = unlist(lm_coef_varPQSQ1)
lm_dfPQSQ1 = variance = lm_varPQSQ1 %>% as.data.frame()
#Block D
PQSQ1_var_sum = sum(lm_varPQSQ1)
like image 236
jeza Avatar asked Mar 27 '20 23:03

jeza


People also ask

What does set seed 123 do in R?

Syntax: set.seed(123) The main point of using the seed is to be able to reproduce a particular sequence of 'random' numbers. and sed(n) reproduces random numbers results by seed.

What does LinearRegression () fit () do?

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Whether to calculate the intercept for this model.

When should you set seed in R?

The use of set. seed is to make sure that we get the same results for randomization. If we randomly select some observations for any task in R or in any statistical software it results in different values all the time and this happens because of randomization.

What does Lasso do in R?

Lasso regression is a regularized regression algorithm that performs L1 regularization which adds penalty equal to the absolute value of the magnitude of coefficients. “LASSO” stands for Least Absolute Shrinkage and Selection Operator.


1 Answers

If I understand you correctly you want to regress V6 on all the other variables using both OLS and a LAD model. You want to select k=30 random "folds" using createFolds and repeat the process also n=30 times. As result you want the variances for each repetition and each coefficient.

I would wrap the fitting part into a function FX. Generate n=30 seeds with sample, loop over it with an lapply to repeat FX n=30 times.

FX <- function(seed, data, k=30) {
  set.seed(seed)  ## sets seed for each iteration
  folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE)  ## folds
  ## OLS
  lm1 <- lapply(folds, function(folds) lm(V6 ~ ., data=data[folds, ]))
  lm.coefs <- t(sapply(lm1, coef))  ## lm coefficients
  ## LAD
  lad1 <- lapply(folds, function(folds) lad(V6 ~ ., data=data[folds, ], method="BR"))
  lad.coefs <- t(sapply(lad1, coef))  ## lad coefficients
  ## calculate column variances for both coef matrices
  ## use `attr<-` to add the seed as an attribute if you want
  return(`attr<-`(cbind(lm=apply(lm.coefs, 2, var), lad=apply(lad.coefs, 2, var)),
                  "seed", seed))
}

seeds <- 1:30  ## specific seeds 1, 2, ... 30
## if you want non-consecutive specific seeds, do:
# set.seed(42)                ## set some initial seed
# n <- 30                     ## n. o. seeds
# seeds <- sample(1:1e6, n)   ## sample seeds for `FX`


res <- lapply(seeds, FX, data=wdbcc)  ## lapply loop

Result

This results in a list of length 30 with variance matrices for each repetition, each model, and each coefficient.

res[1:2]  ## first two lists
# [[1]]
#                       lm          lad
# (Intercept) 9.104280e-06 1.273920e-05
# V1          2.609623e-05 6.992753e-05
# V2          7.082099e-05 2.075875e-05
# V3          1.352299e-05 1.209651e-05
# V4          7.986000e-06 9.273005e-06
# V5          5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
# 
# [[2]]
#                       lm          lad
# (Intercept) 4.558722e-06 2.031707e-05
# V1          2.256583e-05 9.291900e-05
# V2          6.519648e-05 2.768443e-05
# V3          1.800889e-05 9.983524e-06
# V4          1.131813e-05 1.174496e-05
# V5          3.866105e-05 1.022452e-05
# attr(,"seed")
# [1] 2

length(res)
# [1] 30

To calculate the sum of variances for each seed you may use colSums in an sapply.

# sum of variances
sov <- t(sapply(res, colSums))

dim(sov)
# [1] 30  2

head(sov)
#                lm          lad
# [1,] 1.829835e-04 0.0001401535
# [2,] 1.603091e-04 0.0001728735
# [3,] 1.003093e-04 0.0001972869
# [4,] 1.460591e-04 0.0001508251
# [5,] 9.915082e-05 0.0001262106
# [6,] 1.425996e-04 0.0001478449

To understand what one iteration of the lapply does, consider this:

## provide the values of first iteration for arguments of function `FX`
seed <- 1
data <- wdbcc
k <- 30
## first iteration of `lapply`
set.seed(seed)
folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE)  ## folds
## OLS
lm1 <- lapply(folds, function(folds) lm(V6 ~ ., data=data[folds, ]))
lm.coefs <- t(sapply(lm1, coef))  ## lm coefficients

dim(lm.coefs)
# [1] 30  6
head(lm.coefs)
#          (Intercept)         V1         V2         V3        V4         V5
# Fold01 -0.0039130125 -0.5806272 -0.3564769 -0.4804492 0.2271908 -0.2805472
# Fold02  0.0013260444 -0.5863764 -0.3533327 -0.4759213 0.2253128 -0.2874691
# Fold03  0.0006791787 -0.5890787 -0.3678586 -0.4832066 0.2220979 -0.2739124
# Fold04 -0.0010721593 -0.5868079 -0.3722466 -0.4895328 0.2227811 -0.2749657
# Fold05  0.0021856620 -0.5850165 -0.3495360 -0.4810657 0.2235410 -0.2936287
# Fold06  0.0001486909 -0.5872607 -0.3677774 -0.4848523 0.2275780 -0.2823764
## LAD (same as OLS)
lad1 <- lapply(folds, function(folds) lad(V6 ~ ., data=data[folds, ], method="BR"))
lad.coefs <- t(sapply(lad1, coef))  ## lad coefficients
## return, throws variances for each coefficient of each model in a matrix
## the seed is added as an attribute, to be able to identify it later
res.1 <- `attr<-`(cbind(var.lm=apply(lm.coefs, 2, var), 
                        var.lad=apply(lad.coefs, 2, var)),
                  "seed", seed)
res.1
#                   var.lm      var.lad
# (Intercept) 9.104280e-06 1.273920e-05
# V1          2.609623e-05 6.992753e-05
# V2          7.082099e-05 2.075875e-05
# V3          1.352299e-05 1.209651e-05
# V4          7.986000e-06 9.273005e-06
# V5          5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1

Compare res.1 with the first element of list res above.

sov.1 <- colSums(res.1)
sov.1
#       var.lm      var.lad 
# 0.0001829835 0.0001401535 

Compare sov.1 with the first row of matrix sov above.


Edit

For regression functions with matrix notation, such as lm.fit, we may use model.matrix and do the subsetting beforehand, see line lm2.coefs in the function; compare lm and lm2 columns in res2 below, they're equal. (lm.fit is also faster than lm, because it omits unnecessary calculations, and you just need the coefficients; hence you may actually replace lm with lm.fit line. There might also be a way with lad using lsfit in the code, but honestly I'm too unfamiliar with lad to provide you this solution.)

Also notice, that, for sake of brevity I merged the two lines per model into one using sapply directly on the $coefficients. sapply works as lapply but throws a matrix; note that we need to transpose.

FX2 <- function(seed, data, k=30) {
  set.seed(seed)  ## sets seed for each iteration
  folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE)  ## draw folds
  lm.coefs <- t(sapply(folds, function(f) lm(V6 ~ ., data=data[f, ])$coef))
  lm2.coefs <- t(sapply(folds, function(f) {
    data2 <- data[f, ]
    lm.fit(x=model.matrix(V6 ~ ., data2), y=data2[,"V6"])$coef
    }))
  lad.coefs <- t(sapply(folds, function(f) lad(V6 ~ ., data=data[f, ], method="BR")$coef))
  return(`attr<-`(cbind(lm=apply(lm.coefs, 2, var), 
                        lm2=apply(lm2.coefs, 2, var), 
                        lad=apply(lad.coefs, 2, var)),
                  "seed", seed))
}

seeds <- 1:30 
res.2 <- lapply(seeds, FX2, data=wdbcc)  ## lapply loop
res.2[1:2]
# [[1]]
#                       lm          lm2          lad
# (Intercept) 9.104280e-06 9.104280e-06 1.273920e-05
# V1          2.609623e-05 2.609623e-05 6.992753e-05
# V2          7.082099e-05 7.082099e-05 2.075875e-05
# V3          1.352299e-05 1.352299e-05 1.209651e-05
# V4          7.986000e-06 7.986000e-06 9.273005e-06
# V5          5.545298e-05 5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
# 
# [[2]]
#                       lm          lm2          lad
# (Intercept) 4.558722e-06 4.558722e-06 2.031707e-05
# V1          2.256583e-05 2.256583e-05 9.291900e-05
# V2          6.519648e-05 6.519648e-05 2.768443e-05
# V3          1.800889e-05 1.800889e-05 9.983524e-06
# V4          1.131813e-05 1.131813e-05 1.174496e-05
# V5          3.866105e-05 3.866105e-05 1.022452e-05
# attr(,"seed")
# [1] 2


Data and libraries:

invisible(lapply(c("caret", "L1pack"), library, character.only=TRUE))
wdbcc <- read.delim("airfoil_self_noise.dat", header=F)
wdbcc[] <- lapply(wdbcc, scale)
like image 121
jay.sf Avatar answered Oct 21 '22 08:10

jay.sf