Set seed with cv.glmnet paralleled gives different results in R

Question

I'm running parallel cv.glmnet from glmnet package on over 1000 data sets. In each run I set the seed to have the results reproducible. What I've noticed is that my results differ. The thing is that when I run the code on the same day, then the results are the same. But the next day they differ.

Here is my code:

model <- function(path, file, wyniki, faktor = 0.75) {

  set.seed(2)

  dane <- read.csv(file)

  n <- nrow(dane)
  podzial <- 1:floor(faktor*n)


  ########## GLMNET ############
  nFolds <- 3

  train_sparse <- dane[podzial,]
  test_sparse  <- dane[-podzial,]

  # fit with cross-validation
  tryCatch({
    wart <- c(rep(0,6), "nie")
    model <- cv.glmnet(train_sparse[,-1], train_sparse[,1], nfolds=nFolds, standardize=FALSE)

    pred <- predict(model, test_sparse[,-1], type = "response",s=model$lambda.min)

    # fetch of AUC value
    aucp1 <- roc(test_sparse[,1],pred)$auc

  }, error = function(e) print("error"))

  results <- data.frame(auc = aucp1, n = nrow(dane))
  write.table(results, wyniki, sep=',', append=TRUE,row.names =FALSE,col.names=FALSE)


}

path <- path_to_files
files <- list.files(sciezka, full.names = TRUE, recursive = TRUE)
wyniki <- "wyniki_adex__samplingfalse_decl_201512.csv"

library('doSNOW')
library('parallel')

#liczba watkow
threads <- 5

#rejestrujemy liczbe watkow
cl <- makeCluster(threads, outfile="")
registerDoSNOW(cl)

message("Loading packages on threads...")
clusterEvalQ(cl,library(pROC))
clusterEvalQ(cl,library(ROCR))
clusterEvalQ(cl,library(glmnet))
clusterEvalQ(cl,library(stringi))

message("Modelling...")
foreach(i=1:length(pliki)) %dopar% {
  print(i)
  model(path, files[i], wyniki)
}

Does anyone know what is the cause? I'm running CentOS Linux release 7.0.1406 (Core) / Red Hat 4.8.2-16

potockan · Accepted Answer

Found the answer in the documentation of the cv.glmnet function:

Note also that the results of cv.glmnet are random, since the folds are selected at random.

The solution is to manually set the folds so that there are not chosen at random:

nFolds <- 3
foldid <- sample(rep(seq(nFolds), length.out = nrow(train_sparse))
model <- cv.glmnet(x = as.matrix(x = train_sparse[,-1], 
                   y = train_sparse[,1], 
                   nfolds = nFolds,
                   foldid = foldid,
                   standardize = FALSE)

Zelazny7 · Answer

According to Writing R Extensions, a C wrapper is needed to call R's normal random numbers from FORTRAN. I don't see any C code in the glmnet source. I'm afraid it doesn't look implemented:

6.6 Calling C from FORTRAN and vice versa

Set seed with cv.glmnet paralleled gives different results in R

Tags:

r

parallel-processing

random-seed

glmnet

potockan

2 Answers

potockan

Zelazny7

Recent Activity

Donate For Us

Set seed with cv.glmnet paralleled gives different results in R

Tags:

r

parallel-processing

random-seed

glmnet

potockan

2 Answers

potockan

Zelazny7

Related questions

Recent Activity

Donate For Us