Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set seed with cv.glmnet paralleled gives different results in R

I'm running parallel cv.glmnet from glmnet package on over 1000 data sets. In each run I set the seed to have the results reproducible. What I've noticed is that my results differ. The thing is that when I run the code on the same day, then the results are the same. But the next day they differ.

Here is my code:

model <- function(path, file, wyniki, faktor = 0.75) {

  set.seed(2)

  dane <- read.csv(file)

  n <- nrow(dane)
  podzial <- 1:floor(faktor*n)


  ########## GLMNET ############
  nFolds <- 3

  train_sparse <- dane[podzial,]
  test_sparse  <- dane[-podzial,]

  # fit with cross-validation
  tryCatch({
    wart <- c(rep(0,6), "nie")
    model <- cv.glmnet(train_sparse[,-1], train_sparse[,1], nfolds=nFolds, standardize=FALSE)

    pred <- predict(model, test_sparse[,-1], type = "response",s=model$lambda.min)

    # fetch of AUC value
    aucp1 <- roc(test_sparse[,1],pred)$auc

  }, error = function(e) print("error"))

  results <- data.frame(auc = aucp1, n = nrow(dane))
  write.table(results, wyniki, sep=',', append=TRUE,row.names =FALSE,col.names=FALSE)


}

path <- path_to_files
files <- list.files(sciezka, full.names = TRUE, recursive = TRUE)
wyniki <- "wyniki_adex__samplingfalse_decl_201512.csv"

library('doSNOW')
library('parallel')

#liczba watkow
threads <- 5

#rejestrujemy liczbe watkow
cl <- makeCluster(threads, outfile="")
registerDoSNOW(cl)

message("Loading packages on threads...")
clusterEvalQ(cl,library(pROC))
clusterEvalQ(cl,library(ROCR))
clusterEvalQ(cl,library(glmnet))
clusterEvalQ(cl,library(stringi))

message("Modelling...")
foreach(i=1:length(pliki)) %dopar% {
  print(i)
  model(path, files[i], wyniki)
}

Does anyone know what is the cause? I'm running CentOS Linux release 7.0.1406 (Core) / Red Hat 4.8.2-16

like image 890
potockan Avatar asked Jan 08 '16 12:01

potockan


2 Answers

Found the answer in the documentation of the cv.glmnet function:

Note also that the results of cv.glmnet are random, since the folds are selected at random.

The solution is to manually set the folds so that there are not chosen at random:

nFolds <- 3
foldid <- sample(rep(seq(nFolds), length.out = nrow(train_sparse))
model <- cv.glmnet(x = as.matrix(x = train_sparse[,-1], 
                   y = train_sparse[,1], 
                   nfolds = nFolds,
                   foldid = foldid,
                   standardize = FALSE)
like image 127
potockan Avatar answered Sep 23 '22 03:09

potockan


According to Writing R Extensions, a C wrapper is needed to call R's normal random numbers from FORTRAN. I don't see any C code in the glmnet source. I'm afraid it doesn't look implemented:

6.6 Calling C from FORTRAN and vice versa

like image 42
Zelazny7 Avatar answered Sep 19 '22 03:09

Zelazny7