I'm running parallel cv.glmnet
from glmnet
package on over 1000 data sets. In each run I set the seed to have the results reproducible. What I've noticed is that my results differ. The thing is that when I run the code on the same day, then the results are the same. But the next day they differ.
Here is my code:
model <- function(path, file, wyniki, faktor = 0.75) {
set.seed(2)
dane <- read.csv(file)
n <- nrow(dane)
podzial <- 1:floor(faktor*n)
########## GLMNET ############
nFolds <- 3
train_sparse <- dane[podzial,]
test_sparse <- dane[-podzial,]
# fit with cross-validation
tryCatch({
wart <- c(rep(0,6), "nie")
model <- cv.glmnet(train_sparse[,-1], train_sparse[,1], nfolds=nFolds, standardize=FALSE)
pred <- predict(model, test_sparse[,-1], type = "response",s=model$lambda.min)
# fetch of AUC value
aucp1 <- roc(test_sparse[,1],pred)$auc
}, error = function(e) print("error"))
results <- data.frame(auc = aucp1, n = nrow(dane))
write.table(results, wyniki, sep=',', append=TRUE,row.names =FALSE,col.names=FALSE)
}
path <- path_to_files
files <- list.files(sciezka, full.names = TRUE, recursive = TRUE)
wyniki <- "wyniki_adex__samplingfalse_decl_201512.csv"
library('doSNOW')
library('parallel')
#liczba watkow
threads <- 5
#rejestrujemy liczbe watkow
cl <- makeCluster(threads, outfile="")
registerDoSNOW(cl)
message("Loading packages on threads...")
clusterEvalQ(cl,library(pROC))
clusterEvalQ(cl,library(ROCR))
clusterEvalQ(cl,library(glmnet))
clusterEvalQ(cl,library(stringi))
message("Modelling...")
foreach(i=1:length(pliki)) %dopar% {
print(i)
model(path, files[i], wyniki)
}
Does anyone know what is the cause? I'm running CentOS Linux release 7.0.1406 (Core) / Red Hat 4.8.2-16
Found the answer in the documentation of the cv.glmnet
function:
Note also that the results of cv.glmnet are random, since the folds are selected at random.
The solution is to manually set the folds so that there are not chosen at random:
nFolds <- 3
foldid <- sample(rep(seq(nFolds), length.out = nrow(train_sparse))
model <- cv.glmnet(x = as.matrix(x = train_sparse[,-1],
y = train_sparse[,1],
nfolds = nFolds,
foldid = foldid,
standardize = FALSE)
According to Writing R Extensions, a C wrapper is needed to call R's normal random numbers from FORTRAN. I don't see any C code in the glmnet
source. I'm afraid it doesn't look implemented:
6.6 Calling C from FORTRAN and vice versa
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With