Why are my results still not reproducible?

I want to get reproducible results for a CNN. I use Keras and Google Colab with GPU.

In addition to recommendations to insert certain code snippets, which should allow a reproducibility, I also added seeds to the layers.

###### This is the first code snipped to run #####

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
###### This is the second code snipped to run #####

from __future__ import print_function  
import numpy as np 

import tensorflow as tf

import random as rn 
import os 
os.environ['PYTHONASHSEED'] = '0' 
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1) 

###### This is the third code snipped to run #####

from keras import backend as K

sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)  
###### This is the fourth code snipped to run #####

def model_cnn():
  model = Sequential()
  model.add(Conv2D(32, kernel_size=(3,3), kernel_initializer=initializers.glorot_uniform(seed=1), input_shape=(28,28,1)))

  model.add(Conv2D(32, kernel_size=(3,3), kernel_initializer=initializers.glorot_uniform(seed=2)))
  model.add(Dropout(0.25, seed=1))  


  model.add(Dense(512, kernel_initializer=initializers.glorot_uniform(seed=2)))
  model.add(Dropout(0.5, seed=1))
  model.add(Dense(10, kernel_initializer=initializers.glorot_uniform(seed=2)))

  model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.001), metrics=['accuracy'])
  return model

def split_data(X,y):
  X_train_val, X_val, y_train_val, y_val = train_test_split(X, y, random_state=42, test_size=1/5, stratify=y) 
  return(X_train_val, X_val, y_train_val, y_val) 

def train_model_with_EarlyStopping(model, X, y):
  # make train and validation data
  X_tr, X_val, y_tr, y_val = split_data(X,y)

  es = EarlyStopping(monitor='val_loss', patience=20, mode='min', restore_best_weights=True)

  history = model.fit(X_tr, y_tr,

  return history
###### This is the fifth code snipped to run #####

train_model_with_EarlyStopping(model_cnn(), X, y)

Always I run the above code I get different results. Does the reason lies in the code, or it is simply not possible to obtain reproducible results in Google Colab with GPU support?

The complete code (there are unneccessary parts in the code, such as libraries which are not used):

1 Answers

The problem isn't limited to Colab, and is reproducible locally. The behavior, however, may be inevitable.

Code at bottom is a minimally-reproducible version of your code, with fit parameters tweaked for faster testing. What I observed is, the maximum difference for loss is only 0.0144% for 468 iterations per run, across 5 runs. This is pretty good. With batch_size=64, 60000 samples, and 20 epochs, you'll have 18750 iterations - which will amplify this figure substantially.

Regardless, GPU parallelism is the most likely culprit driving the randomnes - and the small differences do accumulate over time to yield a substantial difference - demo below. If 1e-8 seems small, try adding random noise to half your weights w/ magnitude clipped at 1e-8, and witness its life philosophy change.

The role of the seeds becomes dramatically pronounced if you don't use them - try it, all your metrics will fly rampant within the first 10 iterations. Also, loss is better for measuring runtime differences, as accuracy's lot more sensitive to numeric precision errors: the difference between 60% accuracy and 70% accuracy on a 10-sample batch is a prediction that differs by 0.000001 w.r.t. 0.5 - but loss will barely budge.

Lastly, note that your hyperparameter choice will have a far greater impact upon model performance than randomness; no matter how many seeds you throw, they won't magic a model into SOTA. -- I recommend this fine clip.

Your code - is fine. You've taken all practical steps to ensure reproducibility, with an exception: PYTHONHASHSEED must be set before your Python kernel starts.

What can you do to reduce randomness?

  1. Repeat runs, average results. Understandably that's expensive, but note that even a perfectly reproducible run isn't perfectly informative, as model variance w.r.t. train & validation sets is likely to be much greater than noise-induced randomness

  2. K-Fold Cross-Validation: can mitigate both data & noise variance significantly

  3. Larger validation set: extracted features can differ only so much due to noise; the larger the validation set, the less small perturbations in weights should reflect in metrics

GPU Parallelism: amplifying float error

print(2. * 11. / 9.)  # 2.4444444444444446
print(2. / 9. * 11.)  # 2.444444444444444

Order of operations matters, and by exploiting multithreading, GPU parallelism gives no guarantee whatsoever of operations being executed in the same order. On a first look, the difference may look innocent - but give it enough iterations ...

one = 1
for _ in range(int(1e8)):
    one *= (2. / 9. * 11.) / (2. * 11. / 9.)
print(one)     # 0.9999999777955395
print(1 - one) # 1.8167285897874308e-08

... and a "one" is a typical small weight value of 1e-08 away from being its original self. If 100 million iterations seems to be a stretch, consider that the operation completed in ~half a minute, whereas your model can train over an hour, and former runs entirely on CPU.

Minimal reproducible experimentation:

import tensorflow as tf
import random as rn 
import numpy as np

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization
from keras.layers import MaxPooling2D, Conv2D
from keras.optimizers import Adam

def model_cnn():
  model = Sequential()
  model.add(Conv2D(32, kernel_size=(3,3), 
                   kernel_initializer='he_uniform', input_shape=(28,28,1)))
  model.add(Conv2D(32, kernel_size=(3,3), kernel_initializer='he_uniform'))
  model.add(Dense(512, kernel_initializer='he_uniform'))
  model.add(Dense(10, kernel_initializer='he_uniform'))
  model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.001), 
  return model


X_train = np.random.randn(30000, 28, 28, 1)
y_train = np.random.randint(0, 2, (30000, 10))
X_val   = np.random.randn(30000, 28, 28, 1)
y_val   = np.random.randint(0, 2, (30000, 10))
model = model_cnn()


history = model.fit(X_train, y_train, batch_size=64,shuffle=True, 
                    epochs=1, verbose=1, validation_data=(X_val,y_val))

Run differences:

loss: 12.5044 - acc: 0.0971 - val_loss: 11.5389 - val_acc: 0.1051
loss: 12.5047 - acc: 0.0958 - val_loss: 11.5369 - val_acc: 0.1018
loss: 12.5055 - acc: 0.0955 - val_loss: 11.5382 - val_acc: 0.0980
loss: 12.5042 - acc: 0.0961 - val_loss: 11.5382 - val_acc: 0.1179
loss: 12.5062 - acc: 0.0960 - val_loss: 11.5366 - val_acc: 0.1082
