If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters?

Question

UPDATE: This question was for Tensorflow 1.x. I upgraded to 2.0 and (at least on the simple code below) the reproducibility issue seems fixed on 2.0. So that solves my problem; but I'm still curious about what "best practices" were used for this issue on 1.x.

Training the exact same model/parameters/data on keras/tensorflow does not give reproducible results and the loss is significantly different each time you train the model. There are many stackoverflow questions about that (eg, How to get reproducible results in keras ) but the recommend workarounds don't seem to work for me or many other people on StackOverflow. OK, it is what it is.

But given that limitation of non-reproducibility with keras on tensorflow -- what's the best practice for comparing models and choosing hyper parameters? I'm testing different architectures and activations, but since the loss estimate is different each time, I'm never sure if one model is better than the other. Is there any best practice for dealing with this?

I don't think the issue has anything to do with my code, but just in case it helps; here's a sample program:

import os
#stackoverflow says turning off the GPU helps reproducibility, but it doesn't help for me
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ['PYTHONHASHSEED']=str(1)

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers 
import random
import pandas as pd
import numpy as np

#StackOverflow says this is needed for reproducibility but it doesn't help for me
from tensorflow.keras import backend as K
config = tf.ConfigProto(intra_op_parallelism_threads=1,inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=config)
K.set_session(sess)

#make some random data
NUM_ROWS = 1000
NUM_FEATURES = 10
random_data = np.random.normal(size=(NUM_ROWS, NUM_FEATURES))
df = pd.DataFrame(data=random_data, columns=['x_' + str(ii) for ii in range(NUM_FEATURES)])
y = df.sum(axis=1) + np.random.normal(size=(NUM_ROWS))

def run(x, y):
    #StackOverflow says you have to set the seeds but it doesn't help for me
    tf.set_random_seed(1)
    np.random.seed(1)
    random.seed(1)
    os.environ['PYTHONHASHSEED']=str(1)

    model = keras.Sequential([
            keras.layers.Dense(40, input_dim=df.shape[1], activation='relu'),
            keras.layers.Dense(20, activation='relu'),
            keras.layers.Dense(10, activation='relu'),
            keras.layers.Dense(1, activation='linear')
        ])
    NUM_EPOCHS = 500
    model.compile(optimizer='adam', loss='mean_squared_error')
    model.fit(x, y, epochs=NUM_EPOCHS, verbose=0)
    predictions = model.predict(x).flatten()
    loss = model.evaluate(x,  y) #This prints out the loss by side-effect

#Each time we run it gives a wildly different loss. :-(
run(df, y)
run(df, y)
run(df, y)

Given the non-reproducibility, how can I evaluate whether changes in my hyper-parameters and architecture are helping or not?

OverLordGoldDragon · Accepted Answer

It's sneaky, but your code does, in fact, lack a step for better reproducibility: resetting the Keras & TensorFlow graphs before each run. Without this, tf.set_random_seed() won't work properly - see correct approach below.

I'd exhaust all the options before tossing the towel on non-reproducibility; currently I'm aware of only one such instance, and it's likely a bug. Nonetheless, it's possible you'll get notably differing results even if you follow through all the steps - in that case, see "If nothing works", but each is clearly not very productive, thus it's best on focusing attaining reproducibility:

Definitive improvements:

Use reset_seeds(K) below
Increase numeric precision: K.set_floatx('float64')
Set PYTHONHASHSEED before the Python kernel starts - e.g. from terminal
Upgrade to TF 2, which includes some reproducibility bug fixes, but mind performance
Run CPU on a single thread (painfully slow)
Do not import from tf.python.keras - see here
Ensure all imports are consistent (i.e. don't do from keras.layers import ... and from tensorflow.keras.optimizers import ...)
Use a superior CPU - for example, Google Colab, even if using GPU, is much more robust against numeric imprecision - see this SO

Also see related SO on reproducibility

If nothing works:

Rerun X times w/ exact same hyperparameters & seeds, average results
K-Fold Cross-Validation w/ exact same hyperparameters & seeds, average results - superior option, but more work involved

Correct reset method:

def reset_seeds(reset_graph_with_backend=None):
    if reset_graph_with_backend is not None:
        K = reset_graph_with_backend
        K.clear_session()
        tf.compat.v1.reset_default_graph()
        print("KERAS AND TENSORFLOW GRAPHS RESET")  # optional

    np.random.seed(1)
    random.seed(2)
    tf.compat.v1.set_random_seed(3)
    print("RANDOM SEEDS RESET")  # optional

Running TF on single CPU thread: (code for TF1-only)

session_conf = tf.ConfigProto(
      intra_op_parallelism_threads=1,
      inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)

If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters?

Tags:

python

tensorflow

keras

reproducible-research

user2543623

1 Answers

OverLordGoldDragon

Recent Activity

Donate For Us

If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters?

Tags:

python

tensorflow

keras

reproducible-research

user2543623

1 Answers

OverLordGoldDragon

Related questions

Recent Activity

Donate For Us