Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple sessions and graphs in Tensorflow (in the same process)

I'm training a model where the input vector is the output of another model. This involves restoring the first model from a checkpoint file while initializing the second model from scratch (using tf.initialize_variables()) in the same process.

There is a substantial amount of code and abstraction, so I'm just pasting the relevant sections here.

The following is the restoring code:

self.variables = [var for var in all_vars if var.name.startswith(self.name)]
saver = tf.train.Saver(self.variables, max_to_keep=3)
self.save_path = tf.train.latest_checkpoint(os.path.dirname(self.checkpoint_path))

if should_restore:
    self.saver.restore(self.sess, save_path)
else:
    self.sess.run(tf.initialize_variables(self.variables))

Each model is scoped within its own graph and session, like this:

 self.graph = tf.Graph()
 self.sess = tf.Session(graph=self.graph)

 with self.sess.graph.as_default():
    # Create variables and ops.

All the variables within each model are created within the variable_scope context manager.

The feeding works as follows:

  • A background thread calls sess.run(inference_op) on input = scipy.misc.imread(X) and puts the result in a blocking thread-safe queue.
  • The main training loop reads from the queue and calls sess.run(train_op) on the second model.

PROBLEM:
I am observing that the loss values, even in the very first iteration of the training (second model) keep changing drastically across runs (and become nan in a few iterations). I confirmed that the output of the first model is exactly the same everytime. Commenting out the sess.run of the first model and replacing it with identical input from a pickled file does not show this behaviour.

This is the train_op:

    loss_op = tf.nn.sparse_softmax_cross_entropy(network.feedforward())
    # Apply gradients.
    with tf.control_dependencies([loss_op]):
        opt = tf.train.GradientDescentOptimizer(lr)
        grads = opt.compute_gradients(loss_op)
        apply_gradient_op = opt.apply_gradients(grads)

    return apply_gradient_op

I know this is vague, but I'm happy to provide more details. Any help is appreciated!

like image 875
Vikesh Avatar asked Aug 07 '16 23:08

Vikesh


People also ask

What are graphs and sessions in TensorFlow?

Session in TensorFlow. It's simple: A graph defines the computation. It doesn't compute anything, it doesn't hold any values, it just defines the operations that you specified in your code. A session allows to execute graphs or part of graphs.

How graphs are stored and represented in TensorFlow?

TensorFlow uses graphs as the format for saved models when it exports them from Python. Graphs are also easily optimized, allowing the compiler to do transformations like: Statically infer the value of tensors by folding constant nodes in your computation ("constant folding").

Why TensorFlow use computational graphs?

Why tensorflow uses computational graphs? Exp: Tensorflow uses computational graphs because calculations can be done in parallel.

How do TensorFlow sessions work?

TensorFlow Session is a session object which encapsulates the environment in which Operation objects are executed, and data objects are evaluated. TensorFlow requires a session to execute an operation and retrieve its calculated value. A session may own several resources, for example, tf. QueueBase, tf.


1 Answers

The issue is most certainly happening due to concurrent execution of different session objects. I moved the first model's session from the background thread to the main thread, repeated the controlled experiment several times (running for over 24 hours and reaching convergence) and never observed NaN. On the other hand, concurrent execution diverges the model within a few minutes.

I've restructured my code to use a common session object for all models.

like image 133
Vikesh Avatar answered Oct 07 '22 01:10

Vikesh