Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MobileNet is not usable when set is_training to false

Tags:

tensorflow

The more accurate description for this issue is that MobileNet behaves bad when is_training is not set to true explicitly. And I'm referring to the MobileNet that is provided by TensorFlow in their model repository https://github.com/tensorflow/models/blob/master/slim/nets/mobilenet_v1.py.

This is how I create the net (phase_train=True):

with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=phase_train)):
        features, endpoints = mobilenet_v1.mobilenet_v1(
            inputs=images_placeholder, features_layer_size=features_layer_size, dropout_keep_prob=dropout_keep_prob,
            is_training=phase_train)

I'm training a recognition network and while training I test on LFW. The results that I get during the training are improving over time and getting a good accuracy.

Before deployment I freeze the graph. if I freeze the graph with is_training=True the results that I get on LFW are the same as during training. But if I set is_training=False I get results like the network haven't trained at all...

This behavior actually happens with other networks like Inception.

I tend to believe that I miss something very fundamental here and that this is not a bug in TensorFlow...

Any help would be appreciated.

Adding more code...

This is how I prepare for training:

images_placeholder = tf.placeholder(tf.float32, shape=(None, image_size, image_size, 1), name='input')
labels_placeholder = tf.placeholder(tf.int32, shape=(None))
dropout_placeholder = tf.placeholder_with_default(1.0, shape=(), name='dropout_keep_prob')
phase_train_placeholder = tf.Variable(True, name='phase_train')

global_step = tf.Variable(0, name='global_step', trainable=False)

# build graph

with slim.arg_scope(mobilenet_v1.mobilenet_v1_arg_scope(is_training=phase_train_placeholder)):
    features, endpoints = mobilenet_v1.mobilenet_v1(
        inputs=images_placeholder, features_layer_size=512, dropout_keep_prob=1.0,
        is_training=phase_train_placeholder)

# loss

logits = slim.fully_connected(inputs=features, num_outputs=train_data.get_class_count(), activation_fn=None,
                              weights_initializer=tf.truncated_normal_initializer(stddev=0.1),
                              weights_regularizer=slim.l2_regularizer(scale=0.00005),
                              scope='Logits', reuse=False)

tf.losses.sparse_softmax_cross_entropy(labels=labels_placeholder, logits=logits,
                                       reduction=tf.losses.Reduction.MEAN)

loss = tf.losses.get_total_loss()

# normalize output for inference

embeddings = tf.nn.l2_normalize(features, 1, 1e-10, name='embeddings')

# optimizer

optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(loss, global_step=global_step)

This is my train step:

batch_data, batch_labels = train_data.next_batch()
feed_dict = {
    images_placeholder: batch_data,
    labels_placeholder: batch_labels,
    dropout_placeholder: dropout_keep_prob
}
_, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)

I could add the code for how I freeze the graph but it's not really necessary. it's enough to build the graph with is_train=false, load latest checkpoint and run the evaluation on LWF to reproduce the problem.

Update...

I found that the problem is in the batch normalization layer. it's enough to set this layer to is_training=false to reproduce the problem.

references that I found after finding this:

http://ruishu.io/2016/12/27/batchnorm/

https://github.com/tensorflow/tensorflow/issues/10118

Batch Normalization - Tensorflow

Will update with a solution once I have a tested one.

like image 800
Ohad Meir Avatar asked Aug 24 '17 07:08

Ohad Meir


1 Answers

So I found a solution. Mainly using this reference: http://ruishu.io/2016/12/27/batchnorm/

From the link:

Note: When is_training is True the moving_mean and moving_variance need to be updated, by default the update_ops are placed in tf.GraphKeys.UPDATE_OPS so they need to be added as a dependency to the train_op, example:

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) if update_ops: updates = tf.group(*update_ops) total_loss = control_flow_ops.with_dependencies([updates], total_loss)

And to be straight to the point, instead of creating the optimizer like so:

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(total_loss, global_step=global_step)

Do it like this:

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    train_op = optimizer.minimize(total_loss, global_step=global_step)

That will solve the issue.

like image 90
Ohad Meir Avatar answered Oct 05 '22 06:10

Ohad Meir