Training a dense layer from bottleneck features vs. freezing all layers but the last - should be the same, but they behave differently

Question

As a "sanity check" I tried two ways to use transfer learning that I expected to behave the same, if not in running time than at least in the results.

The first method was use of bottleneck features (as explained here https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html ), that is, using the existing predictor to generate the features just before the last dense layer, saving them, then training a new dense layer with these features as input.

The second method was to replace the last dense layer of the model with a new one, then freezing all other layers in model.

I expected the second method to be as effective as the first one, but it was not.

The output of the first method was

 Epoch 1/50
16/16 [==============================] - 0s - loss: 1.3095 - acc: 0.4375 - val_loss: 0.4533 - val_acc: 0.7500
Epoch 2/50
16/16 [==============================] - 0s - loss: 0.3555 - acc: 0.8125 - val_loss: 0.2305 - val_acc: 1.0000
Epoch 3/50
16/16 [==============================] - 0s - loss: 0.1365 - acc: 1.0000 - val_loss: 0.1603 - val_acc: 1.0000
Epoch 4/50
16/16 [==============================] - 0s - loss: 0.0600 - acc: 1.0000 - val_loss: 0.1012 - val_acc: 1.0000
Epoch 5/50
16/16 [==============================] - 0s - loss: 0.0296 - acc: 1.0000 - val_loss: 0.0681 - val_acc: 1.0000
Epoch 6/50
16/16 [==============================] - 0s - loss: 0.0165 - acc: 1.0000 - val_loss: 0.0521 - val_acc: 1.0000
Epoch 7/50
16/16 [==============================] - 0s - loss: 0.0082 - acc: 1.0000 - val_loss: 0.0321 - val_acc: 1.0000
Epoch 8/50
16/16 [==============================] - 0s - loss: 0.0036 - acc: 1.0000 - val_loss: 0.0222 - val_acc: 1.0000
Epoch 9/50
16/16 [==============================] - 0s - loss: 0.0023 - acc: 1.0000 - val_loss: 0.0185 - val_acc: 1.0000
Epoch 10/50
16/16 [==============================] - 0s - loss: 0.0011 - acc: 1.0000 - val_loss: 0.0108 - val_acc: 1.0000
Epoch 11/50
16/16 [==============================] - 0s - loss: 5.6636e-04 - acc: 1.0000 - val_loss: 0.0087 - val_acc: 1.0000
Epoch 12/50
16/16 [==============================] - 0s - loss: 2.9463e-04 - acc: 1.0000 - val_loss: 0.0094 - val_acc: 1.0000
Epoch 13/50
16/16 [==============================] - 0s - loss: 1.5169e-04 - acc: 1.0000 - val_loss: 0.0072 - val_acc: 1.0000
Epoch 14/50
16/16 [==============================] - 0s - loss: 7.4001e-05 - acc: 1.0000 - val_loss: 0.0039 - val_acc: 1.0000
Epoch 15/50
16/16 [==============================] - 0s - loss: 3.9956e-05 - acc: 1.0000 - val_loss: 0.0034 - val_acc: 1.0000
Epoch 16/50
16/16 [==============================] - 0s - loss: 2.0384e-05 - acc: 1.0000 - val_loss: 0.0024 - val_acc: 1.0000
Epoch 17/50
16/16 [==============================] - 0s - loss: 1.0036e-05 - acc: 1.0000 - val_loss: 0.0026 - val_acc: 1.0000
Epoch 18/50
16/16 [==============================] - 0s - loss: 5.0962e-06 - acc: 1.0000 - val_loss: 0.0010 - val_acc: 1.0000
Epoch 19/50
16/16 [==============================] - 0s - loss: 2.7791e-06 - acc: 1.0000 - val_loss: 0.0011 - val_acc: 1.0000
Epoch 20/50
16/16 [==============================] - 0s - loss: 1.5646e-06 - acc: 1.0000 - val_loss: 0.0015 - val_acc: 1.0000
Epoch 21/50
16/16 [==============================] - 0s - loss: 8.6427e-07 - acc: 1.0000 - val_loss: 9.0825e-04 - val_acc: 1.0000
Epoch 22/50
16/16 [==============================] - 0s - loss: 4.3958e-07 - acc: 1.0000 - val_loss: 5.6370e-04 - val_acc: 1.0000
Epoch 23/50
16/16 [==============================] - 0s - loss: 2.5332e-07 - acc: 1.0000 - val_loss: 5.1226e-04 - val_acc: 1.0000
Epoch 24/50
16/16 [==============================] - 0s - loss: 1.6391e-07 - acc: 1.0000 - val_loss: 6.6560e-04 - val_acc: 1.0000
Epoch 25/50
16/16 [==============================] - 0s - loss: 1.3411e-07 - acc: 1.0000 - val_loss: 6.5456e-04 - val_acc: 1.0000
Epoch 26/50
16/16 [==============================] - 0s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 3.4316e-04 - val_acc: 1.0000
Epoch 27/50
16/16 [==============================] - 0s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 3.4316e-04 - val_acc: 1.0000
Epoch 28/50
16/16 [==============================] - 0s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 3.4316e-04 - val_acc: 1.0000
Epoch 29/50
16/16 [==============================] - 0s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 3.4316e-04 - val_acc: 1.0000
Epoch 30/50
16/16 [==============================] - 0s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 3.4316e-04 - val_acc: 1.0000

It converges quickly and yields good results.

The second method, on the other hand, gives this:

Epoch 1/50
24/24 [==============================] - 63s - loss: 0.7375 - acc: 0.7500 - val_loss: 0.7575 - val_acc: 0.6667
Epoch 2/50
24/24 [==============================] - 61s - loss: 0.6763 - acc: 0.7500 - val_loss: 1.5228 - val_acc: 0.5000
Epoch 3/50
24/24 [==============================] - 61s - loss: 0.7149 - acc: 0.7500 - val_loss: 3.5805 - val_acc: 0.3333
Epoch 4/50
24/24 [==============================] - 61s - loss: 0.6363 - acc: 0.7500 - val_loss: 1.5066 - val_acc: 0.5000
Epoch 5/50
24/24 [==============================] - 61s - loss: 0.6542 - acc: 0.7500 - val_loss: 1.8745 - val_acc: 0.6667
Epoch 6/50
24/24 [==============================] - 61s - loss: 0.7007 - acc: 0.7500 - val_loss: 1.5328 - val_acc: 0.5000
Epoch 7/50
24/24 [==============================] - 61s - loss: 0.6900 - acc: 0.7500 - val_loss: 3.6004 - val_acc: 0.3333
Epoch 8/50
24/24 [==============================] - 61s - loss: 0.6615 - acc: 0.7500 - val_loss: 1.5734 - val_acc: 0.5000
Epoch 9/50
24/24 [==============================] - 61s - loss: 0.6571 - acc: 0.7500 - val_loss: 3.0078 - val_acc: 0.6667
Epoch 10/50
24/24 [==============================] - 61s - loss: 0.5762 - acc: 0.7083 - val_loss: 3.6029 - val_acc: 0.5000
Epoch 11/50
24/24 [==============================] - 61s - loss: 0.6515 - acc: 0.7500 - val_loss: 5.8610 - val_acc: 0.3333
Epoch 12/50
24/24 [==============================] - 61s - loss: 0.6541 - acc: 0.7083 - val_loss: 2.4551 - val_acc: 0.5000
Epoch 13/50
24/24 [==============================] - 61s - loss: 0.6700 - acc: 0.7500 - val_loss: 2.9983 - val_acc: 0.6667
Epoch 14/50
24/24 [==============================] - 61s - loss: 0.6486 - acc: 0.7500 - val_loss: 3.6179 - val_acc: 0.5000
Epoch 15/50
24/24 [==============================] - 61s - loss: 0.6985 - acc: 0.6667 - val_loss: 5.8419 - val_acc: 0.3333
Epoch 16/50
24/24 [==============================] - 62s - loss: 0.6465 - acc: 0.7083 - val_loss: 2.5201 - val_acc: 0.5000
Epoch 17/50
24/24 [==============================] - 62s - loss: 0.6246 - acc: 0.7500 - val_loss: 2.9912 - val_acc: 0.6667
Epoch 18/50
24/24 [==============================] - 62s - loss: 0.6768 - acc: 0.7500 - val_loss: 3.6320 - val_acc: 0.5000
Epoch 19/50
24/24 [==============================] - 62s - loss: 0.5774 - acc: 0.7083 - val_loss: 5.8575 - val_acc: 0.3333
Epoch 20/50
24/24 [==============================] - 62s - loss: 0.6642 - acc: 0.7500 - val_loss: 2.5865 - val_acc: 0.5000
Epoch 21/50
24/24 [==============================] - 63s - loss: 0.6553 - acc: 0.7083 - val_loss: 2.9967 - val_acc: 0.6667
Epoch 22/50
24/24 [==============================] - 62s - loss: 0.6469 - acc: 0.7083 - val_loss: 3.6233 - val_acc: 0.5000
Epoch 23/50
24/24 [==============================] - 64s - loss: 0.6029 - acc: 0.7500 - val_loss: 5.8225 - val_acc: 0.3333
Epoch 24/50
24/24 [==============================] - 63s - loss: 0.6183 - acc: 0.7083 - val_loss: 2.5325 - val_acc: 0.5000
Epoch 25/50
24/24 [==============================] - 62s - loss: 0.6631 - acc: 0.7500 - val_loss: 2.9879 - val_acc: 0.6667
Epoch 26/50
24/24 [==============================] - 63s - loss: 0.6082 - acc: 0.7500 - val_loss: 3.6206 - val_acc: 0.5000
Epoch 27/50
24/24 [==============================] - 62s - loss: 0.6536 - acc: 0.7500 - val_loss: 5.7937 - val_acc: 0.3333
Epoch 28/50
24/24 [==============================] - 63s - loss: 0.5853 - acc: 0.7500 - val_loss: 2.6138 - val_acc: 0.5000
Epoch 29/50
24/24 [==============================] - 62s - loss: 0.5523 - acc: 0.7500 - val_loss: 3.0126 - val_acc: 0.6667
Epoch 30/50
24/24 [==============================] - 62s - loss: 0.7112 - acc: 0.7500 - val_loss: 3.7054 - val_acc: 0.5000

The same model (Inception V4) was used for both methods. My code is as follows:

First method (Bottleneck Features):

from keras import backend as K
import inception_v4
import numpy as np
import cv2
import os

from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.layers import Activation, Dropout, Flatten, Dense, Input

from keras.models import Model
os.environ['CUDA_VISIBLE_DEVICES'] = ''



v4 = inception_v4.create_model(weights='imagenet')


#v4.summary()
my_batch_size=1
train_data_dir ='//shared_directory/projects/try_CDxx/data/train/'
validation_data_dir ='//shared_directory/projects/try_CDxx/data/validation/'
top_model_weights_path= 'bottleneck_fc_model.h5'
class_num=2

img_width, img_height = 299, 299
#nb_train_samples=16
#nb_validation_samples=8
nb_epoch=50

main_input= v4.layers[1].input
main_output=v4.layers[-1].output
flatten_output= v4.layers[-2].output


model = Model(input=[main_input], output=[main_output, flatten_output])


def save_BN(model):   
#   
    datagen = ImageDataGenerator(rescale=1./255) # here!
#   
    generator = datagen.flow_from_directory(
            train_data_dir,
            target_size=(img_width, img_height),
            batch_size=my_batch_size,
            class_mode='categorical',
            shuffle=False)
    nb_train_samples = generator.classes.size       
    bottleneck_features_train = model.predict_generator(generator, nb_train_samples)
#
    np.save(open('bottleneck_flat_features_train.npy', 'wb'), bottleneck_features_train[1])

    np.save(open('bottleneck_train_labels.npy', 'wb'), generator.classes)

    generator = datagen.flow_from_directory(
            validation_data_dir,
            target_size=(img_width, img_height),
            batch_size=my_batch_size,
            class_mode='categorical',
            shuffle=False)

    nb_validation_samples = generator.classes.size
    bottleneck_features_validation = model.predict_generator(generator, nb_validation_samples)

    np.save(open('bottleneck_flat_features_validation.npy', 'wb'), bottleneck_features_validation[1])

    np.save(open('bottleneck_validation_labels.npy', 'wb'), generator.classes)


def train_top_model ():
    train_data = np.load(open('bottleneck_flat_features_train.npy'))
    train_labels = np.load(open('bottleneck_train_labels.npy'))
#
    validation_data = np.load(open('bottleneck_flat_features_validation.npy'))
    validation_labels = np.load(open('bottleneck_validation_labels.npy'))
    #
    top_m  = Sequential()
    top_m.add(Dense(class_num,input_shape=train_data.shape[1:], activation='softmax', name='top_dense1'))
    top_m.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
#
    top_m.fit(train_data, train_labels,

    nb_epoch=nb_epoch, batch_size=my_batch_size,
    validation_data=(validation_data, validation_labels))

    Dense_layer=top_m.layers[-1]
    my_weights=Dense_layer.get_weights()
    np.save(open('retrained_top_layer_weight.npy', 'wb'), my_weights)




save_BN(model)
train_top_model()

Second method (freezing all but the last)

from keras import backend as K
import inception_v4
import numpy as np
import cv2
import os

from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras.layers import Activation, Dropout, Flatten, Dense, Input

from keras.models import Model
os.environ['CUDA_VISIBLE_DEVICES'] = ''


my_batch_size=1


train_data_dir ='//shared_directory/projects/try_CDxx/data/train/'
validation_data_dir ='//shared_directory/projects/try_CDxx/data/validation/'
top_model_path= 'tm_trained_model.h5'

img_width, img_height = 299, 299
num_classes=2
#nb_epoch=50
nb_epoch=50
nbr_train_samples = 24
nbr_validation_samples = 12


def train_top_model (num_classes):

    v4 = inception_v4.create_model(weights='imagenet')
    predictions = Dense(output_dim=num_classes, activation='softmax', name="newDense")(v4.layers[-2].output) # replacing the 1001 categories dense layer with my own 
    main_input= v4.layers[1].input
    main_output=predictions
    t_model = Model(input=[main_input], output=[main_output])


    val_datagen = ImageDataGenerator(rescale=1./255)
    train_datagen  = ImageDataGenerator(rescale=1./255)  


    train_generator = train_datagen.flow_from_directory(
            train_data_dir,
            target_size = (img_width, img_height),
            batch_size = my_batch_size,
            shuffle = False,
            class_mode = 'categorical')

    validation_generator = val_datagen.flow_from_directory(
            validation_data_dir,
            target_size=(img_width, img_height),
            batch_size=my_batch_size,
            shuffle = False,
            class_mode = 'categorical') 
#
    for layer in t_model.layers:
        layer.trainable = False
    t_model.layers[-1].trainable=True
    t_model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])


#
    t_model.fit_generator(
            train_generator,
            samples_per_epoch = nbr_train_samples,
            nb_epoch = nb_epoch,
            validation_data = validation_generator,
            nb_val_samples = nbr_validation_samples)
    t_model.save(top_model_path)    

#   print (t_model.trainable_weights)

train_top_model(num_classes)

I think that freezing all of the net but the top and training only the top should be identical to using all of the net but the top to create the features that exist just before the top, and then train a new dense layer is basically the same thing.

So I am either incorrect in my code or my thinking about the problem (or both...)

What am I doing wrong?

Thank you for your time.

Marcin Możejko · Accepted Answer

This was a really neat problem. It's because of Dropout layers in your second approach. Even though the layer was set to be not trainable - Dropout still works and prevents your network from overfitting by changing your input.

Try to change your code to:

v4 = inception_v4.create_model(weights='imagenet')
predictions = Flatten()(v4.layers[-4].output)
predictions = Dense(output_dim=num_classes, activation='softmax', name="newDense")(predictions)

Also - because of the BatchNormalization change the batch_size to 24.

This should work.

Training a dense layer from bottleneck features vs. freezing all layers but the last - should be the same, but they behave differently

Tags:

python

neural-network

deep-learning

keras

keras-layer

user2182857

1 Answers

Marcin Możejko

Recent Activity

Donate For Us

Training a dense layer from bottleneck features vs. freezing all layers but the last - should be the same, but they behave differently

Tags:

python

neural-network

deep-learning

keras

keras-layer

user2182857

1 Answers

Marcin Możejko

Related questions

Recent Activity

Donate For Us