I'm testing out my new NVIDIA Titan V, which supports float16 operations. I noticed that during training, float16 is much slower (~800 ms/step) than float32 (~500 ms/step).
To do float16 operations, I changed my keras.json file to:
{
"backend": "tensorflow",
"floatx": "float16",
"image_data_format": "channels_last",
"epsilon": 1e-07
}
Why are the float16 operations so much slower? Do I need to make modifications to my code and not just the keras.json file?
I am using CUDA 9.0, cuDNN 7.0, tensorflow 1.7.0, and keras 2.1.5 on Windows 10. My python 3.5 code is below:
img_width, img_height = 336, 224
train_data_dir = 'C:\\my_dir\\train'
test_data_dir = 'C:\\my_dir\\test'
batch_size=128
datagen = ImageDataGenerator(rescale=1./255,
horizontal_flip=True, # randomly flip the images
vertical_flip=True)
train_generator = datagen.flow_from_directory(
train_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='binary')
test_generator = datagen.flow_from_directory(
test_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='binary')
# Architecture of NN
model = Sequential()
model.add(Conv2D(32,(3, 3), input_shape=(img_height, img_width, 3),padding='same',kernel_initializer='lecun_normal'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64,(3, 3),padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(AveragePooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(1))
model.add(Activation('sigmoid'))
my_rmsprop = keras.optimizers.RMSprop(lr=0.0001, rho=0.9, epsilon=1e-04, decay=0.0)
model.compile(loss='binary_crossentropy',
optimizer=my_rmsprop,
metrics=['accuracy'])
# Training
nb_epoch = 32
nb_train_samples = 512
nb_test_samples = 512
model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples/batch_size,
epochs=nb_epoch,
verbose=1,
validation_data=test_generator,
validation_steps=nb_test_samples/batch_size)
# Evaluating on the testing set
model.evaluate_generator(test_generator, nb_test_samples)
Efficient training of modern neural networks often relies on using lower precision data types. Peak float16 matrix multiplication and convolution performance is 16x faster than peak float32 performance on A100 GPUs.
Save this question.
float32 is a 32 bit number - float64 uses 64 bits. That means that float64's take up twice as much memory - and doing operations on them may be a lot slower in some machine architectures. However, float64's can represent numbers much more accurately than 32 bit floats. They also allow much larger numbers to be stored.
The float type in Python represents the floating point number. Float is used to represent real numbers and is written with a decimal point dividing the integer and fractional parts. For example, 97.98, 32.3+e18, -32.54e100 all are floating point numbers.
From the documentation of cuDNN (section 2.7, subsection Type Conversion) you can see:
Note: Accumulators are 32-bit integers which wrap on overflow.
and that this holds for the standard INT8 data type of the following: the data input, the filter input and the output.
Under those assumptions, @jiandercy is right that there's a float16 to float32 conversion and then back-conversion before returning the result, and float16
would be slower.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With