I am doing research with semantic segmentation architectures. I need to speed up my training but don't know where to look further.
I have tried different approaches regarding data loading but every time bottleneck seems to be the CPU instead of the GPU. I run nvidia-smi
and htop
to see utilization.
Keras + custom DataGenerator with 8 workers and 1 GPU
model.fit_generator(generator=training_generator,use_multiprocessing=True, workers=8)
Keras + tf.data.dataset with data loaded from raw images
model.fit(training_dataset.make_one_shot_iterator(),...)
I tried both ways of prefetching:dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
dataset = dataset.apply(tf.contrib.data.prefetch_to_device('/gpu:0'))
Keras + tf.data.dataset with data loaded from tf.Records
=> This option is next up.
I feel like right now, my processing chain looks like this:
data on disk -> CPU loads data in RAM -> CPU does data preprocessing -> CPU moves data to GPU -> GPU does training step
Thus the only way to speed up the training is to do all preprocessing up front and save the files to disk (will be huge with data augmentation). Then use tf.Records to load the files efficiently.
Do you have other ideas how to improve the speed of the training?
I have tested my pipeline with two models.
I trained 2 models for 3 epochs with 140 steps each (batch size = 3). Here are the results.
Raw image data => Keras.DataGenerator
simple model: 126s
complex model: 154s
Raw image data => tf.data.datasets
simple model: 208s
complex model: 215s
Helper function
def load_image(self,path):
image = cv2.cvtColor(cv2.imread(path,-1), cv2.COLOR_BGR2RGB)
return image
Main part
#Collect a batch of images on the CPU step by step (probably the bottlebeck of the whole computation)
for i in range(len(image_filenames_tmp)):
#print(image_filenames_tmp[i])
#print(label_filenames_tmp[i])
input_image = self.load_image(image_filenames_tmp[i])[: self.shape[0], : self.shape[1]]
output_image = self.load_image(label_filenames_tmp[i])[: self.shape[0], : self.shape[1]]
# Prep the data. Make sure the labels are in one-hot format
input_image = np.float32(input_image) / 255.0
output_image = np.float32(self.one_hot_it(label=output_image, label_values=label_values))
input_image_batch.append(np.expand_dims(input_image, axis=0))
output_image_batch.append(np.expand_dims(output_image, axis=0))
input_image_batch = np.squeeze(np.stack(input_image_batch, axis=1))
output_image_batch = np.squeeze(np.stack(output_image_batch, axis=1))
return input_image_batch, output_image_batch
Helper function
def preprocess_fn(train_image_filename, train_label_filename):
'''A transformation function to preprocess raw data
into trainable input. '''
x = tf.image.decode_png(tf.read_file(train_image_filename))
x = tf.image.convert_image_dtype(x,tf.float32,saturate=False,name=None)
x = tf.image.resize_image_with_crop_or_pad(x,512,512)
y = tf.image.decode_png(tf.read_file(train_label_filename))
y = tf.image.resize_image_with_crop_or_pad(y,512,512)
class_names, label_values = get_label_info(csv_path)
semantic_map = []
for colour in label_values:
class_map = tf.reduce_all(tf.equal(y, colour), axis=-1)
semantic_map.append(class_map)
semantic_map = tf.stack(semantic_map, axis=-1)
# NOTE cast to tf.float32 because most neural networks operate in float32.
semantic_map = tf.cast(semantic_map, tf.float32)
return x, semantic_map
Main part
dataset = tf.data.Dataset.from_tensor_slices((train_image_filenames, train_label_filenames))
dataset = dataset.apply(tf.contrib.data.map_and_batch(
preprocess_fn, batch_size,
num_parallel_batches=4, # cpu cores
drop_remainder=True if is_training
dataset = dataset.repeat()
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE) # automatically picks best buffer_size
I am dealing with similar issues and trying to optimize the pipeline is an uphill battle. Using horovod instead of keras multi-gpu gives me almost a linear speed up, where as keras multi-gpu didn't: https://medium.com/omnius/keras-horovod-distributed-deep-learning-on-steroids-94666e16673d
tf.dataset is definitely the way to go. You might also want to do shuffle operation for better generalization.
Another thing that improved things a lot for me was resizing images beforehand and saving them with np.save() as .npy files. They take more space to save but reading them is an order of magnitude faster. I used tf.py_func() to convert my numpy operations to tensors (which can't be parallelized because of python GIL)
Nvidia recently release DALI. It does augmentation on the GPU which is definitely the way to go in the future. For simple classification task it might already have all the functionality that you need.
How does your data processing pipeline look like exactly? Have you considered to omit some steps that might be too expensive? How is your data stored? Is it plain image files that are loaded on demand or do you have them pre-loaded to the memory before? Usually loading JPG/PNG images is very expensive.
Can you see any improvements if you increase max_queue_size
in model.fit_generator()
?
And finally, could you benchmark how fast your data processing pipeline is actually by for example just generate a few thousand batches and mesaure the time per batch?
Apart from this, my own experience is that a low GPU utilization might be observed when your model is relatively small / not computational expensive. As new data has to be fed to the GPU between batches, there is just an overhead you can't really avoid. When the ratio between this overhead and the actual computation time for a single pass is high, you might observe that your overall GPU utulization is relatively low and often even get 0% values.
Edit: Could you give us more information on the model you use, especially what kind of layers it mostly consists of. The computation time for a single pass of relatively small CNN for example, might be so short that more time is used by refeeding the GPU between batches than for the actual computations.
Update: After you added more information about your processing pipeline, I would say that your main bottleneck is the loading and decoding of the PNG images. PNG decompression (and compression even much more) is usually very expensive (according to this source about 5 times more than JPEG). To check this assumption, you could profile your processing pipeline by meassure how much time every processing step (decoding, resizing, cropping, etc.) needs and what is the main contributor.
Now there are many ways to optimize your processing pipeline:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With