Google colab pro GPU running extremely slow

Tags:

I am running a Convnet on colab Pro GPU. I have selected GPU in my runtime and can confirm that GPU is available. I am running exactly the same network as yesterday evening, but it is taking about 2 hours per epoch... last night it took about 3 minutes per epoch... nothing has changed at all. I have a feeling colab may have restricted my GPU usage but I can't work out how to tell if this is the issue. Does GPU speed fluctuate much depending on time of day etc? Here are some diagnostics which I have printed, does anyone know how I can investigate deeper what the root cause of this slow behaviour is?

I also tried changing to accelerator in colab to 'None', and my network was the same speed as with 'GPU' selected, implying that for some reason i am no longer training on GPU, or resources have been severely limited. I am using Tensorflow 2.1.

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Sun Mar 22 11:33:14 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    32W / 250W |   8747MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

def mem_report():
  print("CPU RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ))

  GPUs = GPUtil.getGPUs()
  for i, gpu in enumerate(GPUs):
    print('GPU {:d} ... Mem Free: {:.0f}MB / {:.0f}MB | Utilization {:3.0f}%'.format(i, gpu.memoryFree, gpu.memoryTotal, gpu.memoryUtil*100))

mem_report()

CPU RAM Free: 24.5 GB
GPU 0 ... Mem Free: 7533MB / 16280MB | Utilization  54%

Still no luck speeding things up, here is my code, maybe I have overlooked something... btw the images are from an old Kaggle competition, the data can be found here. The training images are saved on my google drive. https://www.kaggle.com/c/datasciencebowl

#loading images from kaggle api

#os.environ['KAGGLE_USERNAME'] = ""
#os.environ['KAGGLE_KEY'] = ""

#!kaggle competitions download -c datasciencebowl

#unpacking zip files

#zipfile.ZipFile('./sampleSubmission.csv.zip', 'r').extractall('./')
#zipfile.ZipFile('./test.zip', 'r').extractall('./')
#zipfile.ZipFile('./train.zip', 'r').extractall('./')

data_dir = pathlib.Path('train')

image_count = len(list(data_dir.glob('*/*.jpg')))
CLASS_NAMES = np.array([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"])

shrimp_zoea = list(data_dir.glob('shrimp_zoea/*'))
for image_path in shrimp_zoea[:5]:
    display.display(Image.open(str(image_path)))

image_generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255,
                                                                  validation_split=0.2)
                                                                  #rotation_range = 40,
                                                                  #width_shift_range = 0.2,
                                                                  #height_shift_range = 0.2,
                                                                  #shear_range = 0.2,
                                                                  #zoom_range = 0.2,
                                                                  #horizontal_flip = True,
                                                                  #fill_mode='nearest')

validation_split = 0.2
BATCH_SIZE = 32
BATCH_SIZE_VALID = 10
IMG_HEIGHT = 224
IMG_WIDTH = 224
STEPS_PER_EPOCH = np.ceil(image_count*(1-(validation_split))/BATCH_SIZE)
VALIDATION_STEPS = np.ceil((image_count*(validation_split)/BATCH_SIZE))

train_data_gen = image_generator.flow_from_directory(directory=str(data_dir),
                                                     subset='training',
                                                     batch_size=BATCH_SIZE,
                                                     class_mode = 'categorical',
                                                     shuffle=True,
                                                     target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                     classes = list(CLASS_NAMES))

validation_data_gen = image_generator.flow_from_directory(directory=str(data_dir),
                                                     subset='validation',
                                                     batch_size=BATCH_SIZE_VALID,
                                                     class_mode = 'categorical',
                                                     shuffle=True,
                                                     target_size=(IMG_HEIGHT, IMG_WIDTH),
                                                     classes = list(CLASS_NAMES))

model_basic = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(224, 224, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1000, activation='relu'),
    tf.keras.layers.Dense(121, activation='softmax')
])

model_basic.summary()

model_basic.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model_basic.fit(
          train_data_gen,
          epochs=10,
          verbose=1,
          validation_data=validation_data_gen,
          steps_per_epoch=STEPS_PER_EPOCH,
          validation_steps=VALIDATION_STEPS,
          initial_epoch=0         
)

339

asked Mar 22 '20 11:03

ojp

Video Answer

1 Answers

In the end the bottle neck seems to be loading images from google drive to colab in each batch. Loading the images to disk reduced the time per epoch to about 30 seconds... here is the code I used to load to disk:

!mkdir train_local
!unzip train.zip -d train_local

After uploading my train.zip file to colab

104

answered Nov 15 '22 11:11

ojp

Related questions
                            
                                Python 3.6 in tensorflow gpu docker images
                            
                                Keras Layer Concatenation
                            
                                What's the best way of centre cropping images in python?
                            
                                module 'tensorflow.python.keras.datasets.fashion_mnist' has no attribute 'load_data'
                            
                                What does "--logtostderr" mean in the command line while using tensorflow's object detection api?
                            
                                Tensorflow: How to tile a tensor that duplicate in certain order? [duplicate]
                            
                                Tensorflow Hub : Stuck while importing a model
                            
                                I trained a keras model on google colab. Now not able to load it locally on my system.
                            
                                How to save a Tensorflow.js model?
                            
                                keras combining two losses with adjustable weights where the outputs do not have the same dimensionality
                            
                                Tensorflow.keras.layers "unresolved reference" in pycharm
                            
                                Tensorflow Object Detection - Convert .pb file to tflite
                            
                                Why is TimeDistributed not needed in my Keras LSTM?
                            
                                Which installer to use for Miniconda with Python 3.6?
                            
                                Setting a random seed on TF 2.0
                            
                                Why would this dataset implementation run out of memory?
                            
                                Which numpy versions are compatible with Tensorflow 1.14.0
                            
                                tf.cast equivalent in pytorch?
                            
                                TypeError: Tensors in list passed to 'values' of 'ConcatV2' Op have types [bool, float32] that don't all match
                            
                                Batch normalization layer for CNN-LSTM

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Google colab pro GPU running extremely slow

Tags:

machine-learning

tensorflow

google-colaboratory

gpu

ojp

People also ask

Video Answer

1 Answers

ojp

Recent Activity

Donate For Us