I am running a Convnet on colab Pro GPU. I have selected GPU in my runtime and can confirm that GPU is available. I am running exactly the same network as yesterday evening, but it is taking about 2 hours per epoch... last night it took about 3 minutes per epoch... nothing has changed at all. I have a feeling colab may have restricted my GPU usage but I can't work out how to tell if this is the issue. Does GPU speed fluctuate much depending on time of day etc? Here are some diagnostics which I have printed, does anyone know how I can investigate deeper what the root cause of this slow behaviour is?
I also tried changing to accelerator in colab to 'None', and my network was the same speed as with 'GPU' selected, implying that for some reason i am no longer training on GPU, or resources have been severely limited. I am using Tensorflow 2.1.
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
print('and then re-execute this cell.')
else:
print(gpu_info)
Sun Mar 22 11:33:14 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 32W / 250W | 8747MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
def mem_report():
print("CPU RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ))
GPUs = GPUtil.getGPUs()
for i, gpu in enumerate(GPUs):
print('GPU {:d} ... Mem Free: {:.0f}MB / {:.0f}MB | Utilization {:3.0f}%'.format(i, gpu.memoryFree, gpu.memoryTotal, gpu.memoryUtil*100))
mem_report()
CPU RAM Free: 24.5 GB
GPU 0 ... Mem Free: 7533MB / 16280MB | Utilization 54%
Still no luck speeding things up, here is my code, maybe I have overlooked something... btw the images are from an old Kaggle competition, the data can be found here. The training images are saved on my google drive. https://www.kaggle.com/c/datasciencebowl
#loading images from kaggle api
#os.environ['KAGGLE_USERNAME'] = ""
#os.environ['KAGGLE_KEY'] = ""
#!kaggle competitions download -c datasciencebowl
#unpacking zip files
#zipfile.ZipFile('./sampleSubmission.csv.zip', 'r').extractall('./')
#zipfile.ZipFile('./test.zip', 'r').extractall('./')
#zipfile.ZipFile('./train.zip', 'r').extractall('./')
data_dir = pathlib.Path('train')
image_count = len(list(data_dir.glob('*/*.jpg')))
CLASS_NAMES = np.array([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"])
shrimp_zoea = list(data_dir.glob('shrimp_zoea/*'))
for image_path in shrimp_zoea[:5]:
display.display(Image.open(str(image_path)))
image_generator = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255,
validation_split=0.2)
#rotation_range = 40,
#width_shift_range = 0.2,
#height_shift_range = 0.2,
#shear_range = 0.2,
#zoom_range = 0.2,
#horizontal_flip = True,
#fill_mode='nearest')
validation_split = 0.2
BATCH_SIZE = 32
BATCH_SIZE_VALID = 10
IMG_HEIGHT = 224
IMG_WIDTH = 224
STEPS_PER_EPOCH = np.ceil(image_count*(1-(validation_split))/BATCH_SIZE)
VALIDATION_STEPS = np.ceil((image_count*(validation_split)/BATCH_SIZE))
train_data_gen = image_generator.flow_from_directory(directory=str(data_dir),
subset='training',
batch_size=BATCH_SIZE,
class_mode = 'categorical',
shuffle=True,
target_size=(IMG_HEIGHT, IMG_WIDTH),
classes = list(CLASS_NAMES))
validation_data_gen = image_generator.flow_from_directory(directory=str(data_dir),
subset='validation',
batch_size=BATCH_SIZE_VALID,
class_mode = 'categorical',
shuffle=True,
target_size=(IMG_HEIGHT, IMG_WIDTH),
classes = list(CLASS_NAMES))
model_basic = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(224, 224, 3)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1000, activation='relu'),
tf.keras.layers.Dense(121, activation='softmax')
])
model_basic.summary()
model_basic.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = model_basic.fit(
train_data_gen,
epochs=10,
verbose=1,
validation_data=validation_data_gen,
steps_per_epoch=STEPS_PER_EPOCH,
validation_steps=VALIDATION_STEPS,
initial_epoch=0
)
It's likely that Drive network rate limits are reducing the speed of your training loop.
Setting Up the Hardware Accelerator on Colab Colab's notebooks use CPUs by default — to change the runtime type to GPUs or TPUs, select “Change runtime type” under “Runtime” from Colab's menu bar. The hardware settings can be accessed from “Change runtime type” under “Runtime” in Colab's menu bar.
Colab Pro and Pro+ limits GPU to NVIDIA P100 or T4. Colab Pro limits RAM to 32 GB while Pro+ limits RAM to 52 GB. Colab Pro and Pro+ limit sessions to 24 hours.
Cloud GPUs are only good for computing. That's something you should think about. To summarize, even a mid-range GPU dramatically outperforms the free Google Colab environment. Keep in mind that I was assigned with Tesla K80 12 GB, which might not be the case for you.
In the end the bottle neck seems to be loading images from google drive to colab in each batch. Loading the images to disk reduced the time per epoch to about 30 seconds... here is the code I used to load to disk:
!mkdir train_local
!unzip train.zip -d train_local
After uploading my train.zip file to colab
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With