Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I got NaN for all losses while training YOLOv8 model

I am training yolov8 model on cuda using this code :

from ultralytics import YOLO
import torch
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
model = YOLO("yolov8n.pt")  # load a pretrained model (recommended for training)
results = model.train(data="data.yaml", epochs=15, workers=0, batch=12)  
results = model.val()  
model.export(format="onnx")

and I am getting Nan for all losses

Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
1/15      1.74G        nan        nan        nan         51        640:   4%

I have tried training a model on cpu and it worked fine. the problem appeared when I installed cuda and started training on it.

I expected that there was an error reading the data or something but everything works fine.

I think it has something to do with memory because when I decreased the image size for the model it worked fine, but when I increased batch size for the same decreased image size it showed NaN again. so it's a trade of between image size, batch size and memory. I am not sure 100% if that is right. but that is what I figured out by experiment. but if you have good answer for this problem, please share it.

like image 829
Mohamed Abu ElNasr Avatar asked Oct 24 '25 08:10

Mohamed Abu ElNasr


1 Answers

I had the same issue. Even after upgrading ultralytics to its latest version 8.0.94 and setting the batch size to a lower value, it did not help me. When I set the device to CPU device=cpu, it works perfectly fine.

so, the problem was mainly with the GPU. As suggested by the github issue, setting amp=False fixed it and I was able to run it on GPU.

yolo task=detect mode=train model=yolov8s.pt data="data.yaml" epochs=20 batch=2 imgsz=640 device=0 workers=8 optimizer=Adam pretrained=true val=true plots=true save=True show=true optimize=true lr0=0.001 lrf=0.01 fliplr=0.0 amp=False
like image 156
Keerthitheja S.C. Avatar answered Oct 27 '25 01:10

Keerthitheja S.C.