I'm attempting to get this PyTorch person detection example running:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
I'm using Ubuntu 18.04. Here is a summary of the steps I've performed:
1) Stock Ubuntu 18.04 install on a Lenovo ThinkPad X1 Extreme Gen 2 with a GTX 1650 GPU.
2) Perform a standard CUDA 10.0 / cuDNN 7.4 install. I'd rather not restate all the steps as this post is going to be more than long enough already. This is a standard procedure, pretty much any link found via googling is what I followed.
3) Install torch
and torchvision
4) From this link on the PyTorch site:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
I saved the source available from the link at the bottom:
https://pytorch.org/tutorials/_static/tv-training-code.py
To a directory I made, PennFudanExample
5) I did the following (found at the top of the above linked notebook):
Install the CoCo API into Python:
cd ~
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
open Makefile in gedit, change the two instances of "python" to "python3", then:
python3 setup.py build_ext --inplace
sudo python3 setup.py install
Get the necessary files the above linked files need to run:
cd ~
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.5.0
from ~/vision/references/detection
, copy coco_eval.py
, coco_utils.py
, engine.py
, transforms.py
, and utils.py
to directory PennFudanExample
.
6) Download the Penn Fudan Pedestrian dataset from the link on the above page:
https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip
then unzip and put in directory PennFudanExample
7) The only change I made to tv-training-code.py
was to change the training batch size from 2 to 1 to prevent a GPU out of memory crash, see this other post I made here:
PyTorch Object Detection with GPU on Ubuntu 18.04 - RuntimeError: CUDA out of memory. Tried to allocate xx.xx MiB
Here is tv-training-code.py
as I'm running it with the slight batch size edit I mentioned:
# Sample code from the TorchVision 0.3 Object Detection Finetuning Tutorial
# http://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
import os
import numpy as np
import torch
from PIL import Image
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from engine import train_one_epoch, evaluate
import utils
import transforms as T
class PennFudanDataset(object):
def __init__(self, root, transforms):
self.root = root
self.transforms = transforms
# load all image files, sorting them to
# ensure that they are aligned
self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))
def __getitem__(self, idx):
# load images ad masks
img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
img = Image.open(img_path).convert("RGB")
# note that we haven't converted the mask to RGB,
# because each color corresponds to a different instance
# with 0 being background
mask = Image.open(mask_path)
mask = np.array(mask)
# instances are encoded as different colors
obj_ids = np.unique(mask)
# first id is the background, so remove it
obj_ids = obj_ids[1:]
# split the color-encoded mask into a set
# of binary masks
masks = mask == obj_ids[:, None, None]
# get bounding box coordinates for each mask
num_objs = len(obj_ids)
boxes = []
for i in range(num_objs):
pos = np.where(masks[i])
xmin = np.min(pos[1])
xmax = np.max(pos[1])
ymin = np.min(pos[0])
ymax = np.max(pos[0])
boxes.append([xmin, ymin, xmax, ymax])
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# there is only one class
labels = torch.ones((num_objs,), dtype=torch.int64)
masks = torch.as_tensor(masks, dtype=torch.uint8)
image_id = torch.tensor([idx])
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
# suppose all instances are not crowd
iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["masks"] = masks
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
if self.transforms is not None:
img, target = self.transforms(img, target)
return img, target
def __len__(self):
return len(self.imgs)
def get_model_instance_segmentation(num_classes):
# load an instance segmentation model pre-trained pre-trained on COCO
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
# now get the number of input features for the mask classifier
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
# and replace the mask predictor with a new one
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
hidden_layer,
num_classes)
return model
def get_transform(train):
transforms = []
transforms.append(T.ToTensor())
if train:
transforms.append(T.RandomHorizontalFlip(0.5))
return T.Compose(transforms)
def main():
# train on the GPU or on the CPU, if a GPU is not available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# our dataset has two classes only - background and person
num_classes = 2
# use our dataset and defined transformations
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))
# split the dataset in train and test set
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])
# define training and validation data loaders
# !!!! CHANGE HERE !!!! For this function call, I changed the batch_size param value from 2 to 1, otherwise this file is exactly as provided from the PyTorch website !!!!
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=1, shuffle=True, num_workers=4,
collate_fn=utils.collate_fn)
data_loader_test = torch.utils.data.DataLoader(
dataset_test, batch_size=1, shuffle=False, num_workers=4,
collate_fn=utils.collate_fn)
# get the model using our helper function
model = get_model_instance_segmentation(num_classes)
# move model to the right device
model.to(device)
# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
momentum=0.9, weight_decay=0.0005)
# and a learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=3,
gamma=0.1)
# let's train it for 10 epochs
num_epochs = 10
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)
print("That's it!")
if __name__ == "__main__":
main()
Here is the full text output including error I'm getting currently:
Epoch: [0] [ 0/120] eta: 0:01:41 lr: 0.000047 loss: 7.3028 (7.3028) loss_classifier: 1.0316 (1.0316) loss_box_reg: 0.0827 (0.0827) loss_mask: 6.1742 (6.1742) loss_objectness: 0.0097 (0.0097) loss_rpn_box_reg: 0.0046 (0.0046) time: 0.8468 data: 0.0803 max mem: 1067
Epoch: [0] [ 10/120] eta: 0:01:02 lr: 0.000467 loss: 2.0995 (3.5058) loss_classifier: 0.6684 (0.6453) loss_box_reg: 0.0999 (0.1244) loss_mask: 1.2471 (2.7069) loss_objectness: 0.0187 (0.0235) loss_rpn_box_reg: 0.0060 (0.0057) time: 0.5645 data: 0.0089 max mem: 1499
Epoch: [0] [ 20/120] eta: 0:00:56 lr: 0.000886 loss: 1.0166 (2.1789) loss_classifier: 0.2844 (0.4347) loss_box_reg: 0.1631 (0.1540) loss_mask: 0.4710 (1.5562) loss_objectness: 0.0187 (0.0242) loss_rpn_box_reg: 0.0082 (0.0099) time: 0.5524 data: 0.0020 max mem: 1704
Epoch: [0] [ 30/120] eta: 0:00:50 lr: 0.001306 loss: 0.5554 (1.6488) loss_classifier: 0.1258 (0.3350) loss_box_reg: 0.1356 (0.1488) loss_mask: 0.2355 (1.1285) loss_objectness: 0.0142 (0.0224) loss_rpn_box_reg: 0.0127 (0.0142) time: 0.5653 data: 0.0023 max mem: 1756
Epoch: [0] [ 40/120] eta: 0:00:45 lr: 0.001726 loss: 0.4520 (1.3614) loss_classifier: 0.1055 (0.2773) loss_box_reg: 0.1101 (0.1530) loss_mask: 0.1984 (0.8981) loss_objectness: 0.0063 (0.0189) loss_rpn_box_reg: 0.0139 (0.0140) time: 0.5621 data: 0.0023 max mem: 1776
Epoch: [0] [ 50/120] eta: 0:00:39 lr: 0.002146 loss: 0.3448 (1.1635) loss_classifier: 0.0622 (0.2346) loss_box_reg: 0.1004 (0.1438) loss_mask: 0.1650 (0.7547) loss_objectness: 0.0033 (0.0172) loss_rpn_box_reg: 0.0069 (0.0131) time: 0.5535 data: 0.0022 max mem: 1776
Epoch: [0] [ 60/120] eta: 0:00:33 lr: 0.002565 loss: 0.3292 (1.0543) loss_classifier: 0.0549 (0.2101) loss_box_reg: 0.1113 (0.1486) loss_mask: 0.1596 (0.6668) loss_objectness: 0.0017 (0.0148) loss_rpn_box_reg: 0.0082 (0.0140) time: 0.5590 data: 0.0022 max mem: 1776
Epoch: [0] [ 70/120] eta: 0:00:28 lr: 0.002985 loss: 0.4105 (0.9581) loss_classifier: 0.0534 (0.1877) loss_box_reg: 0.1049 (0.1438) loss_mask: 0.1709 (0.5995) loss_objectness: 0.0015 (0.0132) loss_rpn_box_reg: 0.0133 (0.0138) time: 0.5884 data: 0.0023 max mem: 1783
Epoch: [0] [ 80/120] eta: 0:00:22 lr: 0.003405 loss: 0.3080 (0.8817) loss_classifier: 0.0441 (0.1706) loss_box_reg: 0.0875 (0.1343) loss_mask: 0.1960 (0.5510) loss_objectness: 0.0015 (0.0122) loss_rpn_box_reg: 0.0071 (0.0137) time: 0.5812 data: 0.0023 max mem: 1783
Epoch: [0] [ 90/120] eta: 0:00:17 lr: 0.003825 loss: 0.2817 (0.8171) loss_classifier: 0.0397 (0.1570) loss_box_reg: 0.0499 (0.1257) loss_mask: 0.1777 (0.5098) loss_objectness: 0.0008 (0.0111) loss_rpn_box_reg: 0.0068 (0.0136) time: 0.5644 data: 0.0022 max mem: 1794
Epoch: [0] [100/120] eta: 0:00:11 lr: 0.004244 loss: 0.2139 (0.7569) loss_classifier: 0.0310 (0.1446) loss_box_reg: 0.0327 (0.1163) loss_mask: 0.1573 (0.4731) loss_objectness: 0.0003 (0.0101) loss_rpn_box_reg: 0.0050 (0.0128) time: 0.5685 data: 0.0022 max mem: 1794
Epoch: [0] [110/120] eta: 0:00:05 lr: 0.004664 loss: 0.2139 (0.7160) loss_classifier: 0.0325 (0.1358) loss_box_reg: 0.0327 (0.1105) loss_mask: 0.1572 (0.4477) loss_objectness: 0.0003 (0.0093) loss_rpn_box_reg: 0.0047 (0.0128) time: 0.5775 data: 0.0022 max mem: 1794
Epoch: [0] [119/120] eta: 0:00:00 lr: 0.005000 loss: 0.2486 (0.6830) loss_classifier: 0.0330 (0.1282) loss_box_reg: 0.0360 (0.1051) loss_mask: 0.1686 (0.4284) loss_objectness: 0.0003 (0.0086) loss_rpn_box_reg: 0.0074 (0.0125) time: 0.5655 data: 0.0022 max mem: 1794
Epoch: [0] Total time: 0:01:08 (0.5676 s / it)
creating index...
index created!
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/cdahms/workspace-apps/PennFudanExample/tv-training-code.py", line 166, in <module>
main()
File "/home/cdahms/workspace-apps/PennFudanExample/tv-training-code.py", line 161, in main
evaluate(model, data_loader_test, device=device)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
return func(*args, **kwargs)
File "/home/cdahms/workspace-apps/PennFudanExample/engine.py", line 80, in evaluate
coco_evaluator = CocoEvaluator(coco, iou_types)
File "/home/cdahms/workspace-apps/PennFudanExample/coco_eval.py", line 28, in __init__
self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
File "/home/cdahms/models/research/pycocotools/cocoeval.py", line 75, in __init__
self.params = Params(iouType=iouType) # parameters
File "/home/cdahms/models/research/pycocotools/cocoeval.py", line 527, in __init__
self.setDetParams()
File "/home/cdahms/models/research/pycocotools/cocoeval.py", line 506, in setDetParams
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
File "<__array_function__ internals>", line 6, in linspace
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
.format(type(num)))
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Process finished with exit code 1
The really strange thing is after I resolved the above-mentioned GPU error this was working for about 1/2 a day and now I'm getting this error, and I could swear I didn't change anything.
I've tried uninstalling and reinstalling torch
, torchvision
, pycocotools
, and for copying the files coco_eval.py
, coco_utils.py
, engine.py
, transforms.py
, and utils.py
, I've tried checking out torchvision v0.5.0, v0.4.2, and using the latest commit, all produce the same error.
Also, I was working from home yesterday (Christmas) and this error does not happen on my home computer, which is also Ubuntu 18.04 with an NVIDIA GPU.
In Googling for this error one suggestion that is relatively common is to backdate numpy
to 1.11.0, but that version is really old now and therefore this would likely cause problems with other packages.
Also in Googleing for this error it seems the general fix is to add a cast to int
somewhere or to change a divide by /
to //
but I'm really hesitant to make changes internal to pycocotools
or worse yet inside numpy
. Also since error was not occurring previously and is not occurring on another computer I don't suspect this is a good idea anyway.
Fortunately I can comment out the line
evaluate(model, data_loader_test, device=device)
For now and the training will complete, although I don't get the evaluation data (Mean Average Precision, etc.)
About the only thing left I can think of at this point is to format the HD and reinstall Ubuntu 18.04 and everything else, but this will take at least a day, and if this ever happens again I'd really like to know what may be causing it.
Ideas? Suggestions? Additional stuff I should check?
-- EDIT --
After re-testing on the same computer experiencing the concern, I found this same error occurs with the evaluation step when using the TensorFlow object detection API.
!@#$%^&
I finally figured this out after about 15 hours on it, as it turns out numpy 1.18.0, which was released 5 days ago as of when I'm writing this, breaks the evaluation process for both TensorFlow and PyTorch object detection. To make a long story short the fix is:
sudo -H pip3 install numpy==1.17.4
A few things I can also mention:
-numpy 1.17.4 was released on November 10th, 2019 and therefore should still be good for quite some time
-There is now a pip package for pycocotools, so instead of the above procedure (cloning and building) you can now simply do:
sudo -H pip3 install pycocotools
--- Update ---
This has now been fixed in pycocotools
with this commit:
https://github.com/cocodataset/cocoapi/pull/354
Also see this (closed) issue for more background:
https://github.com/numpy/numpy/issues/15192
When the updated version of pycocotools
will make it into the pycocotools pip3 package
, I'm not sure.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With