Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch and TensorFlow object detection - evaluate - object of type <class 'numpy.float64'> cannot be safely interpreted as an integer

Tags:

python

pytorch

I'm attempting to get this PyTorch person detection example running:

https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

I'm using Ubuntu 18.04. Here is a summary of the steps I've performed:

1) Stock Ubuntu 18.04 install on a Lenovo ThinkPad X1 Extreme Gen 2 with a GTX 1650 GPU.

2) Perform a standard CUDA 10.0 / cuDNN 7.4 install. I'd rather not restate all the steps as this post is going to be more than long enough already. This is a standard procedure, pretty much any link found via googling is what I followed.

3) Install torch and torchvision

4) From this link on the PyTorch site:

https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

I saved the source available from the link at the bottom:

https://pytorch.org/tutorials/_static/tv-training-code.py

To a directory I made, PennFudanExample

5) I did the following (found at the top of the above linked notebook):

Install the CoCo API into Python:

cd ~
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI

open Makefile in gedit, change the two instances of "python" to "python3", then:

python3 setup.py build_ext --inplace
sudo python3 setup.py install

Get the necessary files the above linked files need to run:

cd ~
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.5.0

from ~/vision/references/detection, copy coco_eval.py, coco_utils.py, engine.py, transforms.py, and utils.py to directory PennFudanExample.

6) Download the Penn Fudan Pedestrian dataset from the link on the above page:

https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip

then unzip and put in directory PennFudanExample

7) The only change I made to tv-training-code.py was to change the training batch size from 2 to 1 to prevent a GPU out of memory crash, see this other post I made here:

PyTorch Object Detection with GPU on Ubuntu 18.04 - RuntimeError: CUDA out of memory. Tried to allocate xx.xx MiB

Here is tv-training-code.py as I'm running it with the slight batch size edit I mentioned:

# Sample code from the TorchVision 0.3 Object Detection Finetuning Tutorial
# http://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

import os
import numpy as np
import torch
from PIL import Image

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

from engine import train_one_epoch, evaluate
import utils
import transforms as T


class PennFudanDataset(object):
    def __init__(self, root, transforms):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to
        # ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        # load images ad masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")
        # note that we haven't converted the mask to RGB,
        # because each color corresponds to a different instance
        # with 0 being background
        mask = Image.open(mask_path)

        mask = np.array(mask)
        # instances are encoded as different colors
        obj_ids = np.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]

        # split the color-encoded mask into a set
        # of binary masks
        masks = mask == obj_ids[:, None, None]

        # get bounding box coordinates for each mask
        num_objs = len(obj_ids)
        boxes = []
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)
        masks = torch.as_tensor(masks, dtype=torch.uint8)

        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

def get_model_instance_segmentation(num_classes):
    # load an instance segmentation model pre-trained pre-trained on COCO
    model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)

    # get number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # now get the number of input features for the mask classifier
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    # and replace the mask predictor with a new one
    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
                                                       hidden_layer,
                                                       num_classes)

    return model


def get_transform(train):
    transforms = []
    transforms.append(T.ToTensor())
    if train:
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)


def main():
    # train on the GPU or on the CPU, if a GPU is not available
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

    # our dataset has two classes only - background and person
    num_classes = 2
    # use our dataset and defined transformations
    dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
    dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))

    # split the dataset in train and test set
    indices = torch.randperm(len(dataset)).tolist()
    dataset = torch.utils.data.Subset(dataset, indices[:-50])
    dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])

    # define training and validation data loaders
    # !!!! CHANGE HERE !!!! For this function call, I changed the batch_size param value from 2 to 1, otherwise this file is exactly as provided from the PyTorch website !!!!
    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=1, shuffle=True, num_workers=4,
        collate_fn=utils.collate_fn)

    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=1, shuffle=False, num_workers=4,
        collate_fn=utils.collate_fn)

    # get the model using our helper function
    model = get_model_instance_segmentation(num_classes)

    # move model to the right device
    model.to(device)

    # construct an optimizer
    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=0.005,
                                momentum=0.9, weight_decay=0.0005)
    # and a learning rate scheduler
    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                                   step_size=3,
                                                   gamma=0.1)

    # let's train it for 10 epochs
    num_epochs = 10

    for epoch in range(num_epochs):
        # train for one epoch, printing every 10 iterations
        train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
        # update the learning rate
        lr_scheduler.step()
        # evaluate on the test dataset
        evaluate(model, data_loader_test, device=device)

    print("That's it!")

if __name__ == "__main__":
    main()

Here is the full text output including error I'm getting currently:

Epoch: [0]  [  0/120]  eta: 0:01:41  lr: 0.000047  loss: 7.3028 (7.3028)  loss_classifier: 1.0316 (1.0316)  loss_box_reg: 0.0827 (0.0827)  loss_mask: 6.1742 (6.1742)  loss_objectness: 0.0097 (0.0097)  loss_rpn_box_reg: 0.0046 (0.0046)  time: 0.8468  data: 0.0803  max mem: 1067
Epoch: [0]  [ 10/120]  eta: 0:01:02  lr: 0.000467  loss: 2.0995 (3.5058)  loss_classifier: 0.6684 (0.6453)  loss_box_reg: 0.0999 (0.1244)  loss_mask: 1.2471 (2.7069)  loss_objectness: 0.0187 (0.0235)  loss_rpn_box_reg: 0.0060 (0.0057)  time: 0.5645  data: 0.0089  max mem: 1499
Epoch: [0]  [ 20/120]  eta: 0:00:56  lr: 0.000886  loss: 1.0166 (2.1789)  loss_classifier: 0.2844 (0.4347)  loss_box_reg: 0.1631 (0.1540)  loss_mask: 0.4710 (1.5562)  loss_objectness: 0.0187 (0.0242)  loss_rpn_box_reg: 0.0082 (0.0099)  time: 0.5524  data: 0.0020  max mem: 1704
Epoch: [0]  [ 30/120]  eta: 0:00:50  lr: 0.001306  loss: 0.5554 (1.6488)  loss_classifier: 0.1258 (0.3350)  loss_box_reg: 0.1356 (0.1488)  loss_mask: 0.2355 (1.1285)  loss_objectness: 0.0142 (0.0224)  loss_rpn_box_reg: 0.0127 (0.0142)  time: 0.5653  data: 0.0023  max mem: 1756
Epoch: [0]  [ 40/120]  eta: 0:00:45  lr: 0.001726  loss: 0.4520 (1.3614)  loss_classifier: 0.1055 (0.2773)  loss_box_reg: 0.1101 (0.1530)  loss_mask: 0.1984 (0.8981)  loss_objectness: 0.0063 (0.0189)  loss_rpn_box_reg: 0.0139 (0.0140)  time: 0.5621  data: 0.0023  max mem: 1776
Epoch: [0]  [ 50/120]  eta: 0:00:39  lr: 0.002146  loss: 0.3448 (1.1635)  loss_classifier: 0.0622 (0.2346)  loss_box_reg: 0.1004 (0.1438)  loss_mask: 0.1650 (0.7547)  loss_objectness: 0.0033 (0.0172)  loss_rpn_box_reg: 0.0069 (0.0131)  time: 0.5535  data: 0.0022  max mem: 1776
Epoch: [0]  [ 60/120]  eta: 0:00:33  lr: 0.002565  loss: 0.3292 (1.0543)  loss_classifier: 0.0549 (0.2101)  loss_box_reg: 0.1113 (0.1486)  loss_mask: 0.1596 (0.6668)  loss_objectness: 0.0017 (0.0148)  loss_rpn_box_reg: 0.0082 (0.0140)  time: 0.5590  data: 0.0022  max mem: 1776
Epoch: [0]  [ 70/120]  eta: 0:00:28  lr: 0.002985  loss: 0.4105 (0.9581)  loss_classifier: 0.0534 (0.1877)  loss_box_reg: 0.1049 (0.1438)  loss_mask: 0.1709 (0.5995)  loss_objectness: 0.0015 (0.0132)  loss_rpn_box_reg: 0.0133 (0.0138)  time: 0.5884  data: 0.0023  max mem: 1783
Epoch: [0]  [ 80/120]  eta: 0:00:22  lr: 0.003405  loss: 0.3080 (0.8817)  loss_classifier: 0.0441 (0.1706)  loss_box_reg: 0.0875 (0.1343)  loss_mask: 0.1960 (0.5510)  loss_objectness: 0.0015 (0.0122)  loss_rpn_box_reg: 0.0071 (0.0137)  time: 0.5812  data: 0.0023  max mem: 1783
Epoch: [0]  [ 90/120]  eta: 0:00:17  lr: 0.003825  loss: 0.2817 (0.8171)  loss_classifier: 0.0397 (0.1570)  loss_box_reg: 0.0499 (0.1257)  loss_mask: 0.1777 (0.5098)  loss_objectness: 0.0008 (0.0111)  loss_rpn_box_reg: 0.0068 (0.0136)  time: 0.5644  data: 0.0022  max mem: 1794
Epoch: [0]  [100/120]  eta: 0:00:11  lr: 0.004244  loss: 0.2139 (0.7569)  loss_classifier: 0.0310 (0.1446)  loss_box_reg: 0.0327 (0.1163)  loss_mask: 0.1573 (0.4731)  loss_objectness: 0.0003 (0.0101)  loss_rpn_box_reg: 0.0050 (0.0128)  time: 0.5685  data: 0.0022  max mem: 1794
Epoch: [0]  [110/120]  eta: 0:00:05  lr: 0.004664  loss: 0.2139 (0.7160)  loss_classifier: 0.0325 (0.1358)  loss_box_reg: 0.0327 (0.1105)  loss_mask: 0.1572 (0.4477)  loss_objectness: 0.0003 (0.0093)  loss_rpn_box_reg: 0.0047 (0.0128)  time: 0.5775  data: 0.0022  max mem: 1794
Epoch: [0]  [119/120]  eta: 0:00:00  lr: 0.005000  loss: 0.2486 (0.6830)  loss_classifier: 0.0330 (0.1282)  loss_box_reg: 0.0360 (0.1051)  loss_mask: 0.1686 (0.4284)  loss_objectness: 0.0003 (0.0086)  loss_rpn_box_reg: 0.0074 (0.0125)  time: 0.5655  data: 0.0022  max mem: 1794
Epoch: [0] Total time: 0:01:08 (0.5676 s / it)
creating index...
index created!
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
    num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cdahms/workspace-apps/PennFudanExample/tv-training-code.py", line 166, in <module>
    main()
  File "/home/cdahms/workspace-apps/PennFudanExample/tv-training-code.py", line 161, in main
    evaluate(model, data_loader_test, device=device)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "/home/cdahms/workspace-apps/PennFudanExample/engine.py", line 80, in evaluate
    coco_evaluator = CocoEvaluator(coco, iou_types)
  File "/home/cdahms/workspace-apps/PennFudanExample/coco_eval.py", line 28, in __init__
    self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
  File "/home/cdahms/models/research/pycocotools/cocoeval.py", line 75, in __init__
    self.params = Params(iouType=iouType) # parameters
  File "/home/cdahms/models/research/pycocotools/cocoeval.py", line 527, in __init__
    self.setDetParams()
  File "/home/cdahms/models/research/pycocotools/cocoeval.py", line 506, in setDetParams
    self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
  File "<__array_function__ internals>", line 6, in linspace
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
    .format(type(num)))
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.

Process finished with exit code 1

The really strange thing is after I resolved the above-mentioned GPU error this was working for about 1/2 a day and now I'm getting this error, and I could swear I didn't change anything.

I've tried uninstalling and reinstalling torch, torchvision, pycocotools, and for copying the files coco_eval.py, coco_utils.py, engine.py, transforms.py, and utils.py, I've tried checking out torchvision v0.5.0, v0.4.2, and using the latest commit, all produce the same error.

Also, I was working from home yesterday (Christmas) and this error does not happen on my home computer, which is also Ubuntu 18.04 with an NVIDIA GPU.

In Googling for this error one suggestion that is relatively common is to backdate numpy to 1.11.0, but that version is really old now and therefore this would likely cause problems with other packages.

Also in Googleing for this error it seems the general fix is to add a cast to int somewhere or to change a divide by / to // but I'm really hesitant to make changes internal to pycocotools or worse yet inside numpy. Also since error was not occurring previously and is not occurring on another computer I don't suspect this is a good idea anyway.

Fortunately I can comment out the line

evaluate(model, data_loader_test, device=device)

For now and the training will complete, although I don't get the evaluation data (Mean Average Precision, etc.)

About the only thing left I can think of at this point is to format the HD and reinstall Ubuntu 18.04 and everything else, but this will take at least a day, and if this ever happens again I'd really like to know what may be causing it.

Ideas? Suggestions? Additional stuff I should check?

-- EDIT --

After re-testing on the same computer experiencing the concern, I found this same error occurs with the evaluation step when using the TensorFlow object detection API.

like image 803
cdahms Avatar asked Dec 26 '19 21:12

cdahms


1 Answers

!@#$%^&

I finally figured this out after about 15 hours on it, as it turns out numpy 1.18.0, which was released 5 days ago as of when I'm writing this, breaks the evaluation process for both TensorFlow and PyTorch object detection. To make a long story short the fix is:

sudo -H pip3 install numpy==1.17.4

A few things I can also mention:

-numpy 1.17.4 was released on November 10th, 2019 and therefore should still be good for quite some time

-There is now a pip package for pycocotools, so instead of the above procedure (cloning and building) you can now simply do:

sudo -H pip3 install pycocotools

--- Update ---

This has now been fixed in pycocotools with this commit:

https://github.com/cocodataset/cocoapi/pull/354

Also see this (closed) issue for more background:

https://github.com/numpy/numpy/issues/15192

When the updated version of pycocotools will make it into the pycocotools pip3 package, I'm not sure.

like image 111
cdahms Avatar answered Nov 10 '22 00:11

cdahms