Tensorflow object detection API killed - OOM. How to reduce shuffle buffer size?

Question

System information

OS Platform and Distribution: CentOS 7.5.1804
TensorFlow installed from: pip install tensorflow-gpu
TensorFlow version: tensorflow-gpu 1.8.0
CUDA/cuDNN version: 9.0/7.1.2
GPU model and memory: GeForce GTX 1080 Ti, 11264MB
Exact command to reproduce:

python train.py --logtostderr --train_dir=./models/train --pipeline_config_path=mask_rcnn_inception_v2_coco.config

Describe the problem

I am attempting to train a Mask-RCNN model on my own dataset (fine tuning from a model trained on COCO), but the process is killed as soon as the shuffle buffer is filled.

Before this happens, nvidia-smi shows memory usage of around 10669MB/11175MB but only 1% GPU utilisation.

I have tried adjusting the following train_config settings:

batch_size: 1    
batch_queue_capacity: 10    
num_batch_queue_threads: 4    
prefetch_queue_capacity: 5

And for train_input_reader:

num_readers: 1
queue_capacity: 10
min_after_dequeue: 5

I believe my problem is similar to TensorFlow Object Detection API - Out of Memory but I am using a GPU rather than CPU-only.

The images I am training on are comparatively large (2048*2048), however I would like to avoid downsizing as the objects to be detected are quite small. My training set consists of 400 images (in a .tfrecord file).

Is there a way to reduce the size of the shuffle buffer to see if this reduces the memory requirement?

Traceback

INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
2018-06-19 12:21:33.487840: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 97 of 2048
2018-06-19 12:21:43.547326: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 231 of 2048
2018-06-19 12:21:53.470634: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 381 of 2048
2018-06-19 12:21:57.030494: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
Killed

liuchangf · Accepted Answer

You can try steps as followings:

1.Set batch_size=1 (or try your own)

2.Change "default value": optional uint32 shuffle_buffer_size = 11 [default = 256] (or try your own) the code is here

models/research/object_detection/protos/input_reader.proto

Line 40 in ce03903

 optional uint32 shuffle_buffer_size = 11 [default = 2048];

original set is :

optional uint32 shuffle_buffer_size = 11 [default = 2048]

the default value is 2048, it's too big for batch_size=1, should be modified accordingly, it consumes a lot of RAM in my opinion.

3.Recompile Protobuf libraries

From tensorflow/models/research/

protoc object_detection/protos/*.proto --python_out=.

CodePerfectPlus · Answer

In your pipeline.config, Add the

shuffle_buffer_size: 200

or as according to your system.

train_input_reader {
  shuffle_buffer_size: 200
  label_map_path: "tfrecords/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "tfrecords/train.record"
  }
}

It's working for me, tested on tf1 and tf2 as well.

zhai · Answer

I change flow_from_directory to flow_from_dataframe function. Because it doesn't upload the matrix values of all images to memory.

Tensorflow object detection API killed - OOM. How to reduce shuffle buffer size?

Tags:

python

tensorflow

object-detection-api

tfrecord

System information

Describe the problem

Traceback

dpaddon

3 Answers

liuchangf

CodePerfectPlus

zhai

Recent Activity

Donate For Us

Tensorflow object detection API killed - OOM. How to reduce shuffle buffer size?

Tags:

python

tensorflow

object-detection-api

tfrecord

System information

Describe the problem

Traceback

dpaddon

3 Answers

liuchangf

CodePerfectPlus

zhai

Related questions

Recent Activity

Donate For Us