Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does DeepLab's --train_crop_size actually do?

Following the instructions included in the model, --training_crop_size is set to a value much smaller than the size of the training images. For instance:

python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=90000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size="769,769" \
    --train_batch_size=1 \
    --dataset="cityscapes" \
    --tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
    --train_logdir=${PATH_TO_TRAIN_DIR} \
    --dataset_dir=${PATH_TO_DATASET}

But what does this option actually do? Does it take a random crop of each training image? If so, wouldn't the input dimensions be smaller, e.g., 769x769 (WxH) as per example? As per instructions, the eval crop size is set to 2049x1025. How does a network with input dimensions 769x769 take 2049x1025 input when there's no suggestion of image resizing? A shape mismatch issue would arise.

Are the instructions conflicting?

like image 752
John M. Avatar asked May 12 '19 04:05

John M.


1 Answers

yes, it seems that in your case the images are cropped during the training process. This enables a larger batch size within the computational limitations of your system. A larger batch size leads to optimization steps which are based on multiple instances instead of considering only one (or very few) instance(s) per optimization (=training) step. This often leads to better results. Normally a random crop is used to make sure that the network is trained on all parts of the image.

The training or deployment of a "fully convolutional" CNN does not require a fixed input size. By using padding at the input edges, the dimentionality reduction is often represented by a factor of 2^n (caused by striding or pooling). Example: your encoder is reducing each spatial dimension by a factor of 2^4 before the decoder is upsampling it again. --> So you only have to make sure that your input dimensions are a multiple of 2^4 (The exact input size does not matter, it is just defining the spatial dimensions of the hidden layer of your network during the training). In case of deeplab, the framework automatically adapts the given input dimensions to the required multiple of 2^x to make it even easier for you to use.

The evaluation instances should never be randomly cropped since only a deterministic evaluation process guarantees meaningful evaluation results. During the evaluation, there is no optimization and a batch size of one is fine.

like image 73
FranklynJey Avatar answered Oct 10 '22 18:10

FranklynJey