The input image size of u-net is 572*572, but the output mask size is 388*388. How could the image get masked with a smaller mask?
Probably you are referring to the scientific paper by Ronneberger et al in which the U-Net architecture was published. There the graph shows these numbers.

The explanation is a bit hidden in section "3. Training" of the paper:
Due to the unpadded convolutions, the output image is smaller than the input by a constant border width.
This means that during each convolution, part of the image is "cropped" since the convolution will start in a coordinate so that it fully overlaps with the input-image / input-blob of the layer. In case of 3x3 convolutions, this is always one pixel at each side. For more a visual explanation of kernels/convolutions see e.g. here. The output is smaller because due to the cropping occuring during unpadded convolutions only (the inner) part of the image gets a result.
It is not a general characteristic of the architecture, but something inherent to (unpadded) convolutions and can be avoided with padding. Probably the most common strategy is mirroring at the image borders, so that each convolution can start at the very edge of an image (and sees mirrored pixels in places where it's kernel overlaps). Then the input size can be preserved and the full image will be segmented.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With