I have a user-space application that generates big SCSI writes (details below). However, when I'm looking at the SCSI commands that reach the SCSI target (i.e. the storage, connected by the FC) something is splitting these writes into 512K chunks.
The application basically does 1M-sized direct writes directly into the device:
fd = open("/dev/sdab", ..|O_DIRECT);
write(fd, ..., 1024 * 1024);
This code causes two SCSI WRITEs to be sent, 512K each.
However, if I issue a direct SCSI command, without the block layer, the write is not split. I issue the following command from the command line:
sg_dd bs=1M count=1 blk_sgio=1 if=/dev/urandom of=/dev/sdab oflag=direct
I can see one single 1M-sized SCSI WRITE.
The question is, what is splitting the write and, more importantly, is it configurable? Linux block layer seems to be guilty (because SG_IO doesn't pass through it) and 512K seems too arbitrary a number not to be some sort of a configurable parameter.
As described in an answer to the "Why is the size of my IO requests being limited, to about 512K" Unix & Linux Stack Exchange question and the "Device limitations" section of the "When 2MB turns into 512KB" document by kernel block layer maintainer Jens Axboe, this can be because your device and kernel have size restrictions (visible in /sys/block/<disk>/queue/
):
max_hw_sectors_kb
maximum size of a single I/O the hardware can acceptmax_sectors_kb
the maximum size the block layer will sendmax_segment_size
and max_segments
the DMA engine limitations for scatter gather (SG) I/O (maximum size of each segment and the maximum number of segments for a single I/O)The segment restrictions matter a lot when the buffer the I/O is coming from is not contiguous and in the worst case each segment can be as small as page (which is 4096 bytes on x86 platforms). This means SG I/O for one I/O can be limited to a size of 4096 * max_segments
.
The question is, what is splitting the write
As you guessed the Linux block layer.
and, more importantly, is it configurable?
You can fiddle with max_sectors_kb
but the rest is fixed and come from device/driver restrictions (so I'm going to guess in your case probably not but you might see bigger I/O directly after a reboot due to less memory fragmentation).
512K seems too arbitrary a number not to be some sort of a configurable parameter
The value is likely related to fragment SG buffers. Let's assume you're on an x86 platform and have a max_segments
of 128
so:
4096 * 128 / 1024 = 512
and that's where 512K could come from.
Bonus chatter: according to https://twitter.com/axboe/status/1207509190907846657 , if your device uses an IOMMU rather than a DMA engine then you shouldn't be segment limited...
The blame is indeed on the block layer, the SCSI layer itself has little regard to the size. You should check though that the underlying layers are indeed able to pass your request, especially with regard to direct io since that may be split into many small pages and requires a scatter-gather list that is longer than what can be supported by the hardware or even just the drivers (libata is/was somewhat limited).
You should look and tune /sys/class/block/$DEV/queue there are assorted files there and the most likely to match what you need is max_sectors_kb but you can just try it out and see what works for you. You may also need to tune the partitions variables as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With