Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linux: writes are split into 512K chunks

I have a user-space application that generates big SCSI writes (details below). However, when I'm looking at the SCSI commands that reach the SCSI target (i.e. the storage, connected by the FC) something is splitting these writes into 512K chunks.

The application basically does 1M-sized direct writes directly into the device:

fd = open("/dev/sdab", ..|O_DIRECT);
write(fd, ..., 1024 * 1024);

This code causes two SCSI WRITEs to be sent, 512K each.

However, if I issue a direct SCSI command, without the block layer, the write is not split. I issue the following command from the command line:

sg_dd bs=1M count=1 blk_sgio=1 if=/dev/urandom of=/dev/sdab oflag=direct

I can see one single 1M-sized SCSI WRITE.

The question is, what is splitting the write and, more importantly, is it configurable? Linux block layer seems to be guilty (because SG_IO doesn't pass through it) and 512K seems too arbitrary a number not to be some sort of a configurable parameter.

like image 873
Leonid99 Avatar asked May 08 '12 07:05

Leonid99


2 Answers

As described in an answer to the "Why is the size of my IO requests being limited, to about 512K" Unix & Linux Stack Exchange question and the "Device limitations" section of the "When 2MB turns into 512KB" document by kernel block layer maintainer Jens Axboe, this can be because your device and kernel have size restrictions (visible in /sys/block/<disk>/queue/):

  • max_hw_sectors_kb maximum size of a single I/O the hardware can accept
  • max_sectors_kb the maximum size the block layer will send
  • max_segment_size and max_segments the DMA engine limitations for scatter gather (SG) I/O (maximum size of each segment and the maximum number of segments for a single I/O)

The segment restrictions matter a lot when the buffer the I/O is coming from is not contiguous and in the worst case each segment can be as small as page (which is 4096 bytes on x86 platforms). This means SG I/O for one I/O can be limited to a size of 4096 * max_segments.

The question is, what is splitting the write

As you guessed the Linux block layer.

and, more importantly, is it configurable?

You can fiddle with max_sectors_kb but the rest is fixed and come from device/driver restrictions (so I'm going to guess in your case probably not but you might see bigger I/O directly after a reboot due to less memory fragmentation).

512K seems too arbitrary a number not to be some sort of a configurable parameter

The value is likely related to fragment SG buffers. Let's assume you're on an x86 platform and have a max_segments of 128 so:

4096 * 128 / 1024 = 512

and that's where 512K could come from.

Bonus chatter: according to https://twitter.com/axboe/status/1207509190907846657 , if your device uses an IOMMU rather than a DMA engine then you shouldn't be segment limited...

like image 97
Anon Avatar answered Sep 22 '22 22:09

Anon


The blame is indeed on the block layer, the SCSI layer itself has little regard to the size. You should check though that the underlying layers are indeed able to pass your request, especially with regard to direct io since that may be split into many small pages and requires a scatter-gather list that is longer than what can be supported by the hardware or even just the drivers (libata is/was somewhat limited).

You should look and tune /sys/class/block/$DEV/queue there are assorted files there and the most likely to match what you need is max_sectors_kb but you can just try it out and see what works for you. You may also need to tune the partitions variables as well.

like image 24
Baruch Even Avatar answered Sep 23 '22 22:09

Baruch Even