Bypassing 4KB block size limitation on block layer/device

Tags:

We are developing an ssd-type storage hardware device that can take read/write request for big block size >4KB at a time (even in MBs size). My understanding is that linux and its filesystem will "chop down" files into 4KB block size that will be passed to block device driver, which will need to physically fill the block with data from the device (ex., for write)

I am also aware the kernel page size has a role in this limitation as it is set at 4KB.

For experiment, I want to find out if there is a way to actually increase this block size, so that we will save some time (instead of doing multiple 4KB writes, we can do it with bigger block size).

Is there any FS or any existing project that I can take a look for this? If not, what is needed to do this experiment - what parts of linux needs to be modified? Trying to find out the level of difficulties and resource needed. Or, if it is even impossible to do so and/or any reason why we do not even need to do so. Any comment is appreciated.

Thanks.

577

asked Jun 02 '15 00:06

electro

1 Answers

The 4k limitation is due to the page cache. The main issue is that if you have a 4k page size, but a 32k block size, what happens if the file is only 2000 bytes long, so you only allocate a 4k page to cover the first 4k of the block. Now someone seeks to offset 20000, and writes a single byte. Now suppose the system is under a lot of memory pressure, and the 4k page for the first 2000 bytes, which is clean, gets pushed out of memory. How do you track which parts of the 32k block contain valid data, and what happens when the system needs to write out the dirty page at offset 20000?

Also, let's assume that the system is under a huge amount of memory pressure, we need to write out that last page; what if there isn't enough memory available to instantiante the other 28k of the 32k block, so we can do the read-modify-write cycle just to update that one dirty 4k page at offset 20000?

These problems can all be solved, but it would require a lot of surgery in the VM layer. The VM layer would need to know that for this file system, pages need to be instantiated in chunks of 8 pages at a time, and if that there is memory pressure to push out a particular page, you need write out all of the 8 pages at the same time if it is dirty, and then drop all 8 pages from the page cache at the same time. All of this implies that you want to track page usage and page dirty not at the 4k page level, but at the compound 32k page/"block" level. It basically will involve changes to almost every single part of the VM subsystem, from the page cleaner, to the page fault handler, the page scanner, the writeback algorithms, etc., etc., etc.

Also consider that even if you did hire a Linux VM expert to do this work, (which the HDD vendors would deeply love you for, since they also want to be able to deploy HDD's with a 32k or 64k physical sector size), it will be 5-7 years before such a modified VM layer would make its appearance in a Red Hat Enterprise Linux kernel, or the equivalent enterprise or LTS kernel for SuSE or Ubuntu. So if you are working at a startup who is hoping to sell your SSD product into the enterprise market --- you might as well give up now with this approach. It's just not going to work before you run out of money.

Now, if you happen to be working for a large Cloud company who is making their own hardware (ala Facebook, Amazon, Google, etc.) maybe you could go down this particular path, since they don't use enterprise kernels that add new features at a glacial pace --- but for that reason, they want to stick relatively close to the upstream kernel to minimize their maintenance cost.

If you do work for one of these large cloud companies, I'd strongly recommend that you contact other companies who are in this same space, and maybe you could collaborate with them to see if together you could do this kind of development work and together try to get this kind of change upstream. It really, really is not a trivial change, though --- especially since the upstream linux kernel developers will demand that this not negatively impact performance in the common case, which will not be involving > 4k block devices any time in the near future. And if you work at a Facebook, Google, Amazon, etc., this is not the sort of change that you would want to maintain as a private change to your kernel, but something that you would want to get upstream, since other wise it would be such a massive, invasive change that supporting it as an out-of-tree patch would be huge headache.

answered Nov 12 '22 20:11

Theodore Ts'o

Related questions
                            
                                What are the differences between C/C++ bare-metal compilation and compilation for a specific OS (Linux)?
                            
                                Native Library bluecove_arm not available
                            
                                Why do we need to disable interrupt before WFI in ARM Linux cpu_idle
                            
                                how to make a nodejs file executable with --harmony option
                            
                                android logcat messages invisible in eclipse
                            
                                Boost UDP socket issue on unix - bind: address already in use
                            
                                File in which the data structure for Global Descriptor and Local Descriptor table is defined?
                            
                                What was the `FUTEX_REQUEUE` bug?
                            
                                How to implement a timeout in open/write function
                            
                                what does "flush 253:0" in iotop file on RHEL
                            
                                How do I count total lines of remote git repository
                            
                                Install vim-go in linux?
                            
                                perf stat gives different number of instruction for every run
                            
                                How is Python blocking signals while os.system("sleep...")?
                            
                                What is the time complexity of reading a file from a Linux filesystem?
                            
                                How to find global static initializations
                            
                                reading a C++ ifstream twice?
                            
                                How to set the Kerberos default_ccache_name attribute on a client without using KRB5CCNAME?
                            
                                Arch Linux / systemd - prevent any kind of shutdown/rebboot
                            
                                using pipe while executing command through the parent

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Bypassing 4KB block size limitation on block layer/device

Tags:

linux

filesystems

linux-kernel

linux-device-driver

electro

People also ask

1 Answers

Theodore Ts'o

Recent Activity

Donate For Us