I am using ByteBuffer.allocateDirect() to allocate some buffer memory for reading a file into memory and then eventually hashing that files bytes and getting a file hash (SHA) out of it. The input files range greatly in size, anywhere from a few KB's to several GB's.
I have read several threads and pages (even some on SO) regarding selecting a buffer size. Some advised trying to select one that the native FileSystem uses in an attempt to minimalize chances of a read operation for a partial block,etc. Such as buffer of 4100 bytes and NTFS defaults to 4096, so the extra 4 bits would require a separate read operation, being extremely wasteful.
So sticking with the powers of 2, 1024, 2048, 4096, 8192, etc. I have seen some recommend buffers the size of 32KB's, and other recommend making the buffer the size of the input file (probably fine for small files, but what about large files?).
How important is it to stick to native block sized buffers? Modernly speaking (assuming modern SATA drive or better with at least 8Mb of on drive cache, and other modern OS "magic" to optimize I/O) how critical is the buffer size and how should I best determine what size to set mine to? I could statically set it, or dynamically determine it? Thanks for any insight.
To check the buffer window, multiply the bit rate (bits per second) by the buffer window (in seconds) and divide by 1000 to get the size, in bits, of the buffer for the stream.
Ideally, 128 is a good buffer size, but 256 should be sufficient for tasks like this. If you can afford a lower buffer size, this is always best. However, this may cause any effects on tracks such as reverb or pitch correction to struggle to run in real-time.
High buffer size is the main reason that causing latency issue, but when you are mixing and mastering, you need to run multiple plugins simultaneously, you should choose higher buffer size like 512 or 1024.
All that said, there's no “industry standard” buffer size and sample rate, as it's all dependent on your computer's processing power. However, recording at 128 to 256 at a sample rate of 48kHz is acceptable for most home recording on modern-day computers.
To answer your direct question: (1) filesystems tend to use powers of 2, so you want to do the same. (2) the larger your working buffer, the less effect any mis-sizing will have.
As you say, if you allocate 4100 and the actual block size is 4096, you'll need two reads to fill the buffer. If, instead, you have a 1,000,000 byte buffer, then being one block high or low doesn't matter (because it takes 245 4096-byte blocks to fill that buffer). Moreover, the larger buffer means that the OS has a better chance to order the reads.
That said, I wouldn't use NIO for this. Instead, I'd use a simple BufferedInputStream
, with maybe a 1k buffer for my read()
s.
The main benefit of NIO is keeping data out of the Java heap. If you're reading and writing a file, for example, using an InputStream
means that the OS reads the data into a JVM-managed buffer, the JVM copies that into an on-heap buffer, then copies it again to an off-heap buffer, then the OS reads that off-heap buffer to write the actual disk blocks (and typically adds its own buffers). In this case, NIO will eliminate that native-heap copies.
However, to compute a hash, you need to have the data in the Java heap, and the Mac
SPI will move it there. So you don't get the benefit of NBI keeping the data off-heap, and IMO the "old IO" is easier to write.
Just don't forget that InputStream.read()
is not guaranteed to read all the bytes you ask for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With