What's the benefit of Async File NIO in Java?

Tags:

According to the documentation of AsynchronousFileChannel and AsynchronousChannelGroup, async NIO is using a dedicated thread pool where "IO events are handled". I couldn't find any clear statement what "handling" means in this context but according to this, I'm pretty sure that at the end of the day, blocking occurs on those dedicated threads. To narrow things down, I'm using Linux and based on Alex Yursha's answer, there is no such thing as non-blocking IO on it, only Windows supports it on some levels.

My question is: what is the benefit of using async NIO versus synchronous IO running on a dedicated thread pool created by myself? Considering the introduced complexity, what would be a scenario when it would still worth to implement?

690

asked Jul 10 '20 20:07

Peter

Video Answer

1 Answers

It's mostly about handrolling your buffer sizes. In that way, you can save a lot of memory, but only if you're trying to handle a lot (many thousands) of simultaneous connections.

First some simplifications and caveats:

I'm going to assume a non-boneheaded scheduler. There are some OSes that just do a really poor job of juggling thousands of threads. There is no inherent reason that an OS will fall down when a user process fires up 1000 full threads, but some OSes do anyway. NIO can help there, but that's a bit of an unfair comparison - usually you should just upgrade your OS. Pretty much any linux, and I believe win10 definitely don't have issues with this many threads, but some old linux port on an ARM hack, or something like windows 7 - that might cause problems.
I'm going to assume you're using NIO to deal with incoming TCP/IP connections (e.g. a web server, or IRC server, something like that). The same principles apply if you're trying to read 1000 files simultaneously, but note that you do need to think about where the bottleneck lies. For example, reading 1000 files simultaneously from a single disk is a pointless exercise - that just slows things down as you're making life harder for the disk (this counts double if it's a spinning disk). For networking, especially if you're on a fast pipe, the bottleneck is not the pipe or your network card, which makes 'handle 1000s of connections simultaneously' a good example. In fact, I'm going to use as example a chat server where 1000 people all connect to one giant chatroom. The job is to receive text messages from anybody connected and send them out to everybody.

The synchronous model

In the synchronous model, life is relatively simple: We'll make 2001 threads:

1 thread to listen for new incoming TCP connections on a socket. This thread will create the 2 'handler' threads and go back to listening for new connections.
per user a thread that reads from the socket until it sees an enter symbol. If it sees this, it will take all text received so far, and notify all 1000 'sender' threads with this new string that needs to be sent out.
per user a thread that will send out the strings in a buffer of 'text messages to send out'. If there's nothing left to send it will wait until a new message is delivered to it.

Each individual moving piece is easily programmed. Some tactical use of a single java.util.concurrent datatype, or even some basic synchronized() blocks will ensure we don't run into any race conditions. I envision maybe 1 page of code for each piece.

But, we do have 2001 threads. Each thread has a stack. In JVMs, each thread gets the same size stack (you can't create a thread but with a differently sized stack), and you configure how large it is with the -Xss parameter. You can make them as small as, say, 128k, but even then that's still 128k * 2001 = ~256MB just for the stacks, we haven't covered any of the heap (all those strings that people are sending back and forth, stuck in send queues), or the app itself, or the JVM basics.

Under the hood, what's going to happen to the CPU which has, say, 16 cores, is that there are 2001 threads and each thread has its own set of conditions which would result in it waking up. For the receivers it's data coming in over the pipe, for the senders its either the network card indicating it is ready to send another packet (in case it's waiting to push data down the line), or waiting for a obj.wait() call to get notified (the threads that receive text from the users would add that string to all the queues of each of the 1000 senders and then notify them all).

That's a lot of context switching: A thread wakes up, sees Joe: Hello, everybody, good morning! in the buffer, turns that into a packet, blits it to the memory buffer of the network card (this is all extremely fast, it's just CPU and memory interacting), and will fall back asleep, for example. The CPU core will then move on and find another thread that is ready to do some work.

CPU cores have on-core caches; in fact, there's a hierarchy. There's main RAM, then L3 cache, L2 cache, on-core cache - and a CPU cannot really operate on RAM anymore in modern architecture, they need for the infrastructure around the chip to realize that it needs to read or write to memory that is on a page that isn't in one of these caches, then the CPU will just freeze for a while until the infra can copy over that page of RAM into one of the caches.

Every time a core switches, it is highly likely that it needs to load a new page, and that can take many hundreds of cycles where the CPU is twiddling its thumbs. A badly written scheduler would cause a lot more of this than is needed. If you read about advantages of NIO, often 'those context switches are expensive!' comes up - this is more or less what they are talking about (but, spoiler alert: The async model also suffers from this!)

The async model

In the synchronous model, the job of figuring out which of the 1000 connected users is ready for stuff to happen is 'stuck' in threads waiting on events; the OS is juggling those 1000 threads and will wake up threads when there's stuff to do.

In the async model we switch it up: We still have threads, but far fewer (one to two for each core is a good idea). That's far fewer threads than connected users: Each thread is responsible for ALL the connections, instead of only for 1 connection. That means each thread will do the job of checking which of the connected users have stuff to do (their network pipe has data to read, or is ready for us to push more data down the wire to them).

The difference is in what the thread asks the OS:

[synchronous] Okay, I want to go to sleep until this one connection sends data to me.
[async] Okay, I want to go to sleep until one of these thousand connections either sends data to me, or I registered that I'm waiting for the network buffer to clear because I have more data to send, and the network is clear, or the socketlistener has a new user connecting.

There is no inherent speed or design advantage to either model - we're just shifting the job around between app and OS.

One advantage often touted for NIO is that you don't need to 'worry' about race conditions, synchronizing, concurrency-safe data structures. This is a commonly repeated falsehood: CPUs have many cores, so if your non-blocking app only ever makes one thread, the vast majority of your CPU will just sit there idle doing nothing, that is highly inefficient.

The great upside here is: Hey, only 16 threads. That's 128k * 16 = 2MB of stack space. That's in stark contrast to the 256MB that the sync model took! However, a different thing now happens: In the synchronous model, a lot of state info about a connection is 'stuck' in that stack. For example, if I write this:

Let's assume the protocol is: client sends 1 int, it's the # of bytes in the message, and then that many bytes, which is the message, UTF-8 encoded.

Click to copy

// synchronous code
int size = readInt();
byte[] buffer = new byte[size];
int pos = 0;
while (pos < size) {
    int r = input.read(buffer, pos, size - pos);
    if (r == -1) throw new IOException("Client hung up");
    pos += r;
}
sendMessage(username + ": " + new String(buffer, StandardCharsets.UTF_8));

When running this, the thread is most likely going to end up blocking on that read call to the inputstream, as that will involve talking to the network card and moving some bytes from its memory buffers into this process's buffers to get the job done. Whilst its frozen, the pointer to that byte array, the size variable, r, etcetera are all in stack.

In the async model, it doesn't work that way. In the async model, you get data given to you, and you get given whatever is there, and you must then handle this because if you don't, that data is gone.

So, in the async model you get, say, half of the Hello everybody, good morning! message. You get the bytes that represent Hello eve and that's it. For that matter, you got the total byte length of this message already and need to remember that, as well as the half you received so far. You need to explicitly make an object and store this stuff somewhere.

Here's the key point: With the synchronous model, a lot of your state info is in stacks. In the async model, you make the data structures to store this state yourself.

And because you make these yourself, they can be dynamically sized, and generally far smaller: You just need ~4 bytes to store size, another 8 or so for a pointer to the byte array, a handful for the username pointer and that's about it. That's orders of magnitude less than the 128k that stack is taking to store that stuff.

Now, another theoretical benefit is that you don't get the context switch - instead of the CPU and OS having to swap to another thread when a read() call has no data left to give you because the network card is waiting for data, it's now the thread's job to go: Okay, no problem - I shall move on to another context object.

But that's a red herring - it doesn't matter if the OS is juggling 1000 context concepts (1000 threads), or if your application is juggling 1000 context concepts (these 'tracker' objects). It's still 1000 connections and everybody chatting away, so every time your thread moves on to check another context object and fill its byte array with more data, most likely it's still a cache miss and the CPU is still going to twiddle its thumbs for hundreds of cycles whilst the hardware infrastructure pulls the appropriate page from main RAM into the caches. So that part is not nearly as relevant, though the fact that the context objects are smaller is going to reduce cache misses somewhat.

That gets us back to: The primary benefit is that you get to handroll those buffers, and in so doing, you can both make them far smaller, and size them dynamically.

The downsides of async

There's a reason we have garbage collected languages. There is a reason we don't write all our code in assembler. Carefully managing all these finicky details by hand is usually not worth it. And so it is here: Often that benefit is not worth it. But just like GFX drivers and kernel cores have a ton of machine code, and drivers tend to be written in hand-managed memory environments, there are cases where careful management of those buffers is very much worth it.

The cost is high, though.

Imagine a theoretical programming language with the following properties:

Each function is either red or blue.
A red function can call blue or red functions, no problem.
A blue function can also call both, but if a blue function calls a red function, you have a bug that is almost impossible to test for but will kill your performance on realistic loads. Blue can call red functions only by going out of their way to define both the call and the response to the result of the call separately and injecting this pair into a queue.
functions tend not to document their colour.
Some system functions are red.
Your function must be blue.

This seems like an utterly boneheaded disaster of a language, no? But that's exactly the world you live in when writing async code!

The problem is: Within async code, you cannot call a blocking function because if it blocks, hey, that's one of only 16 threads that is now blocked, and that immediately means your CPU is now doing 1/16ths nothing. If all 16 threads end up in that blocking part the CPU is literally doing nothing at all and everything is frozen. You just can't do it.

There is a ton of stuff that blocks: Opening files, even touching a class never touched before (that class needs to be loaded from the jar from disk, verified, and linked), so much as looking at a database, doing a quick network check, sometimes asking for the current time will do it. Even logging at debug level might do it (if that ends up writing to disk, voila - blocking operation).

Do you know of any logging framework that either promises to fire up a separate thread to process logs onto disk, or goes out of its way to document if it blocks or not? I don't know of any, either.

So, methods that block are red, your async handlers are blue. Tada - that's why async is so incredibly difficult to truly get right.

The executive summary

Writing async code well is a real pain due to the coloured functions issue. It's also not on its face faster - in fact, it's usually slower. Async can win big if you want to run many thousands of operations simultaneously and the amount of storage required to track the relevant state data for each individual operation is small, because you get to handroll that buffer instead of being forced into relying on 1 stack per thread.

If you have some money left over, well, a developer salary buys you a lot of sticks of RAM, so usually the right option is to go with threads and just opt for a box with a lot of RAM if you want to handle many simultaneous connections.

Note that sites like youtube, facebook, etc effectively take the 'toss money at RAM' solution - they shard their product so that many simple and cheap computers work together to serve up a website. Don't knock it.

Examples where async can really shine is the chat app I've described in this answer. Another is, say, you receiving a short message, and all you do is hash it, encrypt the hash, and respond with it (To hash, you don't need to remember all the bytes flowing in, you can just toss each byte into the hasher which has constant memory load, and when the bytes are all sent, voila, you have your hash). You're looking for little state per operation and not much CPU power either relative to the speed at which the data is provided.

Some bad examples: are a system where you need to do a bunch of DB queries (you'd need an async way to talk to your DB, and in general DBs are bad at trying to run 1000 queries simultaneously), a bitcoin mining operation (the bitcoin mining is the bottleneck, there's no point trying to handle thousands of connections simultaneously on one machine).

159

answered Oct 11 '22 09:10

rzwitserloot

Related questions
                            
                                How to use the Image(Stored Image of device) with Text on TextView Android?
                            
                                Calling a Kotlin higher-order function from Java
                            
                                hibernate second level cache with Redis -will it improve performance?
                            
                                Spring Kafka-Difference between configuring KafkaTemplate with Producer Listener and registering a callback with Listenable Future
                            
                                How to implement created_at and updated_at column using Room Persistence ORM tools in android
                            
                                How can I disable showing the value of a Spring @Value annotation in IntelliJ IDEA?
                            
                                Could a Java compiler reorder function calls?
                            
                                Wait until Firestore data is retrieved to launch an activity
                            
                                Spring Boot project fails to run because of Schema-validation: missing sequence [hibernate_sequence]
                            
                                How to set tableName dynamically using environment variable in spring boot?
                            
                                How to detect Databricks environment programmatically
                            
                                Migration to androidX missing in Android Studio
                            
                                Oracle Java tutorial - static classes - possible error in tutorial
                            
                                How to fix jfoenix modules with javafx 11
                            
                                How to cache maven dependencies in Docker
                            
                                Unmarshalling XML to three lists of different objects using STAX Parser
                            
                                Modify scheduler timing dynamically based on the condition used with spring-boot @Scheduled annotation
                            
                                Connection refused when using wiremock
                            
                                Spring Boot 2 - Do something before the beans are initialized
                            
                                SQS maxNumberOfMessages

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the benefit of Async File NIO in Java?

Tags:

java

asynchronous

java-io

nio