When an application has a huge amount of data (400M) to write to a non-blocking socket, write()
returns EWOULDBLOCK
or EAGAIN
when the send buffer becomes full.
When the socket is (e)polled, I sometimes see a write-ready notification happening when there's 7M space in the send buffer, sometimes 20M and at other times 1M. The variation in the delay between write-ready callbacks is huge: from milliseconds to tens of seconds!
So my question is when exactly does the kernel trigger a write-ready for a socket? What affects triggering of write-ready? Obviously it's not triggered as soon as 1B is written to the wire.
Any help would be appreciated!
I'm using:
Ubuntu 12.04 LTS
Kernel 3.8.0-39-generic
Arch: x86_64
EDIT: Sockets in this context are TCP/IP sockets.
So my question is when exactly does the kernel trigger a write-ready for a socket?
tl;dr; As long as your socket has enough buffer space writes succeed and epoll_wait
will return events to say so in the default level-triggered mode. If the socket runs out of space blocking writers will sleep. The kernel will wake processes (or deliver epoll events to say the socket is writable) when data is acknowledged freeing up space but only if the socket had run out of space. Just as before if nothing changes as long as the socket is writable the level-triggered events will pour in, even if no new notifications come from TCP.
The function that performs the actual notification is sk_write_space
.
This is a member of struct sock
and for TCP the relevant implementation is sk_stream_write_space
in stream.c
.
...
if (skwq_has_sleeper(wq))
wake_up_interruptible_poll(&wq->wait, EPOLLOUT |
EPOLLWRNORM | EPOLLWRBAND);
if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(wq, SOCK_WAKE_SPACE, POLL_OUT);
...
This function wakes up any callers that might be waiting for memory.
(Compare this with sock_def_write_space
.
But when is sk_write_space
called? There are a few call sites but the most prominent is tcp_new_space
which is called by tcp_check_space
, which is called by tcp_data_snd_check
which is called from a bunch of places on the receive path. The function has a descriptive comment:
When incoming ACK allowed to free some skb from write_queue, we remember this event in flag SOCK_QUEUE_SHRUNK and wake up socket on the exit from tcp input handler.
tcp_check_space
is interesting:
if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
/* pairs with tcp_poll() */
smp_mb();
if (sk->sk_socket &&
test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
tcp_new_space(sk);
...
}
Some relevant bits here:
SOCK_QUEUE_SHRUNK
is defined as "write queue has been shrunk recently" and is set set on the transmit path. tcp_check_space
checks and clears it.SOCK_NOSPACE
is set on the transmit path when we run out of buffer space.The conclusion from all this is that tcp_check_space
avoids sending events unless the socket was out of space.
What about tcp_data_snd_check
? During the steady state the most relevant calls are in tcp_rcv_established
:
The fast-path: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5575
The almost-fast path: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5618
The slow-path: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5658
All of these signal data was successfully ACKd.
There are other callers of sk_write_space
in TCP. do_tcp_sendpages
and tcp_sendmsg_locked
call it on error paths to make sure callers are woken up. do_tcp_setsockopt
calls it when setting TCP_NOTSENT_LOWAT
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With