Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When *exactly* is a socket ready to write?

When an application has a huge amount of data (400M) to write to a non-blocking socket, write() returns EWOULDBLOCK or EAGAIN when the send buffer becomes full.

When the socket is (e)polled, I sometimes see a write-ready notification happening when there's 7M space in the send buffer, sometimes 20M and at other times 1M. The variation in the delay between write-ready callbacks is huge: from milliseconds to tens of seconds!

So my question is when exactly does the kernel trigger a write-ready for a socket? What affects triggering of write-ready? Obviously it's not triggered as soon as 1B is written to the wire.

Any help would be appreciated!

I'm using:

Ubuntu 12.04 LTS

Kernel 3.8.0-39-generic

Arch: x86_64

EDIT: Sockets in this context are TCP/IP sockets.

like image 528
themoondothshine Avatar asked May 08 '14 16:05

themoondothshine


1 Answers

So my question is when exactly does the kernel trigger a write-ready for a socket?

tl;dr; As long as your socket has enough buffer space writes succeed and epoll_wait will return events to say so in the default level-triggered mode. If the socket runs out of space blocking writers will sleep. The kernel will wake processes (or deliver epoll events to say the socket is writable) when data is acknowledged freeing up space but only if the socket had run out of space. Just as before if nothing changes as long as the socket is writable the level-triggered events will pour in, even if no new notifications come from TCP.

The function that performs the actual notification is sk_write_space. This is a member of struct sock and for TCP the relevant implementation is sk_stream_write_space in stream.c.

    ...
    if (skwq_has_sleeper(wq))
        wake_up_interruptible_poll(&wq->wait, EPOLLOUT |
                    EPOLLWRNORM | EPOLLWRBAND);
    if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
        sock_wake_async(wq, SOCK_WAKE_SPACE, POLL_OUT);
    ...

This function wakes up any callers that might be waiting for memory. (Compare this with sock_def_write_space.

But when is sk_write_space called? There are a few call sites but the most prominent is tcp_new_space which is called by tcp_check_space, which is called by tcp_data_snd_check which is called from a bunch of places on the receive path. The function has a descriptive comment:

 When incoming ACK allowed to free some skb from write_queue,
 we remember this event in flag SOCK_QUEUE_SHRUNK and wake up socket
 on the exit from tcp input handler.

tcp_check_space is interesting:

    if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
        sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
        /* pairs with tcp_poll() */
        smp_mb();
        if (sk->sk_socket &&
            test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
            tcp_new_space(sk);
            ...
        }

Some relevant bits here:

  1. SOCK_QUEUE_SHRUNK is defined as "write queue has been shrunk recently" and is set set on the transmit path. tcp_check_space checks and clears it.
  2. SOCK_NOSPACE is set on the transmit path when we run out of buffer space.

The conclusion from all this is that tcp_check_space avoids sending events unless the socket was out of space.

What about tcp_data_snd_check? During the steady state the most relevant calls are in tcp_rcv_established:

  1. The fast-path: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5575

  2. The almost-fast path: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5618

  3. The slow-path: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_input.c#L5658

All of these signal data was successfully ACKd.


There are other callers of sk_write_space in TCP. do_tcp_sendpages and tcp_sendmsg_locked call it on error paths to make sure callers are woken up. do_tcp_setsockopt calls it when setting TCP_NOTSENT_LOWAT.

like image 175
cnicutar Avatar answered Oct 02 '22 07:10

cnicutar