I have been profiling TCP latency (in particular, the write
from user space to kernel space of a small message) in order to get some intuition for the latency of a write
(acknowledging that this can be context-specific). I have noticed substantial inconsistency between tests that to me seem similar, and I'm very curious to figure out where the difference comes from. I understand that microbenchmarks can be problematic, but I still feel like I am missing some fundamental understanding (since the latency differences are ~10x).
The set up is that I have a C++ TCP server that accepts one client connection (from another process on the same CPU), and upon connecting with the client makes 20 system calls to write
to the socket, sending one byte at a time. The full code of the server is copied at the end of this post. Here's the output which times each write
using boost/timer
(which adds noise of ~1 mic):
$ clang++ -std=c++11 -stdlib=libc++ tcpServerStove.cpp -O3; ./a.out
18 mics
3 mics
3 mics
4 mics
3 mics
3 mics
4 mics
3 mics
5 mics
3 mics
...
I reliably find that the first write
is significantly slower than the others. If I wrap 10,000 write
calls in a timer, the average is 2 microseconds per write
, yet the first call is always 15+ mics. Why is there this "warming" up phenomenon?
Relatedly, I ran an experiment where in between each write
call I do some blocking CPU work (calculating a large prime number). This causes all the write
calls to be slow:
$ clang++ -std=c++11 -stdlib=libc++ tcpServerStove.cpp -O3; ./a.out
20 mics
23 mics
23 mics
30 mics
23 mics
21 mics
21 mics
22 mics
22 mics
...
Given these results, I'm wondering if there is some kind of batching that happens during the process of copying bytes from the user buffer to the kernel buffer. If multiple write
calls happen in quick succession, do they get coalesced into one kernel interrupt?
In particular I am looking for some notion of how long write
takes to copy buffers from user space to kernel space. If there is some coalescing effect that allows the average write
to only take 2 mics when I do 10,000 in succession, then it would be unfairly optimistic to conclude that the write
latency is 2 mics; it seems that my intuition should be that each write
takes 20 microseconds. This seems surprisingly slow for the lowest latency you can get (a raw write
call on one byte) without kernel bypass.
A final piece of data is that when I set up a ping-pong test between two processes on my computer (a TCP server and a TCP client), I average 6 mics per round trip (which includes a read
, a write
, as well as moving through the localhost network). This seems at odds with the 20 mic latencies for a single write seen above.
Full code for the TCP server:
// Server side C/C++ program to demonstrate Socket programming
// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <boost/timer.hpp>
#include <unistd.h>
// Set up some blocking work.
bool isPrime(int n) {
if (n < 2) {
return false;
}
for (int i = 2; i < n; i++) {
if (n % i == 0) {
return false;
}
}
return true;
}
// Compute the nth largest prime. Takes ~1 sec for n = 10,000
int getPrime(int n) {
int numPrimes = 0;
int i = 0;
while (true) {
if (isPrime(i)) {
numPrimes++;
if (numPrimes >= n) {
return i;
}
}
i++;
}
}
int main(int argc, char const *argv[])
{
int server_fd, new_socket, valread;
struct sockaddr_in address;
int opt = 1;
int addrlen = sizeof(address);
// Create socket for TCP server
server_fd = socket(AF_INET, SOCK_STREAM, 0);
// Prevent writes from being batched
setsockopt(server_fd, SOL_SOCKET, TCP_NODELAY, &opt, sizeof(opt));
setsockopt(server_fd, SOL_SOCKET, TCP_NOPUSH, &opt, sizeof(opt));
setsockopt(server_fd, SOL_SOCKET, SO_SNDBUF, &opt, sizeof(opt));
setsockopt(server_fd, SOL_SOCKET, SO_SNDLOWAT, &opt, sizeof(opt));
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
bind(server_fd, (struct sockaddr *)&address, sizeof(address));
listen(server_fd, 3);
// Accept one client connection
new_socket = accept(server_fd, (struct sockaddr *)&address, (socklen_t*)&addrlen);
char sendBuffer[1] = {0};
int primes[20] = {0};
// Make 20 sequential writes to kernel buffer.
for (int i = 0; i < 20; i++) {
sendBuffer[0] = i;
boost::timer t;
write(new_socket, sendBuffer, 1);
printf("%d mics\n", int(1e6 * t.elapsed()));
// For some reason, doing some blocking work between the writes
// The following work slows down the writes by a factor of 10.
// primes[i] = getPrime(10000 + i);
}
// Print a prime to make sure the compiler doesn't optimize
// away the computations.
printf("prime: %d\n", primes[8]);
}
TCP client code:
// Server side C/C++ program to demonstrate Socket programming
// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char const *argv[])
{
int sock, valread;
struct sockaddr_in address;
int opt = 1;
int addrlen = sizeof(address);
// We'll be passing uint32's back and forth
unsigned char recv_buffer[1024] = {0};
// Create socket for TCP server
sock = socket(AF_INET, SOCK_STREAM, 0);
setsockopt(sock, SOL_SOCKET, TCP_NODELAY, &opt, sizeof(opt));
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
// Accept one client connection
if (connect(sock, (struct sockaddr *)&address, (socklen_t)addrlen) != 0) {
throw("connect failed");
}
read(sock, buffer_pointer, num_left);
for (int i = 0; i < 10; i++) {
printf("%d\n", recv_buffer[i]);
}
}
I tried with and without the flags TCP_NODELAY
, TCP_NOPUSH
, SO_SNDBUF
and SO_SNDLOWAT
, with the idea that this might prevent batching (but my understanding is that this batching occurs between the kernel buffer and the network, not between the user buffer and the kernel buffer).
Here is server code for the ping pong test:
// Server side C/C++ program to demonstrate Socket programming
// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <boost/timer.hpp>
#include <unistd.h>
__inline__ uint64_t rdtsc(void)
{
uint32_t lo, hi;
__asm__ __volatile__ (
"xorl %%eax,%%eax \n cpuid"
::: "%rax", "%rbx", "%rcx", "%rdx");
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
// Big Endian (network order)
unsigned int fromBytes(unsigned char b[4]) {
return b[3] | b[2]<<8 | b[1]<<16 | b[0]<<24;
}
void toBytes(unsigned int x, unsigned char (&b)[4]) {
b[3] = x;
b[2] = x>>8;
b[1] = x>>16;
b[0] = x>>24;
}
int main(int argc, char const *argv[])
{
int server_fd, new_socket, valread;
struct sockaddr_in address;
int opt = 1;
int addrlen = sizeof(address);
unsigned char recv_buffer[4] = {0};
unsigned char send_buffer[4] = {0};
// Create socket for TCP server
server_fd = socket(AF_INET, SOCK_STREAM, 0);
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
bind(server_fd, (struct sockaddr *)&address, sizeof(address));
listen(server_fd, 3);
// Accept one client connection
new_socket = accept(server_fd, (struct sockaddr *)&address, (socklen_t*)&addrlen);
printf("Connected with client!\n");
int counter = 0;
unsigned int x = 0;
auto start = rdtsc();
boost::timer t;
int n = 10000;
while (counter < n) {
valread = read(new_socket, recv_buffer, 4);
x = fromBytes(recv_buffer);
toBytes(x+1, send_buffer);
write(new_socket, send_buffer, 4);
++counter;
}
printf("%f clock cycles per round trip (rdtsc)\n", (rdtsc() - start) / double(n));
printf("%f mics per round trip (boost timer)\n", 1e6 * t.elapsed() / n);
}
Here is client code for the ping pong test:
// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <boost/timer.hpp>
#include <unistd.h>
// Big Endian (network order)
unsigned int fromBytes(unsigned char b[4]) {
return b[3] | b[2]<<8 | b[1]<<16 | b[0]<<24;
}
void toBytes(unsigned int x, unsigned char (&b)[4]) {
b[3] = x;
b[2] = x>>8;
b[1] = x>>16;
b[0] = x>>24;
}
int main(int argc, char const *argv[])
{
int sock, valread;
struct sockaddr_in address;
int opt = 1;
int addrlen = sizeof(address);
// We'll be passing uint32's back and forth
unsigned char recv_buffer[4] = {0};
unsigned char send_buffer[4] = {0};
// Create socket for TCP server
sock = socket(AF_INET, SOCK_STREAM, 0);
// Set TCP_NODELAY so that writes won't be batched
setsockopt(sock, SOL_SOCKET, TCP_NODELAY, &opt, sizeof(opt));
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
// Accept one client connection
if (connect(sock, (struct sockaddr *)&address, (socklen_t)addrlen) != 0) {
throw("connect failed");
}
unsigned int lastReceived = 0;
while (true) {
toBytes(++lastReceived, send_buffer);
write(sock, send_buffer, 4);
valread = read(sock, recv_buffer, 4);
lastReceived = fromBytes(recv_buffer);
}
}
(Not quite an answer, but needed a bit more room than a comment...)
This does sound like Nagle's algorithm, or a variant of it, controlling when TCP packets are actually sent.
For the first write, when there is no unconfirmed data in the 'pipe', it will be sent immediately, which takes a moment. For subsequent writes soon after that, there will still be unconfirmed data in the pipe, so a small amount of data can be queued in the send buffer, which is quicker.
After a break in transmissions, when all the sends have had a chance to catch up, the pipe will be ready to send immediately again.
You can confirm this using something like Wireshark to look at the actual TCP packets - this will show how the write()
requests are being grouped together.
To be fair, I would expect the TCP_NODELAY flag to bypass this though - resulting in a more even spread of timings as you say. If you can check the TCP packets, it also would be worth looking if they are showing the PSH flag set, to force an immediate send.
There are a few issues going on here.
To get closer to the answer, you need to have your client side do two things: 1. receive all the data. 2. keep track of how big each read was. I did this by:
int loc[N+1];
int nloc, curloc;
for (nloc = curloc = 0; curloc < N; nloc++) {
int n = read(sock, recv_buffer + curloc, sizeof recv_buffer-curloc);
if (n <= 0) {
break;
}
curloc += n;
loc[nloc] = curloc;
}
int last = 0;
for (int i = 0; i < nloc; i++) {
printf("%*.*s ", loc[i] - last, loc[i] - last, recv_buffer + last);
last = loc[i];
}
printf("\n");
and defining N to 20 (sorry, upbringing), and changing your server to write a-z one byte at a time. Now, when this prints out something like:
a b c d e f g h i j k l m n o p q r s
we know the server is sending 1 byte packets; however when it prints something like:
a bcde fghi jklm nop qrs
we suspect the server is sending mainly 4 byte packets.
The root problem is that TCP_NODELAY doesn't do what you suspect. Nagle's algorithm, accumulates output when there is an unacknowledged sent packet; TCP_NODELAY controls whether this is applied.
Regardless of TCP_NODELAY, you are still a STREAM_SOCKET, which means that N-writes can be combined into one. The socket is feeding the device, but simultaneously you are feeding the socket. Once a packet [ mbuf, skbuff, ...] has been committed to the device, the socket needs to create a new packet on the next write()s. As soon as the device is ready for a new packet, the socket can provide it, but until then, the packet will serve as a buffer. In buffering mode, the write is very fast, as all the necessary data structures are available [ as mentioned in comments and other answers ].
You can control this buffering by adjusting the SO_SNDBUF and SO_SNDLOWAT socket options. Note, however the buffer returned by accept does not inherit the buffer sizes of the provided socket. By reducing the SNDBUF to 1
The output below:
abcdefghijklmnopqrst
a bcdefgh ijkl mno pqrst
a b cdefg hij klm nop qrst
a b c d e f g h i j k l m n o p q r s t
corresponds starts at the default, then successively adds: TCP_NODELAY, TCP_NOPUSH, SO_SNDBUF (=1), SO_SNDLOWAT (=1) to the server side on subsequent connections. Each iteration has a flatter time delta than the previous.
Your mileage will likely vary, this was on MacOS 10.12; and I changed your programs to C++ thing with rdtsc() because I have trust issues.
/* srv.c */
// Server side C/C++ program to demonstrate Socket programming
// #include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <stdbool.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <unistd.h>
#ifndef N
#define N 20
#endif
int nap = 0;
int step = 0;
extern long rdtsc(void);
void xerror(char *f) {
perror(f);
exit(1);
}
#define Z(x) if ((x) == -1) { xerror(#x); }
void sopt(int fd, int opt, int val) {
Z(setsockopt(fd, SOL_SOCKET, opt, &val, sizeof(val)));
}
int gopt(int fd, int opt) {
int val;
socklen_t r = sizeof(val);
Z(getsockopt(fd, SOL_SOCKET, opt, &val, &r));
return val;
}
#define POPT(fd, x) printf("%s %d ", #x, gopt(fd, x))
void popts(char *tag, int fd) {
printf("%s: ", tag);
POPT(fd, SO_SNDBUF);
POPT(fd, SO_SNDLOWAT);
POPT(fd, TCP_NODELAY);
POPT(fd, TCP_NOPUSH);
printf("\n");
}
void stepsock(int fd) {
switch (step++) {
case 7:
step = 2;
case 6:
sopt(fd, SO_SNDLOWAT, 1);
case 5:
sopt(fd, SO_SNDBUF, 1);
case 4:
sopt(fd, TCP_NOPUSH, 1);
case 3:
sopt(fd, TCP_NODELAY, 1);
case 2:
break;
}
}
int main(int argc, char const *argv[])
{
int server_fd, new_socket, valread;
struct sockaddr_in address;
int opt = 1;
int addrlen = sizeof(address);
// Create socket for TCP server
server_fd = socket(AF_INET, SOCK_STREAM, 0);
popts("original", server_fd);
// Set TCP_NODELAY so that writes won't be batched
while ((opt = getopt(argc, argv, "sn:o:")) != -1) {
switch (opt) {
case 's': step = ! step; break;
case 'n': nap = strtol(optarg, NULL, 0); break;
case 'o':
for (int i = 0; optarg[i]; i++) {
switch (optarg[i]) {
case 't': sopt(server_fd, TCP_NODELAY, 1); break;
case 'p': sopt(server_fd, TCP_NOPUSH, 0); break;
case 's': sopt(server_fd, SO_SNDBUF, 1); break;
case 'l': sopt(server_fd, SO_SNDLOWAT, 1); break;
default:
exit(1);
}
}
}
}
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
if (bind(server_fd, (struct sockaddr *)&address, sizeof(address)) == -1) {
xerror("bind");
}
popts("ready", server_fd);
while (1) {
if (listen(server_fd, 3) == -1) {
xerror("listen");
}
// Accept one client connection
new_socket = accept(server_fd, (struct sockaddr *)&address, (socklen_t*)&addrlen);
if (new_socket == -1) {
xerror("accept");
}
popts("accepted: ", new_socket);
sopt(new_socket, SO_SNDBUF, gopt(server_fd, SO_SNDBUF));
sopt(new_socket, SO_SNDLOWAT, gopt(server_fd, SO_SNDLOWAT));
if (step) {
stepsock(new_socket);
}
long tick[21];
tick[0] = rdtsc();
// Make N sequential writes to kernel buffer.
for (int i = 0; i < N; i++) {
char ch = 'a' + i;
write(new_socket, &ch, 1);
tick[i+1] = rdtsc();
// For some reason, doing some blocking work between the writes
// The following work slows down the writes by a factor of 10.
if (nap) {
sleep(nap);
}
}
for (int i = 1; i < N+1; i++) {
printf("%ld\n", tick[i] - tick[i-1]);
}
printf("_\n");
// Print a prime to make sure the compiler doesn't optimize
// away the computations.
close(new_socket);
}
}
clnt.c:
#include <stdio.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <string.h>
#include <unistd.h>
#ifndef N
#define N 20
#endif
int nap = 0;
int main(int argc, char const *argv[])
{
int sock, valread;
struct sockaddr_in address;
int opt = 1;
int addrlen = sizeof(address);
// We'll be passing uint32's back and forth
unsigned char recv_buffer[1024] = {0};
// Create socket for TCP server
sock = socket(AF_INET, SOCK_STREAM, 0);
// Set TCP_NODELAY so that writes won't be batched
setsockopt(sock, SOL_SOCKET, TCP_NODELAY, &opt, sizeof(opt));
while ((opt = getopt(argc,argv,"n:")) != -1) {
switch (opt) {
case 'n': nap = strtol(optarg, NULL, 0); break;
default:
exit(1);
}
}
opt = 1;
address.sin_family = AF_INET;
address.sin_addr.s_addr = INADDR_ANY;
address.sin_port = htons(8080);
// Accept one client connection
if (connect(sock, (struct sockaddr *)&address, (socklen_t)addrlen) != 0) {
perror("connect failed");
exit(1);
}
if (nap) {
sleep(nap);
}
int loc[N+1];
int nloc, curloc;
for (nloc = curloc = 0; curloc < N; nloc++) {
int n = read(sock, recv_buffer + curloc, sizeof recv_buffer-curloc);
if (n <= 0) {
perror("read");
break;
}
curloc += n;
loc[nloc] = curloc;
}
int last = 0;
for (int i = 0; i < nloc; i++) {
int t = loc[i] - last;
printf("%*.*s ", t, t, recv_buffer + last);
last = loc[i];
}
printf("\n");
return 0;
}
rdtsc.s:
.globl _rdtsc
_rdtsc:
rdtsc
shl $32, %rdx
or %rdx,%rax
ret
(not sure if this may help, but I don't have enough reputation to post a comment)
Microbenchmarking is tricky, especially with OS calls - in my experience, few factors have to be considered and filtered out or measured, before taking numbers conclusively.
Some of these factors are:
cache hits/misses
multitasking preemption
OS allocating memory at certain moments of API calls (memory allocation can easily lead to microseconds of delays)
lazy loading (some API's may not do much during connect
call f.e., until real data comes in)
actual clock speed of CPU at the moment (dynamic clock scaling, happens all the time)
recently executed commands on this or adjacent cores (f.e., heavy AVX512 instructions may switch CPU into L2 (license 2) mode which slows down clock to avoid overheat).
with virtualization, whatever else may be running on the same physical CPU.
You can try to mitigate the influence of factors 1, 2, 6 & 7 by running the same command repeatedly in a cycle. But, in your case, it may mean that you need to open several sockets at once and measure the 1st write to each of them in a cycle. That way your cache for going to the kernel will be preheated at the first call, and further calls will have "cleaner" time. You can average it out.
To help with 5, you can try to "preheat" CPU clock - run a long blocking cycle right before your test and inside your test loop, but don't do anything fancy in that cycle to avoid overheating - the safest is to call __asm("nop")
inside of that cycle.
At first, I didn't notice that you send only 1 byte, and thought that this may be due to TCP slow start. But also your 2nd test with prime number doesn't support this. So, it sounds more like factors 1, 5 or 6 from my list.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With