I am trying to understand how the receiver window affect the throughput over a high latency connection.
I have a simple client-server pair of apps on two machines, far apart, with the connection between the two of 250mSec latency RTT. I ran this test with both Windows (XP, 7), and Linux (Ubuntu 10.x), with the same results, so for simplicity let's assume the case of: Client receiving data: WinXP Pro Server sending data: Win7 Pro Again, latency is 250mSec RTT.
I run my TCP test without changing the receiver buffer size on the client (default is 8Kb), and I see on the wire (using Wireshark):
Looking at the trace I see a bursts of 3-4 packets (with a payload of 1460 bytes), immediately followed by the ACK sent from the client machine to the server, then nothing for approx 250mSec then a new burst of packets from the server to the client.
So, in conclusion it appears that the server doesn't send new data even before it fills up the receiver's window.
To do more tests, I also ran the same test this time changing the receiver's buffer size on the client machine (on Windows, changing the receiver's buffer size ends up affecting the RWIN advertised by the machine). I would expect to see large burst of packets before blocking for ACK... and at least a higher throughput.
In this case I set recv buffer size to 100,000,000. The packets from the client to the server have now a RWIN=99,999,744 (well, that's nice), but unfortunately the pattern of the data sent FROM the server to the client is still the same: a short burst followed by a long wait. To confirm also what I see on the wire, I also measure the time to send a chunk of data from the server to the client. I don't see ANY changes in using a large RWIN or using the default.
Can anybody help me understanding why changing the RWIN doesn't really affect the throughput?
Few notes: - server send data as fast as possible using write() of chunks of 8Kb - as I said before, I see similar effects using Linux as well. changing the receiver buffer size affects the RWIN used by a node, but the throughput remains the same. - I analyze the trace after several hundred packets, to give enough time to the TCP slow start mechanism to enlarge the CWIN size.
As suggested, I'm adding a small snapshot of a wire trace here
No. Time Source Destination Protocol Length Info
21 2.005080 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=11681 Win=99999744 Len=0
22 2.005109 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=19305 Ack=1 Win=65536 Len=1460
23 2.005116 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=20765 Ack=1 Win=65536 Len=1460
24 2.005121 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=22225 Ack=1 Win=65536 Len=1460
25 2.005128 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 946 21500 > 57353 [PSH, ACK] Seq=23685 Ack=1 Win=65536 Len=892
26 2.005154 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=14601 Win=99999744 Len=0
27 2.007106 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=16385 Win=99999744 Len=0
28 2.007398 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=24577 Ack=1 Win=65536 Len=1460
29 2.007401 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=26037 Ack=1 Win=65536 Len=1460
30 2.007403 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=27497 Ack=1 Win=65536 Len=1460
31 2.007404 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=28957 Ack=1 Win=65536 Len=1460
32 2.007406 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=30417 Ack=1 Win=65536 Len=1460
33 2.007408 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 946 21500 > 57353 [PSH, ACK] Seq=31877 Ack=1 Win=65536 Len=892
34 2.007883 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=19305 Win=99999744 Len=0
35 2.257143 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=22225 Win=99999744 Len=0
36 2.257160 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=24577 Win=99999744 Len=0
37 2.257358 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=32769 Ack=1 Win=65536 Len=1460
38 2.257362 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=34229 Ack=1 Win=65536 Len=1460
39 2.257364 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=35689 Ack=1 Win=65536 Len=1460
40 2.257365 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=37149 Ack=1 Win=65536 Len=1460
As you see, the server stop sending data at packet #33.
Client send ACK at packet #34 of an old packet (seq=19305, sent on packet #20, not shown here). With an RWIN of 100Mb, I would expect the server NOT to block for a while.
After 20-30 packets, the congestion window on the server side should be large enough to send more packets than I see... I assume the congestion window eventually is going to grow up to the RWIN... but still, even after hundred of packets, the pattern is the same: data data then block for 250mSec...
I can guess two things from the sample you have provided:
For the window of a TCP connection to scale to a certain size, both the send buffer on the sender and the receive buffer on the receiver must be big enough.
The actual window used is the minimum of the receive window offered/requested by the receiver and the sender's OS-set send buffer size.
Long story short, you need to configure the send buffer size on the server.
To clear things up, let's analyse your sample packet by packet.
The server sends another bunch of data:
22 2.005109 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=19305 Ack=1 Win=65536 Len=1460
23 2.005116 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=20765 Ack=1 Win=65536 Len=1460
24 2.005121 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=22225 Ack=1 Win=65536 Len=1460
25 2.005128 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 946 21500 > 57353 [PSH, ACK] Seq=23685 Ack=1 Win=65536 Len=892
Notice the PSH
. That's a flag indicating to any hops in between that a complete chunk of data has been sent and please send it to the other end.
(A "complete" chunk being your 8kb in this case)
While the server is still sending, it gets 2 ACKS:
26 2.005154 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=14601 Win=99999744 Len=0
27 2.007106 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=16385 Win=99999744 Len=0
Note in particular the numbers: Ack=14601
and Ack=16385
. Those numbers are the sequence numbers of the packets the receiver is acknowledging.
Ack=14601 means "I have received everything up to seq no 14601".
Note also these are older data, not in the sample you have given.
So the server processes those ACKs and continues sending data:
28 2.007398 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=24577 Ack=1 Win=65536 Len=1460
29 2.007401 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=26037 Ack=1 Win=65536 Len=1460
30 2.007403 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=27497 Ack=1 Win=65536 Len=1460
31 2.007404 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=28957 Ack=1 Win=65536 Len=1460
32 2.007406 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=30417 Ack=1 Win=65536 Len=1460
33 2.007408 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 946 21500 > 57353 [PSH, ACK] Seq=31877 Ack=1 Win=65536 Len=892
Here we have a complete block of data: 1460*5+892 == 8192.
Then, 0.443 ms after sending that last packet, it gets one more ACK:
34 2.007883 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=19305 Win=99999744 Len=0
And then there is a delay of almost exactly 250ms, during which the server sends nothing, before it receives these:
35 2.257143 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=22225 Win=99999744 Len=0
36 2.257160 CCC.CCC.CCC.CCC sss.sss.sss.sss TCP 60 57353 > 21500 [ACK] Seq=1 Ack=24577 Win=99999744 Len=0
And then continues sending:
37 2.257358 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=32769 Ack=1 Win=65536 Len=1460
38 2.257362 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=34229 Ack=1 Win=65536 Len=1460
39 2.257364 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=35689 Ack=1 Win=65536 Len=1460
40 2.257365 sss.sss.sss.sss CCC.CCC.CCC.CCC TCP 1514 21500 > 57353 [ACK] Seq=37149 Ack=1 Win=65536 Len=1460
There are two very interesting things to notice here.
First, how many bytes were sent by the server without waiting for an ACK.
Te last ACK seq no the server received before that delay is Ack=19305
, and the seq no of the last packet sent by the server at that point is Seq=30417
.
There so during that pause, there are 11112 bytes that the server has sent that have not yet been ACKed by the client.
Second, that was one ACK received by the server an instant after it sent a bunch of data, that didn't trigger it to send more. It's as if that ACK wasn't good enough.
The ACK received before that was Ack=16385
, giving 30417-16385=14032 bytes that were sent by the server unacknowledged at that point. Only after receiving an ACK for seq no 24577, reducing that count to 30417-24577=5840, did the server start sending again.
So the fact that buffer size of 8k is large compared to the effective window size of 16k means throughput is actually reduced somewhat because the server will not send any of the 8k block until there is room for all of it.
Lastly, for those that are wondering, there is a TCP option called window scaling which allows one end of a connection to declare that the window size is actually some multiple of the number in the TCP header. see RFC 1323. The option is passed in the SYN packets so they aren't visible mid-connection - there is only a hint that window scaling is in effect because the window size TCP header is smaller than the window that is being used.
You can't set a receive buffer size of >= 64k once the socket is connected. You have to do it first. In the case of a server that means setting the receive buffer size on the listening socket: accepted sockets inherit it from the socket they are accepted from. If you don't do this, the TCP window scaling option cannot be negotiated so the peers have no way of telling each other about the size over 64k.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With