October 29, 2009

The Nagle's algorithm and TCP throughput

Who talks about about TCP throughput unfortunately can’t step away from the congestion problem that often occurs in TCP session connections.

There are many TCP Congestion Algorithms, from Window Sliding to Fast Recovery; In this post I will only focus on the Nagle’s algorithm and how applications can be tweaked to either make use of the TCP delay induced by the Nagle’s algorithm or minimize latency for the sake of real time application.

Nagle’s algorithm

The Nagle’s algorithm was developed to prevent congestion due to tinygrams (small packets); that is to say, the % of IP and TCP headers is considerably larger than the packet’s data (MMS)


MTU = MSS - 40 bytes (20 bytes IP header and 20 bytes TCP header)

The problem is that application which only generates a small fraction of data (small bytes write) such as remote login (X server) would just generate each time a packet with 40 bytes headers (IP/TCP headers) + x byte data. This overhead including the amount of packets which are therefore generated would start clocking the link, especially on links with limited bandwidth.

If I connect remotely to an X server and move the mouse, that amount of information generated will obviously be quite small and thus generate a subsequent amount of small packets.

The Nagle’s algorithm delays the sending of tinygrams by buffering them till an ACK has been received for a packet with outstanding data sent earlier.

The algorithm is laid as followed

if there is new data to send** if the window size >= MSS and available data is >= MSS send complete MSS segment now else ** if there is unconfirmed data still in the pipe enqueue data in the buffer until an acknowledge is received** ** else send data immediately** ** end if end if end if

ACK delayed

ACK delayed simply implies that the receiver does not need to acknowledge reception of each segment right after their reception. So instead of sending an ACK for each segment, then at some point later on (once the TCP buffer of the recipient is full), to send an ACK with a 0 value and then an ACK update, the recipient would be able to delay the ACK response and thus in one segment inform the sender of the window size.

It is important to note that the ACK delay should not exceed 0.2 seconds (200 ms)… an increased ACK delayed will therefore highly affect the round-trip timing, as much as no ACK delay will cause a high congestion on the network.

Small Scenario

If I were to send 88000 bytes to a remote host, I would technically be sending (880001460) ~= 60 packets + 400 bytes, this is of course excluding the TCP/IP headers

With ACK delayed, an ACK will be sent for each 2nd packet received, so using the Nagle’s algorithm, once the ACK of the 60th packets comes back, the 61th packet (441 bytes) will be sent. Now imagine we had a 62th packet? The receiver would still be waiting for the 62th packet in order to send the ACK… Nagle on the other side would not send the 62th packet unless it receives the ACK of the 61th packet…

Now as you can imagine, we would start hitting a deadlock till the ACK delayed timeout kicks in. Network degradation will be foreseen, especially with real-time application.

The issue

Now what would happen in such case if I were to use a remote X server and move the mouse, well Nagle would make sure your TCP buffer fills up to FULL size before sending a packet with a total delay of 200ms between each sent packet due to Delayed ACK… make the calculation, the lag will be highly noticeable :-)

To prevent using the Nagle’s algorithm, make sure to set TCP_NODELAY in your application configuration.