Category: TCP/IP

If you have been following the latest improvement in the release of the 2.6.35 kernel, you have probably noted 2 major network stack improvement which have create some buzz in the geek community. Ok not really! since the RPS and RFS project has been going on since a while now by the crowd @Google Inc.

Since those 2 amaizing features have been fully implemented in the kernel and are now supported, I though it is a good opportunity to finally get down to the beast and try to shed some light on RPS and RFS.

First of all… RPS stands for “Receive Packet Steering” and RFS for “Receive Flow Steering”… As you can note, both of them deal with the incoming traffic in the network stack, as this is where the big deadlock takes place when the packets are handled by the kernel and processed. Since most of the network cards out there are single queue NIC, the packets are handled by the kernel through a single queue and the kernel must try to spread the load of packet processing over all CPU cores.

Mono-queue Network Cards

As I said earlier, RPS and RFS addresses the limitation of mono-queue network cards (most cards used in the DataCenters or in the Coorporate IT environment are likely to be mono-queue cards).

With a mono-queue network card, each incoming packets is spread across the CPU cores.. in a nutshell, one packet is handled by core 1 while one packet from the same TCP/UDP stream is handled by core 2 and subsequently the CPU cores must share their cache or query them in order of interest… which hits the overall network performance (this is referred as cacheline bouncing).

They are many cards out there (I like to refer them as exotic NICs) that implement a multi-queue, which in all is what RPS mostly tries to emulate.

RPS – Receive Packet Steering

So following what I said earlier… RPS basically is an emulation of a multi-queue card. In a nutshell, what it does is calculate a hash based on the header of the incoming IP packet, by identifying the IP and Port and assigne this hash to one CPU core. Once the hash is calculated, it is used to transfer all new incoming packets that matches this hash to the same CPU core – a bit like a session sticky in Load balancing terms.

So now all packets of one TCP stream/connection will be handled by one CPU core, and thus avoiding the performance hits created by the cacheline bouncing effect.

You can specify per network card, the number of CPUs to use, this can be found in

/sys/class/net/ethX/rps_cpus

You may wonder if RPS always calculte the hash of an incoming packet… in most cases, if you mono-queue card is able to do so, RPS will simply fetch this information from the card and will not carry on the load to performance the alogrithmic calculation.

RFS – Receive Flow Steering

While RPS obviously offers a huge performance gain, RFS has been introduced to help userland applications process faster by improving CPU locality between the application and the packets handled to it by the kernel. In other word, if an application issues some system calls that triggers packets to be sent and received, its footprint will be logged to the CPU currently executing it and incoming packets targetted to this application will be handed over to this CPU by RPS.

So you can see, RFS is just a sort of addons to RPS but instead of doing an IP/Port match, it is doing an Application match to minimize the impact of CPU locality performance penalty.

In a small summary, this is basically what RPS and RFS does. I will try in another post to get more technical and offer an analysis of the new code changes in the kernel network stack such as an overview of the rps_sock_flow_table, the rps_dev_flow_table and the rxhash variable of the stack – and mostly how out of ordered packets are handled by this new system ;-)

Pre-requiste: Understanding of the 802.1Q Protocol

The purpose of this post is to shed a light on how QinQ Vlan takes place in a bridged network environment.

Before continuing, it is important to keep in mind that 802.1QinQ or 802.1ad isn’t a defined protocol in itself but a mere amendment of the already existing 802.1Q protocol.

Having said that, in a nutshell where a single frame can hold a maximum of 4096 tags, QinQ extends the number of maximum tags to 16777216 tags, thus allowing switches to freely manipulate the tags of a single packet. A typical example where QinQ is used is in bridge networks where each customer’s frame can be easily forwarded through different topology network while appearing to the customer’s as a simple bridge with no frame modification.

That is to say, if a corporation has different offices across a region and wishes to build a single logical lan, the corporation can use QinQ and bridge all its sites through their subscribed network provider, without having to alter the existing vlan infrastructure of the customer.

This as said earlier is achievable through QinQ and S-Vlans. To keep it simple, S-Vlans are just the vlan tags that the frames of a customer gets when entering the vlan space of the Service provider and on which forwarding occurs.

For example

Office A is on vlan 1 —- Provider  — Office B is on vlan 1

For this to work such as the packet from customerA tagged with vlan 1 be tunneled through the ISP’s bridged network, the ISP must work on a different vlan space and assign a specific S-Vlan ID to the coorporation subscribed to its services.

Office A (vlan1) —- Provider (vlan20) — Office B (vlan 1)

When entering the Provider’s bridge, the frame from OfficeA will be tagged with S-Vlan 20 and be forwarded to OfficeB, once the packet reaches the other edge bridge’s endpoint, it is stripped off the S-vlan and enters the office’s B network.

Now what if I have many vlans? Remember, within the Bridged network, the Vlans aren’t looked at, only the S-vlan is looked at… based on the S-vlan, the provider’s switch makes a decision as to which S-vlan switch end point to forward to, to which the customer-network port is assigned. Only once it arrives at the other endpoint, that it is stripped off from the S-vlan tag and the customer’s own switch does the next step forwarding (based on the vlan tags).

I hope that was informative and will clear out a lot of common misunderstanding on QinQ and S-Vlans.

Cheers

Who talks about about TCP throughput unfortunately can’t step away from the congestion problem that often occurs in TCP session connections.

There are many TCP Congestion Algorithms, from Window Sliding to Fast Recovery; In this post I will only focus on the Nagle’s algorithm and how applications can be tweaked to either make use of the TCP delay induced by the Nagle’s algorithm or minimize latency for the sake of real time application.

Nagle’s algorithm

The Nagle’s algorithm was developed to prevent congestion due to tinygrams (small packets); that is to say, the % of IP and TCP headers is considerably larger than the packet’s data (MMS)

Remember

MTU = MSS – 40 bytes (20 bytes IP header and 20 bytes TCP header)

The problem is that application which only generates a small fraction of data (small bytes write) such as remote login (X server) would just generate each time a packet with 40 bytes headers (IP/TCP headers) + x byte data. This overhead including the amount of packets which are therefore generated would start clocking the link, especially on links with limited bandwidth.

If I connect remotely to an X server and move the mouse, that amount of information generated will obviously be quite small and thus generate a subsequent amount of small packets.

The Nagle’s algorithm delays the sending of tinygrams by buffering them till an ACK has been received for a packet with outstanding data sent earlier.

The algorithm is laid as followed

if there is new data to send
if the window size >= MSS and available data is >= MSS
send complete MSS segment now
else
if there is unconfirmed data still in the pipe
enqueue data in the buffer until an acknowledge is received
else
send data immediately
end if
end if
end if

ACK delayed

ACK delayed simply implies that the receiver does not need to acknowledge reception of each segment right after their reception. So instead of sending an ACK for each segment, then at some point later on (once the TCP buffer of the recipient is full), to send an ACK with a 0 value and then an ACK update, the recipient would be able to delay the ACK response and thus in one segment inform the sender of the window size.

It is important to note that the ACK delay should not exceed 0.2 seconds (200 ms)… an increased ACK delayed will therefore highly affect the round-trip timing, as much as no ACK delay will cause a high congestion on the network.

Small Scenario

If I were to send 88000 bytes to a remote host, I would technically be sending (88000/1460) ~= 60 packets + 400 bytes, this is of course excluding the TCP/IP headers

With ACK delayed, an ACK will be sent for each 2nd packet received, so using the Nagle’s algorithm, once the ACK of the 60th packets comes back, the 61th packet (441 bytes) will be sent. Now imagine we had a 62th packet? The receiver would still be waiting for the 62th packet in order to send the ACK… Nagle on the other side would not send the 62th packet unless it receives the ACK of the 61th packet…

Now as you can imagine, we would start hitting a deadlock till the ACK delayed timeout kicks in. Network degradation will be foreseen, especially with real-time application.

The issue

Now what would happen in such case if I were to use a remote X server and move the mouse, well Nagle would make sure your TCP buffer fills up to FULL size before sending a packet with a total delay of 200ms between each sent packet due to Delayed ACK… make the calculation, the lag will be highly noticeable :-)

To prevent using the Nagle’s algorithm, make sure to set TCP_NODELAY in your application configuration.

It is important to note that TCP flow control mechanism as well as Ethernet flow control mechanism are completely 2 different mechanism, which strive to achieve the same unique goal but when in used, are completely unaware of each other.

As a matter of fact, Ethernet flow control can fully alienate your network if not planned and used carefully :)

So What is TCP flow control?

Flow control is a mechanism implemented in the TCP stack which enables a receiver endpoint to notify a sender that it can no longer receive data in its buffer. The buffer size is what is simply referred as the TCP Window Size, and is transmitted in each ACK. The receiver can therefore let the sender know, how much bytes it is able to process at once.

[ let's assume, the receiver machine can only process 8K in its buffer]

(sender) <——– ACK 1022 WIN 4096 <——– (receiver)
(sender) ———> 4K | SEQ 1022 ————–> (receiver)

[ assuming that the buffer of the receiver is now full with the first 4K ]

(sender) <——– ACK 2024 WIN 0 <——– (receiver)

[ the sender is now "blocked" from sending more data till the receiver sends a second acknowledgment]

(sender) <——– ACK 2024 WIN 4096 <——– (receiver)

Ok so now, what is Ethernet flow control?

From layer 4 (TCP flow control), we jump now to layer 2 (Ethernet flow control).

Ethernet flow control is different from TCP flow control as it makes usage of the MAC control frame “pause frame” to notify the end device to stop sending frames. It is important to keep in mind that, the sender of the pause frame sets the 2bit quanta time which defines how long the endpoint must wait to start retransmitting frames and finally to keep in mind that pause frames are not forwarded. That is to say, a MAC control frame will not be forwarded through a trunk port, nor to the adjacent device.

What is the problem when using Ethernet flow control?

If you have read so far, you can start guessing what may occur, if you have “ethernet flow control” enabled on your switch. Instead of dropping the packets when the tcp window size is exhausted, the switch will not drop the packet but generate its own pause frame and send it to the sender host. Now keep in mind that pause frames completely cease all transmission on the data link layer… that is to say if meanwhile PCX was getting a file of PCB, it would as well be “paused”. Because pause frame only work on layer 2 “data-link”, all communications associated to the targeted switch port, will completely cease for the pause period of time.

But what happens meanwhile with the TCP flow control?

Like said earlier, the TCP flow control isn’t aware of the data flow control… the TCP flow control allows TCP to throttle the amount of data it is sending, because the switch no longer drops packets due to “ethernet flow control”, TCP becomes unaware that it is sending more data than what the endpoint window size can receive and thus keeps increasing the amount of data it is sending… the result is an overloaded receiver and a switch which keeps generating pause frames, till the TCP flow control detects congestion and readjusts the sending window.

And what happens when you have IGMP snooping off?

Imagine a multicast scenario, where you have a server and a workstation on 2x 1Gb port and another workstation on a 100Mb. If the server starts sending multicast packets at 1Gpbs (in the absolute ;-) ), Ethernet flow control will directly start to throttle down the speed at which the server sends the packet to the lowest port speed of the switch. Remember we are talking multicast here and because packets would be delivered to the 100Mb port… Ethernet flow control on the switch would force the server to only send at 100Mbps. While this is good in practice, remember without IGMP snooping,the switch would be sending all the multicast packets to all the switch ports, thus to endpoints which are unsolicited in the mutlicast group, will cause Ethernet flow control to trigger bad and slow performance.

Conclusion

IGMP snooping has always been a problem in VRRP setup (aka. Checkpoint HA), causing fluctuation on the interface state (referred as flapping interfaces).

While it is possible to disable IGMP per VLAN, I would recommend disabling IGMP snooping per MAC Multicast Address (i.e 01:50:5e:xx:yy:zz)

As simple as possible laid off, the Mathis equation goes as follow

Rate <= (MSS/RTT)*(1 / p)

MSS

This is the Maximum Segment Size, which is the MTU excluding the TCP/IP headers.

MSS = MTU – TCP/IP headers – for example 1460 with an MTU of 1500 (20b IP and 20b TCP headers)

RTT

RTT is the Round Trip Time as measured by TCP. The round trip is the time it would take a packet to travel from endpoint A to B and from endpoint B to A.

On average, RTT = (Physical Distance * 20ms) / 1609 , that is to say, for each 1 609 km, you should expect an RTT of 20ms

p

p is the probability percentage of packet lost per physical segment. A fiber BER would typically be of 10⁻¹³%.

Before we go on, it is first important to understand how TCP evaluates packet loss. As simple as it can be, packet loss is simply based on late delivered ACKs. The more acknowledgment are being sent late, the more the % of packet lost increases.

Let’s get more serious

As explained earlier, the Mantis Equation allows to locate the rate or so to say throughout we can use based on the MSS, RTT and the probability % of packet loss on the link.

Imagine we have an E3 link. For those new to WAN technology, an E3 link uses an M3 signaling type as opposed to an E1 which uses a ZM signaling type. Getting back to the speed line, an E3 is the equivalent of  16*E1 ~= 34.064 Mbps (including management overhead)

1. Line is E3 with a bw of 34.064 Mbps
2. Our endpoint is roughly 3000 km from us
3. We are using a default MSS of 1460
4. An E3 would have a typical packet loss percentage of 10⁻⁶ = 0.001 % (1 packet lost each 1000 packets)

Based on 3000 km, we could assume that the average RTT would be of 37.29 ms = 0.03729 s

Mantis Eq : (1460 / 0.03729) * (1/0.001) ~=  1.23 Mbps

Now if we had no packet loss, our throughout would have been

Throughput = TCPWindow / RTT

(65535 / 0.03729) * 8 ~= 14Mbits

An original bandwidth line of 14 Mbps and an actual throughput of 1.23Mbps over 3000km with a packet lost of one packet each 1000.

How to do you increase rate?

In a perfect world, you would of course need to reduce each value variable of the equation such as decreasing RTT, decreasing the loss probability and increase the MMS (which btw you cannot on the internet, as all routers are configured with a static MTU of 1500)

I hope that was informative on how packet loss can affect throughput.

Reference

The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm (1997) http://citeseer.ist.psu.edu/old/mathis97macroscopic.html