August 24, 2010

RPS and RFS - Kernel Network Stack

If you have been following the latest improvement in the release of the 2.6.35 kernel, you have probably noted 2 major network stack improvement which have create some buzz in the geek community. Ok not really! since the RPS and RFS project has been going on since a while now by the crowd @Google Inc.

Since those 2 amaizing features have been fully implemented in the kernel and are now supported, I though it is a good opportunity to finally get down to the beast and try to shed some light on RPS and RFS.

First of all… RPS stands for “Receive Packet Steering” and RFS for “Receive Flow Steering”… As you can note, both of them deal with the incoming traffic in the network stack, as this is where the big deadlock takes place when the packets are handled by the kernel and processed. Since most of the network cards out there are single queue NIC, the packets are handled by the kernel through a single queue and the kernel must try to spread the load of packet processing over all CPU cores.

###Mono-queue Network Cards###

As I said earlier, RPS and RFS addresses the limitation of mono-queue network cards (most cards used in the DataCenters or in the Coorporate IT environment are likely to be mono-queue cards).

With a mono-queue network card, each incoming packets is spread across the CPU cores.. in a nutshell, one packet is handled by core 1 while one packet from the same TCP/UDP stream is handled by core 2 and subsequently the CPU cores must share their cache or query them in order of interest… which hits the overall network performance (this is referred as cacheline bouncing).

They are many cards out there (I like to refer them as exotic NICs) that implement a multi-queue, which in all is what RPS mostly tries to emulate.

###RPS - Receive Packet Steering###

So following what I said earlier… RPS basically is an emulation of a multi-queue card. In a nutshell, what it does is calculate a hash based on the header of the incoming IP packet, by identifying the IP and Port and assigne this hash to one CPU core. Once the hash is calculated, it is used to transfer all new incoming packets that matches this hash to the same CPU core - a bit like a session sticky in Load balancing terms.

So now all packets of one TCP stream/connection will be handled by one CPU core, and thus avoiding the performance hits created by the cacheline bouncing effect.

You can specify per network card, the number of CPUs to use, this can be found in

_/sys/class/net/ethX/rpscpus*

You may wonder if RPS always calculte the hash of an incoming packet… in most cases, if you mono-queue card is able to do so, RPS will simply fetch this information from the card and will not carry on the load to performance the alogrithmic calculation.

###RFS - Receive Flow Steering###

While RPS obviously offers a huge performance gain, RFS has been introduced to help userland applications process faster by improving CPU locality between the application and the packets handled to it by the kernel. In other word, if an application issues some system calls that triggers packets to be sent and received, its footprint will be logged to the CPU currently executing it and incoming packets targetted to this application will be handed over to this CPU by RPS.

So you can see, RFS is just a sort of addons to RPS but instead of doing an IP/Port match, it is doing an Application match to minimize the impact of CPU locality performance penalty.

In a small summary, this is basically what RPS and RFS does. I will try in another post to get more technical and offer an analysis of the new code changes in the kernel network stack such as an overview of the rps_sock_flow_table, the rps_dev_flow_table and the rxhash variable of the stack - and mostly how out of ordered packets are handled by this new system ;-)