Author Archive

systemd is a new replacement to the System V init system long used in all flavor of UNIX and Linux systems. Ok! ok! not all flavor of Linux/Unix.. Solaris switched a long time ago to use SMF while Ubuntu/Fedora (and the distribution targeted to end users) use Upstart – and yes if you consider Darwin(OSX) a true Unix system :-X, then those guys have used/use launchd.

You can read the purpose and goals of systemd at http://0pointer.de/blog/projects/systemd.html – while I don’t want to re-iterate the excellent documentation provided by the author, a few things to add is that systemd promises a lot  while its ideas and concepts are not necessary new ;-)

To try systemd, you may use this forked Fedora 13 qemu image http://surfsite.org/f13-systemd-livecd.torrent (keep in mind, this version still used SysV, thus does not offer full socket and bus-based parallelization as stated in the project’s information page http://0pointer.de/blog/projects/systemd.html).

Fedora 14 is also planned to use systemd as a replacement to Upstart.

Have fun!

Ali

Projects: A few announcements

Hello there!

A few announcements to spin off into a well deserved weekend :-D

I – End of Maintenance of NumExt

NumExt will no longer be maintained (sad… sho sho shad :=) …)! Ok! it is not that much of a tragedy, since I am happy to have seen the add-on grow and inspire many other addons that have implemented some of its features (yes! you “code rippers” you know who you are ;-) )

II- quaggOS

quaggOS stable version 1 has been extensively delayed due to the delay from Quagga upstream’s release of the next stable branch. Current branch 0.99.17 is not fully stable, thus we will have to wait a bit longer. Meanwhile I am actively working on making quaggOS a bit more user friendly.

III – pwgen

pwgen firefox addon’s development will continue as it is. Thanks to the many contributors and for the strong user base.

That’s folks!

Cheers,

Ali

If you have been following the latest improvement in the release of the 2.6.35 kernel, you have probably noted 2 major network stack improvement which have create some buzz in the geek community. Ok not really! since the RPS and RFS project has been going on since a while now by the crowd @Google Inc.

Since those 2 amaizing features have been fully implemented in the kernel and are now supported, I though it is a good opportunity to finally get down to the beast and try to shed some light on RPS and RFS.

First of all… RPS stands for “Receive Packet Steering” and RFS for “Receive Flow Steering”… As you can note, both of them deal with the incoming traffic in the network stack, as this is where the big deadlock takes place when the packets are handled by the kernel and processed. Since most of the network cards out there are single queue NIC, the packets are handled by the kernel through a single queue and the kernel must try to spread the load of packet processing over all CPU cores.

Mono-queue Network Cards

As I said earlier, RPS and RFS addresses the limitation of mono-queue network cards (most cards used in the DataCenters or in the Coorporate IT environment are likely to be mono-queue cards).

With a mono-queue network card, each incoming packets is spread across the CPU cores.. in a nutshell, one packet is handled by core 1 while one packet from the same TCP/UDP stream is handled by core 2 and subsequently the CPU cores must share their cache or query them in order of interest… which hits the overall network performance (this is referred as cacheline bouncing).

They are many cards out there (I like to refer them as exotic NICs) that implement a multi-queue, which in all is what RPS mostly tries to emulate.

RPS – Receive Packet Steering

So following what I said earlier… RPS basically is an emulation of a multi-queue card. In a nutshell, what it does is calculate a hash based on the header of the incoming IP packet, by identifying the IP and Port and assigne this hash to one CPU core. Once the hash is calculated, it is used to transfer all new incoming packets that matches this hash to the same CPU core – a bit like a session sticky in Load balancing terms.

So now all packets of one TCP stream/connection will be handled by one CPU core, and thus avoiding the performance hits created by the cacheline bouncing effect.

You can specify per network card, the number of CPUs to use, this can be found in

/sys/class/net/ethX/rps_cpus

You may wonder if RPS always calculte the hash of an incoming packet… in most cases, if you mono-queue card is able to do so, RPS will simply fetch this information from the card and will not carry on the load to performance the alogrithmic calculation.

RFS – Receive Flow Steering

While RPS obviously offers a huge performance gain, RFS has been introduced to help userland applications process faster by improving CPU locality between the application and the packets handled to it by the kernel. In other word, if an application issues some system calls that triggers packets to be sent and received, its footprint will be logged to the CPU currently executing it and incoming packets targetted to this application will be handed over to this CPU by RPS.

So you can see, RFS is just a sort of addons to RPS but instead of doing an IP/Port match, it is doing an Application match to minimize the impact of CPU locality performance penalty.

In a small summary, this is basically what RPS and RFS does. I will try in another post to get more technical and offer an analysis of the new code changes in the kernel network stack such as an overview of the rps_sock_flow_table, the rps_dev_flow_table and the rxhash variable of the stack – and mostly how out of ordered packets are handled by this new system ;-)

-ENOMEM Error – OOM state

When the kernel needs to allocate memory to a process, its first action is to check whether or not that requested memory is available..

To achieve is, the kernel makes sure that the requested memory is globally available on the system excluding 3% reserved for root process and that the committed amount of RAM does not exceed the allowed threshold.

The threshold is determined using this formula

The check is performed through vm_enough_memory() which evaluates the following

(sum total page cache + total free swap pages + total free pages + slab pages) – 3%

If there is enough memory available to satisfy the request, true is returned to the caller, if not, then -ENOMEM error code is returned to the calller.

When -ENOMEM is returned, out_of_memory() is trigger to evaluate if the system is in OOM (Out of Memory). This is done through a series of set evaluations

1. Is there enough swap space left, if yes it is not OOM
2. Has it been more than 5 seconds since the last failure? If yes, it is not OOM
3. Have we failed within the last second? If not, it is not OOM
4. If there have not been 10 failures at least in the last 5 seconds, it is not OOM
5. Has a process been killed within the last 5 seconds? If yes, it is not OOM

OOM Killer

When the system is finally tagged as being OOM, oom_kill() is triggered to select a process to terminate and thus free used memory. The process is done through the badness() function which returns a counted point for each process to select_bad_process(). The badness of a task is calculated using this formula

* totalVMTask is initially the allocated amount of ram of the process, to which can be added the allocated ram of child processes if both the parent and child do not share the same allocated memory (i.e forked processes).

* cputime is the sum of the utime and stime – (utime + stime)

* runtime is the sub of the uptime and start time – (uptime – start time)

CPU time (utime + stime), the run time (uptime – start time)

In all this function selects the process that uses a lot of memory and have been running for a short amount of time, as process that have been running for a long time are unlikely to be the cause of the issue.

Once the badness score of the process is determined, it is adjusted according to the /proc/<pid>/oomadj value as follow:

if (p->oomkilladj > 0)

points <<= p->oomkilladj;

else points >>= -(p->oomkilladj);

Rules of Exceptions

Now they are 3 rules of exceptions
1. if the process is a root process, meaning has CAP_SYS_ADMIN capabilities, then the points are divided by four
2. if it has CAP_SYS_RAWIO capabilities, then the points are as well divided by four (due to access to hardware, not good to kill)
3. Niced Process points are automatically doubled

A SIGTERM is sent to the process selected, the process list will again be reviewed and all process sharing the same mm_struct will be as well killed. If the process as a CAP_SYS_RAMIO capabilities, it will only receive a SIGKILL.

An administrator can always set the oomadj value to the OOM_DISABLE Constant (defined as -17) of a process pid in /proc/<pid>/oomadj to prevent the process to be killed. However recent patches and contributions from developers allows for a more sophisticated approach to handle this issue.

OOM_KILLER Controller

Nikanth Karthikesan from Suse contributed a patch which introduces an OOM Control group. The group can be allocated a value/preference to the oom.priority field (as well introduced by this patch). The group with this highest oom.priority will be first selected to be killed when the system runs out of memory, the default value of the oom.priority being of 1.

You can review the patch at http://lkml.org/lkml/2009/1/29/220.

However as a simple illustration, here are some small example of the controller’s implementation.

1. You need to mount the cgroup OOM pseudo filesystem

mount -t cgroup -o oom oom /mnt/oom-controller

2.  Create a folder that will contain 2 files, the task and oom.priority files. The folder naming can be anything, but as an example, we will create 2 folders, one for the processes to kill first and one for the processes to never kill

mkdir /mnt/oom-controller/proc_{kill1st,notkill}

touch /mnt/oom-controller/proc_kill1st/tasks && touch /mnt/oom-controller/proc_kill/oom.priority

touch /mnt/oom-controller/proc_notkill/tasks && touch /mnt/oom-controller/proc_notkill/oom.priority

3. We can then add the pids of our processes to kill first in the tasks file of the proc_kill group

echo 1456 > mnt/omm-controller/proc_kill1st/tasks

then set a priority “50″

echo 50 > /mnt/omm-controller/proc_kill1st/oom.priority

Now if another group (i.e proc_kill2nd) had a lower digit set in oom.priority, then the processes belonging to proc_kill1st will be killed first.

4. Now let’s set a group of processes to not be killed in case of OOM

echo 0 > /mnt/omm-controller/proc_notkill/oom.priority

Of course, you need to add the pids of yours processes to /mnt/oom-controller/proc_notkill/tasks, but as you can guess, since the oom.priority is 0, those processes will never be killed by the oom_killer.

More to be found on http://lkml.org/lkml/2009/1/29/220… but that’s about the general overview on this OOM_KILLER Controller.

Is this an optimum solution?

No – it makes more sense in the long run for userspace application to be notified by the kernel that memory should be freed and then handled by the userspace applications by freeing their caches. While this can be quite troublesome with userspace program not properly freeing up memory caches or not being able to due to a change in the kernel memory manager, i still think, the OOM mechanism should be driven to be acted upon in userspace and handled in kernelspace (as already done).

Cheers,

Ali

One of the most interesting feature of DMVPN as far as my personal opinion goes is its extended support for VRF on MPLS networks.

Remember, VRF allows multiple instance of routing tables to co-exist on the same router at the same time.

Having said that, DMVPN helps scalling out tradional IPSEC hub-and-spoke VPN configuration by setting permanent and temporary connections, respectively from the spoke routers to the hub router and between the spoker routers as needed. That has for result to aleviate traffic from the hub router and therefore providing Netowrk Performance, Scalability and better Traffic control management.

Having said that, DMVPN relies on the following protocols

- IPSEC: pre-shared keys used to secure the traffic

- mGRE: mGRE allows us to encapsulate multicast packets (i.e OSPF packets) and to setup a speudo-virtual tunnel interface to link our sites

- NHRP: Without NHRP, our GRE tunnel cannot be established. NHRP stands for “Next Hop Resolution Protocol” and allows our server to know what the peer sites IPs are. The NHRP server (HUB) will be answering NHRP request for IP discovery of peers to form tunnels.

- A routing protocol: OSPF, RIP, BGP etc…

Important things to keep in mind

- IPSEC in “Transport Mode”

When setting the tunnel, make sure to use “transport” mode with IPSEC, since the encapsulation of the IP packet in an ESP header is done already with GRE. This allows you to save 20 bytes on the MTU ;-)

– Use RIP for default routes

I know, you are probably ready to pull out your hair, but in a large DMVPN network, using RIP could help scale out than another routing protocol such as OSPF… calculating adjacencies are CPU intensive ;-)