July 11, 2010

Linux Kernel Out of Memory Management

-ENOMEM Error - OOM state

When the kernel needs to allocate memory to a process, its first action is to check whether or not that requested memory is available..

To achieve is, the kernel makes sure that the requested memory is globally available on the system excluding 3% reserved for root process and that the committed amount of RAM does not exceed the allowed threshold.

The threshold is determined using this formula

The check is performed through vm_enough_memory() which evaluates the following

(sum total page cache + total free swap pages + total free pages + slab pages) - 3%

If there is enough memory available to satisfy the request, true is returned to the caller, if not, then -ENOMEM error code is returned to the calller.

When -ENOMEM is returned, out_of_memory() is trigger to evaluate if the system is in OOM (Out of Memory). This is done through a series of set evaluations

  1. Is there enough swap space left, if yes it is not OOM
  2. Has it been more than 5 seconds since the last failure? If yes, it is not OOM
  3. Have we failed within the last second? If not, it is not OOM
  4. If there have not been 10 failures at least in the last 5 seconds, it is not OOM
  5. Has a process been killed within the last 5 seconds? If yes, it is not OOM

OOM Killer

When the system is finally tagged as being OOM, oom_kill() is triggered to select a process to terminate and thus free used memory. The process is done through the badness() function which returns a counted point for each process to select_bad_process(). The badness of a task is calculated using this formula

  • totalVMTask is initially the allocated amount of ram of the process, to which can be added the allocated ram of child processes if both the parent and child do not share the same allocated memory (i.e forked processes).

  • cputime is the sum of the utime and stime - (utime + stime)

  • runtime is the sub of the uptime and start time - (uptime - start time)

CPU time (utime + stime), the run time (uptime - start time)

In all this function selects the process that uses a lot of memory and have been running for a short amount of time, as process that have been running for a long time are unlikely to be the cause of the issue.

Once the badness score of the process is determined, it is adjusted according to the /proc//oomadj value as follow:

** if (p->oomkilladj > 0)**

** points <<= p->oomkilladj;**

** else points >>= -(p->oomkilladj);**

Rules of Exceptions

Now they are 3 rules of exceptions 1. if the process is a root process, meaning has CAP_SYS_ADMIN capabilities, then the points are divided by four 2. if it has CAP_SYS_RAWIO capabilities, then the points are as well divided by four (due to access to hardware, not good to kill) 3. Niced Process points are automatically doubled

A SIGTERM is sent to the process selected, the process list will again be reviewed and all process sharing the same mm_struct will be as well killed. If the process as a CAP_SYS_RAMIO capabilities, it will only receive a SIGKILL.

An administrator can always set the oomadj value to the **OOM_DISABLE Constant (defined as -17) **of a process pid in /proc//oomadj to prevent the process to be killed. However recent patches and contributions from developers allows for a more sophisticated approach to handle this issue.

OOM_KILLER Controller

Nikanth Karthikesan from Suse contributed a patch which introduces an OOM Control group. The group can be allocated a value/preference to the oom.priority field (as well introduced by this patch). The group with this highest oom.priority will be first selected to be killed when the system runs out of memory, the default value of the oom.priority being of 1.

You can review the patch at http://lkml.org/lkml/2009/1/29/220.

However as a simple illustration, here are some small example of the controller’s implementation.

  1. You need to mount the cgroup OOM pseudo filesystem

mount -t cgroup -o oom oom /mnt/oom-controller

  1.  Create a folder that will contain 2 files, the task and oom.priority files. The folder naming can be anything, but as an example, we will create 2 folders, one for the processes to kill first and one for the processes to never kill

mkdir /mnt/oom-controller/proc_{kill1st,notkill}

touch /mnt/oom-controller/proc_kill1st/tasks && touch /mnt/oom-controller/proc_kill/oom.priority

touch /mnt/oom-controller/proc_notkill/tasks && touch /mnt/oom-controller/proc_notkill/oom.priority

  1. We can then add the pids of our processes to kill first in the tasks file of the proc_kill group

echo 1456 > mnt/omm-controller/proc_kill1st/tasks

then set a priority “50”

echo 50 > /mnt/omm-controller/proc_kill1st/oom.priority

Now if another group (i.e proc_kill2nd) had a lower digit set in oom.priority, then the processes belonging to proc_kill1st will be killed first.

  1. Now let’s set a group of processes to not be killed in case of OOM

echo 0 > /mnt/omm-controller/proc_notkill/oom.priority

Of course, you need to add the pids of yours processes to /mnt/oom-controller/proc_notkill/tasks, but as you can guess, since the oom.priority is 0, those processes will never be killed by the oom_killer.

More to be found on http://lkml.org/lkml/2009/1/29/220… but that’s about the general overview on this OOM_KILLER Controller.

Is this an optimum solution?

No - it makes more sense in the long run for userspace application to be notified by the kernel that memory should be freed and then handled by the userspace applications by freeing their caches. While this can be quite troublesome with userspace program not properly freeing up memory caches or not being able to due to a change in the kernel memory manager, i still think, the OOM mechanism should be driven to be acted upon in userspace and handled in kernelspace (as already done).