Linux Kernel Out of Memory Management

July 11th, 2010 Ali Abbas No comments

-ENOMEM Error – OOM state

When the kernel needs to allocate memory to a process, its first action is to check whether or not that requested memory is available..

To achieve is, the kernel makes sure that the requested memory is globally available on the system excluding 3% reserved for root process and that the committed amount of RAM does not exceed the allowed threshold.

The threshold is determined using this formula

The check is performed through vm_enough_memory() which evaluates the following

(sum total page cache + total free swap pages + total free pages + slab pages) – 3%

If there is enough memory available to satisfy the request, true is returned to the caller, if not, then -ENOMEM error code is returned to the calller.

When -ENOMEM is returned, out_of_memory() is trigger to evaluate if the system is in OOM (Out of Memory). This is done through a series of set evaluations

1. Is there enough swap space left, if yes it is not OOM
2. Has it been more than 5 seconds since the last failure? If yes, it is not OOM
3. Have we failed within the last second? If not, it is not OOM
4. If there have not been 10 failures at least in the last 5 seconds, it is not OOM
5. Has a process been killed within the last 5 seconds? If yes, it is not OOM

OOM Killer

When the system is finally tagged as being OOM, oom_kill() is triggered to select a process to terminate and thus free used memory. The process is done through the badness() function which returns a counted point for each process to select_bad_process(). The badness of a task is calculated using this formula

* totalVMTask is initially the allocated amount of ram of the process, to which can be added the allocated ram of child processes if both the parent and child do not share the same allocated memory (i.e forked processes).

* cputime is the sum of the utime and stime – (utime + stime)

* runtime is the sub of the uptime and start time – (uptime – start time)

CPU time (utime + stime), the run time (uptime – start time)

In all this function selects the process that uses a lot of memory and have been running for a short amount of time, as process that have been running for a long time are unlikely to be the cause of the issue.

Once the badness score of the process is determined, it is adjusted according to the /proc/<pid>/oomadj value as follow:

if (p->oomkilladj > 0)

points <<= p->oomkilladj;

else points >>= -(p->oomkilladj);

Rules of Exceptions

Now they are 3 rules of exceptions
1. if the process is a root process, meaning has CAP_SYS_ADMIN capabilities, then the points are divided by four
2. if it has CAP_SYS_RAWIO capabilities, then the points are as well divided by four (due to access to hardware, not good to kill)
3. Niced Process points are automatically doubled

A SIGTERM is sent to the process selected, the process list will again be reviewed and all process sharing the same mm_struct will be as well killed. If the process as a CAP_SYS_RAMIO capabilities, it will only receive a SIGKILL.

An administrator can always set the oomadj value to the OOM_DISABLE Constant (defined as -17) of a process pid in /proc/<pid>/oomadj to prevent the process to be killed. However recent patches and contributions from developers allows for a more sophisticated approach to handle this issue.

OOM_KILLER Controller

Nikanth Karthikesan from Suse contributed a patch which introduces an OOM Control group. The group can be allocated a value/preference to the oom.priority field (as well introduced by this patch). The group with this highest oom.priority will be first selected to be killed when the system runs out of memory, the default value of the oom.priority being of 1.

You can review the patch at http://lkml.org/lkml/2009/1/29/220.

However as a simple illustration, here are some small example of the controller’s implementation.

1. You need to mount the cgroup OOM pseudo filesystem

mount -t cgroup -o oom oom /mnt/oom-controller

2.  Create a folder that will contain 2 files, the task and oom.priority files. The folder naming can be anything, but as an example, we will create 2 folders, one for the processes to kill first and one for the processes to never kill

mkdir /mnt/oom-controller/proc_{kill1st,notkill}

touch /mnt/oom-controller/proc_kill1st/tasks && touch /mnt/oom-controller/proc_kill/oom.priority

touch /mnt/oom-controller/proc_notkill/tasks && touch /mnt/oom-controller/proc_notkill/oom.priority

3. We can then add the pids of our processes to kill first in the tasks file of the proc_kill group

echo 1456 > mnt/omm-controller/proc_kill1st/tasks

then set a priority “50″

echo 50 > /mnt/omm-controller/proc_kill1st/oom.priority

Now if another group (i.e proc_kill2nd) had a lower digit set in oom.priority, then the processes belonging to proc_kill1st will be killed first.

4. Now let’s set a group of processes to not be killed in case of OOM

echo 0 > /mnt/omm-controller/proc_notkill/oom.priority

Of course, you need to add the pids of yours processes to /mnt/oom-controller/proc_notkill/tasks, but as you can guess, since the oom.priority is 0, those processes will never be killed by the oom_killer.

More to be found on http://lkml.org/lkml/2009/1/29/220… but that’s about the general overview on this OOM_KILLER Controller.

Is this an optimum solution?

No – it makes more sense in the long run for userspace application to be notified by the kernel that memory should be freed and then handled by the userspace applications by freeing their caches. While this can be quite troublesome with userspace program not properly freeing up memory caches or not being able to due to a change in the kernel memory manager, i still think, the OOM mechanism should be driven to be acted upon in userspace and handled in kernelspace (as already done).

Cheers,

Ali

Categories: Unix / Linux

Dynamic Multipoint VPN – DMVPN

July 8th, 2010 Ali Abbas No comments

One of the most interesting feature of DMVPN as far as my personal opinion goes is its extended support for VRF on MPLS networks.

Remember, VRF allows multiple instance of routing tables to co-exist on the same router at the same time.

Having said that, DMVPN helps scalling out tradional IPSEC hub-and-spoke VPN configuration by setting permanent and temporary connections, respectively from the spoke routers to the hub router and between the spoker routers as needed. That has for result to aleviate traffic from the hub router and therefore providing Netowrk Performance, Scalability and better Traffic control management.

Having said that, DMVPN relies on the following protocols

- IPSEC: pre-shared keys used to secure the traffic

- mGRE: mGRE allows us to encapsulate multicast packets (i.e OSPF packets) and to setup a speudo-virtual tunnel interface to link our sites

- NHRP: Without NHRP, our GRE tunnel cannot be established. NHRP stands for “Next Hop Resolution Protocol” and allows our server to know what the peer sites IPs are. The NHRP server (HUB) will be answering NHRP request for IP discovery of peers to form tunnels.

- A routing protocol: OSPF, RIP, BGP etc…

Important things to keep in mind

- IPSEC in “Transport Mode”

When setting the tunnel, make sure to use “transport” mode with IPSEC, since the encapsulation of the IP packet in an ESP header is done already with GRE. This allows you to save 20 bytes on the MTU ;-)

– Use RIP for default routes

I know, you are probably ready to pull out your hair, but in a large DMVPN network, using RIP could help scale out than another routing protocol such as OSPF… calculating adjacencies are CPU intensive ;-)

Categories: DMVPN, MPLS, Networking, WAN

Catalyst 6500 and ASIC issues

June 29th, 2010 Ali Abbas No comments

Referral news can be found at http://www.networkworld.com/community/blog/asic-issues-delaying-cisco-switch

Now keep in mind, I have not read the bulletin published by Rodman & Renshaw, LLC – nor can attest this is the fundamental reasons why the switches have been delayed. As for the lifespan of the Cat 6500 to be fully replaced by the Nexus 7000, remember that Cisco’s Supervisor Engines for Modular Switches have a lifespan of 10 to 12 years, that being said a new 720 Supervisor Engine was just released roughly 1 year and a half ago – you make the math now ;-)

Cheers,

Ali

Categories: Cisco, Networking

firefox-pwgen 0.4.5 released

June 20th, 2010 Ali Abbas 1 comment

Hello there!

I know… it has been a long time since the last release and I know, many of you were awaiting for the bug fix identifed by Armin Juhlke @juhlke.de  - I was able finally today to put some time aside and look at it – here is the new release….

This release includes some bug fixes and had underwent a medium code cleanup from the 0.4 branch including some XUL improvements and a new feature added.

Here is the raw changeLog

1. Bug reported by Armin Juhlke @juhlke.de
“The digit 0 is not excluded from generated password when specified in the list of excludec characters”

2. Feature added – Password History added for current session.The user is now able to select whether or not they want to keep a history of the generated passwords… those passwords are not “saved” and only exist in memory; the user can then clean the buffer. That being said, the logged passwords do not survive a firefox restart.

3. JS Preference Code Cleanup – Optimization in metadata table

4. XUL Interface + CSS major improvements

5. Added support for Firefox 3.7a6pre

The addon was just uploaded, thus is available through https://addons.mozilla.org/en-US/firefox/downloads/file/92313/pwgen-0.4.5-fx.xpi – Once it has been reviewed by Firefox AMO Editors, it will then be available on the addon’s main page.

Cheers,

Ali

Categories: General, pwgen-firefox

Cisco IOS Security: Quiet Period Login

June 17th, 2010 Ali Abbas No comments

Cisco’s IOS Quiet Period refers to the period in which telnet/ssh/http access are disabled for an X amount of time after an Y amount of failed attempt.

While it is quite unusual to have router virtual access allowed from the WAN link, it may not hurt to go further by enabling this cisco feature to prevent a potential DOS dictionary attack from the WAN link or possibly as well from the LAN link.

The command used to enable the “Quiet Period” is “login block-for” in Global Configuration mode.

edge(config)#login block-for 600 attempts 5 within 2

In other words, block virtual login for 10mn (600 seconds) after 5 attempts within 2 seconds

Further Options

While this command should be enough to get us where we want to be, it is important to consider the following

1. Log failed login attempts

edge(config)# login on-failure log

You can view the login logs by issuing “show login failures

2. Prevent administrative hosts to be locked out during the Quiet Period

login quiet-mode access-class {acl-name |acl-number}

edge(config)#login quiet-mode access class adminIPs

By defining an access list named adminIPs that possibly contains a range of IPs representing administrative hosts, we can avoid having ourselves be subject to the “Quiet Period” while in action.

I hope that was informative,

Cheers,

Ali

Categories: Cisco, Networking