February 4, 2012

Linux CFS Algorithm and Virtual Runtime

Since the 2.6.23 kernel, the Linux kernel process scheduler previously O(1) was replaced by CFS - a Completely Fair Scheduler.

CFS uses a red-black tree as data-structure and unlike previous Unix process scheduler does not account a traditional time slice of process execution but accounts what is referred as the process virtual runtime, expressed in nanoseconds (as opposed to Hz or jiffies). The usage of a self-balanced tree as the red-black tree allows for a lookup of $$O(\log\ n)$$ time per the height of the tree, but more on this later.

###Virtual Runtime###

The Virtual Runtime, declaration vruntime is a variable part of the process inherited C structure as defined in linux/sched.h which accounts for the time a process run in relation to the number of running processes. Each process holds a process descriptor “task_struct” dynamically allocated via the slab allocator.

struct task_struct {
             volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
             void *stack;
             atomic_t usage;
             unsigned int flags;     /* per process flags, defined below */
             unsigned int ptrace;
    (...)
    #endif
             int on_rq;
             int prio, static_prio, normal_prio;
             unsigned int rt_priority;
             const struct sched_class *sched_class;
             struct sched_entity se;

The sched_entity structure linked in “task_struct” is defined as

    struct sched_entity {
             struct load_weight      load;           /* for load-balancing */
             struct rb_node          run_node;
             struct list_head        group_node;
             unsigned int            on_rq;
    
             u64                     exec_start;
             u64                     sum_exec_runtime;
             u64                     vruntime;
             u64                     prev_sum_exec_runtime;
    
             u64                     nr_migrations;
    
    #ifdef CONFIG_SCHEDSTATS
             struct sched_statistics statistics;
    #endif
    
    #ifdef CONFIG_FAIR_GROUP_SCHED
             struct sched_entity     *parent;
             /* rq on which this entity is (to be) queued: */
             struct cfs_rq           *cfs_rq;
             /* rq "owned" by this entity/group: */
             struct cfs_rq           *my_q;
    #endif
    };

While the other variables also play a role in CFQ decision’s algorithm, vruntime is by far the core variable which needs more attention as to understand the scheduling decision process. As stated earlier, CFQ does not account time slice as did previous schedulers, the vruntime is evaluated

The vruntime for each process is calculated as followed:

  1. Compute the time spent by the process on the CPU
  2. Weight the computed running time against the number of runnable processes

The update_curr function is responsible to calculate the running time spent by the process, which stores the value into an unsigned long variable delta_exec which is computed as followed:

    
     delta_exec = (unsigned long)(now - curr->exec_start);
with
     struct sched_entity *curr = cfs_rq->curr;
     u64 now = rq_of(cfs_rq)->clock_task;

delta_exec is passed unto __update_curr

    __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
                   unsigned long delta_exec)
    {
             unsigned long delta_exec_weighted;
    
             schedstat_set(curr->statistics.exec_max,
                           max((u64)delta_exec, curr->statistics.exec_max));
    
             curr->sum_exec_runtime += delta_exec;
             schedstat_add(cfs_rq, exec_clock, delta_exec);
             delta_exec_weighted = calc_delta_fair(delta_exec, curr);
    
             curr->vruntime += delta_exec_weighted;
             update_min_vruntime(cfs_rq);
    
    #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
             cfs_rq->load_unacc_exec_time += delta_exec;
    #endif
    }

calc_delta_fair will return the weighted value of the process’s calculated runtime delta_exec in relation to number of runnable processes. The sub function used for that calculation is calc_delta_mine.

To keep in mind:

  • unsigned long delta_exec, is the computed running time of the process
  • unsigned long weight, is the nice value of the process
  • struct load_weight *lw, composed of 2 unsigned long “weight” and  “inv_weigh” (lw->weight and lw->inv_weight)
    calc_delta_mine(unsigned long delta_exec, unsigned long weight,
                     struct load_weight *lw)
    {
             u64 tmp;
    
      /*
      * weight can be less than 2^SCHED_LOAD_RESOLUTION for task group sched
      * entities since MIN_SHARES = 2. Treat weight as 1 if less than
      * 2^SCHED_LOAD_RESOLUTION.
      */
             if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
                     tmp = (u64)delta_exec * scale_load_down(weight);
             else
                     tmp = (u64)delta_exec;
    
             if (!lw->inv_weight) {
                     unsigned long w = scale_load_down(lw->weight);
    
                     if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
                             lw->inv_weight = 1;
                     else if (unlikely(!w))
                             lw->inv_weight = WMULT_CONST;
                     else
                             lw->inv_weight = WMULT_CONST / w;
             }
    
    
             if (unlikely(tmp > WMULT_CONST))
                    tmp = SRR(SRR(tmp, WMULT_SHIFT/2) * lw->inv_weight,
                             WMULT_SHIFT/2);
             else
                     tmp = SRR(tmp * lw->inv_weight, WMULT_SHIFT);
    
             return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
    }

Once the vruntime is computed, it is stored into the inherited sched_entity structure of the process by calling

curr->vruntime += delta_exec_weighted;

in the previously seen __update_curr function, we also notice the call

update_min_vruntime(cfs_rq);

What this does is to compute the smallest vruntime of all runnable processes and store it at the leftmode node of the red-black tree.

The CFS selection algorithm for process to be run is extremely simple, it will run its red black tree and search for the smallest vruntime value pointing to the next runnable process. In other words, “run the process which is represented by the leftmost node in the tree”, since the leftmost node constains the smallest vruntime, referred in the source code as min_vruntime.

To conclude this post, it is important to note that CFS does not walk the whole tree since min_vruntime is referenced by rb_leftmost in the cfs_rq struct (kernel/sched.c)

    struct cfs_rq {
             struct load_weight load;
            unsigned long nr_running, h_nr_running;
    
             u64 exec_clock;
             u64 min_vruntime;
    #ifndef CONFIG_64BIT
             u64 min_vruntime_copy;
    #endif
    
             struct rb_root tasks_timeline;
             struct rb_node *rb_leftmost

as seen in this construct (kernel/sched_fair.c)

    static struct sched_entity *__pick_next_entity(struct sched_entity *se)
    {
             struct rb_node *next = rb_next(&se->run_node);
    
             if (!next)
                     return NULL;
    
             return rb_entry(next, struct sched_entity, run_node);
    }