Linux CFS Algorithm and Virtual Runtime
Since the 2.6.23 kernel, the Linux kernel process scheduler previously O(1) was replaced by CFS – a Completely Fair Scheduler.
CFS uses a red-black tree as data-structure and unlike previous Unix process scheduler does not account a traditional time slice of process execution but accounts what is referred as the process virtual runtime, expressed in nanoseconds (as opposed to Hz or jiffies). The usage of a self-balanced tree as the red-black tree allows for a lookup of O(log n) time per the height of the tree, but more on this later.
Virtual Runtime
The Virtual Runtime, declaration “vruntime” is a variable part of the process inherited C structure as defined in linux/sched.h which accounts for the time a process run in relation to the number of running processes. Each process holds a process descriptor “task_struct” dynamically allocated via the slab allocator.
struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; atomic_t usage; unsigned int flags; /* per process flags, defined below */ unsigned int ptrace; (...) #endif int on_rq; int prio, static_prio, normal_prio; unsigned int rt_priority; const struct sched_class *sched_class; struct sched_entity se;
The “sched_entity” structure linked in “task_struct” is defined as
struct sched_entity { struct load_weight load; /* for load-balancing */ struct rb_node run_node; struct list_head group_node; unsigned int on_rq; u64 exec_start; u64 sum_exec_runtime; u64 vruntime; u64 prev_sum_exec_runtime; u64 nr_migrations; #ifdef CONFIG_SCHEDSTATS struct sched_statistics statistics; #endif #ifdef CONFIG_FAIR_GROUP_SCHED struct sched_entity *parent; /* rq on which this entity is (to be) queued: */ struct cfs_rq *cfs_rq; /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; #endif };
While the other variables also play a role in CFQ decision’s algorithm, vruntime is by far the core variable which needs more attention as to understand the scheduling decision process. As stated earlier, CFQ does not account time slice as did previous schedulers, the vruntime is evaluated
The vruntime for each process is calculated as followed:
1- Compute the time spent by the process on the CPU
2- Weight the computed running time against the number of runnable processes
The “update_curr” function is responsible to calculate the running time spent by the process, which stores the value into an unsigned long variable “delta_exec”. “delta_exec” is computed as followed:
delta_exec = (unsigned long)(now - curr->exec_start);
with
struct sched_entity *curr = cfs_rq->curr; u64 now = rq_of(cfs_rq)->clock_task;
“delta_exec” is passed unto __update_curr
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, unsigned long delta_exec) { unsigned long delta_exec_weighted; schedstat_set(curr->statistics.exec_max, max((u64)delta_exec, curr->statistics.exec_max)); curr->sum_exec_runtime += delta_exec; schedstat_add(cfs_rq, exec_clock, delta_exec); delta_exec_weighted = calc_delta_fair(delta_exec, curr); curr->vruntime += delta_exec_weighted; update_min_vruntime(cfs_rq); #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED cfs_rq->load_unacc_exec_time += delta_exec; #endif }
“calc_delta_fair” will return the weighted value of the process’s calculated runtime “delta_exec” in relation to number of runnable processes. The sub function used for that calculation is “calc_delta_mine”.
To keep in mind:
- unsigned long delta_exec, is the computed running time of the process
- unsigned long weight, is the nice value of the process
- struct load_weight *lw, composed of 2 unsigned long “weight” and “inv_weigh” (lw->weight and lw->inv_weight)
calc_delta_mine(unsigned long delta_exec, unsigned long weight, struct load_weight *lw) { u64 tmp; /* * weight can be less than 2^SCHED_LOAD_RESOLUTION for task group sched * entities since MIN_SHARES = 2. Treat weight as 1 if less than * 2^SCHED_LOAD_RESOLUTION. */ if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION))) tmp = (u64)delta_exec * scale_load_down(weight); else tmp = (u64)delta_exec; if (!lw->inv_weight) { unsigned long w = scale_load_down(lw->weight); if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST)) lw->inv_weight = 1; else if (unlikely(!w)) lw->inv_weight = WMULT_CONST; else lw->inv_weight = WMULT_CONST / w; } if (unlikely(tmp > WMULT_CONST)) tmp = SRR(SRR(tmp, WMULT_SHIFT/2) * lw->inv_weight, WMULT_SHIFT/2); else tmp = SRR(tmp * lw->inv_weight, WMULT_SHIFT); return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX); }
Once the vruntime is computed, it is stored into the inherited sched_entity structure of the process by calling
curr->vruntime += delta_exec_weighted;
in the previously seen __update_curr function, we also notice the call
update_min_vruntime(cfs_rq);
What this does is to compute the smallest vruntime of all runnable processes and store it at the leftmode node of the red-black tree.
The CFS selection algorithm for process to be run is extremely simple, it will run its red black tree and search for the smallest vruntime value pointing to the next runnable process. In other words, “run the process which is represented by the leftmost node in the tree”, since the leftmost node constains the smallest vruntime, referred in the source code as min_vruntime.
To conclude this post, it is important to note that CFS does not walk the whole tree since min_vruntime is referenced by rb_leftmost in the cfs_rq struct (kernel/sched.c)
struct cfs_rq { struct load_weight load; unsigned long nr_running, h_nr_running; u64 exec_clock; u64 min_vruntime; #ifndef CONFIG_64BIT u64 min_vruntime_copy; #endif struct rb_root tasks_timeline; struct rb_node *rb_leftmost
as seen in this construct (kernel/sched_fair.c)
static struct sched_entity *__pick_next_entity(struct sched_entity *se) { struct rb_node *next = rb_next(&se->run_node); if (!next) return NULL; return rb_entry(next, struct sched_entity, run_node); }