Since the 2.6.23 kernel, the Linux kernel process scheduler previously O(1) was replaced by CFS - a Completely Fair Scheduler.
CFS uses a red-black tree as data-structure and unlike previous Unix process scheduler does not account a traditional time slice of process execution but accounts what is referred as the process virtual runtime, expressed in nanoseconds (as opposed to Hz or jiffies). The usage of a self-balanced tree as the red-black tree allows for a lookup of time per the height of the tree, but more on this later.
Virtual Runtime
The Virtual Runtime, declaration vruntime is a variable part of the process inherited C structure as defined in linux/sched.h which accounts for the time a process run in relation to the number of running processes. Each process holds a process descriptor “task_struct” dynamically allocated via the slab allocator.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
The sched_entity structure linked in “task_struct” is defined as
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
While the other variables also play a role in CFQ decision’s algorithm, vruntime is by far the core variable which needs more attention as to understand the scheduling decision process. As stated earlier, CFQ does not account time slice as did previous schedulers, the vruntime is evaluated
The vruntime for each process is calculated as followed:
- Compute the time spent by the process on the CPU
- Weight the computed running time against the number of runnable processes
The update_curr function is responsible to calculate the running time spent by the process, which stores the value into an unsigned long variable delta_exec which is computed as followed:
1 2 3 4 |
|
delta_exec is passed unto __update_curr
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
calc_delta_fair will return the weighted value of the process’s calculated runtime delta_exec in relation to number of runnable processes. The sub function used for that calculation is calc_delta_mine.
To keep in mind:
- unsigned long delta_exec, is the computed running time of the process
- unsigned long weight, is the nice value of the process
- struct load_weight *lw, composed of 2 unsigned long “weight” and “inv_weigh” (lw->weight and lw->inv_weight)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
Once the vruntime is computed, it is stored into the inherited sched_entity structure of the process by calling
1
|
|
in the previously seen __update_curr function, we also notice the call
1
|
|
What this does is to compute the smallest vruntime of all runnable processes and store it at the leftmode node of the red-black tree.
The CFS selection algorithm for process to be run is extremely simple, it will run its red black tree and search for the smallest vruntime value pointing to the next runnable process. In other words, “run the process which is represented by the leftmost node in the tree”, since the leftmost node constains the smallest vruntime, referred in the source code as min_vruntime.
To conclude this post, it is important to note that CFS does not walk the whole tree since min_vruntime is referenced by rb_leftmost in the cfs_rq struct (kernel/sched.c)
1 2 3 4 5 6 7 8 9 10 11 12 |
|
as seen in this construct (kernel/sched_fair.c)
1 2 3 4 5 6 7 8 9 |
|