Simple questions often have not so simple answers. One example is the question, What priority does this process have? This question normally comes up when a customer asks us to explain choices made by the CPU scheduler.
To answer it we need to discuss further sub-questions.
Like most operating systems Linux supports threads, which can be seen as a distinct execution flow. In the simplest case we have a process with just one thread, the main thread. While a thread models the execution flow, a process models shared resources between threads. The most prominent resource is memory. Within a process all threads share the same memory. Hence, many threads that run in parallel can work on the same memory to solve a problem.
The Linux CPU scheduler does not operate on processes, in fact it schedules tasks. Within the kernel the smallest scheduling entity is a task, it can be anything that can run on a CPU. It might be a kernel thread, a userspace thread, a threaded interrupt, etc. So don’t get confused, for simplicity we can assume that threads and tasks are the same for now.
Since the scheduler works on tasks, it ensures that each task (and therefore,
thread) has its own priority.
To learn about the priority one must inspect all threads of a process.
By default most Linux tools just show processes, the priorities shown there
are the priorities of the main thread. Many offer a way to show threads
too. For example in the top utility by pressing H
.
Now we are able to answer the first sub-question. We now know that process priorities actually apply per thread, so we need to inspect all threads of a process. If the process in question has just one thread or we’re only interested in a specific thread, things become easy again.
POSIX requires that an operating system offers different scheduling policies, also
sometimes called scheduling classes.
By default all tasks in Linux use the default schedulung policy, called SCHED_NORMAL
or SCHED_OTHER
.
The scheduling algorithm behind SCHED_OTHER
is not defined by POSIX and is implementation
specific. Linux makes sure that all tasks within this scheduling policy are treated
fairly. This means no task will starve and from the user’s point of view it seems like
all programs are running all time, nothing will feel sluggish but heavy workloads
still get stuff done in a timely manner.
In this policy every task has a priority, it is called nice value.
The nice value denotes how nice a task is to other tasks. The value ranges from 19
to -20
.
Where -20
means that a task is not nice at all and wants to run as often as possible,
19
signifies it is very nice to other tasks and is fine when the scheduler picks
other tasks much more often.
Default is 0
, meaning the task is neither selfish nor too devoted.
The current algorithm behind SCHED_OTHER
is called Completely Fair Scheduling (CFS).
CFS offers two more policies: SCHED_BATCH
and SCHED_IDLE
. They make sure that a
thread either runs when possible in one batch, thus
interactivity does not matter, or when SCHED_IDLE
is selected that the thread shall
run when no other task can run, so the very least priority.
Please note that SCHED_IDLE
has nothing to do with the idle thread.
The idle thread is scheduled when absolutely nothing runable was found by the CPU
scheduler.
When SCHED_BATCH
or SCHED_IDLE
are selected CFS’s priority have no meanining,
although some tools report a number.
Linux also implements the POSIX Realtime Extensions. Using these extensions time
critical programs can run on Linux that have to meet strict timing constraints.
It offers two policies: SCHED_FIFO
and SCHED_RR
.
These policies allow deterministic scheduling between threads.
Both policies have a priority between 1
and 99
, where 1
is the lowest.
The scheduler will pick a runnable task depending on the priority.
Linux offers another scheduling policy, SCHED_DEADLINE
. This policy does not
allow the setting of a priority, so it’s irrelevant here.
As opposed to a priority it allows setting a deadline by which a given work
would be completed. So the CPU scheduler has to make sure that the task
runs for long enough before the deadline.
One might wonder how all these policies relate to each other since they can be mixed on the same system. The Linux CPU scheduler consists of five decision algorithms that select tasks, they get executed in the following order.
Stop-Task: This algorithm can only be used by one internal kernel thread to shut down a given CPU. The sole purpose is to preempt the tasks of a given CPU before it can be shut down.
Deadline:
It implements the user visible SCHED_DEADLINE
policy.
If a SCHED_DEADLINE
thread is present on the system it will run before
all other user visible threads, always.
Realtime:
SCHED_FIFO
and SCHED_RR
policies are implemented here. So realtime threads
always run before regular tasks and right after SCHED_DEADLINE
threads,
if present.
Fair (best known as CFS):
CFS implements SCHED_OTHER
, SCHED_BATCH
and SCHED_IDLE
. Most threads
present on a system are handled by this algorithm.
Idle: If no algorithm finds a runnable thread the idle algorithm schedules the idle task on the CPU. The idle task is always runnable and puts the CPU in idle mode.
With this knowledge of scheduling policies we can answer the second sub-question.
SCHED_OTHER
has a priority between 19
and -20
, where 19
is the lowest and -20
the highest.
SCHED_FIFO
and SCHED_RR
threads use a priority between 1
and 99
, where 1
is the lowest.
Internally SCHED_IDLE
and SCHED_BATCH
tasks have a nice value too, but the scheduling algorithm does
not honour it. Don’t get confused when a tool reports a nice value for SCHED_BATCH
or SCHED_IDLE
.
The ps tool is powerful and allows us to query this information.
For example: ps axHo pid,lwp,comm,policy,nice,rtprio
.
So we ask ps for a full list of all threads on the system, as well as their process id, thread id, short name, scheduling policy, nice value and realtime-priority.
ps reports SCHED_DEADLINE
as DLN, SCHED_OTHER
as TS, SCHED_BATCH
as B, SCHED_IDLE
as IDL, SCHED_FIFO
as FF and SCHED_RR
as RR.
So far we saw that priorities of type nice range from 19
to -20
and realtime from 1
to 99
.
These are the ranges you can always expect from process tools like ps.
Things get complicated when raw kernel interfaces are used or kernel internal data is analyzed. You’ll face such situations when developing your own tools or using Linux’s kernel instrumentation facilities.
A prominent example is the getpriority()
system call. It returns the nice value of a thread.
If used in C, it will return an integer between -20
and 19
which is the expected nice value,
but only if the C library performs a fixup.
The raw kernel system call will return a shifted value within the range 1
to 40
.
The reason behind the shift is Linux’s system call ABI. Negative return values
represent error numbers.
So if your application uses system calls directly without a libc, this might be confusing.
shifted value | nice value |
---|---|
1 | -20 |
.. | .. |
21 | 0 |
.. | .. |
40 | 19 |
Using this system call makes only sense if the current scheduling policy is SCHED_OTHER
.
For all other policies the returned priority is meaningless.
When you work with kernel instrumentation tools such as ftrace you will encounter the
deep internals of Linux.
Developers commonly use Linux’s sched_switch
trace point to see which task is picked
by the CPU scheduler.
ftrace will report a line like this one:
[002] d... 565635.368831: sched_switch: prev_comm=kworker/u8:2 prev_pid=891 prev_prio=120 prev_state=R+ ==> next_comm=kworker/u9:4 next_pid=983 next_prio=100
We observe that the scheduler on CPU 2 schedules from PID 891 to PID 983.
Please note that the PID in this context is not a process id as userspace would expect - it is in fact a thread id. Within the kernel, PID is the thread id.
What userspace gets via the getpid()
system call is actually the thread group id.
The alert reader will notice that this trace point has more than just the PID, it also shows the priorities of both tasks. However, they don’t make sense: prev_prio
is 120
and next_prio
is 100
. Neither are in the range of a nice level or a real time priority. Additionally, the trace point does not indicate what scheduling policy is being used.
To decipher these values we need to know that Linux internally uses a unified priority among all scheduling policies.
It ranges from -1
to 139
, where -1
is the highest and 139
the lowest priority.
The unified priority is calcualted as table shows:
policy | specific priority | unified priority |
---|---|---|
DEADLINE | - | -1 |
RR/FIFO | 99 | 0 |
RR/FIFO | 98 | 1 |
.. | .. | .. |
RR/FIFO | 1 | 98 |
OTHER | -20 | 100 |
.. | .. | .. |
OTHER | 0 | 120 |
.. | .. | .. |
OTHER | 19 | 139 |
With this information we can translate prev_prio
to a nice value of 0
,
and next_prio
to a nice value of -20
.
A common pitfall when working with real time applications is to confuse real time priorities from kernel internals and reported by user space tools. Both are in the same range but reversed.
Attentive readers will notice that unified priority 99
is not claimed by
any policy. This is an inconsistency within the Linux scheduling code and as of
writing (v5.17-rc3) still present.
The proc filesystem is the source of all process information tools such as ps display. For each thread it offers a stat file which contains, beside many other statisticts, all stats about the scheduling policy.
To learn the priority of a task, the 18th and 19th field of /proc/<pid>/task/<tid>/stat
is
of interest.
If a task has the scheduling policy SCHED_OTHER
the 19th field is non-zero and exhibits
the nice value.
The 18th field is more generic and exposes the unified priority of a task,
but shifted by -100
.
shifted value | unified priority |
---|---|
-101 | -1 |
-100 | 0 |
.. | .. |
-51 | 49 |
.. | .. |
39 | 139 |
SCHED_DEADLINE
, SCHED_OTHER
, SCHED_BATCH
, SCHED_IDLE
, SCHED_FIFO
or SCHED_RR
.SCHED_OTHER
, SCHED_FIFO
and SCHED_RR
allow priorities.SCHED_OTHER
priority is the so-called nice value, ranging from -20
to 19
.SCHED_FIFO
and SCHED_RR
support priorities between 1
and 99
.Publish date
02.02.2022
Category
linux
Authors
Richard Weinberger
+43 5 9980 400 00 (email preferred)
sigma star gmbh
Eduard-Bodem-Gasse 6, 1st floor
6020 Innsbruck | Austria