Overview

A computer has a small number of CPU cores and a large number of threads competing for them. The kernel's scheduler decides, every few microseconds, which threads run on which cores and for how long. The decision rule depends on the scheduling class each thread belongs to — a small set of named policies that express different goals: throughput-with-fairness, real-time responsiveness, missable-but-CPU-hungry batch work, and so on.

This page covers what classes Peios offers, what each is for, and what authority is needed to change them.

The classes at a glance

Every thread belongs to exactly one scheduling class at any time. The kernel evaluates classes in priority order: any thread in a higher-priority class runs before any thread in a lower-priority one is even considered.

Class	Policy names	Use case
Stop	(kernel only)	Highest priority. Reserved for kernel-internal stop-machine operations. Not user-accessible.
Deadline	`SCHED_DEADLINE`	Threads with strict timing requirements expressed as runtime within a deadline within a period. The scheduler uses earliest-deadline-first ordering.
Real-time	`SCHED_FIFO`, `SCHED_RR`	Threads that must run when ready, regardless of fairness. Priorities 1–99.
Normal	`SCHED_OTHER` (default), `SCHED_BATCH`, `SCHED_IDLE`	The default time-sharing class — fairness across competing workloads using virtual-runtime accounting.
Idle	(kernel-internal)	The per-CPU idle thread. Runs when nothing else is runnable.

A thread in SCHED_FIFO will preempt any SCHED_OTHER thread regardless of nice values; a SCHED_DEADLINE thread will preempt any SCHED_FIFO thread. The class is the first dimension of priority; values within a class are the second.

The Normal class

SCHED_OTHER (also spelled SCHED_NORMAL) is the default class. Every newly-created thread starts here unless something explicitly says otherwise. The Normal class implements fair time-sharing — over time, runnable threads get CPU time roughly in proportion to their relative weights, where weight is derived from the nice value.

The implementation is the kernel's EEVDF scheduler (since kernel 6.6) — Earliest Eligible Virtual Deadline First. EEVDF replaced the earlier CFS (Completely Fair Scheduler). User-visible behaviour is similar: nice values control relative share, and the scheduler approximates "every runnable thread should get its fair slice over time." The internals are different but applications written for CFS continue to work without changes under EEVDF.

Nice values

Nice is a per-thread integer in the range −20 to +19 that scales the share of CPU a Normal-class thread receives:

Nice	Effect
−20	Maximum share. Roughly 87× the share of a nice-0 thread.
0	Default.
+19	Minimum share. Roughly 1/68 the share of a nice-0 thread.

Each step of one nice level changes a thread's weight by about 1.25×, so the extremes are far apart. These ratios describe the relative CPU split between two otherwise-equal, CPU-bound threads competing alone; with more threads, or threads that sleep, the actual division differs.

A thread can lower its own priority freely (set nice to a higher value than current) — there is no privilege check for being polite. Raising priority (setting nice lower than current, including ever returning to a lower nice if it was previously raised) requires SeIncreaseBasePriorityPrivilege. The same privilege gates real-time elevation; see below.

nice(), setpriority(), and getpriority() are the syscalls. setpriority() accepts a thread, a process group, or a user as the target.

`SCHED_BATCH`

SCHED_BATCH is a Normal-class variant that hints to the scheduler "I am CPU-bound and don't care about responsiveness." The scheduler then treats the thread as CPU-intensive and mildly disfavours it on wake-up — it won't preempt other runnable work the moment it wakes — which trades a little latency for fewer context switches and slightly better throughput. Useful for compute jobs that don't interact with users.

SCHED_BATCH does not change the nice-based share; it only changes preemption behaviour. Setting it requires no special privilege — it is a self-imposed quality-of-service hint.

`SCHED_IDLE`

SCHED_IDLE is the lowest-priority Normal-class variant. It runs only when nothing in SCHED_OTHER or SCHED_BATCH is runnable. Use it for background work that should never compete with anything important — file indexing, opportunistic cache warming, optional housekeeping.

SCHED_IDLE does not require privilege to enter. It cannot be combined with elevated nice — its share is fixed below the entire Normal-class range.

Real-time classes

When a workload has hard timing requirements — audio processing, control loops, network packet handling at sub-millisecond latency — fairness is the wrong goal. Real-time scheduling abandons fairness in favour of "this thread runs the moment it is ready."

`SCHED_FIFO`

SCHED_FIFO runs at a fixed real-time priority (1–99) and continues running until it voluntarily yields, blocks, or is preempted by a higher-priority thread. There is no time-slicing within a priority level. Two SCHED_FIFO threads at priority 50 will see one of them run continuously to completion; the other waits until the first yields.

This is the scheduling model audio interrupt handlers, sensor pollers, and packet-pumping threads usually want.

`SCHED_RR`

SCHED_RR is SCHED_FIFO with round-robin time-slicing within a priority level. When two SCHED_RR threads at the same priority are both runnable, the scheduler rotates them on a fixed quantum (sched_rr_get_interval() returns the quantum). Higher-priority RR or FIFO threads still preempt them; lower-priority threads still wait.

SCHED_RR is useful when several equally-urgent threads share a priority level and no one of them should monopolise the CPU.

`SCHED_DEADLINE`

SCHED_DEADLINE is the most expressive real-time policy. A thread declares three numbers:

runtime — how much CPU time it needs, per period
deadline — by when, relative to period start, it must finish
period — the cadence at which it runs

The kernel admits the thread only if it can guarantee these numbers will be met given the existing deadline workload. If admitted, the scheduler runs the thread no later than its deadline using earliest-deadline-first ordering. If the kernel cannot guarantee admission, sched_setattr() fails — the thread is not silently degraded into a best-effort job.

Use SCHED_DEADLINE when you have well-characterised periodic work and need analytical timing guarantees, not just "high priority."

Privilege

Switching a thread to any real-time class — including SCHED_DEADLINE — requires SeIncreaseBasePriorityPrivilege. The same privilege gates raising nice into negative values, raising real-time priority within the class, and switching between real-time policies. There is no separate "real-time privilege"; real-time is conceptually "elevating priority" and the spec maps it to the same gate as nice elevation, the same way Windows uses SeIncreaseBasePriorityPrivilege for REALTIME_PRIORITY_CLASS and Linux uses CAP_SYS_NICE for both.

Layered defences

Real-time threads can starve everything if misbehaving — including the kernel itself. Peios honours the substrate's per-process rate-limit defences:

Mechanism	What it caps
`RLIMIT_RTPRIO`	The maximum real-time priority a process can request without holding `SeIncreaseBasePriorityPrivilege`. Raising this rlimit grants narrow real-time capability without granting full priority elevation.
`RLIMIT_RTTIME`	The CPU time a real-time thread can consume in a single block before being downgraded to `SCHED_OTHER`. Defends against runaway tight loops.
`kernel.sched_rt_runtime_us` / `kernel.sched_rt_period_us`	System-wide cap on aggregate real-time CPU consumption per period. The kernel reserves the remainder for non-RT work.

The system-wide values are sysctls, configured under \System\Scheduling\ in the registry and applied by ksyncd. The per-process rlimits are set with setrlimit() or inherited from a supervisor.

Setting scheduling parameters

Three syscall pairs cover scheduling-parameter manipulation:

Syscall	Purpose
`sched_setscheduler()` / `sched_getscheduler()`	Set or get the scheduling class.
`sched_setparam()` / `sched_getparam()`	Set or get the priority within the current class (real-time priority; ignored for normal classes).
`sched_setattr()` / `sched_getattr()`	Extended interface taking a `struct sched_attr` covering class, priority, deadline parameters, latency-nice, and util-clamp values in one call. Required for `SCHED_DEADLINE`.

sched_setattr() is the modern interface; new code should prefer it. The older calls remain functional for compatibility.

sched_yield() is unchanged from substrate — a thread voluntarily relinquishes the CPU. For real-time threads it has well-defined semantics (yield to others at the same priority); for normal threads it's a hint to the scheduler that other work might be runnable.

sched_get_priority_min() and sched_get_priority_max() return the legal priority range for a given class.

Inheritance and `SCHED_RESET_ON_FORK`

Scheduling parameters are part of a thread's state. Like other thread state, they survive clone() (inherited by both threads and children) and survive exec() (the new program image runs under whatever scheduling parameters the thread had).

For real-time work this can be a hazard: a privileged service running SCHED_FIFO that forks children will have those children running at SCHED_FIFO too, even if the children don't need it. SCHED_RESET_ON_FORK is a flag attached to the parent's scheduling attributes that, when set, causes children created by clone() (without CLONE_THREAD) to revert to SCHED_OTHER at default nice. Real-time supervisors should set it on themselves so non-RT children don't inadvertently inherit RT scheduling.

SCHED_RESET_ON_FORK does not affect the parent itself, only its future children. clone() with CLONE_THREAD (creating threads in the same process) is unaffected — threads always share the process's primary token and scheduling attributes within their thread group.

Latency hints and utilisation clamping

Two newer per-thread hints exist on top of the class system:

sched_latency_nice — a hint to the Normal-class scheduler about preference for low latency vs throughput. Lower values prefer low latency (faster preemption on wake-up); higher values prefer throughput. Applies only to Normal-class threads.
sched_util_min / sched_util_max — utilisation clamping. Tells the cpufreq governor "treat this thread as needing at least X% / at most Y% of CPU capacity." Used to coax the governor into scaling frequency more aggressively for latency-sensitive work, or to cap a hot loop.

Both are set via sched_setattr(). Setting either to a non-default value requires SeIncreaseBasePriorityPrivilege.

Core scheduling and SMT co-scheduling

Modern x86 CPUs implement simultaneous multithreading (SMT) — two logical CPUs share the execution resources of a single physical core. Spectre-class side-channel attacks have shown that an untrusted thread on one SMT sibling can extract information from a sensitive thread on the other. The mainstream mitigation is core scheduling — the kernel guarantees that only threads belonging to the same trust domain run simultaneously on SMT-sibling CPUs.

Core scheduling is requested via prctl(PR_SCHED_CORE, ...). A thread or process is assigned a cookie, an opaque identifier, and the kernel ensures that no SMT-sibling CPU runs a thread with a different cookie at the same time. Threads with the same cookie are co-schedulable; threads with different cookies are mutually exclusive on SMT siblings.

Operation	Effect
`PR_SCHED_CORE_CREATE`	Create a new cookie for the calling thread or process.
`PR_SCHED_CORE_SHARE_FROM`	Adopt another thread's cookie (the calling thread joins the target's co-scheduling group).
`PR_SCHED_CORE_SHARE_TO`	Push the calling thread's cookie onto another thread.
`PR_SCHED_CORE_GET`	Read a thread's cookie.

Setting a cookie on the calling thread requires no privilege. Setting a cookie on another process requires PROCESS_QUERY_INFORMATION on the target's process SD and is also subject to PIP dominance — a non-dominant caller cannot manipulate a Protected process's core-scheduling configuration. This is the same authority pattern used elsewhere when one process wants to manage another's properties: process SD for "may I touch this process at all," PIP for "may I cross this protection level."

Core scheduling is the recommended mitigation when running mutually-untrusting workloads on the same machine without disabling SMT entirely.

Preemption model

The kernel's preemption model is the rule that decides when a running task can be involuntarily switched off the CPU. The model is selected at kernel-build time:

Model	Behaviour
`PREEMPT_NONE`	Voluntary preemption only at explicit `schedule()` points. Highest throughput; worst latency. Server-style.
`PREEMPT_VOLUNTARY`	Same as `PREEMPT_NONE` plus added preemption checkpoints. Better latency at modest throughput cost.
`PREEMPT_LAZY`	Sits between voluntary and full. The scheduler defers preemption of `SCHED_OTHER` tasks until the next return-to-userspace point, giving them more uninterrupted runtime than `PREEMPT_FULL` while still preempting real-time tasks immediately. On modern architectures this replaces `PREEMPT_NONE` and `PREEMPT_VOLUNTARY` as the throughput-preferring option.
`PREEMPT_FULL`	Preempt anywhere not in a critical section. Desktop / interactive default.
`PREEMPT_RT`	Full real-time preemption. Almost everything is preemptible, including most kernel critical sections (replaced by mutexes). Required for hard real-time workloads. Mainline since 6.12.

There are two layers to choosing a model, and they are not the same kind of choice:

PREEMPT_RT versus everything else is a build-time choice. A real-time image (audio, control loops, networking gateways) is compiled with PREEMPT_RT; you cannot switch a running kernel into full real-time preemption. Peios provides reference images at the RT and non-RT levels, and downstream image builders pick whichever matches their workload.
Among the non-real-time models (PREEMPT_NONE, PREEMPT_VOLUNTARY, PREEMPT_LAZY, PREEMPT_FULL) the selection is not baked into the build on the architectures that support dynamic preemption (x86 among them). The kernel patches its preemption checkpoints in place, so the model can be chosen at boot and even changed on a running system. PREEMPT_LAZY is the throughput-preferring default in this set; PREEMPT_FULL the latency-preferring one. On architectures without dynamic-preemption support the non-RT model is fixed at build time.

So preemption is unlike CPU isolation: isolation (isolcpus= and friends, below) is boot-cmdline-and-reboot only, whereas the non-RT preemption model is genuinely runtime-tunable on supporting hardware.

Open Peios decision. Whether Peios surfaces the runtime non-RT preemption selection as a registry knob under \System\Scheduling\ (managed by ksyncd, alongside the RT bandwidth caps) or leaves it to the boot command line is not yet settled — flagged for design.

SMP-only

Peios kernels are built unconditionally with SMP support. Uniprocessor (UP-only) kernel code paths are not retained; single-CPU machines run the SMP scheduler with one CPU. Peios does not target hardware where the negligible SMP overhead on a single CPU would matter, so carrying a separate UP configuration buys nothing. The practical consequence is that any kernel-config knob conditional on CONFIG_SMP is unconditional on Peios.

`rseq` time slice extension

Restartable sequences (rseq) provide a per-thread shared region the kernel updates with information the thread can read without a syscall. Beyond their original use (lock-free per-CPU data structures), rseq carries an opportunistic time-slice extension hint: a thread can write to its rseq area to indicate "I am inside a critical section, please don't preempt me right now." The scheduler honours this best-effort — preemption is briefly deferred so the critical section can complete, after which normal preemption resumes.

The mechanism is bounded — the kernel will not defer preemption indefinitely, and high-priority tasks override the hint — so it acts as a low-overhead priority ceiling for short critical sections without requiring real-time scheduling or PI mutexes. Userspace runtimes (language runtimes, lock-free data structure libraries) use it transparently; nothing in user code changes.

Pressure stall information

Modern Linux exposes per-resource Pressure Stall Information (PSI) — files at /proc/pressure/cpu, /proc/pressure/memory, and /proc/pressure/io that report what fraction of recent time tasks were stalled waiting for the resource. Two metrics per file:

some — fraction of time at least one runnable task was stalled
full — fraction of time all runnable tasks were stalled (memory and io only)

Each metric is reported over three time windows: 10-second, 60-second, and 300-second moving averages.

PSI is the recommended observability surface for capacity planning and autoscaling decisions on Peios. It replaces older indicators like load average — load average conflates runnable, blocked, and uninterruptible-sleep tasks; PSI distinguishes them and reports actual contention. Per-cgroup PSI is also available under each cgroup's cpu.pressure, memory.pressure, io.pressure files, gated by the cgroup hierarchy's normal access controls.

Reading the global PSI files is unprivileged. Reading a cgroup's PSI is gated by the cgroup membership-management SD.

Scheduler observability and tuning internals

A handful of additional substrate features appear in the inventory of completeness:

Autogroups (CONFIG_SCHED_AUTOGROUP). Per-session automatic process grouping for the Normal class. When enabled, each interactive session is treated as a group and gets a fair share of CPU, preventing one user's heavy compute from starving another's interactive work. A kernel-build option Peios images may enable or disable depending on workload.
Energy-aware scheduling (EAS). On heterogeneous-CPU hardware (big.LITTLE, P-core/E-core hybrids), EAS factors energy efficiency into placement decisions. Activates automatically when running on appropriate hardware; no user-facing API.
Load balancing. The scheduler periodically rebalances runnable tasks across CPUs to keep them all busy. Substrate behaviour with no Peios-specific knobs beyond affinity (covered above) and isolation (covered in CPU Affinity and Isolation).
/proc/schedstat. System-wide scheduler statistics — wakeup counts, balance counts, runtime distributions. Useful for tuning when investigating scheduler behaviour. Read access is unprivileged.
/proc/[pid]/sched. Per-task scheduler debug information — virtual runtime, runqueue placement, and other internal counters. Read access is gated by the process SD (PROCESS_QUERY_INFORMATION) and by the PIP /proc default-deny rule for protected processes.

For day-to-day capacity decisions on Peios the recommended surface is PSI plus per-cgroup cpu.stat. The legacy /proc/schedstat and per-task /proc/[pid]/sched interfaces are retained for compatibility with existing Linux tooling.

sched_ext — pluggable schedulers

For workloads with very specific scheduling requirements that the standard classes don't address well — large datacenter fleets tuning for a single workload type, latency-critical systems with unusual placement constraints, gaming-style runtimes wanting custom interactivity heuristics — the kernel provides sched_ext, a framework that lets BPF programs implement complete scheduler classes.

A BPF scheduler is a userspace project (typically written in C with libbpf or in Rust) that compiles to BPF and is loaded into the kernel at runtime. Once loaded, it owns scheduling decisions for tasks placed in its class: enqueue, dequeue, pick-next-task, runqueue migration, and idle-CPU selection are all callbacks into the BPF program. If the BPF program crashes, errors out, or fails to make progress within a timeout, the kernel automatically reverts to the default scheduler — sched_ext cannot brick the system.

The framework includes substrate niceties beyond the basic dispatch: LLC and NUMA-aware idle-CPU selection so BPF schedulers get sensible cache behaviour by default, and a deadline-server backstop that bounds how much CPU bandwidth a misbehaving BPF scheduler can monopolise before kernel intervention.

Privilege model

Loading a BPF scheduler is one of the most consequential operations on a system. The scheduler runs at every dispatch decision on every CPU; it sees timing information about every task; a poorly-written or malicious scheduler can degrade the entire host's performance even with the kernel's safety nets in place.

Peios gates BPF scheduler loading on a dedicated SeLoadSchedulerPrivilege. The privilege is not held by default by any standard service or user role — it must be granted explicitly to the operator or daemon responsible for managing custom schedulers. Holding CAP_BPF or CAP_SYS_ADMIN is not sufficient on Peios; the dedicated privilege is the gate.

Loading, swapping, or unloading a BPF scheduler is unconditionally audit-loud regardless of the success/failure quartet. The audit record includes the scheduler's BPF program identity, the loading principal's token, and the previous scheduler (if any).

Operational notes

BPF schedulers compose with the standard real-time classes (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE) — those continue to take precedence. sched_ext effectively replaces the Normal class for tasks that opt into it.
A BPF scheduler is process-state in the same sense the standard scheduler is: it survives across fork and exec, but is unloaded when the loading principal explicitly detaches or when the kernel revokes it under a fault.
Native Peios applications do not interact with sched_ext directly — they continue to use the standard sched_setattr() API. sched_ext is an operator-tier facility for site-wide scheduler customisation, not an application-level API.

For most Peios deployments the default scheduler (EEVDF + the standard real-time classes) is the right answer. sched_ext is available for advanced operators with a specific workload-tuning case, gated behind a privilege that makes accidental enabling impossible.

The classes at a glance #

The Normal class #

Nice values #

SCHED_BATCH #

SCHED_IDLE #

Real-time classes #

SCHED_FIFO #

SCHED_RR #

SCHED_DEADLINE #

Privilege #

Layered defences #

Setting scheduling parameters #

Inheritance and SCHED_RESET_ON_FORK #

Latency hints and utilisation clamping #

Core scheduling and SMT co-scheduling #

Preemption model #

SMP-only #

rseq time slice extension #

Pressure stall information #

Scheduler observability and tuning internals #

sched_ext — pluggable schedulers #

Privilege model #

Operational notes #

See also