This section defines the exact sequence of operations between
"peinit decides to start service X" and "service X's binary is
running." Every step is numbered. Failure at any step is handled
explicitly.
§4.1.1 Cgroup tree structure
Every service runs in its own cgroup tree:
/sys/fs/cgroup/peinit/<cgroup-id>/ (service root)
/sys/fs/cgroup/peinit/<cgroup-id>/main/ (main process)
/sys/fs/cgroup/peinit/<cgroup-id>/hooks/ (pre/post hooks)
/sys/fs/cgroup/peinit/<cgroup-id>/health/ (health checks)
<cgroup-id> is the service name with / replaced by - (e.g.,
mount:/data becomes mount:-data). This escaping is internal --
the user-facing name is unchanged.
The sub-cgroup structure satisfies cgroups v2's "no internal
processes" constraint (required when controllers are active) and
provides clean containment for hooks and health checks.
§4.1.1.1 Cgroup generations
If a service's previous cgroup tree has leaked sub-cgroups (D-state
processes that survived SIGKILL -- see the Health Checks section),
rmdir on the old tree will fail with EBUSY. In this case, peinit
MUST create a generational cgroup tree:
/sys/fs/cgroup/peinit/<cgroup-id>.gen<N>/ where N increments on
each restart that requires a new tree. Old leaked trees persist
until reboot.
§4.1.2 Pre-start evaluation
Before entering the pre-exec sequence, peinit MUST evaluate
conditions and asserts while the service is still in Inactive
state. This evaluation gates the Inactive → Starting transition.
- Read the service definition from the in-memory cache (see the
Configuration Generations section). This read MUST NOT block
on the registry.
- If the service has Conditions, evaluate all of them. If any
condition fails, the service transitions to Skipped and the
start is abandoned. Skipped services satisfy their dependents.
- If all conditions pass and the service has Asserts, evaluate
all of them. If any assert fails, the service transitions to
Failed with cause AssertionError and the start is abandoned.
Only after conditions and asserts pass does the service transition
to Starting and the pre-exec sequence below begins.
§4.1.3 The sequence
The service is in Starting state for the duration of this sequence.
§4.1.3.1 Step 1: Start timeout
peinit MUST start the StartTimeout timer. This timer covers the
entire remaining sequence: pre-hooks, fork/exec, and readiness
wait. If StartTimeout expires at any point during steps 2-10,
peinit MUST abort the start, kill the service's entire cgroup
tree, and transition the service to Failed with cause
ReadinessTimeout.
§4.1.3.2 Step 2: Create cgroup tree
peinit MUST create the service's cgroup tree (root, main/,
hooks/, health/ sub-cgroups).
If cgroup creation fails, no child process exists. peinit MUST
transition the service to Failed with cause ParentSetupFailure
and return the error (including errno) to the control socket
caller.
§4.1.3.3 Step 3: Run pre-exec hooks
If ExecStartPre is configured, peinit MUST run each hook command
sequentially. Each hook is forked into the hooks/ sub-cgroup.
For each hook, peinit MUST materialise a token at the point of
use: if HookIdentity is set, materialise a token for that identity;
otherwise, materialise a token for the service's Identity. Token
materialisation follows the rules in the Service Identity section.
If token materialisation fails for a hook, the hook fails and the
service transitions to Failed with cause PreHookFailure.
If any hook exits non-zero, peinit MUST:
- Kill the entire service cgroup tree (cleaning up any hook
grandchildren).
- Transition the service to Failed with cause PreHookFailure.
On success of all hooks, peinit MUST kill the hooks/ sub-cgroup
to clean up any lingering hook descendants before the main process
starts.
§4.1.3.4 Step 4: Materialise service token
peinit MUST materialise the service's main process token as
defined in the Service Identity section. For SYSTEM services,
clone peinit's token. For all other identities, request a token
from authd. Apply RequiredPrivileges restriction if configured.
If token materialisation fails (authd unreachable, identity not
found, KACS syscall error), no child process exists. peinit MUST
transition the service to Failed with cause ParentSetupFailure.
§4.1.3.5 Step 5: Create error pipe
peinit MUST create a cloexec pipe (pipe2(O_CLOEXEC)). The parent
holds the read end; the child will hold the write end. This pipe
communicates pre-exec setup errors from the child back to the
parent.
If exec succeeds, the write end auto-closes (CLOEXEC) and the
parent reads EOF -- meaning setup succeeded. If any setup step
fails before exec, the child writes a structured error (step
identifier + errno) over the pipe before exiting.
If pipe2 fails, no child process exists. peinit MUST transition
the service to Failed with cause ParentSetupFailure.
§4.1.3.6 Step 6: Fork
peinit MUST fork via clone3(CLONE_PIDFD) to atomically obtain
a pidfd for the child. There MUST be no window where the child
exists without a pidfd.
If clone3 fails, no child process exists. peinit MUST transition
the service to Failed with cause ParentSetupFailure. Common causes:
EMFILE/ENFILE (fd exhaustion), EAGAIN (PID limit), ENOMEM.
§4.1.3.7 Step 7: Parent post-fork
Immediately after fork, in the parent:
- Move the child into the
main/ sub-cgroup by writing the child
PID to main/cgroup.procs. The parent does this -- not the
child -- because cgroup writes require SYSTEM privileges that
the child's token will not carry after token installation.
- Close the write end of the error pipe.
- Read from the error pipe:
- EOF: exec succeeded. Record the child pidfd as the
service's main process.
- Data: pre-exec setup failed. Parse the step identifier
and errno. Log the specific failure. Transition the service
to Failed with cause PreExecFailure.
§4.1.3.8 Step 8: Child pre-exec
In the child process. This path MUST be minimal -- no heap
allocation, no complex library calls, no logging. Straight-line
setup then exec.
- Close the read end of the error pipe.
- Install the service's KACS token.
- Set RLIMIT values (LimitNOFILE, LimitCORE) if configured.
- Set
oom_score_adj:
-1000 (OOM-immune) for ErrorControl=Critical services.
0 (default) for all others.
- Set working directory.
- Set environment variables (base environment + Environment
values from the definition).
- Set
NOTIFY_SOCKET to the notify socket path. This is set
unconditionally regardless of the Readiness field -- services
use sd_notify for watchdog, STOPPING=1, FDSTORE, and
EXTEND_TIMEOUT_USEC in addition to readiness signalling.
- Inject stored file descriptors if the service has an fd store
with entries from a previous run.
- Exec the binary (
ImagePath + Arguments).
- If exec fails: write error to the pipe,
_exit(127).
- If any step 2-8 fails: write error to the pipe,
_exit(126).
§4.1.3.9 Step 9: Wait for readiness
After successful fork and exec:
- Simple, Readiness=Notify: peinit waits for
READY=1 via
sd_notify. On receipt, the service transitions to Active.
- Simple, Readiness=Alive: the service transitions to Active
immediately (the process exists).
- Oneshot: peinit waits for the process to exit. Exit code 0
(or a code in SuccessExitCodes) transitions to Completed. With
RemainAfterExit=1 the service remains in Completed; without
RemainAfterExit it transitions Completed -> Inactive after
dependents are released. Non-zero exit transitions to Failed.
§4.1.3.10 Step 10: Post-readiness
On readiness (Simple) or successful exit (Oneshot):
- Run ExecStartPost commands. Each hook is forked into the
hooks/ sub-cgroup. Hook failure is logged but MUST NOT fail
the service.
- Release dependent services (they become eligible to start).
- Start the watchdog timer if WatchdogTimeout > 0 (Simple only).
- Start the health check timer if HealthCheck is set (Simple
only).
§4.1.4 Parent-side failure summary
Steps 2, 4, 5, and 6 can fail before any child process exists.
In all four cases, peinit handles the failure entirely in the
parent:
- Clean up any partially created cgroup tree.
- Transition the service to Failed with cause ParentSetupFailure.
- Return the error (including errno) to the control socket caller.
These failures are system-level resource exhaustion (fd limits, PID
limits, memory, cgroup filesystem errors), not service-specific
failures.