On this page
- §4.1.1 Cgroup tree structure
- §4.1.1.1 Cgroup generations
- §4.1.2 Pre-start evaluation
- §4.1.3 The sequence
- §4.1.3.1 Step 1: Start timeout
- §4.1.3.2 Step 2: Create cgroup tree
- §4.1.3.3 Step 3: Run pre-exec hooks
- §4.1.3.4 Step 4: Materialise service token
- §4.1.3.5 Step 5: Create error pipe
- §4.1.3.6 Step 6: Fork
- §4.1.3.7 Step 7: Parent post-fork
- §4.1.3.8 Step 8: Child pre-exec
- §4.1.3.9 Step 9: Wait for readiness
- §4.1.3.10 Step 10: Post-readiness
- §4.1.4 Parent-side failure summary
Pre-Exec Sequence
This section defines the exact sequence of operations between "peinit decides to start service X" and "service X's binary is running." Every step is numbered. Failure at any step is handled explicitly.
§4.1.1 Cgroup tree structure
Every service runs in its own cgroup tree:
/sys/fs/cgroup/peinit/<cgroup-id>/ (service root)
/sys/fs/cgroup/peinit/<cgroup-id>/main/ (main process)
/sys/fs/cgroup/peinit/<cgroup-id>/hooks/ (pre/post hooks)
/sys/fs/cgroup/peinit/<cgroup-id>/health/ (health checks)
<cgroup-id> is the service name with / replaced by - (e.g.,
mount:/data becomes mount:-data). This escaping is internal --
the user-facing name is unchanged.
The sub-cgroup structure satisfies cgroups v2's "no internal processes" constraint (required when controllers are active) and provides clean containment for hooks and health checks.
§4.1.1.1 Cgroup generations
If a service's previous cgroup tree has leaked sub-cgroups (D-state
processes that survived SIGKILL -- see the Health Checks section),
rmdir on the old tree will fail with EBUSY. In this case, peinit
MUST create a generational cgroup tree:
/sys/fs/cgroup/peinit/<cgroup-id>.gen<N>/ where N increments on
each restart that requires a new tree. Old leaked trees persist
until reboot.
§4.1.2 Pre-start evaluation
Before entering the pre-exec sequence, peinit MUST evaluate conditions and asserts while the service is still in Inactive state. This evaluation gates the Inactive → Starting transition.
- Read the service definition from the in-memory cache (see the Configuration Generations section). This read MUST NOT block on the registry.
- If the service has Conditions, evaluate all of them. If any condition fails, the service transitions to Skipped and the start is abandoned. Skipped services satisfy their dependents.
- If all conditions pass and the service has Asserts, evaluate all of them. If any assert fails, the service transitions to Failed with cause AssertionError and the start is abandoned.
Only after conditions and asserts pass does the service transition to Starting and the pre-exec sequence below begins.
§4.1.3 The sequence
The service is in Starting state for the duration of this sequence.
§4.1.3.1 Step 1: Start timeout
peinit MUST start the StartTimeout timer. This timer covers the entire remaining sequence: pre-hooks, fork/exec, and readiness wait. If StartTimeout expires at any point during steps 2-10, peinit MUST abort the start, kill the service's entire cgroup tree, and transition the service to Failed with cause ReadinessTimeout.
§4.1.3.2 Step 2: Create cgroup tree
peinit MUST create the service's cgroup tree (root, main/,
hooks/, health/ sub-cgroups).
If cgroup creation fails, no child process exists. peinit MUST transition the service to Failed with cause ParentSetupFailure and return the error (including errno) to the control socket caller.
§4.1.3.3 Step 3: Run pre-exec hooks
If ExecStartPre is configured, peinit MUST run each hook command
sequentially. Each hook is forked into the hooks/ sub-cgroup.
For each hook, peinit MUST materialise a token at the point of use: if HookIdentity is set, materialise a token for that identity; otherwise, materialise a token for the service's Identity. Token materialisation follows the rules in the Service Identity section. If token materialisation fails for a hook, the hook fails and the service transitions to Failed with cause PreHookFailure.
If any hook exits non-zero, peinit MUST:
- Kill the entire service cgroup tree (cleaning up any hook grandchildren).
- Transition the service to Failed with cause PreHookFailure.
On success of all hooks, peinit MUST kill the hooks/ sub-cgroup
to clean up any lingering hook descendants before the main process
starts.
§4.1.3.4 Step 4: Materialise service token
peinit MUST materialise the service's main process token as defined in the Service Identity section. For SYSTEM services, clone peinit's token. For all other identities, request a token from authd. Apply RequiredPrivileges restriction if configured.
If token materialisation fails (authd unreachable, identity not found, KACS syscall error), no child process exists. peinit MUST transition the service to Failed with cause ParentSetupFailure.
§4.1.3.5 Step 5: Create error pipe
peinit MUST create a cloexec pipe (pipe2(O_CLOEXEC)). The parent
holds the read end; the child will hold the write end. This pipe
communicates pre-exec setup errors from the child back to the
parent.
If exec succeeds, the write end auto-closes (CLOEXEC) and the parent reads EOF -- meaning setup succeeded. If any setup step fails before exec, the child writes a structured error (step identifier + errno) over the pipe before exiting.
If pipe2 fails, no child process exists. peinit MUST transition
the service to Failed with cause ParentSetupFailure.
§4.1.3.6 Step 6: Fork
peinit MUST fork via clone3(CLONE_PIDFD) to atomically obtain
a pidfd for the child. There MUST be no window where the child
exists without a pidfd.
If clone3 fails, no child process exists. peinit MUST transition
the service to Failed with cause ParentSetupFailure. Common causes:
EMFILE/ENFILE (fd exhaustion), EAGAIN (PID limit), ENOMEM.
§4.1.3.7 Step 7: Parent post-fork
Immediately after fork, in the parent:
- Move the child into the
main/sub-cgroup by writing the child PID tomain/cgroup.procs. The parent does this -- not the child -- because cgroup writes require SYSTEM privileges that the child's token will not carry after token installation. - Close the write end of the error pipe.
- Read from the error pipe:
- EOF: exec succeeded. Record the child pidfd as the service's main process.
- Data: pre-exec setup failed. Parse the step identifier and errno. Log the specific failure. Transition the service to Failed with cause PreExecFailure.
§4.1.3.8 Step 8: Child pre-exec
In the child process. This path MUST be minimal -- no heap allocation, no complex library calls, no logging. Straight-line setup then exec.
- Close the read end of the error pipe.
- Install the service's KACS token.
- Set RLIMIT values (LimitNOFILE, LimitCORE) if configured.
- Set
oom_score_adj:-1000(OOM-immune) for ErrorControl=Critical services.0(default) for all others.
- Set working directory.
- Set environment variables (base environment + Environment values from the definition).
- Set
NOTIFY_SOCKETto the notify socket path. This is set unconditionally regardless of the Readiness field -- services use sd_notify for watchdog, STOPPING=1, FDSTORE, and EXTEND_TIMEOUT_USEC in addition to readiness signalling. - Inject stored file descriptors if the service has an fd store with entries from a previous run.
- Exec the binary (
ImagePath+Arguments). - If exec fails: write error to the pipe,
_exit(127). - If any step 2-8 fails: write error to the pipe,
_exit(126).
§4.1.3.9 Step 9: Wait for readiness
After successful fork and exec:
- Simple, Readiness=Notify: peinit waits for
READY=1via sd_notify. On receipt, the service transitions to Active. - Simple, Readiness=Alive: the service transitions to Active immediately (the process exists).
- Oneshot: peinit waits for the process to exit. Exit code 0 (or a code in SuccessExitCodes) transitions to Completed. With RemainAfterExit=1 the service remains in Completed; without RemainAfterExit it transitions Completed -> Inactive after dependents are released. Non-zero exit transitions to Failed.
§4.1.3.10 Step 10: Post-readiness
On readiness (Simple) or successful exit (Oneshot):
- Run ExecStartPost commands. Each hook is forked into the
hooks/sub-cgroup. Hook failure is logged but MUST NOT fail the service. - Release dependent services (they become eligible to start).
- Start the watchdog timer if WatchdogTimeout > 0 (Simple only).
- Start the health check timer if HealthCheck is set (Simple only).
§4.1.4 Parent-side failure summary
Steps 2, 4, 5, and 6 can fail before any child process exists. In all four cases, peinit handles the failure entirely in the parent:
- Clean up any partially created cgroup tree.
- Transition the service to Failed with cause ParentSetupFailure.
- Return the error (including errno) to the control socket caller.
These failures are system-level resource exhaustion (fd limits, PID limits, memory, cgroup filesystem errors), not service-specific failures.