On this page
Health Checks
Active health checks run a command periodically to verify service health beyond "process is alive." This addresses the gap where a service's process is running but is functionally broken (lost database connection, stuck in bad state, returning errors).
§4.2.1 Execution model
The health check command runs with the service's own token. It checks the service's health from the service's perspective.
Each health check invocation runs in an ephemeral health/
sub-cgroup under the service's cgroup tree. It is a child of
peinit, not of the service process. When the check completes or
times out, peinit MUST kill the entire health/ sub-cgroup --
this cleans up any grandchildren the health check may have spawned.
§4.2.2 Overlap prevention
If a previous health check is still running when the next interval
fires, the new check MUST be skipped. A health check that exceeds
HealthCheckTimeout triggers a kill of the health/ sub-cgroup and
is counted as a failure.
§4.2.3 Failure semantics
HealthCheckRetries consecutive failures mark the service as unhealthy. An unhealthy service is restarted using the same restart policy as a crashed service (RestartPolicy, exponential backoff, and throttling all apply). The health check failure count MUST reset when a check succeeds.
§4.2.4 Flap protection
The restart throttling mechanism (RestartMaxRetries within RestartWindow) protects against health check flapping. A service that repeatedly fails health checks, restarts, passes initial checks, then fails again will eventually exhaust its restart budget and transition to Failed.
This only works if the health check failure cycle is shorter than RestartWindow. The graph validation check (see the Dependencies section) MUST enforce:
HealthCheckRetries * HealthCheckInterval < RestartWindow
Configurations that violate this constraint allow the restart counter to reset between failures, meaning RestartMaxRetries is never reached and the service restarts indefinitely. This is a validation error, not a warning.
§4.2.5 D-state sub-cgroup leak
If a health check process is stuck in uninterruptible kernel sleep
(D-state -- typically hung NFS, broken disk controller), SIGKILL
on the health/ sub-cgroup will not terminate it. peinit detects
this via cgroup.events (populated still 1 after SIGKILL + a
post-kill timeout (default 5 seconds)).
Unlike the main process, a stuck health check MUST NOT cause the service to enter the Abandoned state. The health check is a diagnostic probe -- it holds no service resources (ports, file locks, database connections).
Instead, peinit MUST orphan the leaked sub-cgroup:
- Mark it internally as leaked.
- Log a warning: "health check sub-cgroup for service X has D-state processes -- likely dead I/O. Sub-cgroup leaked. Underlying I/O problem requires investigation."
- Continue normal service supervision.
The leaked sub-cgroup remains in the hierarchy until reboot. If
the service is later stopped and restarted, peinit creates a
generational cgroup tree (see the Pre-Exec Sequence section)
because the old tree's rmdir will fail with EBUSY.
The same leak handling applies to hooks/ sub-cgroups -- a
pre-exec hook stuck in D-state is orphaned, not elevated to
service-level Abandoned.
§4.2.6 Leaked sub-cgroup observability
Leaked sub-cgroups MUST NOT be silent. peinit MUST track leaked sub-cgroups per service and expose them:
- Status queries MUST include a
warningsarray listing each leak (sub-cgroup path, type, timestamp of detection). - Start commands on a service with leaked sub-cgroups MUST return a warning in the response: "service has leaked sub-cgroups from a previous generation -- indicates underlying I/O problem requiring investigation."
§4.2.7 Critical service guidance
peinit MUST apply the same health check semantics to Critical services as to any other service. There is no special-casing.