psd-003 — Specification

KMES

Kernel Mediated Event Subsystem — unified event emission, buffering, and delivery for the Peios kernel.

v0.20 Draft 2026-05-21

Section

1 Introduction

§1.1 1 Introduction

Scope

This specification defines the Kernel Mediated Event Subsystem (KMES) for the Peios operating system. KMES is the event subsystem within the Peios Kernel Module (PKM), peer to KACS and LCS. It provides a unified event emission, buffering, and delivery mechanism for the Peios kernel.

KMES is the sole event emission path in Peios. All events -- whether originating from kernel subsystems or userspace processes -- are emitted through KMES. There is no alternative event path.

This specification covers:

The event model -- event structure, header format, msgpack payload, and the header/payload boundary
The emission API -- the internal kernel interface used by KACS and LCS to emit events
The syscall interface -- the mechanism for userspace event emission and consumer attachment
Event buffering and ordering -- per-CPU ring buffers, per-CPU sequence numbering, and wall clock timestamps
The delivery mechanism -- per-CPU shared memory ring buffers, double virtual mapping, lock-free read/write protocols, and futex-based consumer notification
Stamping -- KMES-intrinsic stamp fields (timestamp, sequence number, cpu_id, origin class) and identity stamp fields (effective token GUID, true token GUID, process GUID) captured from KACS at emission time

This specification does not cover:

Event persistence, indexing, or querying (eventd)
Event type schemas or naming conventions (eventd)
Boot identity and cross-boot sequencing (peinit / eventd)
KACS (PSD-004)
LCS (PSD-005)
Authentication or principal management (authd)

§1.2 1 Introduction

Terminology

The following terms are used throughout this specification with the precise meanings defined here.

PKM (Peios Kernel Module): The single loadable kernel module (pkm.ko) containing all Peios kernel extensions. KMES, KACS, and LCS are peer subsystems within PKM.

KMES (Kernel Mediated Event Subsystem): The event subsystem within PKM. Provides the sole event emission path in Peios -- kernel subsystems and userspace processes emit events exclusively through KMES. KMES buffers events, stamps them with metadata, assigns per-CPU sequence numbers, and delivers them to userspace consumers via per-CPU shared memory ring buffers.

Event: An indivisible record consisting of a header and a payload. The header is a packed binary structure containing KMES-intrinsic metadata. The payload is a msgpack-encoded blob of arbitrary structured data defined by the emitter. Header and payload are always produced and consumed together -- neither is meaningful alone.

Header: The packed binary prefix of every event. Contains: event size, header size, wall clock timestamp, per-CPU sequence number, CPU identifier, origin class, three identity GUIDs (effective token, true token, process), and a length-prefixed event type string. All fields before the event type string are at fixed offsets.

Payload: The msgpack-encoded body of an event. Its structure is defined by the emitting subsystem or process. KMES treats the payload as opaque -- it buffers and delivers payloads without interpreting them.

Stamp: The set of metadata fields in the event header that KMES populates at emission time. Stamp fields are: timestamp, sequence number, cpu_id, origin class (KMES-intrinsic), and effective token GUID, true token GUID, process GUID (identity, captured from KACS).

Effective token GUID: The GUID of the token governing the current thread's access rights at emission time. If the thread is impersonating, this is the impersonation token GUID. If the thread is not impersonating, this equals the true token GUID. For kernel emission without a process context or before KACS initialisation, this is the null GUID (all zero bytes).

True token GUID: The GUID of the process's primary token at emission time. For kernel emission without a process context or before KACS initialisation, this is the null GUID.

Process GUID: The GUID of the emitting process, assigned by KACS at process creation (fork). Immutable for the lifetime of the process -- exec does not change it. For kernel emission without a process context or before KACS initialisation, this is the null GUID.

Null GUID: A GUID consisting of 16 zero bytes. Indicates that the identity field is not applicable or not available. KMES stamps the null GUID when KACS is not initialised or when emission occurs in a context with no associated process (kernel worker threads, interrupt-deferred work).

Sequence number: A per-CPU, monotonically increasing 64-bit unsigned integer assigned by KMES to each event at emission time. Each CPU maintains its own independent counter, reset to zero when the PKM module loads. The counter is incremented before the value is taken, so the first event on each CPU receives sequence number 1. Sequence 0 is never assigned to an event. Gaps in the sequence on a given CPU indicate lost events (overwritten or dropped). The sequence number is not a global ordering primitive -- events are ordered primarily by timestamp.

Origin class: A header field identifying the subsystem or emission path that produced the event: a specific kernel subsystem (KMES, KACS, LCS) or userspace (via syscall).

Event type: An arbitrary, length-prefixed UTF-8 string in the event header identifying the kind of event. KMES imposes no structure or naming convention on event types -- schema and naming are consumer concerns.

Ring buffer: A per-CPU shared memory region created and managed by KMES, mapped into the address space of authorized userspace consumers. Each buffer consists of a producer metadata page (mapped read-only), a consumer metadata page (mapped read-write for consumer notification state), and a data region (mapped read-only, double virtual mapped). KMES maintains one ring buffer per CPU. Each buffer is independent, with its own write position, sequence counter, and futex notification. Ring buffers are the sole delivery mechanism from KMES to userspace.

Boot-time ring buffer: The ordinary per-CPU KMES ring buffers created during KMES initialisation using compiled-in defaults. These buffers are the live consumer-facing buffers once consumers attach; there is no separate private boot-buffer copy phase in v0.20.

Consumer: A userspace process that maps one or more KMES ring buffers and reads events from them. Consumers typically dedicate one thread per CPU buffer. eventd is the primary consumer.

§1.3 1 Introduction

Conventions

§1.3.1 Normative keywords

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY in this specification are to be interpreted as described in RFC 2119.

§1.3.2 Byte order

All multi-byte integers in the event header and ring buffer metadata page are little-endian. The ring buffer magic field is a fixed byte sequence compared byte-by-byte, not an integer.

§1.3.3 String encoding

Event type strings are UTF-8 encoded. No case folding or normalization is applied -- event types are compared as raw byte sequences.

§1.3.4 GUID encoding

Identity GUIDs in the event header use the Microsoft GUID binary format (16 bytes): a 4-byte little-endian Data1, a 2-byte little-endian Data2, a 2-byte little-endian Data3, and an 8-byte Data4 array. The format is defined in PSD-002. KMES treats GUIDs as opaque 16-byte values -- it copies them from KACS accessors without interpreting the internal structure.

§1.3.5 Payload encoding

Event payloads are encoded using MessagePack (msgpack) as defined by the MessagePack specification. KMES does not interpret payload contents.

§1.4 1 Introduction

Compatibility

KMES is not a port or reimplementation of any existing event subsystem. The design -- structured events with a fixed binary header and msgpack payload, a shared memory ring buffer for delivery, and a single emission path for both kernel and userspace -- was chosen to meet the specific requirements of Peios: unified observability, trusted kernel-stamped metadata, and a simple delivery mechanism.

KMES serves a similar role to ETW (Event Tracing for Windows) in the Windows kernel and the audit subsystem (auditd) in Linux. It is not compatible with either at the wire level, format level, or API level.

§1.4.1 Features handled by other subsystems

Feature	Subsystem
Event persistence and querying	eventd
Event type schemas and naming conventions	eventd
Boot identity and cross-boot sequencing	peinit / eventd

Section

2 Event model

§2.1 2 Event model

Event Model

§2.1.1 Structure

An event is an indivisible record consisting of a header followed by a payload. The header is a packed binary structure with no padding between fields. The payload is a msgpack-encoded blob. Header and payload are always stored, transmitted, and consumed as a single contiguous byte sequence.

§2.1.2 Header layout

The header fields are laid out sequentially with no padding or alignment gaps.

Offset	Size	Field	Description
0	4	`event_size`	Total size of the event (header + payload) in bytes. `u32`, little-endian.
4	4	`header_size`	Size of the header in bytes. `u32`, little-endian.
8	8	`timestamp`	Wall clock time at emission. Nanoseconds since Unix epoch. `u64`, little-endian.
16	8	`sequence`	Per-CPU, per-boot monotonic sequence number. `u64`, little-endian. Used for gap detection: a gap in the sequence on a given CPU indicates lost events.
24	2	`cpu_id`	The CPU on which the event was emitted. `u16`, little-endian. Identifies which per-CPU ring buffer contains this event.
26	1	`origin_class`	Origin of the event. `u8`.
27	16	`effective_token_guid`	GUID of the effective token for the emitting thread. Microsoft GUID binary format. Null GUID if not available.
43	16	`true_token_guid`	GUID of the process's primary token. Microsoft GUID binary format. Null GUID if not available.
59	16	`process_guid`	GUID of the emitting process. Microsoft GUID binary format. Null GUID if not available.
75	2	`type_len`	Length of the event type string in bytes. `u16`, little-endian.
77	`type_len`	`type`	Event type string. UTF-8 encoded. Not null-terminated.

The payload begins at offset header_size from the start of the event. The next event in the ring buffer begins at offset event_size from the start of the current event.

header_size is exactly 77 + type_len. Consumers MUST use header_size to locate the payload.

There is no separate limit on event type string length. The event type length is constrained only by the total event size limits (MaxEventSize for syscall emitters, 50% of ring buffer capacity for all emitters).

§2.1.3 Stamp fields

KMES populates the following header fields at emission time. The emitter does not provide these -- they are set by KMES unconditionally.

§2.1.3.1 KMES-intrinsic stamps

timestamp -- captured from the wall clock (CLOCK_REALTIME) at the moment KMES accepts the event.
sequence -- assigned by incrementing the emitting CPU's per-boot monotonic counter (initialized to zero) and taking the new value. The first event on each CPU receives sequence number 1; sequence 0 is never assigned.
cpu_id -- the CPU on which the event was emitted.
origin_class -- for syscall emission, set unconditionally to 0 (userspace) by KMES. For kernel emission, set to the value provided by the calling subsystem.

§2.1.3.2 Identity stamps

effective_token_guid -- the GUID of the token governing the current thread's effective access rights. If the thread is impersonating, this is the impersonation token's GUID. If the thread is not impersonating, this equals true_token_guid. Captured by calling kacs_effective_token_guid().
true_token_guid -- the GUID of the process's primary token. Always the process token regardless of impersonation state. Captured by calling kacs_primary_token_guid().
process_guid -- the GUID of the emitting process, assigned at fork and immutable across exec. Captured by calling kacs_process_guid().

All three identity stamp accessors return the null GUID (16 zero bytes) when KACS is not initialised or when the current execution context has no associated process (kernel worker threads, interrupt-deferred work). KMES does not distinguish between these cases -- a null GUID means "identity not available."

§2.1.3.3 Structural fields

event_size, header_size, and type_len are structural fields set by KMES during event construction.

The emitter provides the event type string and the msgpack payload. KMES does not modify either.

§2.1.4 Ordering

For cross-CPU ordering, events are ordered by timestamp (wall clock). Events with identical timestamps from different CPUs were genuinely concurrent and have no defined relative order. Within a single CPU, the sequence number provides reliable monotonic ordering even across clock discontinuities. Events with identical timestamps from the same CPU are ordered by sequence.

There is no global sequence number. Each CPU maintains its own independent sequence counter. The pair (cpu_id, sequence) uniquely identifies an event within a single boot.

§2.1.4.1 Identity stamp timing

Identity stamps are captured during the preemption-disabled ring buffer write phase, alongside the KMES-intrinsic stamps. The three KACS accessor calls are guaranteed to be safe with preemption disabled (no sleeping, no allocation). The identity stamps reflect the thread's identity at the moment of the ring buffer write, not at syscall entry time.

For batch emission, identity stamps are captured once and shared across all events in the batch, since the batch executes with preemption disabled on a single CPU and the emitting thread's identity cannot change mid-batch.

§2.1.5 Origin class values

Value	Origin
0	Userspace (syscall)
1	KMES
2	KACS
3	LCS

Values 4--255 are reserved for future kernel subsystems.

§2.1.6 Payload

The payload is a single msgpack-encoded value occupying the bytes from offset header_size to offset event_size. KMES does not interpret or modify the payload. The payload's structure is defined by the emitter and understood by consumers.

The payload MUST be valid msgpack. A zero-length payload (event_size == header_size) is valid for kernel emitters -- the event consists of a header with no payload data. For syscall emitters, a zero-length payload fails msgpack validation (an empty byte sequence is not a valid msgpack value) and is rejected with EINVAL.

KMES does not validate payloads from kernel emitters. For events emitted via the syscall interface, KMES MUST validate that the payload is well-formed msgpack before accepting the event. Validation is iterative with a bounded maximum nesting depth. Events with invalid payloads MUST be rejected and the syscall MUST return an error to the caller.

§2.1.7 Size limits

For events emitted via syscall, the maximum permitted event size is runtime-configurable via the registry (MaxEventSize). KMES uses an internal default until the registry is reachable. Events exceeding the limit MUST be rejected and the syscall MUST return an error. Kernel emitters are not subject to the configurable size limit -- they are subject only to the 50% ring buffer capacity structural limit defined in §3.1.6.

Section

3 Emission api

§3.1 3 Emission api

Emission API

§3.1.1 Purpose

The emission API is the internal kernel interface through which PKM subsystems emit events into KMES. It is not a syscall -- it is a function call within the kernel module. Userspace event emission is handled by the syscall interface defined in §4.

§3.1.2 Interface

A kernel emitter calls KMES with the following parameters:

origin_class (u8) -- the origin class value identifying the emitting subsystem.
event_type (byte pointer + length) -- the event type string. UTF-8 encoded.
payload (byte pointer + length) -- the msgpack-encoded payload.

The emitter does not specify a CPU or buffer. KMES writes the event to the ring buffer of the CPU on which the calling code is currently executing.

§3.1.3 Preemption

The entire emission path MUST execute with preemption disabled. Preemption is disabled before determining the current CPU and re-enabled after the ring buffer write is complete. This guarantees that the emitting thread cannot be migrated to a different CPU mid-write, which would violate the single-writer-per-buffer invariant.

For kernel emitters, preemption is disabled for the full emission path (timestamp capture through ring buffer write). This is acceptable because kernel emitters produce small, trusted payloads and the total non-preemptible window is a few hundred nanoseconds.

§3.1.4 Event construction

KMES constructs the event by:

Capturing the wall clock timestamp.
Incrementing the current CPU's per-boot sequence counter and taking the new value. The counter starts at 0; the first event on each CPU receives sequence number 1. Sequence 0 is never assigned to an event. This is a CPU-local operation with no cross-CPU contention.
Capturing the three identity GUIDs by calling kacs_effective_token_guid(), kacs_primary_token_guid(), and kacs_process_guid(). These calls return the null GUID if KACS is not initialised or no process context exists.
Building the packed header from the KMES-intrinsic stamp fields, identity GUIDs, cpu_id, origin class, and event type.
Writing the header and payload contiguously into the current CPU's ring buffer.

The timestamp is captured before the sequence number is assigned. Two events with the same timestamp on the same CPU are ordered by sequence number. Identity GUIDs are captured after the sequence number but before the header is built -- the exact ordering between steps 2 and 3 is not observable to consumers.

§3.1.5 Caller contract

The emitter MUST provide a valid origin class value as defined in §2.1.5. The emitter MUST provide a valid UTF-8 event type string. The emitter MUST provide the payload as a contiguous byte buffer.

KMES does not validate the origin class, event type encoding, or payload contents from kernel emitters. These are trusted callers within PKM.

§3.1.6 Structural checks

The emission API performs the following structural checks on every call:

The event type string MUST have nonzero length.
The total event size (header + payload) MUST fit in a u32.
The total event size MUST NOT exceed 50% of the per-CPU ring buffer capacity.

The ring buffer capacity check protects against kernel bugs that would emit an event large enough to overwrite most of a CPU's event history. This threshold is a fixed ratio, not a configurable parameter.

If a structural check fails, the event is not written to the ring buffer. The sequence number still advances, making the drop visible as a gap in the sequence. KMES increments an internal dropped-event counter.

§3.1.7 Ring buffer full

Each per-CPU ring buffer is circular. When a buffer is full, KMES overwrites the oldest events in that buffer to make space for the new event. The write pointer advances unconditionally -- emission never blocks and never fails due to buffer pressure.

Consumers detect overwritten events as gaps in the sequence number. If a consumer's read position has been overwritten, the consumer is advanced to the oldest surviving event.

§3.1.8 Batch emission

The batch emission API allows a kernel emitter to emit multiple events as a single operation, reducing per-event overhead.

§3.1.8.1 Interface

A kernel emitter calls the batch API with the following parameters:

origin_class (u8) -- the origin class value identifying the emitting subsystem. Applied to all events in the batch.
events (array of event descriptors) -- each descriptor contains an event type (byte pointer + length) and a payload (byte pointer + length).
count (u32) -- the number of events in the array.

§3.1.8.2 Behavior

KMES processes the batch as follows:

Disable preemption.
Capture a single wall clock timestamp. All events in the batch share this timestamp.
Capture the three identity GUIDs once. All events in the batch share these identity stamps.
For each event in order: perform structural checks, assign a sequence number, build the header, and write the event to the ring buffer. If a structural check fails on any event, the failing event is dropped (sequence number consumed, gap visible) but subsequent events in the batch continue to be processed.
Store write_pos with a single release barrier after all events are written.
Check need_wake once. If need_wake is 1, increment futex_counter with a release store and issue futex_wake to wake all waiting consumer threads.
Re-enable preemption.

The shared timestamp and identity GUIDs reflect the logical instant of the batch. Events within a batch are ordered by their sequence numbers. The single write_pos update and single need_wake check are the primary performance benefits over individual emission.

§3.1.8.3 Failure semantics

The kernel batch API continues processing after a single event failure. If a structural check fails on event N, event N is dropped but events N+1 through the end of the batch are still processed. This is deliberately different from the syscall batch API (kmes_emit_batch), which stops processing at the first failure. Kernel emitters are trusted and individual structural failures are expected to be rare (indicating a kernel bug). Syscall emitters are untrusted and receive an error indicating which entry failed so the caller can diagnose and fix the issue.

§3.1.8.4 Caller contract

The same caller contract as single emission applies to each event in the batch. KMES does not validate payload contents from kernel emitters.

§3.1.9 Atomicity

Individual event writes to the ring buffer MUST be atomic from the consumer's perspective. A consumer MUST NOT observe a partially written event. For batch emission, write_pos is deferred until all events are written, so consumers observe the entire batch atomically -- no events from the batch are visible until all have been written. The consumer processes individual events within the batch, each of which is independently valid. The mechanism used to guarantee write atomicity is defined in §5.1.8.

Section

4 Syscall interface

§4.1 4 Syscall interface

Syscall Interface

§4.1.1 Overview

KMES exposes three syscalls in the PKM range (1090--1099):

kmes_emit (1090) -- emit a single event from userspace.
kmes_attach (1091) -- attach as a consumer of a single per-CPU ring buffer.
kmes_emit_batch (1092) -- emit multiple events from userspace as a single operation.

All three syscalls use standard Linux error conventions: return -1 and set errno on failure.

§4.1.2 kmes_emit (1090)

Emits a single event into KMES from userspace. The origin class is set to 0 (userspace) unconditionally -- the caller cannot specify it. The event is written to the ring buffer of the CPU on which the calling thread is currently executing.

§4.1.2.1 Privilege requirement

The caller's effective token MUST hold SeAuditPrivilege. If the privilege is not held or not enabled, the syscall fails with EPERM.

§4.1.2.2 Parameters

Parameter	Type	Description
`event_type`	`const char *`	Pointer to the event type string.
`event_type_len`	`u16`	Length of the event type string in bytes.
`payload`	`const void *`	Pointer to the msgpack-encoded payload.
`payload_len`	`u32`	Length of the payload in bytes.

§4.1.2.3 Validation

KMES performs the following validation on every kmes_emit call, in order:

The caller MUST hold SeAuditPrivilege. Fails with EPERM.
If the caller does not hold SeTcbPrivilege, the per-process rate limit is checked using a token bucket algorithm. The bucket refills at MaxEmitRatePerProcess tokens per second and has a burst capacity equal to MaxEmitRatePerProcess. The bucket is created on the process's first emit call, initialized to full capacity. The bucket is destroyed when the process exits. If MaxEmitRatePerProcess changes at runtime, the bucket capacity and refill rate are updated immediately; the current token count is clamped to the new capacity if it exceeds it. If the bucket is empty, the syscall fails with EAGAIN. One token is consumed after the event is successfully written to the ring buffer. Validation failures do not consume a token. Callers holding SeTcbPrivilege are exempt from rate limiting. If KACS is not yet initialized and privilege checks cannot be performed, the syscall fails with EPERM (fail-closed).
event_type_len MUST be nonzero. Fails with EINVAL.
The declared total event size (header_size + payload_len, where header_size = 77 + type_len) is calculated from the userspace length fields without dereferencing the userspace pointers. If the arithmetic overflows, the syscall fails with EINVAL.
The declared total event size (header + payload) MUST NOT exceed the configured maximum event size. Fails with ENOSPC.
The declared total event size MUST NOT exceed 50% of the per-CPU ring buffer capacity. Fails with ENOSPC.
The event type and payload are copied from userspace into a kernel buffer. The event type MUST be valid UTF-8. Fails with EINVAL if the event type contains invalid UTF-8 byte sequences. If either pointer is inaccessible, fails with EFAULT. All subsequent validation and the ring buffer write operate on the kernel copy, not the original userspace memory. This prevents TOCTOU (time-of-check-time-of-use) attacks where userspace modifies the payload between validation and the ring buffer write.
The payload MUST be valid msgpack with nesting depth not exceeding the configured maximum. Fails with EINVAL.

Validation stops at the first failure. The error reflects the first check that failed.

§4.1.2.4 Preemption

Validation (steps 1--8) runs with preemption enabled. The userspace copy at step 7 may trigger page faults, and msgpack validation at step 8 may take microseconds for large payloads. Neither requires CPU affinity.

Preemption is disabled only for the ring buffer write: determining the current CPU, constructing the header (stamping timestamp, sequence number, cpu_id, and capturing identity GUIDs from KACS), writing the event, advancing write_pos, and checking need_wake. The three KACS GUID accessor calls add negligible overhead (each reads a field from an in-memory kernel structure with no allocation or sleeping). This keeps the non-preemptible window to a few hundred nanoseconds regardless of payload size. The cpu_id and identity GUIDs in the event header reflect the state at write time, not at syscall entry time.

§4.1.2.5 Behavior

On success, KMES writes the event to the current CPU's ring buffer and returns 0. The event is visible to consumers immediately.

If the ring buffer is full, KMES overwrites the oldest events. The syscall never blocks due to buffer pressure.

§4.1.2.6 Return

Returns 0 on success. Returns -1 and sets errno on failure.

§4.1.2.7 Errors

Errno	Meaning
EPERM	Caller does not hold SeAuditPrivilege.
EAGAIN	Per-process rate limit exceeded. Caller SHOULD back off and retry.
EINVAL	Event type length is zero, or event type is not valid UTF-8, or payload is invalid msgpack, or payload nesting depth exceeds MaxNestingDepth.
EFAULT	Event type or payload pointer is inaccessible.
ENOSPC	Event exceeds MaxEventSize or 50% of per-CPU ring buffer capacity.
ENOMEM	Kernel memory allocation for the staging buffer failed.

§4.1.3 kmes_emit_batch (1092)

Emits multiple events into KMES from userspace as a single operation. All events share a single timestamp and the overhead of privilege checking, notification, and the write_pos release barrier is incurred once for the batch rather than per event.

§4.1.3.1 Privilege requirement

The caller's effective token MUST hold SeAuditPrivilege. If the privilege is not held or not enabled, the syscall fails with EPERM.

§4.1.3.2 Parameters

Parameter	Type	Description
`entries`	`struct kmes_emit_entry __user *`	Pointer to an array of event descriptors.
`count`	`u32`	Number of entries in the array. MUST be at least 1 and at most 256. The limit of 256 bounds the worst-case preemption-disabled window during the ring buffer write phase to approximately 50--100 microseconds for typical event sizes, while providing strong syscall overhead amortization.
`emitted_out`	`u32 __user *`	Pointer to a u32 written by KMES with the number of events actually emitted.

Each kmes_emit_entry uses C ABI natural alignment on the target architecture. On x86-64:

Offset	Size	Type	Field	Description
0	8	`pointer`	`event_type`	Pointer to the event type string.
8	2	`u16`	`event_type_len`	Length of the event type string in bytes.
10	6	--	padding
16	8	`pointer`	`payload`	Pointer to the msgpack-encoded payload.
24	4	`u32`	`payload_len`	Length of the payload in bytes.
28	4	--	padding

Total struct size: 32 bytes.

§4.1.3.3 Validation

The caller MUST hold SeAuditPrivilege. Fails with EPERM.
count MUST be between 1 and 256 inclusive. Fails with EINVAL.
If the caller does not hold SeTcbPrivilege, the per-process rate limit is checked using the same token bucket as kmes_emit. The bucket MUST have at least count tokens available. If not, the syscall fails with EAGAIN and no events are emitted. Step 3 MUST atomically reserve count tokens from the bucket to prevent concurrent threads from passing the rate check simultaneously. After all processing is complete, unused tokens for events that were not emitted (due to validation failure) are returned to the bucket. Only events actually emitted consume tokens (see Return). Callers holding SeTcbPrivilege are exempt from rate limiting. If KACS is not yet initialized, the syscall fails with EPERM.
emitted_out MUST be writable. If the pointer is inaccessible, the syscall fails with EFAULT and no events are emitted. If the pointer is writable, KMES stores 0 before any per-entry processing begins.
The entry descriptor array is copied from userspace. Fails with EFAULT if inaccessible.
For each entry in order, starting from index 0: the declared total event size (77 + type_len + payload_len) is calculated from the copied length fields without dereferencing the per-entry userspace pointers. If the arithmetic overflows, processing stops and the failing entry is rejected with EINVAL.
The declared total event size for the entry MUST be within MaxEventSize and within 50% of the ring buffer capacity. If either limit is exceeded, processing stops and the failing entry is rejected with ENOSPC.
The event type and payload are copied from userspace into kernel memory. The event type and payload pointers for the failing entry are checked only after steps 6 and 7 pass. If a per-entry pointer is inaccessible, processing stops and the failing entry is rejected with EFAULT.
The staged entry is validated using the same remaining rules as kmes_emit (nonzero event type length, valid UTF-8 event type, valid msgpack within MaxNestingDepth). If any entry fails validation, processing stops. Events before the failing entry that passed validation are emitted. The failing entry and all subsequent entries are not processed.

§4.1.3.4 Preemption

The userspace copies and msgpack validation run with preemption enabled. Preemption is disabled only for the ring buffer writes, the single write_pos release barrier, and the need_wake check.

§4.1.3.5 Behavior

All successfully validated events share a single wall clock timestamp and a single set of identity GUIDs, captured once at the start of the ring buffer write phase. Each event receives its own sequence number. The origin class is set to 0 (userspace) for all events.

If the ring buffer is full, KMES overwrites the oldest events. The syscall never blocks due to buffer pressure.

§4.1.3.6 Return

Returns 0 on full success and writes count to *emitted_out.

Returns -1 and sets errno on any failure. *emitted_out carries the number of events successfully emitted before the failure:

For failures before step 4 (EPERM, EAGAIN, EINVAL on count), emitted_out is not touched.
For EFAULT at step 4, or for failures after step 4 but before any event is emitted, *emitted_out remains 0.
If validation fails on entry N, KMES emits events 0 through N-1, returns -1, sets errno to indicate why entry N failed, and writes N to *emitted_out.

Events that fail validation (entry N and all subsequent entries) do not consume sequence numbers -- they never enter the ring buffer write phase. Only the N events actually emitted are charged against the rate limit, not the full count.

§4.1.3.7 Errors

Errno	Meaning
EPERM	Caller does not hold SeAuditPrivilege.
EAGAIN	Per-process rate limit exceeded. Caller SHOULD back off and retry.
EINVAL	`count` is 0 or exceeds 256, or the failing entry has a zero-length event type, or the failing entry's event type is not valid UTF-8, or the failing entry's payload is invalid msgpack or exceeds MaxNestingDepth.
EFAULT	`emitted_out`, entry array, event type, or payload pointer is inaccessible.
ENOSPC	The failing entry exceeds MaxEventSize or 50% of per-CPU ring buffer capacity.
ENOMEM	Kernel memory allocation failed.

§4.1.4 kmes_attach (1091)

Attaches the caller as a consumer of a single per-CPU KMES ring buffer. Returns one file descriptor for the specified CPU.

§4.1.4.1 Privilege requirement

The caller's effective token MUST hold SeSecurityPrivilege. If the privilege is not held or not enabled, the syscall fails with EPERM.

§4.1.4.2 Parameters

Parameter	Type	Description
`cpu_id`	`u32`	The CPU index to attach to.
`capacity`	`u64 __user *`	Pointer to a u64. On return, the per-CPU ring buffer capacity in bytes. The consumer uses this to compute the mmap size: `8192 + 2 * capacity`.

§4.1.4.3 Behavior

cpu_id is a logical CPU index in the range [0, num_cpus), where num_cpus is the number of CPUs that were online when KMES initialised (at PKM load time). This is the same numbering used in the ring buffer's cpu_id metadata field and in event headers.

KMES creates a single file descriptor for the ring buffer of CPU cpu_id and returns it. The current per-CPU ring buffer capacity is written to *capacity. The returned file descriptor maps exactly one CPU's ring buffer.

If cpu_id is greater than or equal to num_cpus, the syscall fails with EINVAL. This includes CPUs brought online after KMES initialisation -- CPU hotplug is not supported in v0.20 (see §7.1). Consumers discover the CPU count by calling kmes_attach with incrementing cpu_id values starting from 0 until EINVAL is returned.

Repeated calls with the same cpu_id are permitted and return a new file descriptor each time. Each file descriptor maps that CPU's ring buffer. The producer metadata page, consumer metadata page, and data region are shared for all file descriptors attached to the same per-CPU buffer, as defined in §5.1. This allows multiple direct consumers to attach to the same CPU's ring buffer concurrently while sharing the buffer's advisory notification state.

The returned file descriptor supports:

mmap() -- maps that CPU's ring buffer into the caller's address space. The mapping size is 8192 + 2 * capacity bytes. The mapped region layout is defined in §5.1.5.
close() -- releases the file descriptor. The mapping becomes invalid.

The mapped region is split into read-only and read-write sections. The producer metadata page and the data region are mapped read-only -- no privilege, capability, or token grants write access to these regions from userspace. Only KMES writes event data and producer metadata. The consumer metadata page is mapped read-write for consumer notification state (need_wake). KMES treats consumer metadata as advisory and validates all values read from it.

Multiple consumers MAY attach to the same CPU's ring buffer simultaneously. Each consumer maintains its own read position independently in userspace; the mapped consumer metadata page is shared per per-CPU buffer and is not a per-consumer read-position store.

§4.1.4.4 Notification

Each per-CPU ring buffer has its own futex counter (u32) in its metadata page. When KMES writes an event to a CPU's ring buffer and need_wake is set, it increments that buffer's futex counter and issues a futex_wake that wakes all waiting threads. This ensures that multiple consumers attached to the same buffer are all woken when events arrive.

This allows consumers to dedicate one thread per CPU buffer, each sleeping independently on its own futex. Under sustained load, consumer threads remain in the drain loop, need_wake stays 0, and KMES skips all notification overhead.

§4.1.4.5 Return

Returns the file descriptor (non-negative) on success. Returns -1 and sets errno on failure.

§4.1.4.6 Errors

Errno	Meaning
EPERM	Caller does not hold SeSecurityPrivilege.
EINVAL	`cpu_id` is greater than or equal to the number of CPUs.
EFAULT	`capacity` points to inaccessible memory.
ENOMEM	Kernel memory allocation failed.

Section

5 Ring buffer

§5.1 5 Ring buffer

Ring Buffer

§5.1.1 Overview

The ring buffer is the sole delivery mechanism from KMES to userspace consumers. KMES maintains one ring buffer per CPU. Each per-CPU buffer is an independent shared memory region, independently mappable, with its own metadata, write position, and futex counter. There is no shared state between per-CPU buffers on the write path.

This per-CPU design eliminates all contention on the event emission path. Each CPU writes to its own buffer using its own counters. No atomic operations contend across CPUs. This is critical for workloads where KMES traces every syscall across many cores.

Consumers read from per-CPU buffers independently. Each buffer is a complete, self-contained ring buffer -- the same structure, the same read protocol, the same overwrite semantics. The per-CPU design does not change the ring buffer contract; it replicates it.

§5.1.2 Boot buffer

KMES begins buffering events the instant PKM loads, before the registry is available. During this early boot window, events are stored in per-CPU ring buffers created at the compiled-in default BufferCapacity.

These boot-time buffers are ordinary KMES ring buffers. They use the same mapped layout, overwrite semantics, metadata contract, and generation model as later buffers. They MAY be attached and mapped by consumers immediately; there is no separate private boot-only buffer class.

When LCS becomes available, KMES reads the configured ring buffer size from the registry. If the configured BufferCapacity differs from the compiled-in default, KMES creates new per-CPU ring buffers at that size, copies all surviving events from the boot-time buffers into them, increments generation, and switches writers to the new buffers. If the configured BufferCapacity matches the default (or the key does not exist), the existing boot-time buffers remain the live buffers and no swap occurs.

If LCS is not available, KMES continues using the existing boot-time buffers indefinitely at the compiled-in default size.

Boot-time buffers use the same circular overwrite semantics as all other ring buffers. If a boot-time buffer fills before LCS appears, the oldest events are overwritten.

§5.1.3 Capacity

All per-CPU ring buffers share the same capacity. The capacity MUST be a power of two. This allows the wrap-around offset calculation to use a bitwise AND (position & (capacity - 1)) instead of a modulo operation. Every event read and write hits this calculation.

The capacity is configurable via the registry. The compiled-in default is used when the registry is not yet available. The minimum and maximum permitted capacities are implementation-defined but MUST both be powers of two.

§5.1.4 Double virtual mapping

Each ring buffer's physical pages are mapped twice consecutively in virtual memory. If the ring buffer occupies N physical pages, the data region spans 2N pages of virtual address space, with the second N pages mapping the same physical memory as the first N.

Physical pages:  [0][1][2][3]
Virtual mapping: [0][1][2][3][0][1][2][3]

This eliminates all wrap-around handling. When KMES writes an event that crosses the end of the buffer, the write continues into virtual addresses that map back to the beginning of the physical buffer. No branch, no split write, no padding. A single contiguous memcpy handles every write regardless of position.

Consumers benefit identically -- an event that wraps around the physical boundary is read as a single contiguous byte sequence from the consumer's perspective.

§5.1.5 Mapped region layout

When a consumer calls mmap() on a per-CPU file descriptor returned by kmes_attach, the mapped region has the following layout:

Region	Size	Description
Producer metadata page	4096 bytes	KMES-written control fields. Mapped read-only to consumers.
Consumer metadata page	4096 bytes	Consumer-written fields. Mapped read-write to consumers.
Data region	2 × capacity	The double-mapped ring buffer containing events. Mapped read-only to consumers.

The total mapping size is 8192 + (2 × capacity) bytes. Every per-CPU buffer has the same layout and the same capacity.

The consumer maps the entire region with a single mmap() call on the per-CPU file descriptor: mmap(NULL, 8192 + 2 * capacity, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0). The kernel's mmap handler enforces per-page permissions internally: the producer metadata page and data region pages are mapped read-only regardless of the requested PROT flags, the consumer metadata page is mapped read-write. The consumer does not need to issue separate mmap calls for each region. The consumer discovers capacity from the kmes_attach syscall (§4.1).

The mapping is split into read-only and read-write regions to enforce a trust boundary. KMES writes to the producer metadata page and the data region; consumers can only read these. Consumers write to the consumer metadata page; KMES reads from it but treats all values as advisory -- a corrupted consumer page cannot affect KMES correctness or the producer metadata.

The consumer metadata page is shared between all consumers attached to the same per-CPU buffer. A malicious consumer with SeSecurityPrivilege could overwrite need_wake to suppress notification to other consumers on the same buffer. This is accepted because SeSecurityPrivilege is a very high-trust privilege -- normal event consumers use eventd (which enforces per-event SD-based access control), not direct ring buffer access.

§5.1.6 Producer metadata page (offset 0, read-only)

The producer metadata page is laid out to prevent false sharing. Fields that are updated at different frequencies are placed on separate 64-byte cache lines.

False sharing occurs when two independent fields share a cache line. Updating one field invalidates the cache line in every CPU core, forcing all cores to re-fetch the line even for the unchanged field. In the per-CPU design, false sharing between CPUs is eliminated by using separate buffers. Cache line separation within a buffer prevents false sharing between the producing CPU and consuming threads.

§5.1.6.1 Cache line 0 -- static fields (bytes 0--63)

Written once when the ring buffer is created. Never modified after initialisation. Consumers MAY cache these values for the lifetime of the mapping.

Offset	Size	Type	Field	Description
0	8	`[u8; 8]`	`magic`	Magic byte sequence identifying this as a KMES ring buffer. Value: `4B 4D 45 53 52 49 4E 47` (`KMESRING` in ASCII). Compared byte-by-byte, not as an integer.
8	4	`u32`	`version`	Ring buffer format version. v0.20 uses version 1.
12	2	`u16`	`cpu_id`	The CPU this buffer belongs to.
14	2	`u16`	`reserved0`	Reserved. Must be zero.
16	8	`u64`	`capacity`	Data region capacity in bytes. Power of two.
24	8	`u64`	`data_offset`	Byte offset from the start of the mapping to the data region. Equal to the combined metadata size (8192).
32	8	`u64`	`generation`	Buffer generation counter. Starts at 1 for the first ring buffer created on each CPU. Monotonically increasing across buffer swaps -- the new buffer's generation is the old buffer's incremented value.
40	24	--	`reserved1`	Reserved. Must be zero. Pads to cache line boundary.

§5.1.6.2 Cache line 1 -- producer fields (bytes 64--127)

Written by KMES on every event write to this CPU's buffer. This is the hottest cache line in the ring buffer. In the per-CPU design, only one CPU ever writes to this cache line, eliminating cross-core invalidation.

Offset	Size	Type	Field	Description
64	8	`u64`	`write_pos`	Monotonically increasing byte offset of the next write position. Never wraps -- at 1 GB/s sustained throughput, a `u64` byte offset would take over 500 years to overflow. The actual data region offset is `write_pos & (capacity - 1)`. Initialized to 0 on a fresh buffer (advanced after boot buffer events are copied).
72	8	`u64`	`tail_pos`	Byte offset of the oldest surviving event. Advanced by KMES when events are overwritten. Consumers whose read position is behind `tail_pos` have been lapped. Initialized to 0 on a fresh buffer.
80	48	--	`reserved2`	Reserved. Must be zero. Pads to cache line boundary.

§5.1.6.3 Cache line 2 -- notification fields (bytes 128--191)

Written by KMES when waking consumers. Read by consumers for futex-based sleep.

Offset	Size	Type	Field	Description
128	4	`u32`	`futex_counter`	Counter incremented by KMES when waking sleeping consumers. Consumers use `futex_wait` on this address. `u32` because Linux `futex(2)` operates on 32-bit integers. Only incremented when `need_wake` is set.
132	60	--	`reserved3`	Reserved. Must be zero. Pads to cache line boundary.

§5.1.7 Consumer metadata page (offset 4096, read-write)

The consumer metadata page is mapped read-write to consumers. KMES reads from this page but treats all values as advisory. A malicious or buggy direct consumer that corrupts this page can affect advisory notification behavior for direct consumers attached to the same per-CPU buffer, but it cannot affect KMES correctness, the producer metadata, the data region, or other consumers' view of producer metadata.

Offset	Size	Type	Field	Description
4096	1	`u8`	`need_wake`	Consumer-managed flag. Set to 1 by the consumer before sleeping. Read by KMES after writing an event -- KMES treats any nonzero value as 1. If 0, KMES skips the futex_counter increment and futex_wake entirely. Cleared by the consumer after waking.
4097	4095	--	`reserved4`	Reserved. Must be zero. Pads to page boundary.

Under sustained load, the consumer is always draining and need_wake remains 0. KMES reads need_wake, sees 0, and skips all notification overhead -- no futex_counter increment, no futex_wake syscall. The entire notification path costs a single memory read per event (~1ns). Under low load, the consumer sets need_wake before sleeping, and KMES performs the full wake sequence when the next event arrives.

§5.1.8 Write protocol

Each per-CPU buffer has exactly one writer: the CPU it belongs to. There is no cross-CPU contention on any write operation. The write protocol uses no locks and no cross-CPU atomic operations.

For each event on a given CPU:

Capture the wall clock timestamp (CLOCK_REALTIME).
Increment the CPU's per-boot sequence counter and take the new value. This is a CPU-local operation with no contention.
Build the packed event header (timestamp, sequence number, cpu_id, origin class, event type).
Compute the total event size (header + payload).
If the total event size exceeds 50% of the ring buffer capacity, drop the event. The sequence number is consumed, creating a visible gap. Increment the internal dropped-event counter. Stop.
If write_pos + event_size - tail_pos > capacity, the write would overwrite surviving events. Advance tail_pos past overwritten events by reading each overwritten event's event_size field and adding it to tail_pos until sufficient space is available. Store tail_pos with a release memory barrier.
Write the event (header + payload) into the data region at offset write_pos & (capacity - 1). The double virtual mapping ensures this is a single contiguous write even if it crosses the physical buffer boundary.
Store the new write_pos (old value + event size) with a release memory barrier. This barrier ensures the event data is fully visible to consumers before write_pos advances.
Read need_wake. If need_wake is 0, stop -- no consumer is sleeping. If need_wake is 1, increment futex_counter with a release store and issue futex_wake to wake all waiting consumer threads.

During batch writes, steps 8--9 are deferred until all events in the batch have been written. A single release store of write_pos and a single need_wake check cover the entire batch. The overwrite check (step 6) uses an internal running write offset that tracks the current write frontier within the batch, not the consumer-visible write_pos.

The release barriers in steps 6 and 8 establish the ordering guarantee: a consumer that observes the new write_pos is guaranteed to observe the fully written event data and the correct tail_pos.

§5.1.9 Read protocol

Consumers read events directly from the mapped data region. The read protocol uses no locks and no syscalls during the event drain loop.

Each consumer maintains its own read position per buffer in process-local memory. KMES does not track consumer read positions and is not aware of how many consumers exist or how far behind they are. A consumer SHOULD initialize its read_pos to tail_pos on first attachment, starting from the oldest surviving event.

Events are packed contiguously in the data region with no alignment padding between them. event_size is the exact byte count of the event (header + payload) with no trailing padding. The next event begins immediately at the byte following the previous event.

A consumer typically dedicates one thread per CPU buffer. Each thread independently drains its buffer using the following protocol.

§5.1.9.1 Drain loop

Load write_pos with an acquire memory barrier. If write_pos == read_pos, no new events are available. Proceed to notification wait.
Load tail_pos with an acquire memory barrier. If read_pos < tail_pos, the consumer has been lapped -- events at read_pos have been overwritten. Set read_pos = tail_pos. The gap between the old read_pos and tail_pos represents lost events, detectable as a sequence number gap.
Save the current tail_pos as saved_tail.
Read the event at data region offset read_pos & (capacity - 1). The double virtual mapping ensures this is a contiguous read.
Re-read tail_pos. If tail_pos > saved_tail AND read_pos < tail_pos, the event was overwritten during the read (torn read). Discard the event and go to step 2.
Validate event_size > 0 and event_size >= header_size. If either check fails, the event data is corrupt -- the consumer SHOULD advance to tail_pos and continue from step 2. An event_size of 0 would cause an infinite loop; an event_size smaller than header_size indicates a malformed header.
The event is valid. Process it. Advance read_pos by the event's event_size. Consumers MUST NOT read beyond the event's event_size boundary -- stale data from previously overwritten events may be present in the data region. Go to step 1.

§5.1.9.2 Notification wait

When no events are available on a given buffer:

Store 1 to need_wake with a release memory barrier. This signals KMES that the consumer is about to sleep.
Re-read write_pos with an acquire barrier. If new events have arrived since the drain loop exited (KMES wrote between the drain loop's check and the need_wake store), clear need_wake to 0 and return to the drain loop.
Read the current futex_counter value.
Optionally spin briefly, re-checking write_pos for new events. If events arrive during the spin window, clear need_wake to 0 and return to the drain loop. The spin duration is a consumer implementation choice.
Call futex_wait(futex_counter_address, last_seen_value). The kernel puts the thread to sleep if futex_counter has not changed since it was read. This is a genuine kernel sleep -- the thread is descheduled and consumes no CPU.
On wake (KMES incremented futex_counter and called futex_wake), clear need_wake to 0 and return to the drain loop.

Clearing need_wake to 0 (steps 2, 4, and 6) is a plain (relaxed) store -- no memory barrier is required. If KMES observes a stale need_wake of 1 after the consumer has already cleared it, KMES performs a spurious futex_wake on a thread that is already awake. This is harmless -- futex_wake on a non-sleeping thread is a no-op.

The re-check at step 2 closes the race window between the drain loop finding no events and the need_wake store. If KMES writes an event and reads need_wake as 0 (because the consumer hasn't stored it yet), the consumer will see the new write_pos at step 2 and never enter futex_wait.

The adaptive spin in step 4 is optional. Without it, the consumer sleeps immediately when the buffer is empty and is woken by KMES. With it, the consumer catches closely-spaced events without a kernel round-trip. Under sustained load, the consumer never reaches the notification wait -- it stays in the drain loop and need_wake remains 0.

§5.1.9.3 Generation check

After completing a drain cycle (buffer fully drained or batch limit reached), the consumer SHOULD check the generation field in the metadata page.

If generation has changed since the consumer last checked:

Record the sequence number of the last successfully processed event from this buffer.
Call kmes_attach(cpu_id) to obtain a new file descriptor for this CPU's resized ring buffer.
mmap the new file descriptor.
Read the new buffer's metadata (capacity, write_pos, tail_pos).
Scan events in the new buffer to find the first event with a sequence number greater than the recorded sequence number. Set read_pos to that event's position.
Close the old file descriptor and unmap the old buffer.
Continue draining from the new buffer.

If the new capacity is large enough to hold the old buffer's full surviving byte range (write_pos - tail_pos), no events are lost during the generation change. KMES copies those surviving events from the old buffer into the new buffer before incrementing generation, and sequence numbers are continuous across the swap.

If the new capacity is smaller than the old surviving byte range, KMES discards the oldest surviving events from the old buffer until the remaining suffix fits in the new capacity, then copies that suffix into the new buffer. This is equivalent to applying the ordinary overwrite semantics against the smaller capacity during the swap. Any loss is therefore bounded to the oldest surviving prefix; newer events are preserved.

Events copied into the new buffer are re-compacted contiguously starting from position 0. The consumer's old read_pos is not valid in the new buffer; it MUST scan by sequence number to find its position.

The consumer MUST finish draining the old buffer up to its frozen write_pos before switching to the new buffer. After the swap, KMES stops writing to the old buffer; its write_pos is frozen.

The old buffer's physical pages remain valid for as long as any consumer has them mapped. KMES's internal release of the old buffer does not affect existing consumer mappings -- standard kernel mmap reference counting ensures the pages persist until all consumers have unmapped them.

§5.1.9.4 Buffer swap serialization

The buffer swap MUST be atomic per CPU: no events may be lost or duplicated during the transition from the old buffer to the new buffer on a given CPU.

An implementation MAY achieve this with the following per-CPU algorithm:

Disable preemption on the target CPU.
Determine the suffix of surviving events that fits in the new capacity. If the old surviving byte range is larger than the new capacity, advance through the oldest surviving events until the remaining suffix fits. Copy that suffix from this CPU's old buffer to the new buffer, re-compacted contiguously starting from position 0. Events are copied in sequence order. Set the new buffer's tail_pos = 0 and write_pos to the total byte size of the copied events.
Set the new buffer's generation to the old buffer's generation + 1. The new buffer MUST be fully initialized before consumers can see it.
Switch the per-CPU buffer pointer from the old buffer to the new buffer. New events are now written to the new buffer.
Increment generation in the old buffer's metadata. This signals consumers still reading the old buffer that a swap has occurred and they should re-attach.
If the old buffer's need_wake flag is set, increment the old buffer's futex_counter and issue futex_wake using the old buffer's futex address. This wakes consumers sleeping on the old generation so they can observe the generation change.
Re-enable preemption.

The new buffer's generation is set before it becomes visible (step 3 before step 4). The old buffer's generation is incremented after the switch (step 5 after step 4). This ensures consumers see a consistent generation value regardless of whether they read from the old or new buffer.

The wake in step 6 is conditional on the old buffer's need_wake flag for the same reason as ordinary event emission: under sustained load, consumers are already draining and no wake is needed. When consumers are asleep on the old generation, the wake ensures they do not remain blocked indefinitely after writers have moved to the new buffer.

With preemption disabled, no events can be emitted on this CPU between the copy and the switchover. Events emitted on other CPUs are unaffected -- each CPU's swap is independent.

The generation check adds one u64 read per drain cycle. This cost is negligible relative to the event processing work.

§5.1.10 Overwrite semantics

Each per-CPU ring buffer is circular. When a buffer is full, KMES overwrites the oldest events in that buffer to make space for new events. The write pointer advances unconditionally -- emission never blocks.

Consumers detect overwritten events in two ways:

Lapping detection. If read_pos < tail_pos, events at the consumer's read position have been overwritten. The consumer advances to tail_pos.
Sequence gaps. The consumer tracks the last sequence number it processed from each CPU. A gap in the sequence for a given CPU indicates events were lost -- either overwritten in the ring buffer or dropped due to size limits.

KMES maintains tail_pos per buffer to enable lapping detection. When KMES overwrites events, it advances tail_pos past the overwritten events by reading each event's event_size field. This allows consumers to jump directly to the oldest valid event without scanning.

Advancing tail_pos requires walking overwritten events sequentially, reading each event's event_size to determine the next event boundary. This has variable latency proportional to the number of overwritten events and may involve cache-cold reads (the tail region may be megabytes away from the current write position). This cost is accepted as a tradeoff -- the alternative (maintaining a secondary index of event offsets) would add per-event overhead on the write path to optimize the uncommon case where the write pointer overtakes surviving events.

§5.1.11 Memory ordering summary

Operation	Barrier	Purpose
KMES stores `tail_pos`	release	Consumers see the advanced tail before they see new data at old positions.
KMES stores `write_pos`	release	Consumers see complete event data before they see the advanced write position.
KMES stores `futex_counter`	release	Consumers waking from futex see all prior writes.
Consumer stores `need_wake = 1`	release	KMES sees the flag before the consumer enters futex_wait.
Consumer stores `need_wake = 0`	relaxed	Spurious futex_wake from stale read is harmless.
Consumer loads `write_pos` (after setting `need_wake`)	acquire	Closes the race window: if KMES wrote before seeing `need_wake`, the consumer sees the write.
Consumer loads `write_pos` (drain loop)	acquire	Pairs with KMES release on `write_pos`.
Consumer loads `tail_pos`	acquire	Pairs with KMES release on `tail_pos`.

In the per-CPU design, the producer (KMES on CPU N) and the consumer (a userspace thread, potentially on a different CPU) are the only two parties accessing a given buffer's metadata. There is no multi-producer contention. The memory barriers ensure correct visibility between the single producer and its consumers.

On x86-64, stores are not reordered with other stores, so the release barriers on the producer side are no-ops in practice. The specification mandates them for architectural correctness on all platforms.

Section

6 Configuration

§6.1 6 Configuration

Self-Configuration

KMES reads its operational parameters from the registry under Machine\System\KMES\. Compiled-in defaults are used at boot. When LCS becomes available, KMES reads the configuration keys, validates them, and applies valid values. A persistent watch on the configuration subtree ensures ongoing changes are picked up for the lifetime of operation.

§6.1.1 Configuration keys

All keys live under Machine\System\KMES\. Each has a defined type, compiled-in default, and valid range. KMES ignores unknown keys in this subtree.

Key	Type	Default	Valid range	Description
BufferCapacity	REG_QWORD	4194304	65536--268435456	Per-CPU ring buffer capacity in bytes. MUST be a power of two. Values that are not powers of two are treated as invalid. Default is 4 MB. Maximum is 256 MB.
MaxEventSize	REG_DWORD	65536	1024--4194304	Maximum permitted total event size (header + payload) in bytes for events emitted via the `kmes_emit` and `kmes_emit_batch` syscalls. Does not apply to kernel emitters, which are subject only to the 50% structural limit. Default is 64 KB. Maximum is 4 MB.
MaxNestingDepth	REG_DWORD	32	4--256	Maximum permitted msgpack nesting depth for payloads emitted via the `kmes_emit` and `kmes_emit_batch` syscalls. Payloads exceeding this depth are rejected. Does not apply to kernel emitters.
MaxEmitRatePerProcess	REG_DWORD	10000	100--1000000	Maximum events per second that a single process may emit via the `kmes_emit` and `kmes_emit_batch` syscalls. Implemented as a token bucket: refill rate equals this value, burst capacity equals this value. Processes holding SeTcbPrivilege are exempt. Prevents a compromised application from flooding the ring buffer and overwriting legitimate kernel events. Does not apply to kernel emitters. Rate state is per-process, not per-principal -- per-SID rate limiting would unfairly penalise unrelated services sharing a SID (e.g., LocalService). A process that forks to reset its rate limit is bounded by RLIMIT_NPROC. For batch emission, only events actually emitted are charged, not the full requested count.

§6.1.2 Validation

When KMES reads a configuration value, it validates against the defined type, range, and constraints:

Valid value: Applied to the in-memory configuration. For MaxEventSize and MaxNestingDepth, the new value takes effect for subsequent syscalls. For BufferCapacity, KMES triggers a ring buffer swap -- creating new per-CPU ring buffers at the configured size, copying surviving events from the old buffers, incrementing the generation counter, and discarding the old buffers. The swap protocol is defined in §5.1.9.4.
Invalid value (out of range, wrong type, not a power of two for BufferCapacity, missing): Ignored. KMES retains the previously active value (compiled-in default or last known-good). An event is emitted via KMES itself identifying the key, the invalid value, and the value being retained.

Values are never clamped or silently corrected. The write to the registry succeeds (the source does not enforce kernel semantics), but KMES refuses to use it. The registry shows what was written; the event log shows what KMES is actually using.

§6.1.3 Bootstrap sequence

PKM loads. KMES initialises with compiled-in defaults and creates per-CPU ring buffers at the compiled-in default BufferCapacity. These are ordinary consumer-facing KMES ring buffers and receive events immediately.
LCS becomes available (first source registers). KMES reads all keys under Machine\System\KMES\.
If keys exist and are valid, KMES applies them. If BufferCapacity differs from the compiled-in default, KMES performs a normal ring-buffer swap: it creates new per-CPU ring buffers at the configured size, copies surviving events from the existing buffers into them, increments generation, and switches writers to the new buffers. If BufferCapacity matches the default (or the key does not exist), no capacity change is needed and the existing buffers remain in place.
KMES arms a persistent subtree watch on Machine\System\KMES\ via LCS's internal watch mechanism. This is a kernel-internal registration, not a userspace fd-based watch.
If Machine\System\KMES\ does not exist (first boot, empty database), KMES arms a watch on a parent key to detect when the subtree is created. When the key appears, KMES reads and validates its contents and re-arms a targeted watch.
On subsequent changes (administrator modification, Group Policy push at a higher-precedence layer), the watch fires, KMES re-reads the changed key, validates, and applies or rejects.

At no point does KMES enter a "waiting for configuration" state. Compiled-in defaults are always sufficient for operation.

§6.1.4 Security

KMES configuration keys live under Machine\System\KMES\, which inherits the Machine hive root SD (SYSTEM and Administrators: KEY_ALL_ACCESS, Authenticated Users: KEY_READ). Unprivileged processes cannot modify operational parameters.

Domain policy enforcement via Group Policy at a higher-precedence layer provides defence against compromised local administrators -- SeTcbPrivilege is required for layer creation at precedence > 0.

KMES configuration keys are candidates for Superlock protection (a future registry feature that prevents modification outside of Safe or Recovery mode). Event system configuration is critical enough that runtime modification by even a privileged administrator warrants additional gating.

§6.1.5 Boot-time initial capacity

The capacity of the initial boot-time ring buffers is the compiled-in default BufferCapacity. It is not independently configurable before LCS is available. Making the pre-LCS capacity separately configurable would require a mechanism to deliver the value to the kernel before the registry exists, which adds complexity for negligible benefit. Once LCS is available, BufferCapacity changes are applied through the normal generation-bumping ring-buffer swap protocol.

Section

7 Failure modes

§7.1 7 Failure modes

Failure Modes

KMES is a kernel subsystem with no external trust boundary on the write path. Kernel emitters are trusted; userspace emitters are validated at the syscall boundary. Failure semantics are simpler than subsystems like LCS that span kernel-userspace trust boundaries, but MUST still be explicit.

§7.1.1 Ring buffer overrun

When events are emitted faster than consumers drain them, the ring buffer fills and KMES overwrites the oldest events to make space.

The write path is never blocked. Emission never fails due to buffer pressure -- buffer-full conditions are handled by overwriting, not blocking.
Consumers detect lost events as gaps in the per-CPU sequence number.
Consumers whose read position has been overwritten are advanced to tail_pos (the oldest surviving event).
KMES maintains an internal per-CPU dropped-event counter. This counter is not exposed in the ring buffer metadata in v0.20 but MAY be exposed in a future version.

Overrun is a normal operating condition under heavy load, not an error. The system degrades gracefully: recent events are preserved, old events are lost, consumers are aware of the loss.

§7.1.2 Event drop

An event is dropped without being written to the ring buffer when:

Structural limit exceeded. The event exceeds 50% of the per-CPU ring buffer capacity. Applies to both kernel and syscall emitters.
Policy limit exceeded (syscall only). The event exceeds MaxEventSize.
Validation failure (syscall only). The payload is not valid msgpack or exceeds MaxNestingDepth.

For kernel emitters, the per-CPU sequence number advances even when an event is dropped, making the drop visible to consumers as a gap in the sequence. The emitting subsystem is not notified, as the emission API is fire-and-forget.

For syscall emitters, validation failures occur before the ring buffer write phase, so no sequence number is consumed and no gap is visible to consumers. The drop is visible only to the caller via the syscall error return.

§7.1.3 Consumer crash

If a consumer process (e.g., eventd) crashes:

The consumer's mmap'd ring buffer regions remain valid in kernel memory. The kernel cleans up the mappings when the process's file descriptors are closed (normal kernel fd cleanup on process exit).
KMES is unaffected. It continues writing events to the per-CPU ring buffers regardless of whether any consumers are attached.
Events emitted while no consumer is attached accumulate in the ring buffers. If the buffers fill, oldest events are overwritten.
When a consumer restarts and re-attaches (calls kmes_attach for each CPU and mmaps the buffers), it sees all surviving events. Events overwritten during the outage are visible as a sequence gap starting from whatever sequence number the consumer last processed.

KMES has no dependency on consumers. A system with no consumers attached operates identically to a system with consumers -- events are emitted, stamped, buffered, and eventually overwritten.

§7.1.4 Buffer swap failure

When KMES attempts to create new ring buffers (due to a BufferCapacity configuration change, including the first LCS-driven resize away from the boot-time default), memory allocation may fail.

If allocation fails, KMES retains the existing ring buffers at their current size. The configuration change is not applied.
An event is emitted via KMES itself recording the allocation failure and the retained buffer size.
The generation counter is not incremented. Consumers are unaffected.
KMES does not retry automatically. A subsequent configuration write (or system reboot) triggers another attempt.

§7.1.5 LCS unavailable

If LCS never becomes available (no source registers), KMES operates indefinitely with compiled-in defaults. The initial boot-time ring buffers remain the live buffers at the default BufferCapacity. The self-configuration watch is never armed because there is no registry to watch.

This is not a failure -- it is a valid operating mode. KMES has no hard dependency on LCS. The only consequence is that operational parameters cannot be tuned.

§7.1.6 Clock discontinuity

KMES timestamps use CLOCK_REALTIME (wall clock). NTP adjustments can cause the clock to jump forward or backward. When this occurs:

Events emitted after a backward jump have timestamps earlier than events emitted before the jump. Consumers that sort by timestamp will see an apparent reordering.
Per-CPU sequence numbers are unaffected (they are monotonic counters, not derived from the clock). Sequence numbers remain the reliable ordering primitive within a single CPU.
KMES does not detect or compensate for clock discontinuities. Consumers that require monotonic ordering within a CPU SHOULD use the sequence number, not the timestamp.

Cross-CPU ordering during a clock discontinuity is best-effort. Events from different CPUs near a clock jump may have misleading relative timestamps. This is an inherent limitation of wall clock timestamps and is accepted as a trade-off for human-readable, cross-boot-comparable timestamps.

§7.1.7 CPU hotplug

CPU hotplug (adding or removing CPUs at runtime) is not supported in v0.20. The number of per-CPU ring buffers is fixed at KMES initialisation time based on the number of online CPUs when PKM loads. If a CPU is brought online after KMES initialisation, events emitted on that CPU are dropped.

A future version MAY support dynamic per-CPU buffer creation for hotplugged CPUs.

§7.1.8 Memory bounding

KMES kernel memory usage is bounded by:

Per-CPU ring buffers: In steady state, num_cpus × BufferCapacity. At boot, these begin at the compiled-in default size. During a live BufferCapacity swap, old and new per-CPU ring buffers MAY temporarily coexist until existing mappings to the old generation are released. Capacity values remain bounded by the configured limits.
Event construction: Temporary allocations during event construction are bounded by the maximum event size and freed immediately after the event is written to the ring buffer.
Consumer file descriptors: Each kmes_attach call creates one file descriptor. A consumer attaching to all CPUs creates num_cpus file descriptors. Bounded by RLIMIT_NOFILE and the SeSecurityPrivilege requirement.

No KMES-specific global memory cap is required. The BufferCapacity configuration and standard Linux resource limits provide sufficient protection.

Section

8 Appendix a

§8.1 8 Appendix a

Constants

All numeric constants used in the KMES interface. An independent implementer can derive all magic numbers from this page.

§8.1.1 Syscall numbers

Syscall	Number	Description
kmes_emit	1090	Emit a single event from userspace.
kmes_attach	1091	Attach as a consumer of a single per-CPU ring buffer.
kmes_emit_batch	1092	Emit multiple events from userspace as a single operation. Maximum 256 events per call.

§8.1.2 Origin class values

Value	Origin
0	Userspace (syscall)
1	KMES
2	KACS
3	LCS

Values 4--255 are reserved for future kernel subsystems.

§8.1.3 Event header layout

Packed, no padding. All multi-byte integers little-endian. GUIDs in Microsoft GUID binary format.

Offset	Size	Type	Field
0	4	`u32`	`event_size`
4	4	`u32`	`header_size`
8	8	`u64`	`timestamp`
16	8	`u64`	`sequence`
24	2	`u16`	`cpu_id`
26	1	`u8`	`origin_class`
27	16	`GUID`	`effective_token_guid`
43	16	`GUID`	`true_token_guid`
59	16	`GUID`	`process_guid`
75	2	`u16`	`type_len`
77	var	`[u8]`	`type`

Header size: 77 + type_len bytes. All fields before type_len are at fixed offsets. Payload begins at header_size from event start.

§8.1.4 Producer metadata page layout (offset 0, read-only)

One producer metadata page (4096 bytes) per CPU. Cache-line-aligned fields.

§8.1.4.1 Cache line 0 -- static fields (bytes 0--63)

Offset	Size	Type	Field
0	8	`[u8; 8]`	`magic`
8	4	`u32`	`version`
12	2	`u16`	`cpu_id`
14	2	`u16`	`reserved0`
16	8	`u64`	`capacity`
24	8	`u64`	`data_offset`
32	8	`u64`	`generation`
40	24	--	`reserved1`

§8.1.4.2 Cache line 1 -- producer fields (bytes 64--127)

Offset	Size	Type	Field
64	8	`u64`	`write_pos`
72	8	`u64`	`tail_pos`
80	48	--	`reserved2`

§8.1.4.3 Cache line 2 -- notification fields (bytes 128--191)

Offset	Size	Type	Field
128	4	`u32`	`futex_counter`
132	60	--	`reserved3`

§8.1.5 Consumer metadata page layout (offset 4096, read-write)

Offset	Size	Type	Field
4096	1	`u8`	`need_wake`
4097	4095	--	`reserved4`

§8.1.6 Ring buffer magic

0x4B 0x4D 0x45 0x53 0x52 0x49 0x4E 0x47
 K    M    E    S    R    I    N    G

Compared byte-by-byte, not as an integer. Endianness-independent.

§8.1.7 Ring buffer version

v0.20 uses ring buffer format version 1. Events are self-describing: the payload begins at header_size bytes from the event start, so consumers that use header_size to locate the payload remain compatible with any future header growth.

§8.1.8 Mapped region layout

Per-CPU mapping returned by mmap() on a kmes_attach file descriptor:

Offset	Size	Description
0	4096	Producer metadata page (read-only)
4096	4096	Consumer metadata page (read-write)
8192	2 × capacity	Double-mapped data region (read-only)

Total mapping size: 8192 + (2 × capacity) bytes.

§8.1.9 Syscall error codes

§8.1.9.1 kmes_emit errors

Errno	Condition
EPERM	Caller does not hold SeAuditPrivilege.
EAGAIN	Per-process rate limit exceeded.
EINVAL	Event type length is zero, or event type is not valid UTF-8, or payload is invalid msgpack, or payload nesting depth exceeds MaxNestingDepth.
EFAULT	Event type or payload pointer is inaccessible.
ENOSPC	Event exceeds MaxEventSize or 50% of per-CPU ring buffer capacity.
ENOMEM	Kernel memory allocation for staging buffer failed.

§8.1.9.2 kmes_emit_batch errors

Errno	Condition
EPERM	Caller does not hold SeAuditPrivilege.
EAGAIN	Per-process rate limit exceeded.
EINVAL	Count is 0 or exceeds 256, or failing entry has zero-length event type, or failing entry's event type is not valid UTF-8, or failing entry's payload is invalid msgpack or exceeds MaxNestingDepth.
EFAULT	`emitted_out`, entry array, event type, or payload pointer is inaccessible.
ENOSPC	Failing entry exceeds MaxEventSize or 50% of per-CPU ring buffer capacity.
ENOMEM	Kernel memory allocation failed.

§8.1.9.3 kmes_attach errors

Errno	Condition
EPERM	Caller does not hold SeSecurityPrivilege.
EINVAL	`cpu_id` is greater than or equal to the number of CPUs.
EFAULT	`capacity` pointer is inaccessible.
ENOMEM	Kernel memory allocation failed.

§8.1.10 kmes_emit_entry struct layout (x86-64)

C ABI natural alignment. Total size: 32 bytes.

Offset	Size	Type	Field
0	8	`pointer`	`event_type`
8	2	`u16`	`event_type_len`
10	6	--	padding
16	8	`pointer`	`payload`
24	4	`u32`	`payload_len`
28	4	--	padding

§8.1.11 Configuration keys

Registry path: Machine\System\KMES\

Key	Type	Default	Valid range
BufferCapacity	REG_QWORD	4194304 (4 MB)	65536--268435456 (64 KB--256 MB), power of two
MaxEventSize	REG_DWORD	65536 (64 KB)	1024--4194304 (1 KB--4 MB)
MaxNestingDepth	REG_DWORD	32	4--256
MaxEmitRatePerProcess	REG_DWORD	10000	100--1000000

§8.1.12 Privilege requirements

Operation	Required privilege
Emit event from userspace (`kmes_emit`, `kmes_emit_batch`)	SeAuditPrivilege
Attach as consumer (`kmes_attach`)	SeSecurityPrivilege

Section

9 Appendix b

§9.1 9 Appendix b

Recommended Implementation Optimisations

The following optimisations are not normative. They do not affect the ring buffer format, the event header layout, or the consumer protocol. An implementation that omits all of them is fully conformant. However, each one provides measurable throughput or latency improvement with no behavioural trade-offs, and implementers are encouraged to adopt them.

§9.1.1 Timestamp capture

Timestamp capture (CLOCK_REALTIME) is the single most expensive per-event operation on the write path, at approximately 15--25 ns per call. Implementations SHOULD use ktime_get_real_fast_ns() -- the kernel-internal fast path that avoids the full timekeeper seqlock dance. On architectures with an invariant TSC (Time Stamp Counter), this reduces to an rdtsc instruction plus a multiply and add. The trade-off is that in the rare case where a timer interrupt is updating the timekeeper concurrently, the timestamp may be off by one tick. For nanosecond-precision event timestamps, this is acceptable.

§9.1.2 Hugepages

The ring buffer data region benefits from 2 MB hugepages rather than 4 KB standard pages. A 4 MB ring buffer requires 1024 standard pages but only 2 hugepages. Fewer pages means fewer TLB (Translation Lookaside Buffer -- the CPU's cache of virtual-to-physical address mappings) entries are needed to cover the buffer. TLB misses during event reads and writes add ~10-30 ns each, and a large buffer with standard pages can cause frequent misses during sequential traversal.

The double virtual mapping doubles the virtual address range, so the TLB benefit of hugepages is even more pronounced: 4 MB of physical memory mapped as 8 MB of virtual space requires 4 hugepages vs 2048 standard pages.

Hugepages are transparent to consumers -- the mmap'd region behaves identically regardless of the underlying page size.

§9.1.3 NUMA-local allocation

On NUMA (Non-Uniform Memory Access) systems, physical memory is divided into nodes. Each CPU has a local node with fast access (~70 ns) and remote nodes with slower access (~100-150 ns). Ring buffer pages SHOULD be allocated on the same NUMA node as the CPU that writes to them.

Since each ring buffer is written by exactly one CPU, NUMA-local allocation ensures all producer writes are fast. Consumer reads may cross NUMA boundaries (the consumer thread may run on a different node), but under the per-CPU design, the consumer thread can be affinity-bound to the same node as a secondary optimisation.

§9.1.4 Precomputed header templates

Several event header fields are constant for a given CPU: cpu_id and the header structure bytes (header_size, field offsets). A per-CPU header template can be precomputed at initialisation time. At emit time, KMES copies the template and fills in only the variable fields (event_size, timestamp, sequence, origin_class, type_len, type, and the three identity GUIDs). This reduces per-event header construction to a small memcpy plus a few stores.

For kernel emitters with a fixed origin class, the template can include the origin class as well, reducing the per-event work further. The identity GUIDs are per-thread and must be captured at emit time -- they cannot be templated.

§9.1.5 Software prefetch

After reading an event's event_size field, the consumer knows where the next event starts. Issuing a software prefetch instruction for the next event's header address (e.g., __builtin_prefetch in C, prefetch intrinsic in Rust) allows the CPU to begin fetching the next event's cache lines while the current event is being processed. This hides memory latency during sequential buffer traversal.

This is most effective when event processing involves non-trivial work (msgpack decoding, SQLite insertion) that gives the prefetch time to complete.

§9.1.6 Msgpack validation with SIMD

For the kmes_emit syscall path, msgpack payload validation can be accelerated using SIMD (Single Instruction, Multiple Data) instructions. The initial type-byte scan -- determining whether each byte is a fixint, a container header, or a data byte -- is amenable to vectorised classification using SSE4.2 or AVX2 byte-shuffle instructions. This reduces validation overhead for large payloads.

This optimisation is only relevant to the syscall path. Kernel emitters bypass payload validation entirely.

§9.1.7 Per-CPU staging buffer

The kmes_emit and kmes_emit_batch syscalls copy event data from userspace into a kernel buffer before validation and ring buffer write. A per-CPU pre-allocated staging buffer (e.g., one page / 4 KB) eliminates dynamic allocation from the common-case syscall path. Events exceeding the pre-allocated size fall back to kmalloc.

For batch emission, the staging buffer can be reused sequentially: copy entry 0, validate, hold the kernel copy; copy entry 1 into the same staging buffer if entry 0 has already been written to the ring buffer, or allocate a second buffer if entries must be held simultaneously. The goal is to avoid 256 separate kmalloc calls for a full batch.

§9.1.8 Consumer thread affinity

Consumer threads that drain per-CPU ring buffers benefit from being pinned to CPUs on the same NUMA node as the buffer they read. While not strictly necessary (the per-CPU design eliminates write contention regardless of consumer placement), NUMA-local reads avoid cross-node memory traffic during the drain loop.

For eventd specifically, pinning each drain goroutine's underlying OS thread to the same NUMA node as its buffer is a simple configuration that reduces read latency.

Section

10 Appendix c

§10.1 10 Appendix c

Known Omissions

This appendix lists capabilities that are intentionally omitted from v0.20 but are expected to be addressed in future versions. All items listed here are additive -- they can be implemented without restructuring the core ring buffer design, the syscall interface, or the event format.

§10.1.1 CPU topology changes

v0.20 fixes the set of per-CPU ring buffers at KMES initialisation time. The following topology changes are not handled:

CPU hot-add. A CPU brought online after KMES initialisation has no ring buffer. Events emitted on that CPU are dropped silently. The kmes_attach(cpu_id) API returns EINVAL for the new CPU.

CPU offline. A CPU taken offline via cpu_online still has a ring buffer. The drain thread for that CPU sleeps on futex_wait indefinitely since no new events are emitted. This is harmless (one sleeping thread) but not clean.

CPU hot-remove. A CPU that had a ring buffer is physically removed. The drain thread sleeps forever on a buffer that will never receive new events. Without a notification mechanism, the consumer cannot distinguish a quiet CPU from a removed one.

The fix for all three is a topology change notification mechanism. Options include a new generation bump that signals "re-enumerate CPUs", a dedicated topology-change file descriptor, or a field in the producer metadata page indicating CPU status. The kmes_attach(cpu_id) design accommodates all of these -- consumers discover new CPUs by extending their attach loop, and detect removed CPUs via a status field or error code. No changes to the ring buffer format, event format, or emission API are required.

CPU hot-add is the most likely real-world scenario (hypervisors adding vCPUs to a running guest). CPU hot-remove is rare outside mainframes.

§10.1.2 NUMA-aware buffer allocation

v0.20 does not specify which NUMA node ring buffer pages are allocated on. If a ring buffer's physical pages are allocated on a remote NUMA node, every event write on that CPU crosses the interconnect. At KMES throughput targets (millions of events per second), remote NUMA writes add measurable latency.

The fix is to allocate each per-CPU ring buffer's physical pages on the NUMA node local to that CPU. This is a kernel allocation policy change (alloc_pages_node or equivalent) with no consumer-visible effect. The ring buffer format, mapping layout, and syscall interface are unchanged.

§10.1.3 Suspend/resume

v0.20 does not explicitly address system suspend (S3 sleep) or hibernate (S4). Ring buffer contents survive S3 (memory is preserved). On resume, CLOCK_REALTIME jumps forward by the suspend duration. This is a special case of the clock discontinuity described in §7.1 -- consumers see a wall clock gap in timestamps but no sequence number gap.

S4 (hibernate) writes memory to disk. On resume, ring buffers are restored from the hibernate image. The same clock discontinuity applies. If the kernel re-initialises KMES on resume (implementation-dependent), the generation counter signals consumers to re-attach.

No architectural change is needed. A future version MAY add an explicit note to the clock discontinuity section covering suspend/resume.

§10.1.4 SMT capacity planning

Each logical CPU (hardware thread) receives its own ring buffer. On a system with simultaneous multithreading (e.g., 2 threads per core), the total ring buffer memory is doubled compared to a physical-core-only count. At the default 4 MB capacity on a 64-core / 128-thread system, total ring buffer memory is 512 MB.

This is by design -- events are emitted per logical CPU and the cpu_id in event headers reflects the logical CPU. The per-logical-CPU design is correct. A future version MAY add guidance on capacity tuning for SMT-heavy systems.

KMES

Contents

1Introduction

2Event model

3Emission api

4Syscall interface

5Ring buffer

6Configuration

7Failure modes

8Appendix a

9Appendix b

10Appendix c

1 Introduction

Scope

Terminology

Conventions

§1.3.1 Normative keywords

§1.3.2 Byte order

§1.3.3 String encoding

§1.3.4 GUID encoding

§1.3.5 Payload encoding

Compatibility

§1.4.1 Features handled by other subsystems

2 Event model

Event Model

§2.1.1 Structure

§2.1.2 Header layout

§2.1.3 Stamp fields

§2.1.3.1 KMES-intrinsic stamps

§2.1.3.2 Identity stamps

§2.1.3.3 Structural fields

§2.1.4 Ordering

§2.1.4.1 Identity stamp timing

§2.1.5 Origin class values

§2.1.6 Payload

§2.1.7 Size limits

3 Emission api

Emission API

§3.1.1 Purpose

§3.1.2 Interface

§3.1.3 Preemption

§3.1.4 Event construction

§3.1.5 Caller contract

§3.1.6 Structural checks

§3.1.7 Ring buffer full

§3.1.8 Batch emission

§3.1.8.1 Interface

§3.1.8.2 Behavior

§3.1.8.3 Failure semantics

§3.1.8.4 Caller contract

§3.1.9 Atomicity

4 Syscall interface

Syscall Interface

§4.1.1 Overview

§4.1.2 kmes_emit (1090)

§4.1.2.1 Privilege requirement

§4.1.2.2 Parameters

§4.1.2.3 Validation

§4.1.2.4 Preemption

§4.1.2.5 Behavior

§4.1.2.6 Return

§4.1.2.7 Errors

§4.1.3 kmes_emit_batch (1092)

§4.1.3.1 Privilege requirement

§4.1.3.2 Parameters

§4.1.3.3 Validation

§4.1.3.4 Preemption

§4.1.3.5 Behavior

§4.1.3.6 Return

§4.1.3.7 Errors

§4.1.4 kmes_attach (1091)

§4.1.4.1 Privilege requirement

§4.1.4.2 Parameters

§4.1.4.3 Behavior

§4.1.4.4 Notification

§4.1.4.5 Return

§4.1.4.6 Errors

5 Ring buffer

Ring Buffer

§5.1.1 Overview