Specification

KMES

Kernel Mediated Event Subsystem — unified event emission, buffering, and delivery for the Peios kernel.

v0.20 Draft 2026-04-23

Section

1 Introduction

§1.1 1 Introduction

Scope

This specification defines the Kernel Mediated Event Subsystem (KMES) for the Peios operating system. KMES is the event subsystem within the Peios Kernel Module (PKM), peer to KACS and LCS. It provides a unified event emission, buffering, and delivery mechanism for the Peios kernel.

KMES is the sole event emission path in Peios. All events -- whether originating from kernel subsystems or userspace processes -- are emitted through KMES. There is no alternative event path.

This specification covers:

The event model -- event structure, header format, msgpack payload, and the header/payload boundary
The emission API -- the internal kernel interface used by KACS and LCS to emit events
The syscall interface -- the mechanism for userspace event emission and consumer attachment
Event buffering and ordering -- per-CPU ring buffers, per-CPU sequence numbering, and wall clock timestamps
The delivery mechanism -- per-CPU shared memory ring buffers, double virtual mapping, lock-free read/write protocols, and futex-based consumer notification
Stamping -- KMES-intrinsic stamp fields (timestamp, sequence number, cpu_id, origin class); process and identity stamp fields are reserved for a future version pending KACS coordination

This specification does not cover:

Event persistence, indexing, or querying (eventd)
Event type schemas or naming conventions (eventd)
Boot identity and cross-boot sequencing (peinit / eventd)
KACS (covered by the KACS v0.20 specification)
LCS (covered by the LCS v0.21 specification)
Authentication or principal management (authd)

§1.2 1 Introduction

Terminology

The following terms are used throughout this specification with the precise meanings defined here.

PKM (Peios Kernel Module): The single loadable kernel module (pkm.ko) containing all Peios kernel extensions. KMES, KACS, and LCS are peer subsystems within PKM.

KMES (Kernel Mediated Event Subsystem): The event subsystem within PKM. Provides the sole event emission path in Peios -- kernel subsystems and userspace processes emit events exclusively through KMES. KMES buffers events, stamps them with metadata, assigns per-CPU sequence numbers, and delivers them to userspace consumers via per-CPU shared memory ring buffers.

Event: An indivisible record consisting of a header and a payload. The header is a packed binary structure containing KMES-intrinsic metadata. The payload is a msgpack-encoded blob of arbitrary structured data defined by the emitter. Header and payload are always produced and consumed together -- neither is meaningful alone.

Header: The packed binary prefix of every event. Contains: event size, header size, wall clock timestamp, per-CPU sequence number, CPU identifier, origin class, and a length-prefixed event type string. The header also reserves space for future process and identity stamp fields.

Payload: The msgpack-encoded body of an event. Its structure is defined by the emitting subsystem or process. KMES treats the payload as opaque -- it buffers and delivers payloads without interpreting them.

Stamp: The set of metadata fields in the event header that KMES populates at emission time. v0.20 stamp fields are KMES-intrinsic: timestamp, sequence number, cpu_id, and origin class. Future versions will add process and identity fields in coordination with the KACS specification.

Sequence number: A per-CPU, monotonically increasing 64-bit unsigned integer assigned by KMES to each event at emission time. Each CPU maintains its own independent counter, reset to zero when the PKM module loads. Gaps in the sequence on a given CPU indicate lost events (overwritten or dropped). The sequence number is not a global ordering primitive -- events are ordered primarily by timestamp.

Origin class: A header field identifying the subsystem or emission path that produced the event: a specific kernel subsystem (KMES, KACS, LCS) or userspace (via syscall).

Event type: An arbitrary, length-prefixed UTF-8 string in the event header identifying the kind of event. KMES imposes no structure or naming convention on event types -- schema and naming are consumer concerns.

Ring buffer: A per-CPU shared memory region created and managed by KMES, mapped read-only into the address space of authorized userspace consumers. KMES maintains one ring buffer per CPU. Each buffer is independent, with its own write position, sequence counter, and futex notification. Ring buffers are the sole delivery mechanism from KMES to userspace.

Boot buffer: Internal per-CPU kernel buffers used by KMES to capture events during early boot, before the registry is available and the consumer-facing ring buffers are created. One boot buffer exists per CPU. Boot buffers are not visible to consumers. When the ring buffers are created, surviving boot buffer events are copied into them.

Consumer: A userspace process that maps one or more KMES ring buffers and reads events from them. Consumers typically dedicate one thread per CPU buffer. eventd is the primary consumer.

§1.3 1 Introduction

Conventions

§1.3.1 Normative keywords

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY in this specification are to be interpreted as described in RFC 2119.

§1.3.2 Byte order

All multi-byte integers in the event header and ring buffer metadata page are little-endian. The ring buffer magic field is a fixed byte sequence compared byte-by-byte, not an integer.

§1.3.3 String encoding

Event type strings are UTF-8 encoded. No case folding or normalization is applied -- event types are compared as raw byte sequences.

§1.3.4 Payload encoding

Event payloads are encoded using MessagePack (msgpack) as defined by the MessagePack specification. KMES does not interpret payload contents.

§1.4 1 Introduction

Compatibility

KMES is not a port or reimplementation of any existing event subsystem. The design -- structured events with a fixed binary header and msgpack payload, a shared memory ring buffer for delivery, and a single emission path for both kernel and userspace -- was chosen to meet the specific requirements of Peios: unified observability, trusted kernel-stamped metadata, and a simple delivery mechanism.

KMES serves a similar role to ETW (Event Tracing for Windows) in the Windows kernel and the audit subsystem (auditd) in Linux. It is not compatible with either at the wire level, format level, or API level.

§1.4.1 Features handled by other subsystems

Feature	Subsystem
Event persistence and querying	eventd
Event type schemas and naming conventions	eventd
Boot identity and cross-boot sequencing	peinit / eventd

Section

2 Event model

§2.1 2 Event model

Event Model

§2.1.1 Structure

An event is an indivisible record consisting of a header followed by a payload. The header is a packed binary structure with no padding between fields. The payload is a msgpack-encoded blob. Header and payload are always stored, transmitted, and consumed as a single contiguous byte sequence.

§2.1.2 Header layout

The header fields are laid out sequentially with no padding or alignment gaps.

Offset	Size	Field	Description
0	4	`event_size`	Total size of the event (header + payload) in bytes. `u32`, little-endian.
4	4	`header_size`	Size of the header in bytes. `u32`, little-endian.
8	8	`timestamp`	Wall clock time at emission. Nanoseconds since Unix epoch. `u64`, little-endian.
16	8	`sequence`	Per-CPU, per-boot monotonic sequence number. `u64`, little-endian. Used for gap detection: a gap in the sequence on a given CPU indicates lost events.
24	2	`cpu_id`	The CPU on which the event was emitted. `u16`, little-endian. Identifies which per-CPU ring buffer contains this event.
26	1	`origin_class`	Origin of the event. `u8`.
27	2	`type_len`	Length of the event type string in bytes. `u16`, little-endian.
29	`type_len`	`type`	Event type string. UTF-8 encoded. Not null-terminated.

The payload begins at offset header_size from the start of the event. The next event in the ring buffer begins at offset event_size from the start of the current event.

header_size MAY be larger than the minimum size required by the defined fields. Bytes between the end of the type string and header_size are reserved for future stamp fields. Consumers MUST use header_size to locate the payload, not the end of the event type string.

§2.1.3 Stamp fields

KMES populates the following header fields at emission time. The emitter does not provide these -- they are set by KMES unconditionally.

timestamp -- captured from the wall clock (CLOCK_REALTIME) at the moment KMES accepts the event.
sequence -- the next value from the emitting CPU's per-boot monotonic counter, starting at zero when the PKM module loads.
cpu_id -- the CPU on which the event was emitted.
origin_class -- for syscall emission, set unconditionally to 0 (userspace) by KMES. For kernel emission, set to the value provided by the calling subsystem.

event_size, header_size, and type_len are structural fields set by KMES during event construction.

The emitter provides the event type string and the msgpack payload. KMES does not modify either.

§2.1.4 Ordering

For cross-CPU ordering, events are ordered by timestamp (wall clock). Events with identical timestamps from different CPUs were genuinely concurrent and have no defined relative order. Within a single CPU, the sequence number provides reliable monotonic ordering even across clock discontinuities. Events with identical timestamps from the same CPU are ordered by sequence.

There is no global sequence number. Each CPU maintains its own independent sequence counter. The pair (cpu_id, sequence) uniquely identifies an event within a single boot.

§2.1.4.1 Future identity stamp fields

A future version of this specification will define additional stamp fields carrying process and identity information about the emitter. These fields will be added to the header between the event type string and the payload boundary, extending header_size. The specific fields are deferred to avoid specification conflicts with KACS, which owns process identity primitives.

Consumers MUST NOT assume that header_size equals the minimum header size. Consumers MUST use header_size to locate the payload.

§2.1.5 Origin class values

Value	Origin
0	Userspace (syscall)
1	KMES
2	KACS
3	LCS

Values 4--255 are reserved for future kernel subsystems.

§2.1.6 Payload

The payload is a single msgpack-encoded value occupying the bytes from offset header_size to offset event_size. KMES does not interpret or modify the payload. The payload's structure is defined by the emitter and understood by consumers.

The payload MUST be valid msgpack. KMES does not validate payloads from kernel emitters. For events emitted via the syscall interface, KMES MUST validate that the payload is well-formed msgpack before accepting the event. Validation is iterative with a bounded maximum nesting depth. Events with invalid payloads MUST be rejected and the syscall MUST return an error to the caller.

§2.1.7 Size limits

For events emitted via syscall, the maximum permitted event size is runtime-configurable via the registry (MaxEventSize). KMES uses an internal default until the registry is reachable. Events exceeding the limit MUST be rejected and the syscall MUST return an error. Kernel emitters are not subject to the configurable size limit -- they are subject only to the 50% ring buffer capacity structural limit defined in the Emission API section.

Section

3 Emission api

§3.1 3 Emission api

Emission API

§3.1.1 Purpose

The emission API is the internal kernel interface through which PKM subsystems emit events into KMES. It is not a syscall -- it is a function call within the kernel module. Userspace event emission is handled by the syscall interface defined in a later section.

§3.1.2 Interface

A kernel emitter calls KMES with the following parameters:

origin_class (u8) -- the origin class value identifying the emitting subsystem.
event_type (byte pointer + length) -- the event type string. UTF-8 encoded.
payload (byte pointer + length) -- the msgpack-encoded payload.

The emitter does not specify a CPU or buffer. KMES writes the event to the ring buffer of the CPU on which the calling code is currently executing.

§3.1.3 Preemption

The entire emission path MUST execute with preemption disabled. Preemption is disabled before determining the current CPU and re-enabled after the ring buffer write is complete. This guarantees that the emitting thread cannot be migrated to a different CPU mid-write, which would violate the single-writer-per-buffer invariant.

For kernel emitters, preemption is disabled for the full emission path (timestamp capture through ring buffer write). This is acceptable because kernel emitters produce small, trusted payloads and the total non-preemptible window is a few hundred nanoseconds.

§3.1.4 Event construction

KMES constructs the event by:

Capturing the wall clock timestamp.
Incrementing the current CPU's per-boot sequence counter and taking the new value. This is a CPU-local operation with no cross-CPU contention.
Building the packed header from the stamp fields, cpu_id, origin class, and event type.
Writing the header and payload contiguously into the current CPU's ring buffer.

The timestamp is captured before the sequence number is assigned. Two events with the same timestamp on the same CPU are ordered by sequence number.

§3.1.5 Caller contract

The emitter MUST provide a valid origin class value as defined in the Event Model section. The emitter MUST provide a valid UTF-8 event type string. The emitter MUST provide the payload as a contiguous byte buffer.

KMES does not validate the origin class, event type encoding, or payload contents from kernel emitters. These are trusted callers within PKM.

§3.1.6 Structural checks

The emission API performs the following structural checks on every call:

The event type string MUST have nonzero length.
The total event size (header + payload) MUST fit in a u32.
The total event size MUST NOT exceed 50% of the per-CPU ring buffer capacity.

The ring buffer capacity check protects against kernel bugs that would emit an event large enough to overwrite most of a CPU's event history. This threshold is a fixed ratio, not a configurable parameter.

If a structural check fails, the event is not written to the ring buffer. The sequence number still advances, making the drop visible as a gap in the sequence. KMES increments an internal dropped-event counter.

§3.1.7 Ring buffer full

Each per-CPU ring buffer is circular. When a buffer is full, KMES overwrites the oldest events in that buffer to make space for the new event. The write pointer advances unconditionally -- emission never blocks and never fails due to buffer pressure.

Consumers detect overwritten events as gaps in the sequence number. If a consumer's read position has been overwritten, the consumer is advanced to the oldest surviving event.

§3.1.8 Batch emission

The batch emission API allows a kernel emitter to emit multiple events as a single operation, reducing per-event overhead.

§3.1.8.1 Interface

A kernel emitter calls the batch API with the following parameters:

origin_class (u8) -- the origin class value identifying the emitting subsystem. Applied to all events in the batch.
events (array of event descriptors) -- each descriptor contains an event type (byte pointer + length) and a payload (byte pointer + length).
count (u32) -- the number of events in the array.

§3.1.8.2 Behavior

KMES processes the batch as follows:

Disable preemption.
Capture a single wall clock timestamp. All events in the batch share this timestamp.
For each event in order: perform structural checks, assign a sequence number, build the header, and write the event to the ring buffer. If a structural check fails on any event, the failing event is dropped (sequence number consumed, gap visible) but subsequent events in the batch continue to be processed.
Store write_pos with a single release barrier after all events are written.
Check need_wake once. If need_wake is 1, increment futex_counter with a release store and issue futex_wake to wake all waiting consumer threads.
Re-enable preemption.

The shared timestamp reflects the logical instant of the batch. Events within a batch are ordered by their sequence numbers. The single write_pos update and single need_wake check are the primary performance benefits over individual emission.

§3.1.8.3 Failure semantics

The kernel batch API continues processing after a single event failure. If a structural check fails on event N, event N is dropped but events N+1 through the end of the batch are still processed. This is deliberately different from the syscall batch API (kmes_emit_batch), which stops processing at the first failure. Kernel emitters are trusted and individual structural failures are expected to be rare (indicating a kernel bug). Syscall emitters are untrusted and receive an error indicating which entry failed so the caller can diagnose and fix the issue.

§3.1.8.4 Caller contract

The same caller contract as single emission applies to each event in the batch. KMES does not validate payload contents from kernel emitters.

§3.1.9 Atomicity

Individual event writes to the ring buffer MUST be atomic from the consumer's perspective. A consumer MUST NOT observe a partially written event. For batch emission, write_pos is deferred until all events are written, so consumers observe the entire batch atomically -- no events from the batch are visible until all have been written. The consumer processes individual events within the batch, each of which is independently valid. The mechanism used to guarantee write atomicity is defined in the Ring Buffer section.

Section

4 Syscall interface

§4.1 4 Syscall interface

Syscall Interface

§4.1.1 Overview

KMES exposes three syscalls in the PKM range (1090--1099):

kmes_emit (1090) -- emit a single event from userspace.
kmes_attach (1091) -- attach as a consumer and obtain per-CPU ring buffer file descriptors.
kmes_emit_batch (1092) -- emit multiple events from userspace as a single operation.

All three syscalls use standard Linux error conventions: return -1 and set errno on failure.

§4.1.2 kmes_emit (1090)

Emits a single event into KMES from userspace. The origin class is set to 0 (userspace) unconditionally -- the caller cannot specify it. The event is written to the ring buffer of the CPU on which the calling thread is currently executing.

§4.1.2.1 Privilege requirement

The caller's effective token MUST hold SeAuditPrivilege. If the privilege is not held or not enabled, the syscall fails with EPERM.

§4.1.2.2 Parameters

Parameter	Type	Description
`event_type`	`const char *`	Pointer to the event type string.
`event_type_len`	`u16`	Length of the event type string in bytes.
`payload`	`const void *`	Pointer to the msgpack-encoded payload.
`payload_len`	`u32`	Length of the payload in bytes.

§4.1.2.3 Validation

KMES performs the following validation on every kmes_emit call, in order:

The caller MUST hold SeAuditPrivilege. Fails with EPERM.
event_type_len MUST be nonzero. Fails with EINVAL.
The event type and payload are copied from userspace into a kernel buffer. If either pointer is inaccessible, fails with EFAULT. All subsequent validation and the ring buffer write operate on the kernel copy, not the original userspace memory. This prevents TOCTOU (time-of-check-time-of-use) attacks where userspace modifies the payload between validation and the ring buffer write.
The total event size (header + payload) MUST NOT exceed the configured maximum event size. Fails with ENOSPC.
The total event size MUST NOT exceed 50% of the per-CPU ring buffer capacity. Fails with ENOSPC.
The payload MUST be valid msgpack with nesting depth not exceeding the configured maximum. Fails with EINVAL.

Validation stops at the first failure. The error reflects the first check that failed.

§4.1.2.4 Preemption

Validation (steps 1--6) runs with preemption enabled. The userspace copy at step 3 may trigger page faults, and msgpack validation at step 6 may take microseconds for large payloads. Neither requires CPU affinity.

Preemption is disabled only for the ring buffer write: determining the current CPU, constructing the header (stamping timestamp, sequence number, cpu_id), writing the event, advancing write_pos, and checking need_wake. This keeps the non-preemptible window to a few hundred nanoseconds regardless of payload size. The cpu_id in the event header reflects the CPU at write time, not at syscall entry time.

§4.1.2.5 Behavior

On success, KMES writes the event to the current CPU's ring buffer and returns 0. The event is visible to consumers immediately.

If the ring buffer is full, KMES overwrites the oldest events. The syscall never blocks due to buffer pressure.

§4.1.2.6 Return

Returns 0 on success. Returns -1 and sets errno on failure.

§4.1.2.7 Errors

Errno	Meaning
EPERM	Caller does not hold SeAuditPrivilege.
EINVAL	Event type length is zero, or payload is invalid msgpack, or payload nesting depth exceeds MaxNestingDepth.
EFAULT	Event type or payload pointer is inaccessible.
ENOSPC	Event exceeds MaxEventSize or 50% of per-CPU ring buffer capacity.
ENOMEM	Kernel memory allocation for the staging buffer failed.

§4.1.3 kmes_emit_batch (1092)

Emits multiple events into KMES from userspace as a single operation. All events share a single timestamp and the overhead of privilege checking, notification, and the write_pos release barrier is incurred once for the batch rather than per event.

§4.1.3.1 Privilege requirement

The caller's effective token MUST hold SeAuditPrivilege. If the privilege is not held or not enabled, the syscall fails with EPERM.

§4.1.3.2 Parameters

Parameter	Type	Description
`entries`	`struct kmes_emit_entry __user *`	Pointer to an array of event descriptors.
`count`	`u32`	Number of entries in the array. MUST be at least 1 and at most 256. The limit of 256 bounds the worst-case preemption-disabled window during the ring buffer write phase to approximately 50--100 microseconds for typical event sizes, while providing strong syscall overhead amortization.

Each kmes_emit_entry contains:

Field	Type	Description
`event_type`	`const char *`	Pointer to the event type string.
`event_type_len`	`u16`	Length of the event type string in bytes.
`payload`	`const void *`	Pointer to the msgpack-encoded payload.
`payload_len`	`u32`	Length of the payload in bytes.

§4.1.3.3 Validation

The caller MUST hold SeAuditPrivilege. Fails with EPERM.
count MUST be between 1 and 256 inclusive. Fails with EINVAL.
The entry descriptor array is copied from userspace. Fails with EFAULT if inaccessible.
For each entry in order, starting from index 0: the event type and payload are copied from userspace into kernel memory. Each entry is validated using the same rules as kmes_emit (nonzero event type length, total size within MaxEventSize, total size within 50% of buffer capacity, valid msgpack within MaxNestingDepth). If any entry fails validation, processing stops. Events before the failing entry that passed validation are emitted. The failing entry and all subsequent entries are not processed.

§4.1.3.4 Preemption

The userspace copies and msgpack validation run with preemption enabled. Preemption is disabled only for the ring buffer writes, the single write_pos release barrier, and the need_wake check.

§4.1.3.5 Behavior

All successfully validated events share a single wall clock timestamp, captured once at the start of the ring buffer write phase. Each event receives its own sequence number. The origin class is set to 0 (userspace) for all events.

If the ring buffer is full, KMES overwrites the oldest events. The syscall never blocks due to buffer pressure.

§4.1.3.6 Return

Returns the number of events successfully emitted (0 to count). If all events are emitted, the return value equals count. If validation fails on entry N, the return value is N (events 0 through N-1 were emitted) and errno is set to indicate why entry N failed. Events that fail validation (entry N and all subsequent entries) do not consume sequence numbers -- they never enter the ring buffer write phase.

Returns -1 and sets errno for errors that prevent any processing (EPERM, EINVAL on count, EFAULT on the entry array itself, ENOMEM).

§4.1.3.7 Errors

Errno	Meaning
EPERM	Caller does not hold SeAuditPrivilege.
EINVAL	`count` is 0 or exceeds 256, or the failing entry has a zero-length event type, or the failing entry's payload is invalid msgpack or exceeds MaxNestingDepth.
EFAULT	Entry array, event type, or payload pointer is inaccessible.
ENOSPC	The failing entry exceeds MaxEventSize or 50% of per-CPU ring buffer capacity.
ENOMEM	Kernel memory allocation failed.

§4.1.4 kmes_attach (1091)

Attaches the caller as a consumer of the KMES ring buffers. Returns one file descriptor per CPU, each independently mappable.

§4.1.4.1 Privilege requirement

The caller's effective token MUST hold SeSecurityPrivilege. If the privilege is not held or not enabled, the syscall fails with EPERM.

§4.1.4.2 Parameters

Parameter	Type	Description
`fds`	`int __user *`	Pointer to a caller-provided buffer for the returned file descriptors.
`count`	`int __user *`	Pointer to an integer. On entry, the size of the `fds` buffer (number of int-sized slots). On return, the number of CPUs (and thus the number of file descriptors written).

§4.1.4.3 Behavior

KMES writes one file descriptor per CPU into the fds buffer, in CPU order (index 0 = CPU 0, index 1 = CPU 1, etc.). The number of CPUs is written to *count.

If the buffer is too small (*count on entry is less than the number of CPUs), no file descriptors are created. *count is set to the required number and the syscall fails with ERANGE. The caller SHOULD retry with a sufficiently large buffer.

File descriptor creation is all-or-nothing. If allocation of any file descriptor fails (e.g., ENOMEM), all previously created file descriptors from this call are closed internally before the error is returned. The caller receives either all N file descriptors or none.

Each file descriptor independently supports:

mmap() -- maps that CPU's ring buffer into the caller's address space. The mapped region layout is defined in the Ring Buffer section.
close() -- releases the file descriptor. The mapping becomes invalid.

The mapping is unconditionally read-only. No privilege, capability, or token grants write access to the ring buffer from userspace. Only KMES writes to the ring buffer. Any attempt to write to the mapped region results in a segmentation fault.

Multiple consumers MAY attach simultaneously. Each consumer maintains its own read position per buffer independently.

§4.1.4.4 Notification

Each per-CPU ring buffer has its own futex counter (u32) in its metadata page. When KMES writes an event to a CPU's ring buffer and need_wake is set, it increments that buffer's futex counter and issues a futex_wake that wakes all waiting threads. This ensures that multiple consumers attached to the same buffer are all woken when events arrive.

This allows consumers to dedicate one thread per CPU buffer, each sleeping independently on its own futex. Under sustained load, consumer threads remain in the drain loop, need_wake stays 0, and KMES skips all notification overhead.

§4.1.4.5 Return

Returns 0 on success. Returns -1 and sets errno on failure.

§4.1.4.6 Errors

Errno	Meaning
EPERM	Caller does not hold SeSecurityPrivilege.
ERANGE	The `fds` buffer is too small. `*count` is set to the required number of slots.
EFAULT	`fds` or `count` points to inaccessible memory.
ENOMEM	Kernel memory allocation failed.

Section

5 Ring buffer

§5.1 5 Ring buffer

Ring Buffer

§5.1.1 Overview

The ring buffer is the sole delivery mechanism from KMES to userspace consumers. KMES maintains one ring buffer per CPU. Each per-CPU buffer is an independent shared memory region, independently mappable, with its own metadata, write position, and futex counter. There is no shared state between per-CPU buffers on the write path.

This per-CPU design eliminates all contention on the event emission path. Each CPU writes to its own buffer using its own counters. No atomic operations contend across CPUs. This is critical for workloads where KMES traces every syscall across many cores.

Consumers read from per-CPU buffers independently. Each buffer is a complete, self-contained ring buffer -- the same structure, the same read protocol, the same overwrite semantics. The per-CPU design does not change the ring buffer contract; it replicates it.

§5.1.2 Boot buffer

KMES begins buffering events the instant PKM loads, before the registry is available. During this early boot window, events are stored in internal kernel boot buffers (one per CPU). Boot buffers are not visible to consumers and cannot be mapped.

When LCS becomes available, KMES reads the configured ring buffer size from the registry, creates the consumer-facing per-CPU ring buffers at that size, and copies all surviving boot buffer events into them. The boot buffers are then discarded.

If LCS is not available (module loaded without registry), KMES creates the ring buffers at a compiled-in default size.

Boot buffers use the same circular overwrite semantics as ring buffers. If a boot buffer fills before the ring buffer is created, the oldest events are overwritten. The boot buffer size is a compiled-in constant.

§5.1.3 Capacity

All per-CPU ring buffers share the same capacity. The capacity MUST be a power of two. This allows the wrap-around offset calculation to use a bitwise AND (position & (capacity - 1)) instead of a modulo operation. Every event read and write hits this calculation.

The capacity is configurable via the registry. The compiled-in default is used when the registry is not yet available. The minimum and maximum permitted capacities are implementation-defined but MUST both be powers of two.

§5.1.4 Double virtual mapping

Each ring buffer's physical pages are mapped twice consecutively in virtual memory. If the ring buffer occupies N physical pages, the data region spans 2N pages of virtual address space, with the second N pages mapping the same physical memory as the first N.

Physical pages:  [0][1][2][3]
Virtual mapping: [0][1][2][3][0][1][2][3]

This eliminates all wrap-around handling. When KMES writes an event that crosses the end of the buffer, the write continues into virtual addresses that map back to the beginning of the physical buffer. No branch, no split write, no padding. A single contiguous memcpy handles every write regardless of position.

Consumers benefit identically -- an event that wraps around the physical boundary is read as a single contiguous byte sequence from the consumer's perspective.

§5.1.5 Mapped region layout

When a consumer calls mmap() on a per-CPU file descriptor returned by kmes_attach, the mapped region has the following layout:

Region	Size	Description
Metadata page	4096 bytes	Control fields.
Data region	2 × capacity	The double-mapped ring buffer containing events.

The total mapping size is 4096 + (2 × capacity) bytes. Every per-CPU buffer has the same layout and the same capacity.

§5.1.6 Metadata page

The metadata page is laid out to prevent false sharing. Fields that are updated at different frequencies are placed on separate 64-byte cache lines.

False sharing occurs when two independent fields share a cache line. Updating one field invalidates the cache line in every CPU core, forcing all cores to re-fetch the line even for the unchanged field. In the per-CPU design, false sharing between CPUs is eliminated by using separate buffers. Cache line separation within a buffer prevents false sharing between the producing CPU and consuming threads.

§5.1.6.1 Cache line 0 -- static fields (bytes 0--63)

Written once when the ring buffer is created. Never modified after initialisation. Consumers MAY cache these values for the lifetime of the mapping.

Offset	Size	Type	Field	Description
0	8	`[u8; 8]`	`magic`	Magic byte sequence identifying this as a KMES ring buffer. Value: `4B 4D 45 53 52 49 4E 47` (`KMESRING` in ASCII). Compared byte-by-byte, not as an integer.
8	4	`u32`	`version`	Ring buffer format version. v0.20 uses version 1.
12	2	`u16`	`cpu_id`	The CPU this buffer belongs to.
14	2	`u16`	`reserved0`	Reserved. Must be zero.
16	8	`u64`	`capacity`	Data region capacity in bytes. Power of two.
24	8	`u64`	`data_offset`	Byte offset from the start of the mapping to the data region. Equal to the metadata page size (4096).
32	8	`u64`	`generation`	Buffer generation counter. Starts at 1 for the first ring buffer created on each CPU. Monotonically increasing across buffer swaps -- the new buffer's generation is the old buffer's incremented value.
40	24	--	`reserved1`	Reserved. Must be zero. Pads to cache line boundary.

§5.1.6.2 Cache line 1 -- producer fields (bytes 64--127)

Written by KMES on every event write to this CPU's buffer. This is the hottest cache line in the ring buffer. In the per-CPU design, only one CPU ever writes to this cache line, eliminating cross-core invalidation.

Offset	Size	Type	Field	Description
64	8	`u64`	`write_pos`	Monotonically increasing byte offset of the next write position. Never wraps. The actual data region offset is `write_pos & (capacity - 1)`.
72	8	`u64`	`tail_pos`	Byte offset of the oldest surviving event. Advanced by KMES when events are overwritten. Consumers whose read position is behind `tail_pos` have been lapped.
80	48	--	`reserved2`	Reserved. Must be zero. Pads to cache line boundary.

§5.1.6.3 Cache line 2 -- notification fields (bytes 128--191)

Used for futex-based sleep/wake coordination between KMES and consumers.

Offset	Size	Type	Field	Description
128	4	`u32`	`futex_counter`	Counter incremented by KMES when waking sleeping consumers. Consumers use `futex_wait` on this address. `u32` because Linux `futex(2)` operates on 32-bit integers. Only incremented when `need_wake` is set.
132	1	`u8`	`need_wake`	Consumer-managed flag. Set to 1 by the consumer before sleeping. Read by KMES after writing an event. If 0, KMES skips the futex_counter increment and futex_wake entirely. Cleared by the consumer after waking.
133	59	--	`reserved3`	Reserved. Must be zero. Pads to cache line boundary.

Under sustained load, the consumer is always draining and need_wake remains 0. KMES reads need_wake, sees 0, and skips all notification overhead -- no futex_counter increment, no futex_wake syscall. The entire notification path costs a single memory read per event (~1ns). Under low load, the consumer sets need_wake before sleeping, and KMES performs the full wake sequence when the next event arrives.

§5.1.7 Write protocol

Each per-CPU buffer has exactly one writer: the CPU it belongs to. There is no cross-CPU contention on any write operation. The write protocol uses no locks and no cross-CPU atomic operations.

For each event on a given CPU:

Capture the wall clock timestamp (CLOCK_REALTIME).
Increment the CPU's per-boot sequence counter and take the new value. This is a CPU-local operation with no contention.
Build the packed event header (timestamp, sequence number, cpu_id, origin class, event type).
Compute the total event size (header + payload).
If the total event size exceeds 50% of the ring buffer capacity, drop the event. The sequence number is consumed, creating a visible gap. Increment the internal dropped-event counter. Stop.
If write_pos + event_size - tail_pos > capacity, the write would overwrite surviving events. Advance tail_pos past overwritten events by reading each overwritten event's event_size field and adding it to tail_pos until sufficient space is available. Store tail_pos with a release memory barrier.
Write the event (header + payload) into the data region at offset write_pos & (capacity - 1). The double virtual mapping ensures this is a single contiguous write even if it crosses the physical buffer boundary.
Store the new write_pos (old value + event size) with a release memory barrier. This barrier ensures the event data is fully visible to consumers before write_pos advances.
Read need_wake. If need_wake is 0, stop -- no consumer is sleeping. If need_wake is 1, increment futex_counter with a release store and issue futex_wake to wake all waiting consumer threads.

During batch writes, steps 8--9 are deferred until all events in the batch have been written. A single release store of write_pos and a single need_wake check cover the entire batch. The overwrite check (step 6) uses an internal running write offset that tracks the current write frontier within the batch, not the consumer-visible write_pos.

The release barriers in steps 6 and 8 establish the ordering guarantee: a consumer that observes the new write_pos is guaranteed to observe the fully written event data and the correct tail_pos.

§5.1.8 Read protocol

Consumers read events directly from the mapped data region. The read protocol uses no locks and no syscalls during the event drain loop.

Each consumer maintains its own read position per buffer in process-local memory. KMES does not track consumer read positions and is not aware of how many consumers exist or how far behind they are.

A consumer typically dedicates one thread per CPU buffer. Each thread independently drains its buffer using the following protocol.

§5.1.8.1 Drain loop

Load write_pos with an acquire memory barrier. If write_pos == read_pos, no new events are available. Proceed to notification wait.
Load tail_pos with an acquire memory barrier. If read_pos < tail_pos, the consumer has been lapped -- events at read_pos have been overwritten. Set read_pos = tail_pos. The gap between the old read_pos and tail_pos represents lost events, detectable as a sequence number gap.
Save the current tail_pos as saved_tail.
Read the event at data region offset read_pos & (capacity - 1). The double virtual mapping ensures this is a contiguous read.
Re-read tail_pos. If tail_pos > saved_tail AND read_pos < tail_pos, the event was overwritten during the read (torn read). Discard the event and go to step 2.
The event is valid. Process it. Advance read_pos by the event's event_size. Go to step 1.

§5.1.8.2 Notification wait

When no events are available on a given buffer:

Store 1 to need_wake with a release memory barrier. This signals KMES that the consumer is about to sleep.
Re-read write_pos with an acquire barrier. If new events have arrived since the drain loop exited (KMES wrote between the drain loop's check and the need_wake store), clear need_wake to 0 and return to the drain loop.
Read the current futex_counter value.
Optionally spin briefly, re-checking write_pos for new events. If events arrive during the spin window, clear need_wake to 0 and return to the drain loop. The spin duration is a consumer implementation choice.
Call futex_wait(futex_counter_address, last_seen_value). The kernel puts the thread to sleep if futex_counter has not changed since it was read. This is a genuine kernel sleep -- the thread is descheduled and consumes no CPU.
On wake (KMES incremented futex_counter and called futex_wake), clear need_wake to 0 and return to the drain loop.

Clearing need_wake to 0 (steps 2, 4, and 6) is a plain (relaxed) store -- no memory barrier is required. If KMES observes a stale need_wake of 1 after the consumer has already cleared it, KMES performs a spurious futex_wake on a thread that is already awake. This is harmless -- futex_wake on a non-sleeping thread is a no-op.

The re-check at step 2 closes the race window between the drain loop finding no events and the need_wake store. If KMES writes an event and reads need_wake as 0 (because the consumer hasn't stored it yet), the consumer will see the new write_pos at step 2 and never enter futex_wait.

The adaptive spin in step 4 is optional. Without it, the consumer sleeps immediately when the buffer is empty and is woken by KMES. With it, the consumer catches closely-spaced events without a kernel round-trip. Under sustained load, the consumer never reaches the notification wait -- it stays in the drain loop and need_wake remains 0.

§5.1.8.3 Generation check

After completing a drain cycle (buffer fully drained or batch limit reached), the consumer SHOULD check the generation field in the metadata page.

If generation has changed since the consumer last checked:

Record the sequence number of the last successfully processed event from this buffer.
Call kmes_attach to obtain new file descriptors for the resized ring buffers.
mmap the new file descriptor for this CPU.
Read the new buffer's metadata (capacity, write_pos, tail_pos).
Scan events in the new buffer to find the first event with a sequence number greater than the recorded sequence number. Set read_pos to that event's position.
Close the old file descriptor and unmap the old buffer.
Continue draining from the new buffer.

Events MUST NOT be lost during a generation change. KMES copies surviving events from the old buffer into the new buffer before incrementing generation, and sequence numbers are continuous across the swap.

The old buffer's physical pages remain valid for as long as any consumer has them mapped. KMES's internal release of the old buffer does not affect existing consumer mappings -- standard kernel mmap reference counting ensures the pages persist until all consumers have unmapped them. The consumer safely finishes draining the old buffer before switching to the new one.

§5.1.8.4 Buffer swap serialization

The buffer swap MUST be atomic per CPU: no events may be lost or duplicated during the transition from the old buffer to the new buffer on a given CPU.

An implementation MAY achieve this with the following per-CPU algorithm:

Disable preemption on the target CPU.
Copy surviving events from this CPU's old buffer to the new buffer.
Switch the per-CPU buffer pointer from the old buffer to the new buffer.
Increment generation in the old buffer's metadata.
Re-enable preemption.

With preemption disabled, no events can be emitted on this CPU between the copy and the switchover. Events emitted on other CPUs are unaffected -- each CPU's swap is independent.

The generation check adds one u64 read per drain cycle. This cost is negligible relative to the event processing work.

§5.1.9 Overwrite semantics

Each per-CPU ring buffer is circular. When a buffer is full, KMES overwrites the oldest events in that buffer to make space for new events. The write pointer advances unconditionally -- emission never blocks.

Consumers detect overwritten events in two ways:

Lapping detection. If read_pos < tail_pos, events at the consumer's read position have been overwritten. The consumer advances to tail_pos.
Sequence gaps. The consumer tracks the last sequence number it processed from each CPU. A gap in the sequence for a given CPU indicates events were lost -- either overwritten in the ring buffer or dropped due to size limits.

KMES maintains tail_pos per buffer to enable lapping detection. When KMES overwrites events, it advances tail_pos past the overwritten events by reading each event's event_size field. This allows consumers to jump directly to the oldest valid event without scanning.

Advancing tail_pos requires walking overwritten events sequentially, reading each event's event_size to determine the next event boundary. This has variable latency proportional to the number of overwritten events and may involve cache-cold reads (the tail region may be megabytes away from the current write position). This cost is accepted as a tradeoff -- the alternative (maintaining a secondary index of event offsets) would add per-event overhead on the write path to optimize the uncommon case where the write pointer overtakes surviving events.

§5.1.10 Memory ordering summary

Operation	Barrier	Purpose
KMES stores `tail_pos`	release	Consumers see the advanced tail before they see new data at old positions.
KMES stores `write_pos`	release	Consumers see complete event data before they see the advanced write position.
KMES stores `futex_counter`	release	Consumers waking from futex see all prior writes.
Consumer stores `need_wake = 1`	release	KMES sees the flag before the consumer enters futex_wait.
Consumer stores `need_wake = 0`	relaxed	Spurious futex_wake from stale read is harmless.
Consumer loads `write_pos` (after setting `need_wake`)	acquire	Closes the race window: if KMES wrote before seeing `need_wake`, the consumer sees the write.
Consumer loads `write_pos` (drain loop)	acquire	Pairs with KMES release on `write_pos`.
Consumer loads `tail_pos`	acquire	Pairs with KMES release on `tail_pos`.

In the per-CPU design, the producer (KMES on CPU N) and the consumer (a userspace thread, potentially on a different CPU) are the only two parties accessing a given buffer's metadata. There is no multi-producer contention. The memory barriers ensure correct visibility between the single producer and its consumers.

On x86-64, stores are not reordered with other stores, so the release barriers on the producer side are no-ops in practice. The specification mandates them for architectural correctness on all platforms.

Section

6 Configuration

§6.1 6 Configuration

Self-Configuration

KMES reads its operational parameters from the registry under Machine\System\KMES\. Compiled-in defaults are used at boot. When LCS becomes available, KMES reads the configuration keys, validates them, and applies valid values. A persistent watch on the configuration subtree ensures ongoing changes are picked up for the lifetime of operation.

§6.1.1 Configuration keys

All keys live under Machine\System\KMES\. Each has a defined type, compiled-in default, and valid range. KMES ignores unknown keys in this subtree.

Key	Type	Default	Valid range	Description
BufferCapacity	REG_QWORD	4194304	65536--268435456	Per-CPU ring buffer capacity in bytes. MUST be a power of two. Values that are not powers of two are treated as invalid. Default is 4 MB. Maximum is 256 MB.
MaxEventSize	REG_DWORD	65536	1024--4194304	Maximum permitted total event size (header + payload) in bytes for events emitted via the `kmes_emit` and `kmes_emit_batch` syscalls. Does not apply to kernel emitters, which are subject only to the 50% structural limit. Default is 64 KB. Maximum is 4 MB.
MaxNestingDepth	REG_DWORD	32	4--256	Maximum permitted msgpack nesting depth for payloads emitted via the `kmes_emit` and `kmes_emit_batch` syscalls. Payloads exceeding this depth are rejected. Does not apply to kernel emitters.

§6.1.2 Validation

When KMES reads a configuration value, it validates against the defined type, range, and constraints:

Valid value: Applied to the in-memory configuration. For MaxEventSize and MaxNestingDepth, the new value takes effect for subsequent syscalls. For BufferCapacity, KMES triggers a ring buffer swap -- creating new per-CPU ring buffers at the configured size, copying surviving events from the old buffers, incrementing the generation counter, and discarding the old buffers. The swap protocol is defined in the Ring Buffer section.
Invalid value (out of range, wrong type, not a power of two for BufferCapacity, missing): Ignored. KMES retains the previously active value (compiled-in default or last known-good). An event is emitted via KMES itself identifying the key, the invalid value, and the value being retained.

Values are never clamped or silently corrected. The write to the registry succeeds (the source does not enforce kernel semantics), but KMES refuses to use it. The registry shows what was written; the event log shows what KMES is actually using.

§6.1.3 Bootstrap sequence

PKM loads. KMES initialises with compiled-in defaults. Per-CPU boot buffers are created at a compiled-in size (not configurable via the registry). Events begin flowing immediately.
LCS becomes available (first source registers). KMES reads all keys under Machine\System\KMES\.
If keys exist and are valid, KMES applies them. If BufferCapacity differs from the compiled-in default, KMES creates the consumer-facing ring buffers at the configured size and copies boot buffer events into them. If BufferCapacity matches the default (or the key does not exist), KMES creates the ring buffers at the default size.
KMES arms a persistent subtree watch on Machine\System\KMES\ via LCS's internal watch mechanism. This is a kernel-internal registration, not a userspace fd-based watch.
If Machine\System\KMES\ does not exist (first boot, empty database), KMES arms a watch on a parent key to detect when the subtree is created. When the key appears, KMES reads and validates its contents and re-arms a targeted watch.
On subsequent changes (administrator modification, Group Policy push at a higher-precedence layer), the watch fires, KMES re-reads the changed key, validates, and applies or rejects.

At no point does KMES enter a "waiting for configuration" state. Compiled-in defaults are always sufficient for operation.

§6.1.4 Security

KMES configuration keys live under Machine\System\KMES\, which inherits the Machine hive root SD (SYSTEM and Administrators: KEY_ALL_ACCESS, Authenticated Users: KEY_READ). Unprivileged processes cannot modify operational parameters.

Domain policy enforcement via Group Policy at a higher-precedence layer provides defence against compromised local administrators -- SeTcbPrivilege is required for layer creation at precedence > 0.

§6.1.5 Boot buffer size

The boot buffer size is a compiled-in constant, not configurable via the registry. The boot buffer exists only during the window between PKM load and ring buffer creation. Making it configurable would require a mechanism to deliver the value to the kernel before LCS is available, which adds complexity for negligible benefit. The compiled-in size is chosen to be large enough to hold all events generated during a typical boot sequence without loss.

Section

7 Failure modes

§7.1 7 Failure modes

Failure Modes

KMES is a kernel subsystem with no external trust boundary on the write path. Kernel emitters are trusted; userspace emitters are validated at the syscall boundary. Failure semantics are simpler than subsystems like LCS that span kernel-userspace trust boundaries, but MUST still be explicit.

§7.1.1 Ring buffer overrun

When events are emitted faster than consumers drain them, the ring buffer fills and KMES overwrites the oldest events to make space.

The write path is never blocked. Emission never fails due to buffer pressure -- buffer-full conditions are handled by overwriting, not blocking.
Consumers detect lost events as gaps in the per-CPU sequence number.
Consumers whose read position has been overwritten are advanced to tail_pos (the oldest surviving event).
KMES maintains an internal per-CPU dropped-event counter. This counter is not exposed in the ring buffer metadata in v0.20 but MAY be exposed in a future version.

Overrun is a normal operating condition under heavy load, not an error. The system degrades gracefully: recent events are preserved, old events are lost, consumers are aware of the loss.

§7.1.2 Event drop

An event is dropped without being written to the ring buffer when:

Structural limit exceeded. The event exceeds 50% of the per-CPU ring buffer capacity. Applies to both kernel and syscall emitters.
Policy limit exceeded (syscall only). The event exceeds MaxEventSize.
Validation failure (syscall only). The payload is not valid msgpack or exceeds MaxNestingDepth.

For kernel emitters, the per-CPU sequence number advances even when an event is dropped, making the drop visible to consumers as a gap in the sequence. The emitting subsystem is not notified, as the emission API is fire-and-forget.

For syscall emitters, validation failures occur before the ring buffer write phase, so no sequence number is consumed and no gap is visible to consumers. The drop is visible only to the caller via the syscall error return.

§7.1.3 Consumer crash

If a consumer process (e.g., eventd) crashes:

The consumer's mmap'd ring buffer regions remain valid in kernel memory. The kernel cleans up the mappings when the process's file descriptors are closed (normal kernel fd cleanup on process exit).
KMES is unaffected. It continues writing events to the per-CPU ring buffers regardless of whether any consumers are attached.
Events emitted while no consumer is attached accumulate in the ring buffers. If the buffers fill, oldest events are overwritten.
When a consumer restarts and re-attaches (calls kmes_attach and mmaps the buffers), it sees all surviving events. Events overwritten during the outage are visible as a sequence gap starting from whatever sequence number the consumer last processed.

KMES has no dependency on consumers. A system with no consumers attached operates identically to a system with consumers -- events are emitted, stamped, buffered, and eventually overwritten.

§7.1.4 Buffer swap failure

When KMES attempts to create new ring buffers (due to a BufferCapacity configuration change or the boot-to-registry transition), memory allocation may fail.

If allocation fails, KMES retains the existing ring buffers at their current size. The configuration change is not applied.
An event is emitted via KMES itself recording the allocation failure and the retained buffer size.
The generation counter is not incremented. Consumers are unaffected.
KMES does not retry automatically. A subsequent configuration write (or system reboot) triggers another attempt.

§7.1.5 LCS unavailable

If LCS never becomes available (no source registers), KMES operates indefinitely with compiled-in defaults. The ring buffers are created at the default BufferCapacity. The self-configuration watch is never armed because there is no registry to watch.

This is not a failure -- it is a valid operating mode. KMES has no hard dependency on LCS. The only consequence is that operational parameters cannot be tuned.

§7.1.6 Clock discontinuity

KMES timestamps use CLOCK_REALTIME (wall clock). NTP adjustments can cause the clock to jump forward or backward. When this occurs:

Events emitted after a backward jump have timestamps earlier than events emitted before the jump. Consumers that sort by timestamp will see an apparent reordering.
Per-CPU sequence numbers are unaffected (they are monotonic counters, not derived from the clock). Sequence numbers remain the reliable ordering primitive within a single CPU.
KMES does not detect or compensate for clock discontinuities. Consumers that require monotonic ordering within a CPU SHOULD use the sequence number, not the timestamp.

Cross-CPU ordering during a clock discontinuity is best-effort. Events from different CPUs near a clock jump may have misleading relative timestamps. This is an inherent limitation of wall clock timestamps and is accepted as a trade-off for human-readable, cross-boot-comparable timestamps.

§7.1.7 CPU hotplug

CPU hotplug (adding or removing CPUs at runtime) is not supported in v0.20. The number of per-CPU ring buffers is fixed at KMES initialisation time based on the number of online CPUs when PKM loads. If a CPU is brought online after KMES initialisation, events emitted on that CPU are dropped.

A future version MAY support dynamic per-CPU buffer creation for hotplugged CPUs.

§7.1.8 Memory bounding

KMES kernel memory usage is bounded by:

Per-CPU ring buffers: num_cpus × BufferCapacity. At defaults (4 CPUs × 4 MB), 16 MB. Ceiling is num_cpus × 256 MB, configurable only by administrators.
Per-CPU boot buffers: num_cpus × boot_buffer_size. Compiled-in constant. Freed after ring buffers are created.
Event construction: Temporary allocations during event construction are bounded by the maximum event size and freed immediately after the event is written to the ring buffer.
Consumer file descriptors: Each kmes_attach call creates num_cpus file descriptors. Bounded by RLIMIT_NOFILE and the SeSecurityPrivilege requirement.

No KMES-specific global memory cap is required. The BufferCapacity configuration and standard Linux resource limits provide sufficient protection.

Section

8 Appendix a

§8.1 8 Appendix a

Constants

All numeric constants used in the KMES interface. An independent implementer can derive all magic numbers from this page.

§8.1.1 Syscall numbers

Syscall	Number	Description
kmes_emit	1090	Emit a single event from userspace.
kmes_attach	1091	Attach as a consumer and obtain per-CPU ring buffer file descriptors.
kmes_emit_batch	1092	Emit multiple events from userspace as a single operation. Maximum 256 events per call.

§8.1.2 Origin class values

Value	Origin
0	Userspace (syscall)
1	KMES
2	KACS
3	LCS

Values 4--255 are reserved for future kernel subsystems.

§8.1.3 Event header layout

Packed, no padding. All multi-byte integers little-endian.

Offset	Size	Type	Field
0	4	`u32`	`event_size`
4	4	`u32`	`header_size`
8	8	`u64`	`timestamp`
16	8	`u64`	`sequence`
24	2	`u16`	`cpu_id`
26	1	`u8`	`origin_class`
27	2	`u16`	`type_len`
29	var	`[u8]`	`type`

Minimum header size: 29 + type_len bytes. Actual header_size MAY be larger (reserved space for future identity stamp fields). Payload begins at header_size from event start.

§8.1.4 Ring buffer metadata page layout

One metadata page (4096 bytes) per CPU. Cache-line-aligned fields.

§8.1.4.1 Cache line 0 -- static fields (bytes 0--63)

Offset	Size	Type	Field
0	8	`[u8; 8]`	`magic`
8	4	`u32`	`version`
12	2	`u16`	`cpu_id`
14	2	`u16`	`reserved0`
16	8	`u64`	`capacity`
24	8	`u64`	`data_offset`
32	8	`u64`	`generation`
40	24	--	`reserved1`

§8.1.4.2 Cache line 1 -- producer fields (bytes 64--127)

Offset	Size	Type	Field
64	8	`u64`	`write_pos`
72	8	`u64`	`tail_pos`
80	48	--	`reserved2`

§8.1.4.3 Cache line 2 -- notification fields (bytes 128--191)

Offset	Size	Type	Field
128	4	`u32`	`futex_counter`
132	1	`u8`	`need_wake`
133	59	--	`reserved3`

§8.1.5 Ring buffer magic

0x4B 0x4D 0x45 0x53 0x52 0x49 0x4E 0x47
 K    M    E    S    R    I    N    G

Compared byte-by-byte, not as an integer. Endianness-independent.

§8.1.6 Ring buffer version

v0.20 uses ring buffer format version 1.

§8.1.7 Mapped region layout

Per-CPU mapping returned by mmap() on a kmes_attach file descriptor:

Offset	Size	Description
0	4096	Metadata page
4096	2 × capacity	Double-mapped data region

Total mapping size: 4096 + (2 × capacity) bytes.

§8.1.8 Syscall error codes

§8.1.8.1 kmes_emit errors

Errno	Condition
EPERM	Caller does not hold SeAuditPrivilege.
EINVAL	Event type length is zero, or payload is invalid msgpack, or payload nesting depth exceeds MaxNestingDepth.
EFAULT	Event type or payload pointer is inaccessible.
ENOSPC	Event exceeds MaxEventSize or 50% of per-CPU ring buffer capacity.
ENOMEM	Kernel memory allocation for staging buffer failed.

§8.1.8.2 kmes_emit_batch errors

Errno	Condition
EPERM	Caller does not hold SeAuditPrivilege.
EINVAL	Count is 0 or exceeds 256, or failing entry has zero-length event type, or failing entry's payload is invalid msgpack or exceeds MaxNestingDepth.
EFAULT	Entry array, event type, or payload pointer is inaccessible.
ENOSPC	Failing entry exceeds MaxEventSize or 50% of per-CPU ring buffer capacity.
ENOMEM	Kernel memory allocation failed.

§8.1.8.3 kmes_attach errors

Errno	Condition
EPERM	Caller does not hold SeSecurityPrivilege.
ERANGE	Provided buffer is too small. `*count` set to required number.
EFAULT	`fds` or `count` pointer is inaccessible.
ENOMEM	Kernel memory allocation failed.

§8.1.9 Configuration keys

Registry path: Machine\System\KMES\

Key	Type	Default	Valid range
BufferCapacity	REG_QWORD	4194304 (4 MB)	65536--268435456 (64 KB--256 MB), power of two
MaxEventSize	REG_DWORD	65536 (64 KB)	1024--4194304 (1 KB--4 MB)
MaxNestingDepth	REG_DWORD	32	4--256

§8.1.10 Privilege requirements

Operation	Required privilege
Emit event from userspace (`kmes_emit`, `kmes_emit_batch`)	SeAuditPrivilege
Attach as consumer (`kmes_attach`)	SeSecurityPrivilege

Section

9 Appendix b

§9.1 9 Appendix b

Recommended Implementation Optimisations

The following optimisations are not normative. They do not affect the ring buffer format, the event header layout, or the consumer protocol. An implementation that omits all of them is fully conformant. However, each one provides measurable throughput or latency improvement with no behavioural trade-offs, and implementers are encouraged to adopt them.

§9.1.1 Timestamp capture

Timestamp capture (CLOCK_REALTIME) is the single most expensive per-event operation on the write path, at approximately 15--25 ns per call. Implementations SHOULD use ktime_get_real_fast_ns() -- the kernel-internal fast path that avoids the full timekeeper seqlock dance. On architectures with an invariant TSC (Time Stamp Counter), this reduces to an rdtsc instruction plus a multiply and add. The trade-off is that in the rare case where a timer interrupt is updating the timekeeper concurrently, the timestamp may be off by one tick. For nanosecond-precision event timestamps, this is acceptable.

§9.1.2 Hugepages

The ring buffer data region benefits from 2 MB hugepages rather than 4 KB standard pages. A 4 MB ring buffer requires 1024 standard pages but only 2 hugepages. Fewer pages means fewer TLB (Translation Lookaside Buffer -- the CPU's cache of virtual-to-physical address mappings) entries are needed to cover the buffer. TLB misses during event reads and writes add ~10-30 ns each, and a large buffer with standard pages can cause frequent misses during sequential traversal.

The double virtual mapping doubles the virtual address range, so the TLB benefit of hugepages is even more pronounced: 4 MB of physical memory mapped as 8 MB of virtual space requires 4 hugepages vs 2048 standard pages.

Hugepages are transparent to consumers -- the mmap'd region behaves identically regardless of the underlying page size.

§9.1.3 NUMA-local allocation

On NUMA (Non-Uniform Memory Access) systems, physical memory is divided into nodes. Each CPU has a local node with fast access (~70 ns) and remote nodes with slower access (~100-150 ns). Ring buffer pages SHOULD be allocated on the same NUMA node as the CPU that writes to them.

Since each ring buffer is written by exactly one CPU, NUMA-local allocation ensures all producer writes are fast. Consumer reads may cross NUMA boundaries (the consumer thread may run on a different node), but under the per-CPU design, the consumer thread can be affinity-bound to the same node as a secondary optimisation.

§9.1.4 Precomputed header templates

Several event header fields are constant for a given CPU: cpu_id and the header structure bytes (header_size, field offsets). A per-CPU header template can be precomputed at initialisation time. At emit time, KMES copies the template and fills in only the variable fields (event_size, timestamp, sequence, origin_class, type_len, type). This reduces per-event header construction to a small memcpy plus a few stores.

For kernel emitters with a fixed origin class, the template can include the origin class as well, reducing the per-event work further.

§9.1.5 Software prefetch

After reading an event's event_size field, the consumer knows where the next event starts. Issuing a software prefetch instruction for the next event's header address (e.g., __builtin_prefetch in C, prefetch intrinsic in Rust) allows the CPU to begin fetching the next event's cache lines while the current event is being processed. This hides memory latency during sequential buffer traversal.

This is most effective when event processing involves non-trivial work (msgpack decoding, SQLite insertion) that gives the prefetch time to complete.

§9.1.6 Msgpack validation with SIMD

For the kmes_emit syscall path, msgpack payload validation can be accelerated using SIMD (Single Instruction, Multiple Data) instructions. The initial type-byte scan -- determining whether each byte is a fixint, a container header, or a data byte -- is amenable to vectorised classification using SSE4.2 or AVX2 byte-shuffle instructions. This reduces validation overhead for large payloads.

This optimisation is only relevant to the syscall path. Kernel emitters bypass payload validation entirely.

§9.1.7 Per-CPU staging buffer

The kmes_emit and kmes_emit_batch syscalls copy event data from userspace into a kernel buffer before validation and ring buffer write. A per-CPU pre-allocated staging buffer (e.g., one page / 4 KB) eliminates dynamic allocation from the common-case syscall path. Events exceeding the pre-allocated size fall back to kmalloc.

For batch emission, the staging buffer can be reused sequentially: copy entry 0, validate, hold the kernel copy; copy entry 1 into the same staging buffer if entry 0 has already been written to the ring buffer, or allocate a second buffer if entries must be held simultaneously. The goal is to avoid 256 separate kmalloc calls for a full batch.

§9.1.8 Consumer thread affinity

Consumer threads that drain per-CPU ring buffers benefit from being pinned to CPUs on the same NUMA node as the buffer they read. While not strictly necessary (the per-CPU design eliminates write contention regardless of consumer placement), NUMA-local reads avoid cross-node memory traffic during the drain loop.

For eventd specifically, pinning each drain goroutine's underlying OS thread to the same NUMA node as its buffer is a simple configuration that reduces read latency.

KMES

Contents

1Introduction

2Event model

3Emission api

4Syscall interface

5Ring buffer

6Configuration

7Failure modes

8Appendix a

9Appendix b

1 Introduction

Scope

Terminology

Conventions

§1.3.1 Normative keywords

§1.3.2 Byte order

§1.3.3 String encoding

§1.3.4 Payload encoding

Compatibility

§1.4.1 Features handled by other subsystems

2 Event model

Event Model

§2.1.1 Structure

§2.1.2 Header layout

§2.1.3 Stamp fields

§2.1.4 Ordering

§2.1.4.1 Future identity stamp fields

§2.1.5 Origin class values

§2.1.6 Payload

§2.1.7 Size limits

3 Emission api

Emission API

§3.1.1 Purpose

§3.1.2 Interface

§3.1.3 Preemption

§3.1.4 Event construction

§3.1.5 Caller contract

§3.1.6 Structural checks

§3.1.7 Ring buffer full

§3.1.8 Batch emission

§3.1.8.1 Interface

§3.1.8.2 Behavior

§3.1.8.3 Failure semantics

§3.1.8.4 Caller contract

§3.1.9 Atomicity

4 Syscall interface

Syscall Interface

§4.1.1 Overview

§4.1.2 kmes_emit (1090)

§4.1.2.1 Privilege requirement

§4.1.2.2 Parameters

§4.1.2.3 Validation

§4.1.2.4 Preemption

§4.1.2.5 Behavior

§4.1.2.6 Return

§4.1.2.7 Errors

§4.1.3 kmes_emit_batch (1092)

§4.1.3.1 Privilege requirement

§4.1.3.2 Parameters

§4.1.3.3 Validation

§4.1.3.4 Preemption

§4.1.3.5 Behavior

§4.1.3.6 Return

§4.1.3.7 Errors

§4.1.4 kmes_attach (1091)

§4.1.4.1 Privilege requirement

§4.1.4.2 Parameters

§4.1.4.3 Behavior

§4.1.4.4 Notification

§4.1.4.5 Return

§4.1.4.6 Errors

5 Ring buffer

Ring Buffer

§5.1.1 Overview

§5.1.2 Boot buffer

§5.1.3 Capacity

§5.1.4 Double virtual mapping

§5.1.5 Mapped region layout

§5.1.6 Metadata page