psd-008 — Specification

eventd

Event daemon — unified observability sink and query engine for events, logs, and metrics.

v0.23 Draft 2026-04-25

Section

1 Introduction

§1.1 1 Introduction

Scope

This specification defines eventd, the observability daemon for the Peios operating system. eventd is one of the five Peios system daemons managed by peinit. It is the sole persistent sink for observability data in Peios -- all events, logs, and metrics flow through eventd for storage, indexing, and querying.

eventd handles three distinct data types, each with its own ingestion path, storage engine, and query semantics:

Events -- structured records emitted through KMES (Kernel Mediated Event Subsystem). eventd consumes events from KMES per-CPU ring buffers and persists them with full header metadata including identity stamps. Events are the primary audit and security telemetry.
Logs -- unstructured or semi-structured text output from services and system components. eventd ingests logs through a dedicated mechanism independent of KMES.
Metrics -- numeric time-series data representing system and service measurements. eventd ingests metrics through a dedicated mechanism independent of KMES.

This specification covers:

The event ingestion pipeline -- KMES ring buffer consumption, event processing, and persistence
The log ingestion pipeline -- transport mechanism, log record format, and persistence
The metric ingestion pipeline -- transport mechanism, metric data model, and persistence
Storage -- the storage engine for each data type, retention policies, and lifecycle management
Querying -- the interface through which other daemons and tools retrieve stored observability data
Access control -- Security Descriptor-based authorization for reading and writing observability data
Configuration -- registry-based operational parameters
Startup and shutdown -- bootstrap sequence, crash recovery, and graceful termination
Failure modes -- behavior under resource exhaustion, KMES overrun, storage failure, and daemon restart

This specification does not cover:

KMES (covered by PSD-003)
Event type schemas or naming conventions for kernel-emitted events (defined by the emitting subsystem's specification: PSD-004 for KACS events, PSD-005 for LCS events)
KACS (covered by PSD-004)
LCS (covered by PSD-005)
peinit service management (covered by PSD-007)
Authentication or principal management (authd)
Log collection from remote hosts (future scope)
Metric collection or scraping from external sources (future scope)

§1.2 1 Introduction

Terminology

Terms defined in PSD-003 (KMES, event, header, payload, stamp, ring buffer, boot buffer, consumer, origin class, event type, sequence number) are used here with the same meaning and are not redefined.

Terms defined in PSD-004 (token, GUID, SID, Security Descriptor, ACL, ACE, privilege, SeAuditPrivilege, SeSecurityPrivilege) are used here with the same meaning and are not redefined.

Terms defined in PSD-005 (registry, hive, key, value, layer) are used here with the same meaning and are not redefined.

The following terms are specific to this specification.

eventd: The Peios observability daemon. A userspace process managed by peinit that consumes events from KMES, ingests logs and metrics from service processes, and provides persistent storage and querying for all three data types.

Event store: The persistent storage engine for KMES events. Receives structured event records with full header metadata (timestamps, sequence numbers, identity GUIDs) and makes them queryable.

Log store: The persistent storage engine for log records. Receives log entries from service processes and makes them queryable.

Metric store: The persistent storage engine for metric data. Receives numeric time-series measurements and makes them queryable.

Log record: A single log entry as stored by eventd. Contains the log message, a severity level, a timestamp, and metadata identifying the emitting service.

Boot ID: A GUID assigned by peinit at each boot, used by eventd to partition data across boots. Events, logs, and metrics from different boots are never interleaved in storage.

Retention policy: A set of rules governing how long stored data is kept before deletion. Retention policies are configured per data type via the registry and enforced by eventd.

§1.3 1 Introduction

Conventions

§1.3.1 Normative keywords

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY in this specification are to be interpreted as described in RFC 2119.

§1.3.2 Section references

Section references within this specification use the § addressing scheme defined in PSD-001. References to other PSDs use the PSD-NNN §x.y.z(n) citation format.

§1.3.3 Byte order

All multi-byte integers in wire formats and storage formats defined by this specification are little-endian, consistent with PSD-003.

§1.3.4 String encoding

All strings in wire formats and storage formats defined by this specification are UTF-8 encoded.

§1.3.5 Payload encoding

Structured data in the query interface and internal storage formats uses MessagePack (msgpack) as defined by the MessagePack specification, consistent with PSD-003.

§1.4 1 Introduction

Prior Art

eventd is not a port or reimplementation of any single existing system. Its design draws on several established approaches while making different trade-offs.

§1.4.1 Windows Event Log / ETW

Windows provides Event Tracing for Windows (ETW) for kernel and application event delivery, and the Windows Event Log service for persistence and querying. KMES (PSD-003) fills the ETW role; eventd fills the Event Log service role. Key similarities:

Kernel-mediated event delivery with trusted metadata (ETW providers / KMES emitters)
A userspace service responsible for persistence (Event Log service / eventd)
Channel-based access control using Security Descriptors (Windows Event Log channels / eventd event access control)

Key differences:

eventd unifies events, logs, and metrics in a single daemon. Windows separates these across Event Log, ETL trace files, and Performance Counters.
eventd uses SQLite for persistent storage. Windows Event Log uses a proprietary binary format (EVTX).
KMES uses shared memory ring buffers with lock-free protocols. ETW uses kernel-managed trace sessions with different buffering semantics.

§1.4.2 journald (systemd)

journald is the system journal daemon in systemd-based Linux distributions. It captures structured log messages from services (via stdout/stderr and the journal socket) and stores them in a binary journal format.

Key similarities:

Captures service stdout/stderr as structured log records with metadata
Single daemon for system-wide log ingestion
Binary storage format with indexing

Key differences:

eventd handles events and metrics in addition to logs. journald is log-only (metrics require a separate stack like Prometheus/node_exporter).
eventd enforces access control via KACS Security Descriptors. journald uses Unix file permissions and polkit.
eventd receives kernel events through KMES ring buffers. journald receives kernel messages through /dev/kmsg.

§1.4.3 Prometheus / OpenTelemetry

Prometheus is a pull-based metrics system. OpenTelemetry defines a vendor-neutral telemetry collection framework spanning traces, metrics, and logs.

eventd's metric store serves a similar role to a local Prometheus TSDB, but:

eventd is push-based (services push metrics to eventd), not pull-based (Prometheus scrapes endpoints).
eventd does not implement distributed tracing. Traces are out of scope.
eventd integrates metrics with events and logs under a single access control model and query interface, rather than requiring separate systems.

§1.4.4 Features handled by other subsystems

Feature	Subsystem
Event emission, buffering, and delivery	KMES (PSD-003)
Event type schemas for KACS events	KACS (PSD-004)
Event type schemas for LCS events	LCS (PSD-005)
Daemon lifecycle management	peinit (PSD-007)
Access control primitives (tokens, SDs, access checks)	KACS (PSD-004)
Registry storage and configuration	LCS (PSD-005) / loregd (PSD-006)

Section

2 Event ingestion

§2.1 2 Event ingestion

Overview

eventd is the primary consumer of KMES ring buffers. It reads events from the per-CPU shared memory ring buffers provided by KMES, processes them, and writes them to persistent storage.

The event ingestion pipeline has four stages:

Drain -- one thread per CPU reads events from the CPU's ring buffer using the lock-free read protocol defined in PSD-003 §5.1.
Gap detection -- the drain thread compares each event's sequence number against the last seen sequence for that CPU. Gaps indicate lost events and are recorded as synthetic gap records.
Handoff -- the drain thread passes processed events to the writer thread responsible for that CPU's storage shard.
Write -- the writer thread batches events and commits them to the shard's SQLite database. Batch sizing is adaptive, balancing throughput against power-loss resilience.

The pipeline is designed around two principles:

KMES ring buffers are the only buffer. eventd does not maintain a large intermediate buffer between KMES and SQLite. Events move from the ring buffer through a small, bounded handoff directly into a database transaction. The ring buffer absorbs events that arrive during SQLite commits.
Linear scaling through sharding. eventd distributes write work across multiple independent SQLite databases. Each database has its own writer thread and its own WAL. Shards share no write-path state, so throughput scales linearly with shard count.

§2.2 2 Event ingestion

KMES Consumption

§2.2.1 Attachment

On startup, eventd MUST discover the CPU count and attach to each per-CPU ring buffer by calling kmes_attach(cpu_id) (PSD-003 §4.1) with incrementing cpu_id values starting from 0 until EINVAL is returned. Each call returns a single file descriptor for that CPU's ring buffer. eventd MUST map each file descriptor to obtain the per-CPU ring buffer. The caller's effective token MUST hold SeSecurityPrivilege.

eventd MUST read the capacity value returned by kmes_attach and use it to compute the mapping size (8192 + 2 * capacity).

§2.2.2 Drain threads

eventd MUST create one drain thread per CPU. Each drain thread is responsible for reading events from exactly one per-CPU ring buffer. A drain thread MUST NOT read from more than one ring buffer.

Each drain thread MUST follow the read protocol defined in PSD-003 §5.1, including:

Initialising read_pos to tail_pos on first attachment, starting from the oldest surviving event.
Following the drain loop: load write_pos with acquire, check for lapping via tail_pos, validate event structural integrity, advance read_pos by event_size.
Performing the torn read check (re-reading tail_pos after reading an event to detect concurrent overwrite).
Using the notification wait protocol (need_wake, futex_wait) when no events are available, to avoid spinning.

§2.2.3 Event copying

When a drain thread reads an event from the ring buffer, it MUST copy the event data (header and payload) into process-local memory before advancing read_pos. The drain thread MUST NOT pass pointers into the mapped ring buffer region to the writer thread, as the ring buffer memory may be overwritten by KMES at any time after read_pos advances past the event.

The copy is bounded by the event's event_size field. The drain thread MUST NOT read beyond event_size bytes from the event's position in the ring buffer.

§2.2.4 Generation changes

After completing a drain cycle, the drain thread MUST check the ring buffer's generation field as specified in PSD-003 §5.1. If the generation has changed (due to a ring buffer resize triggered by a BufferCapacity configuration change), the drain thread MUST:

Record the sequence number of the last successfully processed event.
Call kmes_attach(cpu_id) to obtain a new file descriptor for this CPU's resized ring buffer.
Map the new ring buffer.
Unmap the old ring buffer and close the old file descriptor.
Scan events in the new buffer to find the first event with a sequence number greater than the recorded sequence number.
Resume draining from that position.

Each drain thread handles generation changes independently. There is no coordination between drain threads during reattach. Each drain thread attaches only to its own CPU's ring buffer.

Events MUST NOT be lost or duplicated during a generation change.

§2.2.5 Sequence tracking

Each drain thread MUST maintain the last seen sequence number for its CPU. This value is used for gap detection (§2.3) and for resumption after generation changes or restarts.

On startup, if eventd has previously persisted data for this boot (identified by boot ID), it MUST read the last persisted sequence number per CPU from the event store and initialise the drain thread's sequence tracking from that value. This allows eventd to detect gaps that span a restart.

If no prior data exists for the current boot, the drain thread initialises its sequence tracker to 0 (no events seen). The first event on each CPU has sequence number 1 (PSD-003 §2.1).

§2.3 2 Event ingestion

Storage Sharding

§2.3.1 Shard model

eventd distributes event writes across one or more independent SQLite databases called shards. Each shard is a self-contained database with its own file, WAL, and writer thread. Shards share no write-path state.

The number of shards is configured via the StorageShards registry key under Machine\System\eventd\. The valid range is 1 to 256. A value of 0 means the shard count equals the CPU count (as reported by kmes_attach). The default is 0.

ⓘ Informative

For best performance, the shard count should be a multiple of both the CPU count and 2. A shard count that is a power of two enables the implementation to use bitwise AND instead of modulo for event routing. A shard count that is a multiple of the CPU count ensures even distribution of shards across CPUs.

§2.3.2 Shard-to-CPU assignment

Each shard is assigned to exactly one drain thread (and thus one CPU) at startup. The assignment is static for the lifetime of the eventd process.

The assignment maps shard j to CPU j % cpu_count. When the shard count equals the CPU count, each CPU has exactly one shard (the 1:1 case). When the shard count is less than the CPU count, multiple CPUs share a shard. When the shard count exceeds the CPU count, a CPU is assigned multiple shards and round-robins events across them.

When a drain thread is assigned multiple shards, it distributes events across its shards using round-robin assignment. Each event is routed to the next shard in sequence.

When the shard count does not divide evenly by the CPU count, some CPUs are assigned one more shard than others. The resulting write throughput imbalance is proportional to one shard's worth of throughput and is negligible in practice.

The shard-to-CPU assignment is not persistent across restarts. A shard database may contain events from different sets of CPUs across different eventd lifetimes. Shards are a write-path optimisation only -- the query path MUST NOT assume any relationship between a shard and a specific CPU. Queries that filter by CPU ID MUST scan all shards.

§2.3.3 Writer threads

eventd MUST create one writer thread per shard. The writer thread is the sole writer to its shard's database. No other thread or connection writes to that database.

When multiple drain threads are assigned to the same shard (shard count < CPU count), they hand off events to the shard's writer thread concurrently. The handoff channel MUST support multiple concurrent producers (drain threads) and a single consumer (the writer thread). When a drain thread is assigned multiple shards (shard count > CPU count), it hands off events to the appropriate writer thread based on the round-robin assignment. The drain threads MUST NOT write to SQLite directly.

§2.3.4 Handoff mechanism

Each writer thread has a bounded handoff channel through which drain threads submit events. The channel capacity MUST NOT exceed MaxBatchSize events. The channel is the only buffer between the drain thread and the writer thread.

When the channel is full, the drain thread MUST stop reading from the KMES ring buffer and wait for the writer thread to drain the channel. The drain thread MUST NOT drop events to relieve backpressure. While the drain thread is paused, new events accumulate in the KMES ring buffer -- this is the designed backpressure path. If the ring buffer fills during the pause, KMES overwrites the oldest events, and the drain thread detects this as a sequence gap when it resumes reading.

This preserves the invariant that KMES ring buffers are the only event buffer. The handoff channel is a staging area for the current batch, not a secondary buffer. Backpressure propagates: writer thread slow → channel fills → drain thread pauses → ring buffer absorbs → KMES overwrites oldest if full → gap detected on resume.

When the writer thread commits a batch and the channel has capacity again, the drain thread resumes reading from the ring buffer immediately.

§2.3.5 Shard lifecycle

Shard databases are created in the event store directory on first use. eventd MUST NOT delete or overwrite existing shard databases from previous configurations. If eventd starts with a different shard count than the previous run, the previously written databases remain in the directory and are available to the query path.

Shard database filenames MUST encode sufficient information to identify the shard and distinguish active shards from historical ones. The naming convention is defined in the storage chapter.

§2.3.6 Reconfiguration

Changing the StorageShards value requires an eventd restart to take effect. eventd MUST NOT dynamically reassign CPUs to shards or create new shards while running.

ⓘ Informative

Shard count changes are expected to be rare -- typically set once based on hardware profile (1 for a Raspberry Pi, CPU count or a multiple of CPU count for a server) and left unchanged. The registry watch mechanism detects the change, but eventd defers application to the next restart rather than attempting a live migration.

§2.4 2 Event ingestion

Batch Writer

§2.4.1 Transaction model

Each writer thread writes events to its shard's SQLite database using explicit transactions. A transaction consists of a BEGIN, one or more INSERT statements (one per event), and a COMMIT. The COMMIT is the durability boundary -- events in a committed transaction are guaranteed to survive process crashes and power loss.

The database MUST be opened in WAL (Write-Ahead Logging) mode. The synchronous pragma MUST be set to FULL. This ensures that every COMMIT fsyncs the WAL, providing per-transaction durability.

§2.4.2 Adaptive batch sizing

The writer thread MUST adapt its batch size to balance throughput against power-loss resilience. The goal is to commit as frequently as throughput allows, minimising the number of events in an uncommitted transaction at any given time.

Throughput is always the top priority. eventd MUST NOT fall behind the ingestion rate -- if it does, KMES ring buffers fill and kernel events are overwritten, which is irrecoverable data loss. Power-loss resilience is maximised within the constraint that throughput is maintained.

The adaptive algorithm operates as follows:

The writer thread begins a transaction.
The writer thread reads available events from its assigned drain threads.
If no events are available and the current batch is non-empty, the writer SHOULD commit immediately. There is no throughput pressure, so committing minimises the power-loss window.
If no events are available and the current batch is empty, the writer thread waits for events (the drain threads will wake it via the handoff mechanism when events arrive).
If events are available, they are added to the current transaction (INSERT statements executed).
After each INSERT (or group of INSERTs), the writer evaluates whether to commit now or continue batching. The decision is based on the observed ratio between event arrival rate and commit throughput: if arrivals are slow relative to commit cost, commit now (resilience). If arrivals are fast relative to commit cost, continue batching (throughput).
The batch MUST NOT exceed MaxBatchSize events. When the limit is reached, the writer MUST commit regardless of throughput conditions.

The specific heuristics for step 6 are implementation-defined. The normative requirements are:

Under low load, the writer MUST commit within MaxBatchLatencyMs of the first event in the batch.
Under high load, the writer MUST NOT produce batches larger than MaxBatchSize.
The writer MUST NOT hold an open transaction indefinitely.
The adaptive algorithm SHOULD NOT oscillate between extreme batch sizes under bursty workloads. Rapid alternation between very small commits (high fsync overhead) and very large commits (high latency) degrades both throughput and power-loss resilience. The implementation SHOULD apply smoothing or hysteresis to the arrival rate estimate.

§2.4.3 Configuration

Key	Type	Default	Valid range	Description
MaxBatchSize	REG_DWORD	10000	100--100000	Maximum number of events in a single transaction.
MaxBatchLatencyMs	REG_DWORD	100	10--5000	Maximum time in milliseconds between the first event entering a batch and the batch being committed.

These parameters bound the adaptive algorithm. MaxBatchSize caps the throughput-optimised case. MaxBatchLatencyMs caps the latency in the resilience-optimised case. The adaptive algorithm operates freely within these bounds.

§2.4.4 WAL checkpointing

WAL mode accumulates write-ahead log data until a checkpoint copies it back to the main database file. Under sustained write load, the WAL can grow large.

Each writer thread MUST trigger a WAL checkpoint when the WAL exceeds a size threshold. The checkpoint SHOULD use SQLITE_CHECKPOINT_PASSIVE mode, which checkpoints as much as possible without blocking readers. If a passive checkpoint cannot make progress (active readers hold pages), the writer MUST NOT block -- it continues writing and retries the checkpoint later.

The checkpoint threshold is implementation-defined. A reasonable default is 1000 pages (4 MB with the default 4 KB page size).

ⓘ Informative

PASSIVE checkpointing runs on the writer thread and briefly serialises with INSERT work. This is inherent to SQLite's architecture -- checkpointing and writing cannot run concurrently on the same database. PASSIVE mode is the lightest option (it yields immediately if readers hold pages) and the per-checkpoint cost is bounded by the threshold size. No alternative design avoids this cost within SQLite's concurrency model.

§2.4.5 Prepared statements

Writer threads MUST use prepared statements for INSERT operations. The prepared statement is created once per writer thread at startup and reused for every INSERT. This eliminates SQL parsing overhead from the hot path.

§2.5 2 Event ingestion

Gap Detection

§2.5.1 Sequence gap detection

Each drain thread MUST track the last seen sequence number for its CPU. When an event is read from the ring buffer, the drain thread MUST compare the event's sequence number against the expected next sequence number (last seen + 1).

If the event's sequence number is greater than expected, the intervening sequence numbers represent lost events. Events may be lost due to:

Ring buffer overrun (KMES overwrote events before eventd read them).
Event drops (KMES dropped events due to structural size limits).
Events lost during an eventd restart (events emitted while eventd was not running).

§2.5.2 Gap records

When a sequence gap is detected, the drain thread MUST generate a synthetic gap record containing:

The CPU ID.
The first missing sequence number (last seen + 1).
The last missing sequence number (current event's sequence number - 1).
The count of missing events (last missing - first missing + 1).
The timestamp of the last successfully processed event on this CPU (if available).
The timestamp of the event that revealed the gap.

The gap record is written directly to the shard database as part of the normal write path. It is not emitted through KMES. Gap records are stored alongside regular events and are queryable through the same interface.

§2.5.3 Lapping

If the drain thread's read_pos falls behind the ring buffer's tail_pos (the consumer has been lapped), the drain thread advances to tail_pos as specified by the PSD-003 read protocol. The sequence gap between the last processed event and the first event at tail_pos is detected and recorded as a gap record through the normal gap detection mechanism.

§2.5.4 Gap records in the event table

Gap records are stored in the events table with event_type = 'synthetic.gap'. The cpu_id column MUST be populated with the CPU ID on which the gap was detected. Other KMES header columns (sequence, origin_class, identity GUIDs) are NULL. The gap details (missing sequence range, surrounding timestamps) are stored as a msgpack map in the payload column. Gap records are queryable through the same interface as all other events. Populating cpu_id ensures that EVENTS WHERE cpu_id == N correctly returns gap records for CPU N.

§2.6 2 Event ingestion

Synthetic Events

§2.6.1 Definition

Synthetic events are records generated by eventd itself, not received from KMES. They are written directly to the event store database, bypassing KMES ring buffers entirely.

Synthetic events do not have KMES headers. They do not carry identity stamps, sequence numbers, or origin class values. They have an eventd-assigned timestamp (wall clock at generation time) and an event type string identifying the kind of synthetic event.

§2.6.2 When synthetic events are generated

eventd MUST generate synthetic events for the following conditions:

Sequence gaps -- when lost events are detected on any CPU (see §2.5).
eventd startup -- when eventd starts and successfully attaches to KMES ring buffers.
eventd shutdown -- when eventd begins a graceful shutdown, recording the last persisted sequence number per CPU.
Storage errors -- when a write to any store (event, log, or metric) fails (disk full, SQLite error).
Configuration changes -- when eventd reads a changed configuration value from the registry.

Additional synthetic event types MAY be defined in future versions.

§2.6.3 Shard assignment

CPU-specific synthetic events (gap records) MUST be written to the shard assigned to the CPU that generated them. They are handed off to the writer thread alongside regular events from that CPU.

Daemon-wide synthetic events (startup, shutdown, configuration changes, storage errors) MUST be written to shard 0. A storage error describes a failure on a specific shard but is itself a daemon-wide notification — it MUST NOT be written to the failing shard. These events are infrequent and the minor write imbalance is negligible.

§2.6.4 Storage

Synthetic events are written to the same shard databases as KMES events. They participate in the same batching, retention, and query mechanisms. They are distinguishable from KMES events by their record type in the storage schema.

§2.6.5 Ordering

Synthetic events are ordered by their eventd-assigned timestamp. They do not participate in per-CPU sequence numbering. A synthetic event's timestamp reflects when eventd generated it, not when the condition it describes occurred (e.g., a gap record's timestamp is when the gap was detected, not when the lost events were emitted).

Section

3 Event storage

§3.1 3 Event storage

Schema

§3.1.1 Event table

Each shard database MUST contain an events table with the following schema:

Column	Type	Description
`id`	INTEGER PRIMARY KEY	SQLite rowid. Auto-assigned, monotonically increasing within the shard.
`boot_id`	BLOB NOT NULL	16-byte boot ID GUID identifying which boot this event belongs to.
`timestamp`	INTEGER NOT NULL	Wall clock time. Nanoseconds since Unix epoch. For KMES events, copied from the event header `timestamp` field. For synthetic events, the wall clock time when eventd generated the record.
`cpu_id`	INTEGER	CPU on which the event was emitted. Copied from the KMES event header. NULL for daemon-wide synthetic events (startup, shutdown, config_change, storage_error). Populated for CPU-specific synthetic events (gap records).
`sequence`	INTEGER	Per-CPU, per-boot monotonic sequence number. Copied from the KMES event header. NULL for synthetic events.
`origin_class`	INTEGER	Origin of the event (0 = userspace, 1 = KMES, 2 = KACS, 3 = LCS). Copied from the KMES event header. NULL for synthetic events.
`event_type`	TEXT NOT NULL	Event type string. For KMES events, copied from the event header. For synthetic events, a `synthetic.` prefixed type string (e.g., `synthetic.startup`, `synthetic.shutdown`, `synthetic.gap`, `synthetic.config_change`, `synthetic.storage_error`).
`effective_token_guid`	BLOB	16-byte GUID of the effective token at emission time. NULL for synthetic events. Null GUID (16 zero bytes) if identity was not available at emission time.
`true_token_guid`	BLOB	16-byte GUID of the process's primary token at emission time. NULL for synthetic events.
`process_guid`	BLOB	16-byte GUID of the emitting process. NULL for synthetic events.
`payload`	BLOB	Msgpack-encoded event payload. For KMES events, the raw payload bytes from the event -- eventd MUST NOT interpret, modify, or re-encode them. For synthetic events, a msgpack-encoded map containing event-specific details. NULL if the event carries no payload data.

All KMES header fields are extracted into individual columns to enable direct SQL filtering without parsing event data. The event_type column serves as the sole discriminator between KMES events and synthetic events -- no separate record type column is needed.

§3.1.2 Synthetic event types

The following synthetic event types are defined:

Event type	Payload contents
`synthetic.startup`	Boot ID, restart flag (whether this is a fresh boot or a restart within the same boot), shard count, per-CPU sequence resume points.
`synthetic.shutdown`	Per-CPU last persisted sequence numbers.
`synthetic.gap`	CPU ID, first missing sequence number, last missing sequence number, count of missing events, timestamp of last event before the gap, timestamp of event that revealed the gap.
`synthetic.config_change`	Key name, old value, new value.
`synthetic.storage_error`	Store type (event, log, or metric), shard index (for event store errors, NULL for log/metric), error description.

The payload schema for each synthetic event type is defined by eventd. Additional synthetic event types MAY be defined in future versions.

§3.1.3 Write-time indexes

At database creation, eventd MUST create the following index:

idx_events_timestamp on events(timestamp) -- required for time-range queries, which are the most common access pattern.

No other indexes are created at database creation time. Additional indexes are managed by the adaptive indexing mechanism (§3.3).

§3.1.4 Schema versioning

Each shard database MUST store a schema version number in a metadata table:

Column	Type	Description
`key`	TEXT PRIMARY KEY	Metadata key name.
`value`	TEXT NOT NULL	Metadata value.

Required metadata entries:

Key	Value
`schema_version`	`1` (for this version of the specification).
`created_at`	ISO 8601 timestamp of database creation.

eventd MUST check the schema version on startup and MUST NOT write to databases with an unrecognised schema version. Migration is a separate administrative operation, not an automatic startup behavior.

§3.2 3 Event storage

Database Lifecycle

§3.2.1 Event store directory

All event shard databases MUST reside in a single directory, the event store directory. The path is configured via the EventStorePath registry key under Machine\System\eventd\. There is no compiled-in default -- if the key does not exist or is invalid, eventd MUST fail to start.

eventd MUST create the directory if it does not exist. eventd MUST NOT write event databases to any other location.

§3.2.2 Shard database naming

Active shard databases MUST be named shard-NNNN.db where NNNN is the zero-padded shard index (0000, 0001, ...). The shard index is the shard number assigned at startup, not the CPU number.

When eventd starts with a shard count that requires new databases, it creates them. When eventd starts with a shard count smaller than the number of existing shard databases, the excess databases are not deleted -- they remain in the directory and are available to the query path.

§3.2.3 Database creation

When a shard database file does not exist at startup, eventd MUST create it with:

WAL mode enabled (PRAGMA journal_mode=WAL).
Synchronous mode set to FULL (PRAGMA synchronous=FULL).
The events and metadata tables created as defined in §3.1.
The idx_events_timestamp index created.
The schema_version and created_at metadata entries populated.

§3.2.4 Database opening

When a shard database file exists at startup, eventd MUST:

Open the database in WAL mode.
Set synchronous mode to FULL.
Read and verify the schema_version metadata entry. If the version is unrecognised, eventd MUST log an error and MUST NOT write to that database. The database remains available for read-only queries.
Verify structural integrity (the required tables exist). If verification fails, eventd MUST log an error and MUST NOT write to that database.

§3.2.5 Query path discovery

The query path MUST discover all .db files in the event store directory and open them for reading. This includes active shard databases, historical shard databases from previous configurations, and any archive databases created by the retention mechanism. The query path MUST NOT assume a fixed number of databases.

Each database is opened with a read-only connection. Read-only connections do not contend with the writer thread's connection.

§3.2.6 Concurrency

Each shard database has exactly one read-write connection (owned by the shard's writer thread) and zero or more read-only connections (owned by query handlers). WAL mode permits concurrent reads alongside a single writer without blocking.

Writer threads MUST NOT share SQLite connections. Each writer thread creates and owns its connection for the lifetime of the eventd process.

§3.3 3 Event storage

Adaptive Indexing

§3.3.1 Purpose

Secondary indexes accelerate queries but slow writes. The optimal set of indexes depends on the actual query patterns of the deployment, which vary between systems and over time. Adaptive indexing allows eventd to maintain the right indexes for the workload without manual tuning.

§3.3.2 Global desired index set

eventd MUST maintain a global desired index set -- an ordered list of columns that should be indexed across all shard databases. The list is ordered by priority: the most frequently queried column has the highest priority.

The adaptive indexing system has three decoupled components:

Query frequency counters. Query handlers increment per-field counters when a field appears in a WHERE predicate. This is the sole write-heavy path. Counters are stored in memory and periodically persisted to a dedicated metadata database in the event store directory (not in the shard databases, since the counters are global state independent of shard configuration).
Index policy logic. A periodic process reads the query frequency counters, applies the creation and removal thresholds, and computes the desired index set. This runs at the interval configured by AdaptiveIndexPolicyIntervalMinutes (default 60 minutes). The policy logic is the sole writer to the desired set. The desired set is stored in memory and persisted to the same metadata database. Future versions MAY extend the policy logic with manual rules (e.g., "always index event_type regardless of query frequency") or administrative overrides.
Shard convergence. Writer threads read the desired index set and converge their material indexes toward it during quiet periods. Writer threads never read the counters and never write to the desired set.

This separation ensures that the high-frequency counter updates (one per query) do not contend with the writer threads' index convergence checks (one per drain cycle). The policy logic is the bridge between the two and runs infrequently enough to never be a contention point.

The desired index set is global -- it applies to all shards uniformly. Individual shards do not make independent indexing decisions.

§3.3.3 Shard convergence

Each shard independently converges its material indexes toward the global desired set during periods of low write activity. When a shard's writer thread has no pending events and the shard's material indexes do not match the desired set, the writer thread SHOULD create or drop indexes to converge.

Index creation uses CREATE INDEX IF NOT EXISTS. Index removal uses DROP INDEX IF EXISTS. Both operations run on the shard's writer thread.

Index creation MUST be cancellable. If the drain threads detect rising write pressure while an index build is in progress, the drain threads MUST signal the writer thread to abort the build. The writer thread MUST cancel the in-progress CREATE INDEX, causing SQLite to roll back the partial index cleanly. The writer thread then resumes normal event batch writing immediately. The aborted index build is reattempted during the next quiet period.

Throughput MUST always take priority over index maintenance.

ⓘ Informative

sqlite3_interrupt() sets a flag that SQLite checks at SQL VM opcode boundaries. During B-tree construction for a large index, the gap between checks can be tens of milliseconds -- long enough to cause ring buffer overrun at high event rates. Implementations SHOULD use sqlite3_progress_handler() to register a callback invoked every N VM opcodes (e.g., every 1000 opcodes). The callback checks a cancellation flag and returns non-zero to abort the operation. This provides much more responsive cancellation than sqlite3_interrupt() alone.

Shards converge at their own pace. A shard under sustained write pressure may lag behind the desired set indefinitely. This is acceptable -- the shard is prioritising throughput over query performance.

§3.3.4 Pressure-based index shedding

When a shard is under sustained write pressure, it MUST shed indexes to reduce per-insert overhead and protect throughput.

Graduated shedding. If more than SheddingBatchPercent% of a shard's batches exceed 75% of MaxBatchSize within a sliding window of SheddingWindowSeconds seconds, the shard MUST drop its lowest-priority secondary index (the index whose corresponding column has the lowest query frequency in the global desired set). If pressure remains after dropping the lowest-priority index, the next-lowest is dropped, and so on. The shedding check runs once per batch commit.

Emergency shedding. If a shard is at maximum batch size and the drain thread signals rising ring buffer pressure (see below), the shard MUST drop all secondary indexes immediately. DROP INDEX is a fast metadata operation (milliseconds, not seconds) and is safe to execute under pressure.

§3.3.4.1 Ring buffer pressure signaling

The drain thread monitors the gap between write_pos and its read_pos in the KMES ring buffer. If this gap exceeds EmergencySheddingBufferPercent% of the ring buffer capacity, the drain thread MUST signal the writer thread that emergency shedding is required. This signal is distinct from the index-build cancellation signal -- it triggers immediate shedding of all secondary indexes regardless of whether an index build is in progress.

§3.3.4.2 Shedding configuration

Key	Type	Default	Valid range	Description
SheddingWindowSeconds	REG_DWORD	30	10--300	Sliding window for graduated shedding evaluation.
SheddingBatchPercent	REG_DWORD	75	50--100	Percentage of batches within the window that must exceed 75% of MaxBatchSize to trigger graduated shedding.
EmergencySheddingBufferPercent	REG_DWORD	75	50--95	Ring buffer fill percentage that triggers emergency shedding.

The idx_events_timestamp index is exempt from shedding. It is always maintained regardless of write pressure. Time-range queries are the foundational access pattern and cannot function without a timestamp index.

§3.3.5 Recovery

When write pressure subsides after index shedding, the shard MUST rebuild dropped indexes to converge back toward the global desired set. Rebuilding follows the same low-write-activity scheduling and cancellability rules as initial index creation.

The rebuild order follows the priority order of the desired set: highest-priority (most frequently queried) indexes are rebuilt first.

§3.3.6 Candidate fields

All fields that can appear in a WHERE predicate are candidates for adaptive indexing. This includes both header columns and payload fields.

§3.3.6.1 Header column indexes

The following events table columns are candidates for standard column indexes:

event_type
origin_class
cpu_id
effective_token_guid
true_token_guid
process_guid
boot_id

The timestamp column is always indexed and is not subject to adaptive management.

§3.3.6.2 Payload field indexes

Any payload field path that appears in a WHERE predicate is a candidate for an expression index. Expression indexes use SQLite's expression index feature with a registered msgpack_extract function:

CREATE INDEX idx_payload_granted_access ON events(msgpack_extract(payload, '$.granted_access'))

The expression index extracts the specified field from the msgpack payload on every INSERT and indexes the result. SQLite's query optimiser uses the expression index automatically when the same extraction expression appears in a WHERE clause.

Payload field indexes follow the same priority ordering, pressure-based shedding, and convergence rules as header column indexes. They are part of the same global desired index set.

The raw payload column MUST NOT receive a plain column index -- indexing an opaque blob is meaningless. Only expression indexes on specific payload field paths are created.

§3.3.7 Index naming

Index names MUST follow the convention idx_events_<column> for header column indexes (e.g., idx_events_event_type, idx_events_process_guid). For payload expression indexes, dots in the field path are replaced with underscores: idx_events_payload_<path> (e.g., idx_events_payload_granted_access, idx_events_payload_source_name for the path source.name).

§3.3.8 Configuration

Key	Type	Default	Valid range	Description
AdaptiveIndexWindowHours	REG_DWORD	24	1--168	Rolling time window in hours over which query frequency is measured.
AdaptiveIndexPolicyIntervalMinutes	REG_DWORD	60	60--1440	How often the index policy logic recomputes the desired index set from the query frequency counters. Minimum 60 minutes to prevent index churn.
AdaptiveIndexCreateThreshold	REG_DWORD	100	10--10000	Number of queries filtering on a column within the window required to add it to the desired set.
AdaptiveIndexDropThreshold	REG_DWORD	10	1--1000	Number of queries filtering on a column within the window below which it is removed from the desired set. The drop threshold MUST be less than the create threshold to provide hysteresis.

§3.3.9 Persistence

The global desired index set, query frequency counters, and adaptive rollup state (§7.4) MUST be persisted to a dedicated metadata database in the event store directory. This database is independent of the shard databases and survives shard reconfiguration.

§3.3.9.1 Metadata database

The metadata database MUST be named eventd-meta.db and reside in the event store directory (alongside the shard databases). It is created on first startup if it does not exist.

The database MUST be opened in WAL mode with synchronous=NORMAL. It is low-volume (written once per policy interval, read at startup) and does not require per-transaction durability.

The database MUST contain the following tables:

index_counters table:

Column	Type	Description
`field_path`	TEXT PRIMARY KEY	The field name or payload path (e.g., `"event_type"`, `"granted_access"`, `"source.name"`).
`query_count`	INTEGER NOT NULL	Number of queries filtering on this field within the current rolling window.
`window_start`	INTEGER NOT NULL	Timestamp (nanoseconds since epoch) when the current window started.

desired_indexes table:

Column	Type	Description
`field_path`	TEXT PRIMARY KEY	The field name or payload path.
`priority`	INTEGER NOT NULL	Priority rank (lower number = higher priority = more frequently queried).
`is_expression`	INTEGER NOT NULL	1 if this is a payload expression index, 0 if a column index.

rollup_counters table:

Column	Type	Description
`function_window`	TEXT PRIMARY KEY	Composite key of function name and window size (e.g., `"avg_3600"` for AVG over 1 hour).
`query_count`	INTEGER NOT NULL	Number of queries using this function/window pair within the current rolling window.
`window_start`	INTEGER NOT NULL	Timestamp when the current window started.

desired_rollups table:

Column	Type	Description
`function_window`	TEXT PRIMARY KEY	Composite key matching `rollup_counters`.
`priority`	INTEGER NOT NULL	Priority rank.

meta table:

Column	Type	Description
`key`	TEXT PRIMARY KEY	Metadata key.
`value`	TEXT NOT NULL	Metadata value.

Required meta entries: schema_version (value 1), created_at (ISO 8601), admin_sd (self-relative Security Descriptor in binary, controlling who can execute the INDEX command and other administrative operations on the indexing policy).

The default admin_sd grants access to SYSTEM and Administrators. eventd MUST check the caller's token against this SD when processing an INDEX command via kacs_access_check with EVENTD_READ as the desired access right.

§3.3.9.2 Concurrency

The metadata database has a single writer: the index/rollup policy logic thread. Query handlers write to in-memory counters only; the policy logic flushes counters to the database at each policy interval. Writer threads and query handlers read the desired index/rollup sets from memory, not from the database.

The metadata database is opened read-write by the policy logic thread and is not accessed by any other thread at the database level. No concurrency control beyond SQLite's built-in WAL mode is required.

§3.3.9.3 Startup

On startup, eventd MUST:

Open eventd-meta.db in the event store directory. Create it if it does not exist.
Verify the schema version. If unrecognised, log an error and recreate the database (adaptive state is lost but not critical).
Load index_counters and rollup_counters into the in-memory counter structures.
Load desired_indexes and desired_rollups into the in-memory desired sets.
Discover material indexes from each shard database's schema and compare against the desired sets.

The set of material indexes in each shard is discovered from the database schema on startup. eventd resumes convergence from whatever state each shard is in -- it does not drop or rebuild indexes on startup.

§3.4 3 Event storage

Retention

§3.4.1 Retention model

ⓘ Informative

The retention model in v0.23 is a deliberate early simplification. Future versions will introduce a significantly more sophisticated retention engine supporting precise query-like rules (e.g., "retain KACS events for 90 days, retain synthetic events for 7 days, retain events where origin_class == userspace for 14 days") and hot pruning of events during ingestion based on policy. The v0.23 model provides the minimum viable retention needed to prevent unbounded disk growth.

eventd MUST enforce retention policies that limit how long event data is stored. Retention prevents unbounded disk growth and ensures that old data is removed in a predictable, configurable manner.

Retention operates on a per-boot granularity for the primary data boundary and a time-based granularity within the current boot. Data from old boots is the first candidate for removal. Within the current boot, data older than the retention window is removed.

§3.4.2 Configuration

Key	Type	Default	Valid range	Description
EventRetentionDays	REG_DWORD	30	1--3650	Maximum age of events in days. Events older than this are eligible for deletion.
EventRetentionMaxBytes	REG_QWORD	0	0--18446744073709551615	Maximum total size of all event shard databases in bytes. 0 means no size limit. When exceeded, the oldest events are deleted until the total size is within the limit.
RetentionCheckIntervalMinutes	REG_DWORD	60	1--1440	How often the retention process runs, in minutes.

Both time-based and size-based retention limits are enforced. If both are configured, the more aggressive limit wins -- an event is deleted if it exceeds either threshold.

§3.4.3 Retention process

The retention process runs periodically on a background thread. It MUST NOT run on the writer threads or drain threads. It operates on one shard database at a time, using a separate read-write connection.

For each shard database:

Delete all rows from the events table where timestamp is older than EventRetentionDays from the current wall clock time. This covers KMES events, synthetic events, and gap records uniformly.
If EventRetentionMaxBytes is nonzero and the total size of all shard databases exceeds the limit, delete the oldest events (by timestamp) across all shards until the total size is within the limit.

Deletion MUST be performed in batches to avoid holding a long-running transaction that blocks the writer thread. Each batch deletes a bounded number of rows (implementation-defined) and commits before starting the next batch.

§3.4.4 Impact on writers

The retention process opens a separate read-write connection to the shard database. In WAL mode, a reader does not block the writer. However, a second writer would block. The retention process MUST coordinate with the shard's writer thread to avoid concurrent write transactions.

The simplest coordination mechanism is for the retention process to acquire a shard-level mutex before writing, and for the writer thread to briefly yield when the retention process needs to run. The retention process performs small, bounded delete batches and releases the mutex between batches, minimising writer thread stall time.

§3.4.5 Disk reclamation

Deleting rows from SQLite does not shrink the database file. Freed pages are reused for future inserts. To reclaim disk space, VACUUM must be run, but this rewrites the entire database and is expensive.

eventd SHOULD NOT run VACUUM automatically. Disk reclamation is an administrative operation triggered explicitly. The freed pages from retention deletes are recycled by subsequent inserts, which is sufficient for steady-state operation.

ⓘ Informative

On a system in steady state -- where events are ingested at roughly the same rate they are deleted by retention -- the database file size stabilises at approximately the retention window's worth of data. The freed pages from old events are reused by new inserts without the database file growing.

§3.5 3 Event storage

Boot Partitioning

§3.5.1 Boot ID

Every record stored by eventd -- events, logs, and metrics -- carries a boot_id: a 16-byte GUID that uniquely identifies the boot during which the record was produced. The boot ID is assigned by peinit at each boot and communicated to eventd at startup.

The boot ID serves two purposes:

Partitioning. Records from different boots are never interleaved in a meaningful sequence. KMES per-CPU sequence numbers reset to zero on each boot. Without a boot ID, sequence number 42 from boot A would be indistinguishable from sequence number 42 from boot B.
Lifecycle. Retention can use boot ID to efficiently delete all data from old boots as a single operation, rather than scanning by timestamp.

§3.5.2 Scope

Boot ID is written to the boot_id column in all three stores:

Event store: every row in the events table.
Log store: every row in the logs table.
Metric store: metrics do not carry boot_id per sample. The boot boundary is less meaningful for metrics because time series are continuous across restarts -- a gauge value is valid regardless of which boot produced it.

ⓘ Informative

The log store includes boot_id because log output may have different meaning across boots (a service may emit different logs depending on boot-time configuration). The metric store omits per-sample boot_id because metric time series are inherently continuous -- a CPU usage reading at 42% is equally valid regardless of boot context. If boot-scoped metric queries are needed, the timestamp can be correlated with the boot ID from the event store's synthetic startup/shutdown events.

§3.5.3 Sequence uniqueness (events only)

Within a single boot, an event is uniquely identified by the tuple (boot_id, cpu_id, sequence). Across boots, boot_id provides the disambiguating dimension. The combination of all three fields is globally unique.

§3.5.4 Boot boundary detection

When eventd starts, it reads the current boot ID from peinit. If the boot ID differs from the boot ID of the most recently stored events (read from the shard databases), eventd has started in a new boot.

On a new boot, eventd MUST:

Reset all per-CPU sequence trackers to 0.
Record the new boot ID for all subsequent writes to the event and log stores.
Emit a synthetic synthetic.startup event with the new boot ID.

On a restart within the same boot (eventd crashed and was restarted by peinit), the boot ID matches and eventd MUST:

Restore per-CPU sequence trackers from the last persisted sequence numbers.
Continue writing with the existing boot ID.
Emit a synthetic synthetic.startup event noting the restart.

Section

4 Log ingestion

§4.1 4 Log ingestion

Overview

Logs are fundamentally text. A log entry is a line of text output by a service or system component, with light metadata attached by the ingestion layer. eventd does not parse or interpret log text -- if the text happens to be JSON or structured in some other way, that is the service's concern, not eventd's.

Structured observability data belongs in events (via KMES), not logs. The boundary is clear: events are typed, schema'd, kernel-stamped records with identity. Logs are human-readable text output. The "peiosification" of Linux software includes a translation layer that converts relevant log output into proper events where structured data is needed.

§4.1.1 Loss tolerance

Log loss is tolerable. Unlike events, where a lost record may represent a missed security audit entry, a lost log line is an inconvenience, not a failure. This tolerance shapes the entire log ingestion design:

The ingestion path does not track sequence numbers or detect gaps.
If the transport buffer fills, log records are dropped without notification.
No synthetic gap records are generated for lost logs.
eventd MAY maintain a dropped-log counter for observability, but this is not a normative requirement.

§4.1.2 Ingestion path

Logs are ingested over a Unix domain socket. The socket accepts connections from any process with the appropriate credentials. The primary log producer is peinit, which captures stdout and stderr from the services it manages and forwards them to eventd. Native Peios services MAY also write directly to the log socket.

peinit is not a privileged log source -- it uses the same socket and protocol as any other log producer. Its role is to bridge the gap between services that write to stdout/stderr (the standard Unix model) and eventd's log ingestion interface.

§4.2 4 Log ingestion

Transport

§4.2.1 Log socket

eventd MUST expose a Unix domain datagram socket for log ingestion. The socket path is configured via the LogSocketPath registry key under Machine\System\eventd\. There is no compiled-in default -- if the key does not exist or is invalid, eventd MUST fail to start.

Datagram sockets are used rather than stream sockets because each log record is an independent message with no framing concerns. A datagram either arrives completely or not at all -- no partial reads, no length-prefix parsing, no connection state.

§4.2.2 Log record format

Each datagram is a single msgpack-encoded map representing one log record:

Field	Type	Required	Description
`origin`	string	Yes	Name of the service or component that produced the log line. For peinit-forwarded logs, this is the service name. For direct logging, this is the identity of the sending process.
`is_error`	bool	Yes	True if the log line was captured from stderr or explicitly marked as error by the sender. False for stdout / normal output.
`message`	string	Yes	The log text. A single line of output.
`timestamp`	u64	No	Wall clock timestamp in nanoseconds since Unix epoch, as captured by the sender. If omitted, eventd uses its own wall clock at receipt time.
`job_id`	binary (16 bytes)	No	Correlation GUID identifying the supervised job that produced this line. peinit sets it when forwarding a service's stdout/stderr so log lines can be correlated to a specific job execution (PSD-007 §7.1). Omitted for direct logging and for output with no associated job.

peinit SHOULD include the timestamp captured at the time the line was read from the service's pipe, not the time it was forwarded to eventd. This preserves timing accuracy when peinit batches log records.

§4.2.3 Malformed input handling

If a datagram contains invalid msgpack (not decodable), the entire datagram MUST be dropped silently.

If a datagram contains a valid msgpack value that is not a map and not an array of maps, it MUST be dropped silently.

For a single record (map), if a required field is missing or has the wrong type (e.g., origin is an integer instead of a string), the record MUST be dropped silently.

For a batched datagram (array of maps), each record is validated independently. Invalid records MUST be dropped. Valid records in the same batch MUST still be processed. One malformed record MUST NOT cause the entire batch to be discarded.

The optional job_id field, when present, MUST be 16 bytes of binary. A present job_id of the wrong type or length MUST be ignored (treated as absent) rather than causing the record to be dropped -- a malformed correlation key must not cost the log line itself.

eventd MUST NOT emit synthetic events, log errors, or increment visible counters in response to malformed input. These are untrusted inputs from arbitrary senders -- reacting to invalid data would be a denial-of-service vector.

ⓘ Informative

The is_error field is deliberately a boolean, not a severity level. peinit can distinguish stdout from stderr and nothing more. Richer severity (debug, info, warn, error, fatal) is a log-framework concern -- services that encode severity in their output can do so textually. Services that need structured severity levels should emit events, not logs.

§4.2.4 Batching

Senders MAY batch multiple log records into a single datagram by sending a msgpack array of log record maps instead of a single map. eventd MUST accept both forms -- a single map (one record) or an array of maps (multiple records).

Batching amortises syscall overhead. peinit SHOULD batch log records when forwarding under sustained load. The batch size is bounded by the maximum datagram size supported by the Unix socket (implementation-defined, typically 212992 bytes on Linux).

§4.2.5 Dropped records

If the socket receive buffer is full when a sender transmits a datagram, the kernel drops the datagram silently (standard Unix datagram socket behavior). Neither the sender nor eventd is notified.

eventd SHOULD NOT increase the socket receive buffer beyond a reasonable size. If eventd cannot drain the socket fast enough, logs are dropped. This is by design -- log ingestion MUST NOT exert backpressure on senders. A service MUST NOT stall because eventd is slow.

ⓘ Informative

The default Linux SO_RCVBUF for Unix datagram sockets is approximately 212 KB, which holds roughly 1000 typical log records. During a batch commit (1-10ms), this buffer is the only cushion. Under burst conditions (e.g., a service dumping a stack trace), some datagrams will be dropped. This is the intended degradation mode for a loss-tolerant data path -- the alternative (backpressure or unbounded buffering) would violate the design constraint that log ingestion must never slow senders.

§4.2.6 Direct service logging

A native Peios service that wants to log directly to eventd (bypassing peinit) uses the same socket and the same record format. The service sets the origin field to its own service name. No additional setup or negotiation is required.

Direct logging is a convenience for services that want more control over log metadata than stdout/stderr provides. It is not required -- most services log via stdout/stderr and peinit handles the rest.

§4.3 4 Log ingestion

Log Writer

§4.3.1 Ingestion thread

eventd MUST run a dedicated log ingestion thread that reads datagrams from the log socket and writes log records to the log store. The log ingestion thread is independent of the event drain threads and event writer threads -- log ingestion does not contend with event ingestion.

ⓘ Informative

The log ingestion thread performs both socket reads and SQLite writes on a single thread. During a batch commit, the socket is not being drained and datagrams may be dropped. Splitting into separate reader and writer threads (with a bounded handoff channel, as the event path uses) would decouple these operations. The single-thread model is a deliberate simplification for v0.23: log loss is tolerable, log volumes are typically lower than event volumes, and the single-thread model avoids handoff channel complexity. If log throughput becomes a bottleneck, log store sharding (analogous to event store sharding) is a more impactful improvement than thread splitting.

§4.3.2 Batched writes

The log writer uses the same adaptive batch sizing approach as the event writer (§2.4). Log records are accumulated into a transaction and committed when either the batch size or latency threshold is reached.

The log writer's batch parameters are configured independently from the event writer:

Key	Type	Default	Valid range	Description
LogMaxBatchSize	REG_DWORD	5000	100--100000	Maximum number of log records in a single transaction.
LogMaxBatchLatencyMs	REG_DWORD	500	10--5000	Maximum time in milliseconds between the first log record entering a batch and the batch being committed.

ⓘ Informative

The default log batch latency (500ms) is higher than the event batch latency (100ms) because log loss is tolerable and power-loss resilience is less critical for logs than for events. Larger, less frequent batches improve throughput efficiency for the common case where log volume is moderate.

§4.3.3 SQLite configuration

The log store database MUST be opened in WAL mode. The synchronous pragma SHOULD be set to NORMAL rather than FULL. Per-transaction fsync is not required for logs because log loss on power failure is acceptable. NORMAL mode syncs at checkpoint time, providing durability against process crashes without the per-commit fsync overhead.

This is a deliberate divergence from the event store, which uses synchronous = FULL for per-transaction durability. The different durability requirements of events and logs justify different SQLite configurations.

Section

5 Log storage

§5.1 5 Log storage

Schema

§5.1.1 Log table

The log store is a single SQLite database. It MUST contain a logs table with the following schema:

Column	Type	Description
`id`	INTEGER PRIMARY KEY	SQLite rowid. Auto-assigned, monotonically increasing.
`boot_id`	BLOB NOT NULL	16-byte boot ID GUID identifying which boot this log entry belongs to.
`timestamp`	INTEGER NOT NULL	Wall clock time in nanoseconds since Unix epoch. If the sender provided a timestamp, that value is used. Otherwise, eventd's receipt time is used.
`origin`	TEXT NOT NULL	Name of the service or component that produced the log line.
`is_error`	INTEGER NOT NULL	1 if the log line came from stderr or was explicitly marked as error. 0 otherwise.
`message`	TEXT NOT NULL	The log text.
`job_id`	BLOB	16-byte GUID of the job that produced this line, when peinit forwarded it with a job correlation. NULL for direct logging or output with no associated job.

The log schema is deliberately minimal. Logs are text with light metadata: the originating service, an error flag, a timestamp, and an optional job-correlation GUID. There are no payload blobs and no origin classes; the only identity-like field is the optional job_id correlation key. Services that need richer structure should emit events.

§5.1.2 Write-time indexes

At database creation, eventd MUST create the following indexes:

idx_logs_timestamp on logs(timestamp) -- required for time-range queries.
idx_logs_origin on logs(origin) -- required for service-filtered queries, the most common log access pattern ("show me logs from service X").
idx_logs_job_id on logs(job_id) WHERE job_id IS NOT NULL -- a partial index supporting per-job log queries ("show me logs for job X"). Partial because direct-logged lines carry no job_id, so only correlated lines are indexed.

The log store does not use adaptive indexing. The schema is narrow and the two write-time indexes cover the dominant query patterns. Additional indexes are not expected to provide meaningful benefit.

ⓘ Informative

The idx_logs_origin index adds write amplification (each INSERT updates two B-trees instead of one). In practice the overhead is modest: the origin column has low cardinality (tens of distinct service names), so the index pages stay in SQLite's page cache and insertions are cheap. This is a deliberate trade-off: "show me logs from service X" is the most common log query pattern and must be fast without a full table scan.

§5.1.3 Schema versioning

The log store database MUST contain a metadata table with the same structure as the event store (§3.1). The schema_version for the log store is 1.

eventd MUST check the schema version on startup and MUST NOT write to the database if the version is unrecognised.

§5.2 5 Log storage

Database Lifecycle

§5.2.1 Log store path

The log store database resides at a path configured via the LogStorePath registry key under Machine\System\eventd\. The value MUST be a file path (not a directory, unlike the event store which uses a directory of shards). There is no compiled-in default -- if the key does not exist or is invalid, eventd MUST fail to start.

eventd MUST create the database file and its parent directories if they do not exist.

§5.2.2 Database creation

When the log store database does not exist at startup, eventd MUST create it with:

WAL mode enabled (PRAGMA journal_mode=WAL).
Synchronous mode set to NORMAL (PRAGMA synchronous=NORMAL).
The logs and metadata tables created as defined in §5.1.
The idx_logs_timestamp and idx_logs_origin indexes created.
The schema_version and created_at metadata entries populated.

§5.2.3 Database opening

When the log store database exists at startup, eventd MUST:

Open the database in WAL mode.
Set synchronous mode to NORMAL.
Verify the schema version. If unrecognised, eventd MUST log an error and MUST NOT write to the database. The database remains available for read-only queries.
Verify structural integrity (required tables and indexes exist).

§5.2.4 Concurrency

The log store has one read-write connection (owned by the log writer thread) and zero or more read-only connections (owned by query handlers). WAL mode permits concurrent reads alongside the single writer.

§5.2.5 WAL checkpointing

The log writer thread MUST trigger WAL checkpoints when the WAL exceeds a size threshold, using SQLITE_CHECKPOINT_PASSIVE mode. The threshold is implementation-defined.

§5.3 5 Log storage

Retention

ⓘ Informative

As with event retention (§3.4), the log retention model in v0.23 is a deliberate early simplification. Future versions will introduce more sophisticated retention rules.

§5.3.1 Configuration

Key	Type	Default	Valid range	Description
LogRetentionDays	REG_DWORD	14	1--3650	Maximum age of log entries in days. Entries older than this are eligible for deletion.
LogRetentionMaxBytes	REG_QWORD	0	0--18446744073709551615	Maximum size of the log store database in bytes. 0 means no size limit.

Both limits are enforced. The more aggressive limit wins.

ⓘ Informative

The default log retention (14 days) is shorter than the default event retention (30 days), reflecting the lower importance of historical log data relative to audit events.

§5.3.2 Retention process

The retention process runs on the same background thread as event retention (§3.4), operating on the log store database after completing event retention.

Delete all rows from the logs table where timestamp is older than LogRetentionDays from the current wall clock time.
If LogRetentionMaxBytes is nonzero and the log store database exceeds the limit, delete the oldest log entries (by timestamp) until the size is within the limit.

Deletion MUST be performed in batches to avoid holding a long-running transaction that blocks the log writer thread.

§5.3.3 Disk reclamation

As with the event store, VACUUM is not run automatically. Freed pages are recycled by subsequent inserts.

Section

6 Metric ingestion

§6.1 6 Metric ingestion

Overview

Metrics are numeric measurements over time. CPU usage, memory consumption, request counts, queue depths, error rates -- any quantity that varies and is worth tracking. Metrics are fundamentally different from events and logs: they are dense time-series data (many samples of the same measurement) rather than discrete occurrences or text output.

eventd is a metric sink, not a metric collector. Services and collection agents push metrics to eventd. eventd does not scrape endpoints, read from /proc, or poll for data. The collection mechanism (e.g., a collectord daemon that gathers system metrics) is outside eventd's scope.

§6.1.1 Data model

A metric data point consists of:

Name -- a dot-separated string identifying the measurement (e.g., cpu.usage, http.requests.total, disk.read.bytes).
Labels -- a set of key-value string pairs providing dimensions (e.g., core="0", service="loregd", method="GET"). Labels distinguish different instances of the same measurement. Label keys and values MUST be non-empty UTF-8 strings. Label keys and values MUST NOT contain the characters = (0x3D) or , (0x2C), as these are used as delimiters in the canonical label representation (§7.1). Records with invalid label keys or values MUST be dropped silently.
Type -- one of counter, gauge, or histogram. The type determines how the metric is interpreted by the query engine.
Timestamp -- wall clock time of the measurement.
Value -- the numeric measurement. The encoding depends on the type.

The combination of name and labels uniquely identifies a time series. cpu.usage{core="0"} and cpu.usage{core="1"} are distinct time series.

ⓘ Informative

Labels with unbounded cardinality (per-request IDs, user-provided strings, timestamps as label values) cause series table growth proportional to the number of unique label combinations. Each unique combination creates a new series row and a new entry in the bounded series cache. When the number of active series exceeds the cache size, every collection cycle evicts and reloads the overflow, causing sustained SQLite lookups on the single metric ingestion thread. This is a well-known anti-pattern in metrics systems. Emitters SHOULD use labels with bounded, low-cardinality values (e.g., CPU core ID, service name, HTTP method). High-cardinality dimensions belong in event payloads, not metric labels.

§6.1.2 Metric types

§6.1.2.1 Counter

A monotonically increasing value that resets to zero when the emitting process restarts. Used for cumulative quantities: total requests served, total bytes transmitted, total errors encountered.

The raw counter value is rarely queried directly. The query engine derives rate (change per unit time) from counter samples, handling resets correctly.

A counter value MUST be a non-negative 64-bit floating-point number.

§6.1.2.2 Gauge

A point-in-time value that can increase or decrease arbitrarily. Used for current state: CPU usage percentage, memory in use, queue depth, temperature.

The raw gauge value is the measurement. Aggregation over time uses min, max, and average -- not rate.

A gauge value MUST be a 64-bit floating-point number (may be negative).

§6.1.2.3 Histogram

A distribution of observed values across predefined buckets. Used for latency, request sizes, and other quantities where the distribution matters more than the average.

A histogram value consists of:

An array of bucket boundaries (upper bounds), monotonically increasing.
An array of cumulative counts, one per bucket. Each count represents the number of observations less than or equal to the corresponding boundary.
A total count of all observations.
A sum of all observed values.

Bucket boundaries are defined by the emitter and MUST be consistent across all samples of the same time series. If bucket boundaries change, it is treated as a new time series.

§6.1.3 Loss tolerance

Metric loss has similar tolerance characteristics to log loss. A missed data point creates a gap in the time series -- the gap is visible but not catastrophic. The query engine interpolates or indicates the gap. Services MUST NOT assume lossless metric delivery.

§6.2 6 Metric ingestion

Transport

§6.2.1 Metric socket

eventd MUST expose a Unix domain datagram socket for metric ingestion. The socket path is configured via the MetricSocketPath registry key under Machine\System\eventd\. There is no compiled-in default -- if the key does not exist or is invalid, eventd MUST fail to start.

Datagram sockets are used for the same reasons as log ingestion (§4.2): each metric submission is an independent message with no framing concerns, and dropped datagrams under backpressure are acceptable.

§6.2.2 Metric record format

Each datagram is a single msgpack-encoded map or an array of maps (batched submission). Each map represents one or more data points for a single time series:

Field	Type	Required	Description
`name`	string	Yes	Metric name. Dot-separated hierarchical string (e.g., `cpu.usage`, `http.requests.total`).
`labels`	map	No	Key-value string pairs providing dimensions. If omitted, the time series has no labels.
`type`	string	Yes	One of `"counter"`, `"gauge"`, or `"histogram"`.
`timestamp`	u64	No	Wall clock timestamp in nanoseconds since Unix epoch. If omitted, eventd uses its receipt time.
`value`	varies	Yes	The measurement. For counter and gauge: a single f64. For histogram: a map (see below).

§6.2.2.1 Histogram value format

Field	Type	Description
`boundaries`	array of f64	Bucket upper bounds, monotonically increasing.
`counts`	array of u64	Cumulative count per bucket. Length MUST equal the length of `boundaries`.
`total_count`	u64	Total number of observations.
`sum`	f64	Sum of all observed values.

§6.2.2.2 Batching

Senders MAY batch multiple metric records into a single datagram by sending a msgpack array of maps. eventd MUST accept both a single map and an array of maps.

Batching is especially valuable for metrics because collection agents typically gather many metrics simultaneously (e.g., collectord reading all CPU cores, all disk devices, and all network interfaces in one pass).

§6.2.3 Metric naming conventions

eventd does not enforce naming conventions. The following conventions are recommended but not normative:

Dot-separated hierarchy: system.cpu.usage, service.http.requests
Units as the final component: disk.read.bytes, request.duration.seconds
Counter names should reflect the cumulative nature: http.requests.total, errors.total

§6.2.4 Malformed input handling

The same rules as log ingestion (§4.2) apply:

Invalid msgpack: drop the entire datagram silently.
Valid msgpack but not a map or array of maps: drop silently.
Map missing required fields or with wrong types: drop the record silently.
Histogram value with mismatched boundaries and counts array lengths, or non-monotonic boundaries: drop the record silently.
Batched datagrams: invalid records are dropped individually; valid records in the same batch are still processed.
eventd MUST NOT emit events or log errors in response to malformed metric input.

§6.2.5 Dropped records

As with log ingestion, if the socket receive buffer is full, datagrams are dropped silently. Metric ingestion MUST NOT exert backpressure on senders.

§6.3 6 Metric ingestion

Metric Writer

§6.3.1 Ingestion thread

eventd MUST run a dedicated metric ingestion thread that reads datagrams from the metric socket and writes data points to the metric store. The metric ingestion thread is independent of the event and log ingestion paths.

§6.3.2 Processing

For each received metric record, the ingestion thread:

Resolves the time series by name and labels using the in-memory series cache (§7.1). If the series does not exist, a new row is inserted into the series table with the type from the record, and the cache is updated.
Validates that the record's type matches the series type. If the type does not match (e.g., a gauge sample for a series registered as a counter), the record MUST be dropped silently. The series type is set at creation time and is immutable -- a series cannot change type.
Inserts a row into the samples table with the resolved series_id, timestamp, and value. For histogram-type metrics, the histogram data is encoded as msgpack and stored in the histogram_data column.

§6.3.3 Batched writes

The metric writer uses the same adaptive batch sizing approach as the event and log writers. Metric data points are accumulated into a transaction and committed when either the batch size or latency threshold is reached.

The metric writer's batch parameters are configured independently:

Key	Type	Default	Valid range	Description
MetricMaxBatchSize	REG_DWORD	5000	100--100000	Maximum metric samples per transaction.
MetricMaxBatchLatencyMs	REG_DWORD	1000	10--5000	Maximum ms before a metric batch is committed.

ⓘ Informative

The default metric batch latency (1000ms) is higher than both the event and log defaults. Metrics are typically sampled at 15-second intervals, so a 1-second commit window accumulates multiple samples efficiently without meaningful latency impact. Under burst conditions (many metrics arriving simultaneously from a collection sweep), the batch size limit ensures timely commits.

§6.3.4 SQLite configuration

The metric store database MUST be opened in WAL mode. The synchronous pragma SHOULD be set to NORMAL. Per-transaction fsync is not required for metrics because metric loss on power failure is acceptable.

§6.3.5 Series cache

The in-memory series cache maps (name, canonical labels, boundaries hash) to series_id. For counter and gauge series, the boundaries hash component is absent. The cache is bounded by MetricSeriesCacheSize (default 50,000 entries) and uses LRU eviction. Cache hits resolve series in a hash table lookup. Cache misses fall back to a SQLite SELECT and insert the result into the cache, evicting the least recently used entry if full. See §7.1 for details.

Section

7 Metric storage

§7.1 7 Metric storage

Schema

§7.1.1 Storage model

The metric store uses a single SQLite database. Unlike event and log storage which store individual records as rows, the metric store is organised around time series. A time series is a unique combination of metric name, labels, and type. Individual data points (samples) are appended to their time series over time.

§7.1.2 Series table

The series table maintains the registry of known time series:

Column	Type	Description
`id`	INTEGER PRIMARY KEY	Series identifier. Auto-assigned. Used as a foreign key in the samples table.
`name`	TEXT NOT NULL	Metric name (e.g., `cpu.usage`).
`labels`	TEXT NOT NULL	Canonical label representation. Labels are sorted by key and encoded as a comma-separated `key=value` string (e.g., `core=0,host=server1`). The empty string represents no labels.
`type`	INTEGER NOT NULL	Metric type: 0 = counter, 1 = gauge, 2 = histogram.
`label_hash`	INTEGER NOT NULL	Hash of the canonical label string. Used for fast lookup.
`boundaries_hash`	INTEGER	Hash of the canonical bucket boundary representation. NULL for counter and gauge series. Non-NULL for histogram series.

The table MUST have a UNIQUE(name, labels, boundaries_hash) constraint. For counter and gauge series, boundaries_hash is NULL and the uniqueness is effectively (name, labels) since all NULLs are distinct in SQLite — the application-level single-writer design prevents duplicates, and the series resolution logic (below) enforces uniqueness before insertion. The label_hash and boundaries_hash columns accelerate lookups but are not uniqueness constraints -- collisions are resolved by comparing the full labels string. The hash algorithm is implementation-defined. Lookups MUST always verify against the full string after hash narrowing. Hash values are not portable across implementations -- a database restored from backup on a different implementation will produce correct results (hash misses fall through to full string comparison) but may have degraded lookup performance until series are re-inserted.

ⓘ Informative

FNV-1a (32-bit or 64-bit) is a good default choice for label_hash and boundaries_hash. It is simple, fast, well-distributed for short strings, and has no external dependencies. The UNIQUE constraint is enforced by SQLite as a defensive measure; the single-writer design prevents duplicates at the application level, but the constraint protects against future changes that introduce additional write paths.

§7.1.3 Samples table

The samples table stores individual data points:

Column	Type	Description
`series_id`	INTEGER NOT NULL	Foreign key referencing `series(id)`.
`timestamp`	INTEGER NOT NULL	Wall clock time in nanoseconds since Unix epoch.
`value`	REAL NOT NULL	The numeric value. For counters and gauges, this is the raw value. For histograms, this column stores 0 and the histogram data is stored in `histogram_data`.
`histogram_data`	BLOB	Msgpack-encoded histogram value (boundaries, counts, total_count, sum). NULL for counter and gauge samples.

§7.1.4 Write-time indexes

At database creation, eventd MUST create the following indexes:

idx_samples_series_timestamp on samples(series_id, timestamp) -- the primary query pattern is "give me samples for series X in time range Y." This composite index supports both series lookup and time range filtering in a single index scan.
idx_series_name on series(name) -- required for metric name lookups.
idx_series_label_hash on series(label_hash) -- required for fast series resolution when ingesting data points.

§7.1.5 Series resolution

When a data point arrives, eventd MUST resolve it to a series ID:

Compute the canonical label string (sort labels by key, encode as key=value pairs).
Hash the canonical label string.
For histogram samples, compute the boundaries hash from the canonical boundary representation (sorted array of boundary values). For counter and gauge samples, the boundaries hash is NULL.
Look up the series table by name, label_hash, and (for histograms) boundaries_hash.
If a match is found, verify the full labels string matches (hash collision check). Use the existing series_id.
If no match is found, insert a new row into the series table and use the new series_id.

For histogram series, a change in bucket boundaries results in a new series row, as required by §6.1. The old series remains in the database with its historical samples. The new series begins accumulating samples with the new boundaries.

Series resolution MUST be cached in a bounded in-memory cache. The cache maps (name, canonical labels, boundaries hash) to series_id. For counter and gauge series, the boundaries hash component is absent. Cache hits resolve in a hash table lookup with no SQLite query. Cache misses fall back to a SQLite SELECT by name and label_hash, then the result is inserted into the cache (evicting the least recently used entry if the cache is full).

Key	Type	Default	Valid range	Description
MetricSeriesCacheSize	REG_DWORD	50000	1000--1000000	Maximum number of entries in the series resolution cache.

The cache uses a least-recently-used (LRU) eviction policy. Series that receive frequent samples stay cached. Series that are rarely updated are evicted and resolved via SQLite on their next sample. A cache miss costs one SQLite SELECT -- fast with the existing idx_series_name and idx_series_label_hash indexes, but slower than a hash table hit.

The cache bounds memory usage regardless of how many distinct series exist in the database. A system with 1 million series but a 50K cache uses memory proportional to the cache size, not the series count. At typical entry sizes (~200-300 bytes), the default 50K cache uses approximately 10-15 MB.

ⓘ Informative

Histogram bucket boundary changes create new series rows. An emitter that drifts its boundaries frequently (e.g., an auto-tuning histogram that adjusts buckets every collection cycle) will create a new series per boundary set, causing series table growth and cache churn. This is a misconfiguration -- emitters SHOULD use fixed boundaries for a given metric. eventd does not defend against this; the series table and cache behave correctly, but the proliferation of near-identical series degrades query performance and wastes storage.

The total number of distinct series in the series table is not capped. New series are always created in the database. The cache only bounds how many are held in memory simultaneously.

ⓘ Informative

MetricSeriesCacheSize SHOULD be configured above the number of actively reporting series. If the cache is smaller than the active set, every collection cycle evicts and reloads the overflow, causing a fixed number of SQLite SELECTs per cycle permanently. For example, 55K active series with a 50K cache incurs ~5K cache misses every 15 seconds indefinitely. LRU cannot help when all series are equally hot.

§7.1.6 Schema versioning

The metric store database MUST contain a metadata table with the same structure as the event and log stores. The schema_version for the metric store is 1.

§7.2 7 Metric storage

Database Lifecycle

§7.2.1 Metric store path

The metric store database resides at a path configured via the MetricStorePath registry key under Machine\System\eventd\. The value MUST be a file path. There is no compiled-in default -- if the key does not exist or is invalid, eventd MUST fail to start.

eventd MUST create the database file and its parent directories if they do not exist.

§7.2.2 Database creation

When the metric store database does not exist at startup, eventd MUST create it with:

WAL mode enabled.
Synchronous mode set to NORMAL (same rationale as the log store -- metric loss on power failure is tolerable).
The series, samples, and metadata tables created as defined in §7.1.
All write-time indexes created.
The schema_version and created_at metadata entries populated.

§7.2.3 Database opening

When the metric store database exists at startup, eventd MUST:

Open the database in WAL mode with synchronous NORMAL.
Verify the schema version. If unrecognised, eventd MUST log an error and MUST NOT write to the database. The database remains available for read-only queries.
Verify structural integrity (required tables and indexes exist).
The series cache starts empty. It is populated on demand as metric samples arrive -- each cache miss triggers a SQLite SELECT and inserts the result into the cache. Within one collection cycle (typically 15 seconds), all active series are cached. No pre-warming is required.

§7.2.4 Concurrency

The metric store has one read-write connection (owned by the metric writer thread) and zero or more read-only connections (owned by query handlers). WAL mode permits concurrent reads alongside the single writer.

§7.2.5 WAL checkpointing

The metric writer thread MUST trigger WAL checkpoints when the WAL exceeds a size threshold, using SQLITE_CHECKPOINT_PASSIVE mode.

§7.3 7 Metric storage

Retention

ⓘ Informative

As with event and log retention, the metric retention model in v0.23 is an early simplification. Future versions will introduce downsampling (automatically aggregating high-resolution data into lower-resolution rollups over time -- e.g., per-second samples become 5-minute averages after one week, and 1-hour averages after one month). Downsampling dramatically reduces storage requirements for long-term metric data while preserving trend visibility. This is part of the more sophisticated retention engine referenced in §3.4.

§7.3.1 Configuration

Key	Type	Default	Valid range	Description
MetricRetentionDays	REG_DWORD	90	1--3650	Maximum age of metric samples in days. Samples older than this are eligible for deletion.
MetricRetentionMaxBytes	REG_QWORD	0	0--18446744073709551615	Maximum size of the metric store database in bytes. 0 means no size limit.

Both limits are enforced. The more aggressive limit wins.

ⓘ Informative

The default metric retention (90 days) is longer than event and log retention because metric data is small per sample and trend data is valuable over longer periods. A system with 1000 time series sampled every 15 seconds produces approximately 5.7 million samples per day -- a few hundred megabytes in SQLite.

§7.3.2 Retention process

The retention process runs on the same background thread as event and log retention, operating on the metric store database after completing log retention.

Delete all rows from the samples table where timestamp is older than MetricRetentionDays from the current wall clock time.
If MetricRetentionMaxBytes is nonzero and the metric store database exceeds the limit, delete the oldest samples (by timestamp) until the size is within the limit.
Delete any rows from the series table that have no remaining samples in the samples table. This cleans up time series definitions for metrics that are no longer being produced.

Deletion MUST be performed in batches to avoid holding long-running transactions.

§7.3.3 Disk reclamation

As with the event and log stores, VACUUM is not run automatically. Freed pages are recycled by subsequent inserts.

§7.4 7 Metric storage

Adaptive Rollups

§7.4.1 Purpose

Aggregate metric queries (AVG_OVER, MIN_OVER, MAX_OVER, RATE, P95 over time ranges) require reading and processing raw samples. For large time ranges, this means scanning thousands or millions of rows. Pre-computing the results of common aggregate queries and storing them as rollups dramatically reduces query time.

Adaptive rollups apply the same principle as adaptive indexing for events: monitor which query patterns occur frequently, pre-compute the results in the background, and serve queries from the rollups when available.

§7.4.2 Rollup model

A rollup is a pre-computed aggregate stored in a dedicated table. Each rollup is defined by:

Series -- which time series the rollup applies to.
Function -- the aggregation function (AVG, MIN, MAX, SUM, RATE, P95, P99).
Window -- the time window size (e.g., 5m, 1h, 1d).

A rollup for cpu.usage[core="0"] AVG_OVER 1h stores one pre-computed average value per hour for that specific series.

§7.4.3 Rollup table

The metric store database MUST contain a rollups table:

Column	Type	Description
`id`	INTEGER PRIMARY KEY	SQLite rowid.
`series_id`	INTEGER NOT NULL	Foreign key referencing `series(id)`.
`function`	INTEGER NOT NULL	Aggregation function identifier.
`window_seconds`	INTEGER NOT NULL	Window size in seconds.
`window_start`	INTEGER NOT NULL	Start timestamp of the window. Nanoseconds since Unix epoch.
`value`	REAL NOT NULL	The pre-computed aggregate value.
`sample_count`	INTEGER NOT NULL	Number of raw samples that contributed to this value. Allows the query engine to assess rollup quality (a window with 1 sample is less reliable than one with 60).

A unique constraint MUST exist on (series_id, function, window_seconds, window_start).

An index MUST exist on (series_id, function, window_seconds, window_start) to support efficient rollup lookups.

§7.4.4 Function identifiers

Value	Function
0	AVG
1	MIN
2	MAX
3	SUM
4	RATE
5	DELTA

Percentile functions (P50, P95, P99) are excluded from rollups. Percentiles are not composable -- the P95 of twelve 5-minute P95 values is not the P95 of the full hour. Percentile queries always compute from raw histogram samples.

§7.4.5 Rollup registry

eventd MUST maintain a global rollup registry -- a set of (function, window) pairs that should be pre-computed across all series. The registry is computed from query frequency tracking, analogous to the global desired index set for events.

For each metric query that includes a value function and/or window function, eventd records the function and window size. When a (function, window) pair's query frequency exceeds the creation threshold over the rolling time window, it is added to the rollup registry. When it falls below the removal threshold, it is removed.

The rollup registry is global -- it applies to all series. If hourly averages are frequently queried for any metric, hourly averages are pre-computed for all active series.

§7.4.6 Rollup computation

Rollup computation runs on a background thread during periods of low write activity. For each (function, window) pair in the registry, the background thread:

Identifies time windows that have raw samples but no rollup entry.
Reads the raw samples for those windows.
Computes the aggregate value.
Inserts the rollup entry.

Only completed windows are rolled up. The current (still-accumulating) window is not pre-computed -- it is always computed from raw samples at query time.

Rollup computation MUST be cancellable. If write pressure rises, the computation is aborted and resumed later. The same cancellation mechanism as adaptive index creation (§3.3) applies.

§7.4.7 Query integration

When executing a metric query with a value function and/or window function, the query engine MUST check whether a matching rollup exists. If a rollup covers the requested function, window size, and time range:

The query reads from the rollups table instead of the samples table.
One row per time window instead of hundreds of raw samples per window.

If the rollup partially covers the time range (e.g., rollups exist for all but the most recent incomplete window), the query reads rollups for completed windows and computes the aggregate from raw samples for the incomplete window.

If no matching rollup exists, the query falls back to raw sample computation. The result is identical -- rollups are a transparent optimisation.

Rollup window sizes do not need to exactly match the query window for composable functions. A query for AVG_OVER 1h can be served from 5-minute AVG rollups by computing a weighted average from twelve 5-minute values (weighted by sample_count). MIN and MAX compose directly (take the min/max across sub-windows). SUM composes by addition. RATE and DELTA compose by summing the per-window values and dividing by total time (RATE) or returning the sum (DELTA). Counter resets are handled during rollup computation — each stored RATE/DELTA value already reflects reset-adjusted deltas. Composition operates on these adjusted values and does not need to handle resets again. Smaller rollup windows can serve larger query windows, but not the reverse.

§7.4.8 Rollup retention

Rollup entries follow the same retention policy as raw samples (§7.3). When raw samples are deleted by the retention process, their corresponding rollup entries MUST also be deleted.

When a (function, window) pair is removed from the rollup registry (query frequency dropped below threshold), existing rollup entries for that pair are not deleted immediately. They remain available for queries until they age out through normal retention. New rollup entries are simply no longer computed.

§7.4.9 Configuration

Key	Type	Default	Valid range	Description
AdaptiveRollupWindowHours	REG_DWORD	48	1--168	Rolling time window in hours over which query frequency is measured.
AdaptiveRollupCreateThreshold	REG_DWORD	50	10--10000	Number of queries with a specific function/window pair required to trigger rollup computation.
AdaptiveRollupDropThreshold	REG_DWORD	5	1--1000	Frequency below which the function/window pair is removed from the registry. MUST be less than the create threshold.

ⓘ Informative

The default rollup thresholds are lower than the adaptive indexing thresholds because rollup computation is cheaper than index creation (it processes data incrementally, window by window) and the query speedup is more dramatic (reading 24 rows instead of 86400 for a daily query at 1-second resolution).

§7.4.10 Persistence

The rollup registry and query frequency counters MUST be persisted across eventd restarts. Existing rollup entries in the rollups table are discovered on startup. eventd resumes rollup computation from whatever state the table is in.

Section

8 Querying

§8.1 8 Querying

Overview

eventd exposes a unified query language for retrieving events, logs, and metrics. The language has three modes -- one per data type -- sharing common syntax for time ranges, filtering, limiting, streaming, and cross-type correlation. Each mode has type-specific syntax that reflects the natural access patterns of that data type.

The three modes:

EVENTS -- search through structured event records. Primary access by event type. Returns collections of records.
LOGS -- search through service log output. Primary access by service name. Returns collections of records.
METRIC -- evaluate numeric measurements. Primary access by metric name and labels. Returns values or time series.

Events and logs are record-oriented (collections you search). Metrics are value-oriented (measurements you evaluate). The syntax reflects this distinction rather than forcing all three through a single pattern.

§8.1.1 Common elements

The following constructs work identically across all three modes:

Element	Syntax	Description
Time start	`SINCE 1h ago`	Filter to records/samples after a point in time.
Time end	`UNTIL 30m ago`	Optional upper bound. Defaults to now.
Additional filter	`WHERE field == value`	Filter by any field.
Cross-type filter	`WHERE METRIC name[labels] > N`	Filter by a condition on another data type.
Limit	`TAKE N`	Limit result count.
Offset	`SKIP N`	Skip first N results after sorting (pagination). Applies to both raw and aggregated results.
Streaming	`STREAM`	Live tail. May appear anywhere.

All keywords are case-insensitive. Documentation uses uppercase by convention.

Clauses after the primary selector may appear in any order. Execution semantics are fixed regardless of clause order.

§8.1.2 Time literals

Literal	Meaning
`Ns ago`	N seconds before now.
`Nm ago`	N minutes ago.
`Nh ago`	N hours ago.
`Nd ago`	N days ago.
`Ns hence`	N seconds after now.
`Nm hence`	N minutes from now.
`Nh hence`	N hours from now.
`Nd hence`	N days from now.
`today`	Midnight of the current day (UTC).
`yesterday`	Midnight of the previous day (UTC).
`YYYY-MM-DD`	Midnight of the specified date (UTC).
`YYYY-MM-DDTHH:MM:SS`	Specific time (UTC).

§8.1.3 GUID literals

Standard GUID string format, with or without braces:

WHERE process_guid == "550e8400-e29b-41d4-a716-446655440000"
WHERE process_guid == "{550e8400-e29b-41d4-a716-446655440000}"

§8.1.4 Integer literals

Decimal and hexadecimal:

WHERE origin_class == 2
WHERE granted_access == 0x1F01FF

§8.1.5 Comparison operators

Operator	Meaning	Applicable types
`==`	Equals	All
`!=`	Not equals	All
`>`	Greater than	Integer, float, timestamp
`>=`	Greater than or equal	Integer, float, timestamp
`<`	Less than	Integer, float, timestamp
`<=`	Less than or equal	Integer, float, timestamp
`STARTS_WITH`	String prefix match	String
`ENDS_WITH`	String suffix match	String
`CONTAINS`	Substring match	String
`IN`	Value is in a set	All
`NOT_IN`	Value is not in a set	All
`IS NULL`	Value is NULL	Any
`IS NOT NULL`	Value is not NULL	Any

All string comparisons (==, !=, STARTS_WITH, ENDS_WITH, CONTAINS, IN, NOT_IN) are case-insensitive by default. This applies to event fields, payload fields, log messages, and metric labels. Integer, float, GUID, and timestamp comparisons are unaffected.

§8.1.6 Logical operators

Predicates within a single WHERE clause may be combined with AND and OR. Parentheses control precedence. AND binds tighter than OR.

Multiple WHERE clauses are logically ANDed. Each WHERE clause is treated as a parenthesized group: WHERE a == 1 OR b == 2 followed by WHERE c == 3 is equivalent to WHERE (a == 1 OR b == 2) AND c == 3.

§8.1.7 Field resolution

For events: known header column names (timestamp, cpu_id, sequence, origin_class, event_type, effective_token_guid, true_token_guid, process_guid, boot_id) resolve to columns. All other names resolve to payload field lookups via msgpack extraction.

For logs: all field names resolve to log columns (timestamp, origin, is_error, message, boot_id). There are no payload fields.

For metrics: known metric fields (timestamp, name, type, value) resolve to columns. All other names resolve to label lookups.

§8.1.8 Result format

All three modes return results as arrays of flat msgpack maps. Each record is a self-describing map with field names as keys.

Event records contain header fields and extracted payload fields merged as top-level keys. Log records contain log columns as top-level keys. Metric records contain the metric name, labels as top-level keys, timestamp, and value.

If SELECT is present, only the named fields appear in each record.

§8.2 8 Querying

Event Queries

§8.2.1 Syntax

EVENTS [type_pattern] [clauses...]

The primary selector is an optional event type pattern, placed immediately after EVENTS. All other clauses may appear in any order.

§8.2.2 Type pattern

If present, the type pattern filters events by their event_type field. Exact match by default. The * wildcard matches zero or more of any character, including dots:

EVENTS kacs.access_denied           -- exact match
EVENTS kacs.*                       -- all event types starting with "kacs."
EVENTS synthetic.*                  -- all eventd synthetic events
EVENTS *.denied                     -- all event types ending with ".denied"
EVENTS kacs.*.denied                -- e.g., matches kacs.access.denied, kacs.token.denied
EVENTS                              -- all events (no type filter)

The type pattern is syntactic sugar for WHERE event_type == "..." (exact) or WHERE event_type STARTS_WITH "..." (trailing *). A pattern with * in other positions is a glob match. The only wildcard character is *. There are no other glob metacharacters (?, [, { have no special meaning). Matching is case-insensitive, consistent with all string comparisons in the query language.

§8.2.3 Origin class aliases

The origin class field accepts named aliases in WHERE clauses:

Alias	Value
`userspace`	0
`kmes`	1
`kacs`	2
`lcs`	3

EVENTS WHERE origin_class == kacs SINCE 1h ago

§8.2.4 Projection

SELECT controls which fields appear in result records. Multiple SELECT clauses are additive.

EVENTS kacs.* SINCE 1h ago SELECT timestamp, event_type, granted_access
EVENTS SELECT timestamp SELECT event_type    -- same as SELECT timestamp, event_type

If no SELECT is present, all header fields are included plus all payload fields are extracted and included as top-level keys.

§8.2.5 Aggregation

§8.2.5.1 COUNT BY

Counts records grouped by a field. Results are sorted by count descending.

EVENTS SINCE 24h ago COUNT BY event_type
-- returns: [{event_type: "kacs.access_check", count: 4523}, {event_type: "lcs.key_set", count: 891}, ...]

§8.2.5.2 TOP N BY

Shorthand for COUNT BY with a limit. Returns the N most frequent values.

EVENTS SINCE 1h ago TOP 10 BY process_guid
-- returns: [{process_guid: "...", count: 892}, {process_guid: "...", count: 445}, ...] (10 records)

§8.2.5.3 DISTINCT

Returns the distinct values of a field.

EVENTS SINCE 24h ago DISTINCT event_type
-- returns: [{event_type: "kacs.access_check"}, {event_type: "kacs.token_create"}, ...]

§8.2.5.4 GROUP with aggregation functions

For more complex aggregations, GROUP groups records by one or more fields, followed by an aggregation function:

EVENTS SINCE 1h ago GROUP origin_class COUNT
EVENTS SINCE 1h ago GROUP origin_class, event_type COUNT

Aggregation functions: COUNT, SUM, AVG, MIN, MAX. SUM, AVG, MIN, and MAX take a field argument:

EVENTS SINCE 1h ago GROUP event_type AVG some_numeric_field

For SUM, AVG, MIN, and MAX, records where the field is NULL or non-numeric are excluded from the aggregate. COUNT counts all records regardless of field values. If all records in a group have NULL or non-numeric values for the aggregated field, the group's aggregate value is NULL.

§8.2.6 Sorting

SORT orders results by one or more fields. Default direction is ascending. DESC reverses.

EVENTS kacs.* SINCE 1h ago SORT timestamp DESC TAKE 100

If no SORT is present, results are ordered by timestamp descending (most recent first).

§8.2.7 Manual indexing

The INDEX command adds a field to the global desired index set immediately, bypassing the adaptive frequency threshold and the policy recomputation interval. This is intended for incident response and ad-hoc investigation where waiting for the adaptive system is not acceptable.

EVENTS INDEX target_sid
EVENTS INDEX granted_access

INDEX triggers an immediate policy recomputation with the named field added at highest priority. The shard writer threads begin converging toward the updated desired set at their next quiet period. The index is subject to the same pressure-based shedding rules as any other adaptive index -- if the system is under heavy write pressure, the index may be shed.

Manually indexed fields remain in the desired set until they fall below the adaptive drop threshold over the rolling window (same as adaptively created indexes). There is no manual "unindex" command -- if the field stops being queried, the adaptive system removes it.

INDEX is an administrative operation. The caller's token is checked against an SD stored in the eventd-meta.db metadata database (§3.3), separate from the read-path SDs in the registry security subtree (§9.1). The default SD grants INDEX access to SYSTEM and Administrators only. This prevents unprivileged users from manipulating the indexing policy to degrade write throughput for other users.

§8.2.8 Examples

Last 50 events:

EVENTS TAKE 50

KACS access denied events from the last hour with specific fields:

EVENTS kacs.access_denied SINCE 1h ago SELECT timestamp, event_type, granted_access, target_sid

Events from a specific process:

EVENTS WHERE process_guid == "550e8400-e29b-41d4-a716-446655440000" SINCE 1d ago

Event type breakdown for the last 24 hours:

EVENTS SINCE 24h ago COUNT BY event_type

Top 10 noisiest processes in the last hour:

EVENTS SINCE 1h ago TOP 10 BY process_guid

Events during high CPU:

EVENTS kacs.* SINCE 1h ago WHERE METRIC cpu.usage > 80

Live tail of all KACS events:

EVENTS kacs.* STREAM

§8.3 8 Querying

Log Queries

§8.3.1 Syntax

LOGS [FROM origin[, origin...]] [ERROR ONLY] [CONTAINING "text"] [clauses...]

The primary selectors are FROM (service name), ERROR ONLY (stderr filter), and CONTAINING (text search). All are optional. All other clauses may appear in any order.

§8.3.2 FROM

Filters logs by origin (service name). Multiple origins may be comma-separated.

LOGS FROM loregd                    -- logs from loregd
LOGS FROM loregd, peinit            -- logs from loregd or peinit
LOGS                                -- all logs

FROM is syntactic sugar for WHERE origin == "..." (single) or WHERE origin IN ("...", "...") (multiple).

§8.3.3 ERROR ONLY

Filters to log lines from stderr (is_error == true).

LOGS ERROR ONLY                     -- all stderr output
LOGS FROM loregd ERROR ONLY         -- stderr from loregd

ERROR ONLY may appear anywhere after LOGS:

LOGS FROM loregd SINCE 1h ago ERROR ONLY    -- same result regardless of position

§8.3.4 CONTAINING

Filters to log lines whose message contains the specified text. Substring match, case-insensitive (consistent with all string comparisons in the query language).

LOGS CONTAINING "connection refused"
LOGS FROM loregd CONTAINING "failed to open"

CONTAINING is a log-specific keyword because text search is the primary operation on log data. It is syntactic sugar for WHERE message CONTAINS "...":

LOGS WHERE message CONTAINS "error"         -- equivalent to CONTAINING "error"

ⓘ Informative

CONTAINING performs a substring scan, not a full-text search. When combined with SINCE, the scan is limited to the matching time range (the timestamp index narrows the scan). Full-text indexing (FTS) is a candidate for future versions.

§8.3.5 Aggregation

COUNT BY and TOP N BY work the same as for events:

LOGS SINCE 1h ago COUNT BY origin           -- log volume per service
LOGS SINCE 1h ago TOP 5 BY origin           -- most verbose services

§8.3.6 Sorting

If no SORT is present, results are ordered by timestamp descending (most recent first).

§8.3.7 Examples

Recent logs from loregd:

LOGS FROM loregd TAKE 100

Errors from any service in the last 30 minutes:

LOGS ERROR ONLY SINCE 30m ago

Search for a string in loregd logs:

LOGS FROM loregd CONTAINING "connection refused" SINCE 1d ago

Most verbose services in the last hour:

LOGS SINCE 1h ago TOP 5 BY origin

Live tail of loregd logs:

LOGS FROM loregd STREAM

Loregd logs during high memory usage:

LOGS FROM loregd WHERE METRIC mem.usage[service="loregd"] > 90

§8.4 8 Querying

Metric Queries

§8.4.1 Syntax

METRIC name[label_selector] [SINCE time] [UNTIL time] [function] [window_function] [clauses...]

The primary selector is the metric name with an optional label selector in brackets, placed immediately after METRIC. Functions and window functions transform the values. All other clauses may appear in any order.

§8.4.2 Metric name

The metric name selects which measurement to query:

METRIC cpu.usage                    -- the cpu.usage metric
METRIC http.requests.total          -- the http.requests.total metric

Glob patterns with * are supported. The * wildcard matches zero or more of any character, including dots. The same glob semantics as event type patterns (§8.2) apply:

METRIC cpu.*                        -- all metrics starting with "cpu."
METRIC disk.usage.*                 -- all disk usage metrics

§8.4.3 Label selector

The label selector in brackets controls how multiple time series for the same metric are handled.

§8.4.3.1 No brackets -- aggregate

When no brackets are present, all matching series are aggregated into a single result using the implicit or explicit aggregation function:

METRIC cpu.usage                    -- average across all cores (implicit AVG)
METRIC cpu.usage MAX                -- maximum across all cores
METRIC cpu.usage SINCE 1h ago       -- averaged time series across all cores

The default aggregation function for no-bracket queries is AVG. To use a different aggregation, specify it explicitly.

§8.4.3.2 Empty brackets -- break out

Empty brackets return each label combination independently:

METRIC cpu.usage[]                  -- latest value per core
METRIC cpu.usage[] SINCE 1h ago     -- time series per core

§8.4.3.3 Label filter -- select specific series

Label filters inside brackets select specific series:

METRIC cpu.usage[core="0"]              -- only core 0
METRIC cpu.usage[core="0", host="srv1"] -- core 0 on srv1

Label values support the same comparison operators as WHERE:

METRIC http.requests.total[method="GET"]
METRIC disk.usage[device STARTS_WITH "sd"]

§8.4.4 Value functions

Value functions transform metric values. They appear after the metric selector. At most one value function may be specified per query.

§8.4.4.1 RATE

Computes the per-second rate of change. Handles counter resets (a decrease in value is treated as a reset, not a negative delta). MUST only be applied to counter-type metrics. Applying RATE to a gauge or histogram series is an error -- the query MUST be rejected at execution time when the series type is resolved.

METRIC http.requests.total SINCE 1h ago RATE

§8.4.4.2 DELTA

Computes the absolute change between consecutive samples. For each pair of consecutive samples (s1, s2), the delta is s2.value - s1.value. If the value decreases (counter reset), the delta is s2.value (the counter restarted from zero). MUST only be applied to counter-type metrics. Applying DELTA to a gauge or histogram series is an error -- the query MUST be rejected at execution time when the series type is resolved. DELTA is the unnormalized form of RATE -- RATE divides by elapsed time to produce a per-second value, DELTA returns the raw difference.

METRIC http.requests.total SINCE 1h ago DELTA

§8.4.4.3 P50, P95, P99

Computes percentiles from histogram-type metrics. Each histogram sample produces one percentile value. MUST only be applied to histogram-type metrics. Applying a percentile function to a counter or gauge series is an error -- the query MUST be rejected at execution time when the series type is resolved.

METRIC request.duration P95
METRIC request.duration[origin="loregd"] SINCE 1h ago P99

§8.4.4.4 AVG, MIN, MAX, SUM

Aggregates values. For no-bracket queries, aggregates across all matching series. For bracket queries, aggregates over time within each series. For queries with SINCE, operates over the time range.

METRIC cpu.usage AVG                            -- average across all cores, latest window
METRIC cpu.usage[] SINCE 1d ago AVG             -- average over the day, per core
METRIC cpu.usage[core="0"] SINCE 1h ago MIN     -- minimum value in the last hour for core 0

§8.4.5 Window functions

Window functions aggregate values into fixed time windows. They appear after a value function (or alone) and take a duration argument. A query may have at most one value function and at most one window function. Value functions (RATE, DELTA, P50, P95, P99, AVG, MIN, MAX, SUM) and window functions (AVG_OVER, MIN_OVER, MAX_OVER) are distinct categories — AVG and AVG_OVER are different keywords. When both are present, the value function is applied first, then the window function aggregates the results into time windows.

§8.4.5.1 AVG_OVER, MIN_OVER, MAX_OVER

Divides the time range into fixed windows of the specified duration and computes the aggregate per window:

METRIC cpu.usage SINCE 1d ago AVG_OVER 1h       -- hourly averages over the last day
METRIC cpu.usage[] SINCE 1d ago AVG_OVER 5m     -- 5-minute averages per core
METRIC request.duration P95 SINCE 1h ago AVG_OVER 5m  -- 5-min averaged P95

Window functions produce one data point per window, reducing data density for trend visualisation.

Window functions require a SINCE clause to define the time range. A query with a window function but no SINCE MUST be rejected with a parse error.

If adaptive rollups (§7.4) exist for the requested function and window size, the query engine serves results from the pre-computed rollup table instead of scanning raw samples. This is transparent -- the result is identical, only the performance differs.

§8.4.6 Without SINCE

When no SINCE clause is present, the query returns the latest value:

METRIC cpu.usage                    -- latest average across cores
METRIC cpu.usage[]                  -- latest value per core
METRIC cpu.usage[core="0"]          -- latest value for core 0

"Latest" means the most recent sample in the metric store.

For value functions that require multiple samples (RATE, DELTA), a query without SINCE uses the two most recent samples to compute a single instantaneous value. For example, METRIC http.requests.total RATE computes the per-second rate between the last two samples. If fewer than two samples exist for a series, the query returns no result for that series.

§8.4.7 Result format

Metric results are flat maps, consistent with event and log results. Labels appear as top-level keys:

{timestamp: 1714000000000000000, name: "cpu.usage", core: "0", value: 42.7}
{timestamp: 1714000015000000000, name: "cpu.usage", core: "0", value: 38.2}

For time series results, one record per sample. For aggregated results (e.g., METRIC cpu.usage AVG), one record with the aggregated value.

Windowed results include the window start timestamp:

{timestamp: 1714000000000000000, name: "cpu.usage", core: "0", value: 41.2}  -- 5min avg
{timestamp: 1714000300000000000, name: "cpu.usage", core: "0", value: 39.8}  -- next window

§8.4.8 Examples

Current CPU usage per core:

METRIC cpu.usage[]

Average CPU over the last day with hourly windows:

METRIC cpu.usage SINCE 1d ago AVG_OVER 1h

Request rate for loregd in the last hour:

METRIC http.requests.total[service="loregd"] SINCE 1h ago RATE

P95 request latency, 5-minute windows:

METRIC request.duration[origin="loregd"] SINCE 1h ago P95 AVG_OVER 5m

All disk usage metrics:

METRIC disk.usage.*[]

CPU usage during access denied events:

METRIC cpu.usage[] SINCE 1h ago WHERE EVENT kacs.access_denied EXISTS

§8.5 8 Querying

Cross-Type Filtering

§8.5.1 Overview

Cross-type filtering allows a query on one data type to be filtered by conditions on another data type. This is the primary mechanism for correlating events, logs, and metrics without explicit JOINs.

Cross-type filters appear as WHERE clauses with a type keyword (METRIC, EVENT, LOG) indicating the data source for the condition.

§8.5.2 WHERE METRIC

Filters records to time periods when a metric condition holds. Available in EVENTS and LOGS queries.

EVENTS kacs.* SINCE 1h ago WHERE METRIC cpu.usage > 80
LOGS FROM loregd WHERE METRIC mem.usage[service="loregd"] > 90

§8.5.2.1 Semantics

The query engine pre-computes the time ranges where the metric condition is true by scanning the metric store for matching samples. These time ranges are then applied as additional timestamp filters on the primary data source.

For each matching sample where the condition holds, a time range is constructed from that sample's timestamp to the next sample's timestamp (or the end of the query window, whichever is smaller). This interpolation assumes the metric condition holds between samples.

The metric condition uses the same comparison operators as standard WHERE clauses. The metric selector uses the same syntax as METRIC queries (name, brackets for labels):

WHERE METRIC cpu.usage > 80                     -- aggregated across all labels
WHERE METRIC cpu.usage[core="0"] > 90           -- specific series
WHERE METRIC disk.io.utilisation[device="sda"] > 95

§8.5.3 WHERE EVENT

Filters records to time periods when events of a specified type exist. Available in LOGS and METRIC queries.

LOGS FROM loregd WHERE EVENT kacs.access_denied EXISTS
METRIC cpu.usage[] SINCE 1h ago WHERE EVENT synthetic.storage_error EXISTS

The EXISTS keyword indicates that the condition is satisfied when at least one event of the specified type exists within the relevant time window. The event type supports glob patterns:

WHERE EVENT kacs.* EXISTS                       -- any KACS event
WHERE EVENT synthetic.gap EXISTS                -- gap records

§8.5.4 WHERE LOG

Filters records to time periods when matching log entries exist. Available in EVENTS and METRIC queries.

EVENTS kacs.* WHERE LOG loregd CONTAINING "error" EXISTS

The LOG condition specifies an origin and optionally a CONTAINING text filter.

§8.5.5 Time window resolution

Cross-type conditions are evaluated against the metric sample interval or event density, not per-row of the primary data source. The time ranges where the condition holds are computed once and applied as a filter.

For metric conditions, the resolution is the metric sample interval (typically 15 seconds). A metric condition like WHERE METRIC cpu.usage > 80 means "during periods where the most recent cpu.usage sample exceeded 80."

For event conditions, EXISTS means "at least one matching event occurred within the sample interval surrounding the primary record's timestamp." The sample interval for event existence checks is configurable:

Key	Type	Default	Valid range	Description
CrossTypeWindowMs	REG_DWORD	15000	1000--300000	Time window in milliseconds for cross-type event and log existence checks.

§8.5.6 Lookback limit

Cross-type filter pre-computation scans the referenced store for the query's time range. Scanning large time ranges (weeks or months of metric samples) is expensive. eventd MUST enforce a maximum lookback period for cross-type conditions:

Key	Type	Default	Valid range	Description
CrossTypeMaxLookbackSeconds	REG_DWORD	604800	3600--2592000	Maximum time range in seconds that a cross-type filter may scan. Default is 7 days.

If the query's effective time range (from SINCE to UNTIL, or SINCE to now) exceeds CrossTypeMaxLookbackSeconds, the cross-type filter MUST be rejected with an error indicating the time range is too large. The error SHOULD suggest narrowing the range with SINCE/UNTIL.

A query with no SINCE clause and a cross-type filter MUST be rejected -- unbounded cross-type scans are never permitted.

§8.5.7 Performance

Cross-type filters require reading from multiple stores. The cross-type condition is evaluated first to produce time ranges, then the primary query is executed with additional timestamp filters. This is efficient when the cross-type condition is selective (narrow time ranges), and expensive when the condition is broadly true (e.g., CPU above 10% for the entire query window).

For metric conditions, the pre-computation SHOULD use adaptive rollups (§7.4) when a matching rollup exists for the referenced metric. Reading rollup entries instead of raw samples reduces the scan from hundreds of thousands of rows to hundreds, making large lookback windows practical.

Cross-type WHERE predicates are tracked by the adaptive indexing system, same as standard WHERE predicates.

§8.6 8 Querying

Execution

§8.6.1 Query parsing

A query string is parsed into an abstract syntax tree (AST). The parser:

Identifies the mode (EVENTS, LOGS, METRIC) from the first token.
Extracts the primary selector (type pattern, FROM, metric name/labels).
Collects all clauses regardless of order.
Validates that the clauses are compatible with the mode (e.g., CONTAINING is only valid in LOGS mode, RATE is only valid in METRIC mode).

Parse errors are returned immediately without executing anything.

§8.6.2 Execution order

Regardless of clause order in the query string, execution follows a fixed sequence:

Cross-type conditions -- WHERE METRIC / WHERE EVENT / WHERE LOG conditions are evaluated first to produce time range filters.
Primary selector -- type pattern (events), FROM (logs), or metric name/labels (metrics) narrows the data source.
SINCE / UNTIL -- time range filter applied.
WHERE -- all WHERE predicates are ANDed and evaluated. Cross-type time ranges from step 1 are included as additional timestamp filters.
ERROR ONLY / CONTAINING -- log-specific filters (evaluated as WHERE predicates internally).
Value functions -- RATE, DELTA, P95, etc. for metrics.
GROUP -- grouping for aggregation.
Aggregation -- COUNT BY, TOP N BY, COUNT, SUM, AVG, MIN, MAX, DISTINCT.
Window functions -- AVG_OVER, MIN_OVER, MAX_OVER for metrics.
SORT -- ordering.
SKIP / TAKE -- pagination and limiting.
SELECT -- result records narrowed to specified fields.

SELECT is applied last -- it controls the shape of output, not the visibility of fields to other clauses.

§8.6.3 SQL translation

For events and logs, the query is translated to SQL internally. Header field predicates translate to SQL WHERE clauses. Payload field predicates (events) translate to msgpack_extract function calls. Log fields translate directly to column references.

For metrics, the query is translated to SQL against the metric store's series and samples tables. The series table is used for name and label resolution (cached in memory). The samples table is queried for the time range.

The SQL translation is an implementation detail. Clients never see SQL.

§8.6.4 Cross-shard fan-out (events only)

Event queries execute against all databases in the event store directory. Results from individual shards are merged depending on the query type:

Non-aggregation queries (with SORT and TAKE): each shard returns up to SKIP + TAKE rows sorted by the sort key (or TAKE rows if no SKIP is present). The merge is an N-way merge of sorted streams. SKIP and TAKE are applied after the merge by the coordinator. Total rows read: at most (SKIP + TAKE) × shard_count.
COUNT: each shard returns its local count. The final result is the sum across all shards.
COUNT BY / TOP N BY / GROUP with COUNT: each shard returns per-group counts. The merge sums counts for the same group key across shards, then sorts by count descending and applies TAKE if present.
GROUP with SUM: each shard returns per-group sums. The merge sums per-group values across shards.
GROUP with AVG: each shard returns per-group sum and count. The merge computes the average from the combined sum and count across shards.
GROUP with MIN / MAX: each shard returns per-group min/max. The merge takes the min/max across shards.
DISTINCT: each shard returns its local distinct values. The merge computes the distinct union across all shards.

Aggregation is pushed down to individual shards wherever possible. The merge operates on partial aggregates, not full row sets. This bounds memory usage to the cardinality of the group key × shard count, not the total row count.

Log and metric queries operate on single databases (one log store, one metric store) and do not require fan-out.

ⓘ Informative

Non-aggregation queries without TAKE have no implicit row limit. A broad query such as EVENTS SINCE 7d ago may return millions of rows, consuming significant memory during the cross-shard merge. The query timeout (QueryTimeoutMs) is the primary backstop against runaway queries. Implementations SHOULD stream merged results to the client incrementally rather than materialising the full result set in memory.

§8.6.5 Adaptive indexing integration

Every query MUST be recorded by the adaptive indexing system (§3.3). For each WHERE predicate:

Header column references increment that column's query frequency counter.
Payload field references increment that field path's query frequency counter.

This applies to event queries only. Log and metric stores have fixed indexes.

§8.6.6 Payload extraction

ⓘ Informative

Constructing flat-map results from event records requires decoding the msgpack payload for each returned row. At high result counts (thousands of events), this becomes the dominant query-path cost. Implementations SHOULD use partial/lazy extraction: when SELECT is present, only decode the named payload fields rather than the entire payload. When no SELECT is present, a streaming msgpack decoder that emits key-value pairs without building a full in-memory representation reduces allocation pressure.

§8.6.7 Read connections

Query execution uses read-only SQLite connections. Read-only connections in WAL mode do not block writer threads. eventd SHOULD support multiple concurrent queries.

§8.6.8 Concurrency limits

eventd MUST enforce a maximum number of concurrent queries (streaming and non-streaming combined) to prevent resource exhaustion.

Key	Type	Default	Valid range	Description
MaxConcurrentQueries	REG_DWORD	128	1--4096	Maximum number of concurrent queries across all clients. Includes both streaming and non-streaming queries.

When the limit is reached, new queries MUST be rejected with an error. The per-query resource cost includes read-only SQLite connections (one per shard for event queries), memory for result merging, and CPU for query execution. The MaxStreamingQueries limit (§8.7) is enforced separately and is typically lower because streaming queries hold resources indefinitely.

§8.6.9 Timeouts

Queries MUST have a maximum execution time.

Key	Type	Default	Valid range	Description
QueryTimeoutMs	REG_DWORD	30000	1000--300000	Maximum query execution time in milliseconds.

§8.7 8 Querying

Streaming

§8.7.1 Overview

The STREAM keyword marks a query as a streaming query. It may appear anywhere in the query string and is treated as a boolean flag. Streaming is supported for EVENTS and LOGS queries. METRIC queries do not support streaming.

ⓘ Informative

Streaming queries are a convenience for interactive tailing and dashboards. Streaming behaviour can feel unintuitive under unusual conditions (cross-type filters evaluated per-batch rather than per-event, high-frequency metric thresholds, backpressure disconnects). Where more predictable results are needed, repeated non-streaming queries with a sliding SINCE window are a more reliable approach. Where minimal latency is critical (security monitoring, anti-virus), direct KMES ring buffer attachment via a dedicated tool (e.g., revstr) bypasses eventd entirely and provides sub-millisecond event access.

ⓘ Informative

Metric streaming is omitted from v0.23 because metric data is sampled at regular intervals (typically every 15 seconds) and the primary metric consumer is a dashboard that polls. Live streaming of individual metric samples adds complexity with limited value for the typical use case. Future versions may add metric streaming if demand warrants it.

§8.7.2 Behavior

eventd executes the query normally and delivers the initial result set.
Instead of closing the query, eventd enters a watch state.
When new records are committed to the relevant store(s), eventd evaluates them against the query's filters.
Matching records are delivered to the client.
The loop continues until the client disconnects or the query is cancelled.

§8.7.3 Notification

Writer threads signal when a batch commit completes. Streaming query handlers wait for this signal rather than polling. The signal mechanism is implementation-defined.

§8.7.4 Latency

Streaming latency is bounded by the batch commit interval of the relevant store. For events, this is approximately MaxBatchLatencyMs (default 100ms). For logs, approximately LogMaxBatchLatencyMs (default 500ms). The actual latency may be lower under light load when the adaptive batcher commits more frequently.

§8.7.5 Connection limits

eventd MUST enforce a maximum number of concurrent streaming queries to prevent resource exhaustion from malicious or excessive streaming connections.

Key	Type	Default	Valid range	Description
MaxStreamingQueries	REG_DWORD	64	1--1024	Maximum number of concurrent streaming queries across all clients.

When the limit is reached, new streaming queries MUST be rejected with an error. Non-streaming queries are unaffected by this limit.

ⓘ Informative

The global limit prevents total resource exhaustion but does not prevent a single caller from consuming all available slots. Per-caller limits (e.g., maximum streaming queries per process GUID) would provide fairer allocation, but require a KACS primitive for identifying the caller's process GUID on the query socket connection. This is the same datagram peer identity gap noted in §9.2 -- until KACS provides the necessary primitives, per-caller streaming limits are not possible. The global limit is the interim protection.

§8.7.6 Backpressure

If a streaming client cannot keep up with the event rate, eventd MUST drop the streaming query and notify the client with an error rather than buffering unboundedly. Streaming queries MUST NOT block or slow the write path.

Backpressure is detected via the socket send buffer. When eventd attempts to send a result message to a streaming client and the socket send buffer is full, the streaming query MUST be terminated immediately. eventd MUST NOT block on the send. The client receives an error message if the socket can still accept it; otherwise the connection is closed.

§8.7.7 Cross-type conditions during streaming

The initial result set uses pre-computed time ranges for cross-type conditions (§8.5). During the streaming phase, pre-computed ranges are stale and MUST NOT be reused.

For each committed batch, eventd MUST re-evaluate cross-type conditions against the current state of the referenced store. For metric conditions (WHERE METRIC), eventd queries the most recent sample for the referenced series using the batch's latest event timestamp and evaluates the condition against that sample. This is a single index seek per cross-type condition per batch.

If the condition is not met, the entire batch is filtered out (no records delivered for that batch). If the condition is met, the batch's records are filtered by the remaining WHERE predicates as normal.

ⓘ Informative

Evaluating the metric condition once per batch (using the batch's latest timestamp) rather than once per event is an optimisation that produces identical results at typical metric sample intervals (15 seconds). The batch commit interval (default 100ms) is far shorter than the metric sample interval, so all events in a batch map to the same metric sample. At abnormally high metric resolutions (sub-second sampling), this optimisation can produce slightly coarser filtering than per-event evaluation -- events near a metric threshold crossing may be included or excluded as a group rather than individually.

§8.7.8 Filter restrictions

During the streaming phase, only WHERE predicates (including cross-type conditions) are evaluated against new records. SORT, TAKE, SKIP, COUNT BY, TOP N BY, DISTINCT, and GROUP do not apply to streamed records -- they apply only to the initial result set. Streamed records are delivered in commit order.

SELECT applies to streamed records -- only the specified fields are included.

§8.8 8 Querying

Transport

§8.8.1 Socket interface

eventd MUST expose a Unix domain socket for query access. The socket path is configured via the QuerySocketPath registry key under Machine\System\eventd\. There is no compiled-in default -- if the key does not exist or is invalid, eventd MUST fail to start.

The query socket is shared across all three data types. The query mode (EVENTS, LOGS, METRIC) is determined by parsing the query string, not by the transport.

§8.8.2 Wire protocol

The query protocol is request-response over the Unix socket. Each message is a length-prefixed msgpack-encoded value:

Field	Type	Size	Description
`length`	`u32`	4 bytes	Total length of the msgpack payload in bytes. Little-endian.
`payload`	msgpack	`length` bytes	The request or response body.

eventd MUST reject messages whose length exceeds MaxQueryMessageBytes. This prevents a malicious or buggy client from forcing a large memory allocation before the query is even parsed.

Key	Type	Default	Valid range	Description
MaxQueryMessageBytes	REG_DWORD	65536	1024--16777216	Maximum permitted query message size in bytes. Messages exceeding this limit are rejected without reading the payload.

§8.8.2.1 Request format

A query request is a msgpack map:

Field	Type	Required	Description
`query`	string	Yes	The query string.

§8.8.2.2 Response format

Result message:

Field	Type	Description
`status`	string	`"ok"`.
`records`	array of map	Result records. Each record is a flat msgpack map.

Each record is a self-describing map. Different records in the same response MAY have different sets of keys (event payload fields vary by event type, metric labels vary by series).

End message:

Field	Type	Description
`status`	string	`"end"`.

Sent after the last result message for non-streaming queries.

Error message:

Field	Type	Description
`status`	string	`"error"`.
`error`	string	Error description (parse error, timeout, type mismatch, etc.).

§8.8.2.3 Value encoding

Value type	Msgpack encoding
Integer	msgpack integer
Float	msgpack float64
String	msgpack string
GUID	msgpack string in standard GUID format
Boolean	msgpack boolean
Nested map (payload)	msgpack map
Array (payload)	msgpack array
NULL	msgpack nil

§8.8.2.4 Streaming responses

For streaming queries, eventd sends the initial result set, then continues sending result messages as new matching records are committed. There is no end message for streaming queries.

§8.8.3 Connection lifecycle

One query per connection. Multiple concurrent queries require multiple connections. The connection is closed after the end message (non-streaming) or on client disconnect (streaming).

§8.8.4 Access control

Query access control is defined in the access control chapter. eventd checks the connecting process's credentials before executing the query.

Section

9 Access control

§9.1 9 Access control

Access Control Model

§9.1.1 Overview

eventd enforces access control on the read path using KACS Security Descriptors and the KACS AccessCheck API (PSD-004). Every query is evaluated against SDs that determine which data -- and which fields within that data -- the caller is authorized to see. Unauthorized records and fields are silently filtered from the result set.

Access control on the write path is handled by other subsystems: KMES controls event emission (SeAuditPrivilege), and the log and metric ingestion sockets use filesystem permissions.

eventd does not implement its own access check logic. All access decisions are delegated to the KACS kacs_access_check (syscall 1023) and kacs_access_check_list (syscall 1024) syscalls, which run the full KACS AccessCheck pipeline including integrity checks, restricted token evaluation, confinement, conditional ACE evaluation, and SACL audit emission.

§9.1.2 Security objects

Access control is defined on named patterns. Each pattern represents a category of observability data and has an associated SD. The three data types have independent pattern namespaces:

Event patterns control access to events by event type.
Log patterns control access to logs by origin (service name).
Metric patterns control access to metrics by metric name.

A pattern matches using dot-delimited prefix semantics. The pattern kacs matches the exact string kacs and any string with the prefix kacs. (note the dot). It does NOT match kacs_extended or kacsfoo -- the dot is the hierarchy separator. The pattern * is the wildcard default that matches everything.

§9.1.3 Access rights

eventd defines the following object-specific access rights (bits 0-15 of the access mask):

Right	Bit	Value	Description
`EVENTD_READ`	0	0x0001	Read records matching this pattern.
`EVENTD_CLEAR`	1	0x0002	Delete records matching this pattern (for future administrative operations).

Generic mapping for eventd objects:

Generic right	Maps to
GENERIC_READ	EVENTD_READ \| READ_CONTROL
GENERIC_WRITE	EVENTD_CLEAR \| READ_CONTROL
GENERIC_EXECUTE	EVENTD_READ \| READ_CONTROL
GENERIC_ALL	EVENTD_READ \| EVENTD_CLEAR \| DELETE \| READ_CONTROL \| WRITE_DAC \| WRITE_OWNER

The generic mapping is passed to kacs_access_check via the generic_read, generic_write, generic_execute, and generic_all fields.

§9.1.4 Per-field access control

eventd supports per-field access control using KACS object ACEs and object type lists. Each queryable field is assigned a GUID. An SD can grant EVENTD_READ on specific field GUIDs, allowing fine-grained control over which fields a caller can see.

§9.1.4.1 Object type list

When performing an access check, eventd constructs an object type list as defined by PSD-004 §10.5. The list is a tree with the security pattern as the root and individual fields as children:

Level 0: Root (pattern GUID -- e.g., GUID for "kacs")
  Level 1: timestamp field GUID
  Level 1: event_type field GUID
  Level 1: cpu_id field GUID
  Level 1: origin_class field GUID
  Level 1: effective_token_guid field GUID
  Level 1: true_token_guid field GUID
  Level 1: process_guid field GUID
  Level 1: payload field GUID (covers all payload fields)

The access check is performed using kacs_access_check_list (syscall 1024), which returns separate verdicts for each node in the tree. eventd uses the per-node results to include or exclude fields from the result record.

§9.1.4.2 SD construction for per-field control

An SD with per-field restrictions uses object ACEs:

An object ACE with no object type GUID applies to all fields (the root).
An object ACE with a field GUID applies to that specific field.

Example: grant SecurityAdmins full read access, but grant MonitoringTeam read access to only timestamp, event_type, and cpu_id:

Allow ACE: SecurityAdmins, EVENTD_READ (no object GUID -- applies to root, propagates to all fields)
Allow ACE: MonitoringTeam, EVENTD_READ, object GUID = timestamp
Allow ACE: MonitoringTeam, EVENTD_READ, object GUID = event_type
Allow ACE: MonitoringTeam, EVENTD_READ, object GUID = cpu_id

MonitoringTeam members querying KACS events see records with only timestamp, event_type, and cpu_id. Payload fields, identity GUIDs, and other header fields are excluded.

§9.1.4.3 Field GUIDs

Field GUIDs are generated deterministically using UUID v5 (RFC 4122). A fixed namespace UUID is defined for eventd field GUIDs:

EVENTD_FIELD_NAMESPACE = {e7d3a1b0-5c2f-4e8a-9b1d-0a6f3c8e2d4b}

A field's GUID is computed as:

field_guid = uuid_v5(EVENTD_FIELD_NAMESPACE, field_name)

Where field_name is the field's query-language name as a UTF-8 string. For header fields, this is the column name (e.g., "timestamp", "event_type", "cpu_id"). For payload fields, this is the dot-separated path (e.g., "granted_access", "target_sid", "source.name"). For log fields, this is the column name (e.g., "origin", "message", "is_error"). For metric labels, this is the label key (e.g., "core", "device").

The computation is deterministic: the same field name always produces the same GUID. No central registry is required. An SD author computes the GUID from the field name using the same algorithm when constructing object ACEs.

The field GUID does not encode the event type, log origin, or metric name. The SD hierarchy provides that scoping. An object ACE referencing the granted_access GUID in the SD for pattern kacs means "the granted_access field of KACS events."

§9.1.4.4 Object type list construction

When performing an access check, eventd constructs the object type list dynamically based on the fields present in the record being checked:

The root node (level 0) uses a fixed GUID for the security pattern's data type (one GUID for events, one for logs, one for metrics).
For each field in the record, a level-1 node is added with the field's deterministically computed GUID.

For event records, the object type list includes nodes for all header fields plus all payload fields present in that specific event's payload. Different event types produce different object type lists because they have different payload fields. The access check result is cached per (token, pattern, field set) tuple.

For log records, the field set is fixed (timestamp, origin, is_error, message, boot_id) and the object type list is the same for all log records.

For metric records, the field set includes the fixed metric fields (timestamp, name, type, value) plus label keys, which vary per series.

§9.1.5 Pattern resolution

When eventd evaluates access for a specific event type, log origin, or metric name, it resolves the applicable SD using hierarchical matching:

Look for an exact match on the full identifier (e.g., kacs.access_denied).
Walk up the hierarchy by removing the last dot-separated component (e.g., kacs).
Fall back to the wildcard default (*).

The first match wins. More specific patterns override less specific ones.

§9.1.6 SD storage

SDs are stored as registry values under the eventd security subtree:

Machine\System\eventd\Security\Events\*                     → SD (default for all events)
Machine\System\eventd\Security\Events\kacs                  → SD (all KACS events)
Machine\System\eventd\Security\Events\kacs.access_denied    → SD (specific override)
Machine\System\eventd\Security\Logs\*                        → SD (default for all logs)
Machine\System\eventd\Security\Logs\loregd                   → SD (loregd logs)
Machine\System\eventd\Security\Metrics\*                     → SD (default for all metrics)
Machine\System\eventd\Security\Metrics\cpu                   → SD (all cpu.* metrics)

The wildcard default keys (*) MUST exist. If a default SD does not exist, eventd MUST deny access to all data of that type (fail-closed).

§9.1.7 Default SDs

On first boot, eventd MUST create the default security keys if they do not exist:

Key	Default SD
`Machine\System\eventd\Security\Events\*`	SYSTEM and Administrators: EVENTD_READ on all fields.
`Machine\System\eventd\Security\Logs\*`	SYSTEM, Administrators, and Authenticated Users: EVENTD_READ on all fields.
`Machine\System\eventd\Security\Metrics\*`	SYSTEM, Administrators, and Authenticated Users: EVENTD_READ on all fields.

ⓘ Informative

The defaults reflect the sensitivity hierarchy: events (which include security audit data) are restricted to administrators by default. Logs and metrics are readable by all authenticated users because they primarily contain operational data. Administrators can tighten these defaults by modifying the SDs.

§9.1.8 Conditional ACEs

SDs on eventd security objects MAY contain conditional ACEs (PSD-004 §3.8). eventd SHOULD pass relevant contextual information as local claims via the local_claims_ptr parameter of the access check syscall. This enables attribute-based policies such as "allow read if the caller's department claim equals 'security'".

The specific local claims passed by eventd are implementation-defined in v0.23.

§9.2 9 Access control

Enforcement

§9.2.1 Caller identification

When a client connects to the query socket, eventd MUST obtain the caller's token by calling kacs_open_peer_token (PSD-004 syscall 1010) on the connected socket file descriptor. This returns a token fd representing the peer's identity, captured at connection time.

If kacs_open_peer_token fails, eventd MUST deny the query entirely (fail-closed).

§9.2.2 Query-time enforcement

Access control is enforced at query time, not at storage time. All events, logs, and metrics are stored regardless of who will eventually query them. Different callers querying the same data see different results based on their token.

This design is correct because:

Audit events MUST be stored regardless of who can read them. Filtering at storage time would violate audit integrity.
SDs can change over time. An administrator can grant or revoke access retroactively.
Multiple users with different access levels query the same event store.

§9.2.3 Access check flow

For each query:

Obtain the caller's token via kacs_open_peer_token.
Parse the query to determine the data source and filters.
Execute the query against the database(s).
For each unique pattern in the result set (distinct event type, log origin, or metric name): a. Resolve the SD for that pattern using hierarchical matching (§9.1). b. Construct the object type list with field GUIDs. c. Call kacs_access_check_list (PSD-004 syscall 1024) with the caller's token, the resolved SD, EVENTD_READ, the object type list, and an audit context identifying the pattern. d. Cache the per-field results for this (token, pattern) pair.
If the query contains GROUP, COUNT BY, TOP N BY, SORT, or DISTINCT referencing a specific field, verify that the caller has EVENTD_READ on that field's GUID for all patterns that could appear in the result. If any pattern denies access to the referenced field, the query MUST be rejected with an error. Aggregating or sorting by a field the caller cannot see is not permitted -- returning a NULL bucket or silently excluding records would produce misleading results.
For each record in the result set: a. Look up the cached per-field results for the record's pattern. b. If the root node was denied, exclude the entire record. c. If the root node was granted, include the record. For each field, include it only if the corresponding field node was granted.

§9.2.4 Caching

Access check results MUST be cached to avoid redundant syscalls. The caching strategy has two levels:

Record-level caching. If the SD for a pattern contains no object ACEs (no per-field restrictions), the access check result is a simple grant/deny on the root. This result is cached per (token, pattern) pair. A query returning 10,000 events across 20 distinct event types performs at most 20 access checks.

Field-level caching. If the SD contains object ACEs, the result depends on which fields are present in the record (different event types have different payload fields and thus different object type lists). The result is cached per (token, pattern, field set) tuple. Events of the same type have the same field set, so in practice this means one access check per (token, event type) pair.

SD caching. The SD resolution (pattern to SD lookup) SHOULD be cached across queries. eventd SHOULD watch the security registry subtree for changes and invalidate the SD cache when SDs are modified.

§9.2.5 Filtered results

Records and fields filtered by access control are silently excluded. The query response does not indicate that records or fields were filtered. The caller sees a result set that appears complete for their access level.

COUNT, COUNT BY, TOP N BY, and other aggregation queries MUST reflect only the records the caller is authorized to see.

§9.2.6 Streaming enforcement

For streaming queries, per-field access check results are cached from the initial query. If an SD changes during a streaming query, the SD cache is invalidated and subsequent batches are re-checked. Token changes (e.g., group membership changes) are not reflected during a streaming query -- the token is a snapshot captured at connection time.

§9.2.7 Cross-type filter enforcement

Cross-type filters (WHERE METRIC, WHERE EVENT, WHERE LOG) are subject to access control on both the primary data source and the cross-referenced data source. If the caller does not have EVENTD_READ on the cross-referenced pattern, the cross-type condition evaluates as if no matching data exists.

This ensures that cross-type filtering cannot be used to infer data the caller is not authorized to see.

§9.2.8 Audit trail

Every access check performed by eventd produces a KACS audit event via the SACL audit walk in the AccessCheck pipeline. eventd MUST pass an audit_context blob identifying the security pattern being accessed (e.g., "events:kacs.access_denied" or "logs:loregd"). This context appears in the emitted audit events, allowing the audit trail to identify exactly which observability data was accessed and by whom.

§9.2.9 Write-path access control

§9.2.9.1 Events

Event emission is controlled by KMES (PSD-003). The kmes_emit and kmes_emit_batch syscalls require SeAuditPrivilege. eventd is not involved in write-path access control for events.

§9.2.9.2 Logs and metrics

The log and metric ingestion sockets use filesystem permissions. The socket files SHOULD be created with permissions that allow all services managed by peinit to write.

ⓘ Informative

Origin spoofing is a known gap in v0.23. The origin field in log records and the name field in metric records are self-reported by the sender. Any process with filesystem write access to the datagram sockets can claim any origin or metric name. A compromised service can inject logs or metrics under another service's identity, creating false operational narratives or masking real incidents.

The correct fix is SD-based write access control that validates the sender's KACS token against an SD governing which origins/metric names the sender is authorized to write. This requires a KACS primitive for identifying the peer token on datagram socket messages (analogous to kacs_open_peer_token for stream sockets, which is not currently defined for datagrams). Until this primitive exists, write-path identity validation for logs and metrics is not possible. This is a priority item for KACS and eventd coordination in a future version.

In the interim, filesystem permissions limit which processes can reach the sockets, and the read-path access control model (SDs on origins and metric names) prevents unauthorized users from querying spoofed data if the SDs are configured to restrict access to the legitimate origin.

Section

10 Startup and shutdown

§10.1 10 Startup and shutdown

Startup

§10.1.1 Dependencies

eventd requires the following subsystems to be available before it can operate:

KMES (PSD-003) -- for event ingestion. KMES is available as soon as PKM is loaded.
LCS / loregd (PSD-005, PSD-006) -- for registry configuration. eventd reads all its configuration from the registry.
KACS (PSD-004) -- for access control. eventd uses the KACS AccessCheck API for query authorization and kacs_open_peer_token for caller identification.
peinit (PSD-007) -- for boot ID and service lifecycle management.

eventd is a peinit-managed service. peinit starts eventd after loregd is available (eventd cannot read its configuration without the registry).

§10.1.2 Bootstrap sequence

eventd startup proceeds in the following order:

§10.1.2.1 Phase 1: Configuration

Read all configuration keys from the registry under Machine\System\eventd\. Required keys are EventStorePath, LogStorePath, MetricStorePath, QuerySocketPath, LogSocketPath, and MetricSocketPath. If any required key is missing or invalid, eventd MUST fail to start.
Read optional configuration keys (StorageShards, MaxBatchSize, MaxBatchLatencyMs, LogMaxBatchSize, LogMaxBatchLatencyMs, and all other tuning parameters). Apply compiled-in defaults for missing keys.
Arm a persistent watch on Machine\System\eventd\ to detect configuration changes at runtime.

§10.1.2.2 Phase 2: Storage initialization

Open or create the event shard databases in the event store directory. For each shard: verify schema version, open in WAL mode with synchronous=FULL, create tables and indexes if new. Log errors for databases with unrecognised schema versions (open read-only for queries, do not write).
Open or create the log store database. Verify schema, open in WAL mode with synchronous=NORMAL.
Open or create the metric store database. Verify schema, open in WAL mode with synchronous=NORMAL. The series cache starts empty and is populated on demand as metrics arrive.
Open or create the eventd-meta.db metadata database in the event store directory. Load adaptive indexing state: read query frequency counters and desired index set. Load adaptive rollup state: read rollup counters and desired rollup set. Discover material indexes from each shard's schema.

§10.1.2.3 Phase 3: Boot boundary

Read the current boot ID from peinit.
Compare the boot ID against the most recently stored boot ID in the event shard databases.
If the boot ID differs (new boot): reset all per-CPU sequence trackers to 0, record the new boot ID.
If the boot ID matches (restart within same boot): read the last persisted sequence number per CPU from the event store, resume sequence tracking from those values.

§10.1.2.4 Phase 4: KMES attachment

Discover the CPU count and attach to each per-CPU ring buffer by calling kmes_attach(cpu_id) (PSD-003 syscall 1091) with incrementing cpu_id values starting from 0 until EINVAL is returned. The caller's token MUST hold SeSecurityPrivilege.
Map each per-CPU ring buffer.
Compute shard-to-CPU assignments based on the configured StorageShards value and the CPU count.

§10.1.2.5 Phase 5: Socket creation

Create the query socket at QuerySocketPath. If a stale socket file exists from a previous crash, unlink it before creating the new socket.
Create the log ingestion socket at LogSocketPath. Unlink stale socket files if present.
Create the metric ingestion socket at MetricSocketPath. Unlink stale socket files if present.
Set filesystem permissions on the log and metric sockets to allow service writes.

§10.1.2.6 Phase 6: Thread startup

Start one drain thread per CPU. Each drain thread begins reading from its assigned ring buffer.
Start one writer thread per event shard.
Start the log ingestion thread.
Start the metric ingestion thread.
Start the retention background thread.
Start the adaptive indexing/rollup background thread.

§10.1.2.7 Phase 7: Ready

Emit a synthetic synthetic.startup event recording the boot ID, shard count, and per-CPU sequence resume points.
Signal readiness to peinit.

§10.1.3 Failure during startup

If any phase fails, eventd MUST NOT signal readiness to peinit. eventd SHOULD log the failure and exit with a nonzero status. peinit's restart policy determines whether eventd is retried.

Partial startup is not permitted. eventd either completes the full bootstrap sequence and signals readiness, or it fails entirely. There is no degraded mode where eventd operates without one of its stores or without KMES attachment.

ⓘ Informative

The "no degraded mode" rule is a v0.23 simplification. Future versions may allow eventd to operate with partial functionality (e.g., log and metric ingestion without event ingestion if KMES is unavailable). For v0.23, the all-or-nothing model is simpler and avoids complex partial-failure state management.

§10.1.4 Configuration changes at runtime

Configuration change notifications MUST be deferred until after the bootstrap sequence completes (phase 7). Changes that arrive during startup are queued and processed after readiness is signalled. This prevents configuration reloads from interacting with partially initialised state.

When a change is detected:

Tuning parameters (batch sizes, latencies, retention periods, adaptive thresholds, query timeout): applied immediately to the running instance.
Socket paths: ignored until restart. Changing a socket path requires an eventd restart.
Store paths: ignored until restart. Changing a store path requires an eventd restart.
StorageShards: ignored until restart.

eventd MUST emit a synthetic synthetic.config_change event for each configuration change applied at runtime.

§10.2 10 Startup and shutdown

Shutdown

§10.2.1 Graceful shutdown

When peinit signals eventd to stop, eventd MUST perform a graceful shutdown. The goal is to persist as much in-flight data as possible without blocking indefinitely.

§10.2.1.1 Shutdown sequence

Stop accepting new connections. Close the query, log, and metric ingestion sockets. Existing streaming queries are terminated with an error.
Drain remaining log and metric data. Read any pending datagrams from the log and metric socket buffers and process them. This is bounded by the socket buffer size and completes quickly.
Final event drain. Each drain thread performs one final drain cycle from its KMES ring buffer, reading all available events.
Final batch commit. Each writer thread commits its current batch immediately, regardless of batch size. The log and metric writers do the same.
Persist sequence state. Write the last persisted sequence number per CPU to the event store metadata. This enables correct sequence resumption on restart.
Emit shutdown event. Write a synthetic synthetic.shutdown event recording the per-CPU last persisted sequence numbers. This event is written directly to shard 0's database. If shard 0 is unavailable (corrupted or excluded during this session), the event is written to the lowest-numbered available shard. If no shard is available, the shutdown event is skipped.
Close databases. Close all SQLite connections (writer and reader connections). SQLite WAL checkpointing occurs automatically on connection close.
Unmap ring buffers. Unmap all KMES ring buffer mappings and close the per-CPU file descriptors.
Exit.

§10.2.1.2 Shutdown timeout

eventd MUST complete the shutdown sequence within a bounded time. If the sequence has not completed within the timeout, eventd MUST abort and exit immediately. The timeout is determined by peinit's service stop timeout (PSD-007).

If shutdown is aborted:

In-flight event batches that have not been committed are lost. These events remain in the KMES ring buffers and will be available when eventd restarts (if they have not been overwritten).
In-flight log and metric batches are lost (acceptable given log/metric loss tolerance).
Per-CPU sequence numbers may not be persisted. On restart, eventd will detect a gap between the last persisted sequence and the current ring buffer state.

§10.2.2 Crash recovery

If eventd crashes (SIGSEGV, SIGKILL, OOM, or any ungraceful termination):

KMES ring buffers are unaffected. KMES continues writing events regardless of consumer state. Events emitted while eventd is down accumulate in the ring buffers.
SQLite databases are consistent. WAL mode guarantees that committed transactions survive a crash. Uncommitted transactions (the in-flight batch at crash time) are rolled back automatically by SQLite on the next open.
Sequence gap. Events emitted between the last committed batch and the crash are not persisted. On restart, eventd detects this as a sequence gap and records it as a synthetic gap record.
Log and metric data in socket buffers is lost. The kernel discards the socket receive buffer on process exit. This is acceptable given log/metric loss tolerance.

No manual recovery action is required. eventd restarts, re-attaches to KMES, resumes draining from the ring buffers, and continues normal operation. The gap between the last persisted event and the first event available in the ring buffer is recorded as a gap.

§10.2.3 Signal handling

eventd MUST handle the following signals:

Signal	Behavior
SIGTERM	Initiate graceful shutdown.
SIGINT	Initiate graceful shutdown.
SIGQUIT	Initiate graceful shutdown with a diagnostic dump (implementation-defined).
SIGHUP	Re-read configuration from the registry (equivalent to a configuration watch notification).

All other signals use default behavior.

Section

11 Failure modes

§11.1 11 Failure modes

Failure Modes

§11.1.1 KMES ring buffer overrun

When events are emitted faster than eventd can drain them, the per-CPU ring buffers fill and KMES overwrites the oldest events.

eventd detects the loss as a sequence gap on the affected CPU.
A synthetic synthetic.gap record is written to the event store.
eventd continues draining from the oldest surviving event (tail_pos).
This is the most serious data loss scenario for events. Ring buffer overrun means audit events were lost irrecoverably.

Mitigations:

Adaptive batch sizing maximises commit throughput.
Adaptive index shedding reduces per-insert overhead under pressure.
Configurable shard count provides linear write scaling.
KMES ring buffer size is configurable (BufferCapacity) to increase the absorption window.

§11.1.2 Disk full

When the filesystem containing the event, log, or metric store reaches capacity, SQLite write operations fail.

§11.1.2.1 Event store

SQLite returns an error on INSERT or COMMIT. The writer thread MUST NOT crash.
The current batch is lost. Events in the failed batch were already consumed from the KMES ring buffer and cannot be recovered from KMES.
The writer thread MUST record the per-CPU sequence ranges of events in the failed batch in an in-memory lost-batch list. On the next successful commit, the writer MUST emit synthetic synthetic.gap records for all accumulated lost-batch ranges before writing new events. This ensures that disk-full data loss is recorded in the event store once disk recovers.
If eventd crashes before disk recovers, the in-memory lost-batch list is lost. On restart, the gap between the last persisted sequence and the current ring buffer position is detected as a normal restart gap (§3.5), so the loss is still recorded.
The writer thread MUST log the commit failure to stderr immediately (peinit captures this), including the CPU IDs and sequence ranges of the lost events. This provides immediate visibility even if the gap record cannot be written yet.
Events continue accumulating in the KMES ring buffers. If the disk remains full long enough for the ring buffers to overrun, additional event loss occurs (detected by the drain thread's normal sequence gap mechanism).
The retention process SHOULD be triggered immediately to attempt to free space by deleting old data.

§11.1.2.2 Log store

Log batches that fail to commit are lost. Acceptable given log loss tolerance.
The log writer thread retries on the next batch.

§11.1.2.3 Metric store

Metric batches that fail to commit are lost. Acceptable given metric loss tolerance.
The metric writer thread retries on the next batch.

§11.1.3 SQLite corruption

If a SQLite database becomes corrupt (hardware error, filesystem bug, incomplete write due to kernel crash):

Corruption is detected at startup by verifying that the required tables and indexes exist (structural check). eventd MUST NOT run PRAGMA integrity_check at startup -- it scans the entire database and is O(DB size), which is unacceptable for large event stores. Corruption that does not affect the schema structure (e.g., a single corrupt page) is detected at query or write time when SQLite encounters the corrupt page and returns an error.
eventd MUST NOT write to a database that fails the structural check.
A corrupt event shard database is excluded from the write path but remains available for read-only queries on the uncorrupted portions. SQLite can often read rows from uncorrupted pages.
eventd MUST log the corruption and emit a synthetic synthetic.storage_error event (to a healthy shard).
If all event shards are corrupt, eventd creates new shard databases and continues writing. Historical data is accessible only from the corrupt databases on a best-effort basis.
If the log or metric store is corrupt, eventd creates a new database and continues. Historical data from the corrupt database is available for read-only queries on a best-effort basis.

Recovery from corruption is an administrative operation. eventd does not attempt automatic repair.

§11.1.4 eventd crash

If eventd crashes:

KMES is unaffected. Events continue accumulating in ring buffers.
SQLite databases are consistent (WAL guarantees). Uncommitted batches are rolled back.
peinit restarts eventd per its service restart policy.
On restart, eventd detects the gap between its last persisted sequence numbers and the current ring buffer state. The gap is recorded as a synthetic gap record.
Log and metric data in socket buffers at crash time is lost.

No manual intervention required. See §10.2 for crash recovery details.

§11.1.5 Registry unavailable

If LCS / loregd becomes unavailable after eventd has started:

eventd retains its last known configuration. Configuration changes are not applied until the registry is available again.
Access control SD lookups fall back to the cached SDs. If the cache is cold for a particular pattern, access is denied (fail-closed).
eventd continues operating with stale configuration indefinitely. This is a degraded state but not a failure.
When the registry becomes available again, the watch fires and eventd re-reads its configuration.

§11.1.6 KACS unavailable

If KACS becomes unavailable after eventd has started:

kacs_open_peer_token fails on new query connections. New queries are denied.
kacs_access_check / kacs_access_check_list fails. Queries in progress that need a fresh access check are denied.
Cached access check results remain valid for their query's duration.
Event ingestion is unaffected -- the drain and write path does not use KACS.
Log and metric ingestion are unaffected.

eventd continues ingesting data but cannot serve queries. When KACS recovers, query service resumes.

§11.1.7 Query timeout

If a query exceeds QueryTimeoutMs:

The query is cancelled. eventd returns an error to the client.
Read-only SQLite connections used by the query are released.
No data loss occurs. The query simply did not complete.

Queries that scan large unindexed datasets are the primary timeout risk. The adaptive indexing system (§3.3) mitigates this over time by creating indexes for frequently queried fields.

§11.1.8 Log or metric socket backpressure

If the log or metric socket receive buffer is full when a sender transmits a datagram:

The kernel drops the datagram silently. Neither the sender nor eventd is notified.
This is by design. Senders MUST NOT be blocked by eventd's ingestion rate.
eventd MAY track a dropped-datagram estimate (by monitoring socket statistics) but this is not normative.

§11.1.9 Writer thread stall during index creation

If an adaptive index build is in progress when events arrive:

The drain threads detect rising write pressure and signal the writer thread to cancel the index build.
The writer thread cancels the CREATE INDEX via sqlite3_interrupt() and resumes normal event writing.
See §3.3 for the full cancellation mechanism.

§11.1.10 Writer thread stall during retention

If the retention process holds a write lock on a shard database:

The shard's writer thread blocks briefly until the retention process releases the lock.
Retention operates in small, bounded delete batches to minimise lock hold time.
Events accumulate in the KMES ring buffer during the stall.
Under sustained write pressure, the retention process SHOULD yield more frequently (smaller batches, longer pauses between batches).

§11.1.11 Power loss

On sudden power loss:

Event store (synchronous=FULL): all committed transactions survive. The in-flight batch (not yet committed) is lost. On restart, this appears as a sequence gap.
Log store (synchronous=NORMAL): transactions committed since the last WAL checkpoint may be lost. On restart, some recent logs may be missing. Acceptable given log loss tolerance.
Metric store (synchronous=NORMAL): same as log store. Recent metrics may be lost.

The different durability guarantees between event and log/metric stores directly reflect the importance hierarchy: events are sacred, logs and metrics are important but not fundamental.

§11.1.12 Memory exhaustion

If eventd is killed by the OOM killer:

Treated identically to an eventd crash. peinit restarts eventd.
SQLite databases are consistent (WAL guarantees).
KMES ring buffers are unaffected.
To reduce OOM risk, eventd's memory usage is bounded by: the in-memory series cache (proportional to number of distinct metric time series), the adaptive indexing/rollup counters (proportional to number of distinct fields queried), and SQLite page caches (configurable per connection).

Section

12 Appendix a

§12.1 12 Appendix a

Configuration Keys

All configuration keys live under Machine\System\eventd\. eventd ignores unknown keys in this subtree. Invalid values are ignored and the compiled-in default is retained. eventd MUST emit a synthetic synthetic.config_change event when a valid configuration change is applied.

§12.1.1 Required keys

These keys MUST exist for eventd to start. There are no compiled-in defaults.

Key	Type	Description
EventStorePath	REG_SZ	Directory path for event shard databases.
LogStorePath	REG_SZ	File path for the log store database.
MetricStorePath	REG_SZ	File path for the metric store database.
QuerySocketPath	REG_SZ	Unix socket path for query access.
LogSocketPath	REG_SZ	Unix socket path for log ingestion.
MetricSocketPath	REG_SZ	Unix socket path for metric ingestion.

§12.1.2 Event ingestion

Key	Type	Default	Valid range	Description
StorageShards	REG_DWORD	0	0--256	Number of event shard databases. 0 = CPU count.
MaxBatchSize	REG_DWORD	10000	100--100000	Maximum events per event writer transaction.
MaxBatchLatencyMs	REG_DWORD	100	10--5000	Maximum ms before an event batch is committed.

§12.1.3 Log ingestion

Key	Type	Default	Valid range	Description
LogMaxBatchSize	REG_DWORD	5000	100--100000	Maximum log records per transaction.
LogMaxBatchLatencyMs	REG_DWORD	500	10--5000	Maximum ms before a log batch is committed.

§12.1.4 Metric ingestion

Key	Type	Default	Valid range	Description
MetricMaxBatchSize	REG_DWORD	5000	100--100000	Maximum metric samples per transaction.
MetricMaxBatchLatencyMs	REG_DWORD	1000	10--5000	Maximum ms before a metric batch is committed.

§12.1.5 Adaptive indexing (events)

Key	Type	Default	Valid range	Description
AdaptiveIndexWindowHours	REG_DWORD	24	1--168	Rolling window for query frequency measurement.
AdaptiveIndexPolicyIntervalMinutes	REG_DWORD	60	60--1440	How often the desired index set is recomputed. Minimum 60 minutes.
AdaptiveIndexCreateThreshold	REG_DWORD	100	10--10000	Queries on a field required to trigger index creation.
AdaptiveIndexDropThreshold	REG_DWORD	10	1--1000	Queries below which an index is removed from the desired set. MUST be less than create threshold.

§12.1.6 Index shedding

Key	Type	Default	Valid range	Description
SheddingWindowSeconds	REG_DWORD	30	10--300	Sliding window for graduated shedding evaluation.
SheddingBatchPercent	REG_DWORD	75	50--100	Percentage of batches exceeding 75% of MaxBatchSize to trigger graduated shedding.
EmergencySheddingBufferPercent	REG_DWORD	75	50--95	Ring buffer fill percentage that triggers emergency shedding.

§12.1.7 Adaptive rollups (metrics)

Key	Type	Default	Valid range	Description
AdaptiveRollupWindowHours	REG_DWORD	48	1--168	Rolling window for rollup query frequency measurement.
AdaptiveRollupCreateThreshold	REG_DWORD	50	10--10000	Queries required to trigger rollup computation.
AdaptiveRollupDropThreshold	REG_DWORD	5	1--1000	Queries below which a rollup pair is removed. MUST be less than create threshold.

§12.1.8 Metric series cache

Key	Type	Default	Valid range	Description
MetricSeriesCacheSize	REG_DWORD	50000	1000--1000000	Maximum entries in the in-memory series resolution cache (LRU).

§12.1.9 Retention

Key	Type	Default	Valid range	Description
EventRetentionDays	REG_DWORD	30	1--3650	Maximum age of events in days.
EventRetentionMaxBytes	REG_QWORD	0	0--2^64-1	Maximum total size of event shard databases. 0 = no limit.
LogRetentionDays	REG_DWORD	14	1--3650	Maximum age of log entries in days.
LogRetentionMaxBytes	REG_QWORD	0	0--2^64-1	Maximum size of log store database. 0 = no limit.
MetricRetentionDays	REG_DWORD	90	1--3650	Maximum age of metric samples in days.
MetricRetentionMaxBytes	REG_QWORD	0	0--2^64-1	Maximum size of metric store database. 0 = no limit.
RetentionCheckIntervalMinutes	REG_DWORD	60	1--1440	How often the retention process runs.

§12.1.10 Querying

Key	Type	Default	Valid range	Description
QueryTimeoutMs	REG_DWORD	30000	1000--300000	Maximum query execution time in ms.
MaxConcurrentQueries	REG_DWORD	128	1--4096	Maximum concurrent queries (streaming and non-streaming) globally.
MaxStreamingQueries	REG_DWORD	64	1--1024	Maximum concurrent streaming queries globally.
MaxQueryMessageBytes	REG_DWORD	65536	1024--16777216	Maximum query message size in bytes.

§12.1.11 Cross-type filtering

Key	Type	Default	Valid range	Description
CrossTypeWindowMs	REG_DWORD	15000	1000--300000	Time window for cross-type event/log existence checks.
CrossTypeMaxLookbackSeconds	REG_DWORD	604800	3600--2592000	Maximum time range a cross-type filter may scan. Default 7 days.

§12.1.12 Security subtree

SDs for read-path access control are stored under Machine\System\eventd\Security\:

Machine\System\eventd\Security\Events\*
Machine\System\eventd\Security\Events\<pattern>
Machine\System\eventd\Security\Logs\*
Machine\System\eventd\Security\Logs\<pattern>
Machine\System\eventd\Security\Metrics\*
Machine\System\eventd\Security\Metrics\<pattern>

See §9.1 for SD structure and default values.

§12.1.13 Runtime vs restart

Change	Effect
Tuning parameters (batch sizes, latencies, retention, adaptive thresholds, query timeout, cross-type window)	Applied immediately.
Socket paths	Requires restart.
Store paths	Requires restart.
StorageShards	Requires restart.
Security SDs	Applied on next query (SD cache invalidated by registry watch).

Section

13 Appendix b

§13.1 13 Appendix b

Constants

§13.1.1 Access rights

Right	Value	Description
EVENTD_READ	0x0001	Read records matching the security pattern.
EVENTD_CLEAR	0x0002	Delete records matching the security pattern.

§13.1.2 Generic mapping

Generic right	Maps to
GENERIC_READ	0x00020001 (EVENTD_READ \| READ_CONTROL)
GENERIC_WRITE	0x00020002 (EVENTD_CLEAR \| READ_CONTROL)
GENERIC_EXECUTE	0x00020001 (EVENTD_READ \| READ_CONTROL)
GENERIC_ALL	0x000F0003 (EVENTD_READ \| EVENTD_CLEAR \| DELETE \| READ_CONTROL \| WRITE_DAC \| WRITE_OWNER)

§13.1.3 Field GUID namespace

All field GUIDs are generated using UUID v5 (RFC 4122) with the following namespace UUID:

EVENTD_FIELD_NAMESPACE = {e7d3a1b0-5c2f-4e8a-9b1d-0a6f3c8e2d4b}

Field GUIDs are computed as uuid_v5(EVENTD_FIELD_NAMESPACE, field_name) where field_name is the UTF-8 field name string.

§13.1.4 Data type root GUIDs

Used as the level-0 node in object type lists for access checks.

Data type	GUID
Events	`{a1b2c3d4-0001-4000-8000-000000000001}`
Logs	`{a1b2c3d4-0001-4000-8000-000000000002}`
Metrics	`{a1b2c3d4-0001-4000-8000-000000000003}`

§13.1.5 Well-known field GUIDs

Computed from uuid_v5(EVENTD_FIELD_NAMESPACE, field_name) for reference. Implementations MUST compute these from the algorithm, not hardcode them.

§13.1.5.1 Event header fields

Field name	Description
`timestamp`	Wall clock time.
`cpu_id`	CPU identifier.
`sequence`	Per-CPU sequence number.
`origin_class`	Origin class (userspace, KMES, KACS, LCS).
`event_type`	Event type string.
`effective_token_guid`	Effective token GUID.
`true_token_guid`	Process primary token GUID.
`process_guid`	Process GUID.
`boot_id`	Boot ID GUID.

§13.1.5.2 Log fields

Field name	Description
`timestamp`	Wall clock time.
`origin`	Service name.
`is_error`	stderr flag.
`message`	Log text.
`boot_id`	Boot ID GUID.

§13.1.5.3 Metric fields

Field name	Description
`timestamp`	Sample time.
`name`	Metric name.
`type`	Metric type (counter, gauge, histogram).
`value`	Numeric value.

Metric label keys produce field GUIDs using the same algorithm. Label key "core" produces uuid_v5(EVENTD_FIELD_NAMESPACE, "core").

§13.1.6 Synthetic event types

Event type	Emitted when
`synthetic.startup`	eventd starts and attaches to KMES.
`synthetic.shutdown`	eventd begins graceful shutdown.
`synthetic.gap`	Sequence gap detected on a CPU.
`synthetic.config_change`	Configuration value applied at runtime.
`synthetic.storage_error`	Write failure on any store.

§13.1.7 Metric type identifiers

Value	Type
0	Counter
1	Gauge
2	Histogram

§13.1.8 Rollup function identifiers

Value	Function
0	AVG
1	MIN
2	MAX
3	SUM
4	RATE
5	DELTA

Percentile functions (P50, P95, P99) are not rollup-eligible. They are computed from raw samples only.

§13.1.9 Log severity

Value	Meaning
0	Normal (stdout).
1	Error (stderr).

ⓘ Informative

The is_error column stores this as an integer. The query language exposes it as a boolean. The query engine MUST accept both boolean (WHERE is_error == true) and integer (WHERE is_error == 1) comparisons -- true is equivalent to 1 and false is equivalent to 0. The ERROR ONLY clause is syntactic sugar for WHERE is_error == true.

§13.1.10 Schema versions

Store	Schema version
Event shard databases	1
Log store database	1
Metric store database	1

§13.1.11 Wire protocol

Field	Size	Type	Description
Message length	4 bytes	u32 LE	Length of the msgpack payload.
Message payload	variable	msgpack	Request or response body.

§13.1.11.1 Query response status values

Status	Meaning
`"ok"`	Result records follow in the `records` field.
`"end"`	No more results. Query complete.
`"error"`	Error occurred. Description in the `error` field.

Section

14 Appendix c

§14.1 14 Appendix c

Recommended Implementation Optimisations

The following optimisations are not normative. They do not affect the storage format, wire protocol, query language, or any observable behavior. An implementation that omits all of them is fully conformant. However, each one provides measurable throughput or latency improvement with no behavioural trade-offs, and implementers are encouraged to adopt them.

§14.1.1 Arena allocation for event copies

Drain threads copy events from the KMES ring buffer into process-local memory at rates up to hundreds of thousands of events per second. Using the system allocator (malloc/Box::new) for each variable-sized event creates allocator pressure: freelist bookkeeping, potential lock contention in multi-threaded allocators, and occasional page faults when the allocator requests new pages from the OS.

Implementations SHOULD use a per-drain-cycle arena allocator (e.g., bumpalo in Rust). Pre-allocate a block of memory at the start of each drain cycle, hand out sequential chunks for each event copy (a pointer bump, no bookkeeping), and free the entire block after the batch is handed off to the writer thread. This reduces per-event allocation cost from ~20-50ns to ~1-2ns and eliminates allocation-related latency spikes.

§14.1.2 Consumer thread affinity

Drain threads that read from per-CPU ring buffers benefit from being pinned to the same NUMA node as the CPU whose buffer they read. While not strictly necessary (the per-CPU design eliminates write contention regardless of consumer placement), NUMA-local reads avoid cross-node memory traffic during the drain loop.

For the 1:1 case (one shard per CPU), pinning the drain thread to the same CPU as its ring buffer gives optimal cache locality: the ring buffer pages are likely already in that CPU's L3 cache from the kernel write.

§14.1.3 Partial msgpack extraction for query results

When constructing flat-map query results from event records, implementations SHOULD use partial/lazy msgpack extraction rather than decoding the entire payload. When a SELECT clause names specific payload fields, only those fields need to be extracted. A streaming msgpack decoder that scans for specific keys and skips unneeded values avoids the cost of building a full in-memory representation of the payload.

For payloads with many fields where only one or two are selected, partial extraction reduces per-row CPU cost by an order of magnitude.

§14.1.4 Prepared statement pooling

Writer threads use one prepared INSERT statement each (§2.4). Query handlers executing translated SQL SHOULD maintain a pool of prepared statements for common query patterns. SQLite prepared statements cache the query plan; re-preparing the same SQL on every query wastes CPU on parsing and planning.

A small LRU pool of ~50-100 prepared statements per read connection covers the common case where operators repeat similar queries.

§14.1.5 Batch socket reads for logs and metrics

The log and metric ingestion threads read datagrams from Unix sockets. The recvmmsg syscall reads multiple datagrams in a single kernel round-trip, reducing per-message syscall overhead. On Linux, recvmmsg can read up to vlen messages at once (typically up to 1024).

Under sustained log or metric load, recvmmsg with a batch size of 64-256 messages reduces syscall overhead by the same factor, improving ingestion throughput without any protocol or format changes.

§14.1.6 SQLite page cache tuning

Each SQLite connection has a page cache (default ~2MB). For event shard writer connections, the page cache holds B-tree pages for the events table and its indexes. Under sustained writes with adaptive indexes, the working set of hot pages can exceed the default cache size, causing frequent page evictions and re-reads.

Implementations SHOULD tune the page cache size per connection based on the number of active indexes. A reasonable heuristic: 2MB base + 1MB per active secondary index. For read-only query connections, a smaller cache (512KB-1MB) is sufficient since queries are typically short-lived.

§14.1.7 Shard file descriptor management

With many active and historical shard databases, the number of open file descriptors can be significant (each SQLite connection opens 1-2 fds for the database and WAL). For the query path, which opens read-only connections to every database in the event store directory, implementations SHOULD use a connection pool with a bounded number of open connections, opening and closing historical shard connections on demand rather than holding them all open permanently.

Active shard writer connections MUST remain open for the lifetime of the process. Historical shard read connections can be opened lazily when a query touches them and closed after a period of inactivity.

eventd

Contents

1Introduction

2Event ingestion

3Event storage

4Log ingestion

5Log storage

6Metric ingestion

7Metric storage

8Querying

9Access control

10Startup and shutdown

11Failure modes

12Appendix a

13Appendix b

14Appendix c

1 Introduction

Scope

Terminology

Conventions

§1.3.1 Normative keywords

§1.3.2 Section references

§1.3.3 Byte order

§1.3.4 String encoding

§1.3.5 Payload encoding

Prior Art

§1.4.1 Windows Event Log / ETW

§1.4.2 journald (systemd)

§1.4.3 Prometheus / OpenTelemetry

§1.4.4 Features handled by other subsystems

2 Event ingestion

Overview

KMES Consumption

§2.2.1 Attachment

§2.2.2 Drain threads

§2.2.3 Event copying

§2.2.4 Generation changes

§2.2.5 Sequence tracking

Storage Sharding

§2.3.1 Shard model

§2.3.2 Shard-to-CPU assignment

§2.3.3 Writer threads

§2.3.4 Handoff mechanism

§2.3.5 Shard lifecycle

§2.3.6 Reconfiguration

Batch Writer

§2.4.1 Transaction model

§2.4.2 Adaptive batch sizing

§2.4.3 Configuration

§2.4.4 WAL checkpointing

§2.4.5 Prepared statements

Gap Detection

§2.5.1 Sequence gap detection

§2.5.2 Gap records

§2.5.3 Lapping

§2.5.4 Gap records in the event table

Synthetic Events

§2.6.1 Definition

§2.6.2 When synthetic events are generated

§2.6.3 Shard assignment

§2.6.4 Storage

§2.6.5 Ordering

3 Event storage

Schema

§3.1.1 Event table

§3.1.2 Synthetic event types

§3.1.3 Write-time indexes

§3.1.4 Schema versioning

Database Lifecycle

§3.2.1 Event store directory

§3.2.2 Shard database naming

§3.2.3 Database creation

§3.2.4 Database opening

§3.2.5 Query path discovery

§3.2.6 Concurrency

Adaptive Indexing

§3.3.1 Purpose

§3.3.2 Global desired index set

§3.3.3 Shard convergence

§3.3.4 Pressure-based index shedding