Files
loki/docs/PHASE-6-IMPLEMENTATION-PLAN.md
2026-04-10 15:45:51 -06:00

37 KiB
Raw Blame History

Phase 6 Implementation Plan: Production Hardening

Overview

Phase 6 closes out the refactor by picking up every "deferred to production hardening" item from Phases 15 and delivering a Loki build that's safe to run as a multi-tenant service. The preceding phases made Loki functionally a server — Phase 6 makes it operationally a server. That means real rate limiting instead of a stub, per-subject session ownership instead of flat visibility, Prometheus metrics instead of in-memory counters, structured JSON logging, deployment manifests, security headers, config validation, and operational runbooks.

This is the final phase. After it lands, Loki v1 is production-ready: you can run loki --serve in a container behind a reverse proxy, scrape its metrics from Prometheus, route requests through a rate limiter, and have multiple tenants share the same instance without seeing each other's data.

Estimated effort: ~1 week Risk: Low. Most of the work is applying well-known patterns (sliding-window rate limiting, row-level authz, Prometheus, structured logging) on top of the architecture the previous phases already built. No new core types, no new pipelines. Depends on: Phases 15 complete. The API server runs, MCP pool works, sessions are UUID-keyed.


Why Phase 6 Exists

Phases 4 and 5 got the API server running with correct semantics, but several explicit gaps were called out as "stubs" or "follow-ups." A Phase 4 deployment is usable for a trusted single-tenant context (an internal tool, a personal server) but unsafe for anything else:

  • Anyone with a valid API key can see every session. Phase 4 flagged this as "single-tenant-per-key." In a multi-tenant deployment where Alice and Bob both have keys, Alice can list Bob's sessions and read their messages. This is a security issue, not a feature gap.
  • No real rate limiting. Phase 4's max_concurrent_requests semaphore caps parallelism but doesn't throttle per-subject request rates. A single runaway client can exhaust the whole concurrency budget.
  • No metrics for external observability. Phase 5 added in-memory counters, but they're only reachable via the .info mcp dot-command or a one-shot JSON endpoint. Production needs Prometheus scraping so alerting and dashboards work.
  • Logs aren't structured. The tracing spans from Phase 4 middleware emit human-readable text. Aggregators like Loki (the other one), Datadog, or CloudWatch want JSON with correlation IDs.
  • No deployment story. There's no Dockerfile, no systemd unit, no documented way to actually run the thing in production. Every deploying team has to reinvent this.
  • Security headers missing. Phase 4's CORS handles cross-origin; it doesn't set X-Content-Type-Options, X-Frame-Options, or similar defaults that a browser-facing endpoint should have.
  • No config validation at startup. Mistyped config values produce runtime errors hours after deployment instead of failing fast at startup.
  • Operational procedures are undocumented. How do you rotate auth keys? How do you reload MCP credentials? What's the runbook when the MCP hit rate drops? None of this is written down.

Phase 6 delivers answers to all of the above. It's the "you can actually deploy this" phase.


What Phase 6 Delivers

Grouped by theme rather than by dependency order. Each item is independently valuable and can be worked in parallel.

Security and isolation

  1. Per-subject session ownership — every session records the authenticated subject that created it; reads/writes are authz-checked against the caller's subject.
  2. Scope-based authorizationAuthContext.scopes are enforced per endpoint (e.g., read:sessions, write:sessions, admin:mcp). Phase 4's middleware already populates scopes; Phase 6 adds the enforcement.
  3. JWT support — extends AuthConfig with a Jwt { issuer, audience, jwks_url } variant that validates tokens against a JWKS endpoint and extracts subject + scopes from claims.
  4. Security headers middlewareX-Content-Type-Options: nosniff, X-Frame-Options: DENY, Referrer-Policy: strict-origin, optional HSTS when behind HTTPS.
  5. Audit logging — structured audit events for every authenticated request (subject, action, target, result), written to a dedicated sink so they survive log rotation.

Throughput and fairness

  1. Per-subject rate limiting — sliding-window limiter keyed by subject. Enforces rate_limit_per_minute and related config. Returns 429 Too Many Requests with a Retry-After header.
  2. Per-subject concurrency limit — subject-scoped semaphore so one noisy neighbor can't exhaust the global concurrency budget.
  3. Backpressure signal — expose a /healthz/ready endpoint that returns 503 when the server is saturated, so upstream load balancers can drain traffic.

Observability

  1. Structured JSON logging — every log line is JSON with timestamp, level, target, request_id, subject, session_id, and fields. Routes through tracing_subscriber with fmt::layer().json().
  2. Prometheus metrics endpoint/metrics exposing the existing Phase 5 counters plus new HTTP metrics (http_requests_total, http_request_duration_seconds, http_requests_in_flight), MCP metrics (mcp_pool_size, mcp_acquire_latency_seconds histogram), and session metrics (sessions_active_total, sessions_created_total).
  3. Liveness and readiness probes/healthz/live for process liveness (always 200 unless shutting down), /healthz/ready for dependency readiness (config loaded, MCP pool initialized, storage writable).

Operability

  1. Config validation at startup — a dedicated ApiConfig::validate() that checks every field against a schema and fails fast with a readable error message listing all problems, not just the first one.
  2. SIGHUP config reload — reloads auth keys, log level, and rate limit settings without restarting the server. Does NOT reload MCP pool config (requires restart because the pool holds live subprocesses).
  3. Dockerfile + multi-stage build — minimal runtime image based on debian:bookworm-slim with the compiled binary, config directory, and non-root user.
  4. systemd service unit — with Type=notify, sandboxing directives, and resource limits.
  5. docker-compose example — for local development with nginx-as-TLS-terminator in front.
  6. Kubernetes manifests — Deployment, Service, ConfigMap, Secret, HorizontalPodAutoscaler.

Documentation

  1. Operational runbook (docs/RUNBOOK.md) — documented procedures for common scenarios.
  2. Deployment guide (docs/DEPLOYMENT.md) — end-to-end instructions for each deployment target.
  3. Security guide (docs/SECURITY.md) — threat model, hardening checklist, key rotation procedures.

Core Type Additions

Most of Phase 6 hangs off existing types. A few new concepts need introducing.

AuthContext enrichment

Phase 4 defined AuthContext { subject: String, scopes: Vec<String> }. Phase 6 extends it:

pub struct AuthContext {
    pub subject: String,
    pub scopes: Scopes,
    pub key_id: Option<String>,        // for audit log correlation
    pub claims: Option<JwtClaims>,     // present when auth mode is Jwt
}

pub struct Scopes(HashSet<String>);

impl Scopes {
    pub fn has(&self, scope: &str) -> bool;
    pub fn has_any(&self, required: &[&str]) -> bool;
    pub fn has_all(&self, required: &[&str]) -> bool;
}

pub enum Scope {
    ReadSessions,      // "read:sessions"
    WriteSessions,     // "write:sessions"
    ReadAgents,        // "read:agents"
    RunAgents,         // "run:agents"
    ReadModels,        // "read:models"
    AdminMcp,          // "admin:mcp"
    AdminSessions,     // "admin:sessions" — can see all users' sessions
}

The Scope enum provides typed constants for the well-known scope strings used in the handlers. Custom scopes (for callers to define their own access tiers) continue to work as raw strings.

SessionOwnership in the session store

The session metadata needs to record who owns each session so reads/writes can be authorized:

pub struct SessionMeta {
    pub id: SessionId,
    pub alias: Option<SessionAlias>,
    pub owner: Option<String>,         // subject that created it; None = legacy
    pub last_modified: SystemTime,
    pub is_autoname: bool,
}

On disk, the ownership field goes into the session's YAML file under a reserved _meta block:

_meta:
  owner: "alice"
  created_at: "2026-04-10T15:32:11Z"
  created_by_key_id: "key_3f2a..."
# ... rest of session fields unchanged

The SessionStore trait gets two new methods and an enriched open signature:

#[async_trait]
pub trait SessionStore: Send + Sync {
    // existing methods unchanged except:
    async fn open(
        &self,
        agent: Option<&str>,
        id: SessionId,
        caller: Option<&AuthContext>,  // NEW: for authz check
    ) -> Result<SessionHandle, StoreError>;

    async fn list(
        &self,
        agent: Option<&str>,
        caller: Option<&AuthContext>,  // NEW: for filtering
    ) -> Result<Vec<SessionMeta>, StoreError>;

    // NEW: transfer ownership (e.g., admin reassignment)
    async fn set_owner(
        &self,
        id: SessionId,
        new_owner: Option<String>,
    ) -> Result<(), StoreError>;
}

caller: None means internal or legacy access (CLI/REPL) — skip authz entirely. caller: Some(...) means an API call — enforce ownership.

Authz rules:

  • Own session: full access.
  • Other subject's session: denied unless caller has admin:sessions scope.
  • Legacy sessions with owner: None: accessible to anyone (grandfathered); every mutation attempts to set the owner to the current caller so they get claimed forward.
  • list: returns only sessions owned by the caller (or all if they have admin:sessions).

RateLimiter and ConcurrencyLimiter

pub struct RateLimiter {
    windows: DashMap<String, SlidingWindow>,
    config: RateLimitConfig,
}

struct SlidingWindow {
    bucket_a: AtomicU64,
    bucket_b: AtomicU64,
    last_reset: AtomicU64,
}

pub struct RateLimitConfig {
    pub per_minute: u32,
    pub burst: u32,
}

impl RateLimiter {
    pub fn check(&self, subject: &str) -> Result<(), RateLimitError>;
}

pub struct RateLimitError {
    pub retry_after: Duration,
    pub limit: u32,
    pub remaining: u32,
}

pub struct SubjectConcurrencyLimiter {
    semaphores: DashMap<String, Arc<Semaphore>>,
    per_subject: usize,
}

impl SubjectConcurrencyLimiter {
    pub async fn acquire(&self, subject: &str) -> OwnedSemaphorePermit;
}

Both live in ApiState and are applied via middleware. Rate limiting runs first (cheap atomic operations), then concurrency acquisition (may block briefly).

MetricsRegistry

pub struct MetricsRegistry {
    pub http_requests_total: IntCounterVec,
    pub http_request_duration: HistogramVec,
    pub http_requests_in_flight: IntGaugeVec,
    pub sessions_active: IntGauge,
    pub sessions_created_total: IntCounter,
    pub mcp_pool_size: IntGaugeVec,
    pub mcp_acquire_latency: HistogramVec,
    pub mcp_spawns_total: IntCounter,
    pub mcp_idle_evictions_total: IntCounter,
    pub auth_failures_total: IntCounterVec,
    pub rate_limit_rejections_total: IntCounterVec,
}

Built on top of the prometheus crate. Exposed via GET /metrics with the Prometheus text exposition format. The registry bridges Phase 5's atomic counters into the Prometheus types without requiring Phase 5's code to change — Phase 5 keeps its simple counters, and Phase 6 reads them on each scrape to populate the Prometheus gauges.

AuditLogger

pub struct AuditLogger {
    sink: AuditSink,
}

pub enum AuditSink {
    Stderr,                                 // default
    File { path: PathBuf, rotation: Rotation },
    Syslog { facility: String },
}

pub struct AuditEvent<'a> {
    pub timestamp: OffsetDateTime,
    pub request_id: Uuid,
    pub subject: Option<&'a str>,
    pub action: AuditAction,
    pub target: Option<&'a str>,
    pub result: AuditResult,
    pub details: Option<serde_json::Value>,
}

pub enum AuditAction {
    SessionCreate,
    SessionRead,
    SessionUpdate,
    SessionDelete,
    AgentActivate,
    ToolExecute,
    McpReload,
    ConfigReload,
    AuthFailure,
    RateLimitRejection,
}

pub enum AuditResult {
    Success,
    Denied { reason: String },
    Error { message: String },
}

impl AuditLogger {
    pub fn log(&self, event: AuditEvent<'_>);
}

Audit events are emitted from handler middleware after request completion. The audit stream is deliberately separate from the regular tracing logs because audit logs have stricter retention/integrity requirements in regulated environments — you want to be able to pipe them to a WORM storage or SIEM without mixing in debug logs.


Migration Strategy

Step 1: Per-subject session ownership

The highest-impact security fix. No new deps, no new config — just enriching existing types.

  1. Add owner: Option<String> and created_by_key_id: Option<String> to the session YAML _meta block. Serde skip if absent (backward compat for legacy files).
  2. Update SessionStore::create to record the caller's subject.
  3. Update SessionStore::open to take caller: Option<&AuthContext> and enforce ownership.
  4. Update SessionStore::list to filter by caller subject (unless caller has admin:sessions scope).
  5. Add SessionStore::set_owner for admin reassignment.
  6. Implement the "claim on first mutation" behavior for legacy sessions.
  7. Update all API handlers to pass the AuthContext through to store calls.
  8. Add integration tests: Alice creates a session, Bob tries to read it (403), admin Claire can read it (200), Alice's list returns only her own, Claire's list with admin:sessions scope returns everything.

Verification: all new authz tests pass. CLI/REPL tests still pass because they pass caller: None.

Step 2: Scope-based authorization for endpoints

Phase 4's middleware attaches AuthContext with a scopes: Vec<String> field but handlers don't check it. Phase 6 adds the enforcement.

  1. Change AuthContext.scopes from Vec<String> to a Scopes(HashSet<String>) newtype with has/has_any/has_all methods.
  2. Define the Scope enum with well-known constants.
  3. Add a require_scope helper and a #[require_scope("read:sessions")] proc macro (or a handler-side check if proc macros add too much complexity).
  4. Annotate every handler with the required scope(s):
    • GET /v1/sessionsread:sessions
    • POST /v1/sessionswrite:sessions
    • GET /v1/sessions/:idread:sessions
    • DELETE /v1/sessions/:idwrite:sessions
    • POST /v1/sessions/:id/completionswrite:sessions + run:agents (if the session has an agent)
    • POST /v1/rags/:name/rebuildadmin:mcp
    • GET /v1/agents, /v1/roles, /v1/rags, /v1/modelsread:agents, read:roles, etc.
    • /metricsadmin:metrics (or unauthenticated if the endpoint is bound to a private network)
  5. Document the scope model in docs/SECURITY.md.

Verification: per-endpoint authz tests. A key with only read:sessions can list and read but not write.

Step 3: JWT support in AuthConfig

Extend the auth mode enum:

pub enum AuthConfig {
    Disabled,
    StaticKeys { keys: Vec<AuthKeyEntry> },
    Jwt(JwtConfig),
}

pub struct JwtConfig {
    pub issuer: String,
    pub audience: String,
    pub jwks_url: String,
    pub jwks_refresh_interval: Duration,
    pub subject_claim: String,        // e.g., "sub"
    pub scopes_claim: String,         // e.g., "scope" or "permissions"
    pub leeway_seconds: u64,
}
  1. Add jsonwebtoken and reqwest (already present) to dependencies.
  2. Implement a JwksCache that fetches jwks_url on startup and refreshes every jwks_refresh_interval. Uses reqwest with a short timeout. Refreshes in the background via tokio::spawn.
  3. The auth middleware branches on AuthConfig: StaticKeys continues to work, Jwt calls jsonwebtoken::decode with the cached JWKS.
  4. Extract subject from the configured claim name. Extract scopes from either a space-separated string (scope claim) or an array claim (permissions).
  5. Handle key rotation gracefully: if decoding fails with "unknown key ID," trigger an immediate JWKS refresh (debounced to once per minute) and retry once.
  6. Integration tests with a fake JWKS endpoint (use mockito or wiremock).

Verification: valid JWT authenticates; expired JWT rejected; invalid signature rejected; JWKS refresh handles key rotation.

Step 4: Real rate limiting

Replace the Phase 4 stub with a working sliding-window implementation.

  1. Add dashmap dependency for the per-subject map (lock-free reads/writes).
  2. Implement SlidingWindow with two adjacent one-minute buckets; the effective rate is the weighted sum of the current bucket plus the tail of the previous bucket based on how far into the current window we are.
  3. Add RateLimiter::check(subject) -> Result<(), RateLimitError>.
  4. Write middleware that calls check before dispatching to handlers. On Err, return 429 with Retry-After header.
  5. Add rate_limit_per_minute and rate_limit_burst config fields. Reasonable defaults: 60/min, burst 10.
  6. Expose per-subject current rate as a gauge in the Prometheus registry.
  7. Integration test: fire N+1 requests as the same subject within a minute, assert the N+1th gets 429.

Verification: rate limiting works correctly across subjects; non-limited subjects aren't affected; burst allowance works.

Step 5: Per-subject concurrency limiter

Complements rate limiting — rate limits the count of requests over time, concurrency limits the simultaneous count.

  1. Implement SubjectConcurrencyLimiter with a DashMap<String, Arc<Semaphore>>.
  2. Lazy-init semaphores per subject with per_subject_concurrency slots (default 8).
  3. Middleware acquires a permit per request. If the subject's semaphore is full, queue briefly (try_acquire_owned with a short timeout), then 503 if still full.
  4. Garbage-collect unused semaphores periodically (entries with no waiters and full availability count haven't been used).
  5. Integration test: fire 10 concurrent requests as one subject with per_subject_concurrency: 5, assert at least 5 serialize.

Verification: no subject can exceed its concurrency budget; other subjects unaffected.

Step 6: Prometheus metrics endpoint

  1. Add prometheus crate.
  2. Implement MetricsRegistry with the metrics listed in the types section.
  3. Wire metric updates into existing code:
    • HTTP middleware: http_requests_total.inc() on response, http_request_duration.observe(elapsed), http_requests_in_flight.inc()/dec()
    • Session creation: sessions_created_total.inc(), sessions_active.set(store.count())
    • MCP factory: read the Phase 5 atomic counters on scrape and populate the Prometheus types
  4. Add GET /metrics handler that writes the Prometheus text exposition format.
  5. Auth policy for /metrics: configurable — either requires admin:metrics scope, or is opened to a private network via metrics_listen_addr: "127.0.0.1:9090" on a separate port (recommended).
  6. Integration test: scrape /metrics, parse the response, assert expected metrics are present with sensible values.

Verification: Prometheus scraping works; metrics increment correctly.

Step 7: Structured JSON logging

Replace the default tracing_subscriber format with JSON output.

  1. Add a log_format: Text | Json config field, default Text for CLI/REPL, Json for --serve mode.
  2. Configure tracing_subscriber::fmt::layer().json() conditionally.
  3. Ensure every span has a request_id field (already present from Phase 4 middleware).
  4. Add subject and session_id as span fields when present, so they get included in every child log line automatically.
  5. Add a log_level config field that SIGHUP reloads at runtime (see Step 12).
  6. Integration test: capture stdout during a request, parse as JSON, assert the fields are present and correctly scoped.

Verification: loki --serve produces one-line-per-event JSON output suitable for log aggregators.

Step 8: Audit logging

Dedicated sink for security-relevant events.

  1. Implement AuditLogger with Stderr, File, and Syslog sinks. Start with just Stderr and FileSyslog via syslog crate can follow.
  2. Emit audit events from:
    • Auth middleware: AuditAction::AuthFailure on any auth rejection
    • Rate limiter: AuditAction::RateLimitRejection on 429
    • Session handlers: AuditAction::SessionCreate/Read/Update/Delete
    • Agent handlers: AuditAction::AgentActivate
    • MCP reload endpoint: AuditAction::McpReload
  3. Audit events are JSON lines with a schema documented in docs/SECURITY.md.
  4. Audit events don't interfere with the main tracing stream — they go to the configured audit sink independently.
  5. File rotation via tracing-appender or manual rotation with size + date cap.

Verification: every security-relevant action produces an audit event; failures include a reason.

Step 9: Security headers and misc middleware

  1. Add a security_headers middleware layer that attaches:
    • X-Content-Type-Options: nosniff
    • X-Frame-Options: DENY
    • Referrer-Policy: strict-origin-when-cross-origin
    • Strict-Transport-Security: max-age=31536000; includeSubDomains (only when api.force_https: true)
    • Do NOT set CSP — this is an API, not a browser app; CSP would confuse clients.
  2. Remove Server: ... and other fingerprinting headers.
  3. Handle OPTIONS preflight correctly (Phase 4's CORS layer does this; verify).

Verification: curl -I inspects headers; automated test asserts each required header is present.

Step 10: Config validation at startup

A single ApiConfig::validate() method that checks every field and aggregates ALL errors before failing.

  1. Implement validation for:
    • listen_addr is parseable and bindable
    • auth.mode has a valid configuration (e.g., StaticKeys with non-empty key list, Jwt with reachable JWKS URL)
    • auth.keys[].key_hash starts with $argon2id$ (catches plaintext keys)
    • rate_limit_per_minute > 0 and burst > 0
    • max_body_bytes > 0 and < 100 MiB (sanity)
    • request_timeout_seconds > 0 and < 3600
    • shutdown_grace_seconds >= 0
    • cors.allowed_origins entries are valid URLs or "*"
  2. Return a ConfigValidationError that lists every problem, not just the first.
  3. Call validate() in serve() before binding the listener.
  4. Test: a deliberately-broken config produces an error listing all problems.

Verification: startup validation catches common mistakes; error message is actionable.

Step 11: Health check endpoints

  1. GET /healthz/live — always returns 200 OK unless the process is in graceful shutdown. Body: {"status":"ok"}. No auth required.
  2. GET /healthz/ready — returns 200 OK when fully initialized and not saturated, otherwise 503 Service Unavailable. Readiness criteria:
    • AppState fully initialized
    • Session store writable (attempt a probe write to a reserved path)
    • MCP pool initialized (at least the factory is alive)
    • Concurrency semaphore has at least 10% available (not saturated)
  3. Both endpoints are unauthenticated and unmetered — load balancers hit them constantly.
  4. Document in docs/DEPLOYMENT.md how Kubernetes, systemd, and other supervisors should use these.

Verification: endpoints return correct status under various load conditions.

Step 12: SIGHUP config reload

Reload a subset of config without restarting.

  1. Reloadable fields:
    • Auth keys (StaticKeys mode)
    • JWT config (including JWKS URL)
    • Log level
    • Rate limit config
    • Per-subject concurrency limits
    • Audit logger sink
  2. NOT reloadable (requires full restart):
    • Listen address
    • MCP pool config (pool holds live subprocesses)
    • Session storage paths
    • TLS certs (use a reverse proxy)
  3. Implementation: SIGHUP handler that re-reads config.yaml, validates it, and atomically swaps the affected fields in ApiState. Uses arc-swap crate for lock-free swaps.
  4. Audit every reload: AuditAction::ConfigReload with before/after diff summary.
  5. Document: rotation procedures for auth keys, logging level adjustments, etc.

Verification: start server, modify config.yaml, send SIGHUP, assert new config is in effect without dropped requests.

Step 13: Deployment manifests

13a. Dockerfile

Multi-stage build for a minimal runtime image:

# Build stage
FROM rust:1.82-slim AS builder
WORKDIR /build
COPY Cargo.toml Cargo.lock ./
COPY src ./src
COPY assets ./assets
RUN cargo build --release --bin loki

# Runtime stage
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates \
    tini \
    && rm -rf /var/lib/apt/lists/*
RUN useradd --system --home /loki --shell /bin/false loki
COPY --from=builder /build/target/release/loki /usr/local/bin/loki
COPY --from=builder /build/assets /opt/loki/assets
USER loki
WORKDIR /loki
ENV LOKI_CONFIG_DIR=/loki/config
EXPOSE 3400
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["/usr/local/bin/loki", "--serve"]

Build args for targeting specific architectures. Result is a ~100 MB image.

13b. systemd unit

[Unit]
Description=Loki AI Server
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/loki --serve
Restart=on-failure
RestartSec=5
User=loki
Group=loki

# Sandboxing
NoNewPrivileges=true
PrivateTmp=true
PrivateDevices=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/loki
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictSUIDSGID=true
RestrictRealtime=true
LockPersonality=true

# Resource limits
LimitNOFILE=65536
LimitNPROC=512
MemoryMax=4G

# Reload
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

Type=notify requires Loki to call sd_notify(READY=1) after successful startup — add this with the sd-notify crate.

13c. docker-compose example

For local development with TLS via nginx:

version: "3.9"
services:
  loki:
    build: .
    environment:
      LOKI_CONFIG_DIR: /loki/config
    volumes:
      - ./config:/loki/config:ro
      - loki_data:/loki/data
    ports:
      - "127.0.0.1:3400:3400"
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3400/healthz/live"]
      interval: 30s
      timeout: 5s
      retries: 3

  nginx:
    image: nginx:alpine
    volumes:
      - ./deploy/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./deploy/certs:/etc/nginx/certs:ro
    ports:
      - "443:443"
    depends_on:
      - loki

volumes:
  loki_data:

Include a sample nginx.conf that terminates TLS and forwards to loki:3400.

13d. Kubernetes manifests

Provide deploy/k8s/ with:

  • namespace.yaml
  • deployment.yaml (3 replicas, resource requests/limits, liveness/readiness probes)
  • service.yaml (ClusterIP)
  • configmap.yaml (non-secret config)
  • secret.yaml (API keys, JWT config)
  • hpa.yaml (HorizontalPodAutoscaler based on CPU + custom metric for requests/sec)
  • ingress.yaml (optional example using nginx-ingress)

Document storage strategy: sessions use a PVC mounted at /loki/data; RAG embeddings use a read-only ConfigMap or a separate PVC.

Verification: each deployment target produces a running Loki that passes health checks.

Step 14: Operational runbook

Write docs/RUNBOOK.md with sections for:

  • Starting and stopping the server
  • Rotating auth keys (StaticKeys mode) — edit config, SIGHUP, verify in audit log
  • Rotating auth keys (Jwt mode) — update JWKS at issuer, Loki auto-refreshes
  • Rotating MCP credentials — update env vars, POST /v1/mcp/reload (new endpoint in this phase) or restart
  • Diagnosing high latency — check MCP hit rate, check LLM provider latency, check concurrency saturation
  • Diagnosing auth failures — audit log AuthFailure events, check key hash, check JWKS reachability
  • Diagnosing rate limit rejections — check per-subject counter, adjust limit or identify runaway client
  • Diagnosing orphaned MCP subprocessesps aux | grep loki, check logs for McpFactory shutdown complete
  • Diagnosing session corruption — check .yaml.tmp files (should not exist when server is idle), inspect session YAML for validity
  • Backup and restore — tar the sessions/ and agents/ directories
  • Scaling horizontally — each replica has its own MCP pool and session store; share sessions via shared filesystem (NFS/EFS) or deferred to a database-backed SessionStore (not in this phase)
  • Incident response — what logs to collect, what metrics to snapshot, how to reach a minimal reproducing state

Verification: walk through each procedure on a test deployment; fix any unclear steps.

Step 15: Deployment and security guides

docs/DEPLOYMENT.md — step-by-step for Docker, systemd, docker-compose, Kubernetes. Pre-flight checklist, first-time setup, upgrade procedure.

docs/SECURITY.md — threat model, hardening checklist, scope model, audit event schema, key rotation, reverse proxy configuration, network security recommendations, CVE reporting contact.

Cross-reference from README.md and add a "Production Deployment" section to the README that points to both docs.

Verification: a developer unfamiliar with Loki can deploy it successfully using only the docs.


Risks and Watch Items

Risk Severity Mitigation
Session ownership migration breaks legacy users Medium Legacy sessions with owner: None stay readable by anyone; they get claimed forward on first mutation. Document this in RUNBOOK.md. Add a one-shot migration CLI command (loki migrate sessions --claim-to <subject>) that assigns ownership of all unowned sessions to a specific subject.
JWT JWKS fetch failures block startup Medium JWKS URL must be reachable at startup; if it's not, log an error and fall back to "reject all" mode until the fetch succeeds. A retry loop with exponential backoff runs in the background. Do NOT crash on JWKS failure.
Rate limiter DashMap growth Low Per-subject windows accumulate forever without cleanup. Add a background reaper that removes entries with zero recent activity every few minutes. Cap total entries at 100k as a safety valve.
Prometheus metric cardinality explosion Low http_requests_total with per-path labels could explode if routes have dynamic segments (/v1/sessions/:id). Use route templates as labels, not concrete paths. Validate label sets at registration.
Audit log retention compliance Low Audit logs might need to be retained for regulatory reasons. Phase 6 provides the emission; retention is the operator's responsibility. Document this in SECURITY.md.
SIGHUP reload partial failure Medium If the new config is invalid, don't swap it in — keep the old config running. Log the validation error. The operator can fix the file and SIGHUP again. Never leave the server in an inconsistent state.
Docker image size Low debian:bookworm-slim is ~80 MB; final image ~100 MB. If smaller is needed, use distroless/cc-debian12 for a ~35 MB image at the cost of not having tini or debugging tools. Document both options.
systemd Type=notify missing implementation Medium Adding sd_notify requires the sd-notify crate AND calling it after listener bind. Missing this call makes systemd think the service failed. Add an integration test that fakes systemd and asserts the notification is sent.
Kubernetes pod disruption Low HPA scales down during low traffic, but in-flight requests on the terminating pod must complete gracefully. Set terminationGracePeriodSeconds to at least shutdown_grace_seconds + 10. Document in DEPLOYMENT.md.
Running under a reverse proxy Low CORS, Host header handling, X-Forwarded-For for rate limiter subject identification. Document the expected proxy config (trust X-Forwarded-* headers only from trusted proxies).

What Phase 6 Does NOT Do

  • No multi-region replication. Loki is a single-instance service; scale out by running multiple instances behind a load balancer, each with its own pool. Cross-instance state sharing is not in scope.
  • No database-backed session store. FileSessionStore is still the only implementation. A PostgresSessionStore is a clean extension point (SessionStore trait is already there) but belongs to a follow-up.
  • No cluster coordination. Each Loki instance is independent. Running Loki in a "cluster" mode where instances share work is a separate project.
  • No advanced ML observability. LLM call costs, token usage trends, provider error rates — these are tracked as counters but not aggregated into dashboards. Follow-up work.
  • No built-in TLS termination. Use a reverse proxy (nginx, Caddy, Traefik, a cloud load balancer). Supporting TLS in-process adds complexity and key management concerns that reverse proxies solve better.
  • No SAML or LDAP. Only StaticKeys and JWT. SAML/LDAP integration can extend AuthConfig later.
  • No plugin system. Extensions to auth, storage, or middleware require forking and rebuilding. A dynamic plugin loader is explicitly out of scope.
  • No multi-tenancy beyond session ownership. Tenants share the same process, same MCP pool, same RAG cache, same resources. Strict tenant isolation (separate processes per tenant) requires orchestration outside Loki.
  • No cost accounting per tenant. LLM API calls are tracked per-subject in audit logs but not aggregated into billing-grade cost reports.

Entry Criteria (from Phase 5)

  • McpFactory pooling works and has metrics
  • Graceful shutdown drains the MCP pool
  • Phase 5 load test passes (hit rate >0.8, no orphaned subprocesses)
  • Phase 4 API integration test suite passes
  • cargo check, cargo test, cargo clippy all clean

Exit Criteria (Phase 6 complete — v1 ready)

  • Per-subject session ownership enforced; integration tests prove Alice can't read Bob's sessions
  • Scope-based authorization enforced on every endpoint
  • JWT authentication works with a real JWKS endpoint
  • Real rate limiting replaces the Phase 4 stub; 429 responses include Retry-After
  • Per-subject concurrency limiter prevents noisy-neighbor saturation
  • Prometheus /metrics endpoint scrapes cleanly
  • Structured JSON logs emitted in --serve mode
  • Audit events written for all security-relevant actions
  • Security headers set on all responses
  • Config validation fails fast at startup with readable errors
  • /healthz/live and /healthz/ready endpoints work
  • SIGHUP reloads auth keys, log level, and rate limits without restart
  • Dockerfile produces a minimal runtime image
  • systemd unit with Type=notify works correctly
  • docker-compose example runs end-to-end with TLS via nginx
  • Kubernetes manifests deploy successfully
  • docs/RUNBOOK.md covers all common operational scenarios
  • docs/DEPLOYMENT.md guides a first-time deployer to success
  • docs/SECURITY.md documents threat model, scopes, and hardening
  • cargo check, cargo test, cargo clippy all clean
  • End-to-end production smoke test: deploy to Kubernetes, send real traffic, scrape metrics, rotate a key, induce a failure, observe recovery

v1 Release Summary

After Phase 6 lands, Loki v1 has transformed from a single-user CLI tool into a production-ready multi-tenant AI service. Here's what the v1 release notes should say:

New in Loki v1:

  • REST API — full HTTP surface for completions, sessions, agents, roles, RAGs, and metadata. Streaming via Server-Sent Events, synchronous via JSON.
  • Multi-tenant sessions — UUID-primary identity with optional human-readable aliases. Per-subject ownership with scope-based access control.
  • Concurrent safety — per-session mutex serialization, per-MCP-server Arc sharing, per-agent runtime isolation. Run dozens of concurrent requests without corruption.
  • MCP pooling — recently-used MCP subprocesses stay warm across requests. Near-zero warm-path latency. Configurable idle timeout and LRU cap.
  • Authentication — static API keys or JWT with JWKS. Argon2-hashed credentials. Scope-based authorization per endpoint.
  • Observability — Prometheus metrics, structured JSON logging with correlation IDs, dedicated audit log stream.
  • Rate limiting — sliding-window per subject with configurable limits and burst allowance.
  • Graceful shutdown — in-flight requests complete within a grace period; MCP subprocesses terminate cleanly; session state is persisted.
  • Deployment manifests — Dockerfile, systemd unit, docker-compose example, Kubernetes manifests.
  • Full documentation — runbook, deployment guide, security guide, API reference.

Backward compatibility:

CLI and REPL continue to work identically to pre-v1 builds. Existing config.yaml, roles/, sessions/, agents/, rags/, and functions/ directories are read-compatible. The legacy session layout is migrated lazily on first access without destroying the old files.

What's next (v2+):

  • Database-backed session store for cross-instance sharing
  • Native TLS termination option
  • SAML / LDAP authentication extensions
  • Per-tenant cost accounting and quotas
  • Dynamic plugin system for custom auth, storage, and middleware
  • Multi-region replication
  • WebSocket transport alongside SSE