No file to print Monitoring and forensics: the vitamins everyone remembers to take after getting sick. Let’s make it first‑class instead of a panic purchase after the postmortem. Here’s the full, doc‑ready epic. > **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. --- # Epic 15: Observability & Forensics **Short name:** Observability & Forensics **Primary components:** Web Services API, Orchestrator, Task Runner, Findings Ledger, Conseiller (Feedser), Excitator (VEXer), Policy Engine, Export Center, Console, CLI **Surfaces:** logs, metrics, traces, decision audits, event timeline, evidence locker, provenance attestations, dashboards, alerts **Dependencies:** Authority‑Backed Scopes & Tenancy, AOC Enforcement, Policy Studio, Export Center **AOC ground rule reminder:** Conseiller and Excitator aggregate and link advisories/VEX. They never merge or mutate source records. Observability and forensic capture must reflect sources as they were seen, not “cleaned up.” --- ## 1) What it is A platform‑wide, tenant‑aware telemetry and evidence system that answers three questions fast: 1. What is happening right now, and is it healthy. 2. What exactly happened earlier, across services and jobs. 3. Can we prove it, independently and later. It combines: * **Unified telemetry** logs, metrics, traces for all services and jobs using OpenTelemetry conventions. * **SLOs and dashboards** for golden signals and user‑journey health. * **Forensic readiness** evidence snapshots, immutable audit trails, provenance attestations, chain‑of‑custody, and a tenant timeline that reconstructs evaluations and decisions. * **UI and CLI** to explore incidents, verify attestations, and export legally defensible bundles. Everything is tenant‑scoped by default and tamper‑evident where it matters. --- ## 2) Why Because “we think it worked” is not proof. You need fast feedback during normal ops and a trustworthy record when something breaks or lawyers get involved. If StellaOps decides risk, then StellaOps must show its work and preserve it. --- ## 3) How it should work ### 3.1 Instrumentation blueprint * **Tracing:** every request and job carries a `trace_id` with W3C trace context. Background workers continue the same trace or link as a child. CLI adds the header when talking to the API. * **Metrics:** standardized counters, histograms, and gauges for latency, throughput, errors, and resource use. Exemplars embed `trace_id` for jump‑to‑trace from graphs. * **Logs:** structured JSON with fixed fields. No printf archaeology. * **Sampling:** head sampling for traces with tail sampling on high latency and errors. Error logs never sampled. **Common fields for logs and spans:** ``` ts, level, service, version, env, region, instance_id, tenant_id, project_id, actor, route, method, resource, action, trace_id, span_id, parent_span_id, request_id, job_id, run_id, source_id, sbom_id, policy_version, decision_effect, decision_reason, error_code, error_message ``` **Sensitive data policy:** * Scrub secrets, tokens, emails, and file paths beyond repo root. * Redact by default and allow per‑tenant opt‑in for deep debug. * PII filters enforced in the logging library, not at call sites. ### 3.2 Metrics taxonomy and SLOs * **Golden signals per service:** request latency (P50/P95/P99), error rate, throughput, saturation (CPU, queue depth). * **User journeys:** SBOM ingest, policy evaluate, advisory link, VEX reconcile, export bundle, notification send. * **SLO examples:** * SBOM ingest start‑to‑first‑component P95 < 5s. * Policy evaluation P95 < 2s for 5k components. * Advisory link to exposure delta P95 < 30s. * Export bundle availability P95 < 90s from request. SLOs emit events on burn rate breaches, not just threshold spikes. ### 3.3 Tracing model * **Ingress span:** `api.` includes authN/Z decision attributes without PII. * **Work spans:** `orchestrator.schedule`, `runner.execute`, `conseiller.ingest`, `excitor.link`, `policy.evaluate`. * **Async linking:** parent trace context serialized into job payloads. Workers restore it and annotate spans with job metadata. * **Cross‑boundary:** CLI generates `traceparent`; Console mirrors it into the browser dev tools for correlation during support cases. ### 3.4 Logs that don’t lie * **Decision logs:** every authZ decision and policy evaluation emits a structured record. * **Consistency:** one logger library used across all services. * **Tenant isolation:** logs carry `tenant_id`. Central log store partitions by tenant. * **Retention:** hot logs short (3‑7 days), warm 30‑90 days, cold based on tenant policy. For forensic artifacts see 3.6. ### 3.5 Tenant Timeline * Stream rule: every state transition emits `timeline_event` with a canonical schema: ``` event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, details{} ``` * Aggregator stores an index per tenant for 180 days by default. * UI visualizes events with filters: “show all policy changes,” “show all job failures that touched component X,” “expand decisions for this trace.” ### 3.6 Forensic Evidence Locker * Append‑only, WORM‑capable object storage path `tenants//evidence//`. * **Snapshot types:** * Evaluation bundle: SBOM slice, linked advisories, VEX inputs, policy set, engine version, config, and decision trace. * Job capsule: inputs digests, outputs, runner image digest, env digest (vars hashed), command transcript. * Export manifest: files, checksums, recipients. * **Integrity:** * Merkle index per bundle. Root hash stored in DB and optionally timestamped. * DSSE signed manifest with platform key. Key rotation logged and validated. ### 3.7 Provenance and attestations * Jobs emit SLSA‑style provenance attestations: * subjects: produced artifacts (e.g., exposure report) * predicate: sources, builder image, inputs, policies, advisories versions * builder id: Task Runner identity * Verify on read. UI shows green check with signer and time. ### 3.8 Chain of custody * Chain ties together: * source fetch → SBOM compute → policy eval → advisory link → export * Each step has immutable IDs and cryptographic digests inside the timeline and evidence. ### 3.9 Incident mode * Feature flag that increases trace sampling, captures additional breadcrumbs, and extends evidence retention for the next N hours. * Per‑tenant activation to avoid surprise bills. * Automatically enabled on SLO burn rate breaches above a threshold. ### 3.10 Multi‑tenant guarantees * Telemetry, timeline, evidence, and attestations are tenant‑scoped by design. * RLS applies to timeline DB. * Evidence locker uses tenant prefix and optional per‑tenant KMS key. * Exporting evidence requires `stella:evidence:export#tenant/` scope. ### 3.11 Console features * **Observability Hub:** health at a glance, SLO widgets, Top failing routes, Top noisy tenants (for operators), trace search, log search with guardrails. * **Forensics Explorer:** timeline with filters, evidence bundle viewer, attestation verifier, “Create Snapshot” wizard, comparison between two evaluations. * **“Why is this red?”** click-through from an error to the exact span, log lines, and policy decision that caused it. ### 3.12 CLI features * `stella obs top` live stats for APIs and jobs. * `stella obs trace ` dump correlated events. * `stella forensic snapshot create --case --scope --id ` * `stella forensic verify ` validate checksums and signatures. * `stella forensic attest show ` print provenance. ### 3.13 Integrations * OpenTelemetry Collector ships by default. * Prometheus scrape config and dashboards included. * Webhooks for SLO breaches and incident mode start/stop. * Optional RFC 3161 timestamping for Merkle roots where a time authority is configured. --- ## 4) Architecture ### 4.1 New modules * `telemetry/core` logging, metrics, tracing libraries with scrubbing and context. * `telemetry/collector-config` ship default collector config. * `timeline/indexer` consumes events and builds tenant timeline indices. * `evidence/locker` API to create, read, and sign bundles. * `provenance/attest` generates DSSE statements and verification helpers. * `console/obs` dashboards and trace viewer. * `console/forensics` timeline and evidence UIs. * `cli/obs`, `cli/forensics`. ### 4.2 Data model New tables with RLS: * `timeline_events(event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, digest, details_jsonb)` * `evidence_bundles(id, tenant_id, case_id, kind, root_hash, path, signer_key_id, created_at, labels)` * `attestations(id, tenant_id, subject_id, subject_digest, statement_path, signer_key_id, created_at)` Indices: `(tenant_id, ts)`, `(tenant_id, resource_type, resource_id)`, `(tenant_id, case_id)`. ### 4.3 Storage layout * Object store: * `tenants//evidence///manifest.json` * `tenants//evidence///root.merkle` * `tenants//attestations//` * Optional S3 Object Lock for WORM with per‑tenant retention policy. ### 4.4 Message topics * `stella..timeline.*` emitted by all services. * `stella..slo.breach` from SLO evaluator. * `stella..incident.mode` start and stop events. * `stella.global.kb.*` remains for publics advisories and has no tenant data. --- ## 5) APIs and contracts ### 5.1 Observability read APIs * `GET /obs/health` summary per service. * `GET /obs/slo` list current SLOs and burn rates. * `GET /obs/trace/:id` metadata plus deep link to trace backend. * `GET /obs/logs` query interface with guardrails for time window and tenant. * `GET /obs/metrics` small set of computed aggregates for Console. ### 5.2 Timeline APIs * `GET /timeline?from=&to=&kind=&resource=&project=` paginated events. * `GET /timeline/:event_id` detailed view, links to trace and evidence. ### 5.3 Evidence APIs * `POST /evidence/snapshot` with `{kind, resource_id, case_id}` creates a bundle. * `GET /evidence/:bundle_id` returns manifest and signed hashes. * `POST /evidence/verify` upload or reference a bundle path to verify. * `POST /evidence/hold/:case_id` place or release legal hold. ### 5.4 Attestation APIs * `GET /attestations?subject_id=` list and filter. * `POST /attestations/verify` verify a statement against subject digest. ### 5.5 Structured log contract Logs are newline‑delimited JSON with the common fields in 3.1. `error_message` must be non‑PII and concise. Multi‑line errors are folded into `details.stack` with length limits. All these endpoints require Authority scopes like: * `stella:obs:read#tenant/` * `stella:timeline:read#tenant/` * `stella:evidence:create#tenant/` * `stella:evidence:read#tenant/` * `stella:attest:read#tenant/` --- ## 6) Documentation changes Create or update: 1. `/docs/observability/overview.md` what, why, scope, data flow. 2. `/docs/observability/telemetry-standards.md` fields, naming, sampling, scrubbing rules, examples. 3. `/docs/observability/metrics-and-slos.md` catalog of metrics, SLO definitions, alert policies, burn rates. 4. `/docs/observability/tracing.md` context propagation, async linking, CLI and Console behavior. 5. `/docs/observability/logging.md` structured logging guide, dos and don’ts, PII policy. 6. `/docs/forensics/evidence-locker.md` bundle formats, WORM, retention, legal hold. 7. `/docs/forensics/provenance-attestation.md` statement schema, signing, verification. 8. `/docs/forensics/timeline.md` schema, event kinds, queries, examples. 9. `/docs/console/observability.md` dashboards, trace viewer, log search. 10. `/docs/console/forensics.md` timeline, snapshot UI, verification flow. 11. `/docs/cli/observability.md` commands and examples. 12. `/docs/cli/forensics.md` snapshot and verify commands. 13. `/docs/security/redaction-and-privacy.md` telemetry privacy and tenant isolation. 14. `/docs/install/telemetry-stack.md` default collector, exporter options, dashboards. 15. `/docs/runbooks/incidents.md` incident mode, SLO breaches, escalation. At the top of each page include: > **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. --- ## 7) Implementation plan ### Phase 1 - Baseline telemetry * Integrate OpenTelemetry SDK in all services. * Replace ad‑hoc logs with structured logger. * Emit minimal metrics and traces on hot paths. * Ship default collector and dashboards. ### Phase 2 - SLOs and dashboards * Define SLOs per journey. * Implement burn rate alerts. * Wire Console Observability Hub widgets. ### Phase 3 - Timeline and decision logs * Emit timeline events everywhere. * Build timeline indexer and APIs. * Console Timeline viewer. ### Phase 4 - Evidence locker * Implement snapshot builders for evaluation, job, export. * Add Merkle manifests and DSSE signing. * Add legal hold and retention checks. ### Phase 5 - Provenance and verification * Generate attestations for jobs and exports. * Verify on read. * CLI verify commands. ### Phase 6 - Incident mode * Feature flag, higher sampling, and retention bump. * Webhook and Notifications integration. --- ## 8) Engineering tasks **Telemetry core** * [ ] Build `telemetry/core` with log scrubbing, context propagation, and OpenTelemetry exporters. * [ ] Introduce a single `Logger` facade, deprecate old logging. * [ ] Add request and job middleware to attach tenant/actor/trace context. **Metrics and SLOs** * [ ] Instrument golden paths with histograms and exemplars. * [ ] Implement SLO evaluator and burn rate alerts. * [ ] Provide Prometheus rules and Grafana dashboards. **Tracing** * [ ] Propagate `traceparent` in HTTP, gRPC, and job payloads. * [ ] Link CLI requests via headers. * [ ] Add span attributes for policy decisions and authZ outcomes. **Timeline** * [ ] Define `timeline_event` schema and emitters in all services. * [ ] Build indexer, APIs, and RLS policies. * [ ] Console visualization with filters and deep links to traces. **Evidence locker** * [ ] Implement bundle builders, Merkle construction, and DSSE signing. * [ ] Object store layout with optional WORM mode. * [ ] APIs for create, get, verify, and legal hold. **Provenance** * [ ] Define statement schema and signers. * [ ] Add verification library and server hooks. * [ ] CLI support for show and verify. **Privacy and redaction** * [ ] Implement scrubbers for secrets and PII in logger. * [ ] Add config for per‑tenant deep debug opt‑in with time‑boxed TTL. * [ ] Redaction tests. **Console** * [ ] Observability Hub: health, SLO widgets, trace search. * [ ] Forensics Explorer: timeline, evidence viewer, verification UX. * [ ] “Why is this red” drill‑down from errors to spans/logs/decisions. **CLI** * [ ] `stella obs` commands and pretty printers. * [ ] `stella forensic snapshot`, `verify`, `attest show`. * [ ] Respect `--tenant` and print `trace_id` for copy‑paste. **Docs** * [ ] Author all pages in section 6, with examples and diagrams. * [ ] Insert the imposed rule banner at the top of each page. * [ ] Add runbooks for SLO breaches and incident mode. **Testing** * [ ] Unit tests: scrubbing, logger fields, timeline emitter. * [ ] E2E: create SBOM, run policy, link advisories, export, then snapshot and verify. * [ ] Load tests: ensure tracing overhead < 5 percent CPU. * [ ] Failure injection: drop collector, ensure backpressure and fallbacks. --- ## 9) Feature changes required in other components * **Authority & Tenancy:** add scopes `obs:read`, `timeline:read`, `evidence:*`, `attest:read`. Enforce tenant constraints everywhere. * **Orchestrator & Runner:** stamp job spans and emit `job.started|finished|failed` timeline events with root cause. * **Findings Ledger:** record evaluation IDs and link to evidence bundles. * **Policy Engine:** include policy version and rule IDs in spans and timeline events; log decision summaries consistently. * **Export Center:** produce attestations for every generated artifact and optionally embed a copy of the evaluation bundle. * **Notifications:** send SLO breach and incident mode start/stop messages with links to traces and timeline searches. * **Conseiller & Excitator:** emit ingest and linking events for the timeline and capture source digests in evidence bundles. > **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied. --- ## 10) Acceptance criteria * All services emit structured logs with the common fields. * At least 90 percent of top routes and job paths have tracing with exemplars. * SLOs defined and visible in Console; burn rate alerts work. * Tenant Timeline shows a full chain for a real evaluation. * Evidence snapshot for an evaluation verifies successfully. * Attestation exists and verifies for at least one export and one job. * Incident mode increases trace volume and extends evidence retention. * RLS prevents cross‑tenant reads of timeline and evidence. * 403 payloads continue to explain policy or scope denials without leaking PII. --- ## 11) Risks and mitigations * **Telemetry cost blow‑up.** Use sampling, cardinality limits, and per‑tenant caps. * **PII leakage.** Redaction enforced in the logger library, security review on field additions. * **Broken traces in async work.** Serialize context into job payloads and test it. * **False sense of immutability.** Use real WORM where available and sign evidence; document guarantees honestly. * **Operational complexity.** Ship default collector and dashboards so teams start from working, not from blank. --- ## 12) Philosophy * **If it isn’t measured, it isn’t managed.** * **If it isn’t preserved, it didn’t happen.** * **If it isn’t explainable, it isn’t trustworthy.** StellaOps makes risk decisions. This epic is how those decisions become observable, traceable, and defensible. > **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.