git.stella-ops.org/docs/implplan/EPIC_15.md

No file to print
Monitoring and forensics: the vitamins everyone remembers to take after getting sick. Let’s make it first‑class instead of a panic purchase after the postmortem. Here’s the full, doc‑ready epic.

> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

---

# Epic 15: Observability & Forensics

**Short name:** Observability & Forensics
**Primary components:** Web Services API, Orchestrator, Task Runner, Findings Ledger, Conseiller (Feedser), Excitator (VEXer), Policy Engine, Export Center, Console, CLI
**Surfaces:** logs, metrics, traces, decision audits, event timeline, evidence locker, provenance attestations, dashboards, alerts
**Dependencies:** Authority‑Backed Scopes & Tenancy, AOC Enforcement, Policy Studio, Export Center

**AOC ground rule reminder:** Conseiller and Excitator aggregate and link advisories/VEX. They never merge or mutate source records. Observability and forensic capture must reflect sources as they were seen, not “cleaned up.”

---

## 1) What it is

A platform‑wide, tenant‑aware telemetry and evidence system that answers three questions fast:

1. What is happening right now, and is it healthy.
2. What exactly happened earlier, across services and jobs.
3. Can we prove it, independently and later.

It combines:

* **Unified telemetry** logs, metrics, traces for all services and jobs using OpenTelemetry conventions.
* **SLOs and dashboards** for golden signals and user‑journey health.
* **Forensic readiness** evidence snapshots, immutable audit trails, provenance attestations, chain‑of‑custody, and a tenant timeline that reconstructs evaluations and decisions.
* **UI and CLI** to explore incidents, verify attestations, and export legally defensible bundles.

Everything is tenant‑scoped by default and tamper‑evident where it matters.

---

## 2) Why

Because “we think it worked” is not proof. You need fast feedback during normal ops and a trustworthy record when something breaks or lawyers get involved. If StellaOps decides risk, then StellaOps must show its work and preserve it.

---

## 3) How it should work

### 3.1 Instrumentation blueprint

* **Tracing:** every request and job carries a `trace_id` with W3C trace context. Background workers continue the same trace or link as a child. CLI adds the header when talking to the API.
* **Metrics:** standardized counters, histograms, and gauges for latency, throughput, errors, and resource use. Exemplars embed `trace_id` for jump‑to‑trace from graphs.
* **Logs:** structured JSON with fixed fields. No printf archaeology.
* **Sampling:** head sampling for traces with tail sampling on high latency and errors. Error logs never sampled.

**Common fields for logs and spans:**

```
ts, level, service, version, env, region, instance_id,
tenant_id, project_id, actor, route, method, resource, action,
trace_id, span_id, parent_span_id, request_id,
job_id, run_id, source_id, sbom_id, policy_version,
decision_effect, decision_reason, error_code, error_message
```

**Sensitive data policy:**

* Scrub secrets, tokens, emails, and file paths beyond repo root.
* Redact by default and allow per‑tenant opt‑in for deep debug.
* PII filters enforced in the logging library, not at call sites.

### 3.2 Metrics taxonomy and SLOs

* **Golden signals per service:** request latency (P50/P95/P99), error rate, throughput, saturation (CPU, queue depth).
* **User journeys:** SBOM ingest, policy evaluate, advisory link, VEX reconcile, export bundle, notification send.
* **SLO examples:**

  * SBOM ingest start‑to‑first‑component P95 < 5s.
  * Policy evaluation P95 < 2s for 5k components.
  * Advisory link to exposure delta P95 < 30s.
  * Export bundle availability P95 < 90s from request.

SLOs emit events on burn rate breaches, not just threshold spikes.

### 3.3 Tracing model

* **Ingress span:** `api.<route>` includes authN/Z decision attributes without PII.
* **Work spans:** `orchestrator.schedule`, `runner.execute`, `conseiller.ingest`, `excitor.link`, `policy.evaluate`.
* **Async linking:** parent trace context serialized into job payloads. Workers restore it and annotate spans with job metadata.
* **Cross‑boundary:** CLI generates `traceparent`; Console mirrors it into the browser dev tools for correlation during support cases.

### 3.4 Logs that don’t lie

* **Decision logs:** every authZ decision and policy evaluation emits a structured record.
* **Consistency:** one logger library used across all services.
* **Tenant isolation:** logs carry `tenant_id`. Central log store partitions by tenant.
* **Retention:** hot logs short (3‑7 days), warm 30‑90 days, cold based on tenant policy. For forensic artifacts see 3.6.

### 3.5 Tenant Timeline

* Stream rule: every state transition emits `timeline_event` with a canonical schema:

```
event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, details{}
```

* Aggregator stores an index per tenant for 180 days by default.
* UI visualizes events with filters: “show all policy changes,” “show all job failures that touched component X,” “expand decisions for this trace.”

### 3.6 Forensic Evidence Locker

* Append‑only, WORM‑capable object storage path `tenants/<tenant>/evidence/<case>/<artifact>`.
* **Snapshot types:**

  * Evaluation bundle: SBOM slice, linked advisories, VEX inputs, policy set, engine version, config, and decision trace.
  * Job capsule: inputs digests, outputs, runner image digest, env digest (vars hashed), command transcript.
  * Export manifest: files, checksums, recipients.
* **Integrity:**

  * Merkle index per bundle. Root hash stored in DB and optionally timestamped.
  * DSSE signed manifest with platform key. Key rotation logged and validated.

### 3.7 Provenance and attestations

* Jobs emit SLSA‑style provenance attestations:

  * subjects: produced artifacts (e.g., exposure report)
  * predicate: sources, builder image, inputs, policies, advisories versions
  * builder id: Task Runner identity
* Verify on read. UI shows green check with signer and time.

### 3.8 Chain of custody

* Chain ties together:

  * source fetch → SBOM compute → policy eval → advisory link → export
* Each step has immutable IDs and cryptographic digests inside the timeline and evidence.

### 3.9 Incident mode

* Feature flag that increases trace sampling, captures additional breadcrumbs, and extends evidence retention for the next N hours.
* Per‑tenant activation to avoid surprise bills.
* Automatically enabled on SLO burn rate breaches above a threshold.

### 3.10 Multi‑tenant guarantees

* Telemetry, timeline, evidence, and attestations are tenant‑scoped by design.
* RLS applies to timeline DB.
* Evidence locker uses tenant prefix and optional per‑tenant KMS key.
* Exporting evidence requires `stella:evidence:export#tenant/<id>` scope.

### 3.11 Console features

* **Observability Hub:** health at a glance, SLO widgets, Top failing routes, Top noisy tenants (for operators), trace search, log search with guardrails.
* **Forensics Explorer:** timeline with filters, evidence bundle viewer, attestation verifier, “Create Snapshot” wizard, comparison between two evaluations.
* **“Why is this red?”** click-through from an error to the exact span, log lines, and policy decision that caused it.

### 3.12 CLI features

* `stella obs top` live stats for APIs and jobs.
* `stella obs trace <trace_id>` dump correlated events.
* `stella forensic snapshot create --case <id> --scope <eval|job|export> --id <resource>`
* `stella forensic verify <bundle.tgz>` validate checksums and signatures.
* `stella forensic attest show <artifact>` print provenance.

### 3.13 Integrations

* OpenTelemetry Collector ships by default.
* Prometheus scrape config and dashboards included.
* Webhooks for SLO breaches and incident mode start/stop.
* Optional RFC 3161 timestamping for Merkle roots where a time authority is configured.

---

## 4) Architecture

### 4.1 New modules

* `telemetry/core` logging, metrics, tracing libraries with scrubbing and context.
* `telemetry/collector-config` ship default collector config.
* `timeline/indexer` consumes events and builds tenant timeline indices.
* `evidence/locker` API to create, read, and sign bundles.
* `provenance/attest` generates DSSE statements and verification helpers.
* `console/obs` dashboards and trace viewer.
* `console/forensics` timeline and evidence UIs.
* `cli/obs`, `cli/forensics`.

### 4.2 Data model

New tables with RLS:

* `timeline_events(event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, digest, details_jsonb)`
* `evidence_bundles(id, tenant_id, case_id, kind, root_hash, path, signer_key_id, created_at, labels)`
* `attestations(id, tenant_id, subject_id, subject_digest, statement_path, signer_key_id, created_at)`

Indices: `(tenant_id, ts)`, `(tenant_id, resource_type, resource_id)`, `(tenant_id, case_id)`.

### 4.3 Storage layout

* Object store:

  * `tenants/<t>/evidence/<case>/<bundle_id>/manifest.json`
  * `tenants/<t>/evidence/<case>/<bundle_id>/root.merkle`
  * `tenants/<t>/attestations/<subject_id>/<statement.json>`
* Optional S3 Object Lock for WORM with per‑tenant retention policy.

### 4.4 Message topics

* `stella.<tenant>.timeline.*` emitted by all services.
* `stella.<tenant>.slo.breach` from SLO evaluator.
* `stella.<tenant>.incident.mode` start and stop events.
* `stella.global.kb.*` remains for publics advisories and has no tenant data.

---

## 5) APIs and contracts

### 5.1 Observability read APIs

* `GET /obs/health` summary per service.
* `GET /obs/slo` list current SLOs and burn rates.
* `GET /obs/trace/:id` metadata plus deep link to trace backend.
* `GET /obs/logs` query interface with guardrails for time window and tenant.
* `GET /obs/metrics` small set of computed aggregates for Console.

### 5.2 Timeline APIs

* `GET /timeline?from=&to=&kind=&resource=&project=` paginated events.
* `GET /timeline/:event_id` detailed view, links to trace and evidence.

### 5.3 Evidence APIs

* `POST /evidence/snapshot` with `{kind, resource_id, case_id}` creates a bundle.
* `GET /evidence/:bundle_id` returns manifest and signed hashes.
* `POST /evidence/verify` upload or reference a bundle path to verify.
* `POST /evidence/hold/:case_id` place or release legal hold.

### 5.4 Attestation APIs

* `GET /attestations?subject_id=` list and filter.
* `POST /attestations/verify` verify a statement against subject digest.

### 5.5 Structured log contract

Logs are newline‑delimited JSON with the common fields in 3.1. `error_message` must be non‑PII and concise. Multi‑line errors are folded into `details.stack` with length limits.

All these endpoints require Authority scopes like:

* `stella:obs:read#tenant/<id>`
* `stella:timeline:read#tenant/<id>`
* `stella:evidence:create#tenant/<id>`
* `stella:evidence:read#tenant/<id>`
* `stella:attest:read#tenant/<id>`

---

## 6) Documentation changes

Create or update:

1. `/docs/observability/overview.md` what, why, scope, data flow.
2. `/docs/observability/telemetry-standards.md` fields, naming, sampling, scrubbing rules, examples.
3. `/docs/observability/metrics-and-slos.md` catalog of metrics, SLO definitions, alert policies, burn rates.
4. `/docs/observability/tracing.md` context propagation, async linking, CLI and Console behavior.
5. `/docs/observability/logging.md` structured logging guide, dos and don’ts, PII policy.
6. `/docs/forensics/evidence-locker.md` bundle formats, WORM, retention, legal hold.
7. `/docs/forensics/provenance-attestation.md` statement schema, signing, verification.
8. `/docs/forensics/timeline.md` schema, event kinds, queries, examples.
9. `/docs/console/observability.md` dashboards, trace viewer, log search.
10. `/docs/console/forensics.md` timeline, snapshot UI, verification flow.
11. `/docs/cli/observability.md` commands and examples.
12. `/docs/cli/forensics.md` snapshot and verify commands.
13. `/docs/security/redaction-and-privacy.md` telemetry privacy and tenant isolation.
14. `/docs/install/telemetry-stack.md` default collector, exporter options, dashboards.
15. `/docs/runbooks/incidents.md` incident mode, SLO breaches, escalation.

At the top of each page include:

> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

---

## 7) Implementation plan

### Phase 1 - Baseline telemetry

* Integrate OpenTelemetry SDK in all services.
* Replace ad‑hoc logs with structured logger.
* Emit minimal metrics and traces on hot paths.
* Ship default collector and dashboards.

### Phase 2 - SLOs and dashboards

* Define SLOs per journey.
* Implement burn rate alerts.
* Wire Console Observability Hub widgets.

### Phase 3 - Timeline and decision logs

* Emit timeline events everywhere.
* Build timeline indexer and APIs.
* Console Timeline viewer.

### Phase 4 - Evidence locker

* Implement snapshot builders for evaluation, job, export.
* Add Merkle manifests and DSSE signing.
* Add legal hold and retention checks.

### Phase 5 - Provenance and verification

* Generate attestations for jobs and exports.
* Verify on read.
* CLI verify commands.

### Phase 6 - Incident mode

* Feature flag, higher sampling, and retention bump.
* Webhook and Notifications integration.

---

## 8) Engineering tasks

**Telemetry core**

* [ ] Build `telemetry/core` with log scrubbing, context propagation, and OpenTelemetry exporters.
* [ ] Introduce a single `Logger` facade, deprecate old logging.
* [ ] Add request and job middleware to attach tenant/actor/trace context.

**Metrics and SLOs**

* [ ] Instrument golden paths with histograms and exemplars.
* [ ] Implement SLO evaluator and burn rate alerts.
* [ ] Provide Prometheus rules and Grafana dashboards.

**Tracing**

* [ ] Propagate `traceparent` in HTTP, gRPC, and job payloads.
* [ ] Link CLI requests via headers.
* [ ] Add span attributes for policy decisions and authZ outcomes.

**Timeline**

* [ ] Define `timeline_event` schema and emitters in all services.
* [ ] Build indexer, APIs, and RLS policies.
* [ ] Console visualization with filters and deep links to traces.

**Evidence locker**

* [ ] Implement bundle builders, Merkle construction, and DSSE signing.
* [ ] Object store layout with optional WORM mode.
* [ ] APIs for create, get, verify, and legal hold.

**Provenance**

* [ ] Define statement schema and signers.
* [ ] Add verification library and server hooks.
* [ ] CLI support for show and verify.

**Privacy and redaction**

* [ ] Implement scrubbers for secrets and PII in logger.
* [ ] Add config for per‑tenant deep debug opt‑in with time‑boxed TTL.
* [ ] Redaction tests.

**Console**

* [ ] Observability Hub: health, SLO widgets, trace search.
* [ ] Forensics Explorer: timeline, evidence viewer, verification UX.
* [ ] “Why is this red” drill‑down from errors to spans/logs/decisions.

**CLI**

* [ ] `stella obs` commands and pretty printers.
* [ ] `stella forensic snapshot`, `verify`, `attest show`.
* [ ] Respect `--tenant` and print `trace_id` for copy‑paste.

**Docs**

* [ ] Author all pages in section 6, with examples and diagrams.
* [ ] Insert the imposed rule banner at the top of each page.
* [ ] Add runbooks for SLO breaches and incident mode.

**Testing**

* [ ] Unit tests: scrubbing, logger fields, timeline emitter.
* [ ] E2E: create SBOM, run policy, link advisories, export, then snapshot and verify.
* [ ] Load tests: ensure tracing overhead < 5 percent CPU.
* [ ] Failure injection: drop collector, ensure backpressure and fallbacks.

---

## 9) Feature changes required in other components

* **Authority & Tenancy:** add scopes `obs:read`, `timeline:read`, `evidence:*`, `attest:read`. Enforce tenant constraints everywhere.
* **Orchestrator & Runner:** stamp job spans and emit `job.started|finished|failed` timeline events with root cause.
* **Findings Ledger:** record evaluation IDs and link to evidence bundles.
* **Policy Engine:** include policy version and rule IDs in spans and timeline events; log decision summaries consistently.
* **Export Center:** produce attestations for every generated artifact and optionally embed a copy of the evaluation bundle.
* **Notifications:** send SLO breach and incident mode start/stop messages with links to traces and timeline searches.
* **Conseiller & Excitator:** emit ingest and linking events for the timeline and capture source digests in evidence bundles.

> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

---

## 10) Acceptance criteria

* All services emit structured logs with the common fields.
* At least 90 percent of top routes and job paths have tracing with exemplars.
* SLOs defined and visible in Console; burn rate alerts work.
* Tenant Timeline shows a full chain for a real evaluation.
* Evidence snapshot for an evaluation verifies successfully.
* Attestation exists and verifies for at least one export and one job.
* Incident mode increases trace volume and extends evidence retention.
* RLS prevents cross‑tenant reads of timeline and evidence.
* 403 payloads continue to explain policy or scope denials without leaking PII.

---

## 11) Risks and mitigations

* **Telemetry cost blow‑up.** Use sampling, cardinality limits, and per‑tenant caps.
* **PII leakage.** Redaction enforced in the logger library, security review on field additions.
* **Broken traces in async work.** Serialize context into job payloads and test it.
* **False sense of immutability.** Use real WORM where available and sign evidence; document guarantees honestly.
* **Operational complexity.** Ship default collector and dashboards so teams start from working, not from blank.

---

## 12) Philosophy

* **If it isn’t measured, it isn’t managed.**
* **If it isn’t preserved, it didn’t happen.**
* **If it isn’t explainable, it isn’t trustworthy.**

StellaOps makes risk decisions. This epic is how those decisions become observable, traceable, and defensible.

> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.