438 lines
36 KiB
Markdown
438 lines
36 KiB
Markdown
No file to print
|
||
Monitoring and forensics: the vitamins everyone remembers to take after getting sick. Let’s make it first‑class instead of a panic purchase after the postmortem. Here’s the full, doc‑ready epic.
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
---
|
||
|
||
# Epic 15: Observability & Forensics
|
||
|
||
**Short name:** Observability & Forensics
|
||
**Primary components:** Web Services API, Orchestrator, Task Runner, Findings Ledger, Conseiller (Feedser), Excitator (VEXer), Policy Engine, Export Center, Console, CLI
|
||
**Surfaces:** logs, metrics, traces, decision audits, event timeline, evidence locker, provenance attestations, dashboards, alerts
|
||
**Dependencies:** Authority‑Backed Scopes & Tenancy, AOC Enforcement, Policy Studio, Export Center
|
||
|
||
**AOC ground rule reminder:** Conseiller and Excitator aggregate and link advisories/VEX. They never merge or mutate source records. Observability and forensic capture must reflect sources as they were seen, not “cleaned up.”
|
||
|
||
---
|
||
|
||
## 1) What it is
|
||
|
||
A platform‑wide, tenant‑aware telemetry and evidence system that answers three questions fast:
|
||
|
||
1. What is happening right now, and is it healthy.
|
||
2. What exactly happened earlier, across services and jobs.
|
||
3. Can we prove it, independently and later.
|
||
|
||
It combines:
|
||
|
||
* **Unified telemetry** logs, metrics, traces for all services and jobs using OpenTelemetry conventions.
|
||
* **SLOs and dashboards** for golden signals and user‑journey health.
|
||
* **Forensic readiness** evidence snapshots, immutable audit trails, provenance attestations, chain‑of‑custody, and a tenant timeline that reconstructs evaluations and decisions.
|
||
* **UI and CLI** to explore incidents, verify attestations, and export legally defensible bundles.
|
||
|
||
Everything is tenant‑scoped by default and tamper‑evident where it matters.
|
||
|
||
---
|
||
|
||
## 2) Why
|
||
|
||
Because “we think it worked” is not proof. You need fast feedback during normal ops and a trustworthy record when something breaks or lawyers get involved. If StellaOps decides risk, then StellaOps must show its work and preserve it.
|
||
|
||
---
|
||
|
||
## 3) How it should work
|
||
|
||
### 3.1 Instrumentation blueprint
|
||
|
||
* **Tracing:** every request and job carries a `trace_id` with W3C trace context. Background workers continue the same trace or link as a child. CLI adds the header when talking to the API.
|
||
* **Metrics:** standardized counters, histograms, and gauges for latency, throughput, errors, and resource use. Exemplars embed `trace_id` for jump‑to‑trace from graphs.
|
||
* **Logs:** structured JSON with fixed fields. No printf archaeology.
|
||
* **Sampling:** head sampling for traces with tail sampling on high latency and errors. Error logs never sampled.
|
||
|
||
**Common fields for logs and spans:**
|
||
|
||
```
|
||
ts, level, service, version, env, region, instance_id,
|
||
tenant_id, project_id, actor, route, method, resource, action,
|
||
trace_id, span_id, parent_span_id, request_id,
|
||
job_id, run_id, source_id, sbom_id, policy_version,
|
||
decision_effect, decision_reason, error_code, error_message
|
||
```
|
||
|
||
**Sensitive data policy:**
|
||
|
||
* Scrub secrets, tokens, emails, and file paths beyond repo root.
|
||
* Redact by default and allow per‑tenant opt‑in for deep debug.
|
||
* PII filters enforced in the logging library, not at call sites.
|
||
|
||
### 3.2 Metrics taxonomy and SLOs
|
||
|
||
* **Golden signals per service:** request latency (P50/P95/P99), error rate, throughput, saturation (CPU, queue depth).
|
||
* **User journeys:** SBOM ingest, policy evaluate, advisory link, VEX reconcile, export bundle, notification send.
|
||
* **SLO examples:**
|
||
|
||
* SBOM ingest start‑to‑first‑component P95 < 5s.
|
||
* Policy evaluation P95 < 2s for 5k components.
|
||
* Advisory link to exposure delta P95 < 30s.
|
||
* Export bundle availability P95 < 90s from request.
|
||
|
||
SLOs emit events on burn rate breaches, not just threshold spikes.
|
||
|
||
### 3.3 Tracing model
|
||
|
||
* **Ingress span:** `api.<route>` includes authN/Z decision attributes without PII.
|
||
* **Work spans:** `orchestrator.schedule`, `runner.execute`, `conseiller.ingest`, `excitor.link`, `policy.evaluate`.
|
||
* **Async linking:** parent trace context serialized into job payloads. Workers restore it and annotate spans with job metadata.
|
||
* **Cross‑boundary:** CLI generates `traceparent`; Console mirrors it into the browser dev tools for correlation during support cases.
|
||
|
||
### 3.4 Logs that don’t lie
|
||
|
||
* **Decision logs:** every authZ decision and policy evaluation emits a structured record.
|
||
* **Consistency:** one logger library used across all services.
|
||
* **Tenant isolation:** logs carry `tenant_id`. Central log store partitions by tenant.
|
||
* **Retention:** hot logs short (3‑7 days), warm 30‑90 days, cold based on tenant policy. For forensic artifacts see 3.6.
|
||
|
||
### 3.5 Tenant Timeline
|
||
|
||
* Stream rule: every state transition emits `timeline_event` with a canonical schema:
|
||
|
||
```
|
||
event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, details{}
|
||
```
|
||
|
||
* Aggregator stores an index per tenant for 180 days by default.
|
||
* UI visualizes events with filters: “show all policy changes,” “show all job failures that touched component X,” “expand decisions for this trace.”
|
||
|
||
### 3.6 Forensic Evidence Locker
|
||
|
||
* Append‑only, WORM‑capable object storage path `tenants/<tenant>/evidence/<case>/<artifact>`.
|
||
* **Snapshot types:**
|
||
|
||
* Evaluation bundle: SBOM slice, linked advisories, VEX inputs, policy set, engine version, config, and decision trace.
|
||
* Job capsule: inputs digests, outputs, runner image digest, env digest (vars hashed), command transcript.
|
||
* Export manifest: files, checksums, recipients.
|
||
* **Integrity:**
|
||
|
||
* Merkle index per bundle. Root hash stored in DB and optionally timestamped.
|
||
* DSSE signed manifest with platform key. Key rotation logged and validated.
|
||
|
||
### 3.7 Provenance and attestations
|
||
|
||
* Jobs emit SLSA‑style provenance attestations:
|
||
|
||
* subjects: produced artifacts (e.g., exposure report)
|
||
* predicate: sources, builder image, inputs, policies, advisories versions
|
||
* builder id: Task Runner identity
|
||
* Verify on read. UI shows green check with signer and time.
|
||
|
||
### 3.8 Chain of custody
|
||
|
||
* Chain ties together:
|
||
|
||
* source fetch → SBOM compute → policy eval → advisory link → export
|
||
* Each step has immutable IDs and cryptographic digests inside the timeline and evidence.
|
||
|
||
### 3.9 Incident mode
|
||
|
||
* Feature flag that increases trace sampling, captures additional breadcrumbs, and extends evidence retention for the next N hours.
|
||
* Per‑tenant activation to avoid surprise bills.
|
||
* Automatically enabled on SLO burn rate breaches above a threshold.
|
||
|
||
### 3.10 Multi‑tenant guarantees
|
||
|
||
* Telemetry, timeline, evidence, and attestations are tenant‑scoped by design.
|
||
* RLS applies to timeline DB.
|
||
* Evidence locker uses tenant prefix and optional per‑tenant KMS key.
|
||
* Exporting evidence requires `stella:evidence:export#tenant/<id>` scope.
|
||
|
||
### 3.11 Console features
|
||
|
||
* **Observability Hub:** health at a glance, SLO widgets, Top failing routes, Top noisy tenants (for operators), trace search, log search with guardrails.
|
||
* **Forensics Explorer:** timeline with filters, evidence bundle viewer, attestation verifier, “Create Snapshot” wizard, comparison between two evaluations.
|
||
* **“Why is this red?”** click-through from an error to the exact span, log lines, and policy decision that caused it.
|
||
|
||
### 3.12 CLI features
|
||
|
||
* `stella obs top` live stats for APIs and jobs.
|
||
* `stella obs trace <trace_id>` dump correlated events.
|
||
* `stella forensic snapshot create --case <id> --scope <eval|job|export> --id <resource>`
|
||
* `stella forensic verify <bundle.tgz>` validate checksums and signatures.
|
||
* `stella forensic attest show <artifact>` print provenance.
|
||
|
||
### 3.13 Integrations
|
||
|
||
* OpenTelemetry Collector ships by default.
|
||
* Prometheus scrape config and dashboards included.
|
||
* Webhooks for SLO breaches and incident mode start/stop.
|
||
* Optional RFC 3161 timestamping for Merkle roots where a time authority is configured.
|
||
|
||
---
|
||
|
||
## 4) Architecture
|
||
|
||
### 4.1 New modules
|
||
|
||
* `telemetry/core` logging, metrics, tracing libraries with scrubbing and context.
|
||
* `telemetry/collector-config` ship default collector config.
|
||
* `timeline/indexer` consumes events and builds tenant timeline indices.
|
||
* `evidence/locker` API to create, read, and sign bundles.
|
||
* `provenance/attest` generates DSSE statements and verification helpers.
|
||
* `console/obs` dashboards and trace viewer.
|
||
* `console/forensics` timeline and evidence UIs.
|
||
* `cli/obs`, `cli/forensics`.
|
||
|
||
### 4.2 Data model
|
||
|
||
New tables with RLS:
|
||
|
||
* `timeline_events(event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, digest, details_jsonb)`
|
||
* `evidence_bundles(id, tenant_id, case_id, kind, root_hash, path, signer_key_id, created_at, labels)`
|
||
* `attestations(id, tenant_id, subject_id, subject_digest, statement_path, signer_key_id, created_at)`
|
||
|
||
Indices: `(tenant_id, ts)`, `(tenant_id, resource_type, resource_id)`, `(tenant_id, case_id)`.
|
||
|
||
### 4.3 Storage layout
|
||
|
||
* Object store:
|
||
|
||
* `tenants/<t>/evidence/<case>/<bundle_id>/manifest.json`
|
||
* `tenants/<t>/evidence/<case>/<bundle_id>/root.merkle`
|
||
* `tenants/<t>/attestations/<subject_id>/<statement.json>`
|
||
* Optional S3 Object Lock for WORM with per‑tenant retention policy.
|
||
|
||
### 4.4 Message topics
|
||
|
||
* `stella.<tenant>.timeline.*` emitted by all services.
|
||
* `stella.<tenant>.slo.breach` from SLO evaluator.
|
||
* `stella.<tenant>.incident.mode` start and stop events.
|
||
* `stella.global.kb.*` remains for publics advisories and has no tenant data.
|
||
|
||
---
|
||
|
||
## 5) APIs and contracts
|
||
|
||
### 5.1 Observability read APIs
|
||
|
||
* `GET /obs/health` summary per service.
|
||
* `GET /obs/slo` list current SLOs and burn rates.
|
||
* `GET /obs/trace/:id` metadata plus deep link to trace backend.
|
||
* `GET /obs/logs` query interface with guardrails for time window and tenant.
|
||
* `GET /obs/metrics` small set of computed aggregates for Console.
|
||
|
||
### 5.2 Timeline APIs
|
||
|
||
* `GET /timeline?from=&to=&kind=&resource=&project=` paginated events.
|
||
* `GET /timeline/:event_id` detailed view, links to trace and evidence.
|
||
|
||
### 5.3 Evidence APIs
|
||
|
||
* `POST /evidence/snapshot` with `{kind, resource_id, case_id}` creates a bundle.
|
||
* `GET /evidence/:bundle_id` returns manifest and signed hashes.
|
||
* `POST /evidence/verify` upload or reference a bundle path to verify.
|
||
* `POST /evidence/hold/:case_id` place or release legal hold.
|
||
|
||
### 5.4 Attestation APIs
|
||
|
||
* `GET /attestations?subject_id=` list and filter.
|
||
* `POST /attestations/verify` verify a statement against subject digest.
|
||
|
||
### 5.5 Structured log contract
|
||
|
||
Logs are newline‑delimited JSON with the common fields in 3.1. `error_message` must be non‑PII and concise. Multi‑line errors are folded into `details.stack` with length limits.
|
||
|
||
All these endpoints require Authority scopes like:
|
||
|
||
* `stella:obs:read#tenant/<id>`
|
||
* `stella:timeline:read#tenant/<id>`
|
||
* `stella:evidence:create#tenant/<id>`
|
||
* `stella:evidence:read#tenant/<id>`
|
||
* `stella:attest:read#tenant/<id>`
|
||
|
||
---
|
||
|
||
## 6) Documentation changes
|
||
|
||
Create or update:
|
||
|
||
1. `/docs/observability/overview.md` what, why, scope, data flow.
|
||
2. `/docs/observability/telemetry-standards.md` fields, naming, sampling, scrubbing rules, examples.
|
||
3. `/docs/observability/metrics-and-slos.md` catalog of metrics, SLO definitions, alert policies, burn rates.
|
||
4. `/docs/observability/tracing.md` context propagation, async linking, CLI and Console behavior.
|
||
5. `/docs/observability/logging.md` structured logging guide, dos and don’ts, PII policy.
|
||
6. `/docs/forensics/evidence-locker.md` bundle formats, WORM, retention, legal hold.
|
||
7. `/docs/forensics/provenance-attestation.md` statement schema, signing, verification.
|
||
8. `/docs/forensics/timeline.md` schema, event kinds, queries, examples.
|
||
9. `/docs/console/observability.md` dashboards, trace viewer, log search.
|
||
10. `/docs/console/forensics.md` timeline, snapshot UI, verification flow.
|
||
11. `/docs/cli/observability.md` commands and examples.
|
||
12. `/docs/cli/forensics.md` snapshot and verify commands.
|
||
13. `/docs/security/redaction-and-privacy.md` telemetry privacy and tenant isolation.
|
||
14. `/docs/install/telemetry-stack.md` default collector, exporter options, dashboards.
|
||
15. `/docs/runbooks/incidents.md` incident mode, SLO breaches, escalation.
|
||
|
||
At the top of each page include:
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
---
|
||
|
||
## 7) Implementation plan
|
||
|
||
### Phase 1 - Baseline telemetry
|
||
|
||
* Integrate OpenTelemetry SDK in all services.
|
||
* Replace ad‑hoc logs with structured logger.
|
||
* Emit minimal metrics and traces on hot paths.
|
||
* Ship default collector and dashboards.
|
||
|
||
### Phase 2 - SLOs and dashboards
|
||
|
||
* Define SLOs per journey.
|
||
* Implement burn rate alerts.
|
||
* Wire Console Observability Hub widgets.
|
||
|
||
### Phase 3 - Timeline and decision logs
|
||
|
||
* Emit timeline events everywhere.
|
||
* Build timeline indexer and APIs.
|
||
* Console Timeline viewer.
|
||
|
||
### Phase 4 - Evidence locker
|
||
|
||
* Implement snapshot builders for evaluation, job, export.
|
||
* Add Merkle manifests and DSSE signing.
|
||
* Add legal hold and retention checks.
|
||
|
||
### Phase 5 - Provenance and verification
|
||
|
||
* Generate attestations for jobs and exports.
|
||
* Verify on read.
|
||
* CLI verify commands.
|
||
|
||
### Phase 6 - Incident mode
|
||
|
||
* Feature flag, higher sampling, and retention bump.
|
||
* Webhook and Notifications integration.
|
||
|
||
---
|
||
|
||
## 8) Engineering tasks
|
||
|
||
**Telemetry core**
|
||
|
||
* [ ] Build `telemetry/core` with log scrubbing, context propagation, and OpenTelemetry exporters.
|
||
* [ ] Introduce a single `Logger` facade, deprecate old logging.
|
||
* [ ] Add request and job middleware to attach tenant/actor/trace context.
|
||
|
||
**Metrics and SLOs**
|
||
|
||
* [ ] Instrument golden paths with histograms and exemplars.
|
||
* [ ] Implement SLO evaluator and burn rate alerts.
|
||
* [ ] Provide Prometheus rules and Grafana dashboards.
|
||
|
||
**Tracing**
|
||
|
||
* [ ] Propagate `traceparent` in HTTP, gRPC, and job payloads.
|
||
* [ ] Link CLI requests via headers.
|
||
* [ ] Add span attributes for policy decisions and authZ outcomes.
|
||
|
||
**Timeline**
|
||
|
||
* [ ] Define `timeline_event` schema and emitters in all services.
|
||
* [ ] Build indexer, APIs, and RLS policies.
|
||
* [ ] Console visualization with filters and deep links to traces.
|
||
|
||
**Evidence locker**
|
||
|
||
* [ ] Implement bundle builders, Merkle construction, and DSSE signing.
|
||
* [ ] Object store layout with optional WORM mode.
|
||
* [ ] APIs for create, get, verify, and legal hold.
|
||
|
||
**Provenance**
|
||
|
||
* [ ] Define statement schema and signers.
|
||
* [ ] Add verification library and server hooks.
|
||
* [ ] CLI support for show and verify.
|
||
|
||
**Privacy and redaction**
|
||
|
||
* [ ] Implement scrubbers for secrets and PII in logger.
|
||
* [ ] Add config for per‑tenant deep debug opt‑in with time‑boxed TTL.
|
||
* [ ] Redaction tests.
|
||
|
||
**Console**
|
||
|
||
* [ ] Observability Hub: health, SLO widgets, trace search.
|
||
* [ ] Forensics Explorer: timeline, evidence viewer, verification UX.
|
||
* [ ] “Why is this red” drill‑down from errors to spans/logs/decisions.
|
||
|
||
**CLI**
|
||
|
||
* [ ] `stella obs` commands and pretty printers.
|
||
* [ ] `stella forensic snapshot`, `verify`, `attest show`.
|
||
* [ ] Respect `--tenant` and print `trace_id` for copy‑paste.
|
||
|
||
**Docs**
|
||
|
||
* [ ] Author all pages in section 6, with examples and diagrams.
|
||
* [ ] Insert the imposed rule banner at the top of each page.
|
||
* [ ] Add runbooks for SLO breaches and incident mode.
|
||
|
||
**Testing**
|
||
|
||
* [ ] Unit tests: scrubbing, logger fields, timeline emitter.
|
||
* [ ] E2E: create SBOM, run policy, link advisories, export, then snapshot and verify.
|
||
* [ ] Load tests: ensure tracing overhead < 5 percent CPU.
|
||
* [ ] Failure injection: drop collector, ensure backpressure and fallbacks.
|
||
|
||
---
|
||
|
||
## 9) Feature changes required in other components
|
||
|
||
* **Authority & Tenancy:** add scopes `obs:read`, `timeline:read`, `evidence:*`, `attest:read`. Enforce tenant constraints everywhere.
|
||
* **Orchestrator & Runner:** stamp job spans and emit `job.started|finished|failed` timeline events with root cause.
|
||
* **Findings Ledger:** record evaluation IDs and link to evidence bundles.
|
||
* **Policy Engine:** include policy version and rule IDs in spans and timeline events; log decision summaries consistently.
|
||
* **Export Center:** produce attestations for every generated artifact and optionally embed a copy of the evaluation bundle.
|
||
* **Notifications:** send SLO breach and incident mode start/stop messages with links to traces and timeline searches.
|
||
* **Conseiller & Excitator:** emit ingest and linking events for the timeline and capture source digests in evidence bundles.
|
||
|
||
> **Imposed rule reminder:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|
||
|
||
---
|
||
|
||
## 10) Acceptance criteria
|
||
|
||
* All services emit structured logs with the common fields.
|
||
* At least 90 percent of top routes and job paths have tracing with exemplars.
|
||
* SLOs defined and visible in Console; burn rate alerts work.
|
||
* Tenant Timeline shows a full chain for a real evaluation.
|
||
* Evidence snapshot for an evaluation verifies successfully.
|
||
* Attestation exists and verifies for at least one export and one job.
|
||
* Incident mode increases trace volume and extends evidence retention.
|
||
* RLS prevents cross‑tenant reads of timeline and evidence.
|
||
* 403 payloads continue to explain policy or scope denials without leaking PII.
|
||
|
||
---
|
||
|
||
## 11) Risks and mitigations
|
||
|
||
* **Telemetry cost blow‑up.** Use sampling, cardinality limits, and per‑tenant caps.
|
||
* **PII leakage.** Redaction enforced in the logger library, security review on field additions.
|
||
* **Broken traces in async work.** Serialize context into job payloads and test it.
|
||
* **False sense of immutability.** Use real WORM where available and sign evidence; document guarantees honestly.
|
||
* **Operational complexity.** Ship default collector and dashboards so teams start from working, not from blank.
|
||
|
||
---
|
||
|
||
## 12) Philosophy
|
||
|
||
* **If it isn’t measured, it isn’t managed.**
|
||
* **If it isn’t preserved, it didn’t happen.**
|
||
* **If it isn’t explainable, it isn’t trustworthy.**
|
||
|
||
StellaOps makes risk decisions. This epic is how those decisions become observable, traceable, and defensible.
|
||
|
||
> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
|