- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution. - Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done. - Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.
36 KiB
No file to print Monitoring and forensics: the vitamins everyone remembers to take after getting sick. Let’s make it first‑class instead of a panic purchase after the postmortem. Here’s the full, doc‑ready epic.
Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
Epic 15: Observability & Forensics
Short name: Observability & Forensics Primary components: Web Services API, Orchestrator, Task Runner, Findings Ledger, Conseiller (Feedser), Excitator (VEXer), Policy Engine, Export Center, Console, CLI Surfaces: logs, metrics, traces, decision audits, event timeline, evidence locker, provenance attestations, dashboards, alerts Dependencies: Authority‑Backed Scopes & Tenancy, AOC Enforcement, Policy Studio, Export Center
AOC ground rule reminder: Conseiller and Excitator aggregate and link advisories/VEX. They never merge or mutate source records. Observability and forensic capture must reflect sources as they were seen, not “cleaned up.”
1) What it is
A platform‑wide, tenant‑aware telemetry and evidence system that answers three questions fast:
- What is happening right now, and is it healthy.
- What exactly happened earlier, across services and jobs.
- Can we prove it, independently and later.
It combines:
- Unified telemetry logs, metrics, traces for all services and jobs using OpenTelemetry conventions.
- SLOs and dashboards for golden signals and user‑journey health.
- Forensic readiness evidence snapshots, immutable audit trails, provenance attestations, chain‑of‑custody, and a tenant timeline that reconstructs evaluations and decisions.
- UI and CLI to explore incidents, verify attestations, and export legally defensible bundles.
Everything is tenant‑scoped by default and tamper‑evident where it matters.
2) Why
Because “we think it worked” is not proof. You need fast feedback during normal ops and a trustworthy record when something breaks or lawyers get involved. If StellaOps decides risk, then StellaOps must show its work and preserve it.
3) How it should work
3.1 Instrumentation blueprint
- Tracing: every request and job carries a
trace_idwith W3C trace context. Background workers continue the same trace or link as a child. CLI adds the header when talking to the API. - Metrics: standardized counters, histograms, and gauges for latency, throughput, errors, and resource use. Exemplars embed
trace_idfor jump‑to‑trace from graphs. - Logs: structured JSON with fixed fields. No printf archaeology.
- Sampling: head sampling for traces with tail sampling on high latency and errors. Error logs never sampled.
Common fields for logs and spans:
ts, level, service, version, env, region, instance_id,
tenant_id, project_id, actor, route, method, resource, action,
trace_id, span_id, parent_span_id, request_id,
job_id, run_id, source_id, sbom_id, policy_version,
decision_effect, decision_reason, error_code, error_message
Sensitive data policy:
- Scrub secrets, tokens, emails, and file paths beyond repo root.
- Redact by default and allow per‑tenant opt‑in for deep debug.
- PII filters enforced in the logging library, not at call sites.
3.2 Metrics taxonomy and SLOs
-
Golden signals per service: request latency (P50/P95/P99), error rate, throughput, saturation (CPU, queue depth).
-
User journeys: SBOM ingest, policy evaluate, advisory link, VEX reconcile, export bundle, notification send.
-
SLO examples:
- SBOM ingest start‑to‑first‑component P95 < 5s.
- Policy evaluation P95 < 2s for 5k components.
- Advisory link to exposure delta P95 < 30s.
- Export bundle availability P95 < 90s from request.
SLOs emit events on burn rate breaches, not just threshold spikes.
3.3 Tracing model
- Ingress span:
api.<route>includes authN/Z decision attributes without PII. - Work spans:
orchestrator.schedule,runner.execute,conseiller.ingest,excitor.link,policy.evaluate. - Async linking: parent trace context serialized into job payloads. Workers restore it and annotate spans with job metadata.
- Cross‑boundary: CLI generates
traceparent; Console mirrors it into the browser dev tools for correlation during support cases.
3.4 Logs that don’t lie
- Decision logs: every authZ decision and policy evaluation emits a structured record.
- Consistency: one logger library used across all services.
- Tenant isolation: logs carry
tenant_id. Central log store partitions by tenant. - Retention: hot logs short (3‑7 days), warm 30‑90 days, cold based on tenant policy. For forensic artifacts see 3.6.
3.5 Tenant Timeline
- Stream rule: every state transition emits
timeline_eventwith a canonical schema:
event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, details{}
- Aggregator stores an index per tenant for 180 days by default.
- UI visualizes events with filters: “show all policy changes,” “show all job failures that touched component X,” “expand decisions for this trace.”
3.6 Forensic Evidence Locker
-
Append‑only, WORM‑capable object storage path
tenants/<tenant>/evidence/<case>/<artifact>. -
Snapshot types:
- Evaluation bundle: SBOM slice, linked advisories, VEX inputs, policy set, engine version, config, and decision trace.
- Job capsule: inputs digests, outputs, runner image digest, env digest (vars hashed), command transcript.
- Export manifest: files, checksums, recipients.
-
Integrity:
- Merkle index per bundle. Root hash stored in DB and optionally timestamped.
- DSSE signed manifest with platform key. Key rotation logged and validated.
3.7 Provenance and attestations
-
Jobs emit SLSA‑style provenance attestations:
- subjects: produced artifacts (e.g., exposure report)
- predicate: sources, builder image, inputs, policies, advisories versions
- builder id: Task Runner identity
-
Verify on read. UI shows green check with signer and time.
3.8 Chain of custody
-
Chain ties together:
- source fetch → SBOM compute → policy eval → advisory link → export
-
Each step has immutable IDs and cryptographic digests inside the timeline and evidence.
3.9 Incident mode
- Feature flag that increases trace sampling, captures additional breadcrumbs, and extends evidence retention for the next N hours.
- Per‑tenant activation to avoid surprise bills.
- Automatically enabled on SLO burn rate breaches above a threshold.
3.10 Multi‑tenant guarantees
- Telemetry, timeline, evidence, and attestations are tenant‑scoped by design.
- RLS applies to timeline DB.
- Evidence locker uses tenant prefix and optional per‑tenant KMS key.
- Exporting evidence requires
stella:evidence:export#tenant/<id>scope.
3.11 Console features
- Observability Hub: health at a glance, SLO widgets, Top failing routes, Top noisy tenants (for operators), trace search, log search with guardrails.
- Forensics Explorer: timeline with filters, evidence bundle viewer, attestation verifier, “Create Snapshot” wizard, comparison between two evaluations.
- “Why is this red?” click-through from an error to the exact span, log lines, and policy decision that caused it.
3.12 CLI features
stella obs toplive stats for APIs and jobs.stella obs trace <trace_id>dump correlated events.stella forensic snapshot create --case <id> --scope <eval|job|export> --id <resource>stella forensic verify <bundle.tgz>validate checksums and signatures.stella forensic attest show <artifact>print provenance.
3.13 Integrations
- OpenTelemetry Collector ships by default.
- Prometheus scrape config and dashboards included.
- Webhooks for SLO breaches and incident mode start/stop.
- Optional RFC 3161 timestamping for Merkle roots where a time authority is configured.
4) Architecture
4.1 New modules
telemetry/corelogging, metrics, tracing libraries with scrubbing and context.telemetry/collector-configship default collector config.timeline/indexerconsumes events and builds tenant timeline indices.evidence/lockerAPI to create, read, and sign bundles.provenance/attestgenerates DSSE statements and verification helpers.console/obsdashboards and trace viewer.console/forensicstimeline and evidence UIs.cli/obs,cli/forensics.
4.2 Data model
New tables with RLS:
timeline_events(event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, digest, details_jsonb)evidence_bundles(id, tenant_id, case_id, kind, root_hash, path, signer_key_id, created_at, labels)attestations(id, tenant_id, subject_id, subject_digest, statement_path, signer_key_id, created_at)
Indices: (tenant_id, ts), (tenant_id, resource_type, resource_id), (tenant_id, case_id).
4.3 Storage layout
-
Object store:
tenants/<t>/evidence/<case>/<bundle_id>/manifest.jsontenants/<t>/evidence/<case>/<bundle_id>/root.merkletenants/<t>/attestations/<subject_id>/<statement.json>
-
Optional S3 Object Lock for WORM with per‑tenant retention policy.
4.4 Message topics
stella.<tenant>.timeline.*emitted by all services.stella.<tenant>.slo.breachfrom SLO evaluator.stella.<tenant>.incident.modestart and stop events.stella.global.kb.*remains for publics advisories and has no tenant data.
5) APIs and contracts
5.1 Observability read APIs
GET /obs/healthsummary per service.GET /obs/slolist current SLOs and burn rates.GET /obs/trace/:idmetadata plus deep link to trace backend.GET /obs/logsquery interface with guardrails for time window and tenant.GET /obs/metricssmall set of computed aggregates for Console.
5.2 Timeline APIs
GET /timeline?from=&to=&kind=&resource=&project=paginated events.GET /timeline/:event_iddetailed view, links to trace and evidence.
5.3 Evidence APIs
POST /evidence/snapshotwith{kind, resource_id, case_id}creates a bundle.GET /evidence/:bundle_idreturns manifest and signed hashes.POST /evidence/verifyupload or reference a bundle path to verify.POST /evidence/hold/:case_idplace or release legal hold.
5.4 Attestation APIs
GET /attestations?subject_id=list and filter.POST /attestations/verifyverify a statement against subject digest.
5.5 Structured log contract
Logs are newline‑delimited JSON with the common fields in 3.1. error_message must be non‑PII and concise. Multi‑line errors are folded into details.stack with length limits.
All these endpoints require Authority scopes like:
stella:obs:read#tenant/<id>stella:timeline:read#tenant/<id>stella:evidence:create#tenant/<id>stella:evidence:read#tenant/<id>stella:attest:read#tenant/<id>
6) Documentation changes
Create or update:
/docs/observability/overview.mdwhat, why, scope, data flow./docs/observability/telemetry-standards.mdfields, naming, sampling, scrubbing rules, examples./docs/observability/metrics-and-slos.mdcatalog of metrics, SLO definitions, alert policies, burn rates./docs/observability/tracing.mdcontext propagation, async linking, CLI and Console behavior./docs/observability/logging.mdstructured logging guide, dos and don’ts, PII policy./docs/forensics/evidence-locker.mdbundle formats, WORM, retention, legal hold./docs/forensics/provenance-attestation.mdstatement schema, signing, verification./docs/forensics/timeline.mdschema, event kinds, queries, examples./docs/console/observability.mddashboards, trace viewer, log search./docs/console/forensics.mdtimeline, snapshot UI, verification flow./docs/cli/observability.mdcommands and examples./docs/cli/forensics.mdsnapshot and verify commands./docs/security/redaction-and-privacy.mdtelemetry privacy and tenant isolation./docs/install/telemetry-stack.mddefault collector, exporter options, dashboards./docs/runbooks/incidents.mdincident mode, SLO breaches, escalation.
At the top of each page include:
Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
7) Implementation plan
Phase 1 - Baseline telemetry
- Integrate OpenTelemetry SDK in all services.
- Replace ad‑hoc logs with structured logger.
- Emit minimal metrics and traces on hot paths.
- Ship default collector and dashboards.
Phase 2 - SLOs and dashboards
- Define SLOs per journey.
- Implement burn rate alerts.
- Wire Console Observability Hub widgets.
Phase 3 - Timeline and decision logs
- Emit timeline events everywhere.
- Build timeline indexer and APIs.
- Console Timeline viewer.
Phase 4 - Evidence locker
- Implement snapshot builders for evaluation, job, export.
- Add Merkle manifests and DSSE signing.
- Add legal hold and retention checks.
Phase 5 - Provenance and verification
- Generate attestations for jobs and exports.
- Verify on read.
- CLI verify commands.
Phase 6 - Incident mode
- Feature flag, higher sampling, and retention bump.
- Webhook and Notifications integration.
8) Engineering tasks
Telemetry core
- Build
telemetry/corewith log scrubbing, context propagation, and OpenTelemetry exporters. - Introduce a single
Loggerfacade, deprecate old logging. - Add request and job middleware to attach tenant/actor/trace context.
Metrics and SLOs
- Instrument golden paths with histograms and exemplars.
- Implement SLO evaluator and burn rate alerts.
- Provide Prometheus rules and Grafana dashboards.
Tracing
- Propagate
traceparentin HTTP, gRPC, and job payloads. - Link CLI requests via headers.
- Add span attributes for policy decisions and authZ outcomes.
Timeline
- Define
timeline_eventschema and emitters in all services. - Build indexer, APIs, and RLS policies.
- Console visualization with filters and deep links to traces.
Evidence locker
- Implement bundle builders, Merkle construction, and DSSE signing.
- Object store layout with optional WORM mode.
- APIs for create, get, verify, and legal hold.
Provenance
- Define statement schema and signers.
- Add verification library and server hooks.
- CLI support for show and verify.
Privacy and redaction
- Implement scrubbers for secrets and PII in logger.
- Add config for per‑tenant deep debug opt‑in with time‑boxed TTL.
- Redaction tests.
Console
- Observability Hub: health, SLO widgets, trace search.
- Forensics Explorer: timeline, evidence viewer, verification UX.
- “Why is this red” drill‑down from errors to spans/logs/decisions.
CLI
stella obscommands and pretty printers.stella forensic snapshot,verify,attest show.- Respect
--tenantand printtrace_idfor copy‑paste.
Docs
- Author all pages in section 6, with examples and diagrams.
- Insert the imposed rule banner at the top of each page.
- Add runbooks for SLO breaches and incident mode.
Testing
- Unit tests: scrubbing, logger fields, timeline emitter.
- E2E: create SBOM, run policy, link advisories, export, then snapshot and verify.
- Load tests: ensure tracing overhead < 5 percent CPU.
- Failure injection: drop collector, ensure backpressure and fallbacks.
9) Feature changes required in other components
- Authority & Tenancy: add scopes
obs:read,timeline:read,evidence:*,attest:read. Enforce tenant constraints everywhere. - Orchestrator & Runner: stamp job spans and emit
job.started|finished|failedtimeline events with root cause. - Findings Ledger: record evaluation IDs and link to evidence bundles.
- Policy Engine: include policy version and rule IDs in spans and timeline events; log decision summaries consistently.
- Export Center: produce attestations for every generated artifact and optionally embed a copy of the evaluation bundle.
- Notifications: send SLO breach and incident mode start/stop messages with links to traces and timeline searches.
- Conseiller & Excitator: emit ingest and linking events for the timeline and capture source digests in evidence bundles.
Imposed rule reminder: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.
10) Acceptance criteria
- All services emit structured logs with the common fields.
- At least 90 percent of top routes and job paths have tracing with exemplars.
- SLOs defined and visible in Console; burn rate alerts work.
- Tenant Timeline shows a full chain for a real evaluation.
- Evidence snapshot for an evaluation verifies successfully.
- Attestation exists and verifies for at least one export and one job.
- Incident mode increases trace volume and extends evidence retention.
- RLS prevents cross‑tenant reads of timeline and evidence.
- 403 payloads continue to explain policy or scope denials without leaking PII.
11) Risks and mitigations
- Telemetry cost blow‑up. Use sampling, cardinality limits, and per‑tenant caps.
- PII leakage. Redaction enforced in the logger library, security review on field additions.
- Broken traces in async work. Serialize context into job payloads and test it.
- False sense of immutability. Use real WORM where available and sign evidence; document guarantees honestly.
- Operational complexity. Ship default collector and dashboards so teams start from working, not from blank.
12) Philosophy
- If it isn’t measured, it isn’t managed.
- If it isn’t preserved, it didn’t happen.
- If it isn’t explainable, it isn’t trustworthy.
StellaOps makes risk decisions. This epic is how those decisions become observable, traceable, and defensible.
Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.