Files

master 651b8e0fa3 feat: Add new projects to solution and implement contract testing documentation

- Added "StellaOps.Policy.Engine", "StellaOps.Cartographer", and "StellaOps.SbomService" projects to the StellaOps solution.
- Created AGENTS.md to outline the Contract Testing Guild Charter, detailing mission, scope, and definition of done.
- Established TASKS.md for the Contract Testing Task Board, outlining tasks for Sprint 62 and Sprint 63 related to mock servers and replay testing.

2025-10-27 07:57:55 +02:00

36 KiB

Raw Blame History

No file to print Monitoring and forensics: the vitamins everyone remembers to take after getting sick. Let’s make it first‑class instead of a panic purchase after the postmortem. Here’s the full, doc‑ready epic.

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

Epic 15: Observability & Forensics

Short name: Observability & Forensics Primary components: Web Services API, Orchestrator, Task Runner, Findings Ledger, Conseiller (Feedser), Excitator (VEXer), Policy Engine, Export Center, Console, CLI Surfaces: logs, metrics, traces, decision audits, event timeline, evidence locker, provenance attestations, dashboards, alerts Dependencies: Authority‑Backed Scopes & Tenancy, AOC Enforcement, Policy Studio, Export Center

AOC ground rule reminder: Conseiller and Excitator aggregate and link advisories/VEX. They never merge or mutate source records. Observability and forensic capture must reflect sources as they were seen, not “cleaned up.”

1) What it is

A platform‑wide, tenant‑aware telemetry and evidence system that answers three questions fast:

What is happening right now, and is it healthy.
What exactly happened earlier, across services and jobs.
Can we prove it, independently and later.

It combines:

Unified telemetry logs, metrics, traces for all services and jobs using OpenTelemetry conventions.
SLOs and dashboards for golden signals and user‑journey health.
Forensic readiness evidence snapshots, immutable audit trails, provenance attestations, chain‑of‑custody, and a tenant timeline that reconstructs evaluations and decisions.
UI and CLI to explore incidents, verify attestations, and export legally defensible bundles.

Everything is tenant‑scoped by default and tamper‑evident where it matters.

2) Why

Because “we think it worked” is not proof. You need fast feedback during normal ops and a trustworthy record when something breaks or lawyers get involved. If StellaOps decides risk, then StellaOps must show its work and preserve it.

3) How it should work

3.1 Instrumentation blueprint

Tracing: every request and job carries a trace_id with W3C trace context. Background workers continue the same trace or link as a child. CLI adds the header when talking to the API.
Metrics: standardized counters, histograms, and gauges for latency, throughput, errors, and resource use. Exemplars embed trace_id for jump‑to‑trace from graphs.
Logs: structured JSON with fixed fields. No printf archaeology.
Sampling: head sampling for traces with tail sampling on high latency and errors. Error logs never sampled.

Common fields for logs and spans:

ts, level, service, version, env, region, instance_id,
tenant_id, project_id, actor, route, method, resource, action,
trace_id, span_id, parent_span_id, request_id,
job_id, run_id, source_id, sbom_id, policy_version,
decision_effect, decision_reason, error_code, error_message

Sensitive data policy:

Scrub secrets, tokens, emails, and file paths beyond repo root.
Redact by default and allow per‑tenant opt‑in for deep debug.
PII filters enforced in the logging library, not at call sites.

3.2 Metrics taxonomy and SLOs

Golden signals per service: request latency (P50/P95/P99), error rate, throughput, saturation (CPU, queue depth).
User journeys: SBOM ingest, policy evaluate, advisory link, VEX reconcile, export bundle, notification send.
SLO examples:
- SBOM ingest start‑to‑first‑component P95 < 5s.
- Policy evaluation P95 < 2s for 5k components.
- Advisory link to exposure delta P95 < 30s.
- Export bundle availability P95 < 90s from request.

SLOs emit events on burn rate breaches, not just threshold spikes.

3.3 Tracing model

Ingress span: api.<route> includes authN/Z decision attributes without PII.
Work spans: orchestrator.schedule, runner.execute, conseiller.ingest, excitor.link, policy.evaluate.
Async linking: parent trace context serialized into job payloads. Workers restore it and annotate spans with job metadata.
Cross‑boundary: CLI generates traceparent; Console mirrors it into the browser dev tools for correlation during support cases.

3.4 Logs that don’t lie

Decision logs: every authZ decision and policy evaluation emits a structured record.
Consistency: one logger library used across all services.
Tenant isolation: logs carry tenant_id. Central log store partitions by tenant.
Retention: hot logs short (3‑7 days), warm 30‑90 days, cold based on tenant policy. For forensic artifacts see 3.6.

3.5 Tenant Timeline

Stream rule: every state transition emits timeline_event with a canonical schema:

event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, details{}

Aggregator stores an index per tenant for 180 days by default.
UI visualizes events with filters: “show all policy changes,” “show all job failures that touched component X,” “expand decisions for this trace.”

3.6 Forensic Evidence Locker

Append‑only, WORM‑capable object storage path tenants/<tenant>/evidence/<case>/<artifact>.
Snapshot types:
- Evaluation bundle: SBOM slice, linked advisories, VEX inputs, policy set, engine version, config, and decision trace.
- Job capsule: inputs digests, outputs, runner image digest, env digest (vars hashed), command transcript.
- Export manifest: files, checksums, recipients.
Integrity:
- Merkle index per bundle. Root hash stored in DB and optionally timestamped.
- DSSE signed manifest with platform key. Key rotation logged and validated.

3.7 Provenance and attestations

Jobs emit SLSA‑style provenance attestations:
- subjects: produced artifacts (e.g., exposure report)
- predicate: sources, builder image, inputs, policies, advisories versions
- builder id: Task Runner identity
Verify on read. UI shows green check with signer and time.

3.8 Chain of custody

Chain ties together:
- source fetch → SBOM compute → policy eval → advisory link → export
Each step has immutable IDs and cryptographic digests inside the timeline and evidence.

3.9 Incident mode

Feature flag that increases trace sampling, captures additional breadcrumbs, and extends evidence retention for the next N hours.
Per‑tenant activation to avoid surprise bills.
Automatically enabled on SLO burn rate breaches above a threshold.

3.10 Multi‑tenant guarantees

Telemetry, timeline, evidence, and attestations are tenant‑scoped by design.
RLS applies to timeline DB.
Evidence locker uses tenant prefix and optional per‑tenant KMS key.
Exporting evidence requires stella:evidence:export#tenant/<id> scope.

3.11 Console features

Observability Hub: health at a glance, SLO widgets, Top failing routes, Top noisy tenants (for operators), trace search, log search with guardrails.
Forensics Explorer: timeline with filters, evidence bundle viewer, attestation verifier, “Create Snapshot” wizard, comparison between two evaluations.
“Why is this red?” click-through from an error to the exact span, log lines, and policy decision that caused it.

3.12 CLI features

stella obs top live stats for APIs and jobs.
stella obs trace <trace_id> dump correlated events.
stella forensic snapshot create --case <id> --scope <eval|job|export> --id <resource>
stella forensic verify <bundle.tgz> validate checksums and signatures.
stella forensic attest show <artifact> print provenance.

3.13 Integrations

OpenTelemetry Collector ships by default.
Prometheus scrape config and dashboards included.
Webhooks for SLO breaches and incident mode start/stop.
Optional RFC 3161 timestamping for Merkle roots where a time authority is configured.

4) Architecture

4.1 New modules

telemetry/core logging, metrics, tracing libraries with scrubbing and context.
telemetry/collector-config ship default collector config.
timeline/indexer consumes events and builds tenant timeline indices.
evidence/locker API to create, read, and sign bundles.
provenance/attest generates DSSE statements and verification helpers.
console/obs dashboards and trace viewer.
console/forensics timeline and evidence UIs.
cli/obs, cli/forensics.

4.2 Data model

New tables with RLS:

timeline_events(event_id, ts, tenant_id, project_id, kind, actor, resource_type, resource_id, trace_id, digest, details_jsonb)
evidence_bundles(id, tenant_id, case_id, kind, root_hash, path, signer_key_id, created_at, labels)
attestations(id, tenant_id, subject_id, subject_digest, statement_path, signer_key_id, created_at)

Indices: (tenant_id, ts), (tenant_id, resource_type, resource_id), (tenant_id, case_id).

4.3 Storage layout

Object store:
- tenants/<t>/evidence/<case>/<bundle_id>/manifest.json
- tenants/<t>/evidence/<case>/<bundle_id>/root.merkle
- tenants/<t>/attestations/<subject_id>/<statement.json>
Optional S3 Object Lock for WORM with per‑tenant retention policy.

4.4 Message topics

stella.<tenant>.timeline.* emitted by all services.
stella.<tenant>.slo.breach from SLO evaluator.
stella.<tenant>.incident.mode start and stop events.
stella.global.kb.* remains for publics advisories and has no tenant data.

5) APIs and contracts

5.1 Observability read APIs

GET /obs/health summary per service.
GET /obs/slo list current SLOs and burn rates.
GET /obs/trace/:id metadata plus deep link to trace backend.
GET /obs/logs query interface with guardrails for time window and tenant.
GET /obs/metrics small set of computed aggregates for Console.

5.2 Timeline APIs

GET /timeline?from=&to=&kind=&resource=&project= paginated events.
GET /timeline/:event_id detailed view, links to trace and evidence.

5.3 Evidence APIs

POST /evidence/snapshot with {kind, resource_id, case_id} creates a bundle.
GET /evidence/:bundle_id returns manifest and signed hashes.
POST /evidence/verify upload or reference a bundle path to verify.
POST /evidence/hold/:case_id place or release legal hold.

5.4 Attestation APIs

GET /attestations?subject_id= list and filter.
POST /attestations/verify verify a statement against subject digest.

5.5 Structured log contract

Logs are newline‑delimited JSON with the common fields in 3.1. error_message must be non‑PII and concise. Multi‑line errors are folded into details.stack with length limits.

All these endpoints require Authority scopes like:

stella:obs:read#tenant/<id>
stella:timeline:read#tenant/<id>
stella:evidence:create#tenant/<id>
stella:evidence:read#tenant/<id>
stella:attest:read#tenant/<id>

6) Documentation changes

Create or update:

/docs/observability/overview.md what, why, scope, data flow.
/docs/observability/telemetry-standards.md fields, naming, sampling, scrubbing rules, examples.
/docs/observability/metrics-and-slos.md catalog of metrics, SLO definitions, alert policies, burn rates.
/docs/observability/tracing.md context propagation, async linking, CLI and Console behavior.
/docs/observability/logging.md structured logging guide, dos and don’ts, PII policy.
/docs/forensics/evidence-locker.md bundle formats, WORM, retention, legal hold.
/docs/forensics/provenance-attestation.md statement schema, signing, verification.
/docs/forensics/timeline.md schema, event kinds, queries, examples.
/docs/console/observability.md dashboards, trace viewer, log search.
/docs/console/forensics.md timeline, snapshot UI, verification flow.
/docs/cli/observability.md commands and examples.
/docs/cli/forensics.md snapshot and verify commands.
/docs/security/redaction-and-privacy.md telemetry privacy and tenant isolation.
/docs/install/telemetry-stack.md default collector, exporter options, dashboards.
/docs/runbooks/incidents.md incident mode, SLO breaches, escalation.

At the top of each page include:

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

7) Implementation plan

Phase 1 - Baseline telemetry

Integrate OpenTelemetry SDK in all services.
Replace ad‑hoc logs with structured logger.
Emit minimal metrics and traces on hot paths.
Ship default collector and dashboards.

Phase 2 - SLOs and dashboards

Define SLOs per journey.
Implement burn rate alerts.
Wire Console Observability Hub widgets.

Phase 3 - Timeline and decision logs

Emit timeline events everywhere.
Build timeline indexer and APIs.
Console Timeline viewer.

Phase 4 - Evidence locker

Implement snapshot builders for evaluation, job, export.
Add Merkle manifests and DSSE signing.
Add legal hold and retention checks.

Phase 5 - Provenance and verification

Generate attestations for jobs and exports.
Verify on read.
CLI verify commands.

Phase 6 - Incident mode

Feature flag, higher sampling, and retention bump.
Webhook and Notifications integration.

8) Engineering tasks

Telemetry core

Build telemetry/core with log scrubbing, context propagation, and OpenTelemetry exporters.
Introduce a single Logger facade, deprecate old logging.
Add request and job middleware to attach tenant/actor/trace context.

Metrics and SLOs

Instrument golden paths with histograms and exemplars.
Implement SLO evaluator and burn rate alerts.
Provide Prometheus rules and Grafana dashboards.

Tracing

Propagate traceparent in HTTP, gRPC, and job payloads.
Link CLI requests via headers.
Add span attributes for policy decisions and authZ outcomes.

Timeline

Define timeline_event schema and emitters in all services.
Build indexer, APIs, and RLS policies.
Console visualization with filters and deep links to traces.

Evidence locker

Implement bundle builders, Merkle construction, and DSSE signing.
Object store layout with optional WORM mode.
APIs for create, get, verify, and legal hold.

Provenance

Define statement schema and signers.
Add verification library and server hooks.
CLI support for show and verify.

Privacy and redaction

Implement scrubbers for secrets and PII in logger.
Add config for per‑tenant deep debug opt‑in with time‑boxed TTL.
Redaction tests.

Console

Observability Hub: health, SLO widgets, trace search.
Forensics Explorer: timeline, evidence viewer, verification UX.
“Why is this red” drill‑down from errors to spans/logs/decisions.

CLI

stella obs commands and pretty printers.
stella forensic snapshot, verify, attest show.
Respect --tenant and print trace_id for copy‑paste.

Docs

Author all pages in section 6, with examples and diagrams.
Insert the imposed rule banner at the top of each page.
Add runbooks for SLO breaches and incident mode.

Testing

Unit tests: scrubbing, logger fields, timeline emitter.
E2E: create SBOM, run policy, link advisories, export, then snapshot and verify.
Load tests: ensure tracing overhead < 5 percent CPU.
Failure injection: drop collector, ensure backpressure and fallbacks.

9) Feature changes required in other components

Authority & Tenancy: add scopes obs:read, timeline:read, evidence:*, attest:read. Enforce tenant constraints everywhere.
Orchestrator & Runner: stamp job spans and emit job.started|finished|failed timeline events with root cause.
Findings Ledger: record evaluation IDs and link to evidence bundles.
Policy Engine: include policy version and rule IDs in spans and timeline events; log decision summaries consistently.
Export Center: produce attestations for every generated artifact and optionally embed a copy of the evaluation bundle.
Notifications: send SLO breach and incident mode start/stop messages with links to traces and timeline searches.
Conseiller & Excitator: emit ingest and linking events for the timeline and capture source digests in evidence bundles.

Imposed rule reminder: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

10) Acceptance criteria

All services emit structured logs with the common fields.
At least 90 percent of top routes and job paths have tracing with exemplars.
SLOs defined and visible in Console; burn rate alerts work.
Tenant Timeline shows a full chain for a real evaluation.
Evidence snapshot for an evaluation verifies successfully.
Attestation exists and verifies for at least one export and one job.
Incident mode increases trace volume and extends evidence retention.
RLS prevents cross‑tenant reads of timeline and evidence.
403 payloads continue to explain policy or scope denials without leaking PII.

11) Risks and mitigations

Telemetry cost blow‑up. Use sampling, cardinality limits, and per‑tenant caps.
PII leakage. Redaction enforced in the logger library, security review on field additions.
Broken traces in async work. Serialize context into job payloads and test it.
False sense of immutability. Use real WORM where available and sign evidence; document guarantees honestly.
Operational complexity. Ship default collector and dashboards so teams start from working, not from blank.

12) Philosophy

If it isn’t measured, it isn’t managed.
If it isn’t preserved, it didn’t happen.
If it isn’t explainable, it isn’t trustworthy.

StellaOps makes risk decisions. This epic is how those decisions become observable, traceable, and defensible.

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

36 KiB Raw Blame History Unescape Escape

Epic 15: Observability & Forensics

1) What it is

2) Why

3) How it should work

3.1 Instrumentation blueprint

3.2 Metrics taxonomy and SLOs

3.3 Tracing model

3.4 Logs that don’t lie

3.5 Tenant Timeline

3.6 Forensic Evidence Locker

3.7 Provenance and attestations

3.8 Chain of custody

3.9 Incident mode

3.10 Multi‑tenant guarantees

3.11 Console features

3.12 CLI features

3.13 Integrations

4) Architecture

4.1 New modules

4.2 Data model

4.3 Storage layout

4.4 Message topics

5) APIs and contracts

5.1 Observability read APIs

5.2 Timeline APIs

5.3 Evidence APIs

5.4 Attestation APIs

5.5 Structured log contract

6) Documentation changes

7) Implementation plan

Phase 1 - Baseline telemetry

Phase 2 - SLOs and dashboards

Phase 3 - Timeline and decision logs

Phase 4 - Evidence locker

Phase 5 - Provenance and verification

Phase 6 - Incident mode

8) Engineering tasks

9) Feature changes required in other components

10) Acceptance criteria

11) Risks and mitigations

12) Philosophy

36 KiB

Raw Blame History