Implement ledger metrics for observability and add tests for Ruby packages endpoints

- Added `LedgerMetrics` class to record write latency and total events for ledger operations. - Created comprehensive tests for Ruby packages endpoints, covering scenarios for missing inventory, successful retrieval, and identifier handling. - Introduced `TestSurfaceSecretsScope` for managing environment variables during tests. - Developed `ProvenanceMongoExtensions` for attaching DSSE provenance and trust information to event documents. - Implemented `EventProvenanceWriter` and `EventWriter` classes for managing event provenance in MongoDB. - Established MongoDB indexes for efficient querying of events based on provenance and trust. - Added models and JSON parsing logic for DSSE provenance and trust information.
2025-11-13 09:29:09 +02:00
parent 151f6b35cc
commit 61f963fd52
101 changed files with 5881 additions and 1776 deletions
--- a/docs/modules/advisory-ai/architecture.md
+++ b/docs/modules/advisory-ai/architecture.md
@@ -1,6 +1,7 @@
 # Advisory AI architecture

 > Captures the retrieval, guardrail, and inference packaging requirements defined in the Advisory AI implementation plan and related module guides.
+> Configuration knobs (inference modes, guardrails, cache/queue budgets) now live in [`docs/policy/assistant-parameters.md`](../../policy/assistant-parameters.md) per DOCS-AIAI-31-006.

 ## 1) Goals

--- a/docs/modules/excititor/README.md
+++ b/docs/modules/excititor/README.md
@@ -4,6 +4,7 @@ Excititor converts heterogeneous VEX feeds into raw observations and linksets th

 ## Latest updates (2025-11-05)
 - Link-Not-Merge readiness: release note [Excitor consensus beta](../../updates/2025-11-05-excitor-consensus-beta.md) captures how Excititor feeds power the Excititor consensus beta (sample payload in [consensus JSON](../../vex/consensus-json.md)).
+- Added [observability guide](operations/observability.md) describing the evidence metrics emitted by `EXCITITOR-AIAI-31-003` (request counters, statement histogram, signature status, guard violations) so Ops/Lens can alert on misuse.
 - README now points policy/UI teams to the upcoming consensus integration work.
 - DSSE packaging for consensus bundles and Export Center hooks are documented in the [beta release note](../../updates/2025-11-05-excitor-consensus-beta.md); operators mirroring Excititor exports must verify detached JWS artefacts (`bundle.json.jws`) alongside each bundle.
 - Follow-ups called out in the release note (Policy weighting knobs `POLICY-ENGINE-30-101`, CLI verb `CLI-VEX-30-002`) remain in-flight and are tracked in `/docs/implplan/SPRINT_200_documentation_process.md`.
--- a/docs/modules/excititor/architecture.md
+++ b/docs/modules/excititor/architecture.md
@@ -2,7 +2,7 @@

 > Consolidates the VEX ingestion guardrails from Epic 1 with consensus and AI-facing requirements from Epics 7 and 8. This is the authoritative architecture record for Excititor.

-> **Scope.** This document specifies the **Excititor** service: its purpose, trust model, data structures, observation/linkset pipelines, APIs, plug-in contracts, storage schema, performance budgets, testing matrix, and how it integrates with Concelier, Policy Engine, and evidence surfaces. It is implementation-ready.
+> **Scope.** This document specifies the **Excititor** service: its purpose, trust model, data structures, observation/linkset pipelines, APIs, plug-in contracts, storage schema, performance budgets, testing matrix, and how it integrates with Concelier, Policy Engine, and evidence surfaces. It is implementation-ready. The immutable observation store schema lives in [`vex_observations.md`](./vex_observations.md).

 ---

--- a/docs/modules/excititor/operations/observability.md
+++ b/docs/modules/excititor/operations/observability.md
@@ -0,0 +1,41 @@
+# Excititor Observability Guide
+
+> Added 2025-11-14 alongside Sprint 119 (`EXCITITOR-AIAI-31-003`). Complements the AirGap/mirror runbooks under the same folder.
+
+Excititor’s evidence APIs now emit first-class OpenTelemetry metrics so Lens, Advisory AI, and Ops can detect misuse or missing provenance without paging through logs. This document lists the counters/histograms shipped by the WebService (`src/Excititor/StellaOps.Excititor.WebService`) and how to hook them into your exporters/dashboards.
+
+## Telemetry prerequisites
+
+- Enable `Excititor:Telemetry` in the service configuration (`appsettings.*`), ensuring **metrics** export is on. The WebService automatically adds the evidence meter (`StellaOps.Excititor.WebService.Evidence`) alongside the ingestion meter.
+- Deploy at least one OTLP or console exporter (see `TelemetryExtensions.ConfigureExcititorTelemetry`). If your region lacks OTLP transport, fall back to scraping the console exporter for smoke tests.
+- Coordinate with the Ops/Signals guild to provision the span/metric sinks referenced in `docs/modules/platform/architecture-overview.md#observability`.
+
+## Metrics reference
+
+| Metric | Type | Description | Key dimensions |
+| --- | --- | --- | --- |
+| `excititor.vex.observation.requests` | Counter | Number of `/v1/vex/observations/{vulnerabilityId}/{productKey}` requests handled. | `tenant`, `outcome` (`success`, `error`, `cancelled`), `truncated` (`true/false`) |
+| `excititor.vex.observation.statement_count` | Histogram | Distribution of statements returned per observation projection request. | `tenant`, `outcome` |
+| `excititor.vex.signature.status` | Counter | Signature status per statement (missing vs. unverified). | `tenant`, `status` (`missing`, `unverified`) |
+| `excititor.vex.aoc.guard_violations` | Counter | Aggregated count of Aggregation-Only Contract violations detected by the WebService (ingest + `/vex/aoc/verify`). | `tenant`, `surface` (`ingest`, `aoc_verify`, etc.), `code` (AOC error code) |
+
+> All metrics originate from the `EvidenceTelemetry` helper (`src/Excititor/StellaOps.Excititor.WebService/Telemetry/EvidenceTelemetry.cs`). When disabled (telemetry off), the helper is inert.
+
+### Dashboard hints
+
+- **Advisory-AI readiness** – alert when `excititor.vex.signature.status{status="missing"}` spikes for a tenant, indicating connectors aren’t supplying signatures.
+- **Guardrail monitoring** – graph `excititor.vex.aoc.guard_violations` per `code` to catch upstream feed regressions before they pollute Evidence Locker or Lens caches.
+- **Capacity planning** – histogram percentiles of `excititor.vex.observation.statement_count` feed API sizing (higher counts mean Advisory AI is requesting broad scopes).
+
+## Operational steps
+
+1. **Enable telemetry**: set `Excititor:Telemetry:EnableMetrics=true`, configure OTLP endpoints/headers as described in `TelemetryExtensions`.
+2. **Add dashboards**: import panels referencing the metrics above (see Grafana JSON snippets in Ops repo once merged).
+3. **Alerting**: add rules for high guard violation rates and missing signatures. Tie alerts back to connectors via tenant metadata.
+4. **Post-deploy checks**: after each release, verify metrics emit by curling `/v1/vex/observations/...`, watching the console exporter (dev) or OTLP (prod).
+
+## Related documents
+
+- `docs/modules/excititor/architecture.md` – API contract, AOC guardrails, connector responsibilities.
+- `docs/modules/excititor/mirrors.md` – AirGap/mirror ingestion checklist (feeds into `EXCITITOR-AIRGAP-56/57`).
+- `docs/modules/platform/architecture-overview.md#observability` – platform-wide telemetry guidance.
--- a/docs/modules/excititor/vex_observations.md
+++ b/docs/modules/excititor/vex_observations.md
@@ -0,0 +1,131 @@
+# VEX Observation Model (`vex_observations`)
+
+> Authored 2025-11-14 for Sprint 120 (`EXCITITOR-LNM-21-001`). This document is the canonical schema description for Excititor’s immutable observation records. It unblocks downstream documentation tasks (`DOCS-LNM-22-002`) and aligns the WebService/Worker data structures with Mongo persistence.
+
+Excititor ingests heterogeneous VEX statements, normalizes them under the Aggregation-Only Contract (AOC), and persists each normalized statement as a **VEX observation**. These observations are the source of truth for:
+
+- Advisory AI citation APIs (`/v1/vex/observations/{vulnerabilityId}/{productKey}`)
+- Graph/Vuln Explorer overlays (batch observation APIs)
+- Evidence Locker + portable bundle manifests
+- Policy Engine materialization and audit trails
+
+All observation documents are immutable. New information creates a new observation record linked by `observationId`; supersedence happens through Graph/Lens layers, not by mutating this collection.
+
+## Storage & routing
+
+| Aspect | Value |
+| --- | --- |
+| Collection | `vex_observations` (Mongo) |
+| Upstream generator | `VexObservationProjectionService` (WebService) and Worker normalization pipeline |
+| Primary key | `{tenant, observationId}` |
+| Required indexes | `{tenant, vulnerabilityId}`, `{tenant, productKey}`, `{tenant, document.digest}`, `{tenant, providerId, status}` |
+| Source of truth for | `/v1/vex/observations`, Graph batch APIs, Excititor → Evidence Locker replication |
+
+## Canonical document shape
+
+```jsonc
+{
+  "tenant": "default",
+  "observationId": "vex:obs:sha256:...",
+  "vulnerabilityId": "CVE-2024-12345",
+  "productKey": "pkg:maven/org.example/app@1.2.3",
+  "providerId": "ubuntu-csaf",
+  "status": "affected",                // matches VexClaimStatus enum
+  "justification": {
+    "type": "component_not_present",
+    "reason": "Package not shipped in this profile",
+    "detail": "Binary not in base image"
+  },
+  "detail": "Free-form vendor detail",
+  "confidence": {
+    "score": 0.9,
+    "level": "high",
+    "method": "vendor"
+  },
+  "signals": {
+    "severity": {
+      "scheme": "cvss3.1",
+      "score": 7.8,
+      "label": "High",
+      "vector": "CVSS:3.1/..."
+    },
+    "kev": true,
+    "epss": 0.77
+  },
+  "scope": {
+    "key": "pkg:deb/ubuntu/apache2@2.4.58-1",
+    "purls": [
+      "pkg:deb/ubuntu/apache2@2.4.58-1",
+      "pkg:docker/example/app@sha256:..."
+    ],
+    "cpes": ["cpe:2.3:a:apache:http_server:2.4.58:*:*:*:*:*:*:*"]
+  },
+  "anchors": [
+    "#/statements/0/justification",
+    "#/statements/0/detail"
+  ],
+  "document": {
+    "format": "csaf",
+    "digest": "sha256:abc123...",
+    "revision": "2024-10-22T09:00:00Z",
+    "sourceUri": "https://ubuntu.com/security/notices/USN-0000-1",
+    "signature": {
+      "type": "cosign",
+      "issuer": "https://token.actions.githubusercontent.com",
+      "keyId": "ubuntu-vex-prod",
+      "verifiedAt": "2024-10-22T09:01:00Z",
+      "transparencyLogReference": "rekor://UUID",
+      "trust": {
+        "tenantId": "default",
+        "issuerId": "ubuntu",
+        "effectiveWeight": 0.9,
+        "tenantOverrideApplied": false,
+        "retrievedAtUtc": "2024-10-22T09:00:30Z"
+      }
+    }
+  },
+  "aoc": {
+    "guardVersion": "2024.10.0",
+    "violations": [],                    // non-empty -> stored + surfaced
+    "ingestedAt": "2024-10-22T09:00:05Z",
+    "retrievedAt": "2024-10-22T08:59:59Z"
+  },
+  "metadata": {
+    "provider-hint": "Mainline feed",
+    "source-channel": "mirror"
+  }
+}
+```
+
+### Field notes
+
+- **`tenant`** – logical tenant resolved by WebService based on headers or default configuration.
+- **`observationId`** – deterministic hash (sha256) over `{tenant, vulnerabilityId, productKey, providerId, statementDigest}`. Never reused.
+- **`status` + `justification`** – follow the OpenVEX semantics enforced by `StellaOps.Excititor.Core.VexClaim`.
+- **`scope`** – includes canonical `key` plus normalized PURLs/CPES; deterministic ordering.
+- **`anchors`** – optional JSON-pointer hints pointing to the source document sections; stored as trimmed strings.
+- **`document.signature`** – mirrors `VexSignatureMetadata`; empty if upstream feed lacks signatures.
+- **`aoc.violations`** – stored if the guard detected non-fatal issues; fatal issues never create an observation.
+- **`metadata`** – reserved for deterministic provider hints; keys follow `vex.*` prefix guidance.
+
+## Determinism & AOC guarantees
+
+1. **Write-once** – once inserted, observation documents never change. New evidence creates a new `observationId`.
+2. **Sorted collections** – arrays (`anchors`, `purls`, `cpes`) are sorted lexicographically before persistence.
+3. **Guard metadata** – `aoc.guardVersion` records the guard library version (`docs/aoc/guard-library.md`), enabling audits.
+4. **Signatures** – only verification metadata proven by the Worker is stored; WebService never recomputes trust.
+5. **Time normalization** – all timestamps stored as UTC ISO-8601 strings (Mongo `DateTime`).
+
+## API mapping
+
+| API | Source fields | Notes |
+| --- | --- | --- |
+| `/v1/vex/observations/{vuln}/{product}` | `tenant`, `vulnerabilityId`, `productKey`, `scope`, `statements[]` | Response uses `VexObservationProjectionService` to render `statements`, `document`, and `signature` fields. |
+| `/vex/aoc/verify` | `document.digest`, `providerId`, `aoc` | Replays guard validation for recent digests; guard violations here align with `aoc.violations`. |
+| Evidence batch API (Graph) | `statements[]`, `scope`, `signals`, `anchors` | Format optimized for overlays; resuces `document` to digest/URI. |
+
+## Related work
+
+- `EXCITITOR-GRAPH-24-*` relies on this schema to build overlays.
+- `DOCS-LNM-22-002` (Link-Not-Merge documentation) references this file.
+- `EXCITITOR-ATTEST-73-*` uses `document.digest` + `signature` to embed provenance in attestation payloads.
--- a/docs/modules/findings-ledger/airgap-provenance.md
+++ b/docs/modules/findings-ledger/airgap-provenance.md
@@ -0,0 +1,61 @@
+# Findings Ledger — Air-Gap Provenance Extensions (LEDGER-AIRGAP-56/57/58)
+
+> **Scope:** How ledger events capture mirror bundle provenance, staleness metrics, evidence snapshots, and sealed-mode timeline events for air-gapped deployments.
+
+## 1. Requirements recap
+- **LEDGER-AIRGAP-56-001:** Record mirror bundle metadata (`bundle_id`, `merkle_root`, `time_anchor`, `source_region`) whenever advisories/VEX/policies are imported offline. Tie import provenance to each affected ledger event.
+- **LEDGER-AIRGAP-56-002:** Surface staleness metrics and enforce risk-critical export blocks when imported data exceeds freshness SLAs; emit remediation guidance.
+- **LEDGER-AIRGAP-57-001:** Link findings evidence snapshots (portable bundles) so cross-enclave verification can attest to the same ledger hash.
+- **LEDGER-AIRGAP-58-001:** Emit sealed-mode timeline events describing bundle impacts (new findings, remediation deltas) for Console and Notify.
+
+## 2. Schema additions
+
+| Entity | Field | Type | Notes |
+| --- | --- | --- | --- |
+| `ledger_events.event_body` | `airgap.bundle` | object | `{ "bundleId", "merkleRoot", "timeAnchor", "sourceRegion", "importedAt", "importOperator" }` recorded on import events. |
+| `ledger_events.event_body` | `airgap.evidenceSnapshot` | object | `{ "bundleUri", "dsseDigest", "expiresAt" }` for findings evidence bundles. |
+| `ledger_projection` | `airgap.stalenessSeconds` | integer | Age of newest data feeding the finding projection. |
+| `ledger_projection` | `airgap.bundleId` | string | Last bundle influencing the projection row. |
+| `timeline_events` (new view) | `airgapImpact` | object | Materials needed for LEDGER-AIRGAP-58-001 timeline feed (finding counts, severity deltas). |
+
+Canonical JSON must sort object keys (`bundleId`, `importOperator`, …) to keep hashes deterministic.
+
+## 3. Import workflow
+1. **Mirror bundle validation:** AirGap controller verifies bundle signature/manifest before ingest; saves metadata for ledger enrichment.
+2. **Event enrichment:** The importer populates `airgap.bundle` fields on each event produced from the bundle. `bundleId` equals manifest digest (SHA-256). `merkleRoot` is the bundle’s manifest Merkle root; `timeAnchor` is the authoritative timestamp from the bundle.
+3. **Anchoring:** Merkle batching includes bundle metadata; anchor references in `ledger_merkle_roots.anchor_reference` use format `airgap::<bundleId>` when not externally anchored.
+4. **Projection staleness:** Projector updates `airgap.stalenessSeconds` comparing current time with `bundle.timeAnchor` per artifact scope; CLI + Console read the value to display freshness indicators.
+
+## 4. Staleness enforcement
+- Config option `AirGapPolicies:FreshnessThresholdSeconds` (default 604800 = 7 days) sets allowable age.
+- Export workflows check `airgap.stalenessSeconds`; when over threshold the service raises `ERR_AIRGAP_STALE` and supplies remediation message referencing the last bundle (`bundleId`, `timeAnchor`, `importOperator`).
+- Metrics (`ledger_airgap_staleness_seconds`) track distribution per tenant for dashboards.
+
+## 5. Evidence snapshots
+- Evidence bundles (`airgap.evidenceSnapshot`) reference portable DSSE packages stored in Evidence Locker (`bundleUri` like `file://offline/evidence/<bundleId>.tar`).
+- CLI command `stella ledger evidence link` attaches evidence snapshots to findings after bundle generation; ledger event records both DSSE digest and expiration.
+- Timeline entries and Console detail views display “Evidence snapshot available” with download instructions suited for sealed environments.
+
+## 6. Timeline events (LEDGER-AIRGAP-58-001)
+- New derived view `timeline_airgap_impacts` emits JSON objects such as:
+  ```json
+  {
+    "tenant": "tenant-a",
+    "bundleId": "bundle-sha256:…",
+    "newFindings": 42,
+    "resolvedFindings": 18,
+    "criticalDelta": +5,
+    "timeAnchor": "2025-10-30T11:00:00Z",
+    "sealedMode": true
+  }
+  ```
+- Console + Notify subscribe to `ledger.airgap.timeline` events to show sealed-mode summaries.
+
+## 7. Offline kit considerations
+- Include bundle provenance schema, staleness policy config, CLI scripts (`stella airgap bundle import`, `stella ledger evidence link`), and sample manifests.
+- Provide validation script `scripts/ledger/validate-airgap-bundle.sh` verifying manifest signatures, timestamps, and ledger enrichment before ingest.
+- Document sealed-mode toggles ensuring no external egress occurs when importing bundles.
+
+---
+
+*Draft 2025-11-13 for LEDGER-AIRGAP-56/57/58 planning.*
--- a/docs/modules/findings-ledger/deployment.md
+++ b/docs/modules/findings-ledger/deployment.md
@@ -0,0 +1,129 @@
+# Findings Ledger Deployment & Operations Guide
+
+> **Applies to** `StellaOps.Findings.Ledger` writer + projector services (Sprint 120).  
+> **Audience** Platform/DevOps engineers bringing up Findings Ledger across dev/stage/prod and air-gapped sites.
+
+## 1. Prerequisites
+
+| Component | Requirement |
+| --- | --- |
+| Database | PostgreSQL 14+ with `citext`, `uuid-ossp`, `pgcrypto`, and `pg_partman`. Provision dedicated database/user per environment. |
+| Storage | Minimum 200 GB SSD per production environment (ledger + projection + Merkle tables). |
+| TLS & identity | Authority reachable for service-to-service JWTs; mTLS optional but recommended. |
+| Secrets | Store DB connection string, encryption keys (`LEDGER__ATTACHMENTS__ENCRYPTIONKEY`), signing credentials for Merkle anchoring in secrets manager. |
+| Observability | OTLP collector endpoint (or Loki/Prometheus endpoints) configured; see `docs/modules/findings-ledger/observability.md`. |
+
+## 2. Docker Compose deployment
+
+1. **Create env files**
+   ```bash
+   cp deploy/compose/env/ledger.env.example ledger.env
+   cp etc/secrets/ledger.postgres.secret.example ledger.postgres.env
+   # Populate LEDGER__DB__CONNECTIONSTRING, LEDGER__ATTACHMENTS__ENCRYPTIONKEY, etc.
+   ```
+2. **Add ledger service overlay** (append to the Compose file in use, e.g. `docker-compose.prod.yaml`):
+   ```yaml
+   services:
+     findings-ledger:
+       image: stellaops/findings-ledger:${STELLA_VERSION:-2025.11.0}
+       restart: unless-stopped
+       env_file:
+         - ledger.env
+         - ledger.postgres.env
+       environment:
+         ASPNETCORE_URLS: http://0.0.0.0:8080
+         LEDGER__DB__CONNECTIONSTRING: ${LEDGER__DB__CONNECTIONSTRING}
+         LEDGER__OBSERVABILITY__ENABLED: "true"
+         LEDGER__MERKLE__ANCHORINTERVAL: "00:05:00"
+       ports:
+         - "8188:8080"
+       depends_on:
+         - postgres
+       volumes:
+         - ./etc/ledger/appsettings.json:/app/appsettings.json:ro
+   ```
+3. **Run migrations then start services**
+   ```bash
+   dotnet run --project src/Findings/StellaOps.Findings.Ledger.Migrations \
+     -- --connection "$LEDGER__DB__CONNECTIONSTRING"
+
+   docker compose --env-file ledger.env --env-file ledger.postgres.env \
+     -f deploy/compose/docker-compose.prod.yaml up -d findings-ledger
+   ```
+4. **Smoke test**
+   ```bash
+   curl -sf http://localhost:8188/health/ready
+   curl -sf http://localhost:8188/metrics | grep ledger_write_latency_seconds
+   ```
+
+## 3. Helm deployment
+
+1. **Create secret**
+   ```bash
+   kubectl create secret generic findings-ledger-secrets \
+     --from-literal=LEDGER__DB__CONNECTIONSTRING="$CONN_STRING" \
+     --from-literal=LEDGER__ATTACHMENTS__ENCRYPTIONKEY="$ENC_KEY" \
+     --dry-run=client -o yaml | kubectl apply -f -
+   ```
+2. **Helm values excerpt**
+   ```yaml
+   services:
+     findingsLedger:
+       enabled: true
+       image:
+         repository: stellaops/findings-ledger
+         tag: 2025.11.0
+       envFromSecrets:
+         - name: findings-ledger-secrets
+       env:
+         LEDGER__OBSERVABILITY__ENABLED: "true"
+         LEDGER__MERKLE__ANCHORINTERVAL: "00:05:00"
+       resources:
+         requests: { cpu: "500m", memory: "1Gi" }
+         limits:   { cpu: "2",    memory: "4Gi" }
+       probes:
+         readinessPath: /health/ready
+         livenessPath: /health/live
+   ```
+3. **Install/upgrade**
+   ```bash
+   helm upgrade --install stellaops deploy/helm/stellaops \
+     -f deploy/helm/stellaops/values-prod.yaml
+   ```
+4. **Verify**
+   ```bash
+   kubectl logs deploy/stellaops-findings-ledger | grep "Ledger started"
+   kubectl port-forward svc/stellaops-findings-ledger 8080 &
+   curl -sf http://127.0.0.1:8080/metrics | head
+   ```
+
+## 4. Backups & restores
+
+| Task | Command / guidance |
+| --- | --- |
+| Online backup | `pg_dump -Fc --dbname="$LEDGER_DB" --file ledger-$(date -u +%Y%m%d).dump` (run hourly for WAL + daily full dumps). |
+| Point-in-time recovery | Enable WAL archiving; document target `recovery_target_time`. |
+| Projection rebuild | After restore, run `dotnet run --project tools/LedgerReplayHarness -- --connection "$LEDGER_DB" --tenant all` to regenerate projections and verify hashes. |
+| Evidence bundles | Store Merkle root anchors + replay DSSE bundles alongside DB backups for audit parity. |
+
+## 5. Offline / air-gapped workflow
+
+- Use `stella ledger observability snapshot --out offline/ledger/metrics.tar.gz` before exporting Offline Kits. Include:
+  - `ledger_write_latency_seconds` summaries
+  - `ledger_merkle_anchor_duration_seconds` histogram
+  - Latest `ledger_merkle_roots` rows (export via `psql \copy`)
+- Package ledger service binaries + migrations using `ops/offline-kit/build_offline_kit.py --include ledger`.
+- Document sealed-mode restrictions: disable outbound attachments unless egress policy allows Evidence Locker endpoints; set `LEDGER__ATTACHMENTS__ALLOWEGRESS=false`.
+
+## 6. Post-deploy checklist
+
+- [ ] Health + metrics endpoints respond.
+- [ ] Merkle anchors writing to `ledger_merkle_roots`.
+- [ ] Projection lag < 30 s (`ledger_projection_lag_seconds`).
+- [ ] Grafana dashboards imported under “Findings Ledger”.
+- [ ] Backups scheduled + restore playbook tested.
+- [ ] Offline snapshot taken (air-gapped sites).
+
+---
+
+*Draft prepared 2025-11-13 for LEDGER-29-009/LEDGER-AIRGAP-56-001 planning. Update once Compose/Helm overlays are merged.*
--- a/docs/modules/findings-ledger/implementation_plan.md
+++ b/docs/modules/findings-ledger/implementation_plan.md
@@ -0,0 +1,45 @@
+# Implementation Plan — Findings Ledger (Sprint 120)
+
+## Phase 1 – Observability baselines (LEDGER-29-007)
+- Instrument writer/projector with metrics listed in `observability.md` (`ledger_write_latency_seconds`, `ledger_events_total`, `ledger_projection_lag_seconds`, etc.).
+- Emit structured logs (Serilog JSON) including chain/sequence/hash metadata.
+- Wire OTLP exporters, ensure `/metrics` endpoint exposes histogram buckets with exemplars.
+- Publish Grafana dashboards + alert rules (Policy SLO pack).
+- Deliver doc updates + sample Grafana JSON in repo (`docs/observability/dashboards/findings-ledger/`).
+
+## Phase 2 – Determinism harness (LEDGER-29-008)
+- Finalize NDJSON fixtures for ≥5 M findings/tenant (per tenant/test scenario).
+- Implement `tools/LedgerReplayHarness` CLI as specified in `replay-harness.md`.
+- Add GitHub/Gitea pipeline job(s) running nightly (1 M) + weekly (5 M) harness plus DSSE signing.
+- Capture CPU/memory/latency metrics and commit signed reports for validation.
+- Provide runbook for QA + Ops to rerun harness in their environments.
+
+## Phase 3 – Deployment & backup collateral (LEDGER-29-009)
+- Integrate ledger service into Compose (`docker-compose.prod.yaml`) and Helm values.
+- Automate PostgreSQL migrations (DatabaseMigrator invocation pre-start).
+- Document backup cadence (pg_dump + WAL archiving) and projection rebuild process (call harness).
+- Ensure Offline Kit packaging pulls binaries, migrations, harness, and default dashboards.
+
+## Phase 4 – Provenance & air-gap extensions
+- LEDGER-34-101: ingest orchestrator run export metadata, index by artifact hash, expose audit endpoint.
+- LEDGER-AIRGAP-56/57/58: extend ledger events to capture bundle provenance, staleness metrics, timeline events.
+- LEDGER-ATTEST-73-001: store attestation pointers (DSSE IDs, Rekor metadata) for explainability.
+- For each extension, update schema doc + workflow inference doc to describe newly recorded fields and tenant-safe defaults.
+
+## Dependencies & sequencing
+1. AdvisoryAI Sprint 110.A completion (raw findings parity).
+2. Observability schema approval (Nov 15) to unblock Phase 1 instrumentation.
+3. QA lab capacity for 5 M replay (Nov 18 checkpoint).
+4. DevOps review of Compose/Helm overlays (Nov 20).
+5. Orchestrator export schema freeze (Nov 25) for provenance linkage.
+
+## Deliverables checklist
+- [ ] Metrics/logging/tracing implementation merged, dashboards exported.
+- [ ] Harness CLI + fixtures + signed reports committed.
+- [ ] Compose/Helm overlays + backup/restore runbooks validated.
+- [ ] Air-gap provenance fields documented + implemented.
+- [ ] Sprint tracker and release notes updated after each phase.
+
+---
+
+*Draft: 2025-11-13. Update when sequencing or dependencies change.*
--- a/docs/modules/findings-ledger/observability.md
+++ b/docs/modules/findings-ledger/observability.md
@@ -0,0 +1,65 @@
+# Findings Ledger Observability Profile (Sprint 120)
+
+> **Audience:** Findings Ledger Guild · Observability Guild · DevOps · AirGap Controller Guild  
+> **Scope:** Metrics, logs, traces, dashboards, and alert contracts required by LEDGER-29-007/008/009. Complements the schema spec and workflow docs.
+
+## 1. Telemetry stack & conventions
+- **Export path:** .NET OpenTelemetry SDK → OTLP → shared collector → Prometheus/Tempo/Loki. Enable via `observability.enabled=true` in `appsettings`.
+- **Namespace prefix:** `ledger.*` for metrics, `Ledger.*` for logs/traces. Labels follow `tenant`, `chain`, `policy`, `status`, `reason`, `anchor`.
+- **Time provenance:** All timestamps emitted in UTC ISO-8601. When metrics/logs include monotonic durations they must derive from `TimeProvider`.
+
+## 2. Metrics
+
+| Metric | Type | Labels | Description / target |
+| --- | --- | --- | --- |
+| `ledger_write_latency_seconds` | Histogram | `tenant`, `event_type` | End-to-end append latency (API ingress → persisted). P95 ≤ 120 ms. |
+| `ledger_events_total` | Counter | `tenant`, `event_type`, `source` (`policy`, `workflow`, `orchestrator`) | Incremented per committed event. Mirrors Merkle leaf count. |
+| `ledger_ingest_backlog_events` | Gauge | `tenant` | Number of events buffered in the writer queue. Alert when >5 000 for 5 min. |
+| `ledger_projection_lag_seconds` | Gauge | `tenant` | Wall-clock difference between latest ledger event and projection tail. Target <30 s. |
+| `ledger_projection_rebuild_seconds` | Histogram | `tenant` | Duration of replay/rebuild operations triggered by LEDGER-29-008 harness. |
+| `ledger_merkle_anchor_duration_seconds` | Histogram | `tenant` | Time to batch + anchor events. Target <60 s per 10k events. |
+| `ledger_merkle_anchor_failures_total` | Counter | `tenant`, `reason` (`db`, `signing`, `network`) | Alerts at >0 within 15 min. |
+| `ledger_attachments_encryption_failures_total` | Counter | `tenant`, `stage` (`encrypt`, `sign`, `upload`) | Ensures secure attachment pipeline stays healthy. |
+| `ledger_db_connections_active` | Gauge | `role` (`writer`, `projector`) | Helps tune pool size. |
+| `ledger_app_version_info` | Gauge | `version`, `git_sha` | Static metric for fleet observability. |
+
+### Derived dashboards
+- **Writer health:** `ledger_write_latency_seconds` (P50/P95/P99), backlog gauge, event throughput.
+- **Projection health:** `ledger_projection_lag_seconds`, rebuild durations, conflict counts (from logs).
+- **Anchoring:** Anchor duration histogram, failure counter, root hash timeline.
+
+## 3. Logs & traces
+- **Log structure:** Serilog JSON with fields `tenant`, `chainId`, `sequence`, `eventId`, `eventType`, `actorId`, `policyVersion`, `hash`, `merkleRoot`.
+- **Log levels:** `Information` for success summaries (sampled), `Warning` for retried operations, `Error` for failed writes/anchors.
+- **Correlation:** Each API request includes `requestId` + `traceId` logged with events. Projector logs capture `replayId` and `rebuildReason`.
+- **Secrets:** Ensure `event_body` is never logged; log only metadata/hashes.
+
+## 4. Alerts
+
+| Alert | Condition | Response |
+| --- | --- | --- |
+| **LedgerWriteSLA** | `ledger_write_latency_seconds` P95 > 0.12 s for 3 intervals | Check DB contention, review queue backlog, scale writer. |
+| **LedgerBacklogGrowing** | `ledger_ingest_backlog_events` > 5 000 for 5 min | Inspect upstream policy runs, ensure projector keeping up. |
+| **ProjectionLag** | `ledger_projection_lag_seconds` > 60 s | Trigger rebuild, verify change streams. |
+| **AnchorFailure** | `ledger_merkle_anchor_failures_total` increase > 0 | Collect logs, rerun anchor, verify signing service. |
+| **AttachmentSecurityError** | `ledger_attachments_encryption_failures_total` increase > 0 | Audit attachments pipeline; check key material and storage endpoints. |
+
+Alerts integrate with Notifier channel `ledger.alerts`. For air-gapped deployments emit to local syslog + CLI incident scripts.
+
+## 5. Testing & determinism harness
+- **Replay harness:** CLI `dotnet run --project tools/LedgerReplayHarness` executes deterministic replays at 5 M findings/tenant. Metrics emitted: `ledger_projection_rebuild_seconds` with `scenario` label.
+- **Property tests:** Seeded tests ensure `ledger_events_total` and Merkle leaf counts match after replay.
+- **CI gating:** `LEDGER-29-008` requires harness output uploaded as signed JSON (`harness-report.json` + DSSE) and referenced in sprint notes.
+
+## 6. Offline & air-gap guidance
+- Collect metrics/log snapshots via `stella ledger observability snapshot --out offline/ledger/metrics.tar.gz`. Include `ledger_write_latency_seconds` summary, anchor root history, and projection lag samples.
+- Include default Grafana JSON under `offline/telemetry/dashboards/ledger/*.json`. Dashboards use the metrics above; filter by `tenant`.
+- Ensure sealed-mode doc (`docs/modules/findings-ledger/schema.md` §3.3) references `ledger_attachments_encryption_failures_total` so Ops can confirm encryption pipeline health without remote telemetry.
+
+## 7. Runbook pointers
+- **Anchoring issues:** Refer to `docs/modules/findings-ledger/schema.md` §3 for root structure, `ops/devops/telemetry/package_offline_bundle.py` for diagnostics.
+- **Projection rebuilds:** `docs/modules/findings-ledger/workflow-inference.md` for chain rules; `scripts/ledger/replay.sh` (LEDGER-29-008 deliverable) for deterministic replays.
+
+---
+
+*Draft compiled 2025-11-13 for LEDGER-29-007/008 planning. Update when metrics or alerts change.*
--- a/docs/modules/findings-ledger/replay-harness.md
+++ b/docs/modules/findings-ledger/replay-harness.md
@@ -0,0 +1,86 @@
+# Findings Ledger Replay & Determinism Harness (LEDGER-29-008)
+
+> **Audience:** Findings Ledger Guild · QA Guild · Policy Guild  
+> **Purpose:** Define the reproducible harness for 5 M findings/tenant replay tests and determinism validation required by LEDGER-29-008.
+
+## 1. Goals
+- Reproduce ledger + projection state from canonical event fixtures with byte-for-byte determinism.
+- Stress test writer/projector throughput at ≥5 M findings per tenant, capturing CPU/memory/latency profiles.
+- Produce signed reports (DSSE) that CI and auditors can review before shipping.
+
+## 2. Architecture
+
+```
+Fixtures (.ndjson) → Harness Runner → Ledger Writer API → Postgres Ledger DB
+                                     ↘ Projector (same DB) ↘ Metrics snapshot
+```
+
+- **Fixtures:** `fixtures/ledger/*.ndjson`, sorted by `sequence_no`, containing canonical JSON envelopes with precomputed hashes.
+- **Runner:** `tools/LedgerReplayHarness` (console app) feeds events, waits for projector catch-up, and verifies projection hashes.
+- **Validation:** After replay, the runner re-reads ledger/projection tables, recomputes hashes, and compares to fixture expectations.
+- **Reporting:** Generates `harness-report.json` with metrics (latency histogram, insertion throughput, projection lag) plus a DSSE signature.
+
+## 3. CLI usage
+
+```bash
+dotnet run --project tools/LedgerReplayHarness \
+  -- --fixture fixtures/ledger/tenant-a.ndjson \
+     --connection "Host=postgres;Username=stellaops;Password=***;Database=findings_ledger" \
+     --tenant tenant-a \
+     --maxParallel 8 \
+     --report out/harness/tenant-a-report.json
+```
+
+Options:
+
+| Option | Description |
+| --- | --- |
+| `--fixture` | Path to NDJSON file (supports multiple). |
+| `--connection` | Postgres connection string (writer + projector share). |
+| `--tenant` | Tenant identifier; harness ensures partitions exist. |
+| `--maxParallel` | Batch concurrency (default 4). |
+| `--report` | Output path for report JSON; `.sig` generated alongside. |
+| `--metrics-endpoint` | Optional Prometheus scrape URI for live metrics snapshot. |
+
+## 4. Verification steps
+
+1. **Hash validation:** Recompute `event_hash` for each appended event and ensure matches fixture.
+2. **Sequence integrity:** Confirm gapless sequences per chain; harness aborts on mismatch.
+3. **Projection determinism:** Compare projector-derived `cycle_hash` with expected value from fixture metadata.
+4. **Performance:** Capture P50/P95 latencies for `ledger_write_latency_seconds` and ensure targets (<120 ms P95) met.
+5. **Resource usage:** Sample CPU/memory via `dotnet-counters` or `kubectl top` and store in report.
+6. **Merkle root check:** Rebuild Merkle tree from events and ensure root equals database `ledger_merkle_roots` entry.
+
+## 5. Output report schema
+
+```json
+{
+  "tenant": "tenant-a",
+  "fixtures": ["fixtures/ledger/tenant-a.ndjson"],
+  "eventsWritten": 5123456,
+  "durationSeconds": 1422.4,
+  "latencyP95Ms": 108.3,
+  "projectionLagMaxSeconds": 18.2,
+  "cpuPercentMax": 72.5,
+  "memoryMbMax": 3580,
+  "merkleRoot": "3f1a…",
+  "status": "pass",
+  "timestamp": "2025-11-13T11:45:00Z"
+}
+```
+
+The harness writes `harness-report.json` plus `harness-report.json.sig` (DSSE) and `metrics-snapshot.prom` for archival.
+
+## 6. CI integration
+- New pipeline job `ledger-replay-harness` runs nightly with reduced dataset (1 M findings) to detect regressions quickly.
+- Full 5 M run executes weekly and before releases; artifacts uploaded to `out/qa/findings-ledger/`.
+- Gates: merge blocked if harness `status != pass` or latencies exceed thresholds.
+
+## 7. Air-gapped execution
+- Include fixtures + harness binaries inside Offline Kit under `offline/ledger/replay/`.
+- Provide `run-harness.sh` script that sets env vars, executes runner, and exports reports.
+- Operators attach signed reports to audit trails, verifying hashed fixtures before import.
+
+---
+
+*Draft prepared 2025-11-13 for LEDGER-29-008. Update when CLI options or thresholds change.*
--- a/docs/modules/policy/TASKS.md
+++ b/docs/modules/policy/TASKS.md
@@ -3,3 +3,4 @@
 | Task ID | State | Notes |
 | --- | --- | --- |
 | `SCANNER-POLICY-0001` | DONE (2025-11-10) | Ruby component predicates implemented in engine/tests, DSL docs updated, offline kit verifies `seed-data/analyzers/ruby/git-sources`. |
+| `DOCS-AIAI-31-006` | DONE (2025-11-13) | Published `docs/policy/assistant-parameters.md` capturing Advisory AI configuration knobs (inference/guardrails/cache/queue) and linked it from the module architecture dossier. |
--- a/docs/modules/scanner/architecture.md
+++ b/docs/modules/scanner/architecture.md
@@ -263,9 +263,10 @@ The emitted `buildId` metadata is preserved in component hashes, diff payloads,

 ### 5.6 DSSE attestation (via Signer/Attestor)

-* WebService constructs **predicate** with `image_digest`, `stellaops_version`, `license_id`, `policy_digest?` (when emitting **final reports**), timestamps.
-* Calls **Signer** (requires **OpTok + PoE**); Signer verifies **entitlement + scanner image integrity** and returns **DSSE bundle**.
-* **Attestor** logs to **Rekor v2**; returns `{uuid,index,proof}` → stored in `artifacts.rekor`.
+* WebService constructs **predicate** with `image_digest`, `stellaops_version`, `license_id`, `policy_digest?` (when emitting **final reports**), timestamps.
+* Calls **Signer** (requires **OpTok + PoE**); Signer verifies **entitlement + scanner image integrity** and returns **DSSE bundle**.
+* **Attestor** logs to **Rekor v2**; returns `{uuid,index,proof}` → stored in `artifacts.rekor`.
+* Operator enablement runbooks (toggles, env-var map, rollout guidance) live in [`operations/dsse-rekor-operator-guide.md`](operations/dsse-rekor-operator-guide.md) per SCANNER-ENG-0015.

 ---

--- a/docs/modules/scanner/design/surface-env.md
+++ b/docs/modules/scanner/design/surface-env.md
@@ -40,35 +40,49 @@ Surface.Env exposes `ISurfaceEnvironment` which returns an immutable `SurfaceEnv

 | Variable | Description | Default | Notes |
 |----------|-------------|---------|-------|
-| `SCANNER_SURFACE_FS_ENDPOINT` | Base URI for Surface.FS service (RustFS, S3-compatible). | _required_ | e.g. `https://surface-cache.svc.cluster.local`. Zastava uses `ZASTAVA_SURFACE_FS_ENDPOINT`; when absent, falls back to scanner value. |
-| `SCANNER_SURFACE_FS_BUCKET` | Bucket/container name used for manifests and artefacts. | `surface-cache` | Must be unique per tenant. |
-| `SCANNER_SURFACE_FS_REGION` | Optional region (S3-style). | `null` | Required for AWS S3. |
-| `SCANNER_SURFACE_CACHE_ROOT` | Local filesystem directory for warm caches. | `/var/lib/stellaops/surface` | Should reside on fast SSD. |
-| `SCANNER_SURFACE_CACHE_QUOTA_MB` | Soft limit for local cache usage. | `4096` | Enforced by Surface.FS eviction policy. |
-| `SCANNER_SURFACE_TLS_CERT_PATH` | Path to PEM bundle for mutual TLS with Surface.FS. | `null` | If provided, library loads cert/key pair. |
-| `SCANNER_SURFACE_TENANT` | Tenant identifier used for cache namespaces. | derived from Authority token | Can be overridden for multi-tenant workers. |
-| `SCANNER_SURFACE_PREFETCH_ENABLED` | Toggle surface prefetch threads. | `false` | If `true`, Worker prefetches manifests before analyzer stage. |
-| `SCANNER_SURFACE_FEATURES` | Comma-separated feature switches. | `""` | e.g. `validation,prewarm,runtime-diff`. |
+| `SCANNER_SURFACE_FS_ENDPOINT` | Base URI for Surface.FS / RustFS / S3-compatible store. | _required_ | Throws `SurfaceEnvironmentException` when `RequireSurfaceEndpoint = true`. When disabled (tests), builder falls back to `https://surface.invalid` so validation can fail fast. Also binds `Surface:Fs:Endpoint` from `IConfiguration`. |
+| `SCANNER_SURFACE_FS_BUCKET` | Bucket/container used for manifests and artefacts. | `surface-cache` | Must be unique per tenant; validators enforce non-empty value. |
+| `SCANNER_SURFACE_FS_REGION` | Optional region for S3-compatible stores. | `null` | Needed only when the backing store requires it (AWS/GCS). |
+| `SCANNER_SURFACE_CACHE_ROOT` | Local directory for warm caches. | `<temp>/stellaops/surface` | Directory is created if missing. Override to `/var/lib/stellaops/surface` (or another fast SSD) in production. |
+| `SCANNER_SURFACE_CACHE_QUOTA_MB` | Soft limit for on-disk cache usage. | `4096` | Enforced range 64–262144 MB; validation emits `SURFACE_ENV_CACHE_QUOTA_INVALID` outside the range. |
+| `SCANNER_SURFACE_PREFETCH_ENABLED` | Enables manifest prefetch threads. | `false` | Workers honour this before analyzer execution. |
+| `SCANNER_SURFACE_TENANT` | Tenant namespace used by cache + secret resolvers. | `TenantResolver(...)` or `"default"` | Default resolver may pull from Authority claims; you can override via env for multi-tenant pools. |
+| `SCANNER_SURFACE_FEATURES` | Comma-separated feature switches. | `""` | Compared against `SurfaceEnvironmentOptions.KnownFeatureFlags`; unknown flags raise warnings. |
+| `SCANNER_SURFACE_TLS_CERT_PATH` | Path to PEM/PKCS#12 file for client auth. | `null` | When present, `SurfaceEnvironmentBuilder` loads the certificate into `SurfaceTlsConfiguration`. |
+| `SCANNER_SURFACE_TLS_KEY_PATH` | Optional private-key path when cert/key are stored separately. | `null` | Stored in `SurfaceTlsConfiguration` for hosts that need to hydrate the key themselves. |

 ### 3.2 Secrets provider keys

 | Variable | Description | Notes |
 |----------|-------------|-------|
-| `SCANNER_SURFACE_SECRETS_PROVIDER` | Provider ID (`kubernetes`, `file`, `inline`). | Controls Surface.Secrets back-end. |
-| `SCANNER_SURFACE_SECRETS_ROOT` | Path or secret namespace. | Example: `/etc/stellaops/secrets` for file provider. |
-| `SCANNER_SURFACE_SECRETS_TENANT` | Tenant override for secret lookup. | Defaults to `SCANNER_SURFACE_TENANT`. |
+| `SCANNER_SURFACE_SECRETS_PROVIDER` | Provider ID (`kubernetes`, `file`, `inline`, future back-ends). | Defaults to `kubernetes`; validators reject unknown values via `SURFACE_SECRET_PROVIDER_UNKNOWN`. |
+| `SCANNER_SURFACE_SECRETS_ROOT` | Path or base namespace for the provider. | Required for the `file` provider (e.g., `/etc/stellaops/secrets`). |
+| `SCANNER_SURFACE_SECRETS_NAMESPACE` | Kubernetes namespace used by the secrets provider. | Mandatory when `provider = kubernetes`. |
+| `SCANNER_SURFACE_SECRETS_FALLBACK_PROVIDER` | Optional secondary provider ID. | Enables tiered lookups (e.g., `kubernetes` → `inline`) without changing code. |
+| `SCANNER_SURFACE_SECRETS_ALLOW_INLINE` | Allows returning inline secrets (useful for tests). | Defaults to `false`; Production deployments should keep this disabled. |
+| `SCANNER_SURFACE_SECRETS_TENANT` | Tenant override for secret lookups. | Defaults to `SCANNER_SURFACE_TENANT` or the tenant resolver result. |

-### 3.3 Zastava-specific keys
+### 3.3 Component-specific prefixes

-Zastava containers read the same primary variables but may override names under the `ZASTAVA_` prefix (e.g., `ZASTAVA_SURFACE_CACHE_ROOT`, `ZASTAVA_SURFACE_FEATURES`). Surface.Env automatically checks component-specific prefixes before falling back to the scanner defaults.
+`SurfaceEnvironmentOptions.Prefixes` controls the order in which suffixes are probed. Every suffix listed above is combined with each prefix (e.g., `SCANNER_SURFACE_FS_ENDPOINT`, `ZASTAVA_SURFACE_FS_ENDPOINT`) and finally the bare suffix (`SURFACE_FS_ENDPOINT`). Configure prefixes per host so local overrides win but global scanner defaults remain available:
+
+| Component | Suggested prefixes (first match wins) | Notes |
+|-----------|---------------------------------------|-------|
+| Scanner.Worker / WebService | `SCANNER` | Default – already added by `AddSurfaceEnvironment`. |
+| Zastava Observer/Webhook (planned) | `ZASTAVA`, `SCANNER` | Call `options.AddPrefix("ZASTAVA")` before relying on `ZASTAVA_*` overrides. |
+| Future CLI / BuildX plug-ins | `CLI`, `SCANNER` | Allows per-user overrides without breaking shared env files. |
+
+This approach means operators can define a single env file (SCANNER_*) and only override the handful of settings that diverge for a specific component by introducing an additional prefix.

 ### 3.4 Configuration precedence

-1. Explicit overrides passed to `SurfaceEnvBuilder` (e.g., from appsettings).
-2. Component-specific env (e.g., `ZASTAVA_SURFACE_FS_ENDPOINT`).
-3. Scanner global env (e.g., `SCANNER_SURFACE_FS_ENDPOINT`).
-4. `SurfaceEnvDefaults.json` (shipped with library for sensible defaults).
-5. Emergency fallback values defined in code (only for development scenarios).
+The builder resolves every suffix using the following precedence:
+
+1. Environment variables using the configured prefixes (e.g., `ZASTAVA_SURFACE_FS_ENDPOINT`, then `SCANNER_SURFACE_FS_ENDPOINT`, then the bare `SURFACE_FS_ENDPOINT`).
+2. Configuration values under the `Surface:*` section (for example `Surface:Fs:Endpoint`, `Surface:Cache:Root` in `appsettings.json` or Helm values).
+3. Hard-coded defaults baked into `SurfaceEnvironmentBuilder` (temporary directory, `surface-cache` bucket, etc.).
+
+`SurfaceEnvironmentOptions.RequireSurfaceEndpoint` controls whether a missing endpoint results in an exception (default: `true`). Other values fall back to the default listed in § 3.1/3.2 and are further validated by the Surface.Validation pipeline.

 ## 4. API Surface

@@ -79,65 +93,99 @@ public interface ISurfaceEnvironment
    IReadOnlyDictionary<string, string> RawVariables { get; }
 }

-public sealed record SurfaceEnvironmentSettings
-(
+public sealed record SurfaceEnvironmentSettings(
    Uri SurfaceFsEndpoint,
    string SurfaceFsBucket,
    string? SurfaceFsRegion,
    DirectoryInfo CacheRoot,
    int CacheQuotaMegabytes,
-    X509Certificate2Collection? ClientCertificates,
-    string Tenant,
    bool PrefetchEnabled,
    IReadOnlyCollection<string> FeatureFlags,
-    SecretProviderConfiguration Secrets,
-    IDictionary<string,string> ComponentOverrides
-);
+    SurfaceSecretsConfiguration Secrets,
+    string Tenant,
+    SurfaceTlsConfiguration Tls)
+{
+    public DateTimeOffset CreatedAtUtc { get; init; }
+}
+
+public sealed record SurfaceSecretsConfiguration(
+    string Provider,
+    string Tenant,
+    string? Root,
+    string? Namespace,
+    string? FallbackProvider,
+    bool AllowInline);
+
+public sealed record SurfaceTlsConfiguration(
+    string? CertificatePath,
+    string? PrivateKeyPath,
+    X509Certificate2Collection? ClientCertificates);
 ```

-Consumers access `ISurfaceEnvironment.Settings` and pass the record into Surface.FS / Surface.Secrets factories. The interface memoises results so repeated access is cheap.
+`ISurfaceEnvironment.RawVariables` captures the exact env/config keys that produced the snapshot so operators can export them in diagnostics bundles.
+
+`SurfaceEnvironmentOptions` configures how the snapshot is built:
+
+* `ComponentName` – used in logs/validation output.
+* `Prefixes` – ordered list of env prefixes (see § 3.3). Defaults to `["SCANNER"]`.
+* `RequireSurfaceEndpoint` – throw when no endpoint is provided (default `true`).
+* `TenantResolver` – delegate invoked when `SCANNER_SURFACE_TENANT` is absent.
+* `KnownFeatureFlags` – recognised feature switches; unexpected values raise warnings.
+
+Example registration:
+
+```csharp
+builder.Services.AddSurfaceEnvironment(options =>
+{
+    options.ComponentName = "Scanner.Worker";
+    options.AddPrefix("ZASTAVA"); // optional future override
+    options.KnownFeatureFlags.Add("validation");
+    options.TenantResolver = sp => sp.GetRequiredService<ITenantContext>().TenantId;
+});
+```
+
+Consumers access `ISurfaceEnvironment.Settings` and pass the record into Surface.FS, Surface.Secrets, cache, and validation helpers. The interface memoises results so repeated access is cheap.

 ## 5. Validation

-Surface.Env invokes the following validators (implemented in Surface.Validation):
+`SurfaceEnvironmentBuilder` only throws `SurfaceEnvironmentException` for malformed inputs (non-integer quota, invalid URI, missing required variable when `RequireSurfaceEndpoint = true`). The richer validation pipeline lives in `StellaOps.Scanner.Surface.Validation` and runs via `services.AddSurfaceValidation()`:

-1. **EndpointValidator** – ensures endpoint URI is absolute HTTPS and not localhost in production.
-2. **CacheQuotaValidator** – verifies quota > 0 and below host max.
-3. **FilesystemValidator** – checks cache root exists/writable; attempts to create directory if missing.
-4. **SecretsProviderValidator** – ensures provider-specific settings (e.g., Kubernetes namespace) are present.
-5. **FeatureFlagValidator** – warns on unknown feature flag tokens.
+1. **SurfaceEndpointValidator** – checks for a non-placeholder endpoint and bucket (`SURFACE_ENV_MISSING_ENDPOINT`, `SURFACE_FS_BUCKET_MISSING`).
+2. **SurfaceCacheValidator** – verifies the cache directory exists/is writable and that the quota is positive (`SURFACE_ENV_CACHE_DIR_UNWRITABLE`, `SURFACE_ENV_CACHE_QUOTA_INVALID`).
+3. **SurfaceSecretsValidator** – validates provider names, required namespace/root fields, and tenant presence (`SURFACE_SECRET_PROVIDER_UNKNOWN`, `SURFACE_SECRET_CONFIGURATION_MISSING`, `SURFACE_ENV_TENANT_MISSING`).

-Failures throw `SurfaceEnvironmentException` with error codes (`SURFACE_ENV_MISSING_ENDPOINT`, `SURFACE_ENV_CACHE_DIR_UNWRITABLE`, etc.). Hosts log the error and fail fast during startup.
+Validators emit `SurfaceValidationIssue` instances with codes defined in `SurfaceValidationIssueCodes`. `LoggingSurfaceValidationReporter` writes structured log entries (Info/Warning/Error) using the component name, issue code, and remediation hint. Hosts fail startup if any issue has `Error` severity; warnings allow startup but surface actionable hints.

 ## 6. Integration Guidance

- **Scanner Worker**: call `services.AddSurfaceEnvironment()` in `Program.cs` before registering analyzers. Pass `hostContext.Configuration.GetSection("Surface")` for overrides.
- **Scanner WebService**: build environment during startup using `AddSurfaceEnvironment`, `AddSurfaceValidation`, `AddSurfaceFileCache`, and `AddSurfaceSecrets`; readiness checks execute the validator runner and scan/report APIs emit Surface CAS pointers derived from the resolved configuration.
- **Zastava Observer/Webhook**: use the same builder; ensure Helm charts set `ZASTAVA_` variables.
- **Scheduler Planner (future)**: treat Surface.Env as read-only input; do not mutate settings.
- `Scanner.Worker` and `Scanner.WebService` automatically bind the `SurfaceCacheOptions.RootDirectory` to `SurfaceEnvironment.Settings.CacheRoot` (2025-11-05); both hosts emit structured warnings (`surface.env.misconfiguration`) when the helper detects missing cache roots, endpoints, or secrets provider settings (2025-11-06).
+- **Scanner Worker**: register `AddSurfaceEnvironment`, `AddSurfaceValidation`, `AddSurfaceFileCache`, and `AddSurfaceSecrets` before analyzer/services (see `src/Scanner/StellaOps.Scanner.Worker/Program.cs`). `SurfaceCacheOptionsConfigurator` already binds the cache root from `ISurfaceEnvironment`.
+- **Scanner WebService**: identical wiring, plus `SurfacePointerService`/`ScannerSurfaceSecretConfigurator` reuse the resolved settings (`Program.cs` demonstrates the pattern).
+- **Zastava Observer/Webhook**: will reuse the same helper once the service adds `AddSurfaceEnvironment(options => options.AddPrefix("ZASTAVA"))` so per-component overrides function without diverging defaults.
+- **Scheduler / CLI / BuildX (future)**: treat `ISurfaceEnvironment` as read-only input; secret lookup, cache plumbing, and validation happen before any queue/enqueue work.

-### 6.1 Misconfiguration warnings
+Readiness probes should invoke `ISurfaceValidatorRunner` (registered by `AddSurfaceValidation`) and fail the endpoint when any issue is returned. The Scanner Worker/WebService hosted services already run the validators on startup; other consumers should follow the same pattern.

-Surface.Env surfaces actionable warnings that appear in structured logs and readiness responses:
+### 6.1 Validation output

- `surface.env.cache_root_missing` – emitted when the resolved cache directory does not exist or is not writable. The host attempts to create the directory once; subsequent failures block startup.
- `surface.env.endpoint_unreachable` – emitted when `SurfaceFsEndpoint` is missing or not an absolute HTTPS URI.
- `surface.env.secrets_provider_invalid` – emitted when the configured secrets provider lacks mandatory fields (e.g., `SCANNER_SURFACE_SECRETS_ROOT` for the `file` provider).
+`LoggingSurfaceValidationReporter` produces log entries that include:

-Each warning includes remediation text and a reference to this design document; operations runbooks should treat these warnings as blockers in production and as validation hints in staging.
+```
+Surface validation issue for component Scanner.Worker: SURFACE_ENV_MISSING_ENDPOINT - Surface FS endpoint is missing or invalid. Hint: Set SCANNER_SURFACE_FS_ENDPOINT to the RustFS/S3 endpoint.
+```
+
+Treat `SurfaceValidationIssueCodes.*` with severity `Error` as hard blockers (readiness must fail). `Warning` entries flag configuration drift (for example, missing namespaces) but allow startup so staging/offline runs can proceed. The codes appear in both the structured log state and the reporter payload, making it easy to alert on them.

 ## 7. Security & Observability

- Never log raw secrets; Surface.Env redacts values by default.
- Emit metric `surface_env_validation_total{status}` to observe validation outcomes.
- Provide `/metrics` gauge for cache quota/residual via Surface.FS integration.
+- Surface.Env never logs raw values; only suffix names and issue codes appear in logs. `RawVariables` is intended for diagnostics bundles and should be treated as sensitive metadata.
+- TLS certificates are loaded into memory and not re-serialised; only the configured paths are exposed to downstream services.
+- To emit metrics, register a custom `ISurfaceValidationReporter` (e.g., wrapping Prometheus counters) in addition to the logging reporter.

 ## 8. Offline & Air-Gap Support

- Defaults assume no public network access; endpoints should point to internal RustFS or S3-compatible system.
- Offline kit templates supply env files under `offline/scanner/surface-env.env`.
- Document steps in `docs/modules/devops/runbooks/zastava-deployment.md` and `offline-kit` tasks for synchronising env values.
+- Defaults assume no public network access; point `SCANNER_SURFACE_FS_ENDPOINT` at an internal RustFS/S3 mirror.
+- Offline bundles must capture an env file (Ops track this under the Offline Kit tasks) so operators can seed `SCANNER_*` values before first boot.
+- Keep `docs/modules/devops/runbooks/zastava-deployment.md` in sync so Zastava deployments reuse the same env contract.

 ## 9. Testing Strategy

--- a/docs/modules/scanner/operations/dsse-rekor-operator-guide.md
+++ b/docs/modules/scanner/operations/dsse-rekor-operator-guide.md
@@ -46,6 +46,17 @@
   - Export Center profile with `attestations.bundle=true`.
   - Rekor log snapshots mirrored (ORAS bundle or rsync of `/var/log/rekor`) for disconnected verification.

+### 3.1 Configuration & env-var map
+
+| Service | Key(s) | Env override | Notes |
+|---------|--------|--------------|-------|
+| Scanner WebService / Worker | `scanner.attestation.signerEndpoint`<br>`scanner.attestation.attestorEndpoint`<br>`scanner.attestation.requireDsse`<br>`scanner.attestation.uploadArtifacts` | `SCANNER__ATTESTATION__SIGNERENDPOINT`<br>`SCANNER__ATTESTATION__ATTESTORENDPOINT`<br>`SCANNER__ATTESTATION__REQUIREDSSE`<br>`SCANNER__ATTESTATION__UPLOADARTIFACTS` | Worker/WebService share the same config. Set `requireDsse=false` while observing, then flip to `true`. `uploadArtifacts=true` stores DSSE+Rekor bundles next to SBOM artefacts. |
+| Signer | `signer.attestorEndpoint`<br>`signer.keyProvider`<br>`signer.fulcio.endpoint` | `SIGNER__ATTESTORENDPOINT` etc. | `attestorEndpoint` lets Signer push DSSE payloads downstream; key provider controls BYO KMS/HSM vs Fulcio. |
+| Attestor | `attestor.rekor.api`<br>`attestor.rekor.publicKeyPath`<br>`attestor.rekor.offlineMirrorPath`<br>`attestor.retry.maxAttempts` | `ATTESTOR__REKOR__API`<br>`ATTESTOR__REKOR__PUBLICKEYPATH`<br>`ATTESTOR__REKOR__OFFLINEMIRRORPATH`<br>`ATTESTOR__RETRY__MAXATTEMPTS` | Mirror path points at the local snapshot directory used in sealed/air-gapped modes. |
+| Export Center | `exportProfiles.<name>.includeAttestations`<br>`exportProfiles.<name>.includeRekorProofs` | `EXPORTCENTER__EXPORTPROFILES__SECURE-DEFAULT__INCLUDEATTESTATIONS` etc. | Use profiles to gate which bundles include DSSE/Reco r data; keep a “secure-default” profile enabled across tiers. |
+
+> **Tip:** Every key above follows the ASP.NET Core double-underscore pattern. For Compose/Helm, add environment variables directly; for Offline Kit overrides, drop `appsettings.Offline.json` with the same sections.
+
 ---

 ## 4. Enablement workflow
@@ -161,6 +172,38 @@ Roll forward per environment; keep the previous phase’s toggles for hot rollba

 ---

+## 8. Operational runbook & SLO guardrails
+
+| Step | Owner | Target / Notes |
+|------|-------|----------------|
+| Health gate | Ops/SRE | `attestor_rekor_success_total` ≥ 99.5% rolling hour, `rekor_inclusion_latency_p95` ≤ 30s. Alert when retries spike or queue depth > 50. |
+| Cutover dry-run | Scanner team | Set `SCANNER__ATTESTATION__REQUIREDSSE=false`, watch metrics + Attestor queue for 24h, capture Rekor proofs per environment. |
+| Enforce | Platform | Flip `requireDsse=true`, promote Policy rule from `warn` → `deny`, notify AppSec + release managers. |
+| Audit proof pack | Export Center | Run secure profile nightly; confirm `attestations/` + `rekor/` trees attached to Offline Kit. Store bundle hash in Evidence Locker. |
+| Verification spot-check | AppSec | Weekly `stellaops-cli attest verify --bundle latest.tar --rekor-key rekor.pub --json` saved to ticket for auditors. |
+| Rollback | Ops/SRE | If Rekor outage exceeds 15 min: set `requireDsse=false`, keep policy in `warn`, purge Attestor queue once log recovers, then re-enable. Document the waiver in the sprint log. |
+
+**Dashboards & alerts**
+- Grafana panel: Rekor inclusion latency (p50/p95) + Attestor retry rate.
+- Alert when `attestationPending=true` events exceed 5 per minute for >5 minutes.
+- Logs must include `rekorUuid`, `rekorLogIndex`, `attestationDigest` for SIEM correlation.
+
+**Runbook snippets**
+```bash
+# test Rekor health + key mismatch
+rekor-cli loginfo --rekor_server "${ATTESTOR__REKOR__API}" --format json | jq .rootHash
+
+# replay stranded payloads after outage
+stellaops-attestor replay --since "2025-11-13T00:00:00Z" \
+  --rekor ${ATTESTOR__REKOR__API} --rekor-key /etc/rekor/rekor.pub
+
+# verify a single DSSE file against Rekor proof bundle
+stellaops-cli attest verify --envelope artifacts/scan123/attest/sbom.dsse.json \
+  --rekor-proof artifacts/scan123/rekor/entry.json --rekor-key rekor.pub
+```
+
+---
+
 ## References

 - Gap analysis: `docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md#dsse-rekor-operator-enablement-trivy-grype-snyk`
@@ -168,4 +211,3 @@ Roll forward per environment; keep the previous phase’s toggles for hot rollba
 - Export Center profiles: `docs/modules/export-center/architecture.md`
 - Policy Engine predicates: `docs/modules/policy/architecture.md`
 - CLI reference: `docs/09_API_CLI_REFERENCE.md`
-