news advisories

2025-11-30 21:00:38 +02:00
parent 0bef705bcc
commit 25254e3831
14 changed files with 7573 additions and 29 deletions
--- a/docs/product-advisories/29-Nov-2025
+++ b/docs/product-advisories/29-Nov-2025
@@ -0,0 +1,792 @@
+Here’s a tight, drop-in acceptance-tests pack for Stella Ops that turns common failure modes into concrete guardrails you can ship this sprint.
+
+---
+
+# 1) Feed outages & integrity drift (e.g., Grype DB / CDN hiccups)
+
+**Lesson:** Never couple scans to a single live feed; pin, verify, and cache.
+
+**Add to acceptance tests**
+
+* **Rollback-safe updaters**
+
+  * If a feed update fails checksum or signature, the system keeps using the last “good” bundle.
+  * On restart, the updater falls back to the last verified bundle without network access.
+* **Signed offline bundles**
+
+  * Every feed bundle (SBOM catalogs, CVE DB shards, rules) must be DSSE-signed; verification blocks ingestion on mismatch.
+  * Bundle manifest lists SHA-256 for each file; any deviation = reject.
+
+**Test cases (CI)**
+
+* Simulate 404/timeout from feed URL → scanner still produces results from cached bundle.
+* Serve a tampered bundle (wrong hash) → updater logs failure; no swap; previous bundle remains active.
+* Air-gap mode: no network → scanner loads from `/var/lib/stellaops/offline-bundles/*` and passes verification.
+
+---
+
+# 2) SBOM quality & schema drift
+
+**Lesson:** Garbage in = garbage VEX. Gate on schema, completeness, and provenance.
+
+**Add to acceptance tests**
+
+* **SBOM schema gating**
+
+  * Reject SBOMs not valid CycloneDX 1.6 / SPDX 2.3 (your chosen set).
+  * Require: component `bom-ref`, supplier, version, hashes, and build provenance (SLSA/in-toto attestation) if provided.
+* **Minimum completeness**
+
+  * Thresholds: ≥95% components with cryptographic hashes; no unknown package ecosystem fields for top 20 deps.
+
+**Test cases**
+
+* Feed malformed CycloneDX → `400 SBOM_VALIDATION_FAILED` with pointer to failing JSON path.
+* SBOM missing hashes for >5% of components → blocked from graph ingestion; actionable error.
+* SBOM with unsigned provenance when policy="RequireAttestation" → rejected.
+
+---
+
+# 3) DB/data corruption or operator error
+
+**Lesson:** Snapshots save releases.
+
+**Add to acceptance tests**
+
+* **DB snapshot cadence**
+
+  * Postgres: base backup nightly + WAL archiving; RPO ≤ 15 min; automated restore rehearsals.
+  * Mongo (while still in use): per-collection dumps until conversion completes; checksum each artifact.
+* **Deterministic replay**
+
+  * Any graph view must be reproducible from snapshot + bundle manifest (same revision hash).
+
+**Test cases**
+
+* Run chaos test that deletes last 24h tables → PITR restore to T-15m succeeds; graph revision IDs match pre-failure.
+* Restore rehearsal produces identical VEX verdict counts for a pinned revision.
+
+---
+
+# 4) Reachability engines & graph evaluation flakiness
+
+**Lesson:** When reachability is uncertain, degrade gracefully and be explicit.
+
+**Add to acceptance tests**
+
+* **Reachability fallbacks**
+
+  * If call-graph build fails or language analyzer missing, verdict moves to “Potentially Affected (Unproven Reach)” with a reason code.
+  * Policies must allow “conservative mode” (assume reachable) vs “lenient mode” (assume not-reachable) toggled per environment.
+* **Stable graph IDs**
+
+  * Graph revision ID is a content hash of inputs (SBOM set + rules + feed versions); identical inputs → identical ID.
+
+**Test cases**
+
+* Remove a language analyzer container at runtime → status flips to fallback code; no 500s; policy evaluation still completes.
+* Re-ingest same inputs → same graph revision ID and same verdict distribution.
+
+---
+
+# 5) Update pipelines & job routing
+
+**Lesson:** No single point of truth; isolate, audit, and prove swaps.
+
+**Add to acceptance tests**
+
+* **Two-phase bundle swaps**
+
+  * Stage → verify → atomic symlink/label swap; all scanners pick up new label within 1 minute, or roll back.
+* **Authority-gated policy changes**
+
+  * Any policy change (severity threshold, allowlist) is a signed request via Authority; audit trail must include signer and DSSE envelope hash.
+
+**Test cases**
+
+* Introduce a new CVE ruleset; verification passes → atomic swap; running scans continue; new scans use N+1 bundle.
+* Attempt policy change with invalid signature → rejected; audit log entry created; unchanged policy in effect.
+
+---
+
+## How to wire this in Stella Ops (quick pointers)
+
+* **Offline bundle format**
+
+  * `bundle.json` (manifest: file list + SHA-256 + DSSE signature), `/sboms/*.json`, `/feeds/cve/*.sqlite` (or shards), `/rules/*.yaml`, `/provenance/*.intoto.jsonl`.
+  * Verification entrypoint in .NET 10: `StellaOps.Bundle.VerifyAsync(manifest, keyring)` before any ingestion.
+
+* **Authority integration**
+
+  * Define `PolicyChangeRequest` (subject, diff, reason, expiry, DSSE envelope).
+  * Gate `PUT /policies/*` behind `Authority.Verify(envelope) == true` and `envelope.subject == computed_diff_hash`.
+
+* **Graph determinism**
+
+  * `GraphRevisionId = SHA256(Sort(JSON([SBOMRefs, RulesetVersion, FeedBundleIds, LatticeConfig, NormalizationVersion])))`.
+
+* **Postgres snapshots (until full conversion)**
+
+  * Use `pg_basebackup` nightly + `wal-g` for WAL; GitLab job runs restore rehearsal weekly into `stellaops-restore` namespace and asserts revision parity against prod.
+
+---
+
+## Minimal developer checklist (copy to your sprint board)
+
+* [ ] Add `BundleVerifier` to scanner startup; block if verification fails.
+* [ ] Implement `CacheLastGoodBundle()` and atomic label swap (`/current -> /v-YYYYMMDDHHmm`).
+* [ ] Add `SbomGate` with JSON-Schema validation + completeness thresholds.
+* [ ] Emit reasoned fallbacks: `REACH_FALLBACK_NO_ANALYZER`, `REACH_FALLBACK_TIMEOUT`.
+* [ ] Compute and display `GraphRevisionId` everywhere (API + UI + logs).
+* [ ] Configure nightly PG backups + weekly restore rehearsal that asserts revision parity.
+* [ ] Route all policy mutations through Authority DSSE verification + auditable ledger entry.
+
+If you want, I can turn this into ready-to-merge .NET test fixtures (xUnit) and a GitLab CI job that runs the feed-tamper/air-gap simulations automatically.
+I’ll take the 5 “miss” areas and turn them into concrete, implementable test plans, with suggested projects, fixtures, and key cases your team can start coding.
+
+I’ll keep names aligned to .NET 10/xUnit and your Stella Ops modules.
+
+---
+
+## 0. Test layout proposal
+
+**Solution structure (tests)**
+
+```text
+/tests
+  /StellaOps.Bundle.Tests
+    BundleVerificationTests.cs
+    CachedBundleFallbackTests.cs
+  /StellaOps.SbomGate.Tests
+    SbomSchemaValidationTests.cs
+    SbomCompletenessTests.cs
+  /StellaOps.Scanner.Tests
+    ScannerOfflineBundleTests.cs
+    ReachabilityFallbackTests.cs
+    GraphRevisionDeterminismTests.cs
+  /StellaOps.DataRecoverability.Tests
+    PostgresSnapshotRestoreTests.cs
+    GraphReplayParityTests.cs
+  /StellaOps.Authority.Tests
+    PolicyChangeSignatureTests.cs
+  /StellaOps.System.Acceptance
+    FeedOutageEndToEndTests.cs
+    AirGapModeEndToEndTests.cs
+    BundleSwapEndToEndTests.cs
+/testdata
+  /bundles
+  /sboms
+  /graphs
+  /db
+```
+
+Use xUnit + FluentAssertions, plus Testcontainers for Postgres.
+
+---
+
+## 1) Feed outages & integrity drift
+
+### Objectives
+
+1. Scanner never “goes dark” because the CDN/feed is down.
+2. Only **verified** bundles are used; tampered bundles are never ingested.
+3. Offline/air-gap mode is a first-class, tested behavior.
+
+### Components under test
+
+* `StellaOps.BundleVerifier` (core library)
+* `StellaOps.Scanner.Webservice` (scanner, bundle loader)
+* Bundle filesystem layout:
+  `/opt/stellaops/bundles/v-<timestamp>/*` + `/opt/stellaops/bundles/current` symlink
+
+### Test dimensions
+
+* Network: OK / timeout / 404 / TLS failure / DNS failure.
+* Remote bundle: correct / tampered (hash mismatch) / wrong signature / truncated.
+* Local cache: last-good present / absent / corrupted.
+* Mode: online / offline (air-gap).
+
+### Detailed test suites
+
+#### 1.1 Bundle verification unit tests
+
+**Project:** `StellaOps.Bundle.Tests`
+
+**Fixtures:**
+
+* `testdata/bundles/good-bundle/`
+* `testdata/bundles/hash-mismatch-bundle/`
+* `testdata/bundles/bad-signature-bundle/`
+* `testdata/bundles/missing-file-bundle/`
+
+**Key tests:**
+
+1. `VerifyAsync_ValidBundle_ReturnsSuccess`
+
+   * Arrange: Load `good-bundle` manifest + DSSE signature.
+   * Act: `BundleVerifier.VerifyAsync(manifest, keyring)`
+   * Assert:
+
+     * `result.IsValid == true`
+     * `result.Files.All(f => f.Status == Verified)`
+
+2. `VerifyAsync_HashMismatch_FailsFast`
+
+   * Use `hash-mismatch-bundle` where one file’s SHA256 differs.
+   * Assert:
+
+     * `IsValid == false`
+     * `Errors` contains `BUNDLE_FILE_HASH_MISMATCH` and the offending path.
+
+3. `VerifyAsync_InvalidSignature_RejectsBundle`
+
+   * DSSE envelope signed with unknown key.
+   * Assert:
+
+     * `IsValid == false`
+     * `Errors` contains `BUNDLE_SIGNATURE_INVALID`.
+
+4. `VerifyAsync_MissingFile_RejectsBundle`
+
+   * Manifest lists file that does not exist on disk.
+   * Assert:
+
+     * `IsValid == false`
+     * `Errors` contains `BUNDLE_FILE_MISSING`.
+
+#### 1.2 Cached bundle fallback logic
+
+**Class under test:** `BundleManager`
+
+Simplified interface:
+
+```csharp
+public interface IBundleManager {
+    Task<BundleRef> GetActiveBundleAsync();
+    Task<BundleRef> UpdateFromRemoteAsync(CancellationToken ct);
+}
+```
+
+**Key tests:**
+
+1. `UpdateFromRemoteAsync_RemoteUnavailable_KeepsLastGoodBundle`
+
+   * Arrange:
+
+     * `lastGood` bundle exists and is marked verified.
+     * Remote HTTP client always throws `TaskCanceledException` (simulated timeout).
+   * Act: `UpdateFromRemoteAsync`.
+   * Assert:
+
+     * Returned bundle ID equals `lastGood.Id`.
+     * No changes to `current` symlink.
+
+2. `UpdateFromRemoteAsync_RemoteTampered_DoesNotReplaceCurrent`
+
+   * Remote returns bundle `temp-bundle` which fails `BundleVerifier`.
+   * Assert:
+
+     * `current` still points to `lastGood`.
+     * An error metric is emitted (e.g. `stellaops_bundle_update_failures_total++`).
+
+3. `GetActiveBundle_NoVerifiedBundle_ThrowsDomainError`
+
+   * No bundle is verified on disk.
+   * `GetActiveBundleAsync` throws a domain exception with code `NO_VERIFIED_BUNDLE_AVAILABLE`.
+   * Consumption pattern in Scanner: scanner fails fast on startup with clear log.
+
+#### 1.3 Scanner behavior with outages (integration)
+
+**Project:** `StellaOps.Scanner.Tests`
+
+Use in-memory host (`WebApplicationFactory<ScannerProgram>`).
+
+**Scenarios:**
+
+* F1: CDN timeout, last-good present.
+* F2: CDN 404, last-good present.
+* F3: CDN returns tampered bundle; verification fails.
+* F4: Air-gap: network disabled, last-good present.
+* F5: Air-gap + no last-good: scanner must refuse to start.
+
+Example test:
+
+```csharp
+[Fact]
+public async Task Scanner_UsesLastGoodBundle_WhenCdnTimesOut() {
+    // Arrange: put good bundle under /bundles/v-1, symlink /bundles/current -> v-1
+    using var host = TestScannerHost.WithBundle("v-1", good: true, simulateCdnTimeout: true);
+
+    // Act: call /api/scan with small fixture image
+    var response = await host.Client.PostAsJsonAsync("/api/scan", scanRequest);
+
+    // Assert:
+    response.StatusCode.Should().Be(HttpStatusCode.OK);
+    var content = await response.Content.ReadFromJsonAsync<ScanResult>();
+    content.BundleId.Should().Be("v-1");
+    host.Logs.Should().Contain("Falling back to last verified bundle");
+}
+```
+
+#### 1.4 System acceptance (GitLab CI)
+
+**Job idea:** `acceptance:feed-resilience`
+
+Steps:
+
+1. Spin up `scanner` + stub `feedser` container.
+2. Phase A: feed OK → run baseline scan; capture `bundleId` and `graphRevisionId`.
+3. Phase B: re-run with feed stub configured to:
+
+   * timeout,
+   * 404,
+   * return tampered bundle.
+4. For each phase:
+
+   * Assert `bundleId` remains the baseline one.
+   * Assert `graphRevisionId` unchanged.
+
+Failure of any assertion should break the pipeline.
+
+---
+
+## 2) SBOM quality & schema drift
+
+### Objectives
+
+1. Only syntactically valid SBOMs are ingested into the graph.
+2. Enforce minimum completeness (hash coverage, supplier etc.).
+3. Clear, machine-readable error responses from SBOM ingestion API.
+
+### Components
+
+* `StellaOps.SbomGate` (validation service)
+* SBOM ingestion endpoint in Scanner/Concelier: `POST /api/sboms`
+
+### Schema validation tests
+
+**Project:** `StellaOps.SbomGate.Tests`
+
+**Fixtures:**
+
+* `sbom-cdx-1.6-valid.json`
+* `sbom-cdx-1.6-malformed.json`
+* `sbom-spdx-2.3-valid.json`
+* `sbom-unsupported-schema.json`
+* `sbom-missing-hashes-10percent.json`
+* `sbom-no-supplier.json`
+
+**Key tests:**
+
+1. `Validate_ValidCycloneDx16_Succeeds`
+
+   * Assert type `SbomValidationResult.Success`.
+   * Ensure `DetectedSchema == CycloneDx16`.
+
+2. `Validate_MalformedJson_FailsWithSyntaxError`
+
+   * Malformed JSON.
+   * Assert:
+
+     * `IsValid == false`
+     * `Errors` contains `SBOM_JSON_SYNTAX_ERROR` with path info.
+
+3. `Validate_UnsupportedSchemaVersion_Fails`
+
+   * SPDX 2.1 (if you only allow 2.3).
+   * Expect `SBOM_SCHEMA_UNSUPPORTED` with `schemaUri` echo.
+
+4. `Validate_MissingHashesOverThreshold_Fails`
+
+   * SBOM where >5% components lack hashes.
+   * Policy: `MinHashCoverage = 0.95`.
+   * Assert:
+
+     * `IsValid == false`
+     * `Errors` contains `SBOM_HASH_COVERAGE_BELOW_THRESHOLD` with actual ratio.
+
+5. `Validate_MissingSupplier_Fails`
+
+   * Critical components missing supplier info.
+   * Expect `SBOM_REQUIRED_FIELD_MISSING` with `component.supplier`.
+
+### API-level tests
+
+**Project:** `StellaOps.Scanner.Tests` (or `StellaOps.Concelier.Tests` depending where SBOM ingestion lives).
+
+Key scenarios:
+
+1. `POST /api/sboms` with malformed JSON
+
+   * Request body: `sbom-cdx-1.6-malformed.json`.
+   * Expected:
+
+     * HTTP 400.
+     * Body: `{ "code": "SBOM_VALIDATION_FAILED", "details": [ ... ], "correlationId": "..." }`.
+     * At least one detail contains `SBOM_JSON_SYNTAX_ERROR`.
+
+2. `POST /api/sboms` with missing hashes
+
+   * Body: `sbom-missing-hashes-10percent.json`.
+   * HTTP 400 with `SBOM_HASH_COVERAGE_BELOW_THRESHOLD`.
+
+3. `POST /api/sboms` with unsupported schema
+
+   * Body: `sbom-unsupported-schema.json`.
+   * HTTP 400 with `SBOM_SCHEMA_UNSUPPORTED`.
+
+4. `POST /api/sboms` valid
+
+   * Body: `sbom-cdx-1.6-valid.json`.
+   * HTTP 202 or 201 (depending on design).
+   * Response contains SBOM ID; subsequent graph build sees that SBOM.
+
+---
+
+## 3) DB/data corruption & operator error
+
+### Objectives
+
+1. You can restore Postgres to a point in time and reproduce previous graph results.
+2. Graphs are deterministic given bundle + SBOM + rules.
+3. Obvious corruptions are detected and surfaced, not silently masked.
+
+### Components
+
+* Postgres cluster (new canonical store)
+* `StellaOps.Scanner.Webservice` (graph builder, persistence)
+* `GraphRevisionId` computation
+
+### 3.1 Postgres snapshot / WAL tests
+
+**Project:** `StellaOps.DataRecoverability.Tests`
+
+Use Testcontainers to spin up Postgres.
+
+Scenarios:
+
+1. `PITR_Restore_ReplaysGraphsWithSameRevisionIds`
+
+   * Arrange:
+
+     * Spin DB container with WAL archiving enabled.
+     * Apply schema migrations.
+     * Ingest fixed set of SBOMs + bundle refs + rules.
+     * Trigger graph build → record `graphRevisionIds` from API.
+     * Take base backup snapshot (simulate daily snapshot).
+   * Act:
+
+     * Destroy container.
+     * Start new container from base backup + replay WAL up to a specific LSN.
+     * Start Scanner against restored DB.
+     * Query graphs again.
+   * Assert:
+
+     * For each known graph: `revisionId_restored == revisionId_original`.
+     * Number of nodes/edges is identical.
+
+2. `PartialDataLoss_DetectedByHealthCheck`
+
+   * After initial load, deliberately delete some rows (e.g. all edges for a given graph).
+   * Run health check endpoint, e.g. `/health/graph`.
+   * Expect:
+
+     * HTTP 503.
+     * Body indicates `GRAPH_INTEGRITY_FAILED` with details of missing edges.
+
+This test forces a discipline to implement a basic graph integrity check (e.g. counts by state vs expected).
+
+### 3.2 Deterministic replay tests
+
+**Project:** `StellaOps.Scanner.Tests` → `GraphRevisionDeterminismTests.cs`
+
+**Precondition:** Graph revision ID computed as:
+
+```csharp
+GraphRevisionId = SHA256(
+  Normalize([
+    BundleId,
+    OrderedSbomIds,
+    RulesetVersion,
+    FeedBundleIds,
+    LatticeConfigVersion,
+    NormalizationVersion
+  ])
+);
+```
+
+**Scenarios:**
+
+1. `SameInputs_SameRevisionId`
+
+   * Run graph build twice for same inputs.
+   * Assert identical `GraphRevisionId`.
+
+2. `DifferentBundle_DifferentRevisionId`
+
+   * Same SBOMs & rules; change vulnerability bundle ID.
+   * Assert `GraphRevisionId` changes.
+
+3. `DifferentRuleset_DifferentRevisionId`
+
+   * Same SBOM & bundle; change ruleset version.
+   * Assert `GraphRevisionId` changes.
+
+4. `OrderingIrrelevant_StableRevision`
+
+   * Provide SBOMs in different order.
+   * Assert ` GraphRevisionId` same (because of internal sorting).
+
+---
+
+## 4) Reachability engine & graph evaluation flakiness
+
+### Objectives
+
+1. If reachability cannot be computed, you do not break; you downgrade verdicts with explicit reason codes.
+2. Deterministic reachability for “golden fixtures”.
+3. Graph evaluation remains stable even when analyzers come and go.
+
+### Components
+
+* `StellaOps.Scanner.Webservice` (lattice / reachability engine)
+* Language analyzers (sidecar or gRPC microservices)
+* Verdict representation, e.g.:
+
+```csharp
+public sealed record VulnerabilityVerdict(
+    string Status,              // "NotAffected", "Affected", "PotentiallyAffected"
+    string ReasonCode,          // "REACH_CONFIRMED", "REACH_FALLBACK_NO_ANALYZER", ...
+    string? AnalyzerId
+);
+```
+
+### 4.1 Golden reachability fixtures
+
+**Project:** `StellaOps.Scanner.Tests` → `GoldenReachabilityTests.cs`
+**Fixtures directory:** `/testdata/reachability/fixture-*/`
+
+Each fixture:
+
+```text
+/testdata/reachability/fixture-01-log4j/
+  sbom.json
+  code-snippets/...
+  expected-vex.json
+  config.json            # language, entrypoints, etc.
+```
+
+**Test pattern:**
+
+For each fixture:
+
+1. Load SBOM + configuration.
+2. Trigger reachability analysis.
+3. Collect raw reachability graph + final VEX verdicts.
+4. Compare to `expected-vex.json` (status + reason codes).
+5. Store the `GraphRevisionId` and set it as golden as well.
+
+Key cases:
+
+* R1: simple direct call → reachability confirmed → `Status = "Affected", ReasonCode = "REACH_CONFIRMED"`.
+* R2: library present but not called → `Status = "NotAffected", ReasonCode = "REACH_ANALYZED_UNREACHABLE"`.
+* R3: language analyzer missing → `Status = "PotentiallyAffected", ReasonCode = "REACH_FALLBACK_NO_ANALYZER"`.
+* R4: analysis timeout → `Status = "PotentiallyAffected", ReasonCode = "REACH_FALLBACK_TIMEOUT"`.
+
+### 4.2 Analyzer unavailability / fallback behavior
+
+**Project:** `StellaOps.Scanner.Tests` → `ReachabilityFallbackTests.cs`
+
+Scenarios:
+
+1. `NoAnalyzerRegistered_ForLanguage_UsesFallback`
+
+   * Scanner config lists a component in language “go” but no analyzer registered.
+   * Expect:
+
+     * No 500 error from `/api/graphs/...`.
+     * All applicable vulnerabilities for that component have `Status = "PotentiallyAffected"` and `ReasonCode = "REACH_FALLBACK_NO_ANALYZER"`.
+
+2. `AnalyzerRpcFailure_UsesFallback`
+
+   * Analyzer responds with gRPC error or HTTP 500.
+   * Scanner logs error and keeps going.
+   * Same semantics as missing analyzer, but with `AnalyzerId` populated and optional `ReasonDetails` (e.g. `RPC_UNAVAILABLE`).
+
+3. `AnalyzerTimeout_UsesTimeoutFallback`
+
+   * Force analyzer calls to time out.
+   * `ReasonCode = "REACH_FALLBACK_TIMEOUT"`.
+
+### 4.3 Concurrency & determinism
+
+Add a test that:
+
+1. Triggers N parallel graph builds for the same inputs.
+2. Asserts that:
+
+   * All builds succeed.
+   * All `GraphRevisionId` are identical.
+   * All reachability reason codes are identical.
+
+This is important for concurrent scanners and ensures lack of race conditions in graph construction.
+
+---
+
+## 5) Update pipelines & job routing
+
+### Objectives
+
+1. Bundle swaps are atomic: scanners see either old or new, never partially written bundles.
+2. Policy changes are always signed via Authority; unsigned/invalid changes never apply.
+3. Job routing changes (if/when you move to direct microservice pools) remain stateless and testable.
+
+### 5.1 Two-phase bundle swap tests
+
+**Bundle layout:**
+
+* `/opt/stellaops/bundles/current` → symlink to `v-YYYYMMDDHHmmss`
+* New bundle:
+
+  * Download to `/opt/stellaops/bundles/staging/<temp-id>`
+  * Verify
+  * Atomic `ln -s v-new current.tmp && mv -T current.tmp current`
+
+**Project:** `StellaOps.Bundle.Tests` → `BundleSwapTests.cs`
+
+Scenarios:
+
+1. `Swap_Success_IsAtomic`
+
+   * Simulate swap in a temp directory.
+   * During swap, spawn parallel tasks that repeatedly read `current` and open `manifest.json`.
+   * Assert:
+
+     * Readers never fail with “file not found” / partial manifest.
+     * Readers only see either `v-old` or `v-new`, no mixed state.
+
+2. `Swap_VerificationFails_NoChangeToCurrent`
+
+   * Stage bundle which fails `BundleVerifier`.
+   * After attempted swap:
+
+     * `current` still points to `v-old`.
+     * No new directory with the name expected for `v-new` is referenced by `current`.
+
+3. `Swap_CrashBetweenVerifyAndMv_LeavesSystemConsistent`
+
+   * Simulate crash after creating `current.tmp` but before `mv -T`.
+   * On “restart”:
+
+     * Cleanup code must detect `current.tmp` and remove it.
+     * Ensure `current` still points to last good.
+
+### 5.2 Authority-gated policy changes
+
+**Component:** `StellaOps.Authority` + any service that exposes `/policies`.
+
+Policy change flow:
+
+1. Client sends DSSE-signed `PolicyChangeRequest` to `/authority/verify`.
+2. Authority validates signature, subject hash.
+3. Service applies change only if Authority approves.
+
+**Project:** `StellaOps.Authority.Tests` + `StellaOps.Scanner.Tests` (or wherever policies live).
+
+Key tests:
+
+1. `PolicyChange_WithValidSignature_Applies`
+
+   * Signed request’s `subject` hash matches computed diff of old->new policy.
+   * Authority returns `Approved`.
+   * Policy service updates policy; audit log entry recorded.
+
+2. `PolicyChange_InvalidSignature_Rejected`
+
+   * Signature verifiable with no trusted key, or corrupted payload.
+   * Expect:
+
+     * HTTP 403 or 400 from policy endpoint.
+     * No policy change in DB.
+     * Audit log entry with reason `SIGNATURE_INVALID`.
+
+3. `PolicyChange_SubjectHashMismatch_Rejected`
+
+   * Attacker changes policy body but not DSSE subject.
+   * On verification, recomputed diff doesn’t match subject hash.
+   * Authority rejects with `SUBJECT_MISMATCH`.
+
+4. `PolicyChange_ExpiredEnvelope_Rejected`
+
+   * Envelope contains `expiry` in past.
+   * Authority rejects with `ENVELOPE_EXPIRED`.
+
+5. `PolicyChange_AuditTrail_Complete`
+
+   * After valid change:
+
+     * Audit log contains: `policyName`, `oldHash`, `newHash`, `signerId`, `envelopeId`, `timestamp`.
+
+### 5.3 Job routing (if/when you use DB-backed routing tables)
+
+You discussed a `routing` table:
+
+```sql
+domain       text,
+instance_id  uuid,
+last_heartbeat timestamptz,
+table_name   text
+```
+
+Key tests (once implemented):
+
+1. `HeartbeatExpired_DropsRoutingEntry`
+
+   * Insert entry with `last_heartbeat` older than 1 minute.
+   * Routing GC job should remove it.
+   * API gateway must not route new jobs to that instance.
+
+2. `RoundRobinAcrossAliveInstances`
+
+   * Multiple routing rows for same domain with fresh heartbeats.
+   * Issue M requests via gateway.
+   * Assert approximately round-robin distribution across `instance_id`.
+
+3. `NoDurabilityRequired_JobsNotReplayedAfterRestart`
+
+   * Confirm that in-memory or temp tables are used appropriately where you do not want durable queues.
+
+If you decide to go with “N gateways x M microservices via Docker load balancer only”, then the main tests here move to health-check based routing in the load balancer and become more infra than app tests.
+
+---
+
+## 6) CI wiring summary
+
+To make this actually enforceable:
+
+1. **Unit test job** (`test:unit`)
+
+   * Runs `StellaOps.Bundle.Tests`, `StellaOps.SbomGate.Tests`, `StellaOps.Authority.Tests`, `StellaOps.Scanner.Tests`.
+
+2. **DB recoverability job** (`test:db-recoverability`)
+
+   * Uses Testcontainers to run `StellaOps.DataRecoverability.Tests`.
+   * Marked as “required” for `main` branch merges.
+
+3. **Acceptance job** (`test:acceptance-system`)
+
+   * Spins up a minimal stack via Docker Compose.
+   * Executes `StellaOps.System.Acceptance` tests:
+
+     * Feed outages & fallback.
+     * Air-gap modes.
+     * Bundle swap.
+   * Can be slower; run on main and release branches.
+
+4. **Nightly chaos job** (`test:nightly-chaos`)
+
+   * Optional: run more expensive tests (simulated DB corruption, analyzer outages, etc.).
+
+---
+
+If you want, next step I can generate skeleton xUnit test classes and a `/testdata` layout you can paste directly into your repo (with TODOs where real fixtures are needed).