Here’s a compact, practical plan to harden Stella Ops around **offline‑ready security evidence and deterministic verdicts**, with just enough background so it all clicks. --- # Why this matters (quick primer) * **Air‑gapped/offline**: Many customers can’t reach public feeds or registries. Your scanners, SBOM tooling, and attestations must work with **pre‑synced bundles** and prove what data they used. * **Interoperability**: Teams mix tools (Syft/Grype/Trivy, cosign, CycloneDX/SPDX). Your CI should **round‑trip** SBOMs and attestations end‑to‑end and prove that downstream consumers (e.g., Grype) can load them. * **Determinism**: Auditors expect **“same inputs → same verdict.”** Capture inputs, policies, and feed hashes so a verdict is exactly reproducible later. * **Operational guardrails**: Shipping gates should fail early on **unknowns** and apply **backpressure** gracefully when load spikes. --- # E2E test themes to add (what to build) 1. **Air‑gapped operation e2e** * Package “offline bundle” (vuln feeds, package catalogs, policy/lattice rules, certs, keys). * Run scans (containers, OS, language deps, binaries) **without network**. * Assert: SBOMs generated, attestations signed/verified, verdicts emitted. * Evidence: manifest of bundle contents + hashes in the run log. 2. **Interop round‑trips (SBOM ⇄ attestation ⇄ scanner)** * Produce SBOM (CycloneDX 1.6 and SPDX 3.0.1) with Syft. * Create **DSSE/cosign** attestation for that SBOM. * Verify consumer tools: * **Grype** scans **from SBOM** (no image pull) and respects attestations. * Verdict references the exact SBOM digest and attestation chain. * Assert: consumers load, validate, and produce identical findings vs direct scan. 3. **Replayability (delta‑verdicts + strict replay)** * Store input set: artifact digest(s), SBOM digests, policy version, feed digests, lattice rules, tool versions. * Re‑run later; assert **byte‑identical verdict** and same “delta‑verdict” when inputs unchanged. 4. **Unknowns‑budget policy gates** * Inject controlled “unknown” conditions (missing CPE mapping, unresolved package source, unparsed distro). * Gate: **fail build if unknowns > budget** (e.g., prod=0, staging≤N). * Assert: UI, CLI, and attestation all record unknown counts and gate decision. 5. **Attestation round‑trip & validation** * Produce: build‑provenance (in‑toto/DSSE), SBOM attest, VEX attest, final **verdict attest**. * Verify: signature (cosign), certificate chain, time‑stamping, Rekor‑style (or mirror) inclusion when online; cached proofs when offline. * Assert: each attestation is linked in the verdict’s evidence index. 6. **Router backpressure chaos (HTTP 429/503 + Retry‑After)** * Load tests that trigger per‑instance and per‑environment limits. * Assert: clients back off per **Retry‑After**, queues drain, no data loss, latencies bounded; UI shows throttling reason. 7. **UI reducer tests for reachability & VEX chips** * Component tests: large SBOM graphs, focused **reachability subgraphs**, and VEX status chips (affected/not‑affected/under‑investigation). * Assert: stable rendering under 50k+ nodes; interactions remain <200 ms. --- # Next‑week checklist (do these now) 1. **Delta‑verdict replay tests**: golden corpus; lock tool+feed versions; assert bit‑for‑bit verdict. 2. **Unknowns‑budget gates in CI**: policy + failing examples; surface in PR checks and UI. 3. **SBOM attestation round‑trip**: Syft → cosign attest → Grype consume‑from‑SBOM; verify signatures & digests. 4. **Router backpressure chaos**: scripted spike; verify 429/503 + Retry‑After handling and metrics. 5. **UI reducer tests**: reachability graph snapshots; VEX chip states; regression suite. --- # Minimal artifacts to standardize (so tests are boring—good!) * **Offline bundle spec**: `bundle.json` with content digests (feeds, policies, keys). * **Evidence manifest**: machine‑readable index linking verdict → SBOM digest → attestation IDs → tool versions. * **Delta‑verdict schema**: captures before/after graph deltas, rule evals, and final gate result. * **Unknowns taxonomy**: codes (e.g., `PKG_SOURCE_UNKNOWN`, `CPE_AMBIG`) with severities and budgets. --- # CI wiring (quick sketch) * **Jobs**: `offline-e2e`, `interop-e2e`, `replayable-verdicts`, `unknowns-gate`, `router-chaos`, `ui-reducers`. * **Matrix**: {Debian/Alpine/RHEL‑like} × {amd64/arm64} × {CycloneDX/SPDX}. * **Cache discipline**: pin tool versions, vendor feeds to content‑addressed store. --- # Fast success criteria (green = done) * Can run **full scan + attest + verify** with **no network**. * Re‑running a fixed input set yields **identical verdict**. * Grype (from SBOM) matches image scan results within tolerance. * Builds auto‑fail when **unknowns budget exceeded**. * Router under burst emits **correct Retry‑After** and recovers cleanly. * UI handles huge graphs; VEX chips never desync from evidence. If you want, I’ll turn this into GitLab/Gitea pipeline YAML + a tiny sample repo (image, SBOM, policies, and goldens) so your team can plug‑and‑play. Below is a complete, end-to-end testing strategy for Stella Ops that turns your moats (offline readiness, deterministic replayable verdicts, lattice/policy decisioning, attestation provenance, unknowns budgets, router backpressure, UI reachability evidence) into continuously verified guarantees. --- ## 1) Non-negotiable test principles ### 1.1 Determinism as a testable contract A scan/verdict is *deterministic* iff **same inputs → byte-identical outputs** across time and machines (within defined tolerances like timestamps captured as evidence, not embedded in payload order). **Determinism controls (must be enforced by tests):** * Canonical JSON (stable key order, stable array ordering where semantically unordered). * Stable sorting for: * packages/components * vulnerabilities * edges in graphs * evidence lists * Time is an *input*, never implicit: * stamp times in a dedicated evidence field; never affect hashing/verdict evaluation. * PRNG uses explicit seed; seed stored in run manifest. * Tool versions + feed digests + policy versions are inputs. * Locale/encoding invariants: UTF-8 everywhere; invariant culture in .NET. ### 1.2 Offline by default Every CI job (except explicitly tagged “online”) runs with **no egress**. * Offline bundle is mandatory input for scanning. * Any attempted network call fails the test (proves air-gap compliance). ### 1.3 Evidence-first validation No assertion is “verdict == pass” without verifying the chain of evidence: * verdict references SBOM digest(s) * SBOM references artifact digest(s) * VEX claims reference vulnerabilities + components + reachability evidence * attestations verify cryptographically and chain to configured roots. ### 1.4 Interop is required, not “nice to have” Stella Ops must round-trip with: * SBOM: CycloneDX 1.6 and SPDX 3.0.1 * Attestation: DSSE / in-toto style envelopes, cosign-compatible flows * Consumer scanners: at least Grype from SBOM; ideally Trivy as cross-check Interop tests are treated as “compatibility contracts” and block releases. ### 1.5 Architectural boundary enforcement (your standing rule) * Lattice/policy merge algorithms run **in `scanner.webservice`**. * `Concelier` and `Excitors` must “preserve prune source”. This is enforced with tests that detect forbidden behavior (see §6.2). --- ## 2) The test portfolio (what kinds of tests exist) Think “coverage by risk”, not “coverage by lines”. ### 2.1 Test layers and what they prove 1. **Unit tests** (fast, deterministic) * Canonicalization, hashing, semantic version range ops * Graph delta algorithms * Policy rule evaluation primitives * Unknowns taxonomy + budgeting math * Evidence index assembly 2. **Property-based tests** (FsCheck) * “Reordering inputs does not change verdict hash” * “Graph merge is associative/commutative where policy declares it” * “Unknowns budgets always monotonic with missing evidence” * Parser robustness: arbitrary JSON for SBOM/VEX envelopes never crashes 3. **Component tests** (service + Postgres; optional Valkey) * `scanner.webservice` lattice merge and replay * Feed loader and cache behavior (offline feeds) * Router backpressure decision logic * Attestation verification modules 4. **Contract tests** (API compatibility) * OpenAPI/JSON schema compatibility for public endpoints * Evidence manifest schema backward compatibility * OCI artifact layout compatibility (attestation attachments) 5. **Integration tests** (multi-service) * Router → scanner.webservice → attestor → storage * Offline bundle import/export * Knowledge snapshot “time travel” replay pipeline 6. **End-to-end tests** (realistic flows) * scan an image → generate SBOM → produce attestations → decision verdict → UI evidence extraction * interop consumers load SBOM and confirm findings parity 7. **Non-functional tests** * Performance & scale (throughput, memory, large SBOM graphs) * Chaos/fault injection (DB restarts, queue spikes, 429/503 backpressure) * Security tests (fuzzers, decompression bomb defense, signature bypass resistance) --- ## 3) Hermetic test harness (how tests run) ### 3.1 Standard test profiles You already decided: **Postgres is system-of-record**, **Valkey is ephemeral**. Define two mandatory execution profiles in CI: 1. **Default**: Postgres + Valkey 2. **Air-gapped minimal**: Postgres only Both must pass. ### 3.2 Environment isolation * Containers started with **no network** unless a test explicitly declares “online”. * For Kubernetes e2e: apply a default-deny egress NetworkPolicy. ### 3.3 Golden corpora repository (your “truth set”) Create a versioned `stellaops-test-corpus/` containing: * container images (or image tarballs) pinned by digest * SBOM expected outputs (CycloneDX + SPDX) * VEX examples (vendor/distro/internal) * vulnerability feed snapshots (pinned digests) * policies + lattice rules + unknown budgets * expected verdicts + delta verdicts * reachability subgraphs as evidence * negative fixtures: malformed SPDX, corrupted DSSE, missing digests, unsupported distros Every corpus item includes a **Run Manifest** (see §4). ### 3.4 Artifact retention in CI Every failing integration/e2e test uploads: * run manifest * offline bundle manifest + hashes * logs (structured) * produced SBOMs * attestations * verdict + delta verdict * evidence index This turns failures into audit-grade reproductions. --- ## 4) Core artifacts that tests must validate ### 4.1 Run Manifest (replay key) A scan run is defined by: * artifact digests (image/config/layers, or binary hash) * SBOM digests produced/consumed * vuln feed snapshot digest(s) * policy version + lattice rules digest * tool versions (scanner, parsers, reachability engine) * crypto profile (roots, key IDs, algorithm set) * environment profile (postgres-only vs postgres+valkey) * seed + canonicalization version **Test invariant:** re-running the same manifest produces **byte-identical verdict** and **same evidence references**. ### 4.2 Offline Bundle Manifest Bundle includes: * feeds + indexes * policies + lattice rule sets * trust roots, intermediate CAs, timestamp roots (as needed) * crypto provider modules (for sovereign readiness) * optional: Rekor mirror snapshot / inclusion proofs cache **Test invariant:** offline scan is blocked if bundle is missing required parts; error is explicit and counts as “unknown” only where policy says so. ### 4.3 Evidence Index The verdict is not the product; the product is verdict + evidence graph: * pointers to SBOM, VEX, reachability proofs, attestations * their digests and verification status * unknowns list with codes + remediation hints **Test invariant:** every “not affected” claim has required evidence hooks per policy (“because feature flag off” etc.), otherwise becomes unknown/fail. --- ## 5) Required E2E flows (minimum set) These are your release blockers. ### Flow A: Air-gapped scan and verdict * Inputs: image tarball + offline bundle * Network: disabled * Output: SBOM (CycloneDX + SPDX), attestations, verdict * Assertions: * no network calls occurred * verdict references bundle digest + feed snapshot digest * unknowns within budget * evidence index complete ### Flow B: SBOM interop round-trip * Produce SBOM via your pipeline * Attach SBOM attestation (DSSE/cosign format) * Consumer (Grype-from-SBOM) reads SBOM and produces findings * Assertions: * consumer can parse SBOM * findings parity within defined tolerance * verdict references exact SBOM digest used by consumer ### Flow C: Deterministic replay * Run scan → store run manifest + outputs * Run again from same manifest * Assertions: * verdict bytes identical * evidence index identical (except allowed “execution metadata” section) * delta verdict is “empty delta” ### Flow D: Diff-aware delta verdict (smart-diff) * Two versions of same image with controlled change (one dependency bump) * Assertions: * delta verdict contains only changed nodes/edges * risk budget computation based on delta matches expected * signed delta verdict validates and is OCI-attached ### Flow E: Unknowns budget gates * Inject unknowns (unmapped package, missing distro metadata, ambiguous CPE) * Policy: * prod budget = 0 * staging budget = N * Assertions: * prod fails, staging passes * unknowns appear in attestation and UI evidence ### Flow F: Router backpressure under burst * Spike requests to a single router instance + environment bucket * Assertions: * 429/503 with Retry-After emitted correctly * clients backoff; no request loss * metrics expose throttling reasons ### Flow G: Evidence export (“audit pack”) * Run scan * Export a sealed audit pack (bundle + run manifest + evidence + verdict) * Import elsewhere (clean environment) * Assertions: * replay produces identical verdict * signatures verify under imported trust roots --- ## 6) Module-specific test requirements ### 6.1 `scanner.webservice` (lattice + policy decisioning) Must have: * unit tests for lattice merge algebra * property tests: declared commutativity/associativity/idempotency * integration tests that merge vendor/distro/internal VEX and confirm precedence rules are policy-driven **Critical invariant tests:** * “Vendor > distro > internal” must be demonstrably *configurable*, and wrong merges must fail deterministically. ### 6.2 Boundary enforcement: Concelier & Excitors preserve prune source Add a “behavioral boundary suite”: * instrument events/telemetry that records where merges happened * feed in conflicting VEX claims and assert: * Concelier/Excitors do not resolve conflicts; they retain provenance and “prune source” * only `scanner.webservice` produces the final merged semantics If Concelier/Excitors output a resolved claim, the test fails. ### 6.3 `Router` backpressure and DPoP/nonce rate limiting * deterministic unit tests for token bucket math * time-controlled tests (virtual clock) * integration tests with Valkey + Postgres-only fallbacks * chaos tests: Valkey down → router degrades gracefully (local per-instance limiter still works) ### 6.4 Storage (Postgres) + Valkey accelerator * migration tests: schema upgrades forward/backward in CI * replay tests: Postgres-only profile yields same verdict bytes * consistency tests: Valkey cache misses never change decision outcomes, only latency ### 6.5 UI evidence rendering * reducer snapshot tests for: * reachability subgraph rendering (large graphs) * VEX chip states: affected/not-affected/under-investigation/unknown * performance budgets: * large graph render under threshold (define and enforce) * contract tests against evidence index schema --- ## 7) Non-functional test program ### 7.1 Performance and scale tests Define standard workloads: * small image (200 packages) * medium (2k packages) * large (20k+ packages) * “monorepo container” worst case (50k+ nodes graph) Metrics collected: * p50/p95/p99 scan time * memory peak * DB write volume * evidence pack size * router throughput + throttle rate Add regression gates: * no more than X% slowdown in p95 vs baseline * no more than Y% growth in evidence pack size for unchanged inputs ### 7.2 Chaos and reliability Run chaos suites weekly/nightly: * kill scanner during run → resume/retry semantics deterministic * restart Postgres mid-run → job fails with explicit retryable state * corrupt offline bundle file → fails with typed error, not crash * burst router + slow downstream → confirms backpressure not meltdown ### 7.3 Security robustness tests * fuzz parsers: SPDX, CycloneDX, VEX, DSSE envelopes * zip/tar bomb defenses (artifact ingestion) * signature bypass attempts: * mismatched digest * altered payload with valid signature on different content * wrong root chain * SSRF defense: any URL fields in SBOM/VEX are treated as data, never fetched in offline mode --- ## 8) CI/CD gating rules (what blocks a release) Release candidate is blocked if any of these fail: 1. All mandatory E2E flows (§5) pass in both profiles: * Postgres-only * Postgres+Valkey 2. Deterministic replay suite: * zero non-deterministic diffs in verdict bytes * allowed diff list is explicit and reviewed 3. Interop suite: * CycloneDX 1.6 and SPDX 3.0.1 round-trips succeed * consumer scanner compatibility tests pass 4. Risk budgets + unknowns budgets: * must pass on corpus, and no regressions against baseline 5. Backpressure correctness: * Retry-After compliance and throttle metrics validated 6. Performance regression budgets: * no breach of p95/memory budgets on standard workloads 7. Flakiness threshold: * if a test flakes more than N times per week, it is quarantined *and* release is blocked until a deterministic root cause is established (quarantine is allowed only for non-blocking suites, never for §5 flows) --- ## 9) Implementation blueprint (how to build this test program) ### Phase 0: Harness and corpus * Stand up test harness: docker compose + Testcontainers (.NET xUnit) * Create corpus repo with 10–20 curated artifacts * Implement run manifest + evidence index capture in all tests ### Phase 1: Determinism and replay * canonicalization utilities + golden verdict bytes * replay runner that loads manifest and replays end-to-end * add property-based tests for ordering and merge invariants ### Phase 2: Offline e2e + interop * offline bundle builder + strict “no egress” enforcement * SBOM attestation round-trip + consumer parsing suite ### Phase 3: Unknowns budgets + delta verdict * unknown taxonomy everywhere (UI + attestations) * delta verdict generation and signing * diff-aware release gates ### Phase 4: Backpressure + chaos + performance * router throttle chaos suite * scale tests with standard workloads and baselines ### Phase 5: Audit packs + time-travel snapshots * sealed export/import * one-command replay for auditors --- ## 10) What you should standardize immediately If you do only three things, do these: 1. **Run Manifest** as first-class test artifact 2. **Golden corpus** that pins all digests (feeds, policies, images, expected outputs) 3. **“No egress” default** in CI with explicit opt-in for online tests Everything else becomes far easier once these are in place. --- If you want, I can also produce a concrete repository layout and CI job matrix (xUnit categories, docker compose profiles, artifact retention conventions, and baseline benchmark scripts) that matches .NET 10 conventions and your Postgres/Valkey profiles.