Files
git.stella-ops.org/docs/product-advisories/unprocessed/20-Dec-2025 - Testing strategy.md
StellaOps Bot 0ada1b583f save progress
2025-12-20 12:15:16 +02:00

19 KiB
Raw Blame History

Heres a compact, practical plan to harden StellaOps around offlineready security evidence and deterministic verdicts, with just enough background so it all clicks.


Why this matters (quick primer)

  • Airgapped/offline: Many customers cant reach public feeds or registries. Your scanners, SBOM tooling, and attestations must work with presynced bundles and prove what data they used.
  • Interoperability: Teams mix tools (Syft/Grype/Trivy, cosign, CycloneDX/SPDX). Your CI should roundtrip SBOMs and attestations endtoend and prove that downstream consumers (e.g., Grype) can load them.
  • Determinism: Auditors expect “same inputs → same verdict.” Capture inputs, policies, and feed hashes so a verdict is exactly reproducible later.
  • Operational guardrails: Shipping gates should fail early on unknowns and apply backpressure gracefully when load spikes.

E2E test themes to add (what to build)

  1. Airgapped operation e2e
  • Package “offline bundle” (vuln feeds, package catalogs, policy/lattice rules, certs, keys).
  • Run scans (containers, OS, language deps, binaries) without network.
  • Assert: SBOMs generated, attestations signed/verified, verdicts emitted.
  • Evidence: manifest of bundle contents + hashes in the run log.
  1. Interop roundtrips (SBOM ⇄ attestation ⇄ scanner)
  • Produce SBOM (CycloneDX1.6 and SPDX3.0.1) with Syft.

  • Create DSSE/cosign attestation for that SBOM.

  • Verify consumer tools:

    • Grype scans from SBOM (no image pull) and respects attestations.
    • Verdict references the exact SBOM digest and attestation chain.
  • Assert: consumers load, validate, and produce identical findings vs direct scan.

  1. Replayability (deltaverdicts + strict replay)
  • Store input set: artifact digest(s), SBOM digests, policy version, feed digests, lattice rules, tool versions.
  • Rerun later; assert byteidentical verdict and same “deltaverdict” when inputs unchanged.
  1. Unknownsbudget policy gates
  • Inject controlled “unknown” conditions (missing CPE mapping, unresolved package source, unparsed distro).
  • Gate: fail build if unknowns > budget (e.g., prod=0, staging≤N).
  • Assert: UI, CLI, and attestation all record unknown counts and gate decision.
  1. Attestation roundtrip & validation
  • Produce: buildprovenance (intoto/DSSE), SBOM attest, VEX attest, final verdict attest.
  • Verify: signature (cosign), certificate chain, timestamping, Rekorstyle (or mirror) inclusion when online; cached proofs when offline.
  • Assert: each attestation is linked in the verdicts evidence index.
  1. Router backpressure chaos (HTTP 429/503 + RetryAfter)
  • Load tests that trigger perinstance and perenvironment limits.
  • Assert: clients back off per RetryAfter, queues drain, no data loss, latencies bounded; UI shows throttling reason.
  1. UI reducer tests for reachability & VEX chips
  • Component tests: large SBOM graphs, focused reachability subgraphs, and VEX status chips (affected/notaffected/underinvestigation).
  • Assert: stable rendering under 50k+ nodes; interactions remain <200ms.

Nextweek checklist (do these now)

  1. Deltaverdict replay tests: golden corpus; lock tool+feed versions; assert bitforbit verdict.
  2. Unknownsbudget gates in CI: policy + failing examples; surface in PR checks and UI.
  3. SBOM attestation roundtrip: Syft → cosign attest → Grype consumefromSBOM; verify signatures & digests.
  4. Router backpressure chaos: scripted spike; verify 429/503 + RetryAfter handling and metrics.
  5. UI reducer tests: reachability graph snapshots; VEX chip states; regression suite.

Minimal artifacts to standardize (so tests are boring—good!)

  • Offline bundle spec: bundle.json with content digests (feeds, policies, keys).
  • Evidence manifest: machinereadable index linking verdict → SBOM digest → attestation IDs → tool versions.
  • Deltaverdict schema: captures before/after graph deltas, rule evals, and final gate result.
  • Unknowns taxonomy: codes (e.g., PKG_SOURCE_UNKNOWN, CPE_AMBIG) with severities and budgets.

CI wiring (quick sketch)

  • Jobs: offline-e2e, interop-e2e, replayable-verdicts, unknowns-gate, router-chaos, ui-reducers.
  • Matrix: {Debian/Alpine/RHELlike} × {amd64/arm64} × {CycloneDX/SPDX}.
  • Cache discipline: pin tool versions, vendor feeds to contentaddressed store.

Fast success criteria (green = done)

  • Can run full scan + attest + verify with no network.
  • Rerunning a fixed input set yields identical verdict.
  • Grype (from SBOM) matches image scan results within tolerance.
  • Builds autofail when unknowns budget exceeded.
  • Router under burst emits correct RetryAfter and recovers cleanly.
  • UI handles huge graphs; VEX chips never desync from evidence.

If you want, Ill turn this into GitLab/Gitea pipeline YAML + a tiny sample repo (image, SBOM, policies, and goldens) so your team can plugandplay. Below is a complete, end-to-end testing strategy for Stella Ops that turns your moats (offline readiness, deterministic replayable verdicts, lattice/policy decisioning, attestation provenance, unknowns budgets, router backpressure, UI reachability evidence) into continuously verified guarantees.


1) Non-negotiable test principles

1.1 Determinism as a testable contract

A scan/verdict is deterministic iff same inputs → byte-identical outputs across time and machines (within defined tolerances like timestamps captured as evidence, not embedded in payload order).

Determinism controls (must be enforced by tests):

  • Canonical JSON (stable key order, stable array ordering where semantically unordered).

  • Stable sorting for:

    • packages/components
    • vulnerabilities
    • edges in graphs
    • evidence lists
  • Time is an input, never implicit:

    • stamp times in a dedicated evidence field; never affect hashing/verdict evaluation.
  • PRNG uses explicit seed; seed stored in run manifest.

  • Tool versions + feed digests + policy versions are inputs.

  • Locale/encoding invariants: UTF-8 everywhere; invariant culture in .NET.

1.2 Offline by default

Every CI job (except explicitly tagged “online”) runs with no egress.

  • Offline bundle is mandatory input for scanning.
  • Any attempted network call fails the test (proves air-gap compliance).

1.3 Evidence-first validation

No assertion is “verdict == pass” without verifying the chain of evidence:

  • verdict references SBOM digest(s)
  • SBOM references artifact digest(s)
  • VEX claims reference vulnerabilities + components + reachability evidence
  • attestations verify cryptographically and chain to configured roots.

1.4 Interop is required, not “nice to have”

Stella Ops must round-trip with:

  • SBOM: CycloneDX 1.6 and SPDX 3.0.1
  • Attestation: DSSE / in-toto style envelopes, cosign-compatible flows
  • Consumer scanners: at least Grype from SBOM; ideally Trivy as cross-check

Interop tests are treated as “compatibility contracts” and block releases.

1.5 Architectural boundary enforcement (your standing rule)

  • Lattice/policy merge algorithms run in scanner.webservice.
  • Concelier and Excitors must “preserve prune source”. This is enforced with tests that detect forbidden behavior (see §6.2).

2) The test portfolio (what kinds of tests exist)

Think “coverage by risk”, not “coverage by lines”.

2.1 Test layers and what they prove

  1. Unit tests (fast, deterministic)
  • Canonicalization, hashing, semantic version range ops
  • Graph delta algorithms
  • Policy rule evaluation primitives
  • Unknowns taxonomy + budgeting math
  • Evidence index assembly
  1. Property-based tests (FsCheck)
  • “Reordering inputs does not change verdict hash”
  • “Graph merge is associative/commutative where policy declares it”
  • “Unknowns budgets always monotonic with missing evidence”
  • Parser robustness: arbitrary JSON for SBOM/VEX envelopes never crashes
  1. Component tests (service + Postgres; optional Valkey)
  • scanner.webservice lattice merge and replay
  • Feed loader and cache behavior (offline feeds)
  • Router backpressure decision logic
  • Attestation verification modules
  1. Contract tests (API compatibility)
  • OpenAPI/JSON schema compatibility for public endpoints
  • Evidence manifest schema backward compatibility
  • OCI artifact layout compatibility (attestation attachments)
  1. Integration tests (multi-service)
  • Router → scanner.webservice → attestor → storage
  • Offline bundle import/export
  • Knowledge snapshot “time travel” replay pipeline
  1. End-to-end tests (realistic flows)
  • scan an image → generate SBOM → produce attestations → decision verdict → UI evidence extraction
  • interop consumers load SBOM and confirm findings parity
  1. Non-functional tests
  • Performance & scale (throughput, memory, large SBOM graphs)
  • Chaos/fault injection (DB restarts, queue spikes, 429/503 backpressure)
  • Security tests (fuzzers, decompression bomb defense, signature bypass resistance)

3) Hermetic test harness (how tests run)

3.1 Standard test profiles

You already decided: Postgres is system-of-record, Valkey is ephemeral.

Define two mandatory execution profiles in CI:

  1. Default: Postgres + Valkey
  2. Air-gapped minimal: Postgres only

Both must pass.

3.2 Environment isolation

  • Containers started with no network unless a test explicitly declares “online”.
  • For Kubernetes e2e: apply a default-deny egress NetworkPolicy.

3.3 Golden corpora repository (your “truth set”)

Create a versioned stellaops-test-corpus/ containing:

  • container images (or image tarballs) pinned by digest
  • SBOM expected outputs (CycloneDX + SPDX)
  • VEX examples (vendor/distro/internal)
  • vulnerability feed snapshots (pinned digests)
  • policies + lattice rules + unknown budgets
  • expected verdicts + delta verdicts
  • reachability subgraphs as evidence
  • negative fixtures: malformed SPDX, corrupted DSSE, missing digests, unsupported distros

Every corpus item includes a Run Manifest (see §4).

3.4 Artifact retention in CI

Every failing integration/e2e test uploads:

  • run manifest
  • offline bundle manifest + hashes
  • logs (structured)
  • produced SBOMs
  • attestations
  • verdict + delta verdict
  • evidence index

This turns failures into audit-grade reproductions.


4) Core artifacts that tests must validate

4.1 Run Manifest (replay key)

A scan run is defined by:

  • artifact digests (image/config/layers, or binary hash)
  • SBOM digests produced/consumed
  • vuln feed snapshot digest(s)
  • policy version + lattice rules digest
  • tool versions (scanner, parsers, reachability engine)
  • crypto profile (roots, key IDs, algorithm set)
  • environment profile (postgres-only vs postgres+valkey)
  • seed + canonicalization version

Test invariant: re-running the same manifest produces byte-identical verdict and same evidence references.

4.2 Offline Bundle Manifest

Bundle includes:

  • feeds + indexes
  • policies + lattice rule sets
  • trust roots, intermediate CAs, timestamp roots (as needed)
  • crypto provider modules (for sovereign readiness)
  • optional: Rekor mirror snapshot / inclusion proofs cache

Test invariant: offline scan is blocked if bundle is missing required parts; error is explicit and counts as “unknown” only where policy says so.

4.3 Evidence Index

The verdict is not the product; the product is verdict + evidence graph:

  • pointers to SBOM, VEX, reachability proofs, attestations
  • their digests and verification status
  • unknowns list with codes + remediation hints

Test invariant: every “not affected” claim has required evidence hooks per policy (“because feature flag off” etc.), otherwise becomes unknown/fail.


5) Required E2E flows (minimum set)

These are your release blockers.

Flow A: Air-gapped scan and verdict

  • Inputs: image tarball + offline bundle

  • Network: disabled

  • Output: SBOM (CycloneDX + SPDX), attestations, verdict

  • Assertions:

    • no network calls occurred
    • verdict references bundle digest + feed snapshot digest
    • unknowns within budget
    • evidence index complete

Flow B: SBOM interop round-trip

  • Produce SBOM via your pipeline

  • Attach SBOM attestation (DSSE/cosign format)

  • Consumer (Grype-from-SBOM) reads SBOM and produces findings

  • Assertions:

    • consumer can parse SBOM
    • findings parity within defined tolerance
    • verdict references exact SBOM digest used by consumer

Flow C: Deterministic replay

  • Run scan → store run manifest + outputs

  • Run again from same manifest

  • Assertions:

    • verdict bytes identical
    • evidence index identical (except allowed “execution metadata” section)
    • delta verdict is “empty delta”

Flow D: Diff-aware delta verdict (smart-diff)

  • Two versions of same image with controlled change (one dependency bump)

  • Assertions:

    • delta verdict contains only changed nodes/edges
    • risk budget computation based on delta matches expected
    • signed delta verdict validates and is OCI-attached

Flow E: Unknowns budget gates

  • Inject unknowns (unmapped package, missing distro metadata, ambiguous CPE)

  • Policy:

    • prod budget = 0
    • staging budget = N
  • Assertions:

    • prod fails, staging passes
    • unknowns appear in attestation and UI evidence

Flow F: Router backpressure under burst

  • Spike requests to a single router instance + environment bucket

  • Assertions:

    • 429/503 with Retry-After emitted correctly
    • clients backoff; no request loss
    • metrics expose throttling reasons

Flow G: Evidence export (“audit pack”)

  • Run scan

  • Export a sealed audit pack (bundle + run manifest + evidence + verdict)

  • Import elsewhere (clean environment)

  • Assertions:

    • replay produces identical verdict
    • signatures verify under imported trust roots

6) Module-specific test requirements

6.1 scanner.webservice (lattice + policy decisioning)

Must have:

  • unit tests for lattice merge algebra
  • property tests: declared commutativity/associativity/idempotency
  • integration tests that merge vendor/distro/internal VEX and confirm precedence rules are policy-driven

Critical invariant tests:

  • “Vendor > distro > internal” must be demonstrably configurable, and wrong merges must fail deterministically.

6.2 Boundary enforcement: Concelier & Excitors preserve prune source

Add a “behavioral boundary suite”:

  • instrument events/telemetry that records where merges happened

  • feed in conflicting VEX claims and assert:

    • Concelier/Excitors do not resolve conflicts; they retain provenance and “prune source”
    • only scanner.webservice produces the final merged semantics

If Concelier/Excitors output a resolved claim, the test fails.

6.3 Router backpressure and DPoP/nonce rate limiting

  • deterministic unit tests for token bucket math
  • time-controlled tests (virtual clock)
  • integration tests with Valkey + Postgres-only fallbacks
  • chaos tests: Valkey down → router degrades gracefully (local per-instance limiter still works)

6.4 Storage (Postgres) + Valkey accelerator

  • migration tests: schema upgrades forward/backward in CI
  • replay tests: Postgres-only profile yields same verdict bytes
  • consistency tests: Valkey cache misses never change decision outcomes, only latency

6.5 UI evidence rendering

  • reducer snapshot tests for:

    • reachability subgraph rendering (large graphs)
    • VEX chip states: affected/not-affected/under-investigation/unknown
  • performance budgets:

    • large graph render under threshold (define and enforce)
  • contract tests against evidence index schema


7) Non-functional test program

7.1 Performance and scale tests

Define standard workloads:

  • small image (200 packages)
  • medium (2k packages)
  • large (20k+ packages)
  • “monorepo container” worst case (50k+ nodes graph)

Metrics collected:

  • p50/p95/p99 scan time
  • memory peak
  • DB write volume
  • evidence pack size
  • router throughput + throttle rate

Add regression gates:

  • no more than X% slowdown in p95 vs baseline
  • no more than Y% growth in evidence pack size for unchanged inputs

7.2 Chaos and reliability

Run chaos suites weekly/nightly:

  • kill scanner during run → resume/retry semantics deterministic
  • restart Postgres mid-run → job fails with explicit retryable state
  • corrupt offline bundle file → fails with typed error, not crash
  • burst router + slow downstream → confirms backpressure not meltdown

7.3 Security robustness tests

  • fuzz parsers: SPDX, CycloneDX, VEX, DSSE envelopes

  • zip/tar bomb defenses (artifact ingestion)

  • signature bypass attempts:

    • mismatched digest
    • altered payload with valid signature on different content
    • wrong root chain
  • SSRF defense: any URL fields in SBOM/VEX are treated as data, never fetched in offline mode


8) CI/CD gating rules (what blocks a release)

Release candidate is blocked if any of these fail:

  1. All mandatory E2E flows (§5) pass in both profiles:

    • Postgres-only
    • Postgres+Valkey
  2. Deterministic replay suite:

    • zero non-deterministic diffs in verdict bytes
    • allowed diff list is explicit and reviewed
  3. Interop suite:

    • CycloneDX 1.6 and SPDX 3.0.1 round-trips succeed
    • consumer scanner compatibility tests pass
  4. Risk budgets + unknowns budgets:

    • must pass on corpus, and no regressions against baseline
  5. Backpressure correctness:

    • Retry-After compliance and throttle metrics validated
  6. Performance regression budgets:

    • no breach of p95/memory budgets on standard workloads
  7. Flakiness threshold:

    • if a test flakes more than N times per week, it is quarantined and release is blocked until a deterministic root cause is established (quarantine is allowed only for non-blocking suites, never for §5 flows)

9) Implementation blueprint (how to build this test program)

Phase 0: Harness and corpus

  • Stand up test harness: docker compose + Testcontainers (.NET xUnit)
  • Create corpus repo with 1020 curated artifacts
  • Implement run manifest + evidence index capture in all tests

Phase 1: Determinism and replay

  • canonicalization utilities + golden verdict bytes
  • replay runner that loads manifest and replays end-to-end
  • add property-based tests for ordering and merge invariants

Phase 2: Offline e2e + interop

  • offline bundle builder + strict “no egress” enforcement
  • SBOM attestation round-trip + consumer parsing suite

Phase 3: Unknowns budgets + delta verdict

  • unknown taxonomy everywhere (UI + attestations)
  • delta verdict generation and signing
  • diff-aware release gates

Phase 4: Backpressure + chaos + performance

  • router throttle chaos suite
  • scale tests with standard workloads and baselines

Phase 5: Audit packs + time-travel snapshots

  • sealed export/import
  • one-command replay for auditors

10) What you should standardize immediately

If you do only three things, do these:

  1. Run Manifest as first-class test artifact
  2. Golden corpus that pins all digests (feeds, policies, images, expected outputs)
  3. “No egress” default in CI with explicit opt-in for online tests

Everything else becomes far easier once these are in place.


If you want, I can also produce a concrete repository layout and CI job matrix (xUnit categories, docker compose profiles, artifact retention conventions, and baseline benchmark scripts) that matches .NET 10 conventions and your Postgres/Valkey profiles.