I thought you might want a sharper picture of what’s out there — and what’s credible — for building a compact “golden set” of redistributable fixtures to drop into StellaOps’ deterministic reachability benchmark. Here’s a quick breakdown of the main external benchmark sources that map closely to your list, and what you’d get from integrating them. --- ## 🔎 Key benchmark suites & what they bring ### **SV-COMP (C reachability / verification tasks)** * SV-COMP is the canonical benchmark for software verification tools, providing curated C programs + reachability specs (safety, mem-safety, overflow, termination). ([SV-COMP][1]) * Tasks are published via tagged git releases (e.g., `svcomp25`), so you can reproducibly fetch exact sources and property definitions. * These tasks give deterministic ground truth about reachability/unreachability. **Why it matters for StellaOps:** plug in real-world-like C code + formal specs. ### **OSS-Fuzz (reproducer corpus for fuzzed bugs)** * Each OSS-Fuzz issue includes a deterministic reproducer file. ([Google GitHub][2]) * Replaying the reproducer yields the same crash if the binary & sanitizers are consistent. * Public corpora can become golden fixtures. **Why it matters:** covers real-world bugs beyond static properties. --- ## ❌ Not perfectly covered (Tier-2 options) * Juliet / OWASP / Java/Python suites lack a single authoritative, stable distribution. * Package snapshots (Debian/Alpine) need manual CVE + harness mapping. * Curated container images (Vulhub) require licensing vetting and orchestration. * Call-graph corpora (NYXCorpus, SWARM) have no guaranteed stable labels. **Implication:** Tier-2 fixtures need frozen versions, harness engineering, and license checks. --- ## ✅ Compact “golden set” candidate | # | Fixture source | Ground truth | |---|----------------|--------------| | 1 | SV-COMP ReachSafety / MemSafety / NoOverflows | C programs + formal reachability specs (no `reach_error()` calls, no overflows). | | 2 | OSS-Fuzz reproducer corpus (C/C++) | Deterministic crash inputs triggering CVEs. | | 3 | OSS-Fuzz seed corpus | Known-safe vs bug-triggering inputs for stability comparisons. | | 4 | Additional SV-COMP categories (Heap, Bitvector, Float) | Broader token coverage. | | 5 | Placeholder: Debian/Alpine package snapshots | Manual metadata + CVEs; build harnesses. | | 6 | Placeholder: Java/Python OWASP/Juliet-inspired fixtures | Dynamic languages coverage; custom instrumentation required. | | 7 | Placeholder: Curated vulnerable container images (Vulhub) | Real-world deployment exposures. | | 8 | Placeholder: Call-graph corpora (NYXCorpus/SWARM) | Dataflow reachability under complex graphs; requires tooling. | Start with SV-COMP + OSS-Fuzz as Tier-1; add Tier-2 once harnesses & snapshots are ready. --- ## ⚠️ Watch-outs for Tier-2 * Corpus maintenance (OSS-Fuzz seeds can change) requires freezing sources. * Label ambiguity: container/Java fixtures must have precise success/failure definitions. * Licensing/distribution for some corpa/containers must be vetted. * Call-graph/dynamic cases demand instrumentation to produce deterministic labels. --- ## 🎯 Recommendation Adopt a “Tier-1 Cadre” (SV-COMP + OSS-Fuzz) for deterministic ground truth. Expand Tier-2 (packages, Java/Python, containers, call graphs) as separate projects with frozen fixtures, metadata (CWE/CVE, PURLs, CycloneDX), and verified harnesses. Want me to draft an 8–12 fixture “starter pack” (paths, metadata, PURLs) you can copy into StellaOps? Let me know. [1]: https://sv-comp.sosy-lab.org/2025/benchmarks.php?utm_source=chatgpt.com "Benchmark Verification Tasks - SV-COMP" [2]: https://google.github.io/oss-fuzz/advanced-topics/reproducing/?utm_source=chatgpt.com "Reproducing OSS-Fuzz issues" [3]: https://github.com/google/oss-fuzz/discussions/12278?utm_source=chatgpt.com "Corpus management in OSS-Fuzz" Good, let’s design this as a real “v0.1 Golden Set” spec your team can implement, not just a loose starter pack. I’ll give you: 1. Design goals and constraints 2. Repository layout for the benchmark 3. Fixture metadata schema (YAML) 4. A concrete Core-10 Golden Fixture set (each with purpose & ground truth) 5. How to wire this into StellaOps (Scanner/Sbomer/Vexer/UnknownsRegistry/Authority) 6. A short implementation plan for your team --- ## 1. Design goals & constraints Non-negotiables: * **Deterministic**: same input → same graph, same verdicts, same logs; no network, no time, no randomness. * **Compact**: ~10 tiny fixtures, each buildable and runnable in seconds, small SBOMs. * **Redistributable**: avoid licensing traps by: * Preferring **synthetic code** owned by you (MIT/BSD-style). * For “realistic” CVEs, use **fake local IDs** (e.g. `FAKE-CVE-2025-0001`) and local feeds, not NVD data. * **Complete chain**: every fixture ships with: * Source + build recipe * Binary/container * CycloneDX SBOM * Local vulnerability feed entries (OSV-like or your own schema) * Reference VEX document (OpenVEX or CycloneDX VEX) * Expected graph revision ID + reachability verdicts * **Coverage of patterns**: * Safe vs unsafe variants (Not Affected vs Affected) * Intra-procedural, inter-procedural, transitive dep * OS package vs app-level deps * Multi-language (C, .NET, Java, Python) * Containerized vs bare-metal --- ## 2. Scope of v0: Core-10 Golden Fixtures Two tiers, but all small: * **Core-10** (what you ship and depend on for regression): * 5 native/C fixtures (classic reachability & library shape) * 1 .NET fixture * 1 Java fixture * 1 Python fixture * 2 container fixtures (OS package style) * **Extended-X** (optional later): fuzz-style repros, dynamic imports, concurrency, etc. Below I’ll detail the Core-10 so your team can actually implement them. --- ## 3. Repository layout Recommended layout inside `stella-ops` mono-repo: ```text benchmarks/ reachability/ golden-v0/ fixtures/ C-REACH-UNSAFE-001/ C-REACH-SAFE-002/ C-MEM-BOUNDS-003/ C-INT-OVERFLOW-004/ C-LIB-TRANSITIVE-005/ CONTAINER-OSPKG-SAFE-006/ CONTAINER-OSPKG-UNSAFE-007/ JAVA-HTTP-UNSAFE-008/ DOTNET-LIB-PAIR-009/ PYTHON-IMPORT-UNSAFE-010/ feeds/ osv-golden.json # “FAKE-CVE-*” style vulnerabilities authority/ vex-reference-index.json # mapping fixture → reference VEX + expected verdict README.md ``` Each fixture folder: ```text fixtures// src/ # source code (C, C#, Java, Python, Dockerfile...) build/ Dockerfile # if containerised build.sh # deterministic build commands artifacts/ binary/ # final exe/jar/dll/image.tar sbom.cdx.json # CycloneDX SBOM (canonical, normalized) vex.openvex.json # reference VEX verdicts manifest.fixture.yaml graph.reference.json # canonical normalized graph graph.reference.sha256.txt # hash = "Graph Revision ID" docs/ explanation.md # human-readable root-cause + reachability explanation ``` --- ## 4. Fixture metadata schema (`manifest.fixture.yaml`) ```yaml id: "C-REACH-UNSAFE-001" name: "Simple reachable error in C" version: "0.1.0" language: "c" domain: "native" category: - "reachability" - "control-flow" ground_truth: vulnerable: true reachable: true cwes: ["CWE-754"] # Improper Check for Unusual or Exceptional Conditions fake_cves: ["FAKE-CVE-2025-0001"] verdict: property_type: "reach_error_unreachable" property_holds: false explanation_ref: "docs/explanation.md" build: type: "local" environment: "debian:12" commands: - "gcc -O0 -g -o app main.c" outputs: binary: "artifacts/binary/app" run: command: "./artifacts/binary/app" args: [] env: {} stdin: "" expected: exit_code: 1 stdout_contains: ["REACH_ERROR"] stderr_contains: [] # Optional coverage or traces you might add later coverage_file: null sbom: path: "artifacts/sbom.cdx.json" format: "cyclonedx-1.5" vex: path: "artifacts/vex.openvex.json" format: "openvex-0.2" statement_ids: - "vex-statement-1" graph: reference_path: "artifacts/graph.reference.json" revision_id_sha256_path: "artifacts/graph.reference.sha256.txt" stellaops_tags: difficulty: "easy" focus: - "control-flow" - "single-binary" used_by: - "scanner.webservice" - "sbomer" - "vexer" - "excititor" ``` Your team can add more, but this is enough to wire the benchmark into the pipeline. --- ## 5. Core-10 Golden Fixtures (concrete proposal) I’ll describe each with: purpose, pattern, what the code does, and ground truth. ### 5.1 `C-REACH-UNSAFE-001` – basic reachable error * **Purpose**: baseline reachability detection on a tiny C program. * **Pattern**: single `main`, simple branch, error sink function `reach_error()`. * **Code shape**: * `main(int argc, char** argv)` parses integer `x`. * If `x == 42`, it calls `reach_error()`, which prints `REACH_ERROR` and exits 1. * **Ground truth**: * Vulnerable: `true`. * Reachable: `true` if run with `x=42` (you fix input in `run.args`). * Property: “reach_error is unreachable” → `false` (counterexample exists). * **Why it’s valuable**: * Exercises simple control flow; used as “hello world” of deterministic reachability. ### 5.2 `C-REACH-SAFE-002` – safe twin of 001 * **Purpose**: same SBOM shape, but no reachable error, to test “Not Affected”. * **Pattern**: identical to 001 but with an added guard. * **Code shape**: * For example, check `x != 42` or remove path to `reach_error()`. * **Ground truth**: * Vulnerable: false (no call to `reach_error` at all) *or* treat it as “patched”. * Reachable: false. * Property “reach_error is unreachable” → `true`. * **Why**: * Used to verify that graph revision is different and VEX becomes “Not Affected” for the same fake CVE (if you model it that way). ### 5.3 `C-MEM-BOUNDS-003` – out-of-bounds write * **Purpose**: exercise memory-safety property and CWE mapping. * **Pattern**: fixed-size buffer + unchecked copy. * **Code shape**: * `char buf[16];` * Copies `argv[1]` into `buf` with `strcpy` or manual loop without bounds check. * **Ground truth**: * Vulnerable: true. * Reachable: true on any input with length > 15 (you fix a triggering arg). * CWEs: `["CWE-119", "CWE-120"]`. * **Expected run**: * With ASAN or similar, exit non-zero, mention heap/buffer overflow; for determinism, you can standardize to exit code 139 and not rely on sanitizer text in tests. ### 5.4 `C-INT-OVERFLOW-004` – integer overflow * **Purpose**: test handling of arithmetic / overflow-related vulnerabilities. * **Pattern**: multiplication or addition with insufficient bounds checking. * **Code shape**: * Function `size_t alloc_size(size_t n)` that does `n * 16` without overflow checks, then allocates and writes. * **Ground truth**: * Vulnerable: true. * Reachable: true with crafted large `n`. * CWEs: `["CWE-190", "CWE-680"]`. * **Why**: * Lets you validate that your vulnerability feed (fake CVE) asserts “affected” on this component, and your reachability engine confirms the path. ### 5.5 `C-LIB-TRANSITIVE-005` – vulnerable library, unreachable in app * **Purpose**: test the core SBOM→VEX story: component is vulnerable, but not used. * **Pattern**: * `libvuln.a` with function `void do_unsafe(char* input)` containing the same OOB bug as 003. * `app.c` links to `libvuln` but never calls `do_unsafe()`. * **Code shape**: * Build static library from `libvuln.c`. * Build `app` that uses only `do_safe()` from `libvuln.c` or that just links but doesn’t call anything from the “unsafe” TU. * **SBOM**: * SBOM lists `component: "pkg:generic/libvuln@1.0.0"` with `fake_cves: ["FAKE-CVE-2025-0003"]`. * **Ground truth**: * Vulnerable component present in SBOM: yes. * Reachable vulnerable function: no. * Correct VEX: “Not Affected: vulnerable code not in execution path for this product”. * **Why**: * Canonical demonstration of correct VEX semantics on real-world pattern: vulnerable lib, harmless usage. --- ### 5.6 `CONTAINER-OSPKG-SAFE-006` – OS package, unused binary * **Purpose**: simulate vulnerable OS package installed but unused, to test image scanning vs reachability. * **Pattern**: * Minimal container (e.g. `debian:12-slim` or `alpine:3.x`) with installed package `vuln-tool` that is never invoked by the entrypoint. * Your app is a trivial “hello” binary. * **SBOM**: * OS-level components include `pkg:generic/vuln-tool@1.0.0`. * **Ground truth**: * Vulnerable: the package is flagged by local feed. * Reachable: false under the specified `CMD` and test scenario. * VEX: “Not Affected – vulnerable code present but not invoked in product’s operational context.” * **Why**: * Tests that StellaOps does not over-report image-level CVEs when nothing in the product’s execution profile uses them. ### 5.7 `CONTAINER-OSPKG-UNSAFE-007` – OS package actually used * **Purpose**: same as 006 but positive case: vulnerability is reachable. * **Pattern**: * Same base image and package. * Entrypoint script calls `vuln-tool` with crafted input that triggers the bug. * **Ground truth**: * Vulnerable: true. * Reachable: true. * This should flip the VEX verdict vs 006. * **Why**: * Verifies that your reachability engine + runtime behaviour correctly distinguish “installed but unused” from “installed and actively exploited.” --- ### 5.8 `JAVA-HTTP-UNSAFE-008` – vulnerable route in minimal Java service * **Purpose**: test JVM + HTTP + transitive dep reachability. * **Pattern**: * Small Spring Boot or JAX-RS service with: * `/safe` endpoint using only safe methods. * `/unsafe` endpoint calling a method in `vuln-lib` that has a simple bug (e.g. path traversal or unsafe deserialization). * **SBOM**: * Component `pkg:maven/org.stellaops/vuln-lib@1.0.0` linked to `FAKE-CVE-2025-0004`. * **Ground truth**: * For an HTTP call to `/unsafe`, vulnerability reachable. * For `/safe`, not reachable. * **Benchmark convention**: * Fixture defines `run.unsafe` and `run.safe` commands in manifest (two separate “scenarios” under one fixture ID, or two sub-cases in `manifest.fixture.yaml`). * **Why**: * Exercises language-level dependency resolution, transitive calls, and HTTP entrypoints. --- ### 5.9 `DOTNET-LIB-PAIR-009` – .NET assembly with safe & unsafe variants * **Purpose**: cover your home turf: .NET 10 / C# pipeline + SBOM & VEX. * **Pattern**: * `Golden.Banking.Core` library with method: * `public void Process(string iban)` → suspicious string parsing / regex or overflow. * Two apps: * `Golden.Banking.App.Unsafe` that calls `Process()` with unsafe behaviour. * `Golden.Banking.App.Safe` that never calls `Process()` or uses a safe wrapper. * **SBOM**: * Component `pkg:nuget/Golden.Banking.Core@1.0.0` tied to `FAKE-CVE-2025-0005`. * **Ground truth**: * For `App.Unsafe`, vulnerability reachable. * For `App.Safe`, not reachable. * **Why**: * Validates your .NET tooling (Sbomer, scanner.webservice) and that your graphs respect assembly boundaries and call sites. --- ### 5.10 `PYTHON-IMPORT-UNSAFE-010` – Python optional import pattern * **Purpose**: basic coverage for dynamic / interpreted language with optional module. * **Pattern**: * `app.py`: * Imports `helper` which conditionally imports `vuln_mod` when `ENABLE_VULN=1`. * When enabled, calling `/unsafe` function triggers, e.g., `eval(user_input)`. * **SBOM**: * Component `pkg:pypi/vuln-mod@1.0.0` → `FAKE-CVE-2025-0006`. * **Ground truth**: * With `ENABLE_VULN=0`, vulnerable module not imported → unreachable. * With `ENABLE_VULN=1`, reachable. * **Why**: * Simple but realistic test for environment-dependent reachability and Python support. --- ## 6. Local vulnerability feed for the Golden Set To keep everything sovereign and deterministic, define a small internal OSV-like JSON feed, e.g. `benchmarks/reachability/golden-v0/feeds/osv-golden.json`: ```json { "vulnerabilities": [ { "id": "FAKE-CVE-2025-0001", "summary": "Reachable error in sample C program", "aliases": [], "affected": [ { "package": { "ecosystem": "generic", "name": "C-REACH-UNSAFE-001" }, "ranges": [{ "type": "SEMVER", "events": [{ "introduced": "0" }] }] } ], "database_specific": { "stellaops_fixture_id": "C-REACH-UNSAFE-001" } } // ... more FAKE-CVE defs ... ] } ``` Scanner/Feedser in “golden mode” should: * Use **only** this feed. * Produce deterministic, closed-world graphs and VEX decisions. --- ## 7. Integration hooks with StellaOps Make sure each module has a clear use of the golden set: * **Scanner.Webservice** * Input: SBOM + local feed for a fixture. * Output: canonical graph JSON and SHA-256 revision. * For each fixture, compare produced `revision_id` against `graph.reference.sha256.txt`. * **Sbomer** * Rebuilds SBOM from source/binaries and compares it to `artifacts/sbom.cdx.json`. * Fails test if SBOMs differ in canonicalized form. * **Vexer / Excititor** * Ingests graph + local feed and produces VEX. * Compare resulting VEX to `artifacts/vex.openvex.json`. * **UnknownsRegistry** * For v0 you can keep unknowns minimal, but: * At least one fixture (e.g. Python or container) can contain a “deliberately un-PURL-able” file to confirm it enters Unknowns with expected half-life. * **Authority** * Signs the reference artifacts: * SBOM * Graph * VEX * Ensures deterministic attestation for the golden set (you can later publish these as public reference proofs). --- ## 8. Implementation plan for your team You can drop this straight into a ticket or doc. 1. **Scaffold repo structure** * Create `benchmarks/reachability/golden-v0/...` layout as above. * Add a top-level `README.md` describing goals and usage. 2. **Implement the 10 fixtures** * Each fixture: write minimal code, build scripts, and `manifest.fixture.yaml`. * Keep code tiny (1–3 files) and deterministic (no network, no randomness, no wall time). 3. **Generate SBOMs** * Use your Sbomer for each artifact. * Normalize / canonicalize SBOMs and commit them as `artifacts/sbom.cdx.json`. 4. **Define FAKE-CVE feed** * Create `feeds/osv-golden.json` with 1–2 entries per fixture. * Map each entry to PURLs used in SBOMs. 5. **Produce reference graphs** * Run Scanner in “golden mode” on each fixture’s SBOM + feed. * Normalize graphs (sorted JSON, deterministic formatting). * Compute SHA-256 → store in `graph.reference.sha256.txt`. 6. **Produce reference VEX documents** * Run Vexer / Excititor with graphs + feed. * Manually review results, edit as needed. * Save final accepted VEX as `artifacts/vex.openvex.json`. 7. **Write explanations** * For each fixture, add `docs/explanation.md`: * 5–10 lines explaining root cause, path, and why affected / not affected. 8. **Wire into CI** * Add a `GoldenReachabilityTests` job that: * Builds all fixtures. * Regenerates SBOM, graph, and VEX. * Compares against reference artifacts. * Fail CI if any fixture drifts. 9. **Expose as a developer command** * Add a CLI command, e.g.: * `stellaops bench reachability --fixture C-REACH-UNSAFE-001` * So developers can locally re-run single fixtures during development. --- If you want, next step I can: * Take 2–3 of these fixtures (for example `C-REACH-UNSAFE-001`, `C-LIB-TRANSITIVE-005`, and `DOTNET-LIB-PAIR-009`) and draft **actual code sketches + full `manifest.fixture.yaml`** so your devs can literally copy-paste and start implementing.