20 KiB
I thought you might want a sharper picture of what’s out there — and what’s credible — for building a compact “golden set” of redistributable fixtures to drop into StellaOps’ deterministic reachability benchmark. Here’s a quick breakdown of the main external benchmark sources that map closely to your list, and what you’d get from integrating them.
🔎 Key benchmark suites & what they bring
SV-COMP (C reachability / verification tasks)
- SV-COMP is the canonical benchmark for software verification tools, providing curated C programs + reachability specs (safety, mem-safety, overflow, termination). (SV-COMP)
- Tasks are published via tagged git releases (e.g.,
svcomp25), so you can reproducibly fetch exact sources and property definitions. - These tasks give deterministic ground truth about reachability/unreachability.
Why it matters for StellaOps: plug in real-world-like C code + formal specs.
OSS-Fuzz (reproducer corpus for fuzzed bugs)
- Each OSS-Fuzz issue includes a deterministic reproducer file. (Google GitHub)
- Replaying the reproducer yields the same crash if the binary & sanitizers are consistent.
- Public corpora can become golden fixtures.
Why it matters: covers real-world bugs beyond static properties.
❌ Not perfectly covered (Tier-2 options)
- Juliet / OWASP / Java/Python suites lack a single authoritative, stable distribution.
- Package snapshots (Debian/Alpine) need manual CVE + harness mapping.
- Curated container images (Vulhub) require licensing vetting and orchestration.
- Call-graph corpora (NYXCorpus, SWARM) have no guaranteed stable labels.
Implication: Tier-2 fixtures need frozen versions, harness engineering, and license checks.
✅ Compact “golden set” candidate
| # | Fixture source | Ground truth |
|---|---|---|
| 1 | SV-COMP ReachSafety / MemSafety / NoOverflows | C programs + formal reachability specs (no reach_error() calls, no overflows). |
| 2 | OSS-Fuzz reproducer corpus (C/C++) | Deterministic crash inputs triggering CVEs. |
| 3 | OSS-Fuzz seed corpus | Known-safe vs bug-triggering inputs for stability comparisons. |
| 4 | Additional SV-COMP categories (Heap, Bitvector, Float) | Broader token coverage. |
| 5 | Placeholder: Debian/Alpine package snapshots | Manual metadata + CVEs; build harnesses. |
| 6 | Placeholder: Java/Python OWASP/Juliet-inspired fixtures | Dynamic languages coverage; custom instrumentation required. |
| 7 | Placeholder: Curated vulnerable container images (Vulhub) | Real-world deployment exposures. |
| 8 | Placeholder: Call-graph corpora (NYXCorpus/SWARM) | Dataflow reachability under complex graphs; requires tooling. |
Start with SV-COMP + OSS-Fuzz as Tier-1; add Tier-2 once harnesses & snapshots are ready.
⚠️ Watch-outs for Tier-2
- Corpus maintenance (OSS-Fuzz seeds can change) requires freezing sources.
- Label ambiguity: container/Java fixtures must have precise success/failure definitions.
- Licensing/distribution for some corpa/containers must be vetted.
- Call-graph/dynamic cases demand instrumentation to produce deterministic labels.
🎯 Recommendation
Adopt a “Tier-1 Cadre” (SV-COMP + OSS-Fuzz) for deterministic ground truth. Expand Tier-2 (packages, Java/Python, containers, call graphs) as separate projects with frozen fixtures, metadata (CWE/CVE, PURLs, CycloneDX), and verified harnesses.
Want me to draft an 8–12 fixture “starter pack” (paths, metadata, PURLs) you can copy into StellaOps? Let me know.
Good, let’s design this as a real “v0.1 Golden Set” spec your team can implement, not just a loose starter pack.
I’ll give you:
- Design goals and constraints
- Repository layout for the benchmark
- Fixture metadata schema (YAML)
- A concrete Core-10 Golden Fixture set (each with purpose & ground truth)
- How to wire this into StellaOps (Scanner/Sbomer/Vexer/UnknownsRegistry/Authority)
- A short implementation plan for your team
1. Design goals & constraints
Non-negotiables:
-
Deterministic: same input → same graph, same verdicts, same logs; no network, no time, no randomness.
-
Compact: ~10 tiny fixtures, each buildable and runnable in seconds, small SBOMs.
-
Redistributable: avoid licensing traps by:
- Preferring synthetic code owned by you (MIT/BSD-style).
- For “realistic” CVEs, use fake local IDs (e.g.
FAKE-CVE-2025-0001) and local feeds, not NVD data.
-
Complete chain: every fixture ships with:
- Source + build recipe
- Binary/container
- CycloneDX SBOM
- Local vulnerability feed entries (OSV-like or your own schema)
- Reference VEX document (OpenVEX or CycloneDX VEX)
- Expected graph revision ID + reachability verdicts
-
Coverage of patterns:
- Safe vs unsafe variants (Not Affected vs Affected)
- Intra-procedural, inter-procedural, transitive dep
- OS package vs app-level deps
- Multi-language (C, .NET, Java, Python)
- Containerized vs bare-metal
2. Scope of v0: Core-10 Golden Fixtures
Two tiers, but all small:
-
Core-10 (what you ship and depend on for regression):
- 5 native/C fixtures (classic reachability & library shape)
- 1 .NET fixture
- 1 Java fixture
- 1 Python fixture
- 2 container fixtures (OS package style)
-
Extended-X (optional later): fuzz-style repros, dynamic imports, concurrency, etc.
Below I’ll detail the Core-10 so your team can actually implement them.
3. Repository layout
Recommended layout inside stella-ops mono-repo:
benchmarks/
reachability/
golden-v0/
fixtures/
C-REACH-UNSAFE-001/
C-REACH-SAFE-002/
C-MEM-BOUNDS-003/
C-INT-OVERFLOW-004/
C-LIB-TRANSITIVE-005/
CONTAINER-OSPKG-SAFE-006/
CONTAINER-OSPKG-UNSAFE-007/
JAVA-HTTP-UNSAFE-008/
DOTNET-LIB-PAIR-009/
PYTHON-IMPORT-UNSAFE-010/
feeds/
osv-golden.json # “FAKE-CVE-*” style vulnerabilities
authority/
vex-reference-index.json # mapping fixture → reference VEX + expected verdict
README.md
Each fixture folder:
fixtures/<ID>/
src/ # source code (C, C#, Java, Python, Dockerfile...)
build/
Dockerfile # if containerised
build.sh # deterministic build commands
artifacts/
binary/ # final exe/jar/dll/image.tar
sbom.cdx.json # CycloneDX SBOM (canonical, normalized)
vex.openvex.json # reference VEX verdicts
manifest.fixture.yaml
graph.reference.json # canonical normalized graph
graph.reference.sha256.txt # hash = "Graph Revision ID"
docs/
explanation.md # human-readable root-cause + reachability explanation
4. Fixture metadata schema (manifest.fixture.yaml)
id: "C-REACH-UNSAFE-001"
name: "Simple reachable error in C"
version: "0.1.0"
language: "c"
domain: "native"
category:
- "reachability"
- "control-flow"
ground_truth:
vulnerable: true
reachable: true
cwes: ["CWE-754"] # Improper Check for Unusual or Exceptional Conditions
fake_cves: ["FAKE-CVE-2025-0001"]
verdict:
property_type: "reach_error_unreachable"
property_holds: false
explanation_ref: "docs/explanation.md"
build:
type: "local"
environment: "debian:12"
commands:
- "gcc -O0 -g -o app main.c"
outputs:
binary: "artifacts/binary/app"
run:
command: "./artifacts/binary/app"
args: []
env: {}
stdin: ""
expected:
exit_code: 1
stdout_contains: ["REACH_ERROR"]
stderr_contains: []
# Optional coverage or traces you might add later
coverage_file: null
sbom:
path: "artifacts/sbom.cdx.json"
format: "cyclonedx-1.5"
vex:
path: "artifacts/vex.openvex.json"
format: "openvex-0.2"
statement_ids:
- "vex-statement-1"
graph:
reference_path: "artifacts/graph.reference.json"
revision_id_sha256_path: "artifacts/graph.reference.sha256.txt"
stellaops_tags:
difficulty: "easy"
focus:
- "control-flow"
- "single-binary"
used_by:
- "scanner.webservice"
- "sbomer"
- "vexer"
- "excititor"
Your team can add more, but this is enough to wire the benchmark into the pipeline.
5. Core-10 Golden Fixtures (concrete proposal)
I’ll describe each with: purpose, pattern, what the code does, and ground truth.
5.1 C-REACH-UNSAFE-001 – basic reachable error
-
Purpose: baseline reachability detection on a tiny C program.
-
Pattern: single
main, simple branch, error sink functionreach_error(). -
Code shape:
main(int argc, char** argv)parses integerx.- If
x == 42, it callsreach_error(), which printsREACH_ERRORand exits 1.
-
Ground truth:
- Vulnerable:
true. - Reachable:
trueif run withx=42(you fix input inrun.args). - Property: “reach_error is unreachable” →
false(counterexample exists).
- Vulnerable:
-
Why it’s valuable:
- Exercises simple control flow; used as “hello world” of deterministic reachability.
5.2 C-REACH-SAFE-002 – safe twin of 001
-
Purpose: same SBOM shape, but no reachable error, to test “Not Affected”.
-
Pattern: identical to 001 but with an added guard.
-
Code shape:
- For example, check
x != 42or remove path toreach_error().
- For example, check
-
Ground truth:
- Vulnerable: false (no call to
reach_errorat all) or treat it as “patched”. - Reachable: false.
- Property “reach_error is unreachable” →
true.
- Vulnerable: false (no call to
-
Why:
- Used to verify that graph revision is different and VEX becomes “Not Affected” for the same fake CVE (if you model it that way).
5.3 C-MEM-BOUNDS-003 – out-of-bounds write
-
Purpose: exercise memory-safety property and CWE mapping.
-
Pattern: fixed-size buffer + unchecked copy.
-
Code shape:
char buf[16];- Copies
argv[1]intobufwithstrcpyor manual loop without bounds check.
-
Ground truth:
- Vulnerable: true.
- Reachable: true on any input with length > 15 (you fix a triggering arg).
- CWEs:
["CWE-119", "CWE-120"].
-
Expected run:
- With ASAN or similar, exit non-zero, mention heap/buffer overflow; for determinism, you can standardize to exit code 139 and not rely on sanitizer text in tests.
5.4 C-INT-OVERFLOW-004 – integer overflow
-
Purpose: test handling of arithmetic / overflow-related vulnerabilities.
-
Pattern: multiplication or addition with insufficient bounds checking.
-
Code shape:
- Function
size_t alloc_size(size_t n)that doesn * 16without overflow checks, then allocates and writes.
- Function
-
Ground truth:
- Vulnerable: true.
- Reachable: true with crafted large
n. - CWEs:
["CWE-190", "CWE-680"].
-
Why:
- Lets you validate that your vulnerability feed (fake CVE) asserts “affected” on this component, and your reachability engine confirms the path.
5.5 C-LIB-TRANSITIVE-005 – vulnerable library, unreachable in app
-
Purpose: test the core SBOM→VEX story: component is vulnerable, but not used.
-
Pattern:
libvuln.awith functionvoid do_unsafe(char* input)containing the same OOB bug as 003.app.clinks tolibvulnbut never callsdo_unsafe().
-
Code shape:
- Build static library from
libvuln.c. - Build
appthat uses onlydo_safe()fromlibvuln.cor that just links but doesn’t call anything from the “unsafe” TU.
- Build static library from
-
SBOM:
- SBOM lists
component: "pkg:generic/libvuln@1.0.0"withfake_cves: ["FAKE-CVE-2025-0003"].
- SBOM lists
-
Ground truth:
- Vulnerable component present in SBOM: yes.
- Reachable vulnerable function: no.
- Correct VEX: “Not Affected: vulnerable code not in execution path for this product”.
-
Why:
- Canonical demonstration of correct VEX semantics on real-world pattern: vulnerable lib, harmless usage.
5.6 CONTAINER-OSPKG-SAFE-006 – OS package, unused binary
-
Purpose: simulate vulnerable OS package installed but unused, to test image scanning vs reachability.
-
Pattern:
- Minimal container (e.g.
debian:12-slimoralpine:3.x) with installed packagevuln-toolthat is never invoked by the entrypoint. - Your app is a trivial “hello” binary.
- Minimal container (e.g.
-
SBOM:
- OS-level components include
pkg:generic/vuln-tool@1.0.0.
- OS-level components include
-
Ground truth:
- Vulnerable: the package is flagged by local feed.
- Reachable: false under the specified
CMDand test scenario. - VEX: “Not Affected – vulnerable code present but not invoked in product’s operational context.”
-
Why:
- Tests that StellaOps does not over-report image-level CVEs when nothing in the product’s execution profile uses them.
5.7 CONTAINER-OSPKG-UNSAFE-007 – OS package actually used
-
Purpose: same as 006 but positive case: vulnerability is reachable.
-
Pattern:
- Same base image and package.
- Entrypoint script calls
vuln-toolwith crafted input that triggers the bug.
-
Ground truth:
- Vulnerable: true.
- Reachable: true.
- This should flip the VEX verdict vs 006.
-
Why:
- Verifies that your reachability engine + runtime behaviour correctly distinguish “installed but unused” from “installed and actively exploited.”
5.8 JAVA-HTTP-UNSAFE-008 – vulnerable route in minimal Java service
-
Purpose: test JVM + HTTP + transitive dep reachability.
-
Pattern:
-
Small Spring Boot or JAX-RS service with:
/safeendpoint using only safe methods./unsafeendpoint calling a method invuln-libthat has a simple bug (e.g. path traversal or unsafe deserialization).
-
-
SBOM:
- Component
pkg:maven/org.stellaops/vuln-lib@1.0.0linked toFAKE-CVE-2025-0004.
- Component
-
Ground truth:
- For an HTTP call to
/unsafe, vulnerability reachable. - For
/safe, not reachable.
- For an HTTP call to
-
Benchmark convention:
- Fixture defines
run.unsafeandrun.safecommands in manifest (two separate “scenarios” under one fixture ID, or two sub-cases inmanifest.fixture.yaml).
- Fixture defines
-
Why:
- Exercises language-level dependency resolution, transitive calls, and HTTP entrypoints.
5.9 DOTNET-LIB-PAIR-009 – .NET assembly with safe & unsafe variants
-
Purpose: cover your home turf: .NET 10 / C# pipeline + SBOM & VEX.
-
Pattern:
-
Golden.Banking.Corelibrary with method:public void Process(string iban)→ suspicious string parsing / regex or overflow.
-
Two apps:
Golden.Banking.App.Unsafethat callsProcess()with unsafe behaviour.Golden.Banking.App.Safethat never callsProcess()or uses a safe wrapper.
-
-
SBOM:
- Component
pkg:nuget/Golden.Banking.Core@1.0.0tied toFAKE-CVE-2025-0005.
- Component
-
Ground truth:
- For
App.Unsafe, vulnerability reachable. - For
App.Safe, not reachable.
- For
-
Why:
- Validates your .NET tooling (Sbomer, scanner.webservice) and that your graphs respect assembly boundaries and call sites.
5.10 PYTHON-IMPORT-UNSAFE-010 – Python optional import pattern
-
Purpose: basic coverage for dynamic / interpreted language with optional module.
-
Pattern:
-
app.py:- Imports
helperwhich conditionally importsvuln_modwhenENABLE_VULN=1. - When enabled, calling
/unsafefunction triggers, e.g.,eval(user_input).
- Imports
-
-
SBOM:
- Component
pkg:pypi/vuln-mod@1.0.0→FAKE-CVE-2025-0006.
- Component
-
Ground truth:
- With
ENABLE_VULN=0, vulnerable module not imported → unreachable. - With
ENABLE_VULN=1, reachable.
- With
-
Why:
- Simple but realistic test for environment-dependent reachability and Python support.
6. Local vulnerability feed for the Golden Set
To keep everything sovereign and deterministic, define a small internal OSV-like JSON feed, e.g. benchmarks/reachability/golden-v0/feeds/osv-golden.json:
{
"vulnerabilities": [
{
"id": "FAKE-CVE-2025-0001",
"summary": "Reachable error in sample C program",
"aliases": [],
"affected": [
{
"package": {
"ecosystem": "generic",
"name": "C-REACH-UNSAFE-001"
},
"ranges": [{ "type": "SEMVER", "events": [{ "introduced": "0" }] }]
}
],
"database_specific": {
"stellaops_fixture_id": "C-REACH-UNSAFE-001"
}
}
// ... more FAKE-CVE defs ...
]
}
Scanner/Feedser in “golden mode” should:
- Use only this feed.
- Produce deterministic, closed-world graphs and VEX decisions.
7. Integration hooks with StellaOps
Make sure each module has a clear use of the golden set:
-
Scanner.Webservice
- Input: SBOM + local feed for a fixture.
- Output: canonical graph JSON and SHA-256 revision.
- For each fixture, compare produced
revision_idagainstgraph.reference.sha256.txt.
-
Sbomer
- Rebuilds SBOM from source/binaries and compares it to
artifacts/sbom.cdx.json. - Fails test if SBOMs differ in canonicalized form.
- Rebuilds SBOM from source/binaries and compares it to
-
Vexer / Excititor
- Ingests graph + local feed and produces VEX.
- Compare resulting VEX to
artifacts/vex.openvex.json.
-
UnknownsRegistry
-
For v0 you can keep unknowns minimal, but:
- At least one fixture (e.g. Python or container) can contain a “deliberately un-PURL-able” file to confirm it enters Unknowns with expected half-life.
-
-
Authority
-
Signs the reference artifacts:
- SBOM
- Graph
- VEX
-
Ensures deterministic attestation for the golden set (you can later publish these as public reference proofs).
-
8. Implementation plan for your team
You can drop this straight into a ticket or doc.
-
Scaffold repo structure
- Create
benchmarks/reachability/golden-v0/...layout as above. - Add a top-level
README.mddescribing goals and usage.
- Create
-
Implement the 10 fixtures
- Each fixture: write minimal code, build scripts, and
manifest.fixture.yaml. - Keep code tiny (1–3 files) and deterministic (no network, no randomness, no wall time).
- Each fixture: write minimal code, build scripts, and
-
Generate SBOMs
- Use your Sbomer for each artifact.
- Normalize / canonicalize SBOMs and commit them as
artifacts/sbom.cdx.json.
-
Define FAKE-CVE feed
- Create
feeds/osv-golden.jsonwith 1–2 entries per fixture. - Map each entry to PURLs used in SBOMs.
- Create
-
Produce reference graphs
- Run Scanner in “golden mode” on each fixture’s SBOM + feed.
- Normalize graphs (sorted JSON, deterministic formatting).
- Compute SHA-256 → store in
graph.reference.sha256.txt.
-
Produce reference VEX documents
- Run Vexer / Excititor with graphs + feed.
- Manually review results, edit as needed.
- Save final accepted VEX as
artifacts/vex.openvex.json.
-
Write explanations
-
For each fixture, add
docs/explanation.md:- 5–10 lines explaining root cause, path, and why affected / not affected.
-
-
Wire into CI
-
Add a
GoldenReachabilityTestsjob that:- Builds all fixtures.
- Regenerates SBOM, graph, and VEX.
- Compares against reference artifacts.
-
Fail CI if any fixture drifts.
-
-
Expose as a developer command
-
Add a CLI command, e.g.:
stellaops bench reachability --fixture C-REACH-UNSAFE-001
-
So developers can locally re-run single fixtures during development.
-
If you want, next step I can:
- Take 2–3 of these fixtures (for example
C-REACH-UNSAFE-001,C-LIB-TRANSITIVE-005, andDOTNET-LIB-PAIR-009) and draft actual code sketches + fullmanifest.fixture.yamlso your devs can literally copy-paste and start implementing.