Files
git.stella-ops.org/docs/product-advisories/archived/24-Nov-2025 - Designing a Deterministic Reachability Benchmark.md
master b3656e5cb7
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
update advisories
2025-11-29 01:32:00 +02:00

24 KiB
Raw Blame History

Heres a clean, actionready blueprint for a public reachability benchmark you can stand up quickly and grow over time.

Why this matters (quick)

“Reachability” asks: is a flagged vulnerability actually executable from real entry points in this codebase/container? A public, reproducible benchmark lets you compare tools applestoapples, drive research, and keep vendors honest.

What to collect (dataset design)

  • Projects & languages

    • Polyglot mix: C/C++ (ELF/PE/MachO), Java/Kotlin, C#/.NET, Python, JavaScript/TypeScript, PHP, Go, Rust.
    • For each project: small (≤5k LOC), medium (5100k), large (100k+).
  • Groundtruth artifacts

    • Seed CVEs with known sinks (e.g., deserializers, command exec, SS RF) and neutral projects with no reachable path (negatives).
    • Exploit oracles: minimal PoCs or unit tests that (1) reach the sink and (2) toggle reachability via feature flags.
  • Build outputs (deterministic)

    • Reproducible binaries/bytecode (strip timestamps; fixed seeds; SOURCE_DATE_EPOCH).
    • SBOM (CycloneDX/SPDX) + PURLs + BuildID (ELF .note.gnu.buildid / PE Authentihash / MachO UUID).
    • Attestations: intoto/DSSE envelopes recording toolchain versions, flags, hashes.
  • Execution traces (for truth)

    • CI traces: callgraph dumps from compilers/analyzers; unittest coverage; optional dynamic traces (eBPF/.NET ETW/Java Flight Recorder).
    • Entrypoint manifests: HTTP routes, CLI commands, cron/queue consumers.
  • Metadata

    • Language, framework, package manager, compiler versions, OS/container image, optimization level, stripping info, license.

How to label ground truth

  • Pervuln case: (component, version, sink_id) with label reachable / unreachable / unknown.
  • Evidence bundle: pointer to (a) static call path, (b) dynamic hit (trace/coverage), or (c) rationale for negative.
  • Confidence: high (static+dynamic agree), medium (one source), low (heuristic only).

Scoring (simple + fair)

  • Binary classification on cases:

    • Precision, Recall, F1. Report AUPR if you output probabilities.
  • Path quality

    • Explainability score (03):

      • 0: “vuln reachable” w/o context
      • 1: names only (entry→…→sink)
      • 2: full interprocedural path w/ locations
      • 3: plus inputs/guards (taint/constraints, env flags)
  • Runtime cost

    • Wallclock, peak RAM, image size; normalized by KLOC.
  • Determinism

    • Rerun variance (≤1% is “A”, 15% “B”, >5% “C”).

Avoiding overfitting

  • Train/Dev/Test splits per language; hidden test projects rotated quarterly.
  • Case churn: introduce isomorphic variants (rename symbols, reorder files) to punish memorization.
  • Poisoned controls: include decoy sinks and unreachable deadcode traps.
  • Submission rules: require attestations of tool versions & flags; limit percase hints.

Reference baselines (to run outofthebox)

  • Snyk Code/Reachability (JS/Java/Python, SaaS/CLI).
  • Semgrep + Pro Engine (rules + reachability mode).
  • CodeQL (multilang, LGTMstyle queries).
  • Joern (C/C++/JVM code property graphs).
  • angr (binary symbolic exec; selective for native samples).
  • Languagespecific: pipaudit w/ import graphs, npm with locktree + route discovery, Maven + callgraph (Soot/WALA).

Submission format (one JSON per tool run)

{
  "tool": {"name": "YourTool", "version": "1.2.3"},
  "run": {
    "commit": "…",
    "platform": "ubuntu:24.04",
    "time_s": 182.4, "peak_mb": 3072
  },
  "cases": [
    {
      "id": "php-shop:fastjson@1.2.68:Sink#deserialize",
      "prediction": "reachable",
      "confidence": 0.88,
      "explain": {
        "entry": "POST /api/orders",
        "path": [
          "OrdersController::create",
          "Serializer::deserialize",
          "Fastjson::parseObject"
        ],
        "guards": ["feature.flag.json_enabled==true"]
      }
    }
  ],
  "artifacts": {
    "sbom": "sha256:…", "attestation": "sha256:…"
  }
}

Folder layout (repo)

/benchmark
  /cases/<lang>/<project>/<case_id>/
    case.yaml           # component@version, sink, labels, evidence refs
    entrypoints.yaml    # routes/CLIs/cron
    build/              # Dockerfiles, lockfiles, pinned toolchains
    outputs/            # SBOMs, binaries, traces (checksummed)
  /splits/{train,dev,test}.txt
  /schemas/{case.json,submission.json}
  /scripts/{build.sh, run_tests.sh, score.py}
  /docs/ (how-to, FAQs, T&Cs)

Minimal v1 (46 weeks of work)

  1. Languages: JS/TS, Python, Java, C (ELF).
  2. 2030 cases: mix of reachable/unreachable with PoC unit tests.
  3. Deterministic builds in containers; publish SBOM+attestations.
  4. Scorer: precision/recall/F1 + explainability, runtime, determinism.
  5. Baselines: run CodeQL + Semgrep across all; Snyk where feasible; angr for 3 native cases.
  6. Website: static leaderboard (perlang, persize), download links, submission guide.

V2+ (quarterly)

  • Add .NET, PHP, Go, Rust; broaden binary focus (PE/MachO).
  • Add dynamic traces (eBPF/ETW/JFR) and taint oracles.
  • Introduce configgated reachability (feature flags, env, k8s secrets).
  • Add dataset cards per case (threat model, CWE, falsepositive traps).

Publishing & governance

  • License: CCBYSA for metadata, sourcecompatible OSS for code, binaries under original licenses.
  • Repro packs: benchmark-kit.tgz with container recipes, hashes, and attestations.
  • Disclosure: CVE hygiene, responsible use, optout path for upstreams.
  • Stewards: small TAC (you + two external reviewers) to approve new cases and adjudicate disputes.

Immediate next steps (checklist)

  • Lock the schemas (case + submission + attestation fields).
  • Pick 8 seed projects (2 per language tiered by size).
  • Draft 12 sinkcases (6 reachable, 6 unreachable) with unittest oracles.
  • Script deterministic builds and hashlocked SBOMs.
  • Implement the scorer; publish a starter leaderboard with 2 baselines.
  • Ship v1 website/docs and open submissions.

If you want, I can generate the repo scaffold (folders, YAML/JSON schemas, Dockerfiles, scorer script) so your team can git clone and start adding cases immediately. Cool, lets turn the blueprint into a concrete, developerfriendly implementation plan.

Ill assume v1 scope is:

  • Languages: JavaScript/TypeScript (Node), Python, Java, C (ELF)
  • ~2030 cases total (reachable/unreachable mix)
  • Baselines: CodeQL, Semgrep, maybe Snyk where licenses allow, and angr for a few native cases

You can expand later, but this plan is enough to get v1 shipped.


0. Overall project structure & ownership

Owners

  • Tech Lead owns architecture & final decisions
  • Benchmark Core 23 devs building schemas, scorer, infra
  • Language Tracks 1 dev per language (JS, Python, Java, C)
  • Website/Docs 1 dev

Repo layout (target)

reachability-benchmark/
  README.md
  LICENSE
  CONTRIBUTING.md
  CODE_OF_CONDUCT.md

  benchmark/
    cases/
      js/
        express-blog/
          case-001/
            case.yaml
            entrypoints.yaml
            build/
              Dockerfile
              build.sh
            src/           # project source (or submodule)
            tests/         # unit tests as oracles
            outputs/
              sbom.cdx.json
              binary.tar.gz
              coverage.json
              traces/      # optional dynamic traces
      py/
        flask-api/...
      java/
        spring-app/...
      c/
        httpd-like/...
    schemas/
      case.schema.yaml
      entrypoints.schema.yaml
      truth.schema.yaml
      submission.schema.json
    tools/
      scorer/
        rb_score/
          __init__.py
          cli.py
          metrics.py
          loader.py
          explainability.py
        pyproject.toml
        tests/
      build/
        build_all.py
        validate_builds.py

  baselines/
    codeql/
      run_case.sh
      config/
    semgrep/
      run_case.sh
      rules/
    snyk/
      run_case.sh
    angr/
      run_case.sh

  ci/
    github/
      benchmark.yml

  website/
    # static site / leaderboard

1. Phase 1 Repo & infra setup

Task 1.1 Create repository

Developer: Tech Lead Deliverables:

  • Repo created (reachability-benchmark or similar)

  • LICENSE (e.g., Apache-2.0 or MIT)

  • Basic README.md describing:

    • Purpose (public reachability benchmark)
    • Highlevel design
    • v1 scope (langs, #cases)

Task 1.2 Bootstrap structure

Developer: Benchmark Core

Create directory skeleton as above (without filling everything yet).

Add:

# benchmark/Makefile
.PHONY: test lint build
test:
\tpytest benchmark/tools/scorer/tests

lint:
\tblack benchmark/tools/scorer
\tflake8 benchmark/tools/scorer

build:
\tpython benchmark/tools/build/build_all.py

Task 1.3 Coding standards & tooling

Developer: Benchmark Core

  • Add .editorconfig, .gitignore, and Python tool configs (ruff, black, or flake8).

  • Define minimal PR checklist in CONTRIBUTING.md:

    • Tests pass
    • Lint passes
    • New schemas have JSON schema or YAML schema and tests
    • New cases come with oracles (tests/coverage)

2. Phase 2 Case & submission schemas

Task 2.1 Define case metadata format

Developer: Benchmark Core

Create benchmark/schemas/case.schema.yaml and an example case.yaml.

Example case.yaml

id: "js-express-blog:001"
language: "javascript"
framework: "express"
size: "small"               # small | medium | large
component:
  name: "express-blog"
  version: "1.0.0-bench"
vulnerability:
  cve: "CVE-XXXX-YYYY"
  cwe: "CWE-502"
  description: "Unsafe deserialization via user-controlled JSON."
  sink_id: "Deserializer::parse"
ground_truth:
  label: "reachable"        # reachable | unreachable | unknown
  confidence: "high"        # high | medium | low
  evidence_files:
    - "truth.yaml"
  notes: >
    Unit test test_reachable_deserialization triggers the sink.
build:
  dockerfile: "build/Dockerfile"
  build_script: "build/build.sh"
  output:
    artifact_path: "outputs/binary.tar.gz"
    sbom_path: "outputs/sbom.cdx.json"
    coverage_path: "outputs/coverage.json"
    traces_dir: "outputs/traces"
environment:
  os_image: "ubuntu:24.04"
  compiler: null
  runtime:
    node: "20.11.0"
  source_date_epoch: 1730000000

Acceptance criteria

  • Schema validates sample case.yaml with a Python script:

    • benchmark/tools/build/validate_schema.py using jsonschema or pykwalify.

Task 2.2 Entry points schema

Developer: Benchmark Core

benchmark/schemas/entrypoints.schema.yaml

Example entrypoints.yaml

entries:
  http:
    - id: "POST /api/posts"
      route: "/api/posts"
      method: "POST"
      handler: "PostsController.create"
  cli:
    - id: "generate-report"
      command: "node cli.js generate-report"
      description: "Generates summary report."
  scheduled:
    - id: "daily-cleanup"
      schedule: "0 3 * * *"
      handler: "CleanupJob.run"

Task 2.3 Ground truth / truth schema

Developer: Benchmark Core + Language Tracks

benchmark/schemas/truth.schema.yaml

Example truth.yaml

id: "js-express-blog:001"
cases:
  - sink_id: "Deserializer::parse"
    label: "reachable"
    dynamic_evidence:
      covered_by_tests:
        - "tests/test_reachable_deserialization.js::should_reach_sink"
      coverage_files:
        - "outputs/coverage.json"
    static_evidence:
      call_path:
        - "POST /api/posts"
        - "PostsController.create"
        - "PostsService.createFromJson"
        - "Deserializer.parse"
    config_conditions:
      - "process.env.FEATURE_JSON_ENABLED == 'true'"
    notes: "If FEATURE_JSON_ENABLED=false, path is unreachable."

Task 2.4 Submission schema

Developer: Benchmark Core

benchmark/schemas/submission.schema.json

Shape

{
  "tool": { "name": "YourTool", "version": "1.2.3" },
  "run": {
    "commit": "abcd1234",
    "platform": "ubuntu:24.04",
    "time_s": 182.4,
    "peak_mb": 3072
  },
  "cases": [
    {
      "id": "js-express-blog:001",
      "prediction": "reachable",
      "confidence": 0.88,
      "explain": {
        "entry": "POST /api/posts",
        "path": [
          "PostsController.create",
          "PostsService.createFromJson",
          "Deserializer.parse"
        ],
        "guards": [
          "process.env.FEATURE_JSON_ENABLED === 'true'"
        ]
      }
    }
  ],
  "artifacts": {
    "sbom": "sha256:...",
    "attestation": "sha256:..."
  }
}

Write Python validation utility:

python benchmark/tools/scorer/validate_submission.py submission.json

Acceptance criteria

  • Validation fails on missing fields / wrong enum values.
  • At least two sample submissions pass validation (e.g., “perfect” and “random baseline”).

3. Phase 3 Reference projects & deterministic builds

Task 3.1 Select and vendor v1 projects

Developer: Tech Lead + Language Tracks

For each language, choose:

  • 1 small toy app (simple web or CLI)
  • 1 medium app (more routes, multiple modules)
  • Optional: 1 large (for performance stress tests)

Add them under benchmark/cases/<lang>/<project>/src/ (or as git submodules if you want to track upstream).


Task 3.2 Deterministic Docker build per project

Developer: Language Tracks

For each project:

  • Create build/Dockerfile

  • Create build/build.sh that:

    • Builds the app
    • Produces artifacts
    • Generates SBOM and attestation

Example build/Dockerfile (Node)

FROM node:20.11-slim

ENV NODE_ENV=production
ENV SOURCE_DATE_EPOCH=1730000000

WORKDIR /app
COPY src/ /app
COPY package.json package-lock.json /app/

RUN npm ci --ignore-scripts && \
    npm run build || true

CMD ["node", "server.js"]

Example build.sh

#!/usr/bin/env bash
set -euo pipefail

ROOT_DIR="$(dirname "$(readlink -f "$0")")/.."
OUT_DIR="$ROOT_DIR/outputs"
mkdir -p "$OUT_DIR"

IMAGE_TAG="rb-js-express-blog:1"

docker build -t "$IMAGE_TAG" "$ROOT_DIR/build"

# Export image as tarball (binary artifact)
docker save "$IMAGE_TAG" | gzip > "$OUT_DIR/binary.tar.gz"

# Generate SBOM (e.g. via syft)  can be optional stub for v1
syft packages "docker:$IMAGE_TAG" -o cyclonedx-json > "$OUT_DIR/sbom.cdx.json"

# In future: generate in-toto attestations

Task 3.3 Determinism checker

Developer: Benchmark Core

benchmark/tools/build/validate_builds.py:

  • For each case:

    • Run build.sh twice
    • Compare hashes of outputs/binary.tar.gz and outputs/sbom.cdx.json
  • Fail if hashes differ.

Acceptance criteria

  • All v1 cases produce identical artifacts across two builds on CI.

4. Phase 4 Ground truth oracles (tests & traces)

Task 4.1 Add unit/integration tests for reachable cases

Developer: Language Tracks

For each reachable case:

  • Add tests/ under the project to:

    • Start the app (if necessary)
    • Send a request/trigger that reaches the vulnerable sink
    • Assert that a sentinel side effect occurs (e.g. log or marker file) instead of real exploitation.

Example for Node using Jest:

test("should reach deserialization sink", async () => {
  const res = await request(app)
    .post("/api/posts")
    .send({ title: "x", body: '{"__proto__":{}}' });

  expect(res.statusCode).toBe(200);
  // Sink logs "REACH_SINK"  we check log or variable
  expect(sinkWasReached()).toBe(true);
});

Task 4.2 Instrument coverage

Developer: Language Tracks

  • For each language, pick a coverage tool:

    • JS: nyc + istanbul
    • Python: coverage.py
    • Java: jacoco
    • C: gcov/llvm-cov (optional for v1)
  • Ensure running tests produces outputs/coverage.json or .xml that we then convert to a simple JSON format:

{
  "files": {
    "src/controllers/posts.js": {
      "lines_covered": [12, 13, 14, 27],
      "lines_total": 40
    }
  }
}

Create a small converter script if needed.

Task 4.3 Optional dynamic traces

If you want richer evidence:

  • JS: add middleware that logs (entry_id, handler, sink) triples to outputs/traces/traces.json
  • Python: similar using decorators
  • C/Java: out of scope for v1 unless you want to invest extra time.

5. Phase 5 Scoring tool (CLI)

Task 5.1 Implement rb-score library + CLI

Developer: Benchmark Core

Create benchmark/tools/scorer/rb_score/ with:

  • loader.py

    • Load all case.yaml, truth.yaml into memory.
    • Provide functions: load_cases() -> Dict[case_id, Case].
  • metrics.py

    • Implement:

      • compute_precision_recall(truth, predictions)
      • compute_path_quality_score(explain_block) (03)
      • compute_runtime_stats(run_block)
  • cli.py

    • CLI:
rb-score \
  --cases-root benchmark/cases \
  --submission submissions/mytool.json \
  --output results/mytool_results.json

Pseudo-code for core scoring

def score_submission(truth, submission):
    y_true = []
    y_pred = []
    per_case_scores = {}

    for case in truth:
        gt = truth[case.id].label  # reachable/unreachable
        pred_case = find_pred_case(submission.cases, case.id)
        pred_label = pred_case.prediction if pred_case else "unreachable"

        y_true.append(gt == "reachable")
        y_pred.append(pred_label == "reachable")

        explain_score = explainability(pred_case.explain if pred_case else None)

        per_case_scores[case.id] = {
            "gt": gt,
            "pred": pred_label,
            "explainability": explain_score,
        }

    precision, recall, f1 = compute_prf(y_true, y_pred)

    return {
        "summary": {
            "precision": precision,
            "recall": recall,
            "f1": f1,
            "num_cases": len(truth),
        },
        "cases": per_case_scores,
    }

Task 5.2 Explainability scoring rules

Developer: Benchmark Core

Implement explainability(explain):

  • 0 explain missing or path empty

  • 1 path present with at least 2 nodes (sink + one function)

  • 2 path contains:

    • Entry label (HTTP route/CLI id)
    • ≥3 nodes (entry → … → sink)
  • 3 Level 2 plus guards list non-empty

Unit tests for at least 4 scenarios.

Task 5.3 Regression tests for scoring

Add small test fixture:

  • Tiny synthetic benchmark: 3 cases, 2 reachable, 1 unreachable.

  • 3 submissions:

    • Perfect
    • All reachable
    • All unreachable

Assertions:

  • Perfect: precision=1, recall=1
  • All reachable: recall=1, precision<1
  • All unreachable: precision=1 (trivially on negatives), recall=0

6. Phase 6 Baseline integrations

Task 6.1 Semgrep baseline

Developer: Benchmark Core (with Semgrep experience)

  • baselines/semgrep/run_case.sh:

    • Inputs: case_id, cases_root, output_path

    • Steps:

      • Find src/ for case

      • Run semgrep --config auto or curated rules

      • Convert Semgrep findings into benchmark submission format:

        • Map Semgrep rules → vulnerability types → candidate sinks
        • Heuristically guess reachability (for v1, maybe always “reachable” if sink in code path)
    • Output: output_path JSON conforming to submission.schema.json.

Task 6.2 CodeQL baseline

  • Create CodeQL databases for each project (likely via codeql database create).

  • Create queries targeting known sinks (e.g., Deserialization, CommandInjection).

  • baselines/codeql/run_case.sh:

    • Build DB (or reuse)
    • Run queries
    • Translate results into our submission format (again as heuristic reachability).

Task 6.3 Optional Snyk / angr baselines

  • Snyk:

    • Use snyk test on the project
    • Map results to dependencies & known CVEs
    • For v1, just mark as reachable if Snyk reports a reachable path (if available).
  • angr:

    • For 12 small C samples, configure simple analysis script.

Acceptance criteria

  • For at least 5 cases (across languages), the baselines produce valid submission JSON.
  • rb-score runs and yields metrics without errors.

7. Phase 7 CI/CD

Task 7.1 GitHub Actions workflow

Developer: Benchmark Core

ci/github/benchmark.yml:

Jobs:

  1. lint-and-test

    • python -m pip install -e benchmark/tools/scorer[dev]
    • make lint
    • make test
  2. build-cases

    • python benchmark/tools/build/build_all.py
    • Run validate_builds.py
  3. smoke-baselines

    • For 23 cases, run Semgrep/CodeQL wrappers and ensure they emit valid submissions.

Task 7.2 Artifact upload

  • Upload outputs/ tarball from build-cases as workflow artifacts.
  • Upload results/*.json from scoring runs.

8. Phase 8 Website & leaderboard

Task 8.1 Define results JSON format

Developer: Benchmark Core + Website dev

results/leaderboard.json:

{
  "tools": [
    {
      "name": "Semgrep",
      "version": "1.60.0",
      "summary": {
        "precision": 0.72,
        "recall": 0.48,
        "f1": 0.58
      },
      "by_language": {
        "javascript": {"precision": 0.80, "recall": 0.50, "f1": 0.62},
        "python": {"precision": 0.65, "recall": 0.45, "f1": 0.53}
      }
    }
  ]
}

CLI option to generate this:

rb-score compare \
  --cases-root benchmark/cases \
  --submissions submissions/*.json \
  --output results/leaderboard.json

Task 8.2 Static site

Developer: Website dev

Tech choice: any static framework (Next.js, Astro, Docusaurus, or even pure HTML+JS).

Pages:

  • Home

    • What is reachability?
    • Summary of benchmark
  • Leaderboard

    • Renders leaderboard.json
    • Filters: language, case size
  • Docs

    • How to run benchmark locally
    • How to prepare a submission

Add a simple script to copy results/leaderboard.json into website/public/ for publishing.


9. Phase 9 Docs, governance, and contribution flow

Task 9.1 CONTRIBUTING.md

Include:

  • How to add a new case:

    • Stepbystep:

      1. Create project folder under benchmark/cases/<lang>/<project>/case-XXX/

      2. Add case.yaml, entrypoints.yaml, truth.yaml

      3. Add oracles (tests, coverage)

      4. Add deterministic build/ assets

      5. Run local tooling:

        • validate_schema.py
        • validate_builds.py --case <id>
    • Example PR description template.

Task 9.2 Governance doc

  • Define Technical Advisory Committee (TAC) roles:

    • Approve new cases
    • Approve schema changes
    • Manage hidden test sets (future phase)
  • Define release cadence:

    • v1.0 with public cases
    • Quarterly updates with new hidden cases.

10. Suggested milestone breakdown (for planning / sprints)

Milestone 1 Foundation (12 sprints)

  • Repo scaffolding (Tasks 1.x)

  • Schemas (Tasks 2.x)

  • Two tiny toy cases (one JS, one Python) with:

    • case.yaml, entrypoints.yaml, truth.yaml
    • Deterministic build
    • Basic unit tests
  • Minimal rb-score with:

    • Case loading
    • Precision/recall only

Exit: You can run rb-score on a dummy submission for 2 cases.


Milestone 2 v1 dataset (23 sprints)

  • Add ~2030 cases across JS, Python, Java, C
  • Ground truth & coverage for each
  • Deterministic builds validated
  • Explainability scoring implemented
  • Regression tests for rb-score

Exit: Full scoring tool stable; dataset repeatably builds on CI.


Milestone 3 Baselines & site (12 sprints)

  • Semgrep + CodeQL baselines producing valid submissions
  • CI running smoke baselines
  • leaderboard.json generator
  • Static website with public leaderboard and docs

Exit: Public v1 benchmark you can share with external tool authors.


If you tell me which stack your team prefers for the site (React, plain HTML, SSG, etc.) or which CI youre on, I can adapt this into concrete config files (e.g., a full GitHub Actions workflow, Next.js scaffold, or exact pyproject.toml for rb-score).