Files
git.stella-ops.org/docs/product-advisories/archived/26-Nov-2025 - Opening Up a Reachability Dataset.md
master b3656e5cb7
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
update advisories
2025-11-29 01:32:00 +02:00

24 KiB
Raw Blame History

Heres a concrete, lowlift way to boost StellaOpss visibility and prove your “deterministic, replayable” moat: publish a sanitized subset of reachability graphs as a public benchmark that others can run and score identically.

What this is (plain English)

  • You release a small, carefully scrubbed set of packages + SBOMs + VEX + callgraphs (source & binaries) with groundtruth reachability labels for a curated list of CVEs.
  • You also ship a deterministic scoring harness (container + manifest) so anyone can reproduce the exact scores, byteforbyte.

Why it helps

  • Proof of determinism: identical inputs → identical graphs → identical scores.
  • Research magnet: gives labs and tool vendors a neutral yardstick; you become “the” benchmark steward.
  • Biz impact: easy demo for buyers; lets you publish leaderboards and whitepapers.

Scope (MVP dataset)

  • Languages: PHP, JS, Python, plus binary (ELF/PE/MachO) mini-cases.

  • Units: 2030 packages total; 36 CVEs per language; 46 binary cases (static & dynamicallylinked).

  • Artifacts per unit:

    • Package tarball(s) or container image digest
    • SBOM (CycloneDX 1.6 + SPDX 3.0.1)
    • VEX (knownexploited, notaffected, underinvestigation)
    • Call graph (normalized JSON)
    • Ground truth: list of vulnerable entrypoints/edges considered reachable
    • Determinism manifest: feed URLs + rule hashes + container digests + tool versions

Data model (keep it simple)

  • dataset.json: index of cases with contentaddressed URIs (sha256)

  • sbom/, vex/, graphs/, truth/ folders mirroring the index

  • manifest.lock.json: DSSEsigned record of:

    • feeder rules, lattice policies, normalizers (name + version + hash)
    • container image digests for each step (scanner/cartographer/normalizer)
    • timestamp + signer (StellaOps Authority)

Scoring harness (deterministic)

  • One Docker image: stellaops/benchmark-harness:<tag>

  • Inputs: dataset root + manifest.lock.json

  • Outputs:

    • scores.json (precision/recall/F1, percase and macro)
    • replay-proof.txt (hashes of every artifact used)
  • No network mode (offlinefirst). Fails closed if any hash mismatches.

Metrics (clear + auditable)

  • Per case: TP/FP/FN for reachable functions (or edges), plus optional sinkreach verification.
  • Aggregates: micro/macro F1; “Determinism Index” (stddev of repeated runs must be 0).
  • Repro test: the harness reruns N=3 and asserts identical outputs (hash compare).
  • Strip any proprietary code/data; prefer OSS with permissive licenses.
  • Replace real package registries with local mirrors and pin digests.
  • Publish under CCBY4.0 (data) + Apache2.0 (harness). Add a simple contributor license agreement for external case submissions.

Baselines to include (neutral + useful)

  • “Naïve reachable” (all functions in package)
  • “Importsonly” (entrypoints that match import graph)
  • “Calldepth2” (bounded traversal)
  • Your graph engine run with frozen rules from the manifest (as a reference, not a claim of SOTA)

Repository layout (public)

stellaops-reachability-benchmark/
  dataset/
    dataset.json
    sbom/...
    vex/...
    graphs/...
    truth/...
    manifest.lock.json  (DSSE-signed)
  harness/
    Dockerfile
    runner.py (CLI)
    schema/ (JSON Schemas for graphs, truth, scores)
  docs/
    HOWTO.md (5-min run)
    CONTRIBUTING.md
    SANITIZATION.md
    LICENSES/

Docs your team can ship in a day

  • HOWTO.md: docker run -v $PWD/dataset:/d -v $PWD/out:/o stellaops/benchmark-harness score /d /o
  • SCHEMA.md: JSON Schemas for graph and truth (keep fields minimal: nodes, edges, purls, sinks, evidence).
  • REPRODUCIBILITY.md: explains DSSE signatures, lockfile, and offline run.
  • LIMITATIONS.md: clarifies scope (no dynamic runtime traces in v1, etc.).

Governance (lightweight)

  • Versioned releases: v0.1, v0.2 with changelogs.

  • Submission gate: PR template + CI that:

    • validates schemas
    • checks hashes match lockfile
    • rescores and compares to contributors score
  • Leaderboard cadence: monthly markdown table regenerated by CI.

Launch plan (2week sprint)

  • Day 12: pick cases; finalize schemas; write SANITIZATION.md.
  • Day 35: build harness image; implement deterministic runner; freeze manifest.lock.json.
  • Day 68: produce ground truth; run baselines; generate initial scores.
  • Day 910: docs + website README; record a 2minute demo GIF.
  • Day 1112: legal review + licenses; create issue labels (“good first case”).
  • Day 1314: publish, post on GitHub + LinkedIn; invite Semgrep/Snyk/OSSFuzz folks to submit cases.

Nicetohave (but easy)

  • JSON Schema for groundtruth edges so academics can autoingest.
  • Small “unknowns” registry example to show how you annotate unresolved symbols without breaking determinism.
  • Binary minilab: stripped vs nonstripped ELF pair to show your patchoracle technique in action (truth labels reflect oracle result).

If you want, I can draft the repo skeleton (folders, placeholder JSON Schemas, a sample manifest.lock.json, and a minimal runner.py CLI) so you can drop it straight into GitHub. Got you — lets turn that highlevel idea into something your devs can actually pick up and ship.

Below is a concrete implementation plan for the StellaOps Reachability Benchmark repo: directory structure, components, tasks, and acceptance criteria. You can drop this straight into a ticketing system as epics → stories.


0. Tech assumptions (adjust if needed)

To be specific, Ill assume:

  • Repo: stellaops-reachability-benchmark
  • Harness language: Python 3.11+
  • Packaging: Docker image for the harness
  • Schemas: JSON Schema (Draft 202012)
  • CI: GitHub Actions

If your stack differs, you can still reuse the structure and acceptance criteria.


1. Repo skeleton & project bootstrap

Goal: Create a minimal but fully wired repo.

Tasks

  1. Create skeleton

    • Structure:

      stellaops-reachability-benchmark/
        dataset/
          dataset.json
          sbom/
          vex/
          graphs/
          truth/
          packages/
          manifest.lock.json    # initially stub
        harness/
          reachbench/
            __init__.py
            cli.py
            dataset_loader.py
            schemas/
              graph.schema.json
              truth.schema.json
              dataset.schema.json
              scores.schema.json
          tests/
        docs/
          HOWTO.md
          SCHEMA.md
          REPRODUCIBILITY.md
          LIMITATIONS.md
          SANITIZATION.md
        .github/
          workflows/
            ci.yml
        pyproject.toml
        README.md
        LICENSE
        Dockerfile
      
  2. Bootstrap Python project

    • pyproject.toml with:

      • reachbench package
      • deps: jsonschema, click or typer, pyyaml, pytest
    • harness/tests/ with a dummy test to ensure CI is green.

  3. Dockerfile

    • Minimal, pinned versions:

      FROM python:3.11-slim
      WORKDIR /app
      COPY . .
      RUN pip install --no-cache-dir .
      ENTRYPOINT ["reachbench"]
      
  4. CI basic pipeline (.github/workflows/ci.yml)

    • Jobs:

      • lint (e.g., ruff or flake8 if you want)
      • test (pytest)
      • build-docker (just to ensure Dockerfile stays valid)

Acceptance criteria

  • pip install . works locally.
  • reachbench --help prints CLI help (even if commands are stubs).
  • CI passes on main branch.

2. Dataset & schema definitions

Goal: Define all JSON formats and enforce them.

2.1 Define dataset index format (dataset/dataset.json)

File: dataset/dataset.json

Example:

{
  "version": "0.1.0",
  "cases": [
    {
      "id": "php-wordpress-5.8-cve-2023-12345",
      "language": "php",
      "kind": "source",          // "source" | "binary" | "container"
      "cves": ["CVE-2023-12345"],
      "artifacts": {
        "package": {
          "path": "packages/php/wordpress-5.8.tar.gz",
          "sha256": "…"
        },
        "sbom": {
          "path": "sbom/php/wordpress-5.8.cdx.json",
          "format": "cyclonedx-1.6",
          "sha256": "…"
        },
        "vex": {
          "path": "vex/php/wordpress-5.8.vex.json",
          "format": "csaf-2.0",
          "sha256": "…"
        },
        "graph": {
          "path": "graphs/php/wordpress-5.8.graph.json",
          "schema": "graph.schema.json",
          "sha256": "…"
        },
        "truth": {
          "path": "truth/php/wordpress-5.8.truth.json",
          "schema": "truth.schema.json",
          "sha256": "…"
        }
      }
    }
  ]
}

2.2 Define truth schema (harness/reachbench/schemas/truth.schema.json)

Model (conceptual):

{
  "case_id": "php-wordpress-5.8-cve-2023-12345",
  "vulnerable_components": [
    {
      "cve": "CVE-2023-12345",
      "symbol": "wp_ajax_nopriv_some_vuln",
      "symbol_kind": "function",      // "function" | "method" | "binary_symbol"
      "status": "reachable",          // "reachable" | "not_reachable"
      "reachable_from": [
        {
          "entrypoint_id": "web:GET:/foo",
          "notes": "HTTP route /foo"
        }
      ],
      "evidence": "manual-analysis"   // or "unit-test", "patch-oracle"
    }
  ],
  "non_vulnerable_components": [
    {
      "symbol": "wp_safe_function",
      "symbol_kind": "function",
      "status": "not_reachable",
      "evidence": "manual-analysis"
    }
  ]
}

Tasks

  • Implement JSON Schema capturing:

    • required fields: case_id, vulnerable_components
    • allowed enums for symbol_kind, status, evidence
  • Add unit tests that:

    • validate a valid truth file
    • fail on various broken ones (missing case_id, unknown status, etc.)

2.3 Define graph schema (harness/reachbench/schemas/graph.schema.json)

Model (conceptual):

{
  "case_id": "php-wordpress-5.8-cve-2023-12345",
  "language": "php",
  "nodes": [
    {
      "id": "func:wp_ajax_nopriv_some_vuln",
      "symbol": "wp_ajax_nopriv_some_vuln",
      "kind": "function",
      "purl": "pkg:composer/wordpress/wordpress@5.8"
    }
  ],
  "edges": [
    {
      "from": "func:wp_ajax_nopriv_some_vuln",
      "to": "func:wpdb_query",
      "kind": "call"
    }
  ],
  "entrypoints": [
    {
      "id": "web:GET:/foo",
      "symbol": "some_controller",
      "kind": "http_route"
    }
  ]
}

Tasks

  • JSON Schema with:

    • nodes[] (id, symbol, kind, optional purl)
    • edges[] (from, to, kind)
    • entrypoints[] (id, symbol, kind)
  • Tests: verify a valid graph; invalid ones (missing id, unknown kind) are rejected.

2.4 Dataset index schema (dataset.schema.json)

  • JSON Schema describing dataset.json (version string, cases array).
  • Tests: validate the example dataset file.

Acceptance criteria

  • Running a simple script (will be reachbench validate-dataset) validates all JSON files in dataset/ against schemas without errors.
  • CI fails if any dataset JSON is invalid.

3. Lockfile & determinism manifest

Goal: Implement manifest.lock.json generation and verification.

3.1 Lockfile structure

File: dataset/manifest.lock.json

Example:

{
  "version": "0.1.0",
  "created_at": "2025-01-15T12:00:00Z",
  "dataset": {
    "root": "dataset/",
    "sha256": "…",
    "cases": {
      "php-wordpress-5.8-cve-2023-12345": {
        "sha256": "…"
      }
    }
  },
  "tools": {
    "graph_normalizer": {
      "name": "stellaops-graph-normalizer",
      "version": "1.2.3",
      "sha256": "…"
    }
  },
  "containers": {
    "scanner_image": "ghcr.io/stellaops/scanner@sha256:…",
    "normalizer_image": "ghcr.io/stellaops/normalizer@sha256:…"
  },
  "signatures": [
    {
      "type": "dsse",
      "key_id": "stellaops-benchmark-key-1",
      "signature": "base64-encoded-blob"
    }
  ]
}

(Signatures can be optional in v1 but structure should be there.)

3.2 lockfile.py module

File: harness/reachbench/lockfile.py

Responsibilities

  • Compute deterministic SHA-256 digest of:

    • each cases artifacts (path → hash from dataset.json)
    • entire dataset/ tree (sorted traversal)
  • Generate new manifest.lock.json:

    • version (hard-coded constant)
    • created_at (UTC ISO8601)
    • dataset section with case hashes
  • Verification:

    • verify_lockfile(dataset_root, lockfile_path):

      • recompute hashes
      • compare to lockfile.dataset
      • return boolean + list of mismatches

Tasks

  1. Implement canonical hashing:

    • For text JSON files: normalize with:

      • sort keys
      • no whitespace
      • UTF8 encoding
    • For binaries (packages): raw bytes.

  2. Implement compute_dataset_hashes(dataset_root):

    • Returns {"cases": {...}, "root_sha256": "…"}.
  3. Implement write_lockfile(...) and verify_lockfile(...).

  4. Tests:

    • Two calls with same dataset produce identical lockfile (order of cases keys normalized).
    • Changing any artifact file changes the root hash and causes verify to fail.

3.3 CLI commands

Add to cli.py:

  • reachbench compute-lockfile --dataset-root ./dataset --out ./dataset/manifest.lock.json
  • reachbench verify-lockfile --dataset-root ./dataset --lockfile ./dataset/manifest.lock.json

Acceptance criteria

  • reachbench compute-lockfile generates a stable file (byte-for-byte identical across runs).

  • reachbench verify-lockfile exits with:

    • code 0 if matches
    • non-zero if mismatch (plus human-readable diff).

4. Scoring harness CLI

Goal: Deterministically score participant results against ground truth.

4.1 Result format (participant output)

Expectation:

Participants provide results/ with one JSON per case:

results/
  php-wordpress-5.8-cve-2023-12345.json
  js-express-4.17-cve-2022-9999.json

Result file example:

{
  "case_id": "php-wordpress-5.8-cve-2023-12345",
  "tool_name": "my-reachability-analyzer",
  "tool_version": "1.0.0",
  "predictions": [
    {
      "cve": "CVE-2023-12345",
      "symbol": "wp_ajax_nopriv_some_vuln",
      "symbol_kind": "function",
      "status": "reachable"
    },
    {
      "cve": "CVE-2023-12345",
      "symbol": "wp_safe_function",
      "symbol_kind": "function",
      "status": "not_reachable"
    }
  ]
}

4.2 Scoring model

  • Treat scoring as classification over (cve, symbol) pairs.

  • For each case:

    • Truth positives: all vulnerable_components with status == "reachable".
    • Truth negatives: everything marked not_reachable (optional in v1).
    • Predictions: all entries with status == "reachable".
  • Compute:

    • TP: predicted reachable & truth reachable.
    • FP: predicted reachable but truth says not reachable / unknown.
    • FN: truth reachable but not predicted reachable.
  • Metrics:

    • Precision, Recall, F1 per case.
    • Macro-averaged metrics across all cases.

4.3 Implementation (scoring.py)

File: harness/reachbench/scoring.py

Functions:

  • load_truth(case_truth_path) -> TruthModel

  • load_predictions(predictions_path) -> PredictionModel

  • compute_case_metrics(truth, preds) -> dict

    • returns:

      {
        "case_id": str,
        "tp": int,
        "fp": int,
        "fn": int,
        "precision": float,
        "recall": float,
        "f1": float
      }
      
  • aggregate_metrics(case_metrics_list) -> dict

    • macro_precision, macro_recall, macro_f1, num_cases.

4.4 CLI: score

Signature:

reachbench score \
  --dataset-root ./dataset \
  --results-root ./results \
  --lockfile ./dataset/manifest.lock.json \
  --out ./out/scores.json \
  [--cases php-*] \
  [--repeat 3]

Behavior:

  1. Verify lockfile (fail closed if mismatch).

  2. Load dataset.json, filter cases if --cases is set (glob).

  3. For each case:

    • Load truth file (and validate schema).

    • Locate results file (<case_id>.json) under results-root:

      • If missing, treat as all FN (or mark case as “no submission”).
    • Load and validate predictions (include a JSON Schema: results.schema.json).

    • Compute per-case metrics.

  4. Aggregate metrics.

  5. Write scores.json:

    {
      "version": "0.1.0",
      "dataset_version": "0.1.0",
      "generated_at": "2025-01-15T12:34:56Z",
      "macro_precision": 0.92,
      "macro_recall": 0.88,
      "macro_f1": 0.90,
      "cases": [
        {
          "case_id": "php-wordpress-5.8-cve-2023-12345",
          "tp": 10,
          "fp": 1,
          "fn": 2,
          "precision": 0.91,
          "recall": 0.83,
          "f1": 0.87
        }
      ]
    }
    
  6. Determinism check:

    • If --repeat N given:

      • Re-run scoring in-memory N times.
      • Compare resulting JSON strings (canonicalized via sorted keys).
      • If any differ, exit non-zero with message (“non-deterministic scoring detected”).

4.5 Offline-only mode

  • In cli.py, early check:

    if os.getenv("REACHBENCH_OFFLINE_ONLY", "1") == "1":
        # Verify no outbound network: by policy, just ensure we never call any net libs.
        # (In v1, simply avoid adding any such calls.)
    
  • Document that harness must not reach out to the internet.

Acceptance criteria

  • Given a small artificial dataset with 23 cases and handcrafted results, reachbench score produces expected metrics (assert via tests).
  • Running reachbench score --repeat 3 produces identical scores.json across runs.
  • Missing results files are handled gracefully (but clearly documented).

5. Baseline implementations

Goal: Provide in-repo baselines that use only the provided graphs (no extra tooling).

5.1 Baseline types

  1. Naïve reachable: all symbols in the vulnerable package are considered reachable.

  2. Imports-only: reachable = any symbol that:

    • appears in the graph AND
    • is reachable from any entrypoint by a single edge OR name match.
  3. Call-depth-2:

    • From each entrypoint, traverse up to depth 2 along call edges.
    • Anything at depth ≤ 2 is considered reachable.

5.2 Implementation

File: harness/reachbench/baselines.py

  • baseline_naive(graph, truth) -> PredictionModel
  • baseline_imports_only(graph, truth) -> PredictionModel
  • baseline_call_depth_2(graph, truth) -> PredictionModel

CLI:

reachbench run-baseline \
  --dataset-root ./dataset \
  --baseline naive|imports|depth2 \
  --out ./results-baseline-<baseline>/

Behavior:

  • For each case:

    • Load graph.
    • Generate predictions per baseline.
    • Write result file results-baseline-<baseline>/<case_id>.json.

5.3 Tests

  • Tiny synthetic dataset in harness/tests/data/:

    • 12 cases with simple graphs.
    • Known expectations for each baseline (TP/FP/FN counts).

Acceptance criteria

  • reachbench run-baseline --baseline naive runs end-to-end and outputs results files.
  • reachbench score on baseline results produces stable scores.
  • Tests validate baseline behavior on synthetic cases.

6. Dataset validation & tooling

Goal: One command to validate everything (schemas, hashes, internal consistency).

CLI: validate-dataset

reachbench validate-dataset \
  --dataset-root ./dataset \
  [--lockfile ./dataset/manifest.lock.json]

Checks:

  1. dataset.json conforms to dataset.schema.json.

  2. For each case:

    • all artifact paths exist
    • graph file passes graph.schema.json
    • truth file passes truth.schema.json
  3. Optional: verify lockfile if provided.

Implementation:

  • dataset_loader.py:

    • load_dataset_index(path) -> DatasetIndex
    • iter_cases(dataset_index) yields case objects.
    • validate_case(case, dataset_root) -> list[str] (list of error messages).

Acceptance criteria

  • Broken paths / invalid JSON produce a clear error message and non-zero exit code.
  • CI job calls reachbench validate-dataset on every push.

7. Documentation

Goal: Make it trivial for outsiders to use the benchmark.

7.1 README.md

  • Overview:

    • What the benchmark is.
    • What it measures (reachability precision/recall).
  • Quickstart:

    git clone ...
    cd stellaops-reachability-benchmark
    
    # Validate dataset
    reachbench validate-dataset --dataset-root ./dataset
    
    # Run baselines
    reachbench run-baseline --baseline naive --dataset-root ./dataset --out ./results-naive
    
    # Score baselines
    reachbench score --dataset-root ./dataset --results-root ./results-naive --out ./out/naive-scores.json
    

7.2 docs/HOWTO.md

  • Step-by-step:

    • Installing harness.
    • Running your own tool on the dataset.
    • Formatting your results/.
    • Running reachbench score.
    • Interpreting scores.json.

7.3 docs/SCHEMA.md

  • Human-readable description of:

    • graph JSON
    • truth JSON
    • results JSON
    • scores JSON
  • Link to actual JSON Schemas.

7.4 docs/REPRODUCIBILITY.md

  • Explain:

    • lockfile design
    • hashing rules
    • deterministic scoring and --repeat flag
    • how to verify youre using the exact same dataset.

7.5 docs/SANITIZATION.md

  • Rules for adding new cases:

    • Only use OSS or properly licensed code.
    • Strip secrets / proprietary paths / user data.
    • How to confirm nothing sensitive is in package tarballs.

Acceptance criteria

  • A new engineer (or external user) can go from zero to “I ran the baseline and got scores” by following docs only.
  • All example commands work as written.

8. CI/CD details

Goal: Keep repo healthy and ensure determinism.

CI jobs (GitHub Actions)

  1. lint

    • Run ruff / flake8 (your choice).
  2. test

    • Run pytest.
  3. validate-dataset

    • Run reachbench validate-dataset --dataset-root ./dataset.
  4. determinism

    • Small workflow step:

      • Run reachbench score on a tiny test dataset with --repeat 3.
      • Assert success.
  5. docker-build

    • docker build the harness image.

Acceptance criteria

  • All jobs green on main.
  • PRs show failing status if schemas or determinism break.

9. Rough “epics → stories” breakdown

You can paste roughly like this into Jira/Linear:

  1. Epic: Repo bootstrap & CI

    • Story: Create repo skeleton & Python project
    • Story: Add Dockerfile & basic CI (lint + tests)
  2. Epic: Schemas & dataset plumbing

    • Story: Implement truth.schema.json + tests
    • Story: Implement graph.schema.json + tests
    • Story: Implement dataset.schema.json + tests
    • Story: Implement validate-dataset CLI
  3. Epic: Lockfile & determinism

    • Story: Implement lockfile computation + verification
    • Story: Add compute-lockfile & verify-lockfile CLI
    • Story: Add determinism checks in CI
  4. Epic: Scoring harness

    • Story: Define results format + results.schema.json
    • Story: Implement scoring logic (scoring.py)
    • Story: Implement score CLI with --repeat
    • Story: Add unit tests for metrics
  5. Epic: Baselines

    • Story: Implement naive baseline
    • Story: Implement imports-only baseline
    • Story: Implement depth-2 baseline
    • Story: Add run-baseline CLI + tests
  6. Epic: Documentation & polish

    • Story: Write README + HOWTO
    • Story: Write SCHEMA / REPRODUCIBILITY / SANITIZATION docs
    • Story: Final repo cleanup & examples

If you tell me your preferred language and CI, I can also rewrite this into exact tickets and even starter code for cli.py and a couple of schemas.