stella-ops.org/git.stella-ops.org

Fork 0

Files

master e950474a77

Docs CI / lint-and-preview (push) Has been cancelled

Details

AOC Guard CI / aoc-guard (push) Has been cancelled

Details

AOC Guard CI / aoc-verify (push) Has been cancelled

Details

api-governance / spectral-lint (push) Has been cancelled

Details

oas-ci / oas-validate (push) Has been cancelled

Details

Policy Lint & Smoke / policy-lint (push) Has been cancelled

Details

Policy Simulation / policy-simulate (push) Has been cancelled

Details

SDK Publish & Sign / sdk-publish (push) Has been cancelled

Details

2025-11-27 15:16:31 +02:00

24 KiB

Raw Blame History

Here’s a clean, action‑ready blueprint for a public reachability benchmark you can stand up quickly and grow over time.

Why this matters (quick)

“Reachability” asks: is a flagged vulnerability actually executable from real entry points in this codebase/container? A public, reproducible benchmark lets you compare tools apples‑to‑apples, drive research, and keep vendors honest.

What to collect (dataset design)

Projects & languages
- Polyglot mix: C/C++ (ELF/PE/Mach‑O), Java/Kotlin, C#/.NET, Python, JavaScript/TypeScript, PHP, Go, Rust.
- For each project: small (≤5k LOC), medium (5–100k), large (100k+).
Ground‑truth artifacts
- Seed CVEs with known sinks (e.g., deserializers, command exec, SS RF) and neutral projects with no reachable path (negatives).
- Exploit oracles: minimal PoCs or unit tests that (1) reach the sink and (2) toggle reachability via feature flags.
Build outputs (deterministic)
- Reproducible binaries/bytecode (strip timestamps; fixed seeds; SOURCE_DATE_EPOCH).
- SBOM (CycloneDX/SPDX) + PURLs + Build‑ID (ELF .note.gnu.build‑id / PE Authentihash / Mach‑O UUID).
- Attestations: in‑toto/DSSE envelopes recording toolchain versions, flags, hashes.
Execution traces (for truth)
- CI traces: call‑graph dumps from compilers/analyzers; unit‑test coverage; optional dynamic traces (eBPF/.NET ETW/Java Flight Recorder).
- Entry‑point manifests: HTTP routes, CLI commands, cron/queue consumers.
Metadata
- Language, framework, package manager, compiler versions, OS/container image, optimization level, stripping info, license.

How to label ground truth

Per‑vuln case: (component, version, sink_id) with label reachable / unreachable / unknown.
Evidence bundle: pointer to (a) static call path, (b) dynamic hit (trace/coverage), or (c) rationale for negative.
Confidence: high (static+dynamic agree), medium (one source), low (heuristic only).

Scoring (simple + fair)

Binary classification on cases:
- Precision, Recall, F1. Report AU‑PR if you output probabilities.
Path quality
- Explainability score (0–3):
  - 0: “vuln reachable” w/o context
  - 1: names only (entry→…→sink)
  - 2: full interprocedural path w/ locations
  - 3: plus inputs/guards (taint/constraints, env flags)
Runtime cost
- Wall‑clock, peak RAM, image size; normalized by KLOC.
Determinism
- Re‑run variance (≤1% is “A”, 1–5% “B”, >5% “C”).

Avoiding overfitting

Train/Dev/Test splits per language; hidden test projects rotated quarterly.
Case churn: introduce isomorphic variants (rename symbols, reorder files) to punish memorization.
Poisoned controls: include decoy sinks and unreachable dead‑code traps.
Submission rules: require attestations of tool versions & flags; limit per‑case hints.

Reference baselines (to run out‑of‑the‑box)

Snyk Code/Reachability (JS/Java/Python, SaaS/CLI).
Semgrep + Pro Engine (rules + reachability mode).
CodeQL (multi‑lang, LGTM‑style queries).
Joern (C/C++/JVM code property graphs).
angr (binary symbolic exec; selective for native samples).
Language‑specific: pip‑audit w/ import graphs, npm with lock‑tree + route discovery, Maven + call‑graph (Soot/WALA).

Submission format (one JSON per tool run)

{
  "tool": {"name": "YourTool", "version": "1.2.3"},
  "run": {
    "commit": "…",
    "platform": "ubuntu:24.04",
    "time_s": 182.4, "peak_mb": 3072
  },
  "cases": [
    {
      "id": "php-shop:fastjson@1.2.68:Sink#deserialize",
      "prediction": "reachable",
      "confidence": 0.88,
      "explain": {
        "entry": "POST /api/orders",
        "path": [
          "OrdersController::create",
          "Serializer::deserialize",
          "Fastjson::parseObject"
        ],
        "guards": ["feature.flag.json_enabled==true"]
      }
    }
  ],
  "artifacts": {
    "sbom": "sha256:…", "attestation": "sha256:…"
  }
}

Folder layout (repo)

/benchmark
  /cases/<lang>/<project>/<case_id>/
    case.yaml           # component@version, sink, labels, evidence refs
    entrypoints.yaml    # routes/CLIs/cron
    build/              # Dockerfiles, lockfiles, pinned toolchains
    outputs/            # SBOMs, binaries, traces (checksummed)
  /splits/{train,dev,test}.txt
  /schemas/{case.json,submission.json}
  /scripts/{build.sh, run_tests.sh, score.py}
  /docs/ (how-to, FAQs, T&Cs)

Minimal v1 (4–6 weeks of work)

Languages: JS/TS, Python, Java, C (ELF).
20–30 cases: mix of reachable/unreachable with PoC unit tests.
Deterministic builds in containers; publish SBOM+attestations.
Scorer: precision/recall/F1 + explainability, runtime, determinism.
Baselines: run CodeQL + Semgrep across all; Snyk where feasible; angr for 3 native cases.
Website: static leaderboard (per‑lang, per‑size), download links, submission guide.

V2+ (quarterly)

Add .NET, PHP, Go, Rust; broaden binary focus (PE/Mach‑O).
Add dynamic traces (eBPF/ETW/JFR) and taint oracles.
Introduce config‑gated reachability (feature flags, env, k8s secrets).
Add dataset cards per case (threat model, CWE, false‑positive traps).

Publishing & governance

License: CC‑BY‑SA for metadata, source‑compatible OSS for code, binaries under original licenses.
Repro packs: benchmark-kit.tgz with container recipes, hashes, and attestations.
Disclosure: CVE hygiene, responsible use, opt‑out path for upstreams.
Stewards: small TAC (you + two external reviewers) to approve new cases and adjudicate disputes.

Immediate next steps (checklist)

Lock the schemas (case + submission + attestation fields).
Pick 8 seed projects (2 per language tiered by size).
Draft 12 sink‑cases (6 reachable, 6 unreachable) with unit‑test oracles.
Script deterministic builds and hash‑locked SBOMs.
Implement the scorer; publish a starter leaderboard with 2 baselines.
Ship v1 website/docs and open submissions.

If you want, I can generate the repo scaffold (folders, YAML/JSON schemas, Dockerfiles, scorer script) so your team can git clone and start adding cases immediately. Cool, let’s turn the blueprint into a concrete, developer‑friendly implementation plan.

I’ll assume v1 scope is:

Languages: JavaScript/TypeScript (Node), Python, Java, C (ELF)
~20–30 cases total (reachable/unreachable mix)
Baselines: CodeQL, Semgrep, maybe Snyk where licenses allow, and angr for a few native cases

You can expand later, but this plan is enough to get v1 shipped.

0. Overall project structure & ownership

Owners

Tech Lead – owns architecture & final decisions
Benchmark Core – 2–3 devs building schemas, scorer, infra
Language Tracks – 1 dev per language (JS, Python, Java, C)
Website/Docs – 1 dev

Repo layout (target)

reachability-benchmark/
  README.md
  LICENSE
  CONTRIBUTING.md
  CODE_OF_CONDUCT.md

  benchmark/
    cases/
      js/
        express-blog/
          case-001/
            case.yaml
            entrypoints.yaml
            build/
              Dockerfile
              build.sh
            src/           # project source (or submodule)
            tests/         # unit tests as oracles
            outputs/
              sbom.cdx.json
              binary.tar.gz
              coverage.json
              traces/      # optional dynamic traces
      py/
        flask-api/...
      java/
        spring-app/...
      c/
        httpd-like/...
    schemas/
      case.schema.yaml
      entrypoints.schema.yaml
      truth.schema.yaml
      submission.schema.json
    tools/
      scorer/
        rb_score/
          __init__.py
          cli.py
          metrics.py
          loader.py
          explainability.py
        pyproject.toml
        tests/
      build/
        build_all.py
        validate_builds.py

  baselines/
    codeql/
      run_case.sh
      config/
    semgrep/
      run_case.sh
      rules/
    snyk/
      run_case.sh
    angr/
      run_case.sh

  ci/
    github/
      benchmark.yml

  website/
    # static site / leaderboard

1. Phase 1 – Repo & infra setup

Task 1.1 – Create repository

Developer: Tech Lead Deliverables:

Repo created (reachability-benchmark or similar)
LICENSE (e.g., Apache-2.0 or MIT)
Basic README.md describing:
- Purpose (public reachability benchmark)
- High‑level design
- v1 scope (langs, #cases)

Task 1.2 – Bootstrap structure

Developer: Benchmark Core

Create directory skeleton as above (without filling everything yet).

Add:

# benchmark/Makefile
.PHONY: test lint build
test:
\tpytest benchmark/tools/scorer/tests

lint:
\tblack benchmark/tools/scorer
\tflake8 benchmark/tools/scorer

build:
\tpython benchmark/tools/build/build_all.py

Task 1.3 – Coding standards & tooling

Developer: Benchmark Core

Add .editorconfig, .gitignore, and Python tool configs (ruff, black, or flake8).
Define minimal PR checklist in CONTRIBUTING.md:
- Tests pass
- Lint passes
- New schemas have JSON schema or YAML schema and tests
- New cases come with oracles (tests/coverage)

2. Phase 2 – Case & submission schemas

Task 2.1 – Define case metadata format

Developer: Benchmark Core

Create benchmark/schemas/case.schema.yaml and an example case.yaml.

Example case.yaml

id: "js-express-blog:001"
language: "javascript"
framework: "express"
size: "small"               # small | medium | large
component:
  name: "express-blog"
  version: "1.0.0-bench"
vulnerability:
  cve: "CVE-XXXX-YYYY"
  cwe: "CWE-502"
  description: "Unsafe deserialization via user-controlled JSON."
  sink_id: "Deserializer::parse"
ground_truth:
  label: "reachable"        # reachable | unreachable | unknown
  confidence: "high"        # high | medium | low
  evidence_files:
    - "truth.yaml"
  notes: >
    Unit test test_reachable_deserialization triggers the sink.
build:
  dockerfile: "build/Dockerfile"
  build_script: "build/build.sh"
  output:
    artifact_path: "outputs/binary.tar.gz"
    sbom_path: "outputs/sbom.cdx.json"
    coverage_path: "outputs/coverage.json"
    traces_dir: "outputs/traces"
environment:
  os_image: "ubuntu:24.04"
  compiler: null
  runtime:
    node: "20.11.0"
  source_date_epoch: 1730000000

Acceptance criteria

Schema validates sample case.yaml with a Python script:
- benchmark/tools/build/validate_schema.py using jsonschema or pykwalify.

Task 2.2 – Entry points schema

Developer: Benchmark Core

benchmark/schemas/entrypoints.schema.yaml

Example entrypoints.yaml

entries:
  http:
    - id: "POST /api/posts"
      route: "/api/posts"
      method: "POST"
      handler: "PostsController.create"
  cli:
    - id: "generate-report"
      command: "node cli.js generate-report"
      description: "Generates summary report."
  scheduled:
    - id: "daily-cleanup"
      schedule: "0 3 * * *"
      handler: "CleanupJob.run"

Task 2.3 – Ground truth / truth schema

Developer: Benchmark Core + Language Tracks

benchmark/schemas/truth.schema.yaml

Example truth.yaml

id: "js-express-blog:001"
cases:
  - sink_id: "Deserializer::parse"
    label: "reachable"
    dynamic_evidence:
      covered_by_tests:
        - "tests/test_reachable_deserialization.js::should_reach_sink"
      coverage_files:
        - "outputs/coverage.json"
    static_evidence:
      call_path:
        - "POST /api/posts"
        - "PostsController.create"
        - "PostsService.createFromJson"
        - "Deserializer.parse"
    config_conditions:
      - "process.env.FEATURE_JSON_ENABLED == 'true'"
    notes: "If FEATURE_JSON_ENABLED=false, path is unreachable."

Task 2.4 – Submission schema

Developer: Benchmark Core

benchmark/schemas/submission.schema.json

Shape

{
  "tool": { "name": "YourTool", "version": "1.2.3" },
  "run": {
    "commit": "abcd1234",
    "platform": "ubuntu:24.04",
    "time_s": 182.4,
    "peak_mb": 3072
  },
  "cases": [
    {
      "id": "js-express-blog:001",
      "prediction": "reachable",
      "confidence": 0.88,
      "explain": {
        "entry": "POST /api/posts",
        "path": [
          "PostsController.create",
          "PostsService.createFromJson",
          "Deserializer.parse"
        ],
        "guards": [
          "process.env.FEATURE_JSON_ENABLED === 'true'"
        ]
      }
    }
  ],
  "artifacts": {
    "sbom": "sha256:...",
    "attestation": "sha256:..."
  }
}

Write Python validation utility:

python benchmark/tools/scorer/validate_submission.py submission.json

Acceptance criteria

Validation fails on missing fields / wrong enum values.
At least two sample submissions pass validation (e.g., “perfect” and “random baseline”).

3. Phase 3 – Reference projects & deterministic builds

Task 3.1 – Select and vendor v1 projects

Developer: Tech Lead + Language Tracks

For each language, choose:

1 small toy app (simple web or CLI)
1 medium app (more routes, multiple modules)
Optional: 1 large (for performance stress tests)

Add them under benchmark/cases/<lang>/<project>/src/ (or as git submodules if you want to track upstream).

Task 3.2 – Deterministic Docker build per project

Developer: Language Tracks

For each project:

Create build/Dockerfile
Create build/build.sh that:
- Builds the app
- Produces artifacts
- Generates SBOM and attestation

Example build/Dockerfile (Node)

FROM node:20.11-slim

ENV NODE_ENV=production
ENV SOURCE_DATE_EPOCH=1730000000

WORKDIR /app
COPY src/ /app
COPY package.json package-lock.json /app/

RUN npm ci --ignore-scripts && \
    npm run build || true

CMD ["node", "server.js"]

Example build.sh

#!/usr/bin/env bash
set -euo pipefail

ROOT_DIR="$(dirname "$(readlink -f "$0")")/.."
OUT_DIR="$ROOT_DIR/outputs"
mkdir -p "$OUT_DIR"

IMAGE_TAG="rb-js-express-blog:1"

docker build -t "$IMAGE_TAG" "$ROOT_DIR/build"

# Export image as tarball (binary artifact)
docker save "$IMAGE_TAG" | gzip > "$OUT_DIR/binary.tar.gz"

# Generate SBOM (e.g. via syft) – can be optional stub for v1
syft packages "docker:$IMAGE_TAG" -o cyclonedx-json > "$OUT_DIR/sbom.cdx.json"

# In future: generate in-toto attestations

Task 3.3 – Determinism checker

Developer: Benchmark Core

benchmark/tools/build/validate_builds.py:

For each case:
- Run build.sh twice
- Compare hashes of outputs/binary.tar.gz and outputs/sbom.cdx.json
Fail if hashes differ.

Acceptance criteria

All v1 cases produce identical artifacts across two builds on CI.

4. Phase 4 – Ground truth oracles (tests & traces)

Task 4.1 – Add unit/integration tests for reachable cases

Developer: Language Tracks

For each reachable case:

Add tests/ under the project to:
- Start the app (if necessary)
- Send a request/trigger that reaches the vulnerable sink
- Assert that a sentinel side effect occurs (e.g. log or marker file) instead of real exploitation.

Example for Node using Jest:

test("should reach deserialization sink", async () => {
  const res = await request(app)
    .post("/api/posts")
    .send({ title: "x", body: '{"__proto__":{}}' });

  expect(res.statusCode).toBe(200);
  // Sink logs "REACH_SINK" – we check log or variable
  expect(sinkWasReached()).toBe(true);
});

Task 4.2 – Instrument coverage

Developer: Language Tracks

For each language, pick a coverage tool:
- JS: nyc + istanbul
- Python: coverage.py
- Java: jacoco
- C: gcov/llvm-cov (optional for v1)
Ensure running tests produces outputs/coverage.json or .xml that we then convert to a simple JSON format:

{
  "files": {
    "src/controllers/posts.js": {
      "lines_covered": [12, 13, 14, 27],
      "lines_total": 40
    }
  }
}

Create a small converter script if needed.

Task 4.3 – Optional dynamic traces

If you want richer evidence:

JS: add middleware that logs (entry_id, handler, sink) triples to outputs/traces/traces.json
Python: similar using decorators
C/Java: out of scope for v1 unless you want to invest extra time.

5. Phase 5 – Scoring tool (CLI)

Task 5.1 – Implement `rb-score` library + CLI

Developer: Benchmark Core

Create benchmark/tools/scorer/rb_score/ with:

loader.py
- Load all case.yaml, truth.yaml into memory.
- Provide functions: load_cases() -> Dict[case_id, Case].
metrics.py
- Implement:
  - compute_precision_recall(truth, predictions)
  - compute_path_quality_score(explain_block) (0–3)
  - compute_runtime_stats(run_block)
cli.py
- CLI:

rb-score \
  --cases-root benchmark/cases \
  --submission submissions/mytool.json \
  --output results/mytool_results.json

Pseudo-code for core scoring

def score_submission(truth, submission):
    y_true = []
    y_pred = []
    per_case_scores = {}

    for case in truth:
        gt = truth[case.id].label  # reachable/unreachable
        pred_case = find_pred_case(submission.cases, case.id)
        pred_label = pred_case.prediction if pred_case else "unreachable"

        y_true.append(gt == "reachable")
        y_pred.append(pred_label == "reachable")

        explain_score = explainability(pred_case.explain if pred_case else None)

        per_case_scores[case.id] = {
            "gt": gt,
            "pred": pred_label,
            "explainability": explain_score,
        }

    precision, recall, f1 = compute_prf(y_true, y_pred)

    return {
        "summary": {
            "precision": precision,
            "recall": recall,
            "f1": f1,
            "num_cases": len(truth),
        },
        "cases": per_case_scores,
    }

Task 5.2 – Explainability scoring rules

Developer: Benchmark Core

Implement explainability(explain):

0 – explain missing or path empty
1 – path present with at least 2 nodes (sink + one function)
2 – path contains:
- Entry label (HTTP route/CLI id)
- ≥3 nodes (entry → … → sink)
3 – Level 2 plus guards list non-empty

Unit tests for at least 4 scenarios.

Task 5.3 – Regression tests for scoring

Add small test fixture:

Tiny synthetic benchmark: 3 cases, 2 reachable, 1 unreachable.
3 submissions:
- Perfect
- All reachable
- All unreachable

Assertions:

Perfect: precision=1, recall=1
All reachable: recall=1, precision<1
All unreachable: precision=1 (trivially on negatives), recall=0

6. Phase 6 – Baseline integrations

Task 6.1 – Semgrep baseline

Developer: Benchmark Core (with Semgrep experience)

baselines/semgrep/run_case.sh:
- Inputs: case_id, cases_root, output_path
- Steps:
  - Find src/ for case
  - Run semgrep --config auto or curated rules
  - Convert Semgrep findings into benchmark submission format:
    - Map Semgrep rules → vulnerability types → candidate sinks
    - Heuristically guess reachability (for v1, maybe always “reachable” if sink in code path)
- Output: output_path JSON conforming to submission.schema.json.

Task 6.2 – CodeQL baseline

Create CodeQL databases for each project (likely via codeql database create).
Create queries targeting known sinks (e.g., Deserialization, CommandInjection).
baselines/codeql/run_case.sh:
- Build DB (or reuse)
- Run queries
- Translate results into our submission format (again as heuristic reachability).

Task 6.3 – Optional Snyk / angr baselines

Snyk:
- Use snyk test on the project
- Map results to dependencies & known CVEs
- For v1, just mark as reachable if Snyk reports a reachable path (if available).
angr:
- For 1–2 small C samples, configure simple analysis script.

Acceptance criteria

For at least 5 cases (across languages), the baselines produce valid submission JSON.
rb-score runs and yields metrics without errors.

7. Phase 7 – CI/CD

Task 7.1 – GitHub Actions workflow

Developer: Benchmark Core

ci/github/benchmark.yml:

Jobs:

lint-and-test
- python -m pip install -e benchmark/tools/scorer[dev]
- make lint
- make test
build-cases
- python benchmark/tools/build/build_all.py
- Run validate_builds.py
smoke-baselines
- For 2–3 cases, run Semgrep/CodeQL wrappers and ensure they emit valid submissions.

Task 7.2 – Artifact upload

Upload outputs/ tarball from build-cases as workflow artifacts.
Upload results/*.json from scoring runs.

8. Phase 8 – Website & leaderboard

Task 8.1 – Define results JSON format

Developer: Benchmark Core + Website dev

results/leaderboard.json:

{
  "tools": [
    {
      "name": "Semgrep",
      "version": "1.60.0",
      "summary": {
        "precision": 0.72,
        "recall": 0.48,
        "f1": 0.58
      },
      "by_language": {
        "javascript": {"precision": 0.80, "recall": 0.50, "f1": 0.62},
        "python": {"precision": 0.65, "recall": 0.45, "f1": 0.53}
      }
    }
  ]
}

CLI option to generate this:

rb-score compare \
  --cases-root benchmark/cases \
  --submissions submissions/*.json \
  --output results/leaderboard.json

Task 8.2 – Static site

Developer: Website dev

Tech choice: any static framework (Next.js, Astro, Docusaurus, or even pure HTML+JS).

Pages:

Home
- What is reachability?
- Summary of benchmark
Leaderboard
- Renders leaderboard.json
- Filters: language, case size
Docs
- How to run benchmark locally
- How to prepare a submission

Add a simple script to copy results/leaderboard.json into website/public/ for publishing.

9. Phase 9 – Docs, governance, and contribution flow

Task 9.1 – CONTRIBUTING.md

Include:

How to add a new case:
- Step‑by‑step:
  1. Create project folder under benchmark/cases/<lang>/<project>/case-XXX/
  2. Add case.yaml, entrypoints.yaml, truth.yaml
  3. Add oracles (tests, coverage)
  4. Add deterministic build/ assets
  5. Run local tooling:
    - validate_schema.py
    - validate_builds.py --case <id>
- Example PR description template.

Task 9.2 – Governance doc

Define Technical Advisory Committee (TAC) roles:
- Approve new cases
- Approve schema changes
- Manage hidden test sets (future phase)
Define release cadence:
- v1.0 with public cases
- Quarterly updates with new hidden cases.

10. Suggested milestone breakdown (for planning / sprints)

Milestone 1 – Foundation (1–2 sprints)

Repo scaffolding (Tasks 1.x)
Schemas (Tasks 2.x)
Two tiny toy cases (one JS, one Python) with:
- case.yaml, entrypoints.yaml, truth.yaml
- Deterministic build
- Basic unit tests
Minimal rb-score with:
- Case loading
- Precision/recall only

Exit: You can run rb-score on a dummy submission for 2 cases.

Milestone 2 – v1 dataset (2–3 sprints)

Add ~20–30 cases across JS, Python, Java, C
Ground truth & coverage for each
Deterministic builds validated
Explainability scoring implemented
Regression tests for rb-score

Exit: Full scoring tool stable; dataset repeatably builds on CI.

Milestone 3 – Baselines & site (1–2 sprints)

Semgrep + CodeQL baselines producing valid submissions
CI running smoke baselines
leaderboard.json generator
Static website with public leaderboard and docs

Exit: Public v1 benchmark you can share with external tool authors.

If you tell me which stack your team prefers for the site (React, plain HTML, SSG, etc.) or which CI you’re on, I can adapt this into concrete config files (e.g., a full GitHub Actions workflow, Next.js scaffold, or exact pyproject.toml for rb-score).

24 KiB Raw Blame History Unescape Escape

Why this matters (quick)

What to collect (dataset design)

How to label ground truth

Scoring (simple + fair)

Avoiding overfitting

Reference baselines (to run out‑of‑the‑box)

Submission format (one JSON per tool run)

Folder layout (repo)

Minimal v1 (4–6 weeks of work)

V2+ (quarterly)

Publishing & governance

Immediate next steps (checklist)

0. Overall project structure & ownership

1. Phase 1 – Repo & infra setup

Task 1.1 – Create repository

Task 1.2 – Bootstrap structure

Task 1.3 – Coding standards & tooling

2. Phase 2 – Case & submission schemas

Task 2.1 – Define case metadata format

Task 2.2 – Entry points schema

Task 2.3 – Ground truth / truth schema

Task 2.4 – Submission schema

3. Phase 3 – Reference projects & deterministic builds

Task 3.1 – Select and vendor v1 projects

Task 3.2 – Deterministic Docker build per project

Task 3.3 – Determinism checker

4. Phase 4 – Ground truth oracles (tests & traces)

Task 4.1 – Add unit/integration tests for reachable cases

Task 4.2 – Instrument coverage

Task 4.3 – Optional dynamic traces

5. Phase 5 – Scoring tool (CLI)

Task 5.1 – Implement rb-score library + CLI

Task 5.2 – Explainability scoring rules

Task 5.3 – Regression tests for scoring

6. Phase 6 – Baseline integrations

Task 6.1 – Semgrep baseline

Task 6.2 – CodeQL baseline

Task 6.3 – Optional Snyk / angr baselines

7. Phase 7 – CI/CD

Task 7.1 – GitHub Actions workflow

Task 7.2 – Artifact upload

8. Phase 8 – Website & leaderboard

Task 8.1 – Define results JSON format

Task 8.2 – Static site

9. Phase 9 – Docs, governance, and contribution flow

Task 9.1 – CONTRIBUTING.md

Task 9.2 – Governance doc

10. Suggested milestone breakdown (for planning / sprints)

Milestone 1 – Foundation (1–2 sprints)

Milestone 2 – v1 dataset (2–3 sprints)

Milestone 3 – Baselines & site (1–2 sprints)

24 KiB

Raw Blame History

Task 5.1 – Implement `rb-score` library + CLI