up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
api-governance / spectral-lint (push) Has been cancelled
oas-ci / oas-validate (push) Has been cancelled
Policy Lint & Smoke / policy-lint (push) Has been cancelled
Policy Simulation / policy-simulate (push) Has been cancelled
SDK Publish & Sign / sdk-publish (push) Has been cancelled

This commit is contained in:
master
2025-11-27 15:05:48 +02:00
parent 4831c7fcb0
commit e950474a77
278 changed files with 81498 additions and 672 deletions

View File

@@ -0,0 +1,944 @@
Heres a clean, actionready blueprint for a **public reachability benchmark** you can stand up quickly and grow over time.
# Why this matters (quick)
“Reachability” asks: *is a flagged vulnerability actually executable from real entry points in this codebase/container?* A public, reproducible benchmark lets you compare tools applestoapples, drive research, and keep vendors honest.
# What to collect (dataset design)
* **Projects & languages**
* Polyglot mix: **C/C++ (ELF/PE/MachO)**, **Java/Kotlin**, **C#/.NET**, **Python**, **JavaScript/TypeScript**, **PHP**, **Go**, **Rust**.
* For each project: small (≤5k LOC), medium (5100k), large (100k+).
* **Groundtruth artifacts**
* **Seed CVEs** with known sinks (e.g., deserializers, command exec, SS RF) and **neutral projects** with *no* reachable path (negatives).
* **Exploit oracles**: minimal PoCs or unit tests that (1) reach the sink and (2) toggle reachability via feature flags.
* **Build outputs (deterministic)**
* **Reproducible binaries/bytecode** (strip timestamps; fixed seeds; SOURCE_DATE_EPOCH).
* **SBOM** (CycloneDX/SPDX) + **PURLs** + **BuildID** (ELF .note.gnu.buildid / PE Authentihash / MachO UUID).
* **Attestations**: intoto/DSSE envelopes recording toolchain versions, flags, hashes.
* **Execution traces (for truth)**
* **CI traces**: callgraph dumps from compilers/analyzers; unittest coverage; optional **dynamic traces** (eBPF/.NET ETW/Java Flight Recorder).
* **Entrypoint manifests**: HTTP routes, CLI commands, cron/queue consumers.
* **Metadata**
* Language, framework, package manager, compiler versions, OS/container image, optimization level, stripping info, license.
# How to label ground truth
* **Pervuln case**: `(component, version, sink_id)` with label **reachable / unreachable / unknown**.
* **Evidence bundle**: pointer to (a) static call path, (b) dynamic hit (trace/coverage), or (c) rationale for negative.
* **Confidence**: high (static+dynamic agree), medium (one source), low (heuristic only).
# Scoring (simple + fair)
* **Binary classification** on cases:
* Precision, Recall, F1. Report **AUPR** if you output probabilities.
* **Path quality**
* **Explainability score (03)**:
* 0: “vuln reachable” w/o context
* 1: names only (entry→…→sink)
* 2: full interprocedural path w/ locations
* 3: plus **inputs/guards** (taint/constraints, env flags)
* **Runtime cost**
* Wallclock, peak RAM, image size; normalized by KLOC.
* **Determinism**
* Rerun variance (≤1% is “A”, 15% “B”, >5% “C”).
# Avoiding overfitting
* **Train/Dev/Test** splits per language; **hidden test** projects rotated quarterly.
* **Case churn**: introduce **isomorphic variants** (rename symbols, reorder files) to punish memorization.
* **Poisoned controls**: include decoy sinks and unreachable deadcode traps.
* **Submission rules**: require **attestations** of tool versions & flags; limit percase hints.
# Reference baselines (to run outofthebox)
* **Snyk Code/Reachability** (JS/Java/Python, SaaS/CLI).
* **Semgrep + Pro Engine** (rules + reachability mode).
* **CodeQL** (multilang, LGTMstyle queries).
* **Joern** (C/C++/JVM code property graphs).
* **angr** (binary symbolic exec; selective for native samples).
* **Languagespecific**: pipaudit w/ import graphs, npm with locktree + route discovery, Maven + callgraph (Soot/WALA).
# Submission format (one JSON per tool run)
```json
{
"tool": {"name": "YourTool", "version": "1.2.3"},
"run": {
"commit": "…",
"platform": "ubuntu:24.04",
"time_s": 182.4, "peak_mb": 3072
},
"cases": [
{
"id": "php-shop:fastjson@1.2.68:Sink#deserialize",
"prediction": "reachable",
"confidence": 0.88,
"explain": {
"entry": "POST /api/orders",
"path": [
"OrdersController::create",
"Serializer::deserialize",
"Fastjson::parseObject"
],
"guards": ["feature.flag.json_enabled==true"]
}
}
],
"artifacts": {
"sbom": "sha256:…", "attestation": "sha256:…"
}
}
```
# Folder layout (repo)
```
/benchmark
/cases/<lang>/<project>/<case_id>/
case.yaml # component@version, sink, labels, evidence refs
entrypoints.yaml # routes/CLIs/cron
build/ # Dockerfiles, lockfiles, pinned toolchains
outputs/ # SBOMs, binaries, traces (checksummed)
/splits/{train,dev,test}.txt
/schemas/{case.json,submission.json}
/scripts/{build.sh, run_tests.sh, score.py}
/docs/ (how-to, FAQs, T&Cs)
```
# Minimal **v1** (46 weeks of work)
1. **Languages**: JS/TS, Python, Java, C (ELF).
2. **2030 cases**: mix of reachable/unreachable with PoC unit tests.
3. **Deterministic builds** in containers; publish SBOM+attestations.
4. **Scorer**: precision/recall/F1 + explainability, runtime, determinism.
5. **Baselines**: run CodeQL + Semgrep across all; Snyk where feasible; angr for 3 native cases.
6. **Website**: static leaderboard (perlang, persize), download links, submission guide.
# V2+ (quarterly)
* Add **.NET, PHP, Go, Rust**; broaden binary focus (PE/MachO).
* Add **dynamic traces** (eBPF/ETW/JFR) and **taint oracles**.
* Introduce **configgated reachability** (feature flags, env, k8s secrets).
* Add **dataset cards** per case (threat model, CWE, falsepositive traps).
# Publishing & governance
* License: **CCBYSA** for metadata, **sourcecompatible OSS** for code, binaries under original licenses.
* **Repro packs**: `benchmark-kit.tgz` with container recipes, hashes, and attestations.
* **Disclosure**: CVE hygiene, responsible use, optout path for upstreams.
* **Stewards**: small TAC (you + two external reviewers) to approve new cases and adjudicate disputes.
# Immediate next steps (checklist)
* Lock the **schemas** (case + submission + attestation fields).
* Pick 8 seed projects (2 per language tiered by size).
* Draft 12 sinkcases (6 reachable, 6 unreachable) with unittest oracles.
* Script deterministic builds and **hashlocked SBOMs**.
* Implement the scorer; publish a **starter leaderboard** with 2 baselines.
* Ship **v1 website/docs** and open submissions.
If you want, I can generate the repo scaffold (folders, YAML/JSON schemas, Dockerfiles, scorer script) so your team can `git clone` and start adding cases immediately.
Cool, lets turn the blueprint into a concrete, developerfriendly implementation plan.
Ill assume **v1 scope** is:
* Languages: **JavaScript/TypeScript (Node)**, **Python**, **Java**, **C (ELF)**
* ~**2030 cases** total (reachable/unreachable mix)
* Baselines: **CodeQL**, **Semgrep**, maybe **Snyk** where licenses allow, and **angr** for a few native cases
You can expand later, but this plan is enough to get v1 shipped.
---
## 0. Overall project structure & ownership
**Owners**
* **Tech Lead** owns architecture & final decisions
* **Benchmark Core** 23 devs building schemas, scorer, infra
* **Language Tracks** 1 dev per language (JS, Python, Java, C)
* **Website/Docs** 1 dev
**Repo layout (target)**
```text
reachability-benchmark/
README.md
LICENSE
CONTRIBUTING.md
CODE_OF_CONDUCT.md
benchmark/
cases/
js/
express-blog/
case-001/
case.yaml
entrypoints.yaml
build/
Dockerfile
build.sh
src/ # project source (or submodule)
tests/ # unit tests as oracles
outputs/
sbom.cdx.json
binary.tar.gz
coverage.json
traces/ # optional dynamic traces
py/
flask-api/...
java/
spring-app/...
c/
httpd-like/...
schemas/
case.schema.yaml
entrypoints.schema.yaml
truth.schema.yaml
submission.schema.json
tools/
scorer/
rb_score/
__init__.py
cli.py
metrics.py
loader.py
explainability.py
pyproject.toml
tests/
build/
build_all.py
validate_builds.py
baselines/
codeql/
run_case.sh
config/
semgrep/
run_case.sh
rules/
snyk/
run_case.sh
angr/
run_case.sh
ci/
github/
benchmark.yml
website/
# static site / leaderboard
```
---
## 1. Phase 1 Repo & infra setup
### Task 1.1 Create repository
**Developer:** Tech Lead
**Deliverables:**
* Repo created (`reachability-benchmark` or similar)
* `LICENSE` (e.g., Apache-2.0 or MIT)
* Basic `README.md` describing:
* Purpose (public reachability benchmark)
* Highlevel design
* v1 scope (langs, #cases)
### Task 1.2 Bootstrap structure
**Developer:** Benchmark Core
Create directory skeleton as above (without filling everything yet).
Add:
```bash
# benchmark/Makefile
.PHONY: test lint build
test:
\tpytest benchmark/tools/scorer/tests
lint:
\tblack benchmark/tools/scorer
\tflake8 benchmark/tools/scorer
build:
\tpython benchmark/tools/build/build_all.py
```
### Task 1.3 Coding standards & tooling
**Developer:** Benchmark Core
* Add `.editorconfig`, `.gitignore`, and Python tool configs (`ruff`, `black`, or `flake8`).
* Define minimal **PR checklist** in `CONTRIBUTING.md`:
* Tests pass
* Lint passes
* New schemas have JSON schema or YAML schema and tests
* New cases come with oracles (tests/coverage)
---
## 2. Phase 2 Case & submission schemas
### Task 2.1 Define case metadata format
**Developer:** Benchmark Core
Create `benchmark/schemas/case.schema.yaml` and an example `case.yaml`.
**Example `case.yaml`**
```yaml
id: "js-express-blog:001"
language: "javascript"
framework: "express"
size: "small" # small | medium | large
component:
name: "express-blog"
version: "1.0.0-bench"
vulnerability:
cve: "CVE-XXXX-YYYY"
cwe: "CWE-502"
description: "Unsafe deserialization via user-controlled JSON."
sink_id: "Deserializer::parse"
ground_truth:
label: "reachable" # reachable | unreachable | unknown
confidence: "high" # high | medium | low
evidence_files:
- "truth.yaml"
notes: >
Unit test test_reachable_deserialization triggers the sink.
build:
dockerfile: "build/Dockerfile"
build_script: "build/build.sh"
output:
artifact_path: "outputs/binary.tar.gz"
sbom_path: "outputs/sbom.cdx.json"
coverage_path: "outputs/coverage.json"
traces_dir: "outputs/traces"
environment:
os_image: "ubuntu:24.04"
compiler: null
runtime:
node: "20.11.0"
source_date_epoch: 1730000000
```
**Acceptance criteria**
* Schema validates sample `case.yaml` with a Python script:
* `benchmark/tools/build/validate_schema.py` using `jsonschema` or `pykwalify`.
---
### Task 2.2 Entry points schema
**Developer:** Benchmark Core
`benchmark/schemas/entrypoints.schema.yaml`
**Example `entrypoints.yaml`**
```yaml
entries:
http:
- id: "POST /api/posts"
route: "/api/posts"
method: "POST"
handler: "PostsController.create"
cli:
- id: "generate-report"
command: "node cli.js generate-report"
description: "Generates summary report."
scheduled:
- id: "daily-cleanup"
schedule: "0 3 * * *"
handler: "CleanupJob.run"
```
---
### Task 2.3 Ground truth / truth schema
**Developer:** Benchmark Core + Language Tracks
`benchmark/schemas/truth.schema.yaml`
**Example `truth.yaml`**
```yaml
id: "js-express-blog:001"
cases:
- sink_id: "Deserializer::parse"
label: "reachable"
dynamic_evidence:
covered_by_tests:
- "tests/test_reachable_deserialization.js::should_reach_sink"
coverage_files:
- "outputs/coverage.json"
static_evidence:
call_path:
- "POST /api/posts"
- "PostsController.create"
- "PostsService.createFromJson"
- "Deserializer.parse"
config_conditions:
- "process.env.FEATURE_JSON_ENABLED == 'true'"
notes: "If FEATURE_JSON_ENABLED=false, path is unreachable."
```
---
### Task 2.4 Submission schema
**Developer:** Benchmark Core
`benchmark/schemas/submission.schema.json`
**Shape**
```json
{
"tool": { "name": "YourTool", "version": "1.2.3" },
"run": {
"commit": "abcd1234",
"platform": "ubuntu:24.04",
"time_s": 182.4,
"peak_mb": 3072
},
"cases": [
{
"id": "js-express-blog:001",
"prediction": "reachable",
"confidence": 0.88,
"explain": {
"entry": "POST /api/posts",
"path": [
"PostsController.create",
"PostsService.createFromJson",
"Deserializer.parse"
],
"guards": [
"process.env.FEATURE_JSON_ENABLED === 'true'"
]
}
}
],
"artifacts": {
"sbom": "sha256:...",
"attestation": "sha256:..."
}
}
```
Write Python validation utility:
```bash
python benchmark/tools/scorer/validate_submission.py submission.json
```
**Acceptance criteria**
* Validation fails on missing fields / wrong enum values.
* At least two sample submissions pass validation (e.g., “perfect” and “random baseline”).
---
## 3. Phase 3 Reference projects & deterministic builds
### Task 3.1 Select and vendor v1 projects
**Developer:** Tech Lead + Language Tracks
For each language, choose:
* 1 small toy app (simple web or CLI)
* 1 medium app (more routes, multiple modules)
* Optional: 1 large (for performance stress tests)
Add them under `benchmark/cases/<lang>/<project>/src/`
(or as git submodules if you want to track upstream).
---
### Task 3.2 Deterministic Docker build per project
**Developer:** Language Tracks
For each project:
* Create `build/Dockerfile`
* Create `build/build.sh` that:
* Builds the app
* Produces artifacts
* Generates SBOM and attestation
**Example `build/Dockerfile` (Node)**
```dockerfile
FROM node:20.11-slim
ENV NODE_ENV=production
ENV SOURCE_DATE_EPOCH=1730000000
WORKDIR /app
COPY src/ /app
COPY package.json package-lock.json /app/
RUN npm ci --ignore-scripts && \
npm run build || true
CMD ["node", "server.js"]
```
**Example `build.sh`**
```bash
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(dirname "$(readlink -f "$0")")/.."
OUT_DIR="$ROOT_DIR/outputs"
mkdir -p "$OUT_DIR"
IMAGE_TAG="rb-js-express-blog:1"
docker build -t "$IMAGE_TAG" "$ROOT_DIR/build"
# Export image as tarball (binary artifact)
docker save "$IMAGE_TAG" | gzip > "$OUT_DIR/binary.tar.gz"
# Generate SBOM (e.g. via syft) can be optional stub for v1
syft packages "docker:$IMAGE_TAG" -o cyclonedx-json > "$OUT_DIR/sbom.cdx.json"
# In future: generate in-toto attestations
```
---
### Task 3.3 Determinism checker
**Developer:** Benchmark Core
`benchmark/tools/build/validate_builds.py`:
* For each case:
* Run `build.sh` twice
* Compare hashes of `outputs/binary.tar.gz` and `outputs/sbom.cdx.json`
* Fail if hashes differ.
**Acceptance criteria**
* All v1 cases produce identical artifacts across two builds on CI.
---
## 4. Phase 4 Ground truth oracles (tests & traces)
### Task 4.1 Add unit/integration tests for reachable cases
**Developer:** Language Tracks
For each **reachable** case:
* Add `tests/` under the project to:
* Start the app (if necessary)
* Send a request/trigger that reaches the vulnerable sink
* Assert that a sentinel side effect occurs (e.g. log or marker file) instead of real exploitation.
Example for Node using Jest:
```js
test("should reach deserialization sink", async () => {
const res = await request(app)
.post("/api/posts")
.send({ title: "x", body: '{"__proto__":{}}' });
expect(res.statusCode).toBe(200);
// Sink logs "REACH_SINK" we check log or variable
expect(sinkWasReached()).toBe(true);
});
```
### Task 4.2 Instrument coverage
**Developer:** Language Tracks
* For each language, pick a coverage tool:
* JS: `nyc` + `istanbul`
* Python: `coverage.py`
* Java: `jacoco`
* C: `gcov`/`llvm-cov` (optional for v1)
* Ensure running tests produces `outputs/coverage.json` or `.xml` that we then convert to a simple JSON format:
```json
{
"files": {
"src/controllers/posts.js": {
"lines_covered": [12, 13, 14, 27],
"lines_total": 40
}
}
}
```
Create a small converter script if needed.
### Task 4.3 Optional dynamic traces
If you want richer evidence:
* JS: add middleware that logs `(entry_id, handler, sink)` triples to `outputs/traces/traces.json`
* Python: similar using decorators
* C/Java: out of scope for v1 unless you want to invest extra time.
---
## 5. Phase 5 Scoring tool (CLI)
### Task 5.1 Implement `rb-score` library + CLI
**Developer:** Benchmark Core
Create `benchmark/tools/scorer/rb_score/` with:
* `loader.py`
* Load all `case.yaml`, `truth.yaml` into memory.
* Provide functions: `load_cases() -> Dict[case_id, Case]`.
* `metrics.py`
* Implement:
* `compute_precision_recall(truth, predictions)`
* `compute_path_quality_score(explain_block)` (03)
* `compute_runtime_stats(run_block)`
* `cli.py`
* CLI:
```bash
rb-score \
--cases-root benchmark/cases \
--submission submissions/mytool.json \
--output results/mytool_results.json
```
**Pseudo-code for core scoring**
```python
def score_submission(truth, submission):
y_true = []
y_pred = []
per_case_scores = {}
for case in truth:
gt = truth[case.id].label # reachable/unreachable
pred_case = find_pred_case(submission.cases, case.id)
pred_label = pred_case.prediction if pred_case else "unreachable"
y_true.append(gt == "reachable")
y_pred.append(pred_label == "reachable")
explain_score = explainability(pred_case.explain if pred_case else None)
per_case_scores[case.id] = {
"gt": gt,
"pred": pred_label,
"explainability": explain_score,
}
precision, recall, f1 = compute_prf(y_true, y_pred)
return {
"summary": {
"precision": precision,
"recall": recall,
"f1": f1,
"num_cases": len(truth),
},
"cases": per_case_scores,
}
```
### Task 5.2 Explainability scoring rules
**Developer:** Benchmark Core
Implement `explainability(explain)`:
* 0 `explain` missing or `path` empty
* 1 `path` present with at least 2 nodes (sink + one function)
* 2 `path` contains:
* Entry label (HTTP route/CLI id)
* ≥3 nodes (entry → … → sink)
* 3 Level 2 plus `guards` list non-empty
Unit tests for at least 4 scenarios.
### Task 5.3 Regression tests for scoring
Add small test fixture:
* Tiny synthetic benchmark: 3 cases, 2 reachable, 1 unreachable.
* 3 submissions:
* Perfect
* All reachable
* All unreachable
Assertions:
* Perfect: `precision=1, recall=1`
* All reachable: `recall=1, precision<1`
* All unreachable: `precision=1 (trivially on negatives), recall=0`
---
## 6. Phase 6 Baseline integrations
### Task 6.1 Semgrep baseline
**Developer:** Benchmark Core (with Semgrep experience)
* `baselines/semgrep/run_case.sh`:
* Inputs: `case_id`, `cases_root`, `output_path`
* Steps:
* Find `src/` for case
* Run `semgrep --config auto` or curated rules
* Convert Semgrep findings into benchmark submission format:
* Map Semgrep rules → vulnerability types → candidate sinks
* Heuristically guess reachability (for v1, maybe always “reachable” if sink in code path)
* Output: `output_path` JSON conforming to `submission.schema.json`.
### Task 6.2 CodeQL baseline
* Create CodeQL databases for each project (likely via `codeql database create`).
* Create queries targeting known sinks (e.g., `Deserialization`, `CommandInjection`).
* `baselines/codeql/run_case.sh`:
* Build DB (or reuse)
* Run queries
* Translate results into our submission format (again as heuristic reachability).
### Task 6.3 Optional Snyk / angr baselines
* Snyk:
* Use `snyk test` on the project
* Map results to dependencies & known CVEs
* For v1, just mark as `reachable` if Snyk reports a reachable path (if available).
* angr:
* For 12 small C samples, configure simple analysis script.
**Acceptance criteria**
* For at least 5 cases (across languages), the baselines produce valid submission JSON.
* `rb-score` runs and yields metrics without errors.
---
## 7. Phase 7 CI/CD
### Task 7.1 GitHub Actions workflow
**Developer:** Benchmark Core
`ci/github/benchmark.yml`:
Jobs:
1. `lint-and-test`
* `python -m pip install -e benchmark/tools/scorer[dev]`
* `make lint`
* `make test`
2. `build-cases`
* `python benchmark/tools/build/build_all.py`
* Run `validate_builds.py`
3. `smoke-baselines`
* For 23 cases, run Semgrep/CodeQL wrappers and ensure they emit valid submissions.
### Task 7.2 Artifact upload
* Upload `outputs/` tarball from `build-cases` as workflow artifacts.
* Upload `results/*.json` from scoring runs.
---
## 8. Phase 8 Website & leaderboard
### Task 8.1 Define results JSON format
**Developer:** Benchmark Core + Website dev
`results/leaderboard.json`:
```json
{
"tools": [
{
"name": "Semgrep",
"version": "1.60.0",
"summary": {
"precision": 0.72,
"recall": 0.48,
"f1": 0.58
},
"by_language": {
"javascript": {"precision": 0.80, "recall": 0.50, "f1": 0.62},
"python": {"precision": 0.65, "recall": 0.45, "f1": 0.53}
}
}
]
}
```
CLI option to generate this:
```bash
rb-score compare \
--cases-root benchmark/cases \
--submissions submissions/*.json \
--output results/leaderboard.json
```
### Task 8.2 Static site
**Developer:** Website dev
Tech choice: any static framework (Next.js, Astro, Docusaurus, or even pure HTML+JS).
Pages:
* **Home**
* What is reachability?
* Summary of benchmark
* **Leaderboard**
* Renders `leaderboard.json`
* Filters: language, case size
* **Docs**
* How to run benchmark locally
* How to prepare a submission
Add a simple script to copy `results/leaderboard.json` into `website/public/` for publishing.
---
## 9. Phase 9 Docs, governance, and contribution flow
### Task 9.1 CONTRIBUTING.md
Include:
* How to add a new case:
* Stepbystep:
1. Create project folder under `benchmark/cases/<lang>/<project>/case-XXX/`
2. Add `case.yaml`, `entrypoints.yaml`, `truth.yaml`
3. Add oracles (tests, coverage)
4. Add deterministic `build/` assets
5. Run local tooling:
* `validate_schema.py`
* `validate_builds.py --case <id>`
* Example PR description template.
### Task 9.2 Governance doc
* Define **Technical Advisory Committee (TAC)** roles:
* Approve new cases
* Approve schema changes
* Manage hidden test sets (future phase)
* Define **release cadence**:
* v1.0 with public cases
* Quarterly updates with new hidden cases.
---
## 10. Suggested milestone breakdown (for planning / sprints)
### Milestone 1 Foundation (12 sprints)
* Repo scaffolding (Tasks 1.x)
* Schemas (Tasks 2.x)
* Two tiny toy cases (one JS, one Python) with:
* `case.yaml`, `entrypoints.yaml`, `truth.yaml`
* Deterministic build
* Basic unit tests
* Minimal `rb-score` with:
* Case loading
* Precision/recall only
**Exit:** You can run `rb-score` on a dummy submission for 2 cases.
---
### Milestone 2 v1 dataset (23 sprints)
* Add ~2030 cases across JS, Python, Java, C
* Ground truth & coverage for each
* Deterministic builds validated
* Explainability scoring implemented
* Regression tests for `rb-score`
**Exit:** Full scoring tool stable; dataset repeatably builds on CI.
---
### Milestone 3 Baselines & site (12 sprints)
* Semgrep + CodeQL baselines producing valid submissions
* CI running smoke baselines
* `leaderboard.json` generator
* Static website with public leaderboard and docs
**Exit:** Public v1 benchmark you can share with external tool authors.
---
If you tell me which stack your team prefers for the site (React, plain HTML, SSG, etc.) or which CI youre on, I can adapt this into concrete config files (e.g., a full GitHub Actions workflow, Next.js scaffold, or exact `pyproject.toml` for `rb-score`).