git.stella-ops.org/26-Nov-2025 - Opening Up a Reachability Dataset.md at b3656e5cb7ae5910d37f5464bb49f024d1979c59 - git.stella-ops.org

stella-ops.org/git.stella-ops.org

Fork 0

Files

master b3656e5cb7

Docs CI / lint-and-preview (push) Has been cancelled

Details

update advisories

2025-11-29 01:32:00 +02:00

24 KiB

Raw Blame History

Here’s a concrete, low‑lift way to boost Stella Ops’s visibility and prove your “deterministic, replayable” moat: publish a sanitized subset of reachability graphs as a public benchmark that others can run and score identically.

What this is (plain English)

You release a small, carefully scrubbed set of packages + SBOMs + VEX + call‑graphs (source & binaries) with ground‑truth reachability labels for a curated list of CVEs.
You also ship a deterministic scoring harness (container + manifest) so anyone can reproduce the exact scores, byte‑for‑byte.

Why it helps

Proof of determinism: identical inputs → identical graphs → identical scores.
Research magnet: gives labs and tool vendors a neutral yardstick; you become “the” benchmark steward.
Biz impact: easy demo for buyers; lets you publish leaderboards and whitepapers.

Scope (MVP dataset)

Languages: PHP, JS, Python, plus binary (ELF/PE/Mach‑O) mini-cases.
Units: 20–30 packages total; 3–6 CVEs per language; 4–6 binary cases (static & dynamically‑linked).
Artifacts per unit:
- Package tarball(s) or container image digest
- SBOM (CycloneDX 1.6 + SPDX 3.0.1)
- VEX (known‑exploited, not‑affected, under‑investigation)
- Call graph (normalized JSON)
- Ground truth: list of vulnerable entrypoints/edges considered reachable
- Determinism manifest: feed URLs + rule hashes + container digests + tool versions

Data model (keep it simple)

dataset.json: index of cases with content‑addressed URIs (sha256)
sbom/, vex/, graphs/, truth/ folders mirroring the index
manifest.lock.json: DSSE‑signed record of:
- feeder rules, lattice policies, normalizers (name + version + hash)
- container image digests for each step (scanner/cartographer/normalizer)
- timestamp + signer (Stella Ops Authority)

Scoring harness (deterministic)

One Docker image: stellaops/benchmark-harness:<tag>
Inputs: dataset root + manifest.lock.json
Outputs:
- scores.json (precision/recall/F1, per‑case and macro)
- replay-proof.txt (hashes of every artifact used)
No network mode (offline‑first). Fails closed if any hash mismatches.

Metrics (clear + auditable)

Per case: TP/FP/FN for reachable functions (or edges), plus optional sink‑reach verification.
Aggregates: micro/macro F1; “Determinism Index” (stddev of repeated runs must be 0).
Repro test: the harness re‑runs N=3 and asserts identical outputs (hash compare).

Sanitization & legal

Strip any proprietary code/data; prefer OSS with permissive licenses.
Replace real package registries with local mirrors and pin digests.
Publish under CC‑BY‑4.0 (data) + Apache‑2.0 (harness). Add a simple contributor license agreement for external case submissions.

Baselines to include (neutral + useful)

“Naïve reachable” (all functions in package)
“Imports‑only” (entrypoints that match import graph)
“Call‑depth‑2” (bounded traversal)
Your graph engine run with frozen rules from the manifest (as a reference, not a claim of SOTA)

Repository layout (public)

stellaops-reachability-benchmark/
  dataset/
    dataset.json
    sbom/...
    vex/...
    graphs/...
    truth/...
    manifest.lock.json  (DSSE-signed)
  harness/
    Dockerfile
    runner.py (CLI)
    schema/ (JSON Schemas for graphs, truth, scores)
  docs/
    HOWTO.md (5-min run)
    CONTRIBUTING.md
    SANITIZATION.md
    LICENSES/

Docs your team can ship in a day

HOWTO.md: docker run -v $PWD/dataset:/d -v $PWD/out:/o stellaops/benchmark-harness score /d /o
SCHEMA.md: JSON Schemas for graph and truth (keep fields minimal: nodes, edges, purls, sinks, evidence).
REPRODUCIBILITY.md: explains DSSE signatures, lockfile, and offline run.
LIMITATIONS.md: clarifies scope (no dynamic runtime traces in v1, etc.).

Governance (lightweight)

Versioned releases: v0.1, v0.2 with changelogs.
Submission gate: PR template + CI that:
- validates schemas
- checks hashes match lockfile
- re‑scores and compares to contributor’s score
Leaderboard cadence: monthly markdown table regenerated by CI.

Launch plan (2‑week sprint)

Day 1–2: pick cases; finalize schemas; write SANITIZATION.md.
Day 3–5: build harness image; implement deterministic runner; freeze manifest.lock.json.
Day 6–8: produce ground truth; run baselines; generate initial scores.
Day 9–10: docs + website README; record a 2‑minute demo GIF.
Day 11–12: legal review + licenses; create issue labels (“good first case”).
Day 13–14: publish, post on GitHub + LinkedIn; invite Semgrep/Snyk/OSS‑Fuzz folks to submit cases.

Nice‑to‑have (but easy)

JSON Schema for ground‑truth edges so academics can auto‑ingest.
Small “unknowns” registry example to show how you annotate unresolved symbols without breaking determinism.
Binary mini‑lab: stripped vs non‑stripped ELF pair to show your patch‑oracle technique in action (truth labels reflect oracle result).

If you want, I can draft the repo skeleton (folders, placeholder JSON Schemas, a sample manifest.lock.json, and a minimal runner.py CLI) so you can drop it straight into GitHub. Got you — let’s turn that high‑level idea into something your devs can actually pick up and ship.

Below is a concrete implementation plan for the StellaOps Reachability Benchmark repo: directory structure, components, tasks, and acceptance criteria. You can drop this straight into a ticketing system as epics → stories.

0. Tech assumptions (adjust if needed)

To be specific, I’ll assume:

Repo: stellaops-reachability-benchmark
Harness language: Python 3.11+
Packaging: Docker image for the harness
Schemas: JSON Schema (Draft 2020–12)
CI: GitHub Actions

If your stack differs, you can still reuse the structure and acceptance criteria.

1. Repo skeleton & project bootstrap

Goal: Create a minimal but fully wired repo.

Tasks

Create skeleton

Structure:

stellaops-reachability-benchmark/
  dataset/
    dataset.json
    sbom/
    vex/
    graphs/
    truth/
    packages/
    manifest.lock.json    # initially stub
  harness/
    reachbench/
      __init__.py
      cli.py
      dataset_loader.py
      schemas/
        graph.schema.json
        truth.schema.json
        dataset.schema.json
        scores.schema.json
    tests/
  docs/
    HOWTO.md
    SCHEMA.md
    REPRODUCIBILITY.md
    LIMITATIONS.md
    SANITIZATION.md
  .github/
    workflows/
      ci.yml
  pyproject.toml
  README.md
  LICENSE
  Dockerfile

Bootstrap Python project
- pyproject.toml with:
  - reachbench package
  - deps: jsonschema, click or typer, pyyaml, pytest
- harness/tests/ with a dummy test to ensure CI is green.

Dockerfile

Minimal, pinned versions:

FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir .
ENTRYPOINT ["reachbench"]

CI basic pipeline (.github/workflows/ci.yml)
- Jobs:
  - lint (e.g., ruff or flake8 if you want)
  - test (pytest)
  - build-docker (just to ensure Dockerfile stays valid)

Acceptance criteria

pip install . works locally.
reachbench --help prints CLI help (even if commands are stubs).
CI passes on main branch.

2. Dataset & schema definitions

Goal: Define all JSON formats and enforce them.

2.1 Define dataset index format (`dataset/dataset.json`)

File: dataset/dataset.json

Example:

{
  "version": "0.1.0",
  "cases": [
    {
      "id": "php-wordpress-5.8-cve-2023-12345",
      "language": "php",
      "kind": "source",          // "source" | "binary" | "container"
      "cves": ["CVE-2023-12345"],
      "artifacts": {
        "package": {
          "path": "packages/php/wordpress-5.8.tar.gz",
          "sha256": "…"
        },
        "sbom": {
          "path": "sbom/php/wordpress-5.8.cdx.json",
          "format": "cyclonedx-1.6",
          "sha256": "…"
        },
        "vex": {
          "path": "vex/php/wordpress-5.8.vex.json",
          "format": "csaf-2.0",
          "sha256": "…"
        },
        "graph": {
          "path": "graphs/php/wordpress-5.8.graph.json",
          "schema": "graph.schema.json",
          "sha256": "…"
        },
        "truth": {
          "path": "truth/php/wordpress-5.8.truth.json",
          "schema": "truth.schema.json",
          "sha256": "…"
        }
      }
    }
  ]
}

2.2 Define truth schema (`harness/reachbench/schemas/truth.schema.json`)

Model (conceptual):

{
  "case_id": "php-wordpress-5.8-cve-2023-12345",
  "vulnerable_components": [
    {
      "cve": "CVE-2023-12345",
      "symbol": "wp_ajax_nopriv_some_vuln",
      "symbol_kind": "function",      // "function" | "method" | "binary_symbol"
      "status": "reachable",          // "reachable" | "not_reachable"
      "reachable_from": [
        {
          "entrypoint_id": "web:GET:/foo",
          "notes": "HTTP route /foo"
        }
      ],
      "evidence": "manual-analysis"   // or "unit-test", "patch-oracle"
    }
  ],
  "non_vulnerable_components": [
    {
      "symbol": "wp_safe_function",
      "symbol_kind": "function",
      "status": "not_reachable",
      "evidence": "manual-analysis"
    }
  ]
}

Tasks

Implement JSON Schema capturing:
- required fields: case_id, vulnerable_components
- allowed enums for symbol_kind, status, evidence
Add unit tests that:
- validate a valid truth file
- fail on various broken ones (missing case_id, unknown status, etc.)

2.3 Define graph schema (`harness/reachbench/schemas/graph.schema.json`)

Model (conceptual):

{
  "case_id": "php-wordpress-5.8-cve-2023-12345",
  "language": "php",
  "nodes": [
    {
      "id": "func:wp_ajax_nopriv_some_vuln",
      "symbol": "wp_ajax_nopriv_some_vuln",
      "kind": "function",
      "purl": "pkg:composer/wordpress/wordpress@5.8"
    }
  ],
  "edges": [
    {
      "from": "func:wp_ajax_nopriv_some_vuln",
      "to": "func:wpdb_query",
      "kind": "call"
    }
  ],
  "entrypoints": [
    {
      "id": "web:GET:/foo",
      "symbol": "some_controller",
      "kind": "http_route"
    }
  ]
}

Tasks

JSON Schema with:
- nodes[] (id, symbol, kind, optional purl)
- edges[] (from, to, kind)
- entrypoints[] (id, symbol, kind)
Tests: verify a valid graph; invalid ones (missing id, unknown kind) are rejected.

2.4 Dataset index schema (`dataset.schema.json`)

JSON Schema describing dataset.json (version string, cases array).
Tests: validate the example dataset file.

Acceptance criteria

Running a simple script (will be reachbench validate-dataset) validates all JSON files in dataset/ against schemas without errors.
CI fails if any dataset JSON is invalid.

3. Lockfile & determinism manifest

Goal: Implement manifest.lock.json generation and verification.

3.1 Lockfile structure

File: dataset/manifest.lock.json

Example:

{
  "version": "0.1.0",
  "created_at": "2025-01-15T12:00:00Z",
  "dataset": {
    "root": "dataset/",
    "sha256": "…",
    "cases": {
      "php-wordpress-5.8-cve-2023-12345": {
        "sha256": "…"
      }
    }
  },
  "tools": {
    "graph_normalizer": {
      "name": "stellaops-graph-normalizer",
      "version": "1.2.3",
      "sha256": "…"
    }
  },
  "containers": {
    "scanner_image": "ghcr.io/stellaops/scanner@sha256:…",
    "normalizer_image": "ghcr.io/stellaops/normalizer@sha256:…"
  },
  "signatures": [
    {
      "type": "dsse",
      "key_id": "stellaops-benchmark-key-1",
      "signature": "base64-encoded-blob"
    }
  ]
}

(Signatures can be optional in v1 – but structure should be there.)

3.2 `lockfile.py` module

File: harness/reachbench/lockfile.py

Responsibilities

Compute deterministic SHA-256 digest of:
- each case’s artifacts (path → hash from dataset.json)
- entire dataset/ tree (sorted traversal)
Generate new manifest.lock.json:
- version (hard-coded constant)
- created_at (UTC ISO8601)
- dataset section with case hashes
Verification:
- verify_lockfile(dataset_root, lockfile_path):
  - recompute hashes
  - compare to lockfile.dataset
  - return boolean + list of mismatches

Tasks

Implement canonical hashing:
- For text JSON files: normalize with:
  - sort keys
  - no whitespace
  - UTF‑8 encoding
- For binaries (packages): raw bytes.
Implement compute_dataset_hashes(dataset_root):
- Returns {"cases": {...}, "root_sha256": "…"}.
Implement write_lockfile(...) and verify_lockfile(...).
Tests:
- Two calls with same dataset produce identical lockfile (order of cases keys normalized).
- Changing any artifact file changes the root hash and causes verify to fail.

3.3 CLI commands

Add to cli.py:

reachbench compute-lockfile --dataset-root ./dataset --out ./dataset/manifest.lock.json
reachbench verify-lockfile --dataset-root ./dataset --lockfile ./dataset/manifest.lock.json

Acceptance criteria

reachbench compute-lockfile generates a stable file (byte-for-byte identical across runs).
reachbench verify-lockfile exits with:
- code 0 if matches
- non-zero if mismatch (plus human-readable diff).

4. Scoring harness CLI

Goal: Deterministically score participant results against ground truth.

4.1 Result format (participant output)

Expectation:

Participants provide results/ with one JSON per case:

results/
  php-wordpress-5.8-cve-2023-12345.json
  js-express-4.17-cve-2022-9999.json

Result file example:

{
  "case_id": "php-wordpress-5.8-cve-2023-12345",
  "tool_name": "my-reachability-analyzer",
  "tool_version": "1.0.0",
  "predictions": [
    {
      "cve": "CVE-2023-12345",
      "symbol": "wp_ajax_nopriv_some_vuln",
      "symbol_kind": "function",
      "status": "reachable"
    },
    {
      "cve": "CVE-2023-12345",
      "symbol": "wp_safe_function",
      "symbol_kind": "function",
      "status": "not_reachable"
    }
  ]
}

4.2 Scoring model

Treat scoring as classification over (cve, symbol) pairs.
For each case:
- Truth positives: all vulnerable_components with status == "reachable".
- Truth negatives: everything marked not_reachable (optional in v1).
- Predictions: all entries with status == "reachable".
Compute:
- TP: predicted reachable & truth reachable.
- FP: predicted reachable but truth says not reachable / unknown.
- FN: truth reachable but not predicted reachable.
Metrics:
- Precision, Recall, F1 per case.
- Macro-averaged metrics across all cases.

4.3 Implementation (`scoring.py`)

File: harness/reachbench/scoring.py

Functions:

load_truth(case_truth_path) -> TruthModel
load_predictions(predictions_path) -> PredictionModel

compute_case_metrics(truth, preds) -> dict

returns:

{
  "case_id": str,
  "tp": int,
  "fp": int,
  "fn": int,
  "precision": float,
  "recall": float,
  "f1": float
}

aggregate_metrics(case_metrics_list) -> dict
- macro_precision, macro_recall, macro_f1, num_cases.

4.4 CLI: `score`

Signature:

reachbench score \
  --dataset-root ./dataset \
  --results-root ./results \
  --lockfile ./dataset/manifest.lock.json \
  --out ./out/scores.json \
  [--cases php-*] \
  [--repeat 3]

Behavior:

Verify lockfile (fail closed if mismatch).
Load dataset.json, filter cases if --cases is set (glob).
For each case:
- Load truth file (and validate schema).
- Locate results file (<case_id>.json) under results-root:
  - If missing, treat as all FN (or mark case as “no submission”).
- Load and validate predictions (include a JSON Schema: results.schema.json).
- Compute per-case metrics.
Aggregate metrics.

Write scores.json:

{
  "version": "0.1.0",
  "dataset_version": "0.1.0",
  "generated_at": "2025-01-15T12:34:56Z",
  "macro_precision": 0.92,
  "macro_recall": 0.88,
  "macro_f1": 0.90,
  "cases": [
    {
      "case_id": "php-wordpress-5.8-cve-2023-12345",
      "tp": 10,
      "fp": 1,
      "fn": 2,
      "precision": 0.91,
      "recall": 0.83,
      "f1": 0.87
    }
  ]
}

Determinism check:
- If --repeat N given:
  - Re-run scoring in-memory N times.
  - Compare resulting JSON strings (canonicalized via sorted keys).
  - If any differ, exit non-zero with message (“non-deterministic scoring detected”).

4.5 Offline-only mode

In cli.py, early check:

if os.getenv("REACHBENCH_OFFLINE_ONLY", "1") == "1":
    # Verify no outbound network: by policy, just ensure we never call any net libs.
    # (In v1, simply avoid adding any such calls.)

Document that harness must not reach out to the internet.

Acceptance criteria

Given a small artificial dataset with 2–3 cases and handcrafted results, reachbench score produces expected metrics (assert via tests).
Running reachbench score --repeat 3 produces identical scores.json across runs.
Missing results files are handled gracefully (but clearly documented).

5. Baseline implementations

Goal: Provide in-repo baselines that use only the provided graphs (no extra tooling).

5.1 Baseline types

Naïve reachable: all symbols in the vulnerable package are considered reachable.
Imports-only: reachable = any symbol that:
- appears in the graph AND
- is reachable from any entrypoint by a single edge OR name match.
Call-depth-2:
- From each entrypoint, traverse up to depth 2 along call edges.
- Anything at depth ≤ 2 is considered reachable.

5.2 Implementation

File: harness/reachbench/baselines.py

baseline_naive(graph, truth) -> PredictionModel
baseline_imports_only(graph, truth) -> PredictionModel
baseline_call_depth_2(graph, truth) -> PredictionModel

CLI:

reachbench run-baseline \
  --dataset-root ./dataset \
  --baseline naive|imports|depth2 \
  --out ./results-baseline-<baseline>/

Behavior:

For each case:
- Load graph.
- Generate predictions per baseline.
- Write result file results-baseline-<baseline>/<case_id>.json.

5.3 Tests

Tiny synthetic dataset in harness/tests/data/:
- 1–2 cases with simple graphs.
- Known expectations for each baseline (TP/FP/FN counts).

Acceptance criteria

reachbench run-baseline --baseline naive runs end-to-end and outputs results files.
reachbench score on baseline results produces stable scores.
Tests validate baseline behavior on synthetic cases.

6. Dataset validation & tooling

Goal: One command to validate everything (schemas, hashes, internal consistency).

CLI: `validate-dataset`

reachbench validate-dataset \
  --dataset-root ./dataset \
  [--lockfile ./dataset/manifest.lock.json]

Checks:

dataset.json conforms to dataset.schema.json.
For each case:
- all artifact paths exist
- graph file passes graph.schema.json
- truth file passes truth.schema.json
Optional: verify lockfile if provided.

Implementation:

dataset_loader.py:
- load_dataset_index(path) -> DatasetIndex
- iter_cases(dataset_index) yields case objects.
- validate_case(case, dataset_root) -> list[str] (list of error messages).

Acceptance criteria

Broken paths / invalid JSON produce a clear error message and non-zero exit code.
CI job calls reachbench validate-dataset on every push.

7. Documentation

Goal: Make it trivial for outsiders to use the benchmark.

7.1 `README.md`

Overview:
- What the benchmark is.
- What it measures (reachability precision/recall).

Quickstart:

git clone ...
cd stellaops-reachability-benchmark

# Validate dataset
reachbench validate-dataset --dataset-root ./dataset

# Run baselines
reachbench run-baseline --baseline naive --dataset-root ./dataset --out ./results-naive

# Score baselines
reachbench score --dataset-root ./dataset --results-root ./results-naive --out ./out/naive-scores.json

7.2 `docs/HOWTO.md`

Step-by-step:
- Installing harness.
- Running your own tool on the dataset.
- Formatting your results/.
- Running reachbench score.
- Interpreting scores.json.

7.3 `docs/SCHEMA.md`

Human-readable description of:
- graph JSON
- truth JSON
- results JSON
- scores JSON
Link to actual JSON Schemas.

7.4 `docs/REPRODUCIBILITY.md`

Explain:
- lockfile design
- hashing rules
- deterministic scoring and --repeat flag
- how to verify you’re using the exact same dataset.

7.5 `docs/SANITIZATION.md`

Rules for adding new cases:
- Only use OSS or properly licensed code.
- Strip secrets / proprietary paths / user data.
- How to confirm nothing sensitive is in package tarballs.

Acceptance criteria

A new engineer (or external user) can go from zero to “I ran the baseline and got scores” by following docs only.
All example commands work as written.

8. CI/CD details

Goal: Keep repo healthy and ensure determinism.

CI jobs (GitHub Actions)

lint
- Run ruff / flake8 (your choice).
test
- Run pytest.
validate-dataset
- Run reachbench validate-dataset --dataset-root ./dataset.
determinism
- Small workflow step:
  - Run reachbench score on a tiny test dataset with --repeat 3.
  - Assert success.
docker-build
- docker build the harness image.

Acceptance criteria

All jobs green on main.
PRs show failing status if schemas or determinism break.

9. Rough “epics → stories” breakdown

You can paste roughly like this into Jira/Linear:

Epic: Repo bootstrap & CI
- Story: Create repo skeleton & Python project
- Story: Add Dockerfile & basic CI (lint + tests)
Epic: Schemas & dataset plumbing
- Story: Implement truth.schema.json + tests
- Story: Implement graph.schema.json + tests
- Story: Implement dataset.schema.json + tests
- Story: Implement validate-dataset CLI
Epic: Lockfile & determinism
- Story: Implement lockfile computation + verification
- Story: Add compute-lockfile & verify-lockfile CLI
- Story: Add determinism checks in CI
Epic: Scoring harness
- Story: Define results format + results.schema.json
- Story: Implement scoring logic (scoring.py)
- Story: Implement score CLI with --repeat
- Story: Add unit tests for metrics
Epic: Baselines
- Story: Implement naive baseline
- Story: Implement imports-only baseline
- Story: Implement depth-2 baseline
- Story: Add run-baseline CLI + tests
Epic: Documentation & polish
- Story: Write README + HOWTO
- Story: Write SCHEMA / REPRODUCIBILITY / SANITIZATION docs
- Story: Final repo cleanup & examples

If you tell me your preferred language and CI, I can also rewrite this into exact tickets and even starter code for cli.py and a couple of schemas.

24 KiB Raw Blame History Unescape Escape

What this is (plain English)

Why it helps

Scope (MVP dataset)

Data model (keep it simple)

Scoring harness (deterministic)

Metrics (clear + auditable)

Sanitization & legal

Baselines to include (neutral + useful)

Repository layout (public)

Docs your team can ship in a day

Governance (lightweight)

Launch plan (2‑week sprint)

Nice‑to‑have (but easy)

0. Tech assumptions (adjust if needed)

1. Repo skeleton & project bootstrap

Tasks

Acceptance criteria

2. Dataset & schema definitions

2.1 Define dataset index format (dataset/dataset.json)

2.2 Define truth schema (harness/reachbench/schemas/truth.schema.json)

2.3 Define graph schema (harness/reachbench/schemas/graph.schema.json)

2.4 Dataset index schema (dataset.schema.json)

Acceptance criteria

3. Lockfile & determinism manifest

3.1 Lockfile structure

3.2 lockfile.py module

3.3 CLI commands

Acceptance criteria

4. Scoring harness CLI

4.1 Result format (participant output)

4.2 Scoring model

4.3 Implementation (scoring.py)

4.4 CLI: score

4.5 Offline-only mode

Acceptance criteria

5. Baseline implementations

5.1 Baseline types

5.2 Implementation

5.3 Tests

Acceptance criteria

6. Dataset validation & tooling

CLI: validate-dataset

7. Documentation

7.1 README.md

7.2 docs/HOWTO.md

7.3 docs/SCHEMA.md

7.4 docs/REPRODUCIBILITY.md

7.5 docs/SANITIZATION.md

Acceptance criteria

8. CI/CD details

CI jobs (GitHub Actions)

Acceptance criteria

9. Rough “epics → stories” breakdown

24 KiB

Raw Blame History

2.1 Define dataset index format (`dataset/dataset.json`)

2.2 Define truth schema (`harness/reachbench/schemas/truth.schema.json`)

2.3 Define graph schema (`harness/reachbench/schemas/graph.schema.json`)

2.4 Dataset index schema (`dataset.schema.json`)

3.2 `lockfile.py` module

4.3 Implementation (`scoring.py`)

4.4 CLI: `score`

CLI: `validate-dataset`

7.1 `README.md`

7.2 `docs/HOWTO.md`

7.3 `docs/SCHEMA.md`

7.4 `docs/REPRODUCIBILITY.md`

7.5 `docs/SANITIZATION.md`