Compare commits
3 Commits
76ecea482e
...
d785a9095f
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d785a9095f | ||
|
|
0c9e8d5d18 | ||
|
|
790801f329 |
@@ -0,0 +1,446 @@
|
|||||||
|
Here’s a crisp, practical way to turn Stella Ops’ “verifiable proof spine” into a moat—and how to measure it.
|
||||||
|
|
||||||
|
# Why this matters (in plain terms)
|
||||||
|
|
||||||
|
Security tools often say “trust me.” You’ll say “prove it”—every finding and every “not‑affected” claim ships with cryptographic receipts anyone can verify.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Differentiators to build in
|
||||||
|
|
||||||
|
**1) Bind every verdict to a graph hash**
|
||||||
|
|
||||||
|
* Compute a stable **Graph Revision ID** (Merkle root) over: SBOM nodes, edges, policies, feeds, scan params, and tool versions.
|
||||||
|
* Store the ID on each finding/VEX item; show it in the UI and APIs.
|
||||||
|
* Rule: any data change → new graph hash → new revisioned verdicts.
|
||||||
|
|
||||||
|
**2) Attach machine‑verifiable receipts (in‑toto/DSSE)**
|
||||||
|
|
||||||
|
* For each verdict, emit a **DSSE‑wrapped in‑toto statement**:
|
||||||
|
|
||||||
|
* predicateType: `stellaops.dev/verdict@v1`
|
||||||
|
* includes: graphRevisionId, artifact digests, rule id/version, inputs (CPE/CVE/CVSS), timestamps.
|
||||||
|
* Sign with your **Authority** (Sigstore key, offline mode supported).
|
||||||
|
* Keep receipts queryable and exportable; mirror to Rekor‑compatible ledger when online.
|
||||||
|
|
||||||
|
**3) Add reachability “call‑stack slices” or binary‑symbol proofs**
|
||||||
|
|
||||||
|
* For code‑level reachability, store compact slices: entry → sink, with symbol names + file:line.
|
||||||
|
* For binary-only targets, include **symbol presence proofs** (e.g., Bloom filters + offsets) with executable digest.
|
||||||
|
* Compress and embed a hash of the slice/proof inside the DSSE payload.
|
||||||
|
|
||||||
|
**4) Deterministic replay manifests**
|
||||||
|
|
||||||
|
* Alongside receipts, publish a **Replay Manifest** (inputs, feeds, rule versions, container digests) so any auditor can reproduce the same graph hash and verdicts offline.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Benchmarks to publish (make them your headline KPIs)
|
||||||
|
|
||||||
|
**A) False‑positive reduction vs. baseline scanners (%)**
|
||||||
|
|
||||||
|
* Method: run a public corpus (e.g., sample images + app stacks) across 3–4 popular scanners; label ground truth once; compare FP rate.
|
||||||
|
* Report: mean & p95 FP reduction.
|
||||||
|
|
||||||
|
**B) Proof coverage (% of findings with signed evidence)**
|
||||||
|
|
||||||
|
* Definition: `(# findings or VEX items carrying valid DSSE receipts) / (total surfaced items)`.
|
||||||
|
* Break out: runtime‑reachable vs. unreachable, and “not‑affected” claims.
|
||||||
|
|
||||||
|
**C) Triage time saved (p50/p95)**
|
||||||
|
|
||||||
|
* Measure analyst minutes from “alert created” → “final disposition.”
|
||||||
|
* A/B with receipts hidden vs. visible; publish median/p95 deltas.
|
||||||
|
|
||||||
|
**D) Determinism stability**
|
||||||
|
|
||||||
|
* Re-run identical scans N times / across nodes; publish `% identical graph hashes` and drift causes when different.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Minimal implementation plan (week‑by‑week)
|
||||||
|
|
||||||
|
**Week 1: primitives**
|
||||||
|
|
||||||
|
* Add Graph Revision ID generator in `scanner.webservice` (Merkle over normalized JSON of SBOM+edges+policies+toolVersions).
|
||||||
|
* Define `VerdictReceipt` schema (protobuf/JSON) and DSSE envelope types.
|
||||||
|
|
||||||
|
**Week 2: signing + storage**
|
||||||
|
|
||||||
|
* Wire DSSE signing in **Authority**; offline key support + rotation.
|
||||||
|
* Persist receipts in `Receipts` table (Postgres) keyed by `(graphRevisionId, verdictId)`; enable export (JSONL) and ledger mirror.
|
||||||
|
|
||||||
|
**Week 3: reachability proofs**
|
||||||
|
|
||||||
|
* Add call‑stack slice capture in reachability engine; serialize compactly; hash + reference from receipts.
|
||||||
|
* Binary symbol proof module for ELF/PE: symbol bitmap + digest.
|
||||||
|
|
||||||
|
**Week 4: replay + UX**
|
||||||
|
|
||||||
|
* Emit `replay.manifest.json` per scan (inputs, tool digests).
|
||||||
|
* UI: show **“Verified”** badge, graph hash, signature issuer, and a one‑click “Copy receipt” button.
|
||||||
|
* API: `GET /verdicts/{id}/receipt`, `GET /graphs/{rev}/replay`.
|
||||||
|
|
||||||
|
**Week 5: benchmarks harness**
|
||||||
|
|
||||||
|
* Create `bench/` with golden fixtures and a runner:
|
||||||
|
|
||||||
|
* Baseline scanner adapters
|
||||||
|
* Ground‑truth labels
|
||||||
|
* Metrics export (FP%, proof coverage, triage time capture hooks)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Developer guardrails (make these non‑negotiable)
|
||||||
|
|
||||||
|
* **No receipt, no ship:** any surfaced verdict must carry a DSSE receipt.
|
||||||
|
* **Schema freeze windows:** changes to rule inputs or policy logic must bump rule version and therefore the graph hash.
|
||||||
|
* **Replay‑first CI:** PRs touching scanning/rules must pass a replay test that reproduces prior graph hashes on gold fixtures.
|
||||||
|
* **Clock safety:** use monotonic time inside receipts; add UTC wall‑time separately.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# What to show buyers/auditors
|
||||||
|
|
||||||
|
* A short **audit kit**: sample container + your receipts + replay manifest + one command to reproduce the same graph hash.
|
||||||
|
* A one‑page **benchmark readout**: FP reduction, proof coverage, and triage time saved (p50/p95), with corpus description.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you want, I’ll draft:
|
||||||
|
|
||||||
|
1. the DSSE `predicate` schema,
|
||||||
|
2. the Postgres DDL for `Receipts` and `Graphs`, and
|
||||||
|
3. a tiny .NET verification CLI (`stellaops-verify`) that replays a manifest and validates signatures.
|
||||||
|
Here’s a focused “developer guidelines” doc just for **Benchmarks for a Testable Security Moat** in Stella Ops.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Stella Ops Developer Guidelines
|
||||||
|
|
||||||
|
## Benchmarks for a Testable Security Moat
|
||||||
|
|
||||||
|
> **Goal:** Benchmarks are how we *prove* Stella Ops is better, not just say it is. If a “moat” claim can’t be tied to a benchmark, it doesn’t exist.
|
||||||
|
|
||||||
|
Everything here is about how you, as a developer, design, extend, and run those benchmarks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. What our benchmarks must measure
|
||||||
|
|
||||||
|
Every core product claim needs at least one benchmark:
|
||||||
|
|
||||||
|
1. **Detection quality**
|
||||||
|
|
||||||
|
* Precision / recall vs ground truth.
|
||||||
|
* False positives vs popular scanners.
|
||||||
|
* False negatives on known‑bad samples.
|
||||||
|
|
||||||
|
2. **Proof & evidence quality**
|
||||||
|
|
||||||
|
* % of findings with **valid receipts** (DSSE).
|
||||||
|
* % of VEX “not‑affected” with attached proofs.
|
||||||
|
* Reachability proof quality:
|
||||||
|
|
||||||
|
* call‑stack slice present?
|
||||||
|
* symbol proof present for binaries?
|
||||||
|
|
||||||
|
3. **Triage & workflow impact**
|
||||||
|
|
||||||
|
* Time‑to‑decision for analysts (p50/p95).
|
||||||
|
* Click depth and context switches per decision.
|
||||||
|
* “Verified” vs “unverified” verdict triage times.
|
||||||
|
|
||||||
|
4. **Determinism & reproducibility**
|
||||||
|
|
||||||
|
* Same inputs → same **Graph Revision ID**.
|
||||||
|
* Stable verdict sets across runs/nodes.
|
||||||
|
|
||||||
|
> **Rule:** If you add a feature that impacts any of these, you must either hook it into an existing benchmark or add a new one.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Benchmark assets and layout
|
||||||
|
|
||||||
|
**2.1 Repo layout (convention)**
|
||||||
|
|
||||||
|
Under `bench/` we maintain everything benchmark‑related:
|
||||||
|
|
||||||
|
* `bench/corpus/`
|
||||||
|
|
||||||
|
* `images/` – curated container images / tarballs.
|
||||||
|
* `repos/` – sample codebases (with known vulns).
|
||||||
|
* `sboms/` – canned SBOMs for edge cases.
|
||||||
|
* `bench/scenarios/`
|
||||||
|
|
||||||
|
* `*.yaml` – scenario definitions (inputs + expected outputs).
|
||||||
|
* `bench/golden/`
|
||||||
|
|
||||||
|
* `*.json` – golden results (expected findings, metrics).
|
||||||
|
* `bench/tools/`
|
||||||
|
|
||||||
|
* adapters for baseline scanners, parsers, helpers.
|
||||||
|
* `bench/scripts/`
|
||||||
|
|
||||||
|
* `run_benchmarks.[sh/cs]` – single entrypoint.
|
||||||
|
|
||||||
|
**2.2 Scenario definition (high‑level)**
|
||||||
|
|
||||||
|
Each scenario yaml should minimally specify:
|
||||||
|
|
||||||
|
* **Inputs**
|
||||||
|
|
||||||
|
* artifact references (image name / path / repo SHA / SBOM file).
|
||||||
|
* environment knobs (features enabled/disabled).
|
||||||
|
* **Ground truth**
|
||||||
|
|
||||||
|
* list of expected vulns (or explicit “none”).
|
||||||
|
* for some: expected reachability (reachable/unreachable).
|
||||||
|
* expected VEX entries (affected / not affected).
|
||||||
|
* **Expectations**
|
||||||
|
|
||||||
|
* required metrics (e.g., “no more than 2 FPs”, “no FNs”).
|
||||||
|
* required proof coverage (e.g., “100% of surfaced findings have receipts”).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Core benchmark metrics (developer‑facing definitions)
|
||||||
|
|
||||||
|
Use these consistently across code and docs.
|
||||||
|
|
||||||
|
### 3.1 Detection metrics
|
||||||
|
|
||||||
|
* `true_positive_count` (TP)
|
||||||
|
* `false_positive_count` (FP)
|
||||||
|
* `false_negative_count` (FN)
|
||||||
|
|
||||||
|
Derived:
|
||||||
|
|
||||||
|
* `precision = TP / (TP + FP)`
|
||||||
|
* `recall = TP / (TP + FN)`
|
||||||
|
* For UX: track **FP per asset** and **FP per 100 findings**.
|
||||||
|
|
||||||
|
**Developer guideline:**
|
||||||
|
|
||||||
|
* When you introduce a filter, deduper, or rule tweak, add/modify a scenario where:
|
||||||
|
|
||||||
|
* the change **helps** (reduces FP or FN); and
|
||||||
|
* a different scenario guards against regressions.
|
||||||
|
|
||||||
|
### 3.2 Moat‑specific metrics
|
||||||
|
|
||||||
|
These are the ones that directly support the “testable moat” story:
|
||||||
|
|
||||||
|
1. **False‑positive reduction vs baseline scanners**
|
||||||
|
|
||||||
|
* Run baseline scanners across our corpus (via adapters in `bench/tools`).
|
||||||
|
* Compute:
|
||||||
|
|
||||||
|
* `baseline_fp_rate`
|
||||||
|
* `stella_fp_rate`
|
||||||
|
* `fp_reduction = (baseline_fp_rate - stella_fp_rate) / baseline_fp_rate`.
|
||||||
|
|
||||||
|
2. **Proof coverage**
|
||||||
|
|
||||||
|
* `proof_coverage_all = findings_with_valid_receipts / total_findings`
|
||||||
|
* `proof_coverage_vex = vex_items_with_valid_receipts / total_vex_items`
|
||||||
|
* `proof_coverage_reachable = reachable_findings_with_proofs / total_reachable_findings`
|
||||||
|
|
||||||
|
3. **Triage time improvement**
|
||||||
|
|
||||||
|
* In test harnesses, simulate or record:
|
||||||
|
|
||||||
|
* `time_to_triage_with_receipts`
|
||||||
|
* `time_to_triage_without_receipts`
|
||||||
|
* Compute median & p95 deltas.
|
||||||
|
|
||||||
|
4. **Determinism**
|
||||||
|
|
||||||
|
* Re‑run the same scenario `N` times:
|
||||||
|
|
||||||
|
* `% runs with identical Graph Revision ID`
|
||||||
|
* `% runs with identical verdict sets`
|
||||||
|
* On mismatch, diff and log cause (e.g., non‑stable sort, non‑pinned feed).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. How developers should work with benchmarks
|
||||||
|
|
||||||
|
### 4.1 “No feature without benchmarks”
|
||||||
|
|
||||||
|
If you’re adding or changing:
|
||||||
|
|
||||||
|
* graph structure,
|
||||||
|
* rule logic,
|
||||||
|
* scanner integration,
|
||||||
|
* VEX handling,
|
||||||
|
* proof / receipt generation,
|
||||||
|
|
||||||
|
you **must** do *at least one* of:
|
||||||
|
|
||||||
|
1. **Extend an existing scenario**
|
||||||
|
|
||||||
|
* Add expectations that cover your change, or
|
||||||
|
* tighten an existing bound (e.g., lower FP threshold).
|
||||||
|
|
||||||
|
2. **Add a new scenario**
|
||||||
|
|
||||||
|
* For new attack classes / edge cases / ecosystems.
|
||||||
|
|
||||||
|
**Anti‑patterns:**
|
||||||
|
|
||||||
|
* Shipping a new capability with *no* corresponding scenario.
|
||||||
|
* Updating golden outputs without explaining why metrics changed.
|
||||||
|
|
||||||
|
### 4.2 CI gates
|
||||||
|
|
||||||
|
We treat benchmarks as **blocking**:
|
||||||
|
|
||||||
|
* Add a CI job, e.g.:
|
||||||
|
|
||||||
|
* `make bench:quick` on every PR (small subset).
|
||||||
|
* `make bench:full` on main / nightly.
|
||||||
|
* CI fails if:
|
||||||
|
|
||||||
|
* Any scenario marked `strict: true` has:
|
||||||
|
|
||||||
|
* Precision or recall below its threshold.
|
||||||
|
* Proof coverage below its configured threshold.
|
||||||
|
* Global regressions above tolerance:
|
||||||
|
|
||||||
|
* e.g. total FP increases > X% without an explicit override.
|
||||||
|
|
||||||
|
**Developer rule:**
|
||||||
|
|
||||||
|
* If you intentionally change behavior:
|
||||||
|
|
||||||
|
* Update the relevant golden files.
|
||||||
|
* Include a short note in the PR (e.g., `bench-notes.md` snippet) describing:
|
||||||
|
|
||||||
|
* what changed,
|
||||||
|
* why the new result is better, and
|
||||||
|
* which moat metric it improves (FP, proof coverage, determinism, etc.).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Benchmark implementation guidelines
|
||||||
|
|
||||||
|
### 5.1 Make benchmarks deterministic
|
||||||
|
|
||||||
|
* **Pin everything**:
|
||||||
|
|
||||||
|
* feed snapshots,
|
||||||
|
* tool container digests,
|
||||||
|
* rule versions,
|
||||||
|
* time windows.
|
||||||
|
* Use **Replay Manifests** as the source of truth:
|
||||||
|
|
||||||
|
* `replay.manifest.json` should contain:
|
||||||
|
|
||||||
|
* input artifacts,
|
||||||
|
* tool versions,
|
||||||
|
* feed versions,
|
||||||
|
* configuration flags.
|
||||||
|
* If a benchmark depends on time:
|
||||||
|
|
||||||
|
* Inject a **fake clock** or explicit “as of” timestamp.
|
||||||
|
|
||||||
|
### 5.2 Keep scenarios small but meaningful
|
||||||
|
|
||||||
|
* Prefer many **focused** scenarios over a few huge ones.
|
||||||
|
* Each scenario should clearly answer:
|
||||||
|
|
||||||
|
* “What property of Stella Ops are we testing?”
|
||||||
|
* “What moat claim does this support?”
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
* `bench/scenarios/false_pos_kubernetes.yaml`
|
||||||
|
|
||||||
|
* Focus: config noise reduction vs baseline scanner.
|
||||||
|
* `bench/scenarios/reachability_java_webapp.yaml`
|
||||||
|
|
||||||
|
* Focus: reachable vs unreachable vuln proofs.
|
||||||
|
* `bench/scenarios/vex_not_affected_openssl.yaml`
|
||||||
|
|
||||||
|
* Focus: VEX correctness and proof coverage.
|
||||||
|
|
||||||
|
### 5.3 Use golden outputs, not ad‑hoc assertions
|
||||||
|
|
||||||
|
* Bench harness should:
|
||||||
|
|
||||||
|
* Run Stella Ops on scenario inputs.
|
||||||
|
* Normalize outputs (sorted lists, stable IDs).
|
||||||
|
* Compare to `bench/golden/<scenario>.json`.
|
||||||
|
* Golden file should include:
|
||||||
|
|
||||||
|
* expected findings (id, severity, reachable?, etc.),
|
||||||
|
* expected VEX entries,
|
||||||
|
* expected metrics (precision, recall, coverage).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Moat‑critical benchmark types (we must have all of these)
|
||||||
|
|
||||||
|
When you’re thinking about gaps, check that we have:
|
||||||
|
|
||||||
|
1. **Cross‑tool comparison**
|
||||||
|
|
||||||
|
* Same corpus, multiple scanners.
|
||||||
|
* Metrics vs baselines for FP/FN.
|
||||||
|
|
||||||
|
2. **Proof density & quality**
|
||||||
|
|
||||||
|
* Corpus where:
|
||||||
|
|
||||||
|
* some vulns are reachable,
|
||||||
|
* some are not,
|
||||||
|
* some are not present.
|
||||||
|
* Ensure:
|
||||||
|
|
||||||
|
* reachable ones have rich proofs (stack slices / symbol proofs).
|
||||||
|
* non‑reachable or absent ones have:
|
||||||
|
|
||||||
|
* correct disposition, and
|
||||||
|
* clear receipts explaining why.
|
||||||
|
|
||||||
|
3. **VEX accuracy**
|
||||||
|
|
||||||
|
* Scenarios with known SBOM + known vulnerability impact.
|
||||||
|
* Check:
|
||||||
|
|
||||||
|
* VEX “affected”/“not‑affected” matches ground truth.
|
||||||
|
* every VEX entry has a receipt.
|
||||||
|
|
||||||
|
4. **Analyst workflow**
|
||||||
|
|
||||||
|
* Small usability corpus for internal testing:
|
||||||
|
|
||||||
|
* Measure time‑to‑triage with/without receipts.
|
||||||
|
* Use the same scenarios across releases to track improvement.
|
||||||
|
|
||||||
|
5. **Upgrade / drift resistance**
|
||||||
|
|
||||||
|
* Scenarios that are **expected to remain stable** across:
|
||||||
|
|
||||||
|
* rule changes that *shouldn’t* affect outcomes.
|
||||||
|
* feed updates (within a given version window).
|
||||||
|
* These act as canaries for unintended regressions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Developer checklist (TL;DR)
|
||||||
|
|
||||||
|
Before merging a change that touches security logic, ask yourself:
|
||||||
|
|
||||||
|
1. **Is there at least one benchmark scenario that exercises this change?**
|
||||||
|
2. **Does the change improve at least one moat metric, or is it neutral?**
|
||||||
|
3. **Have I run `make bench:quick` locally and checked diffs?**
|
||||||
|
4. **If goldens changed, did I explain why in the PR?**
|
||||||
|
5. **Did I keep benchmarks deterministic (pinned versions, fake time, etc.)?**
|
||||||
|
|
||||||
|
If any answer is “no”, fix that before merging.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you’d like, next step I can sketch a concrete `bench/scenarios/*.yaml` and matching `bench/golden/*.json` example that encodes one *specific* moat claim (e.g., “30% fewer FPs than Scanner X on Kubernetes configs”) so your team has a ready‑to-copy pattern.
|
||||||
@@ -0,0 +1,287 @@
|
|||||||
|
Here’s a condensed **“Stella Ops Developer Guidelines”** based on the official engineering docs and dev guides.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Where to start
|
||||||
|
|
||||||
|
* **Dev docs index:** The main entrypoint is `Development Guides & Tooling` (docs/technical/development/README.md). It links to coding standards, test strategy, performance workbook, plug‑in SDK, examples, and more. ([Gitea: Git with a cup of tea][1])
|
||||||
|
* **If a term is unfamiliar:** Check the one‑page *Glossary of Terms* first. ([Stella Ops][2])
|
||||||
|
* **Big picture:** Stella Ops is an SBOM‑first, offline‑ready container security platform; a lot of design decisions (determinism, signatures, policy DSL, SBOM delta scans) flow from that. ([Stella Ops][3])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Core engineering principles
|
||||||
|
|
||||||
|
From **Coding Standards & Contributor Guide**: ([Gitea: Git with a cup of tea][4])
|
||||||
|
|
||||||
|
1. **SOLID first** – especially interface & dependency inversion.
|
||||||
|
2. **100‑line file rule** – if a file grows >100 physical lines, split or refactor.
|
||||||
|
3. **Contracts vs runtime** – public DTOs and interfaces live in lightweight `*.Contracts` projects; implementations live in sibling runtime projects.
|
||||||
|
4. **Single composition root** – DI wiring happens in `StellaOps.Web/Program.cs` and each plug‑in’s `IoCConfigurator`. Nothing else creates a service provider.
|
||||||
|
5. **No service locator** – constructor injection only; no global `ServiceProvider` or static service lookups.
|
||||||
|
6. **Fail‑fast startup** – validate configuration *before* the web host starts listening.
|
||||||
|
7. **Hot‑load compatibility** – avoid static singletons that would survive plug‑in unload; don’t manually load assemblies outside the built‑in loader.
|
||||||
|
|
||||||
|
These all serve the product goals of **deterministic, offline, explainable security decisions**. ([Stella Ops][3])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Repository layout & layering
|
||||||
|
|
||||||
|
From the repo layout section: ([Gitea: Git with a cup of tea][4])
|
||||||
|
|
||||||
|
* **Top‑level structure (simplified):**
|
||||||
|
|
||||||
|
```text
|
||||||
|
src/
|
||||||
|
backend/
|
||||||
|
StellaOps.Web/ # ASP.NET host + composition root
|
||||||
|
StellaOps.Common/ # logging, helpers
|
||||||
|
StellaOps.Contracts/ # DTO + interface contracts
|
||||||
|
… more runtime projects
|
||||||
|
plugins-sdk/ # plug‑in templates & abstractions
|
||||||
|
frontend/ # Angular workspace
|
||||||
|
tests/ # mirrors src 1‑to‑1
|
||||||
|
```
|
||||||
|
|
||||||
|
* **Rules:**
|
||||||
|
|
||||||
|
* No “Module” folders or nested solution hierarchies.
|
||||||
|
* Tests mirror `src/` structure 1:1; **no test code in production projects**.
|
||||||
|
* New features follow *feature folder* layout (e.g., `Scan/ScanService.cs`, `Scan/ScanController.cs`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Naming, style & language usage
|
||||||
|
|
||||||
|
Key conventions: ([Gitea: Git with a cup of tea][4])
|
||||||
|
|
||||||
|
* **Namespaces:** file‑scoped, `StellaOps.*`.
|
||||||
|
* **Interfaces:** `I` prefix (`IScannerRunner`).
|
||||||
|
* **Classes/records:** PascalCase (`ScanRequest`, `TrivyRunner`).
|
||||||
|
* **Private fields:** `camelCase` (no leading `_`).
|
||||||
|
* **Constants:** `SCREAMING_SNAKE_CASE`.
|
||||||
|
* **Async methods:** end with `Async`.
|
||||||
|
* **Usings:** outside namespace, sorted, no wildcard imports.
|
||||||
|
* **File length:** keep ≤100 lines including `using` and braces (enforced by tooling).
|
||||||
|
|
||||||
|
C# feature usage: ([Gitea: Git with a cup of tea][4])
|
||||||
|
|
||||||
|
* Nullable reference types **on**.
|
||||||
|
* Use `record` for immutable DTOs.
|
||||||
|
* Prefer pattern matching over long `switch` cascades.
|
||||||
|
* `Span`/`Memory` only when you’ve measured that you need them.
|
||||||
|
* Use `await foreach` instead of manual iterator loops.
|
||||||
|
|
||||||
|
Formatting & analysis:
|
||||||
|
|
||||||
|
* `dotnet format` must be clean; StyleCop + security analyzers + CodeQL run in CI and are treated as gates. ([Gitea: Git with a cup of tea][4])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Dependency injection, async & concurrency
|
||||||
|
|
||||||
|
DI policy (core + plug‑ins): ([Gitea: Git with a cup of tea][4])
|
||||||
|
|
||||||
|
* Exactly **one composition root** per process (`StellaOps.Web/Program.cs`).
|
||||||
|
* Plug‑ins contribute through:
|
||||||
|
|
||||||
|
* `[ServiceBinding]` attributes for simple bindings, or
|
||||||
|
* An `IoCConfigurator : IDependencyInjectionRoutine` for advanced setups.
|
||||||
|
* Default lifetime is **scoped**. Use singletons only for truly stateless, thread‑safe helpers.
|
||||||
|
* Never use a service locator or manually build nested service providers except in tests.
|
||||||
|
|
||||||
|
Async & threading: ([Gitea: Git with a cup of tea][4])
|
||||||
|
|
||||||
|
* All I/O is async; avoid `.Result` / `.Wait()`.
|
||||||
|
* Library code uses `ConfigureAwait(false)`.
|
||||||
|
* Control concurrency with channels or `Parallel.ForEachAsync`, not ad‑hoc `Task.Run` loops.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Tests, tooling & quality gates
|
||||||
|
|
||||||
|
The **Automated Test‑Suite Overview** spells out all CI layers and budgets. ([Gitea: Git with a cup of tea][5])
|
||||||
|
|
||||||
|
**Test layers (high‑level):**
|
||||||
|
|
||||||
|
* Unit tests: xUnit.
|
||||||
|
* Property‑based tests: FsCheck.
|
||||||
|
* Integration:
|
||||||
|
|
||||||
|
* API integration with Testcontainers.
|
||||||
|
* DB/merge flows using Mongo + Redis.
|
||||||
|
* Contracts: gRPC breakage checks with Buf.
|
||||||
|
* Frontend:
|
||||||
|
|
||||||
|
* Unit tests with Jest.
|
||||||
|
* E2E tests with Playwright.
|
||||||
|
* Lighthouse runs for performance & accessibility.
|
||||||
|
* Non‑functional:
|
||||||
|
|
||||||
|
* Load tests via k6.
|
||||||
|
* Chaos experiments (CPU/OOM) using Docker tooling.
|
||||||
|
* Dependency & license scanning.
|
||||||
|
* SBOM reproducibility/attestation checks.
|
||||||
|
|
||||||
|
**Quality gates (examples):** ([Gitea: Git with a cup of tea][5])
|
||||||
|
|
||||||
|
* API unit test line coverage ≥ ~85%.
|
||||||
|
* API P95 latency ≤ ~120 ms in nightly runs.
|
||||||
|
* Δ‑SBOM warm scan P95 ≤ ~5 s on reference hardware.
|
||||||
|
* Lighthouse perf score ≥ ~90, a11y ≥ ~95.
|
||||||
|
|
||||||
|
**Local workflows:**
|
||||||
|
|
||||||
|
* Use `./scripts/dev-test.sh` for “fast” local runs and `--full` for the entire stack (API, UI, Playwright, Lighthouse, etc.). Needs Docker and modern Node. ([Gitea: Git with a cup of tea][5])
|
||||||
|
* Some suites use Mongo2Go + an OpenSSL 1.1 shim; others use a helper script to spin up a local `mongod` for deeper debugging. ([Gitea: Git with a cup of tea][5])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Plug‑ins & connectors
|
||||||
|
|
||||||
|
The **Plug‑in SDK Guide** is your bible for schedule jobs, scanner adapters, TLS providers, notification channels, etc. ([Gitea: Git with a cup of tea][6])
|
||||||
|
|
||||||
|
**Basics:**
|
||||||
|
|
||||||
|
* Use `.NET` templates to scaffold:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet new stellaops-plugin-schedule -n MyPlugin.Schedule --output src
|
||||||
|
```
|
||||||
|
|
||||||
|
* At publish time, copy **signed** artefacts to:
|
||||||
|
|
||||||
|
```text
|
||||||
|
src/backend/Stella.Ops.Plugin.Binaries/<MyPlugin>/
|
||||||
|
MyPlugin.dll
|
||||||
|
MyPlugin.dll.sig
|
||||||
|
```
|
||||||
|
|
||||||
|
* The backend:
|
||||||
|
|
||||||
|
* Verifies the Cosign signature.
|
||||||
|
* Enforces `[StellaPluginVersion]` compatibility.
|
||||||
|
* Loads plug‑ins in isolated `AssemblyLoadContext`s.
|
||||||
|
|
||||||
|
**DI entrypoints:**
|
||||||
|
|
||||||
|
* For simple cases, mark implementations with `[ServiceBinding(typeof(IMyContract), ServiceLifetime.Scoped, …)]`.
|
||||||
|
* For more control, implement `IoCConfigurator : IDependencyInjectionRoutine` and configure services/options in `Register(...)`. ([Gitea: Git with a cup of tea][6])
|
||||||
|
|
||||||
|
**Examples:**
|
||||||
|
|
||||||
|
* **Schedule job:** implement `IJob.ExecuteAsync`, add `[StellaPluginVersion("X.Y.Z")]`, register cron with `services.AddCronJob<MyJob>("0 15 * * *")`.
|
||||||
|
* **Scanner adapter:** implement `IScannerRunner` and register via `services.AddScanner<MyAltScanner>("alt")`; document Docker sidecars if needed. ([Gitea: Git with a cup of tea][6])
|
||||||
|
|
||||||
|
**Signing & deployment:**
|
||||||
|
|
||||||
|
* Publish, sign with Cosign, optionally zip:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet publish -c Release -p:PublishSingleFile=true -o out
|
||||||
|
cosign sign --key $COSIGN_KEY out/MyPlugin.Schedule.dll
|
||||||
|
```
|
||||||
|
|
||||||
|
* Copy into the backend container (e.g., `/opt/plugins/`) and restart.
|
||||||
|
|
||||||
|
* Unsigned DLLs are rejected when `StellaOps:Security:DisableUnsigned=false`. ([Gitea: Git with a cup of tea][6])
|
||||||
|
|
||||||
|
**Marketplace:**
|
||||||
|
|
||||||
|
* Tag releases like `plugin-vX.Y.Z`, attach the signed ZIP, and submit metadata to the community plug‑in index so it shows up in the UI Marketplace. ([Gitea: Git with a cup of tea][6])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Policy DSL & security decisions
|
||||||
|
|
||||||
|
For policy authors and tooling engineers, the **Stella Policy DSL (stella‑dsl@1)** doc is key. ([Stella Ops][7])
|
||||||
|
|
||||||
|
**Goals:**
|
||||||
|
|
||||||
|
* Deterministic: same inputs → same findings on every machine.
|
||||||
|
* Declarative: no arbitrary loops, network calls, or clocks.
|
||||||
|
* Explainable: each decision carries rule, inputs, rationale.
|
||||||
|
* Offline‑friendly and reachability‑aware (SBOM + advisories + VEX + reachability). ([Stella Ops][7])
|
||||||
|
|
||||||
|
**Structure:**
|
||||||
|
|
||||||
|
* One `policy` block per `.stella` file, with:
|
||||||
|
|
||||||
|
* `metadata` (description, tags).
|
||||||
|
* `profile` blocks (severity, trust, reachability adjustments).
|
||||||
|
* `rule` blocks (`when` / `then` logic).
|
||||||
|
* Optional `settings`. ([Stella Ops][7])
|
||||||
|
|
||||||
|
**Context & built‑ins:**
|
||||||
|
|
||||||
|
* Namespaces like `sbom`, `advisory`, `vex`, `env`, `telemetry`, `secret`, `profile.*`, etc. ([Stella Ops][7])
|
||||||
|
* Helpers such as `normalize_cvss`, `risk_score`, `vex.any`, `vex.latest`, `sbom.any_component`, `exists`, `coalesce`, and secrets‑specific helpers. ([Stella Ops][7])
|
||||||
|
|
||||||
|
**Rules of thumb:**
|
||||||
|
|
||||||
|
* Always include a clear `because` when you change `status` or `severity`. ([Stella Ops][7])
|
||||||
|
* Avoid catch‑all suppressions (`when true` + `status := "suppressed"`); the linter will flag them. ([Stella Ops][7])
|
||||||
|
* Use `stella policy lint/compile/simulate` in CI and locally; test in sealed (offline) mode to ensure no network dependencies. ([Stella Ops][7])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Commits, PRs & docs
|
||||||
|
|
||||||
|
From the commit/PR checklist: ([Gitea: Git with a cup of tea][4])
|
||||||
|
|
||||||
|
Before opening a PR:
|
||||||
|
|
||||||
|
1. Use **Conventional Commit** prefixes (`feat:`, `fix:`, `docs:`, etc.).
|
||||||
|
2. Run `dotnet format` and `dotnet test`; both must be green.
|
||||||
|
3. Keep new/changed files within the 100‑line guideline.
|
||||||
|
4. Update XML‑doc comments for any new public API.
|
||||||
|
5. If you add/change a public contract:
|
||||||
|
|
||||||
|
* Update the relevant markdown docs.
|
||||||
|
* Update JSON schema / API descriptions as needed.
|
||||||
|
6. Ensure static analyzers and CI jobs relevant to your change are passing.
|
||||||
|
|
||||||
|
For new test layers or jobs, also update the test‑suite overview and metrics docs so the CI configuration stays discoverable. ([Gitea: Git with a cup of tea][5])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Licensing & reciprocity
|
||||||
|
|
||||||
|
Stella Ops ships under **AGPL‑3.0‑or‑later** with a strong reciprocity clause: ([Stella Ops][8])
|
||||||
|
|
||||||
|
* You may run, study, modify, and redistribute it, including as a hosted service.
|
||||||
|
* If you run a **modified** version for others over a network, you must make that exact source code available to those users.
|
||||||
|
* Official containers are signed and include SBOMs and attestations; verify them with Cosign as described on the license/security pages. ([Stella Ops][8])
|
||||||
|
|
||||||
|
When you build extensions:
|
||||||
|
|
||||||
|
* Keep plug‑ins compatible with AGPL expectations around combined works.
|
||||||
|
* Don’t embed proprietary logic into the core without checking license implications.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. If you just want a “first contribution” recipe
|
||||||
|
|
||||||
|
A practical path that follows the guidelines:
|
||||||
|
|
||||||
|
1. Clone the main repo; skim **Coding Standards**, **Test Suite Overview**, and the **Dev Guides & Tooling** index. ([Gitea: Git with a cup of tea][4])
|
||||||
|
2. Get `dotnet`, Docker, Node set up; run `./scripts/dev-test.sh` to make sure your environment is healthy. ([Gitea: Git with a cup of tea][5])
|
||||||
|
3. Pick a small issue (docs, small refactor, or new test), make changes respecting:
|
||||||
|
|
||||||
|
* 100‑line files,
|
||||||
|
* DI patterns,
|
||||||
|
* naming & style.
|
||||||
|
4. Add/adjust tests plus any affected docs or JSON schemas.
|
||||||
|
5. Run tests + formatting locally, push, and open a PR with a conventional title and a short “how I tested this” note.
|
||||||
|
|
||||||
|
If you tell me what you’re planning to work on (plug‑in, policy pack, core feature, or UI), I can turn this into a very concrete checklist tailored to that slice of Stella Ops.
|
||||||
|
|
||||||
|
[1]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/08b27b8a266c82960c7653797460e1e1d17ecd45/docs/technical/development/README.md "git.stella-ops.org/README.md at 08b27b8a266c82960c7653797460e1e1d17ecd45 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[2]: https://stella-ops.org/docs/14_glossary_of_terms/?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
|
||||||
|
[3]: https://stella-ops.org/docs/05_SYSTEM_REQUIREMENTS_SPEC/?utm_source=chatgpt.com "system requirements specification - Stella Ops – Open • Sovereign ..."
|
||||||
|
[4]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/08b27b8a266c82960c7653797460e1e1d17ecd45/docs/18_CODING_STANDARDS.md "git.stella-ops.org/18_CODING_STANDARDS.md at 08b27b8a266c82960c7653797460e1e1d17ecd45 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[5]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/08b27b8a266c82960c7653797460e1e1d17ecd45/docs/19_TEST_SUITE_OVERVIEW.md "git.stella-ops.org/19_TEST_SUITE_OVERVIEW.md at 08b27b8a266c82960c7653797460e1e1d17ecd45 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[6]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/08b27b8a266c82960c7653797460e1e1d17ecd45/docs/10_PLUGIN_SDK_GUIDE.md "git.stella-ops.org/10_PLUGIN_SDK_GUIDE.md at 08b27b8a266c82960c7653797460e1e1d17ecd45 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[7]: https://stella-ops.org/docs/policy/dsl/index.html "Stella Ops – Signed Reachability · Deterministic Replay · Sovereign Crypto"
|
||||||
|
[8]: https://stella-ops.org/license/?utm_source=chatgpt.com "AGPL‑3.0‑or‑later - Stella Ops"
|
||||||
@@ -0,0 +1,585 @@
|
|||||||
|
Here’s a tight, practical pattern to make your scanner’s vuln‑DB updates rock‑solid even when feeds hiccup:
|
||||||
|
|
||||||
|
# Offline, verifiable update bundles (DSSE + Rekor v2)
|
||||||
|
|
||||||
|
**Idea:** distribute DB updates as offline tarballs. Each tarball ships with:
|
||||||
|
|
||||||
|
* a **DSSE‑signed** statement (e.g., in‑toto style) over the bundle hash
|
||||||
|
* a **Rekor v2 receipt** proving the signature/statement was logged
|
||||||
|
* a small **manifest.json** (version, created_at, content hashes)
|
||||||
|
|
||||||
|
**Startup flow (happy path):**
|
||||||
|
|
||||||
|
1. Load latest tarball from your local `updates/` cache.
|
||||||
|
2. Verify DSSE signature against your trusted public keys.
|
||||||
|
3. Verify Rekor v2 receipt (inclusion proof) matches the DSSE payload hash.
|
||||||
|
4. If both pass, unpack/activate; record the bundle’s **trust_id** (e.g., statement digest).
|
||||||
|
5. If anything fails, **keep using the last good bundle**. No service disruption.
|
||||||
|
|
||||||
|
**Why this helps**
|
||||||
|
|
||||||
|
* **Air‑gap friendly:** no live network needed at activation time.
|
||||||
|
* **Tamper‑evident:** DSSE + Rekor receipt proves provenance and transparency.
|
||||||
|
* **Operational stability:** feed outages become non‑events—scanner just keeps the last good state.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File layout inside each bundle
|
||||||
|
|
||||||
|
```
|
||||||
|
/bundle-2025-11-29/
|
||||||
|
manifest.json # { version, created_at, entries[], sha256s }
|
||||||
|
payload.tar.zst # the actual DB/indices
|
||||||
|
payload.tar.zst.sha256
|
||||||
|
statement.dsse.json # DSSE-wrapped statement over payload hash
|
||||||
|
rekor-receipt.json # Rekor v2 inclusion/verification material
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Acceptance/Activation rules
|
||||||
|
|
||||||
|
* **Trust root:** pin one (or more) publisher public keys; rotate via separate, out‑of‑band process.
|
||||||
|
* **Monotonicity:** only activate if `manifest.version > current.version` (or if trust policy explicitly allows replay for rollback testing).
|
||||||
|
* **Atomic switch:** unpack to `db/staging/`, validate, then symlink‑flip to `db/active/`.
|
||||||
|
* **Quarantine on failure:** move bad bundles to `updates/quarantine/` with a reason code.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Minimal .NET 10 verifier sketch (C#)
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed record BundlePaths(string Dir) {
|
||||||
|
public string Manifest => Path.Combine(Dir, "manifest.json");
|
||||||
|
public string Payload => Path.Combine(Dir, "payload.tar.zst");
|
||||||
|
public string Dsse => Path.Combine(Dir, "statement.dsse.json");
|
||||||
|
public string Receipt => Path.Combine(Dir, "rekor-receipt.json");
|
||||||
|
}
|
||||||
|
|
||||||
|
public async Task<bool> ActivateBundleAsync(BundlePaths b, TrustConfig trust, string activeDir) {
|
||||||
|
var manifest = await Manifest.LoadAsync(b.Manifest);
|
||||||
|
if (!await Hashes.VerifyAsync(b.Payload, manifest.PayloadSha256)) return false;
|
||||||
|
|
||||||
|
// 1) DSSE verify (publisher keys pinned in trust)
|
||||||
|
var (okSig, dssePayloadDigest) = await Dsse.VerifyAsync(b.Dsse, trust.PublisherKeys);
|
||||||
|
if (!okSig || dssePayloadDigest != manifest.PayloadSha256) return false;
|
||||||
|
|
||||||
|
// 2) Rekor v2 receipt verify (inclusion + statement digest == dssePayloadDigest)
|
||||||
|
if (!await RekorV2.VerifyReceiptAsync(b.Receipt, dssePayloadDigest, trust.RekorPub)) return false;
|
||||||
|
|
||||||
|
// 3) Stage, validate, then atomically flip
|
||||||
|
var staging = Path.Combine(activeDir, "..", "staging");
|
||||||
|
DirUtil.Empty(staging);
|
||||||
|
await TarZstd.ExtractAsync(b.Payload, staging);
|
||||||
|
if (!await LocalDbSelfCheck.RunAsync(staging)) return false;
|
||||||
|
|
||||||
|
SymlinkUtil.AtomicSwap(source: staging, target: activeDir);
|
||||||
|
State.WriteLastGood(manifest.Version, dssePayloadDigest);
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Operational playbook
|
||||||
|
|
||||||
|
* **On boot & daily at HH:MM:** try `ActivateBundleAsync()` on the newest bundle; on failure, log and continue.
|
||||||
|
* **Telemetry (no PII):** reason codes (SIG_FAIL, RECEIPT_FAIL, HASH_MISMATCH, SELFTEST_FAIL), versions, last_good.
|
||||||
|
* **Keys & rotation:** keep `publisher.pub` and `rekor.pub` in a root‑owned, read‑only path; rotate via a separate signed “trust bundle”.
|
||||||
|
* **Defense‑in‑depth:** verify both the **payload hash** and each file’s hash listed in `manifest.entries[]`.
|
||||||
|
* **Rollback:** allow `--force-activate <bundle>` for emergency testing, but mark as **non‑monotonic** in state.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What to hand your release team
|
||||||
|
|
||||||
|
* A Make/CI target that:
|
||||||
|
|
||||||
|
1. Builds `payload.tar.zst` and computes hashes
|
||||||
|
2. Generates `manifest.json`
|
||||||
|
3. Creates and signs the **DSSE statement**
|
||||||
|
4. Submits to Rekor (or your mirror) and saves the **v2 receipt**
|
||||||
|
5. Packages the bundle folder and publishes to your offline repo
|
||||||
|
* A checksum file (`*.sha256sum`) for ops to verify out‑of‑band.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you want, I can turn this into a Stella Ops spec page (`docs/modules/scanner/offline-bundles.md`) plus a small reference implementation (C# library + CLI) that drops right into your Scanner service.
|
||||||
|
Here’s a “drop‑in” Stella Ops dev guide for **DSSE‑signed Offline Scanner Updates** — written in the same spirit as the existing docs and sprint files.
|
||||||
|
|
||||||
|
You can treat this as the seed for `docs/modules/scanner/development/dsse-offline-updates.md` (or similar).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# DSSE‑Signed Offline Scanner Updates — Developer Guidelines
|
||||||
|
|
||||||
|
> **Audience**
|
||||||
|
> Scanner, Export Center, Attestor, CLI, and DevOps engineers implementing DSSE‑signed offline vulnerability updates and integrating them into the Offline Update Kit (OUK).
|
||||||
|
>
|
||||||
|
> **Context**
|
||||||
|
>
|
||||||
|
> * OUK already ships **signed, atomic offline update bundles** with merged vulnerability feeds, container images, and an attested manifest.([git.stella-ops.org][1])
|
||||||
|
> * DSSE + Rekor is already used for **scan evidence** (SBOM attestations, Rekor proofs).([git.stella-ops.org][2])
|
||||||
|
> * Sprints 160/162 add **attestation bundles** with manifest, checksums, DSSE signature, and optional transparency log segments, and integrate them into OUK and CLI flows.([git.stella-ops.org][3])
|
||||||
|
|
||||||
|
These guidelines tell you how to **wire all of that together** for “offline scanner updates” (feeds, rules, packs) in a way that matches Stella Ops’ determinism + sovereignty promises.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Mental model
|
||||||
|
|
||||||
|
At a high level, you’re building this:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Advisory mirrors / Feeds builders
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
ExportCenter.AttestationBundles
|
||||||
|
(creates DSSE + Rekor evidence
|
||||||
|
for each offline update snapshot)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Offline Update Kit (OUK) builder
|
||||||
|
(adds feeds + evidence to kit tarball)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
stella offline kit import / admin CLI
|
||||||
|
(verifies Cosign + DSSE + Rekor segments,
|
||||||
|
then atomically swaps scanner feeds)
|
||||||
|
```
|
||||||
|
|
||||||
|
Online, Rekor is live; offline, you rely on **bundled Rekor segments / snapshots** and the existing OUK mechanics (import is atomic, old feeds kept until new bundle is fully verified).([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Goals & non‑goals
|
||||||
|
|
||||||
|
### Goals
|
||||||
|
|
||||||
|
1. **Authentic offline snapshots**
|
||||||
|
Every offline scanner update (OUK or delta) must be verifiably tied to:
|
||||||
|
|
||||||
|
* a DSSE envelope,
|
||||||
|
* a certificate chain rooted in Stella’s Fulcio/KMS profile or BYO KMS/HSM,
|
||||||
|
* *and* a Rekor v2 inclusion proof or bundled log segment.([Stella Ops][4])
|
||||||
|
|
||||||
|
2. **Deterministic replay**
|
||||||
|
Given:
|
||||||
|
|
||||||
|
* a specific offline update kit (`stella-ops-offline-kit-<DATE>.tgz` + `offline-manifest-<DATE>.json`)([git.stella-ops.org][1])
|
||||||
|
* its DSSE attestation bundle + Rekor segments
|
||||||
|
every verifier must reach the *same* verdict on integrity and contents — online or fully air‑gapped.
|
||||||
|
|
||||||
|
3. **Separation of concerns**
|
||||||
|
|
||||||
|
* Export Center: build attestation bundles, no business logic about scanning.([git.stella-ops.org][5])
|
||||||
|
* Scanner: import & apply feeds; verify but not generate DSSE.
|
||||||
|
* Signer / Attestor: own DSSE & Rekor integration.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
4. **Operational safety**
|
||||||
|
|
||||||
|
* Imports remain **atomic and idempotent**.
|
||||||
|
* Old feeds stay live until the new update is **fully verified** (Cosign + DSSE + Rekor).([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
### Non‑goals
|
||||||
|
|
||||||
|
* Designing new crypto or log formats.
|
||||||
|
* Per‑feed DSSE envelopes (you can have more later, but the minimum contract is **bundle‑level** attestation).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Bundle contract for DSSE‑signed offline updates
|
||||||
|
|
||||||
|
You’re extending the existing OUK contract:
|
||||||
|
|
||||||
|
* OUK already packs:
|
||||||
|
|
||||||
|
* merged vuln feeds (OSV, GHSA, optional NVD 2.0, CNNVD/CNVD, ENISA, JVN, BDU),
|
||||||
|
* container images (`stella-ops`, Zastava, etc.),
|
||||||
|
* provenance (Cosign signature, SPDX SBOM, in‑toto SLSA attestation),
|
||||||
|
* `offline-manifest.json` + detached JWS signed during export.([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
For **DSSE‑signed offline scanner updates**, add a new logical layer:
|
||||||
|
|
||||||
|
### 2.1. Files to ship
|
||||||
|
|
||||||
|
Inside each offline kit (full or delta) you must produce:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/attestations/
|
||||||
|
offline-update.dsse.json # DSSE envelope
|
||||||
|
offline-update.rekor.json # Rekor entry + inclusion proof (or segment descriptor)
|
||||||
|
/manifest/
|
||||||
|
offline-manifest.json # existing manifest
|
||||||
|
offline-manifest.json.jws # existing detached JWS
|
||||||
|
/feeds/
|
||||||
|
... # existing feed payloads
|
||||||
|
```
|
||||||
|
|
||||||
|
The exact paths can be adjusted, but keep:
|
||||||
|
|
||||||
|
* **One DSSE bundle per kit** (min spec).
|
||||||
|
* **One canonical Rekor proof file** per DSSE envelope.
|
||||||
|
|
||||||
|
### 2.2. DSSE payload contents (minimal)
|
||||||
|
|
||||||
|
Define (or reuse) a predicate type such as:
|
||||||
|
|
||||||
|
```jsonc
|
||||||
|
{
|
||||||
|
"payloadType": "application/vnd.in-toto+json",
|
||||||
|
"payload": { /* base64 */ }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Decoded payload (in-toto statement) should **at minimum** contain:
|
||||||
|
|
||||||
|
* **Subject**
|
||||||
|
|
||||||
|
* `name`: `stella-ops-offline-kit-<DATE>.tgz`
|
||||||
|
* `digest.sha256`: tarball digest
|
||||||
|
|
||||||
|
* **Predicate type** (recommendation)
|
||||||
|
|
||||||
|
* `https://stella-ops.org/attestations/offline-update/1`
|
||||||
|
|
||||||
|
* **Predicate fields**
|
||||||
|
|
||||||
|
* `offline_manifest_sha256` – SHA‑256 of `offline-manifest.json`
|
||||||
|
* `feeds` – array of feed entries such as `{ name, snapshot_date, archive_digest }` (mirrors `rules_and_feeds` style used in the moat doc).([Stella Ops][6])
|
||||||
|
* `builder` – CI workflow id / git commit / Export Center job id
|
||||||
|
* `created_at` – UTC ISO‑8601
|
||||||
|
* `oukit_channel` – e.g., `edge`, `stable`, `fips-profile`
|
||||||
|
|
||||||
|
**Guideline:** this DSSE payload is the **single canonical description** of “what this offline update snapshot is”.
|
||||||
|
|
||||||
|
### 2.3. Rekor material
|
||||||
|
|
||||||
|
Attestor must:
|
||||||
|
|
||||||
|
* Submit `offline-update.dsse.json` to Rekor v2, obtaining:
|
||||||
|
|
||||||
|
* `uuid`
|
||||||
|
* `logIndex`
|
||||||
|
* inclusion proof (`rootHash`, `hashes`, `checkpoint`)
|
||||||
|
* Serialize that to `offline-update.rekor.json` and store it in object storage + OUK staging, so it ships in the kit.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
For fully offline operation:
|
||||||
|
|
||||||
|
* Either:
|
||||||
|
|
||||||
|
* embed a **minimal log segment** containing that entry; or
|
||||||
|
* rely on daily Rekor snapshot exports included elsewhere in the kit.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Implementation by module
|
||||||
|
|
||||||
|
### 3.1 Export Center — attestation bundles
|
||||||
|
|
||||||
|
**Working directory:** `src/ExportCenter/StellaOps.ExportCenter.AttestationBundles`([git.stella-ops.org][7])
|
||||||
|
|
||||||
|
**Responsibilities**
|
||||||
|
|
||||||
|
1. **Compose attestation bundle job** (EXPORT‑ATTEST‑74‑001)
|
||||||
|
|
||||||
|
* Input: a snapshot identifier (e.g., offline kit build id or feed snapshot date).
|
||||||
|
* Read manifest and feed metadata from the Export Center’s storage.([git.stella-ops.org][5])
|
||||||
|
* Generate the DSSE payload structure described above.
|
||||||
|
* Call `StellaOps.Signer` to wrap it in a DSSE envelope.
|
||||||
|
* Call `StellaOps.Attestor` to submit DSSE → Rekor and get the inclusion proof.([git.stella-ops.org][2])
|
||||||
|
* Persist:
|
||||||
|
|
||||||
|
* `offline-update.dsse.json`
|
||||||
|
* `offline-update.rekor.json`
|
||||||
|
* any log segment artifacts.
|
||||||
|
|
||||||
|
2. **Integrate into offline kit packaging** (EXPORT‑ATTEST‑74‑002 / 75‑001)
|
||||||
|
|
||||||
|
* The OUK builder (Python script `ops/offline-kit/build_offline_kit.py`) already assembles artifacts & manifests.([Stella Ops][8])
|
||||||
|
* Extend that pipeline (or add an Export Center step) to:
|
||||||
|
|
||||||
|
* fetch the attestation bundle for the snapshot,
|
||||||
|
* place it under `/attestations/` in the kit staging dir,
|
||||||
|
* ensure `offline-manifest.json` contains entries for the DSSE and Rekor files (name, sha256, size, capturedAt).([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
3. **Contracts & schemas**
|
||||||
|
|
||||||
|
* Define a small JSON schema for `offline-update.rekor.json` (UUID, index, proof fields) and check it into `docs/11_DATA_SCHEMAS.md` or module‑local schemas.
|
||||||
|
* Keep all new payload schemas **versioned**; avoid “shape drift”.
|
||||||
|
|
||||||
|
**Do / Don’t**
|
||||||
|
|
||||||
|
* ✅ **Do** treat attestation bundle job as *pure aggregation* (AOC guardrail: no modification of evidence).([git.stella-ops.org][5])
|
||||||
|
* ✅ **Do** rely on Signer + Attestor; don’t hand‑roll DSSE/Rekor logic in Export Center.([git.stella-ops.org][2])
|
||||||
|
* ❌ **Don’t** reach out to external networks from this job — it must run with the same offline‑ready posture as the rest of the platform.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.2 Offline Update Kit builder
|
||||||
|
|
||||||
|
**Working area:** `ops/offline-kit/*` + `docs/24_OFFLINE_KIT.md`([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
Guidelines:
|
||||||
|
|
||||||
|
1. **Preserve current guarantees**
|
||||||
|
|
||||||
|
* Imports must remain **idempotent and atomic**, with **old feeds kept until the new bundle is fully verified**. This now includes DSSE/Rekor checks in addition to Cosign + JWS.([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
2. **Staging layout**
|
||||||
|
|
||||||
|
* When staging a kit, ensure the tree looks like:
|
||||||
|
|
||||||
|
```text
|
||||||
|
out/offline-kit/staging/
|
||||||
|
feeds/...
|
||||||
|
images/...
|
||||||
|
manifest/offline-manifest.json
|
||||||
|
attestations/offline-update.dsse.json
|
||||||
|
attestations/offline-update.rekor.json
|
||||||
|
```
|
||||||
|
|
||||||
|
* Update `offline-manifest.json` so each new file appears with:
|
||||||
|
|
||||||
|
* `name`, `sha256`, `size`, `capturedAt`.([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
3. **Deterministic ordering**
|
||||||
|
|
||||||
|
* File lists in manifests must be in a stable order (e.g., lexical paths).
|
||||||
|
* Timestamps = UTC ISO‑8601 only; never use local time. (Matches determinism guidance in AGENTS.md + policy/runs docs.)([git.stella-ops.org][9])
|
||||||
|
|
||||||
|
4. **Delta kits**
|
||||||
|
|
||||||
|
* For deltas (`stella-ouk-YYYY-MM-DD.delta.tgz`), DSSE should still cover:
|
||||||
|
|
||||||
|
* the delta tarball digest,
|
||||||
|
* the **logical state** (feeds & versions) after applying the delta.
|
||||||
|
* Don’t shortcut by “attesting only the diff files” — the predicate must describe the resulting snapshot.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.3 Scanner — import & activation
|
||||||
|
|
||||||
|
**Working directory:** `src/Scanner/StellaOps.Scanner.WebService`, `StellaOps.Scanner.Worker`([git.stella-ops.org][9])
|
||||||
|
|
||||||
|
Scanner already exposes admin flows for:
|
||||||
|
|
||||||
|
* **Offline kit import**, which:
|
||||||
|
|
||||||
|
* validates the Cosign signature of the kit,
|
||||||
|
* uses the attested manifest,
|
||||||
|
* keeps old feeds until verification is done.([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
Add DSSE/Rekor awareness as follows:
|
||||||
|
|
||||||
|
1. **Verification sequence (happy path)**
|
||||||
|
|
||||||
|
On `import-offline-usage-kit`:
|
||||||
|
|
||||||
|
1. Validate **Cosign** signature of the tarball.
|
||||||
|
2. Validate `offline-manifest.json` with its JWS signature.
|
||||||
|
3. Verify **file digests** for all entries (including `/attestations/*`).
|
||||||
|
4. Verify **DSSE**:
|
||||||
|
|
||||||
|
* Call `StellaOps.Attestor.Verify` (or CLI equivalent) with:
|
||||||
|
|
||||||
|
* `offline-update.dsse.json`
|
||||||
|
* `offline-update.rekor.json`
|
||||||
|
* local Rekor log snapshot / segment (if configured)([git.stella-ops.org][2])
|
||||||
|
* Ensure the payload digest matches the kit tarball + manifest digests.
|
||||||
|
5. Only after all checks pass:
|
||||||
|
|
||||||
|
* swap Scanner’s feed pointer to the new snapshot,
|
||||||
|
* emit an audit event noting:
|
||||||
|
|
||||||
|
* kit filename, tarball digest,
|
||||||
|
* DSSE statement digest,
|
||||||
|
* Rekor UUID + log index.
|
||||||
|
|
||||||
|
2. **Config surface**
|
||||||
|
|
||||||
|
Add config keys (names illustrative):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
scanner:
|
||||||
|
offlineKit:
|
||||||
|
requireDsse: true # fail import if DSSE/Rekor verification fails
|
||||||
|
rekorOfflineMode: true # use local snapshots only
|
||||||
|
attestationVerifier: https://attestor.internal
|
||||||
|
```
|
||||||
|
|
||||||
|
* Mirror them via ASP.NET Core config + env vars (`SCANNER__OFFLINEKIT__REQUIREDSSSE`, etc.), following the same pattern as the DSSE/Rekor operator guide.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
3. **Failure behaviour**
|
||||||
|
|
||||||
|
* **DSSE/Rekor fail, Cosign + manifest OK**
|
||||||
|
|
||||||
|
* Keep old feeds active.
|
||||||
|
* Mark import as failed; surface a `ProblemDetails` error via API/UI.
|
||||||
|
* Log structured fields: `rekorUuid`, `attestationDigest`, `offlineKitHash`, `failureReason`.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
* **Config flag to soften during rollout**
|
||||||
|
|
||||||
|
* When `requireDsse=false`, treat DSSE/Rekor failure as a warning and still allow the import (for initial observation phase), but emit alerts. This mirrors the “observe → enforce” pattern in the DSSE/Rekor operator guide.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.4 Signer & Attestor
|
||||||
|
|
||||||
|
You mostly **reuse** existing guidance:([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
* Add a new predicate type & schema for offline updates in Signer.
|
||||||
|
|
||||||
|
* Ensure Attestor:
|
||||||
|
|
||||||
|
* can submit offline‑update DSSE envelopes to Rekor,
|
||||||
|
* can emit verification routines (used by CLI and Scanner) that:
|
||||||
|
|
||||||
|
* verify the DSSE signature,
|
||||||
|
* check the certificate chain against the configured root pack (FIPS/eIDAS/GOST/SM, etc.),([Stella Ops][4])
|
||||||
|
* verify Rekor inclusion using either live log or local snapshot.
|
||||||
|
|
||||||
|
* For fully air‑gapped installs:
|
||||||
|
|
||||||
|
* rely on Rekor **snapshots mirrored** into Offline Kit (already recommended in the operator guide’s offline section).([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.5 CLI & UI
|
||||||
|
|
||||||
|
Extend CLI with explicit verbs (matching EXPORT‑ATTEST sprints):([git.stella-ops.org][10])
|
||||||
|
|
||||||
|
* `stella attest bundle verify --bundle path/to/offline-kit.tgz --rekor-key rekor.pub`
|
||||||
|
* `stella attest bundle import --bundle ...` (for sites that prefer a two‑step “verify then import” flow)
|
||||||
|
* Wire UI Admin → Offline Kit screen so that:
|
||||||
|
|
||||||
|
* verification status shows both **Cosign/JWS** and **DSSE/Rekor** state,
|
||||||
|
* policy banners display kit generation time, manifest hash, and DSSE/Rekor freshness.([Stella Ops][11])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Determinism & offline‑safety rules
|
||||||
|
|
||||||
|
When touching any of this code, keep these rules front‑of‑mind (they align with the policy DSL and architecture docs):([Stella Ops][4])
|
||||||
|
|
||||||
|
1. **No hidden network dependencies**
|
||||||
|
|
||||||
|
* All verification **must work offline** given the kit + Rekor snapshots.
|
||||||
|
* Any fallback to live Rekor / Fulcio endpoints must be explicitly toggled and never on by default for “offline mode”.
|
||||||
|
|
||||||
|
2. **Stable serialization**
|
||||||
|
|
||||||
|
* DSSE payload JSON:
|
||||||
|
|
||||||
|
* stable ordering of fields,
|
||||||
|
* no float weirdness,
|
||||||
|
* UTC timestamps.
|
||||||
|
|
||||||
|
3. **Replayable imports**
|
||||||
|
|
||||||
|
* Running `import-offline-usage-kit` twice with the same bundle must be a no‑op after the first time.
|
||||||
|
* The DSSE payload for a given snapshot must not change over time; if it does, bump the predicate or snapshot version.
|
||||||
|
|
||||||
|
4. **Explainability**
|
||||||
|
|
||||||
|
* When verification fails, errors must explain **what** mismatched (kit digest, manifest digest, DSSE envelope hash, Rekor inclusion) so auditors can reason about it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Testing & CI expectations
|
||||||
|
|
||||||
|
Tie this into the existing CI workflows (`scanner-determinism.yml`, `attestation-bundle.yml`, `offline-kit` pipelines, etc.):([git.stella-ops.org][12])
|
||||||
|
|
||||||
|
### 5.1 Unit & integration tests
|
||||||
|
|
||||||
|
Write tests that cover:
|
||||||
|
|
||||||
|
1. **Happy paths**
|
||||||
|
|
||||||
|
* Full kit import with valid:
|
||||||
|
|
||||||
|
* Cosign,
|
||||||
|
* manifest JWS,
|
||||||
|
* DSSE,
|
||||||
|
* Rekor proof (online and offline modes).
|
||||||
|
|
||||||
|
2. **Corruption scenarios**
|
||||||
|
|
||||||
|
* Tampered feed file (hash mismatch).
|
||||||
|
* Tampered `offline-manifest.json`.
|
||||||
|
* Tampered DSSE payload (signature fails).
|
||||||
|
* Mismatched Rekor entry (payload digest doesn’t match DSSE).
|
||||||
|
|
||||||
|
3. **Offline scenarios**
|
||||||
|
|
||||||
|
* No network access, only Rekor snapshot:
|
||||||
|
|
||||||
|
* DSSE verification still passes,
|
||||||
|
* Rekor proof validates against local tree head.
|
||||||
|
|
||||||
|
4. **Roll‑back logic**
|
||||||
|
|
||||||
|
* Import fails at DSSE/Rekor step:
|
||||||
|
|
||||||
|
* scanner DB still points at previous feeds,
|
||||||
|
* metrics/logs show failure and no partial state.
|
||||||
|
|
||||||
|
### 5.2 SLOs & observability
|
||||||
|
|
||||||
|
Reuse metrics suggested by DSSE/Rekor guide and adapt to OUK imports:([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
* `offlinekit_import_total{status="success|failed_dsse|failed_rekor|failed_cosign"}`
|
||||||
|
* `offlinekit_attestation_verify_latency_seconds` (histogram)
|
||||||
|
* `attestor_rekor_success_total`, `attestor_rekor_retry_total`, `rekor_inclusion_latency`
|
||||||
|
* Dashboards: kit versions per environment, time since last kit, DSSE/Rekor health.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Developer checklist (TL;DR)
|
||||||
|
|
||||||
|
When you pick up a task touching DSSE‑signed offline updates:
|
||||||
|
|
||||||
|
1. **Read the background**
|
||||||
|
|
||||||
|
* `docs/modules/scanner/operations/dsse-rekor-operator-guide.md`([git.stella-ops.org][2])
|
||||||
|
* `docs/24_OFFLINE_KIT.md` (and public offline kit guide).([git.stella-ops.org][1])
|
||||||
|
* Relevant sprint file (`SPRINT_160_export_evidence`, `SPRINT_162_exportcenter_i`, etc.).([git.stella-ops.org][10])
|
||||||
|
|
||||||
|
2. **Implement**
|
||||||
|
|
||||||
|
* Generate DSSE payloads in Export Center only.
|
||||||
|
* Call Signer & Attestor; persist DSSE + Rekor JSON next to manifests.
|
||||||
|
* Extend OUK builder to include attestation bundle and list it in `offline-manifest.json`.
|
||||||
|
* Update Scanner import flow to verify DSSE/Rekor before swapping feeds.
|
||||||
|
|
||||||
|
3. **Test**
|
||||||
|
|
||||||
|
* Unit tests for bundle composition & schema.
|
||||||
|
* Integration tests for import + rollback.
|
||||||
|
* Determinism tests (same inputs → same DSSE payload).
|
||||||
|
|
||||||
|
4. **Wire telemetry**
|
||||||
|
|
||||||
|
* Counters + latency histograms.
|
||||||
|
* Logs with `offlineKitHash`, `attestationDigest`, `rekorUuid`.
|
||||||
|
|
||||||
|
5. **Document**
|
||||||
|
|
||||||
|
* Update `docs/modules/export-center/architecture.md`, `docs/modules/scanner/architecture.md`, and the OUK docs where flows or contracts changed.([git.stella-ops.org][5])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you tell me which module you’re actually coding in next (Scanner, Export Center, CLI, or Attestor), I can turn this into a very concrete “AGENTS.md‑style” section with exact file paths, class names, and a starter test layout for that module.
|
||||||
|
|
||||||
|
[1]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/7bf40b8589c94078e8eadb240553c02f097a5127/docs/24_OFFLINE_KIT.md "git.stella-ops.org/24_OFFLINE_KIT.md at 7bf40b8589c94078e8eadb240553c02f097a5127 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[2]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/13e4b53dda1575ba46c6188c794fd465ec6fdeec/docs/modules/scanner/operations/dsse-rekor-operator-guide.md "git.stella-ops.org/dsse-rekor-operator-guide.md at 13e4b53dda1575ba46c6188c794fd465ec6fdeec - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[3]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/raw/commit/61f963fd52cd4d6bb2f86afc5a82eac04c04b00e/docs/implplan/SPRINT_162_exportcenter_i.md?utm_source=chatgpt.com "https://git.stella-ops.org/stella-ops.org/git.stel..."
|
||||||
|
[4]: https://stella-ops.org/docs/07_high_level_architecture/index.html?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
|
||||||
|
[5]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d870da18ce194c6a5f1a6d71abea36205d9fb276/docs/export-center/architecture.md?utm_source=chatgpt.com "Export Center Architecture - Stella Ops"
|
||||||
|
[6]: https://stella-ops.org/docs/moat/?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
|
||||||
|
[7]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/79b8e53441e92dbc63684f42072434d40b80275f/src/ExportCenter?utm_source=chatgpt.com "Code - Stella Ops"
|
||||||
|
[8]: https://stella-ops.org/docs/24_offline_kit/?utm_source=chatgpt.com "Offline Update Kit (OUK) — Air‑Gap Bundle - Stella Ops – Open ..."
|
||||||
|
[9]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/7768555f2d107326050cc5ff7f5cb81b82b7ce5f/AGENTS.md "git.stella-ops.org/AGENTS.md at 7768555f2d107326050cc5ff7f5cb81b82b7ce5f - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[10]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/66cb6c4b8af58a33efa1521b7953dda834431497/docs/implplan/SPRINT_160_export_evidence.md?utm_source=chatgpt.com "git.stella-ops.org/SPRINT_160_export_evidence.md at ..."
|
||||||
|
[11]: https://stella-ops.org/about/?utm_source=chatgpt.com "Signed Reachability · Deterministic Replay · Sovereign Crypto"
|
||||||
|
[12]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/actions/?actor=0&status=0&workflow=sdk-publish.yml&utm_source=chatgpt.com "Actions - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
@@ -0,0 +1,819 @@
|
|||||||
|
Here’s a crisp, opinionated storage blueprint you can hand to your Stella Ops devs right now, plus zero‑downtime conversion tactics so you can keep prototyping fast without painting yourself into a corner.
|
||||||
|
|
||||||
|
# Module → store map (deterministic by default)
|
||||||
|
|
||||||
|
* **Authority / OAuth / Accounts & Audit**
|
||||||
|
|
||||||
|
* **PostgreSQL** as the primary source of truth.
|
||||||
|
* Tables: `users`, `clients`, `oauth_tokens`, `roles`, `grants`, `audit_log`.
|
||||||
|
* **Row‑Level Security (RLS)** on `users`, `grants`, `audit_log`; **STRICT FK + CHECK** constraints; **immutable UUID PKs**.
|
||||||
|
* **Audit**: `audit_log(actor_id, action, entity, entity_id, at timestamptz default now(), diff jsonb)`.
|
||||||
|
* **Why**: ACID + RLS keeps authz decisions and audit trails deterministic and reviewable.
|
||||||
|
|
||||||
|
* **VEX & Vulnerability Writes**
|
||||||
|
|
||||||
|
* **PostgreSQL** with **JSONB facts + relational decisions**.
|
||||||
|
* Tables: `vuln_fact(jsonb)`, `vex_decision(package_id, vuln_id, status, rationale, proof_ref, updated_at)`.
|
||||||
|
* **Materialized views** for triage queues, e.g. `mv_triage_hotset` (refresh on commit or scheduled).
|
||||||
|
* **Why**: JSONB lets you ingest vendor‑shaped docs; decisions stay relational for joins, integrity, and explainability.
|
||||||
|
|
||||||
|
* **Routing / Feature Flags / Rate‑limits**
|
||||||
|
|
||||||
|
* **PostgreSQL** (truth) + **Redis** (cache).
|
||||||
|
* Tables: `feature_flag(key, rules jsonb, version)`, `route(domain, service, instance_id, last_heartbeat)`, `rate_limiter(bucket, quota, interval)`.
|
||||||
|
* Redis keys: `flag:{key}:{version}`, `route:{domain}`, `rl:{bucket}` with short TTLs.
|
||||||
|
* **Why**: one canonical RDBMS for consistency; Redis for hot path latency.
|
||||||
|
|
||||||
|
* **Unknowns Registry (ambiguity tracker)**
|
||||||
|
|
||||||
|
* **PostgreSQL** with **temporal tables** (bitemporal pattern via `valid_from/valid_to`, `sys_from/sys_to`).
|
||||||
|
* Table: `unknowns(subject_hash, kind, context jsonb, valid_from, valid_to, sys_from default now(), sys_to)`.
|
||||||
|
* Views: `unknowns_current` where `valid_to is null`.
|
||||||
|
* **Why**: preserves how/when uncertainty changed (critical for proofs and audits).
|
||||||
|
|
||||||
|
* **Artifacts / SBOM / VEX files**
|
||||||
|
|
||||||
|
* **OCI‑compatible CAS** (e.g., self‑hosted registry or MinIO bucket as content‑addressable store).
|
||||||
|
* Keys by **digest** (`sha256:...`), metadata in Postgres `artifact(index)` with `digest`, `media_type`, `size`, `signatures`.
|
||||||
|
* **Why**: blobs don’t belong in your RDBMS; use CAS for scale + cryptographic addressing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# PostgreSQL implementation essentials (copy/paste starters)
|
||||||
|
|
||||||
|
* **RLS scaffold (Authority)**:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
alter table audit_log enable row level security;
|
||||||
|
create policy p_audit_read_self
|
||||||
|
on audit_log for select
|
||||||
|
using (actor_id = current_setting('app.user_id')::uuid or
|
||||||
|
exists (select 1 from grants g where g.user_id = current_setting('app.user_id')::uuid and g.role = 'auditor'));
|
||||||
|
```
|
||||||
|
|
||||||
|
* **JSONB facts + relational decisions**:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table vuln_fact (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
source text not null,
|
||||||
|
payload jsonb not null,
|
||||||
|
received_at timestamptz default now()
|
||||||
|
);
|
||||||
|
|
||||||
|
create table vex_decision (
|
||||||
|
package_id uuid not null,
|
||||||
|
vuln_id text not null,
|
||||||
|
status text check (status in ('not_affected','affected','fixed','under_investigation')),
|
||||||
|
rationale text,
|
||||||
|
proof_ref text,
|
||||||
|
decided_at timestamptz default now(),
|
||||||
|
primary key (package_id, vuln_id)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* **Materialized view for triage**:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create materialized view mv_triage_hotset as
|
||||||
|
select v.id as fact_id, v.payload->>'vuln' as vuln, v.received_at
|
||||||
|
from vuln_fact v
|
||||||
|
where (now() - v.received_at) < interval '7 days';
|
||||||
|
-- refresh concurrently via job
|
||||||
|
```
|
||||||
|
|
||||||
|
* **Temporal pattern (Unknowns)**:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table unknowns (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
subject_hash text not null,
|
||||||
|
kind text not null,
|
||||||
|
context jsonb not null,
|
||||||
|
valid_from timestamptz not null default now(),
|
||||||
|
valid_to timestamptz,
|
||||||
|
sys_from timestamptz not null default now(),
|
||||||
|
sys_to timestamptz
|
||||||
|
);
|
||||||
|
|
||||||
|
create view unknowns_current as
|
||||||
|
select * from unknowns where valid_to is null;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Conversion (not migration): zero‑downtime, prototype‑friendly
|
||||||
|
|
||||||
|
Even if you’re “not migrating anything yet,” set these rails now so cutting over later is painless.
|
||||||
|
|
||||||
|
1. **Encode Mongo‑shaped docs into JSONB with versioned schemas**
|
||||||
|
|
||||||
|
* Ingest pipeline writes to `*_fact(payload jsonb, schema_version int)`.
|
||||||
|
* Add a **`validate(schema_version, payload)`** step in your service layer (JSON Schema or SQL checks).
|
||||||
|
* Keep a **forward‑compatible view** that projects stable columns from JSONB (e.g., `payload->>'id' as vendor_id`) so downstream code doesn’t break when payload evolves.
|
||||||
|
|
||||||
|
2. **Outbox pattern for exactly‑once side‑effects**
|
||||||
|
|
||||||
|
* Add `outbox(id, topic, key, payload jsonb, created_at, dispatched bool default false)`.
|
||||||
|
* On the same transaction as your write, insert the outbox row.
|
||||||
|
* A background dispatcher reads `dispatched=false`, publishes to MQ/Webhook, then marks `dispatched=true`.
|
||||||
|
* Guarantees: no lost events, no duplicates to external systems.
|
||||||
|
|
||||||
|
3. **Parallel read adapters behind feature flags**
|
||||||
|
|
||||||
|
* Keep old readers (e.g., Mongo driver) and new Postgres readers in the same service.
|
||||||
|
* Gate by `feature_flag('pg_reads')` per tenant or env; flip gradually.
|
||||||
|
* Add a **read‑diff monitor** that compares results and logs mismatches to `audit_log(diff)`.
|
||||||
|
|
||||||
|
4. **CDC for analytics without coupling**
|
||||||
|
|
||||||
|
* Enable **logical replication** (pgoutput) on your key tables.
|
||||||
|
* Stream changes into analyzers (reachability, heuristics) without hitting primaries.
|
||||||
|
* This lets you keep OLTP clean and still power dashboards/tests.
|
||||||
|
|
||||||
|
5. **Materialized views & job cadence**
|
||||||
|
|
||||||
|
* Refresh `mv_*` on a fixed cadence (e.g., every 2–5 minutes) or post‑commit for hot paths.
|
||||||
|
* Keep **“cold path”** analytics in separate schemas (`analytics.*`) sourced from CDC.
|
||||||
|
|
||||||
|
6. **Cutover playbook (phased)**
|
||||||
|
|
||||||
|
* Phase A (Dark Read): write Postgres, still serve from Mongo; compare results silently.
|
||||||
|
* Phase B (Shadow Serve): 5–10% traffic from Postgres via flag; auto‑rollback switch.
|
||||||
|
* Phase C (Authoritative): Postgres becomes source; Mongo path left for emergency read‑only.
|
||||||
|
* Phase D (Retire): freeze Mongo, back up, remove writes, delete code paths after 2 stable sprints.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Rate‑limits & flags: single truth, fast edges
|
||||||
|
|
||||||
|
* **Truth in Postgres** with versioned flag docs:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table feature_flag (
|
||||||
|
key text primary key,
|
||||||
|
rules jsonb not null,
|
||||||
|
version int not null default 1,
|
||||||
|
updated_at timestamptz default now()
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* **Edge cache** in Redis:
|
||||||
|
|
||||||
|
* `SETEX flag:{key}:{version} <ttl> <json>`
|
||||||
|
* On update, bump `version`; readers compose cache key with version (cache‑busting without deletes).
|
||||||
|
|
||||||
|
* **Rate limiting**: Persist quotas in Postgres; counters in Redis (`INCR rl:{bucket}:{window}`), with periodic reconciliation jobs writing summaries back to Postgres for audits.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# CAS for SBOM/VEX/attestations
|
||||||
|
|
||||||
|
* Push blobs to OCI/MinIO by digest; store only pointers in Postgres:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table artifact_index (
|
||||||
|
digest text primary key,
|
||||||
|
media_type text not null,
|
||||||
|
size bigint not null,
|
||||||
|
created_at timestamptz default now(),
|
||||||
|
signature_refs jsonb
|
||||||
|
);
|
||||||
|
```
|
||||||
|
* Benefits: immutable, deduped, easy to mirror into offline kits.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Guardrails your team should follow
|
||||||
|
|
||||||
|
* **Always** wrap multi‑table writes (facts + outbox + decisions) in a single transaction.
|
||||||
|
* **Prefer** `jsonb_path_query` for targeted reads; **avoid** scanning entire payloads.
|
||||||
|
* **Enforce** RLS + least‑privilege roles; application sets `app.user_id` at session start.
|
||||||
|
* **Version everything**: schemas, flags, materialized views; never “change in place” without bumping version.
|
||||||
|
* **Observability**: expose `pg_stat_statements`, refresh latency for `mv_*`, outbox lag, Redis hit ratio, and RLS policy hits.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you want, I can turn this into:
|
||||||
|
|
||||||
|
* ready‑to‑run **EF Core 10** migrations,
|
||||||
|
* a **/docs/architecture/store-map.md** for your repo,
|
||||||
|
* and a tiny **dev seed** (Docker + sample data) so the team can poke it immediately.
|
||||||
|
Here’s a focused “PostgreSQL patterns per module” doc you can hand straight to your StellaOps devs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# StellaOps – PostgreSQL Patterns per Module
|
||||||
|
|
||||||
|
**Scope:** How each StellaOps module should use PostgreSQL: schema patterns, constraints, RLS, indexing, and transaction rules.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Cross‑cutting PostgreSQL Rules
|
||||||
|
|
||||||
|
These apply everywhere unless explicitly overridden.
|
||||||
|
|
||||||
|
### 0.1 Core conventions
|
||||||
|
|
||||||
|
* **Schemas**
|
||||||
|
|
||||||
|
* Use **one logical schema** per module: `authority`, `routing`, `vex`, `unknowns`, `artifact`.
|
||||||
|
* Shared utilities (e.g., `outbox`) live in a `core` schema.
|
||||||
|
|
||||||
|
* **Naming**
|
||||||
|
|
||||||
|
* Tables: `snake_case`, singular: `user`, `feature_flag`, `vuln_fact`.
|
||||||
|
* PK: `id uuid primary key`.
|
||||||
|
* FKs: `<referenced_table>_id` (e.g., `user_id`, `tenant_id`).
|
||||||
|
* Timestamps:
|
||||||
|
|
||||||
|
* `created_at timestamptz not null default now()`
|
||||||
|
* `updated_at timestamptz not null default now()`
|
||||||
|
|
||||||
|
* **Multi‑tenancy**
|
||||||
|
|
||||||
|
* All tenant‑scoped tables must have `tenant_id uuid not null`.
|
||||||
|
* Enforce tenant isolation with **RLS** on `tenant_id`.
|
||||||
|
|
||||||
|
* **Time & timezones**
|
||||||
|
|
||||||
|
* Always `timestamptz`, always store **UTC**, let the DB default `now()`.
|
||||||
|
|
||||||
|
### 0.2 RLS & security
|
||||||
|
|
||||||
|
* RLS must be **enabled** on any table reachable from a user‑initiated path.
|
||||||
|
* Every session must set:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
select set_config('app.user_id', '<uuid>', false);
|
||||||
|
select set_config('app.tenant_id', '<uuid>', false);
|
||||||
|
select set_config('app.roles', 'role1,role2', false);
|
||||||
|
```
|
||||||
|
* RLS policies:
|
||||||
|
|
||||||
|
* Base policy: `tenant_id = current_setting('app.tenant_id')::uuid`.
|
||||||
|
* Extra predicates for per‑user privacy (e.g., only see own tokens, only own API clients).
|
||||||
|
* DB users:
|
||||||
|
|
||||||
|
* Each module’s service has its **own role** with access only to its schema + `core.outbox`.
|
||||||
|
|
||||||
|
### 0.3 JSONB & versioning
|
||||||
|
|
||||||
|
* Any JSONB column must have:
|
||||||
|
|
||||||
|
* `payload jsonb not null`,
|
||||||
|
* `schema_version int not null`.
|
||||||
|
* Always index:
|
||||||
|
|
||||||
|
* by source (`source` / `origin`),
|
||||||
|
* by a small set of projected fields used in WHERE clauses.
|
||||||
|
|
||||||
|
### 0.4 Migrations
|
||||||
|
|
||||||
|
* All schema changes via migrations, forward‑only.
|
||||||
|
* Backwards‑compat pattern:
|
||||||
|
|
||||||
|
1. Add new columns / tables.
|
||||||
|
2. Backfill.
|
||||||
|
3. Flip code to use new structure (behind a feature flag).
|
||||||
|
4. After stability, remove old columns/paths.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Authority Module (auth, accounts, audit)
|
||||||
|
|
||||||
|
**Schema:** `authority.*`
|
||||||
|
**Mission:** identity, OAuth, roles, grants, audit.
|
||||||
|
|
||||||
|
### 1.1 Core tables & patterns
|
||||||
|
|
||||||
|
* `authority.user`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table authority.user (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
email text not null,
|
||||||
|
display_name text not null,
|
||||||
|
is_disabled boolean not null default false,
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
updated_at timestamptz not null default now(),
|
||||||
|
unique (tenant_id, email)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* Never hard‑delete users: use `is_disabled` (and optionally `disabled_at`).
|
||||||
|
|
||||||
|
* `authority.role`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table authority.role (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
name text not null,
|
||||||
|
description text,
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
updated_at timestamptz not null default now(),
|
||||||
|
unique (tenant_id, name)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* `authority.grant`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table authority.grant (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
user_id uuid not null references authority.user(id),
|
||||||
|
role_id uuid not null references authority.role(id),
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
unique (tenant_id, user_id, role_id)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* `authority.oauth_client`, `authority.oauth_token`
|
||||||
|
|
||||||
|
* Enforce token uniqueness:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table authority.oauth_token (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
user_id uuid not null references authority.user(id),
|
||||||
|
client_id uuid not null references authority.oauth_client(id),
|
||||||
|
token_hash text not null, -- hash, never raw
|
||||||
|
expires_at timestamptz not null,
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
revoked_at timestamptz,
|
||||||
|
unique (token_hash)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.2 Audit log pattern
|
||||||
|
|
||||||
|
* `authority.audit_log`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table authority.audit_log (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
actor_id uuid, -- null for system
|
||||||
|
action text not null,
|
||||||
|
entity_type text not null,
|
||||||
|
entity_id uuid,
|
||||||
|
at timestamptz not null default now(),
|
||||||
|
diff jsonb not null
|
||||||
|
);
|
||||||
|
```
|
||||||
|
* Insert audit rows in the **same transaction** as the change.
|
||||||
|
|
||||||
|
### 1.3 RLS patterns
|
||||||
|
|
||||||
|
* Base RLS:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
alter table authority.user enable row level security;
|
||||||
|
|
||||||
|
create policy p_user_tenant on authority.user
|
||||||
|
for all using (tenant_id = current_setting('app.tenant_id')::uuid);
|
||||||
|
```
|
||||||
|
* Extra policies:
|
||||||
|
|
||||||
|
* Audit log is visible only to:
|
||||||
|
|
||||||
|
* actor themself, or
|
||||||
|
* users with an `auditor` or `admin` role.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Routing & Feature Flags Module
|
||||||
|
|
||||||
|
**Schema:** `routing.*`
|
||||||
|
**Mission:** where instances live, what features are on, rate‑limit configuration.
|
||||||
|
|
||||||
|
### 2.1 Feature flags
|
||||||
|
|
||||||
|
* `routing.feature_flag`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table routing.feature_flag (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
key text not null,
|
||||||
|
rules jsonb not null,
|
||||||
|
version int not null default 1,
|
||||||
|
is_enabled boolean not null default true,
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
updated_at timestamptz not null default now(),
|
||||||
|
unique (tenant_id, key)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* **Immutability by version**:
|
||||||
|
|
||||||
|
* On update, **increment `version`**, don’t overwrite historical data.
|
||||||
|
* Mirror changes into a history table via trigger:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table routing.feature_flag_history (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
feature_flag_id uuid not null references routing.feature_flag(id),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
key text not null,
|
||||||
|
rules jsonb not null,
|
||||||
|
version int not null,
|
||||||
|
changed_at timestamptz not null default now(),
|
||||||
|
changed_by uuid
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 Instance registry
|
||||||
|
|
||||||
|
* `routing.instance`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table routing.instance (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
instance_key text not null,
|
||||||
|
domain text not null,
|
||||||
|
last_heartbeat timestamptz not null default now(),
|
||||||
|
status text not null check (status in ('active','draining','offline')),
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
updated_at timestamptz not null default now(),
|
||||||
|
unique (tenant_id, instance_key),
|
||||||
|
unique (tenant_id, domain)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* Pattern:
|
||||||
|
|
||||||
|
* Heartbeats use `update ... set last_heartbeat = now()` without touching other fields.
|
||||||
|
* Routing logic filters by `status='active'` and heartbeat recency.
|
||||||
|
|
||||||
|
### 2.3 Rate‑limit configuration
|
||||||
|
|
||||||
|
* Config in Postgres, counters in Redis:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table routing.rate_limit_config (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
key text not null,
|
||||||
|
limit_per_interval int not null,
|
||||||
|
interval_seconds int not null,
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
updated_at timestamptz not null default now(),
|
||||||
|
unique (tenant_id, key)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. VEX & Vulnerability Module
|
||||||
|
|
||||||
|
**Schema:** `vex.*`
|
||||||
|
**Mission:** ingest vulnerability facts, keep decisions & triage state.
|
||||||
|
|
||||||
|
### 3.1 Facts as JSONB
|
||||||
|
|
||||||
|
* `vex.vuln_fact`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table vex.vuln_fact (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
source text not null, -- e.g. "nvd", "vendor_x_vex"
|
||||||
|
external_id text, -- e.g. CVE, advisory id
|
||||||
|
payload jsonb not null,
|
||||||
|
schema_version int not null,
|
||||||
|
received_at timestamptz not null default now()
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* Index patterns:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create index on vex.vuln_fact (tenant_id, source);
|
||||||
|
create index on vex.vuln_fact (tenant_id, external_id);
|
||||||
|
create index vuln_fact_payload_gin on vex.vuln_fact using gin (payload);
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.2 Decisions as relational data
|
||||||
|
|
||||||
|
* `vex.package`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table vex.package (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
name text not null,
|
||||||
|
version text not null,
|
||||||
|
ecosystem text not null, -- e.g. "pypi", "npm"
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
unique (tenant_id, name, version, ecosystem)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* `vex.vex_decision`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table vex.vex_decision (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
package_id uuid not null references vex.package(id),
|
||||||
|
vuln_id text not null,
|
||||||
|
status text not null check (status in (
|
||||||
|
'not_affected', 'affected', 'fixed', 'under_investigation'
|
||||||
|
)),
|
||||||
|
rationale text,
|
||||||
|
proof_ref text, -- CAS digest or URL
|
||||||
|
decided_by uuid,
|
||||||
|
decided_at timestamptz not null default now(),
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
updated_at timestamptz not null default now(),
|
||||||
|
unique (tenant_id, package_id, vuln_id)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* For history:
|
||||||
|
|
||||||
|
* Keep current state in `vex_decision`.
|
||||||
|
* Mirror previous versions into `vex_decision_history` table (similar to feature flags).
|
||||||
|
|
||||||
|
### 3.3 Triage queues with materialized views
|
||||||
|
|
||||||
|
* Example triage view:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create materialized view vex.mv_triage_queue as
|
||||||
|
select
|
||||||
|
d.tenant_id,
|
||||||
|
p.name,
|
||||||
|
p.version,
|
||||||
|
d.vuln_id,
|
||||||
|
d.status,
|
||||||
|
d.decided_at
|
||||||
|
from vex.vex_decision d
|
||||||
|
join vex.package p on p.id = d.package_id
|
||||||
|
where d.status = 'under_investigation';
|
||||||
|
```
|
||||||
|
|
||||||
|
* Refresh options:
|
||||||
|
|
||||||
|
* Scheduled refresh (cron/worker).
|
||||||
|
* Or **incremental** via triggers (more complex; use only when needed).
|
||||||
|
|
||||||
|
### 3.4 RLS for VEX
|
||||||
|
|
||||||
|
* All tables scoped by `tenant_id`.
|
||||||
|
* Typical policy:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
alter table vex.vex_decision enable row level security;
|
||||||
|
|
||||||
|
create policy p_vex_tenant on vex.vex_decision
|
||||||
|
for all using (tenant_id = current_setting('app.tenant_id')::uuid);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Unknowns Module
|
||||||
|
|
||||||
|
**Schema:** `unknowns.*`
|
||||||
|
**Mission:** represent uncertainty and how it changes over time.
|
||||||
|
|
||||||
|
### 4.1 Bitemporal unknowns table
|
||||||
|
|
||||||
|
* `unknowns.unknown`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table unknowns.unknown (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
subject_hash text not null, -- stable identifier for "thing" being reasoned about
|
||||||
|
kind text not null, -- e.g. "reachability", "version_inferred"
|
||||||
|
context jsonb not null, -- extra info: call graph node, evidence, etc.
|
||||||
|
valid_from timestamptz not null default now(),
|
||||||
|
valid_to timestamptz,
|
||||||
|
sys_from timestamptz not null default now(),
|
||||||
|
sys_to timestamptz,
|
||||||
|
created_at timestamptz not null default now()
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* “Exactly one open unknown per subject/kind” pattern:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create unique index unknown_one_open_per_subject
|
||||||
|
on unknowns.unknown (tenant_id, subject_hash, kind)
|
||||||
|
where valid_to is null;
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Closing an unknown
|
||||||
|
|
||||||
|
* Close by setting `valid_to` and `sys_to`:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
update unknowns.unknown
|
||||||
|
set valid_to = now(), sys_to = now()
|
||||||
|
where id = :id and valid_to is null;
|
||||||
|
```
|
||||||
|
|
||||||
|
* Never hard-delete; keep all rows for audit/explanation.
|
||||||
|
|
||||||
|
### 4.3 Convenience views
|
||||||
|
|
||||||
|
* Current unknowns:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create view unknowns.current as
|
||||||
|
select *
|
||||||
|
from unknowns.unknown
|
||||||
|
where valid_to is null;
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 RLS
|
||||||
|
|
||||||
|
* Same tenant policy as other modules; unknowns are tenant‑scoped.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Artifact Index / CAS Module
|
||||||
|
|
||||||
|
**Schema:** `artifact.*`
|
||||||
|
**Mission:** index of immutable blobs stored in OCI / S3 / MinIO etc.
|
||||||
|
|
||||||
|
### 5.1 Artifact index
|
||||||
|
|
||||||
|
* `artifact.artifact`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table artifact.artifact (
|
||||||
|
digest text primary key, -- e.g. "sha256:..."
|
||||||
|
tenant_id uuid not null,
|
||||||
|
media_type text not null,
|
||||||
|
size_bytes bigint not null,
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
created_by uuid
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* Validate digest shape with a CHECK:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
alter table artifact.artifact
|
||||||
|
add constraint chk_digest_format
|
||||||
|
check (digest ~ '^sha[0-9]+:[0-9a-fA-F]{32,}$');
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.2 Signatures and tags
|
||||||
|
|
||||||
|
* `artifact.signature`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table artifact.signature (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
artifact_digest text not null references artifact.artifact(digest),
|
||||||
|
signer text not null,
|
||||||
|
signature_payload jsonb not null,
|
||||||
|
created_at timestamptz not null default now()
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
* `artifact.tag`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table artifact.tag (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid not null,
|
||||||
|
name text not null,
|
||||||
|
artifact_digest text not null references artifact.artifact(digest),
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
unique (tenant_id, name)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.3 RLS
|
||||||
|
|
||||||
|
* Ensure that tenants cannot see each other’s digests, even if the CAS backing store is shared:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
alter table artifact.artifact enable row level security;
|
||||||
|
|
||||||
|
create policy p_artifact_tenant on artifact.artifact
|
||||||
|
for all using (tenant_id = current_setting('app.tenant_id')::uuid);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Shared Outbox / Event Pattern
|
||||||
|
|
||||||
|
**Schema:** `core.*`
|
||||||
|
**Mission:** reliable events for external side‑effects.
|
||||||
|
|
||||||
|
### 6.1 Outbox table
|
||||||
|
|
||||||
|
* `core.outbox`
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create table core.outbox (
|
||||||
|
id uuid primary key default gen_random_uuid(),
|
||||||
|
tenant_id uuid,
|
||||||
|
aggregate_type text not null, -- e.g. "vex_decision", "feature_flag"
|
||||||
|
aggregate_id uuid,
|
||||||
|
topic text not null,
|
||||||
|
payload jsonb not null,
|
||||||
|
created_at timestamptz not null default now(),
|
||||||
|
dispatched_at timestamptz,
|
||||||
|
dispatch_attempts int not null default 0,
|
||||||
|
error text
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6.2 Usage rule
|
||||||
|
|
||||||
|
* For anything that must emit an event (webhook, Kafka, notifications):
|
||||||
|
|
||||||
|
* In the **same transaction** as the change:
|
||||||
|
|
||||||
|
* write primary data (e.g. `vex.vex_decision`),
|
||||||
|
* insert an `outbox` row.
|
||||||
|
* A background worker:
|
||||||
|
|
||||||
|
* pulls undelivered rows,
|
||||||
|
* sends to external system,
|
||||||
|
* updates `dispatched_at`/`dispatch_attempts`/`error`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Indexing & Query Patterns per Module
|
||||||
|
|
||||||
|
### 7.1 Authority
|
||||||
|
|
||||||
|
* Index:
|
||||||
|
|
||||||
|
* `user(tenant_id, email)`
|
||||||
|
* `grant(tenant_id, user_id)`
|
||||||
|
* `oauth_token(token_hash)`
|
||||||
|
* Typical query patterns:
|
||||||
|
|
||||||
|
* Look up user by `tenant_id + email`.
|
||||||
|
* All roles/grants for a user; design composite indexes accordingly.
|
||||||
|
|
||||||
|
### 7.2 Routing & Flags
|
||||||
|
|
||||||
|
* Index:
|
||||||
|
|
||||||
|
* `feature_flag(tenant_id, key)`
|
||||||
|
* partial index on enabled flags:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
create index on routing.feature_flag (tenant_id, key)
|
||||||
|
where is_enabled;
|
||||||
|
```
|
||||||
|
* `instance(tenant_id, status)`, `instance(tenant_id, domain)`.
|
||||||
|
|
||||||
|
### 7.3 VEX
|
||||||
|
|
||||||
|
* Index:
|
||||||
|
|
||||||
|
* `package(tenant_id, name, version, ecosystem)`
|
||||||
|
* `vex_decision(tenant_id, package_id, vuln_id)`
|
||||||
|
* GIN on `vuln_fact.payload` for flexible querying.
|
||||||
|
|
||||||
|
### 7.4 Unknowns
|
||||||
|
|
||||||
|
* Index:
|
||||||
|
|
||||||
|
* unique open unknown per subject/kind (shown above).
|
||||||
|
* `unknown(tenant_id, kind)` for filtering by kind.
|
||||||
|
|
||||||
|
### 7.5 Artifact
|
||||||
|
|
||||||
|
* Index:
|
||||||
|
|
||||||
|
* PK on `digest`.
|
||||||
|
* `signature(tenant_id, artifact_digest)`.
|
||||||
|
* `tag(tenant_id, name)`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Transaction & Isolation Guidelines
|
||||||
|
|
||||||
|
* Default isolation: **READ COMMITTED**.
|
||||||
|
* For critical sequences (e.g., provisioning a tenant, bulk role updates):
|
||||||
|
|
||||||
|
* consider **REPEATABLE READ** or **SERIALIZABLE** and keep transactions short.
|
||||||
|
* Pattern:
|
||||||
|
|
||||||
|
* One transaction per logical user action (e.g., “set flag”, “record decision”).
|
||||||
|
* Never do long‑running external calls inside a database transaction.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you’d like, next step I can turn this into:
|
||||||
|
|
||||||
|
* concrete `CREATE SCHEMA` + `CREATE TABLE` migration files, and
|
||||||
|
* a short “How to write queries in each module” cheat‑sheet for devs (with example SELECT/INSERT/UPDATE patterns).
|
||||||
@@ -0,0 +1,585 @@
|
|||||||
|
Here’s a tight, practical pattern to make your scanner’s vuln‑DB updates rock‑solid even when feeds hiccup:
|
||||||
|
|
||||||
|
# Offline, verifiable update bundles (DSSE + Rekor v2)
|
||||||
|
|
||||||
|
**Idea:** distribute DB updates as offline tarballs. Each tarball ships with:
|
||||||
|
|
||||||
|
* a **DSSE‑signed** statement (e.g., in‑toto style) over the bundle hash
|
||||||
|
* a **Rekor v2 receipt** proving the signature/statement was logged
|
||||||
|
* a small **manifest.json** (version, created_at, content hashes)
|
||||||
|
|
||||||
|
**Startup flow (happy path):**
|
||||||
|
|
||||||
|
1. Load latest tarball from your local `updates/` cache.
|
||||||
|
2. Verify DSSE signature against your trusted public keys.
|
||||||
|
3. Verify Rekor v2 receipt (inclusion proof) matches the DSSE payload hash.
|
||||||
|
4. If both pass, unpack/activate; record the bundle’s **trust_id** (e.g., statement digest).
|
||||||
|
5. If anything fails, **keep using the last good bundle**. No service disruption.
|
||||||
|
|
||||||
|
**Why this helps**
|
||||||
|
|
||||||
|
* **Air‑gap friendly:** no live network needed at activation time.
|
||||||
|
* **Tamper‑evident:** DSSE + Rekor receipt proves provenance and transparency.
|
||||||
|
* **Operational stability:** feed outages become non‑events—scanner just keeps the last good state.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File layout inside each bundle
|
||||||
|
|
||||||
|
```
|
||||||
|
/bundle-2025-11-29/
|
||||||
|
manifest.json # { version, created_at, entries[], sha256s }
|
||||||
|
payload.tar.zst # the actual DB/indices
|
||||||
|
payload.tar.zst.sha256
|
||||||
|
statement.dsse.json # DSSE-wrapped statement over payload hash
|
||||||
|
rekor-receipt.json # Rekor v2 inclusion/verification material
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Acceptance/Activation rules
|
||||||
|
|
||||||
|
* **Trust root:** pin one (or more) publisher public keys; rotate via separate, out‑of‑band process.
|
||||||
|
* **Monotonicity:** only activate if `manifest.version > current.version` (or if trust policy explicitly allows replay for rollback testing).
|
||||||
|
* **Atomic switch:** unpack to `db/staging/`, validate, then symlink‑flip to `db/active/`.
|
||||||
|
* **Quarantine on failure:** move bad bundles to `updates/quarantine/` with a reason code.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Minimal .NET 10 verifier sketch (C#)
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed record BundlePaths(string Dir) {
|
||||||
|
public string Manifest => Path.Combine(Dir, "manifest.json");
|
||||||
|
public string Payload => Path.Combine(Dir, "payload.tar.zst");
|
||||||
|
public string Dsse => Path.Combine(Dir, "statement.dsse.json");
|
||||||
|
public string Receipt => Path.Combine(Dir, "rekor-receipt.json");
|
||||||
|
}
|
||||||
|
|
||||||
|
public async Task<bool> ActivateBundleAsync(BundlePaths b, TrustConfig trust, string activeDir) {
|
||||||
|
var manifest = await Manifest.LoadAsync(b.Manifest);
|
||||||
|
if (!await Hashes.VerifyAsync(b.Payload, manifest.PayloadSha256)) return false;
|
||||||
|
|
||||||
|
// 1) DSSE verify (publisher keys pinned in trust)
|
||||||
|
var (okSig, dssePayloadDigest) = await Dsse.VerifyAsync(b.Dsse, trust.PublisherKeys);
|
||||||
|
if (!okSig || dssePayloadDigest != manifest.PayloadSha256) return false;
|
||||||
|
|
||||||
|
// 2) Rekor v2 receipt verify (inclusion + statement digest == dssePayloadDigest)
|
||||||
|
if (!await RekorV2.VerifyReceiptAsync(b.Receipt, dssePayloadDigest, trust.RekorPub)) return false;
|
||||||
|
|
||||||
|
// 3) Stage, validate, then atomically flip
|
||||||
|
var staging = Path.Combine(activeDir, "..", "staging");
|
||||||
|
DirUtil.Empty(staging);
|
||||||
|
await TarZstd.ExtractAsync(b.Payload, staging);
|
||||||
|
if (!await LocalDbSelfCheck.RunAsync(staging)) return false;
|
||||||
|
|
||||||
|
SymlinkUtil.AtomicSwap(source: staging, target: activeDir);
|
||||||
|
State.WriteLastGood(manifest.Version, dssePayloadDigest);
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Operational playbook
|
||||||
|
|
||||||
|
* **On boot & daily at HH:MM:** try `ActivateBundleAsync()` on the newest bundle; on failure, log and continue.
|
||||||
|
* **Telemetry (no PII):** reason codes (SIG_FAIL, RECEIPT_FAIL, HASH_MISMATCH, SELFTEST_FAIL), versions, last_good.
|
||||||
|
* **Keys & rotation:** keep `publisher.pub` and `rekor.pub` in a root‑owned, read‑only path; rotate via a separate signed “trust bundle”.
|
||||||
|
* **Defense‑in‑depth:** verify both the **payload hash** and each file’s hash listed in `manifest.entries[]`.
|
||||||
|
* **Rollback:** allow `--force-activate <bundle>` for emergency testing, but mark as **non‑monotonic** in state.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What to hand your release team
|
||||||
|
|
||||||
|
* A Make/CI target that:
|
||||||
|
|
||||||
|
1. Builds `payload.tar.zst` and computes hashes
|
||||||
|
2. Generates `manifest.json`
|
||||||
|
3. Creates and signs the **DSSE statement**
|
||||||
|
4. Submits to Rekor (or your mirror) and saves the **v2 receipt**
|
||||||
|
5. Packages the bundle folder and publishes to your offline repo
|
||||||
|
* A checksum file (`*.sha256sum`) for ops to verify out‑of‑band.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you want, I can turn this into a Stella Ops spec page (`docs/modules/scanner/offline-bundles.md`) plus a small reference implementation (C# library + CLI) that drops right into your Scanner service.
|
||||||
|
Here’s a “drop‑in” Stella Ops dev guide for **DSSE‑signed Offline Scanner Updates** — written in the same spirit as the existing docs and sprint files.
|
||||||
|
|
||||||
|
You can treat this as the seed for `docs/modules/scanner/development/dsse-offline-updates.md` (or similar).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# DSSE‑Signed Offline Scanner Updates — Developer Guidelines
|
||||||
|
|
||||||
|
> **Audience**
|
||||||
|
> Scanner, Export Center, Attestor, CLI, and DevOps engineers implementing DSSE‑signed offline vulnerability updates and integrating them into the Offline Update Kit (OUK).
|
||||||
|
>
|
||||||
|
> **Context**
|
||||||
|
>
|
||||||
|
> * OUK already ships **signed, atomic offline update bundles** with merged vulnerability feeds, container images, and an attested manifest.([git.stella-ops.org][1])
|
||||||
|
> * DSSE + Rekor is already used for **scan evidence** (SBOM attestations, Rekor proofs).([git.stella-ops.org][2])
|
||||||
|
> * Sprints 160/162 add **attestation bundles** with manifest, checksums, DSSE signature, and optional transparency log segments, and integrate them into OUK and CLI flows.([git.stella-ops.org][3])
|
||||||
|
|
||||||
|
These guidelines tell you how to **wire all of that together** for “offline scanner updates” (feeds, rules, packs) in a way that matches Stella Ops’ determinism + sovereignty promises.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Mental model
|
||||||
|
|
||||||
|
At a high level, you’re building this:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Advisory mirrors / Feeds builders
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
ExportCenter.AttestationBundles
|
||||||
|
(creates DSSE + Rekor evidence
|
||||||
|
for each offline update snapshot)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Offline Update Kit (OUK) builder
|
||||||
|
(adds feeds + evidence to kit tarball)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
stella offline kit import / admin CLI
|
||||||
|
(verifies Cosign + DSSE + Rekor segments,
|
||||||
|
then atomically swaps scanner feeds)
|
||||||
|
```
|
||||||
|
|
||||||
|
Online, Rekor is live; offline, you rely on **bundled Rekor segments / snapshots** and the existing OUK mechanics (import is atomic, old feeds kept until new bundle is fully verified).([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Goals & non‑goals
|
||||||
|
|
||||||
|
### Goals
|
||||||
|
|
||||||
|
1. **Authentic offline snapshots**
|
||||||
|
Every offline scanner update (OUK or delta) must be verifiably tied to:
|
||||||
|
|
||||||
|
* a DSSE envelope,
|
||||||
|
* a certificate chain rooted in Stella’s Fulcio/KMS profile or BYO KMS/HSM,
|
||||||
|
* *and* a Rekor v2 inclusion proof or bundled log segment.([Stella Ops][4])
|
||||||
|
|
||||||
|
2. **Deterministic replay**
|
||||||
|
Given:
|
||||||
|
|
||||||
|
* a specific offline update kit (`stella-ops-offline-kit-<DATE>.tgz` + `offline-manifest-<DATE>.json`)([git.stella-ops.org][1])
|
||||||
|
* its DSSE attestation bundle + Rekor segments
|
||||||
|
every verifier must reach the *same* verdict on integrity and contents — online or fully air‑gapped.
|
||||||
|
|
||||||
|
3. **Separation of concerns**
|
||||||
|
|
||||||
|
* Export Center: build attestation bundles, no business logic about scanning.([git.stella-ops.org][5])
|
||||||
|
* Scanner: import & apply feeds; verify but not generate DSSE.
|
||||||
|
* Signer / Attestor: own DSSE & Rekor integration.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
4. **Operational safety**
|
||||||
|
|
||||||
|
* Imports remain **atomic and idempotent**.
|
||||||
|
* Old feeds stay live until the new update is **fully verified** (Cosign + DSSE + Rekor).([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
### Non‑goals
|
||||||
|
|
||||||
|
* Designing new crypto or log formats.
|
||||||
|
* Per‑feed DSSE envelopes (you can have more later, but the minimum contract is **bundle‑level** attestation).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Bundle contract for DSSE‑signed offline updates
|
||||||
|
|
||||||
|
You’re extending the existing OUK contract:
|
||||||
|
|
||||||
|
* OUK already packs:
|
||||||
|
|
||||||
|
* merged vuln feeds (OSV, GHSA, optional NVD 2.0, CNNVD/CNVD, ENISA, JVN, BDU),
|
||||||
|
* container images (`stella-ops`, Zastava, etc.),
|
||||||
|
* provenance (Cosign signature, SPDX SBOM, in‑toto SLSA attestation),
|
||||||
|
* `offline-manifest.json` + detached JWS signed during export.([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
For **DSSE‑signed offline scanner updates**, add a new logical layer:
|
||||||
|
|
||||||
|
### 2.1. Files to ship
|
||||||
|
|
||||||
|
Inside each offline kit (full or delta) you must produce:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/attestations/
|
||||||
|
offline-update.dsse.json # DSSE envelope
|
||||||
|
offline-update.rekor.json # Rekor entry + inclusion proof (or segment descriptor)
|
||||||
|
/manifest/
|
||||||
|
offline-manifest.json # existing manifest
|
||||||
|
offline-manifest.json.jws # existing detached JWS
|
||||||
|
/feeds/
|
||||||
|
... # existing feed payloads
|
||||||
|
```
|
||||||
|
|
||||||
|
The exact paths can be adjusted, but keep:
|
||||||
|
|
||||||
|
* **One DSSE bundle per kit** (min spec).
|
||||||
|
* **One canonical Rekor proof file** per DSSE envelope.
|
||||||
|
|
||||||
|
### 2.2. DSSE payload contents (minimal)
|
||||||
|
|
||||||
|
Define (or reuse) a predicate type such as:
|
||||||
|
|
||||||
|
```jsonc
|
||||||
|
{
|
||||||
|
"payloadType": "application/vnd.in-toto+json",
|
||||||
|
"payload": { /* base64 */ }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Decoded payload (in-toto statement) should **at minimum** contain:
|
||||||
|
|
||||||
|
* **Subject**
|
||||||
|
|
||||||
|
* `name`: `stella-ops-offline-kit-<DATE>.tgz`
|
||||||
|
* `digest.sha256`: tarball digest
|
||||||
|
|
||||||
|
* **Predicate type** (recommendation)
|
||||||
|
|
||||||
|
* `https://stella-ops.org/attestations/offline-update/1`
|
||||||
|
|
||||||
|
* **Predicate fields**
|
||||||
|
|
||||||
|
* `offline_manifest_sha256` – SHA‑256 of `offline-manifest.json`
|
||||||
|
* `feeds` – array of feed entries such as `{ name, snapshot_date, archive_digest }` (mirrors `rules_and_feeds` style used in the moat doc).([Stella Ops][6])
|
||||||
|
* `builder` – CI workflow id / git commit / Export Center job id
|
||||||
|
* `created_at` – UTC ISO‑8601
|
||||||
|
* `oukit_channel` – e.g., `edge`, `stable`, `fips-profile`
|
||||||
|
|
||||||
|
**Guideline:** this DSSE payload is the **single canonical description** of “what this offline update snapshot is”.
|
||||||
|
|
||||||
|
### 2.3. Rekor material
|
||||||
|
|
||||||
|
Attestor must:
|
||||||
|
|
||||||
|
* Submit `offline-update.dsse.json` to Rekor v2, obtaining:
|
||||||
|
|
||||||
|
* `uuid`
|
||||||
|
* `logIndex`
|
||||||
|
* inclusion proof (`rootHash`, `hashes`, `checkpoint`)
|
||||||
|
* Serialize that to `offline-update.rekor.json` and store it in object storage + OUK staging, so it ships in the kit.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
For fully offline operation:
|
||||||
|
|
||||||
|
* Either:
|
||||||
|
|
||||||
|
* embed a **minimal log segment** containing that entry; or
|
||||||
|
* rely on daily Rekor snapshot exports included elsewhere in the kit.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Implementation by module
|
||||||
|
|
||||||
|
### 3.1 Export Center — attestation bundles
|
||||||
|
|
||||||
|
**Working directory:** `src/ExportCenter/StellaOps.ExportCenter.AttestationBundles`([git.stella-ops.org][7])
|
||||||
|
|
||||||
|
**Responsibilities**
|
||||||
|
|
||||||
|
1. **Compose attestation bundle job** (EXPORT‑ATTEST‑74‑001)
|
||||||
|
|
||||||
|
* Input: a snapshot identifier (e.g., offline kit build id or feed snapshot date).
|
||||||
|
* Read manifest and feed metadata from the Export Center’s storage.([git.stella-ops.org][5])
|
||||||
|
* Generate the DSSE payload structure described above.
|
||||||
|
* Call `StellaOps.Signer` to wrap it in a DSSE envelope.
|
||||||
|
* Call `StellaOps.Attestor` to submit DSSE → Rekor and get the inclusion proof.([git.stella-ops.org][2])
|
||||||
|
* Persist:
|
||||||
|
|
||||||
|
* `offline-update.dsse.json`
|
||||||
|
* `offline-update.rekor.json`
|
||||||
|
* any log segment artifacts.
|
||||||
|
|
||||||
|
2. **Integrate into offline kit packaging** (EXPORT‑ATTEST‑74‑002 / 75‑001)
|
||||||
|
|
||||||
|
* The OUK builder (Python script `ops/offline-kit/build_offline_kit.py`) already assembles artifacts & manifests.([Stella Ops][8])
|
||||||
|
* Extend that pipeline (or add an Export Center step) to:
|
||||||
|
|
||||||
|
* fetch the attestation bundle for the snapshot,
|
||||||
|
* place it under `/attestations/` in the kit staging dir,
|
||||||
|
* ensure `offline-manifest.json` contains entries for the DSSE and Rekor files (name, sha256, size, capturedAt).([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
3. **Contracts & schemas**
|
||||||
|
|
||||||
|
* Define a small JSON schema for `offline-update.rekor.json` (UUID, index, proof fields) and check it into `docs/11_DATA_SCHEMAS.md` or module‑local schemas.
|
||||||
|
* Keep all new payload schemas **versioned**; avoid “shape drift”.
|
||||||
|
|
||||||
|
**Do / Don’t**
|
||||||
|
|
||||||
|
* ✅ **Do** treat attestation bundle job as *pure aggregation* (AOC guardrail: no modification of evidence).([git.stella-ops.org][5])
|
||||||
|
* ✅ **Do** rely on Signer + Attestor; don’t hand‑roll DSSE/Rekor logic in Export Center.([git.stella-ops.org][2])
|
||||||
|
* ❌ **Don’t** reach out to external networks from this job — it must run with the same offline‑ready posture as the rest of the platform.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.2 Offline Update Kit builder
|
||||||
|
|
||||||
|
**Working area:** `ops/offline-kit/*` + `docs/24_OFFLINE_KIT.md`([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
Guidelines:
|
||||||
|
|
||||||
|
1. **Preserve current guarantees**
|
||||||
|
|
||||||
|
* Imports must remain **idempotent and atomic**, with **old feeds kept until the new bundle is fully verified**. This now includes DSSE/Rekor checks in addition to Cosign + JWS.([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
2. **Staging layout**
|
||||||
|
|
||||||
|
* When staging a kit, ensure the tree looks like:
|
||||||
|
|
||||||
|
```text
|
||||||
|
out/offline-kit/staging/
|
||||||
|
feeds/...
|
||||||
|
images/...
|
||||||
|
manifest/offline-manifest.json
|
||||||
|
attestations/offline-update.dsse.json
|
||||||
|
attestations/offline-update.rekor.json
|
||||||
|
```
|
||||||
|
|
||||||
|
* Update `offline-manifest.json` so each new file appears with:
|
||||||
|
|
||||||
|
* `name`, `sha256`, `size`, `capturedAt`.([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
3. **Deterministic ordering**
|
||||||
|
|
||||||
|
* File lists in manifests must be in a stable order (e.g., lexical paths).
|
||||||
|
* Timestamps = UTC ISO‑8601 only; never use local time. (Matches determinism guidance in AGENTS.md + policy/runs docs.)([git.stella-ops.org][9])
|
||||||
|
|
||||||
|
4. **Delta kits**
|
||||||
|
|
||||||
|
* For deltas (`stella-ouk-YYYY-MM-DD.delta.tgz`), DSSE should still cover:
|
||||||
|
|
||||||
|
* the delta tarball digest,
|
||||||
|
* the **logical state** (feeds & versions) after applying the delta.
|
||||||
|
* Don’t shortcut by “attesting only the diff files” — the predicate must describe the resulting snapshot.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.3 Scanner — import & activation
|
||||||
|
|
||||||
|
**Working directory:** `src/Scanner/StellaOps.Scanner.WebService`, `StellaOps.Scanner.Worker`([git.stella-ops.org][9])
|
||||||
|
|
||||||
|
Scanner already exposes admin flows for:
|
||||||
|
|
||||||
|
* **Offline kit import**, which:
|
||||||
|
|
||||||
|
* validates the Cosign signature of the kit,
|
||||||
|
* uses the attested manifest,
|
||||||
|
* keeps old feeds until verification is done.([git.stella-ops.org][1])
|
||||||
|
|
||||||
|
Add DSSE/Rekor awareness as follows:
|
||||||
|
|
||||||
|
1. **Verification sequence (happy path)**
|
||||||
|
|
||||||
|
On `import-offline-usage-kit`:
|
||||||
|
|
||||||
|
1. Validate **Cosign** signature of the tarball.
|
||||||
|
2. Validate `offline-manifest.json` with its JWS signature.
|
||||||
|
3. Verify **file digests** for all entries (including `/attestations/*`).
|
||||||
|
4. Verify **DSSE**:
|
||||||
|
|
||||||
|
* Call `StellaOps.Attestor.Verify` (or CLI equivalent) with:
|
||||||
|
|
||||||
|
* `offline-update.dsse.json`
|
||||||
|
* `offline-update.rekor.json`
|
||||||
|
* local Rekor log snapshot / segment (if configured)([git.stella-ops.org][2])
|
||||||
|
* Ensure the payload digest matches the kit tarball + manifest digests.
|
||||||
|
5. Only after all checks pass:
|
||||||
|
|
||||||
|
* swap Scanner’s feed pointer to the new snapshot,
|
||||||
|
* emit an audit event noting:
|
||||||
|
|
||||||
|
* kit filename, tarball digest,
|
||||||
|
* DSSE statement digest,
|
||||||
|
* Rekor UUID + log index.
|
||||||
|
|
||||||
|
2. **Config surface**
|
||||||
|
|
||||||
|
Add config keys (names illustrative):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
scanner:
|
||||||
|
offlineKit:
|
||||||
|
requireDsse: true # fail import if DSSE/Rekor verification fails
|
||||||
|
rekorOfflineMode: true # use local snapshots only
|
||||||
|
attestationVerifier: https://attestor.internal
|
||||||
|
```
|
||||||
|
|
||||||
|
* Mirror them via ASP.NET Core config + env vars (`SCANNER__OFFLINEKIT__REQUIREDSSSE`, etc.), following the same pattern as the DSSE/Rekor operator guide.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
3. **Failure behaviour**
|
||||||
|
|
||||||
|
* **DSSE/Rekor fail, Cosign + manifest OK**
|
||||||
|
|
||||||
|
* Keep old feeds active.
|
||||||
|
* Mark import as failed; surface a `ProblemDetails` error via API/UI.
|
||||||
|
* Log structured fields: `rekorUuid`, `attestationDigest`, `offlineKitHash`, `failureReason`.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
* **Config flag to soften during rollout**
|
||||||
|
|
||||||
|
* When `requireDsse=false`, treat DSSE/Rekor failure as a warning and still allow the import (for initial observation phase), but emit alerts. This mirrors the “observe → enforce” pattern in the DSSE/Rekor operator guide.([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.4 Signer & Attestor
|
||||||
|
|
||||||
|
You mostly **reuse** existing guidance:([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
* Add a new predicate type & schema for offline updates in Signer.
|
||||||
|
|
||||||
|
* Ensure Attestor:
|
||||||
|
|
||||||
|
* can submit offline‑update DSSE envelopes to Rekor,
|
||||||
|
* can emit verification routines (used by CLI and Scanner) that:
|
||||||
|
|
||||||
|
* verify the DSSE signature,
|
||||||
|
* check the certificate chain against the configured root pack (FIPS/eIDAS/GOST/SM, etc.),([Stella Ops][4])
|
||||||
|
* verify Rekor inclusion using either live log or local snapshot.
|
||||||
|
|
||||||
|
* For fully air‑gapped installs:
|
||||||
|
|
||||||
|
* rely on Rekor **snapshots mirrored** into Offline Kit (already recommended in the operator guide’s offline section).([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3.5 CLI & UI
|
||||||
|
|
||||||
|
Extend CLI with explicit verbs (matching EXPORT‑ATTEST sprints):([git.stella-ops.org][10])
|
||||||
|
|
||||||
|
* `stella attest bundle verify --bundle path/to/offline-kit.tgz --rekor-key rekor.pub`
|
||||||
|
* `stella attest bundle import --bundle ...` (for sites that prefer a two‑step “verify then import” flow)
|
||||||
|
* Wire UI Admin → Offline Kit screen so that:
|
||||||
|
|
||||||
|
* verification status shows both **Cosign/JWS** and **DSSE/Rekor** state,
|
||||||
|
* policy banners display kit generation time, manifest hash, and DSSE/Rekor freshness.([Stella Ops][11])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Determinism & offline‑safety rules
|
||||||
|
|
||||||
|
When touching any of this code, keep these rules front‑of‑mind (they align with the policy DSL and architecture docs):([Stella Ops][4])
|
||||||
|
|
||||||
|
1. **No hidden network dependencies**
|
||||||
|
|
||||||
|
* All verification **must work offline** given the kit + Rekor snapshots.
|
||||||
|
* Any fallback to live Rekor / Fulcio endpoints must be explicitly toggled and never on by default for “offline mode”.
|
||||||
|
|
||||||
|
2. **Stable serialization**
|
||||||
|
|
||||||
|
* DSSE payload JSON:
|
||||||
|
|
||||||
|
* stable ordering of fields,
|
||||||
|
* no float weirdness,
|
||||||
|
* UTC timestamps.
|
||||||
|
|
||||||
|
3. **Replayable imports**
|
||||||
|
|
||||||
|
* Running `import-offline-usage-kit` twice with the same bundle must be a no‑op after the first time.
|
||||||
|
* The DSSE payload for a given snapshot must not change over time; if it does, bump the predicate or snapshot version.
|
||||||
|
|
||||||
|
4. **Explainability**
|
||||||
|
|
||||||
|
* When verification fails, errors must explain **what** mismatched (kit digest, manifest digest, DSSE envelope hash, Rekor inclusion) so auditors can reason about it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Testing & CI expectations
|
||||||
|
|
||||||
|
Tie this into the existing CI workflows (`scanner-determinism.yml`, `attestation-bundle.yml`, `offline-kit` pipelines, etc.):([git.stella-ops.org][12])
|
||||||
|
|
||||||
|
### 5.1 Unit & integration tests
|
||||||
|
|
||||||
|
Write tests that cover:
|
||||||
|
|
||||||
|
1. **Happy paths**
|
||||||
|
|
||||||
|
* Full kit import with valid:
|
||||||
|
|
||||||
|
* Cosign,
|
||||||
|
* manifest JWS,
|
||||||
|
* DSSE,
|
||||||
|
* Rekor proof (online and offline modes).
|
||||||
|
|
||||||
|
2. **Corruption scenarios**
|
||||||
|
|
||||||
|
* Tampered feed file (hash mismatch).
|
||||||
|
* Tampered `offline-manifest.json`.
|
||||||
|
* Tampered DSSE payload (signature fails).
|
||||||
|
* Mismatched Rekor entry (payload digest doesn’t match DSSE).
|
||||||
|
|
||||||
|
3. **Offline scenarios**
|
||||||
|
|
||||||
|
* No network access, only Rekor snapshot:
|
||||||
|
|
||||||
|
* DSSE verification still passes,
|
||||||
|
* Rekor proof validates against local tree head.
|
||||||
|
|
||||||
|
4. **Roll‑back logic**
|
||||||
|
|
||||||
|
* Import fails at DSSE/Rekor step:
|
||||||
|
|
||||||
|
* scanner DB still points at previous feeds,
|
||||||
|
* metrics/logs show failure and no partial state.
|
||||||
|
|
||||||
|
### 5.2 SLOs & observability
|
||||||
|
|
||||||
|
Reuse metrics suggested by DSSE/Rekor guide and adapt to OUK imports:([git.stella-ops.org][2])
|
||||||
|
|
||||||
|
* `offlinekit_import_total{status="success|failed_dsse|failed_rekor|failed_cosign"}`
|
||||||
|
* `offlinekit_attestation_verify_latency_seconds` (histogram)
|
||||||
|
* `attestor_rekor_success_total`, `attestor_rekor_retry_total`, `rekor_inclusion_latency`
|
||||||
|
* Dashboards: kit versions per environment, time since last kit, DSSE/Rekor health.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Developer checklist (TL;DR)
|
||||||
|
|
||||||
|
When you pick up a task touching DSSE‑signed offline updates:
|
||||||
|
|
||||||
|
1. **Read the background**
|
||||||
|
|
||||||
|
* `docs/modules/scanner/operations/dsse-rekor-operator-guide.md`([git.stella-ops.org][2])
|
||||||
|
* `docs/24_OFFLINE_KIT.md` (and public offline kit guide).([git.stella-ops.org][1])
|
||||||
|
* Relevant sprint file (`SPRINT_160_export_evidence`, `SPRINT_162_exportcenter_i`, etc.).([git.stella-ops.org][10])
|
||||||
|
|
||||||
|
2. **Implement**
|
||||||
|
|
||||||
|
* Generate DSSE payloads in Export Center only.
|
||||||
|
* Call Signer & Attestor; persist DSSE + Rekor JSON next to manifests.
|
||||||
|
* Extend OUK builder to include attestation bundle and list it in `offline-manifest.json`.
|
||||||
|
* Update Scanner import flow to verify DSSE/Rekor before swapping feeds.
|
||||||
|
|
||||||
|
3. **Test**
|
||||||
|
|
||||||
|
* Unit tests for bundle composition & schema.
|
||||||
|
* Integration tests for import + rollback.
|
||||||
|
* Determinism tests (same inputs → same DSSE payload).
|
||||||
|
|
||||||
|
4. **Wire telemetry**
|
||||||
|
|
||||||
|
* Counters + latency histograms.
|
||||||
|
* Logs with `offlineKitHash`, `attestationDigest`, `rekorUuid`.
|
||||||
|
|
||||||
|
5. **Document**
|
||||||
|
|
||||||
|
* Update `docs/modules/export-center/architecture.md`, `docs/modules/scanner/architecture.md`, and the OUK docs where flows or contracts changed.([git.stella-ops.org][5])
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you tell me which module you’re actually coding in next (Scanner, Export Center, CLI, or Attestor), I can turn this into a very concrete “AGENTS.md‑style” section with exact file paths, class names, and a starter test layout for that module.
|
||||||
|
|
||||||
|
[1]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/7bf40b8589c94078e8eadb240553c02f097a5127/docs/24_OFFLINE_KIT.md "git.stella-ops.org/24_OFFLINE_KIT.md at 7bf40b8589c94078e8eadb240553c02f097a5127 - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[2]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/13e4b53dda1575ba46c6188c794fd465ec6fdeec/docs/modules/scanner/operations/dsse-rekor-operator-guide.md "git.stella-ops.org/dsse-rekor-operator-guide.md at 13e4b53dda1575ba46c6188c794fd465ec6fdeec - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[3]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/raw/commit/61f963fd52cd4d6bb2f86afc5a82eac04c04b00e/docs/implplan/SPRINT_162_exportcenter_i.md?utm_source=chatgpt.com "https://git.stella-ops.org/stella-ops.org/git.stel..."
|
||||||
|
[4]: https://stella-ops.org/docs/07_high_level_architecture/index.html?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
|
||||||
|
[5]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d870da18ce194c6a5f1a6d71abea36205d9fb276/docs/export-center/architecture.md?utm_source=chatgpt.com "Export Center Architecture - Stella Ops"
|
||||||
|
[6]: https://stella-ops.org/docs/moat/?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
|
||||||
|
[7]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/79b8e53441e92dbc63684f42072434d40b80275f/src/ExportCenter?utm_source=chatgpt.com "Code - Stella Ops"
|
||||||
|
[8]: https://stella-ops.org/docs/24_offline_kit/?utm_source=chatgpt.com "Offline Update Kit (OUK) — Air‑Gap Bundle - Stella Ops – Open ..."
|
||||||
|
[9]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/7768555f2d107326050cc5ff7f5cb81b82b7ce5f/AGENTS.md "git.stella-ops.org/AGENTS.md at 7768555f2d107326050cc5ff7f5cb81b82b7ce5f - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
|
[10]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/66cb6c4b8af58a33efa1521b7953dda834431497/docs/implplan/SPRINT_160_export_evidence.md?utm_source=chatgpt.com "git.stella-ops.org/SPRINT_160_export_evidence.md at ..."
|
||||||
|
[11]: https://stella-ops.org/about/?utm_source=chatgpt.com "Signed Reachability · Deterministic Replay · Sovereign Crypto"
|
||||||
|
[12]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/actions/?actor=0&status=0&workflow=sdk-publish.yml&utm_source=chatgpt.com "Actions - git.stella-ops.org - Gitea: Git with a cup of tea"
|
||||||
@@ -0,0 +1,425 @@
|
|||||||
|
Here’s a simple metric that will make your security UI (and teams) radically better: **Time‑to‑Evidence (TTE)** — the time from opening a finding to seeing *raw proof* (a data‑flow edge, an SBOM line, or a VEX note), not a summary.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### What it is
|
||||||
|
|
||||||
|
* **Definition:** TTE = `t_first_proof_rendered − t_open_finding`.
|
||||||
|
* **Proof =** the exact artifact or path that justifies the claim (e.g., `package-lock.json: line 214 → openssl@1.1.1`, `reachability: A → B → C sink`, or `VEX: not_affected due to unreachable code`).
|
||||||
|
* **Target:** **P95 ≤ 15s** (stretch: P99 ≤ 30s). If 95% of findings show proof within 15 seconds, the UI stays honest: evidence before opinion, low noise, fast explainability.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Why it matters
|
||||||
|
|
||||||
|
* **Trust:** People accept decisions they can *verify* quickly.
|
||||||
|
* **Triage speed:** Proof-first UIs cut back-and-forth and guesswork.
|
||||||
|
* **Noise control:** If you can’t surface proof fast, you probably shouldn’t surface the finding yet.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### How to measure (engineering‑ready)
|
||||||
|
|
||||||
|
* Emit two stamps per finding view:
|
||||||
|
|
||||||
|
* `t_open_finding` (on route enter or modal open).
|
||||||
|
* `t_first_proof_rendered` (first DOM paint of SBOM line / path list / VEX clause).
|
||||||
|
* Store as `tte_ms` in a lightweight events table (Postgres) with tags: `tenant`, `finding_id`, `proof_kind` (`sbom|reachability|vex`), `source` (`local|remote|cache`).
|
||||||
|
* Nightly rollup: compute P50/P90/P95/P99 by proof_kind and page.
|
||||||
|
* Alert when **P95 > 15s** for 15 minutes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### UI contract (keeps the UX honest)
|
||||||
|
|
||||||
|
* **Above the fold:** always show a compact **Proof panel** first (not hidden behind tabs).
|
||||||
|
* **Skeletons over spinners:** reserve space; render partial proof as soon as any piece is ready.
|
||||||
|
* **Plain text copy affordance:** “Copy SBOM line / path” button right next to the proof.
|
||||||
|
* **Defer non‑proof widgets:** CVSS badges, remediation prose, and charts load *after* proof.
|
||||||
|
* **Empty‑state truth:** if no proof exists, say “No proof available yet” and show the loader for *that* proof type only (don’t pretend with summaries).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Backend rules of thumb
|
||||||
|
|
||||||
|
* **Pre‑index for first paint:** cache top N proof items per hot finding (e.g., first SBOM hit + shortest path).
|
||||||
|
* **Bound queries:** proof queries must be *O(log n)* on indexed columns (pkg name@version, file hash, graph node id).
|
||||||
|
* **Chunked streaming:** send first proof chunk <200 ms after backend hit; don’t hold for the full set.
|
||||||
|
* **Timeout budget:** 12s backend budget + 3s UI/render margin = 15s P95.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Minimal contract to add in your code
|
||||||
|
|
||||||
|
```ts
|
||||||
|
// Frontend: fire on open
|
||||||
|
metrics.emit('finding_open', { findingId, t: performance.now() });
|
||||||
|
|
||||||
|
// When the first real proof node/line hits the DOM:
|
||||||
|
metrics.emit('proof_rendered', { findingId, proofKind, t: performance.now() });
|
||||||
|
```
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Rollup (hourly)
|
||||||
|
SELECT
|
||||||
|
proof_kind,
|
||||||
|
percentile_cont(0.95) WITHIN GROUP (ORDER BY tte_ms) AS p95_ms
|
||||||
|
FROM tte_events
|
||||||
|
WHERE ts >= now() - interval '1 hour'
|
||||||
|
GROUP BY proof_kind;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### What to put on the team dashboard
|
||||||
|
|
||||||
|
* **TTE P95 by page** (Findings list, Finding details).
|
||||||
|
* **TTE P95 by proof_kind** (sbom / reachability / vex).
|
||||||
|
* **Error budget burn**: minutes over target per day.
|
||||||
|
* **Top regressions**: last 7 days vs prior 7.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Acceptance checklist for any finding view
|
||||||
|
|
||||||
|
* [ ] First paint shows a real proof snippet (not a summary).
|
||||||
|
* [ ] “Copy proof” button works within 1 click.
|
||||||
|
* [ ] TTE P95 in staging ≤ 10s; in prod ≤ 15s.
|
||||||
|
* [ ] If proof missing, explicit empty‑state + retry path.
|
||||||
|
* [ ] Telemetry sampled ≥ 50% of sessions (or 100% for internal).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you want, I can turn this into a **`docs/UX/tte.md`** spec plus a **Grafana panel JSON** and a tiny **.NET middleware** to emit the metrics—just say the word and I’ll drop in ready‑to‑use snippets.
|
||||||
|
Perfect, got it — let’s turn **Time‑to‑Evidence (TTE)** into a proper set of *“how we track UX health”* guidelines.
|
||||||
|
|
||||||
|
I’ll treat this like something you can drop into an internal engineering/UX doc.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. What exactly is Time‑to‑Evidence?
|
||||||
|
|
||||||
|
**Definition**
|
||||||
|
|
||||||
|
> **TTE = t_first_proof_rendered − t_open_finding**
|
||||||
|
|
||||||
|
* **t_open_finding** – when the user first opens a “finding” / detail view (e.g., vulnerability, alert, ticket, log event).
|
||||||
|
* **t_first_proof_rendered** – when the UI first paints **actual evidence** that backs the finding, for example:
|
||||||
|
|
||||||
|
* The SBOM row showing `package@version`.
|
||||||
|
* The call‑graph/data‑flow path to a sink.
|
||||||
|
* A VEX note explaining why something is (not) affected.
|
||||||
|
* A raw log snippet that the alert is based on.
|
||||||
|
|
||||||
|
**Key principle:**
|
||||||
|
TTE measures **how long users have to trust you blindly** before they can see proof with their own eyes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. UX health goals & targets
|
||||||
|
|
||||||
|
Treat TTE like latency SLOs:
|
||||||
|
|
||||||
|
* **Primary SLO**:
|
||||||
|
|
||||||
|
* **P95 TTE ≤ 15s** for all findings in normal conditions.
|
||||||
|
* **Stretch SLO**:
|
||||||
|
|
||||||
|
* **P99 TTE ≤ 30s** for heavy cases (big graphs, huge SBOMs, cold caches).
|
||||||
|
* **Guardrail**:
|
||||||
|
|
||||||
|
* P50 TTE should be **< 3s**. If the median creeps up, you’re in trouble even if P95 looks OK.
|
||||||
|
|
||||||
|
You can refine by feature:
|
||||||
|
|
||||||
|
* “Simple” proof (single SBOM row, small payload):
|
||||||
|
|
||||||
|
* P95 ≤ 5s.
|
||||||
|
* “Complex” proof (reachability graph, cross‑repo joins):
|
||||||
|
|
||||||
|
* P95 ≤ 15s.
|
||||||
|
|
||||||
|
**UX rule of thumb**
|
||||||
|
|
||||||
|
* < 2s: feels instant.
|
||||||
|
* 2–10s: acceptable if clearly loading something heavy.
|
||||||
|
* > 10s: needs **strong** feedback (progress, partial results, explanations).
|
||||||
|
* > 30s: the system should probably **offer fallback** (e.g., “download raw evidence” or “retry”).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Instrumentation guidelines
|
||||||
|
|
||||||
|
### 3.1 Event model
|
||||||
|
|
||||||
|
Emit two core events per finding view:
|
||||||
|
|
||||||
|
1. **`finding_open`**
|
||||||
|
|
||||||
|
* When user opens the finding details (route enter / modal open).
|
||||||
|
* Must include:
|
||||||
|
|
||||||
|
* `finding_id`
|
||||||
|
* `tenant_id` / `org_id`
|
||||||
|
* `user_role` (admin, dev, triager, etc.)
|
||||||
|
* `entry_point` (list, search, notification, deep link)
|
||||||
|
* `ui_version` / `build_sha`
|
||||||
|
|
||||||
|
2. **`proof_rendered`**
|
||||||
|
|
||||||
|
* First time *any* qualifying proof element is painted.
|
||||||
|
* Must include:
|
||||||
|
|
||||||
|
* `finding_id`
|
||||||
|
* `proof_kind` (`sbom | reachability | vex | logs | other`)
|
||||||
|
* `source` (`local_cache | backend_api | 3rd_party`)
|
||||||
|
* `proof_height` (e.g., pixel offset from top) – to ensure it’s actually above the fold or very close.
|
||||||
|
|
||||||
|
**Derived metric**
|
||||||
|
|
||||||
|
Your telemetry pipeline should compute:
|
||||||
|
|
||||||
|
```text
|
||||||
|
tte_ms = proof_rendered.timestamp - finding_open.timestamp
|
||||||
|
```
|
||||||
|
|
||||||
|
If there are multiple `proof_rendered` events for the same `finding_open`, use:
|
||||||
|
|
||||||
|
* **TTE (first proof)** – minimum timestamp; primary SLO.
|
||||||
|
* Optionally: **TTE (full evidence)** – last proof in a defined “bundle” (e.g., path + SBOM row).
|
||||||
|
|
||||||
|
### 3.2 Implementation notes
|
||||||
|
|
||||||
|
**Frontend**
|
||||||
|
|
||||||
|
* Emit `finding_open` as soon as:
|
||||||
|
|
||||||
|
* The route is confirmed and
|
||||||
|
* You know which `finding_id` is being displayed.
|
||||||
|
* Emit `proof_rendered`:
|
||||||
|
|
||||||
|
* **Not** when you *fetch* data, but when at least one evidence component is **visibly rendered**.
|
||||||
|
* Easiest approach: hook into component lifecycle / intersection observer on the evidence container.
|
||||||
|
|
||||||
|
Pseudo‑example:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
// On route/mount:
|
||||||
|
metrics.emit('finding_open', {
|
||||||
|
findingId,
|
||||||
|
entryPoint,
|
||||||
|
userRole,
|
||||||
|
uiVersion,
|
||||||
|
t: performance.now()
|
||||||
|
});
|
||||||
|
|
||||||
|
// In EvidencePanel component, after first render with real data:
|
||||||
|
if (!hasEmittedProof && hasRealEvidence) {
|
||||||
|
metrics.emit('proof_rendered', {
|
||||||
|
findingId,
|
||||||
|
proofKind: 'sbom',
|
||||||
|
source: 'backend_api',
|
||||||
|
t: performance.now()
|
||||||
|
});
|
||||||
|
hasEmittedProof = true;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Backend**
|
||||||
|
|
||||||
|
* No special requirement beyond:
|
||||||
|
|
||||||
|
* Stable IDs (`finding_id`).
|
||||||
|
* Knowing which API endpoints respond with evidence payloads — you’ll want to correlate backend latency with TTE later.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Data quality & sampling
|
||||||
|
|
||||||
|
If you want TTE to drive decisions, the data must be boringly reliable.
|
||||||
|
|
||||||
|
**Guidelines**
|
||||||
|
|
||||||
|
1. **Sample rate**
|
||||||
|
|
||||||
|
* Start with **100%** in staging.
|
||||||
|
* In production, aim for **≥ 25% of sessions** for TTE events at minimum; 100% is ideal if volume is reasonable.
|
||||||
|
|
||||||
|
2. **Clock skew**
|
||||||
|
|
||||||
|
* Prefer **frontend timestamps** using `performance.now()` for TTE; they’re monotonic within a tab.
|
||||||
|
* Don’t mix backend clocks into the TTE calculation.
|
||||||
|
|
||||||
|
3. **Bot / synthetic traffic**
|
||||||
|
|
||||||
|
* Tag synthetic tests (`is_synthetic = true`) and exclude them from UX health dashboards.
|
||||||
|
|
||||||
|
4. **Retry behavior**
|
||||||
|
|
||||||
|
* If the proof fails to load and user hits “retry”:
|
||||||
|
|
||||||
|
* Treat it as a separate measurement (`retry = true`) or
|
||||||
|
* Log an additional `proof_error` event with error class (timeout, 5xx, network, parse, etc.).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Dashboards: how to watch TTE
|
||||||
|
|
||||||
|
You want a small, opinionated set of views that answer:
|
||||||
|
|
||||||
|
> “Is UX getting better or worse for people trying to understand findings?”
|
||||||
|
|
||||||
|
### 5.1 Core widgets
|
||||||
|
|
||||||
|
1. **TTE distribution**
|
||||||
|
|
||||||
|
* P50 / P90 / P95 / P99 per day (or per release).
|
||||||
|
* Split by `proof_kind`.
|
||||||
|
|
||||||
|
2. **TTE by page / surface**
|
||||||
|
|
||||||
|
* Finding list → detail.
|
||||||
|
* Deep links from notifications.
|
||||||
|
* Direct URLs / bookmarks.
|
||||||
|
|
||||||
|
3. **TTE by user segment**
|
||||||
|
|
||||||
|
* New users vs power users.
|
||||||
|
* Different roles (security engineer vs application dev).
|
||||||
|
|
||||||
|
4. **Error budget panel**
|
||||||
|
|
||||||
|
* “Minutes over SLO per day” – e.g., sum of all user‑minutes where TTE > 15s.
|
||||||
|
* Use this to prioritize work.
|
||||||
|
|
||||||
|
5. **Correlation with engagement**
|
||||||
|
|
||||||
|
* Scatter: TTE vs session length, or TTE vs “user clicked ‘ignore’ / ‘snooze’”.
|
||||||
|
* Aim to confirm the obvious: **long TTE → worse engagement/completion**.
|
||||||
|
|
||||||
|
### 5.2 Operational details
|
||||||
|
|
||||||
|
* Update granularity: **real‑time or ≤15 min** for on‑call/ops panels.
|
||||||
|
* Retention: at least **90 days** to see trends across big releases.
|
||||||
|
* Breakdowns:
|
||||||
|
|
||||||
|
* `backend_region` (to catch regional issues).
|
||||||
|
* `build_version` (to spot regressions quickly).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. UX & engineering design rules anchored in TTE
|
||||||
|
|
||||||
|
These are the **behavior rules** for the product that keep TTE healthy.
|
||||||
|
|
||||||
|
### 6.1 “Evidence first” layout rules
|
||||||
|
|
||||||
|
* **Evidence above the fold**
|
||||||
|
|
||||||
|
* At least *one* proof element must be visible **without scrolling** on a typical laptop viewport.
|
||||||
|
* **Summary second**
|
||||||
|
|
||||||
|
* CVSS scores, severity badges, long descriptions: all secondary. Evidence should come *before* opinion.
|
||||||
|
* **No fake proof**
|
||||||
|
|
||||||
|
* Don’t use placeholders that *look* like evidence but aren’t (e.g., “example path” or generic text).
|
||||||
|
* If evidence is still loading, show a clear skeleton/loader with “Loading evidence…”.
|
||||||
|
|
||||||
|
### 6.2 Loading strategy rules
|
||||||
|
|
||||||
|
* Start fetching evidence **as soon as navigation begins**, not after the page is fully mounted.
|
||||||
|
* Use **lazy loading** for non‑critical widgets until after proof is shown.
|
||||||
|
* If a call is known to be heavy:
|
||||||
|
|
||||||
|
* Consider **precomputing** and caching the top evidence (shortest path, first SBOM hit).
|
||||||
|
* Stream results: render first proof item as soon as it arrives; don’t wait for the full list.
|
||||||
|
|
||||||
|
### 6.3 Empty / error state rules
|
||||||
|
|
||||||
|
* If there is genuinely no evidence:
|
||||||
|
|
||||||
|
* Explicitly say **“No supporting evidence available yet”** and treat TTE as:
|
||||||
|
|
||||||
|
* Either “no value” (excluded), or
|
||||||
|
* A special bucket `proof_kind = "none"`.
|
||||||
|
* If loading fails:
|
||||||
|
|
||||||
|
* Show a clear error and a **retry** that re‑emits `proof_rendered` when successful.
|
||||||
|
* Log `proof_error` with reason; track error rate alongside TTE.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. How to *use* TTE in practice
|
||||||
|
|
||||||
|
### 7.1 For releases
|
||||||
|
|
||||||
|
For any change that affects findings UI or evidence plumbing:
|
||||||
|
|
||||||
|
* Add a release checklist item:
|
||||||
|
|
||||||
|
* “No regression on TTE P95 for [pages X, Y].”
|
||||||
|
* During rollout:
|
||||||
|
|
||||||
|
* Compare **pre‑ vs post‑release** TTE P95 by `ui_version`.
|
||||||
|
* If regression > 20%:
|
||||||
|
|
||||||
|
* Roll back, or
|
||||||
|
* Add a follow‑up ticket explicitly tagged with the regression.
|
||||||
|
|
||||||
|
### 7.2 For experiments / A/B tests
|
||||||
|
|
||||||
|
When running UI experiments around findings:
|
||||||
|
|
||||||
|
* Always capture TTE per variant.
|
||||||
|
* Compare:
|
||||||
|
|
||||||
|
* TTE P50/P95.
|
||||||
|
* Task completion rate (e.g., “user changed status”).
|
||||||
|
* Subjective UX (CSAT) if you have it.
|
||||||
|
|
||||||
|
You’re looking for patterns like:
|
||||||
|
|
||||||
|
* Variant B: **+5% completion**, **+8% TTE** → maybe OK.
|
||||||
|
* Variant C: **+2% completion**, **+70% TTE** → probably not acceptable.
|
||||||
|
|
||||||
|
### 7.3 For prioritization
|
||||||
|
|
||||||
|
Use TTE as a lever in planning:
|
||||||
|
|
||||||
|
* If P95 TTE is healthy and stable:
|
||||||
|
|
||||||
|
* More room for new features / experiments.
|
||||||
|
* If P95 TTE is trending up for 2+ weeks:
|
||||||
|
|
||||||
|
* Time to schedule a “TTE debt” story: caching, query optimization, UI re‑layout, etc.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Quick “TTE‑ready” checklist
|
||||||
|
|
||||||
|
You’re “tracking UX health with TTE” if you can honestly tick these:
|
||||||
|
|
||||||
|
1. **Instrumentation**
|
||||||
|
|
||||||
|
* [ ] `finding_open` + `proof_rendered` events exist and are correlated.
|
||||||
|
* [ ] TTE computed in a stable pipeline (joins, dedupe, etc.).
|
||||||
|
2. **Targets**
|
||||||
|
|
||||||
|
* [ ] TTE SLOs defined (P95, P99) and agreed by UX + engineering.
|
||||||
|
3. **Dashboards**
|
||||||
|
|
||||||
|
* [ ] A dashboard shows TTE by proof kind, page, and release.
|
||||||
|
* [ ] On‑call / ops can see TTE in near real‑time.
|
||||||
|
4. **UX rules**
|
||||||
|
|
||||||
|
* [ ] Evidence is visible above the fold for all main finding types.
|
||||||
|
* [ ] Non‑critical widgets load after evidence.
|
||||||
|
* [ ] Empty/error states are explicit about evidence availability.
|
||||||
|
5. **Process**
|
||||||
|
|
||||||
|
* [ ] Major UI changes check TTE pre vs post as part of release acceptance.
|
||||||
|
* [ ] Regressions in TTE create real tickets, not just “we’ll watch it”.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you tell me what stack you’re on (e.g., React + Next.js + OpenTelemetry + X observability tool), I can turn this into concrete code snippets and an example dashboard spec (fields, queries, charts) tailored exactly to your setup.
|
||||||
@@ -0,0 +1,576 @@
|
|||||||
|
Here’s a tight, practical blueprint to turn your SBOM→VEX links into an auditable “proof spine”—using signed DSSE statements and a per‑dependency trust anchor—so every VEX verdict can be traced, verified, and replayed.
|
||||||
|
|
||||||
|
# What this gives you
|
||||||
|
|
||||||
|
* A **chain of evidence** from each SBOM entry → analysis → VEX verdict.
|
||||||
|
* **Tamper‑evident** DSSE‑signed records (offline‑friendly).
|
||||||
|
* **Deterministic replay**: same inputs → same verdicts (great for audits/regulators).
|
||||||
|
|
||||||
|
# Core objects (canonical IDs)
|
||||||
|
|
||||||
|
* **ArtifactID**: digest of package/container (e.g., `sha256:…`).
|
||||||
|
* **SBOMEntryID**: stable ID for a component in an SBOM (`sbomDigest:package@version[:purl]`).
|
||||||
|
* **EvidenceID**: hash of raw evidence (scanner JSON, reachability, exploit intel).
|
||||||
|
* **ReasoningID**: hash of normalized reasoning (rules/lattice inputs used).
|
||||||
|
* **VEXVerdictID**: hash of the final VEX statement body.
|
||||||
|
* **ProofBundleID**: merkle root of {SBOMEntryID, EvidenceID[], ReasoningID, VEXVerdictID}.
|
||||||
|
* **TrustAnchorID**: per‑dependency anchor (public key + policy) used to validate the above.
|
||||||
|
|
||||||
|
# Signed DSSE envelopes you’ll produce
|
||||||
|
|
||||||
|
1. **Evidence Statement** (per evidence item)
|
||||||
|
|
||||||
|
* `subject`: SBOMEntryID
|
||||||
|
* `predicateType`: `evidence.stella/v1`
|
||||||
|
* `predicate`: source, tool version, timestamps, EvidenceID
|
||||||
|
* **Signers**: scanner/ingestor key
|
||||||
|
|
||||||
|
2. **Reasoning Statement**
|
||||||
|
|
||||||
|
* `subject`: SBOMEntryID
|
||||||
|
* `predicateType`: `reasoning.stella/v1` (your lattice/policy inputs + ReasoningID)
|
||||||
|
* **Signers**: “Policy/Lattice Engine” key (Authority)
|
||||||
|
|
||||||
|
3. **VEX Verdict Statement**
|
||||||
|
|
||||||
|
* `subject`: SBOMEntryID
|
||||||
|
* `predicateType`: CycloneDX or CSAF VEX body + VEXVerdictID
|
||||||
|
* **Signers**: VEXer key (or vendor key if you have it)
|
||||||
|
|
||||||
|
4. **Proof Spine Statement** (the spine itself)
|
||||||
|
|
||||||
|
* `subject`: SBOMEntryID
|
||||||
|
* `predicateType`: `proofspine.stella/v1`
|
||||||
|
* `predicate`: EvidenceID[], ReasoningID, VEXVerdictID, ProofBundleID
|
||||||
|
* **Signers**: Authority key
|
||||||
|
|
||||||
|
# Trust model (per‑dependency anchor)
|
||||||
|
|
||||||
|
* **TrustAnchor** (per package/purl): { TrustAnchorID, allowed signers (KMS refs, PKs), accepted predicateTypes, policy version, revocation list }.
|
||||||
|
* Store anchors in **Authority** and pin them in your graph by SBOMEntryID→TrustAnchorID.
|
||||||
|
* Optional: PQC mode (Dilithium/Falcon) for long‑term archives.
|
||||||
|
|
||||||
|
# Verification pipeline (deterministic)
|
||||||
|
|
||||||
|
1. Resolve SBOMEntryID → TrustAnchorID.
|
||||||
|
2. Verify every DSSE envelope’s signature **against the anchor’s allowed keys**.
|
||||||
|
3. Recompute EvidenceID/ReasoningID/VEXVerdictID from raw content; compare hashes.
|
||||||
|
4. Recompute ProofBundleID (merkle root) and compare to the spine.
|
||||||
|
5. Emit a **Receipt**: {ProofBundleID, verification log, tool digests}. Cache it.
|
||||||
|
|
||||||
|
# Storage layout (Postgres + blob store)
|
||||||
|
|
||||||
|
* `sbom_entries(entry_id PK, bom_digest, purl, version, artifact_digest, trust_anchor_id)`
|
||||||
|
* `dsse_envelopes(env_id PK, entry_id, predicate_type, signer_keyid, body_hash, envelope_blob_ref, signed_at)`
|
||||||
|
* `spines(entry_id PK, bundle_id, evidence_ids[], reasoning_id, vex_id, anchor_id, created_at)`
|
||||||
|
* `trust_anchors(anchor_id PK, purl_pattern, allowed_keyids[], policy_ref, revoked_keys[])`
|
||||||
|
* Blobs (immutable): raw evidence, normalized reasoning JSON, VEX JSON, DSSE bytes.
|
||||||
|
|
||||||
|
# API surface (clean and small)
|
||||||
|
|
||||||
|
* `POST /proofs/:entry/spine` → submit or update spine (idempotent by ProofBundleID)
|
||||||
|
* `GET /proofs/:entry/receipt` → full verification receipt (JSON)
|
||||||
|
* `GET /proofs/:entry/vex` → the verified VEX body
|
||||||
|
* `GET /anchors/:anchor` → fetch trust anchor (for offline kits)
|
||||||
|
|
||||||
|
# Normalization rules (so hashes are stable)
|
||||||
|
|
||||||
|
* Canonical JSON (UTF‑8, sorted keys, no insignificant whitespace).
|
||||||
|
* Strip volatile fields (timestamps that aren’t part of the semantic claim).
|
||||||
|
* Version your schemas: `evidence.stella/v1`, `reasoning.stella/v1`, etc.
|
||||||
|
|
||||||
|
# Signing keys & rotation
|
||||||
|
|
||||||
|
* Keep keys in your **Authority** module (KMS/HSM; offline export for air‑gap).
|
||||||
|
* Publish key material via an **attestation feed** (or Rekor‑mirror) for third‑party audit.
|
||||||
|
* Rotate by **adding** new allowed_keyids in the TrustAnchor; never mutate old envelopes.
|
||||||
|
|
||||||
|
# CI/CD hooks
|
||||||
|
|
||||||
|
* On SBOM ingest → create/refresh SBOMEntry rows + attach TrustAnchor.
|
||||||
|
* On scan completion → produce Evidence Statements (DSSE) immediately.
|
||||||
|
* On policy evaluation → produce Reasoning + VEX, then assemble Spine.
|
||||||
|
* Gate releases on `GET /proofs/:entry/receipt` == PASS.
|
||||||
|
|
||||||
|
# UX (auditor‑friendly)
|
||||||
|
|
||||||
|
* **Proof timeline** per entry: SBOM → Evidence tiles → Reasoning → VEX → Receipt.
|
||||||
|
* One‑click “Recompute & Compare” to show deterministic replay passes.
|
||||||
|
* Red/amber flags when a signature no longer matches a TrustAnchor or a key is revoked.
|
||||||
|
|
||||||
|
# Minimal dev checklist
|
||||||
|
|
||||||
|
* [ ] Implement canonicalizers (Evidence, Reasoning, VEX).
|
||||||
|
* [ ] Implement DSSE sign/verify (ECDSA + optional PQC).
|
||||||
|
* [ ] TrustAnchor registry + resolver by purl pattern.
|
||||||
|
* [ ] Merkle bundling to get ProofBundleID.
|
||||||
|
* [ ] Receipt generator + verifier.
|
||||||
|
* [ ] Postgres schema + blob GC (content‑addressed).
|
||||||
|
* [ ] CI gates + API endpoints above.
|
||||||
|
* [ ] Auditor UI: timeline + diff + receipts download.
|
||||||
|
|
||||||
|
If you want, I can drop in a ready‑to‑use JSON schema set (`evidence.stella/v1`, `reasoning.stella/v1`, `proofspine.stella/v1`) and sample DSSE envelopes wired to your .NET 10 stack.
|
||||||
|
Here’s a focused **Stella Ops Developer Guidelines** doc, specifically for the pipeline that turns **SBOM data into verifiable proofs** (your SBOM → Evidence → Reasoning → VEX → Proof Spine).
|
||||||
|
|
||||||
|
Feel free to paste this into your internal handbook and tweak names to match your repos/services.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Stella Ops Developer Guidelines
|
||||||
|
|
||||||
|
## Turning SBOM Data Into Verifiable Proofs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Mental Model: What You’re Actually Building
|
||||||
|
|
||||||
|
For every component in an SBOM, Stella must be able to answer, *“Why should anyone trust our VEX verdict for this dependency, today and ten years from now?”*
|
||||||
|
|
||||||
|
We do that with a pipeline:
|
||||||
|
|
||||||
|
1. **SBOM Ingest**
|
||||||
|
Raw SBOM → validated → normalized → `SBOMEntryID`.
|
||||||
|
|
||||||
|
2. **Evidence Collection**
|
||||||
|
Scans, feeds, configs, reachability, etc. → canonical evidence blobs → `EvidenceID` → DSSE-signed.
|
||||||
|
|
||||||
|
3. **Reasoning / Policy**
|
||||||
|
Policy + evidence → deterministic reasoning → `ReasoningID` → DSSE-signed.
|
||||||
|
|
||||||
|
4. **VEX Verdict**
|
||||||
|
VEX statement (CycloneDX / CSAF) → canonicalized → `VEXVerdictID` → DSSE-signed.
|
||||||
|
|
||||||
|
5. **Proof Spine**
|
||||||
|
`{SBOMEntryID, EvidenceIDs[], ReasoningID, VEXVerdictID}` → merkle bundle → `ProofBundleID` → DSSE-signed.
|
||||||
|
|
||||||
|
6. **Verification & Receipts**
|
||||||
|
Re-run verification → `Receipt` that proves everything above is intact and anchored to trusted keys.
|
||||||
|
|
||||||
|
Everything you do in this area should keep this spine intact and verifiable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Non‑Negotiable Invariants
|
||||||
|
|
||||||
|
These are the rules you don’t break without an explicit, company-level decision:
|
||||||
|
|
||||||
|
1. **Immutability of Signed Facts**
|
||||||
|
|
||||||
|
* DSSE envelopes (evidence, reasoning, VEX, spines) are append‑only.
|
||||||
|
* You never edit or delete content inside a previously signed envelope.
|
||||||
|
* Corrections are made by **superseding** (new statement pointing at the old one).
|
||||||
|
|
||||||
|
2. **Determinism**
|
||||||
|
|
||||||
|
* Same `{SBOMEntryID, Evidence set, policyVersion}` ⇒ same `{ReasoningID, VEXVerdictID, ProofBundleID}`.
|
||||||
|
* No non-deterministic inputs (e.g., “current time”, random IDs) in anything that affects IDs or verdicts.
|
||||||
|
|
||||||
|
3. **Traceability**
|
||||||
|
|
||||||
|
* Every VEX verdict must be traceable back to:
|
||||||
|
|
||||||
|
* The precise SBOM entry
|
||||||
|
* Concrete evidence blobs
|
||||||
|
* A specific policy & reasoning snapshot
|
||||||
|
* A trust anchor defining allowed signers
|
||||||
|
|
||||||
|
4. **Least Trust / Least Privilege**
|
||||||
|
|
||||||
|
* Services only know the keys and data they need.
|
||||||
|
* Trust is always explicit: through **TrustAnchors** and signature verification, never “because it’s in our DB”.
|
||||||
|
|
||||||
|
5. **Backwards Compatibility**
|
||||||
|
|
||||||
|
* New code must continue to verify **old proofs**.
|
||||||
|
* New policies must **not rewrite history**; they produce *new* spines, leaving old ones intact.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. SBOM Ingestion Guidelines
|
||||||
|
|
||||||
|
**Goal:** Turn arbitrary SBOMs into stable, addressable `SBOMEntryID`s and safe internal models.
|
||||||
|
|
||||||
|
### 3.1 Inputs & Formats
|
||||||
|
|
||||||
|
* Support at least:
|
||||||
|
|
||||||
|
* CycloneDX (JSON)
|
||||||
|
* SPDX (JSON / Tag-Value)
|
||||||
|
* For each ingested SBOM, store:
|
||||||
|
|
||||||
|
* Raw SBOM bytes (immutable, content-addressed)
|
||||||
|
* A normalized internal representation (your own model)
|
||||||
|
|
||||||
|
### 3.2 IDs
|
||||||
|
|
||||||
|
* Generate:
|
||||||
|
|
||||||
|
* `sbomDigest` = hash(raw SBOM, canonical form)
|
||||||
|
* `SBOMEntryID` = `sbomDigest + purl + version` (or equivalent stable tuple)
|
||||||
|
* `SBOMEntryID` must:
|
||||||
|
|
||||||
|
* Not depend on ingestion time or database IDs.
|
||||||
|
* Be reproducible from SBOM + deterministic normalization.
|
||||||
|
|
||||||
|
### 3.3 Validation & Errors
|
||||||
|
|
||||||
|
* Validate:
|
||||||
|
|
||||||
|
* Syntax (JSON, schema)
|
||||||
|
* Core semantics (package identifiers, digests, versions)
|
||||||
|
* If invalid:
|
||||||
|
|
||||||
|
* Reject the SBOM **but** record a small DSSE “failure attestation” explaining:
|
||||||
|
|
||||||
|
* Why it failed
|
||||||
|
* Which file
|
||||||
|
* Which system version
|
||||||
|
* This still gives you a proof trail for “we tried and it failed”.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Evidence Collection Guidelines
|
||||||
|
|
||||||
|
**Goal:** Capture all inputs that influence the verdict in a canonical, signed form.
|
||||||
|
|
||||||
|
Typical evidence types:
|
||||||
|
|
||||||
|
* SCA / vuln scanner results
|
||||||
|
* CVE feeds & advisory data
|
||||||
|
* Reachability / call graph analysis
|
||||||
|
* Runtime context (where this component is used)
|
||||||
|
* Manual assessments (e.g., security engineer verdicts)
|
||||||
|
|
||||||
|
### 4.1 Evidence Canonicalization
|
||||||
|
|
||||||
|
For every evidence item:
|
||||||
|
|
||||||
|
* Normalize to a schema like `evidence.stella/v1` with fields such as:
|
||||||
|
|
||||||
|
* `source` (scanner name, feed)
|
||||||
|
* `sourceVersion` (tool version, DB version)
|
||||||
|
* `collectionTime`
|
||||||
|
* `sbomEntryId`
|
||||||
|
* `vulnerabilityId` (if applicable)
|
||||||
|
* `rawFinding` (or pointer to it)
|
||||||
|
* Canonical JSON rules:
|
||||||
|
|
||||||
|
* Sorted keys
|
||||||
|
* UTF‑8, no extraneous whitespace
|
||||||
|
* No volatile fields beyond what’s semantically needed (e.g., you might include `collectionTime`, but then know it affects the hash and treat that consciously).
|
||||||
|
|
||||||
|
Then:
|
||||||
|
|
||||||
|
* Compute `EvidenceID = hash(canonicalEvidenceJson)`.
|
||||||
|
* Wrap in DSSE:
|
||||||
|
|
||||||
|
* `subject`: `SBOMEntryID`
|
||||||
|
* `predicateType`: `evidence.stella/v1`
|
||||||
|
* `predicate`: canonical evidence + `EvidenceID`.
|
||||||
|
* Sign with **evidence-ingestor key** (per environment).
|
||||||
|
|
||||||
|
### 4.2 Ops Rules
|
||||||
|
|
||||||
|
* **Idempotency:**
|
||||||
|
Re-running the same scan with same inputs should produce the same evidence object and `EvidenceID`.
|
||||||
|
* **Tool changes:**
|
||||||
|
When tool version or configuration changes, that’s a *new* evidence statement with a new `EvidenceID`. Do not overwrite old evidence.
|
||||||
|
* **Partial failure:**
|
||||||
|
If a scan fails, produce a minimal failure evidence record (with error details) instead of “nothing”.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Reasoning & Policy Engine Guidelines
|
||||||
|
|
||||||
|
**Goal:** Turn evidence into a defensible, replayable reasoning step with a clear policy version.
|
||||||
|
|
||||||
|
### 5.1 Reasoning Object
|
||||||
|
|
||||||
|
Define a canonical reasoning schema, e.g. `reasoning.stella/v1`:
|
||||||
|
|
||||||
|
* `sbomEntryId`
|
||||||
|
* `evidenceIds[]` (sorted)
|
||||||
|
* `policyVersion`
|
||||||
|
* `inputs`: normalized form of all policy inputs (severity thresholds, lattice rules, etc.)
|
||||||
|
* `intermediateFindings`: optional but useful — e.g., “reachable vulns = …”
|
||||||
|
|
||||||
|
Then:
|
||||||
|
|
||||||
|
* Canonicalize JSON and compute `ReasoningID = hash(canonicalReasoning)`.
|
||||||
|
* Wrap in DSSE:
|
||||||
|
|
||||||
|
* `subject`: `SBOMEntryID`
|
||||||
|
* `predicateType`: `reasoning.stella/v1`
|
||||||
|
* `predicate`: canonical reasoning + `ReasoningID`.
|
||||||
|
* Sign with **Policy/Authority key**.
|
||||||
|
|
||||||
|
### 5.2 Determinism
|
||||||
|
|
||||||
|
* Reasoning functions must be **pure**:
|
||||||
|
|
||||||
|
* Inputs: SBOMEntryID, evidence set, policy version, configuration.
|
||||||
|
* No hidden calls to external APIs at decision time (fetch feeds earlier and record them as evidence).
|
||||||
|
* If you need “current time” in policy:
|
||||||
|
|
||||||
|
* Treat it as **explicit input** and record it inside reasoning under `inputs.currentEvaluationTime`.
|
||||||
|
|
||||||
|
### 5.3 Policy Evolution
|
||||||
|
|
||||||
|
* When policy changes:
|
||||||
|
|
||||||
|
* Bump `policyVersion`.
|
||||||
|
* New evaluations produce new `ReasoningID` and new VEX/spines.
|
||||||
|
* Don’t retroactively apply new policy to old reasoning objects; generate new ones alongside.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. VEX Verdict Guidelines
|
||||||
|
|
||||||
|
**Goal:** Generate VEX statements that are strongly tied to SBOM entries and your reasoning.
|
||||||
|
|
||||||
|
### 6.1 Shape
|
||||||
|
|
||||||
|
* Target standard formats:
|
||||||
|
|
||||||
|
* CycloneDX VEX
|
||||||
|
* or CSAF
|
||||||
|
* Required linkages:
|
||||||
|
|
||||||
|
* Component reference = `SBOMEntryID` or a resolvable component identifier from your SBOM normalize layer.
|
||||||
|
* Vulnerability IDs (CVE, GHSA, internal IDs).
|
||||||
|
* Status (`not_affected`, `affected`, `fixed`, etc.).
|
||||||
|
* Justification & impact.
|
||||||
|
|
||||||
|
### 6.2 Canonicalization & Signing
|
||||||
|
|
||||||
|
* Define a canonical VEX body schema (subset of the standard + internal metadata):
|
||||||
|
|
||||||
|
* `sbomEntryId`
|
||||||
|
* `vulnerabilityId`
|
||||||
|
* `status`
|
||||||
|
* `justification`
|
||||||
|
* `policyVersion`
|
||||||
|
* `reasoningId`
|
||||||
|
* Canonicalize JSON → `VEXVerdictID = hash(canonicalVexBody)`.
|
||||||
|
* DSSE-envelope:
|
||||||
|
|
||||||
|
* `subject`: `SBOMEntryID`
|
||||||
|
* `predicateType`: e.g. `cdx-vex.stella/v1`
|
||||||
|
* `predicate`: canonical VEX + `VEXVerdictID`.
|
||||||
|
* Sign with **VEXer key** or vendor key (depending on trust anchor).
|
||||||
|
|
||||||
|
### 6.3 External VEX
|
||||||
|
|
||||||
|
* When importing vendor VEX:
|
||||||
|
|
||||||
|
* Verify signature against vendor’s TrustAnchor.
|
||||||
|
* Canonicalize to your internal schema but preserve:
|
||||||
|
|
||||||
|
* Original document
|
||||||
|
* Original signature material
|
||||||
|
* Record “source = vendor” vs “source = stella” so auditors see origin.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Proof Spine Guidelines
|
||||||
|
|
||||||
|
**Goal:** Build a compact, tamper-evident “bundle” that ties everything together.
|
||||||
|
|
||||||
|
### 7.1 Structure
|
||||||
|
|
||||||
|
For each `SBOMEntryID`, gather:
|
||||||
|
|
||||||
|
* `EvidenceIDs[]` (sorted lexicographically).
|
||||||
|
* `ReasoningID`.
|
||||||
|
* `VEXVerdictID`.
|
||||||
|
|
||||||
|
Compute:
|
||||||
|
|
||||||
|
* Merkle tree root (or deterministic hash) over:
|
||||||
|
|
||||||
|
* `sbomEntryId`
|
||||||
|
* sorted `EvidenceIDs[]`
|
||||||
|
* `ReasoningID`
|
||||||
|
* `VEXVerdictID`
|
||||||
|
* Result is `ProofBundleID`.
|
||||||
|
|
||||||
|
Create a DSSE “spine”:
|
||||||
|
|
||||||
|
* `subject`: `SBOMEntryID`
|
||||||
|
* `predicateType`: `proofspine.stella/v1`
|
||||||
|
* `predicate`:
|
||||||
|
|
||||||
|
* `evidenceIds[]`
|
||||||
|
* `reasoningId`
|
||||||
|
* `vexVerdictId`
|
||||||
|
* `policyVersion`
|
||||||
|
* `proofBundleId`
|
||||||
|
* Sign with **Authority key**.
|
||||||
|
|
||||||
|
### 7.2 Ops Rules
|
||||||
|
|
||||||
|
* Spine generation is idempotent:
|
||||||
|
|
||||||
|
* Same inputs → same `ProofBundleID`.
|
||||||
|
* Never mutate existing spines; new policy or new evidence ⇒ new spine.
|
||||||
|
* Keep a clear API contract:
|
||||||
|
|
||||||
|
* `GET /proofs/:entry` returns **all** spines, each labeled with `policyVersion` and timestamps.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Storage & Schema Guidelines
|
||||||
|
|
||||||
|
**Goal:** Keep proofs queryable forever without breaking verification.
|
||||||
|
|
||||||
|
### 8.1 Tables (conceptual)
|
||||||
|
|
||||||
|
* `sbom_entries`: `entry_id`, `bom_digest`, `purl`, `version`, `artifact_digest`, `trust_anchor_id`.
|
||||||
|
* `dsse_envelopes`: `env_id`, `entry_id`, `predicate_type`, `signer_keyid`, `body_hash`, `envelope_blob_ref`, `signed_at`.
|
||||||
|
* `spines`: `entry_id`, `proof_bundle_id`, `policy_version`, `evidence_ids[]`, `reasoning_id`, `vex_verdict_id`, `anchor_id`, `created_at`.
|
||||||
|
* `trust_anchors`: `anchor_id`, `purl_pattern`, `allowed_keyids[]`, `policy_ref`, `revoked_keys[]`.
|
||||||
|
|
||||||
|
### 8.2 Schema Changes
|
||||||
|
|
||||||
|
Always follow:
|
||||||
|
|
||||||
|
1. **Expand**
|
||||||
|
|
||||||
|
* Add new columns/tables.
|
||||||
|
* Make new code tolerant of old data.
|
||||||
|
|
||||||
|
2. **Backfill**
|
||||||
|
|
||||||
|
* Idempotent jobs that fill in new IDs/fields without touching old DSSE payloads.
|
||||||
|
|
||||||
|
3. **Contract**
|
||||||
|
|
||||||
|
* Only after all code uses the new model.
|
||||||
|
* Never drop the raw DSSE or raw SBOM blobs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Verification & Receipts
|
||||||
|
|
||||||
|
**Goal:** Make it trivial (for you, customers, and regulators) to recheck everything.
|
||||||
|
|
||||||
|
### 9.1 Verification Flow
|
||||||
|
|
||||||
|
Given `SBOMEntryID` or `ProofBundleID`:
|
||||||
|
|
||||||
|
1. Fetch spine and trust anchor.
|
||||||
|
2. Verify:
|
||||||
|
|
||||||
|
* Spine DSSE signature against TrustAnchor’s allowed keys.
|
||||||
|
* VEX, reasoning, and evidence DSSE signatures.
|
||||||
|
3. Recompute:
|
||||||
|
|
||||||
|
* `EvidenceIDs` from stored canonical evidence.
|
||||||
|
* `ReasoningID` from reasoning.
|
||||||
|
* `VEXVerdictID` from VEX body.
|
||||||
|
* `ProofBundleID` from the above.
|
||||||
|
4. Compare to stored IDs.
|
||||||
|
|
||||||
|
Emit a **Receipt**:
|
||||||
|
|
||||||
|
* `proofBundleId`
|
||||||
|
* `verifiedAt`
|
||||||
|
* `verifierVersion`
|
||||||
|
* `anchorId`
|
||||||
|
* `result` (pass/fail, with reasons)
|
||||||
|
|
||||||
|
### 9.2 Offline Kit
|
||||||
|
|
||||||
|
* Provide a minimal CLI (`stella verify`) that:
|
||||||
|
|
||||||
|
* Accepts a bundle export (SBOM + DSSE envelopes + anchors).
|
||||||
|
* Verifies everything without network access.
|
||||||
|
|
||||||
|
Developers must ensure:
|
||||||
|
|
||||||
|
* Export format is documented and stable.
|
||||||
|
* All fields required for verification are included.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Security & Key Management (for Devs)
|
||||||
|
|
||||||
|
* Keys live in **KMS/HSM**, not env vars or config files.
|
||||||
|
* Separate keysets:
|
||||||
|
|
||||||
|
* `dev`, `staging`, `prod`
|
||||||
|
* Authority vs VEXer vs Evidence Ingestor.
|
||||||
|
* TrustAnchors:
|
||||||
|
|
||||||
|
* Edit via Authority service only.
|
||||||
|
* Every change:
|
||||||
|
|
||||||
|
* Requires code-reviewed change.
|
||||||
|
* Writes an audit log entry.
|
||||||
|
|
||||||
|
Never:
|
||||||
|
|
||||||
|
* Log private keys.
|
||||||
|
* Log full DSSE envelopes in plaintext logs (log IDs and hashes instead).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Observability & On‑Call Expectations
|
||||||
|
|
||||||
|
### 11.1 Metrics
|
||||||
|
|
||||||
|
For the SBOM→Proof pipeline, expose:
|
||||||
|
|
||||||
|
* `sboms_ingested_total`
|
||||||
|
* `sbom_ingest_errors_total{reason}`
|
||||||
|
* `evidence_statements_created_total`
|
||||||
|
* `reasoning_statements_created_total`
|
||||||
|
* `vex_statements_created_total`
|
||||||
|
* `proof_spines_created_total`
|
||||||
|
* `proof_verifications_total{result}` (pass/fail reason)
|
||||||
|
* Latency histograms per stage (`_duration_seconds`)
|
||||||
|
|
||||||
|
### 11.2 Logging
|
||||||
|
|
||||||
|
Include in structured logs wherever relevant:
|
||||||
|
|
||||||
|
* `sbomEntryId`
|
||||||
|
* `proofBundleId`
|
||||||
|
* `anchorId`
|
||||||
|
* `policyVersion`
|
||||||
|
* `requestId` / `traceId`
|
||||||
|
|
||||||
|
### 11.3 Runbooks
|
||||||
|
|
||||||
|
You should maintain runbooks for at least:
|
||||||
|
|
||||||
|
* “Pipeline is stalled” (backlog of SBOMs, evidence, or spines).
|
||||||
|
* “Verification failures increased”.
|
||||||
|
* “Trust anchor or key issues” (rotation, revocation, misconfiguration).
|
||||||
|
* “Backfill gone wrong” (how to safely stop, resume, and audit).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Dev Workflow & PR Checklist (SBOM→Proof Changes Only)
|
||||||
|
|
||||||
|
When your change touches SBOM ingestion, evidence, reasoning, VEX, or proof spines, check:
|
||||||
|
|
||||||
|
* [ ] IDs (`SBOMEntryID`, `EvidenceID`, `ReasoningID`, `VEXVerdictID`, `ProofBundleID`) remain **deterministic** and fully specified.
|
||||||
|
* [ ] No mutation of existing DSSE envelopes or historical proof data.
|
||||||
|
* [ ] Schema changes follow **expand → backfill → contract**.
|
||||||
|
* [ ] New/updated TrustAnchors reviewed by Authority owner.
|
||||||
|
* [ ] Unit tests cover:
|
||||||
|
|
||||||
|
* Canonicalization for any new/changed predicate.
|
||||||
|
* ID computation.
|
||||||
|
* [ ] Integration test covers:
|
||||||
|
|
||||||
|
* SBOM → Evidence → Reasoning → VEX → Spine → Verification → Receipt.
|
||||||
|
* [ ] Observability updated:
|
||||||
|
|
||||||
|
* New paths emit logs & metrics.
|
||||||
|
* [ ] Rollback plan documented (especially for migrations & policy changes).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you tell me which microservices/repos map to these stages (e.g. `stella-sbom-ingest`, `stella-proof-authority`, `stella-vexer`), I can turn this into a more concrete, service‑by‑service checklist with example API contracts and class/interface sketches.
|
||||||
422
docs/router/01-Step.md
Normal file
422
docs/router/01-Step.md
Normal file
@@ -0,0 +1,422 @@
|
|||||||
|
Goal for this phase: get a clean, compiling skeleton in place that matches the spec and folder conventions, with zero real logic and minimal dependencies. After this, all future work plugs into this structure.
|
||||||
|
|
||||||
|
I’ll break it into concrete tasks you can assign to agents.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Define the repository layout
|
||||||
|
|
||||||
|
**Owner: “Skeleton” / infra agent**
|
||||||
|
|
||||||
|
Target layout (no code yet, just dirs):
|
||||||
|
|
||||||
|
```text
|
||||||
|
/ (repo root)
|
||||||
|
StellaOps.Router.sln
|
||||||
|
/src
|
||||||
|
/StellaOps.Gateway.WebService
|
||||||
|
/__Libraries
|
||||||
|
/StellaOps.Router.Common
|
||||||
|
/StellaOps.Router.Config
|
||||||
|
/StellaOps.Microservice
|
||||||
|
/StellaOps.Microservice.SourceGen (empty stub for now)
|
||||||
|
/tests
|
||||||
|
/StellaOps.Router.Common.Tests
|
||||||
|
/StellaOps.Gateway.WebService.Tests
|
||||||
|
/StellaOps.Microservice.Tests
|
||||||
|
/docs
|
||||||
|
/router
|
||||||
|
specs.md (already exists)
|
||||||
|
README.md (placeholder, 2–3 lines)
|
||||||
|
```
|
||||||
|
|
||||||
|
Tasks:
|
||||||
|
|
||||||
|
1. Create `src`, `src/__Libraries`, `tests`, `docs/router` directories if missing.
|
||||||
|
2. Move/confirm `docs/router/specs.md` is the canonical spec.
|
||||||
|
3. Add `docs/router/README.md` with a pointer: “Start with specs.md; this folder will host router-related docs.”
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Create the solution and projects
|
||||||
|
|
||||||
|
**Owner: skeleton agent**
|
||||||
|
|
||||||
|
### 2.1 Create solution
|
||||||
|
|
||||||
|
* At repo root:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet new sln -n StellaOps.Router
|
||||||
|
```
|
||||||
|
|
||||||
|
* Add projects as they are created in the next step.
|
||||||
|
|
||||||
|
### 2.2 Create projects
|
||||||
|
|
||||||
|
For each project below:
|
||||||
|
|
||||||
|
* `dotnet new` with appropriate template.
|
||||||
|
* Set `RootNamespace` / `AssemblyName` to match folder & spec.
|
||||||
|
|
||||||
|
Projects:
|
||||||
|
|
||||||
|
1. **Gateway webservice**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd src/StellaOps.Gateway.WebService
|
||||||
|
dotnet new webapi -n StellaOps.Gateway.WebService
|
||||||
|
```
|
||||||
|
|
||||||
|
* This will create an ASP.NET Core Web API project; we’ll trim later.
|
||||||
|
|
||||||
|
2. **Common library**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd src/__Libraries
|
||||||
|
dotnet new classlib -n StellaOps.Router.Common
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Config library**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet new classlib -n StellaOps.Router.Config
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Microservice SDK**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet new classlib -n StellaOps.Microservice
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Microservice Source Generator (stub)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet new classlib -n StellaOps.Microservice.SourceGen
|
||||||
|
```
|
||||||
|
|
||||||
|
* This will be converted to an Analyzer/SourceGen project later; for now it can compile as a plain library.
|
||||||
|
|
||||||
|
6. **Test projects**
|
||||||
|
|
||||||
|
Under `tests`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd tests
|
||||||
|
dotnet new xunit -n StellaOps.Router.Common.Tests
|
||||||
|
dotnet new xunit -n StellaOps.Gateway.WebService.Tests
|
||||||
|
dotnet new xunit -n StellaOps.Microservice.Tests
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.3 Add projects to solution
|
||||||
|
|
||||||
|
At repo root:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet sln StellaOps.Router.sln add \
|
||||||
|
src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj \
|
||||||
|
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj \
|
||||||
|
src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj \
|
||||||
|
src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj \
|
||||||
|
src/__Libraries/StellaOps.Microservice.SourceGen/StellaOps.Microservice.SourceGen.csproj \
|
||||||
|
tests/StellaOps.Router.Common.Tests/StellaOps.Router.Common.Tests.csproj \
|
||||||
|
tests/StellaOps.Gateway.WebService.Tests/StellaOps.Gateway.WebService.Tests.csproj \
|
||||||
|
tests/StellaOps.Microservice.Tests/StellaOps.Microservice.Tests.csproj
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Wire basic project references
|
||||||
|
|
||||||
|
**Owner: skeleton agent**
|
||||||
|
|
||||||
|
The reference graph should be:
|
||||||
|
|
||||||
|
* `StellaOps.Gateway.WebService`
|
||||||
|
|
||||||
|
* references `StellaOps.Router.Common`
|
||||||
|
* references `StellaOps.Router.Config`
|
||||||
|
|
||||||
|
* `StellaOps.Microservice`
|
||||||
|
|
||||||
|
* references `StellaOps.Router.Common`
|
||||||
|
* (later) references `StellaOps.Microservice.SourceGen` as analyzer; for now no reference.
|
||||||
|
|
||||||
|
* `StellaOps.Router.Config`
|
||||||
|
|
||||||
|
* references `StellaOps.Router.Common` (for `EndpointDescriptor`, `InstanceDescriptor`, etc.)
|
||||||
|
|
||||||
|
Test projects:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common.Tests` → `StellaOps.Router.Common`
|
||||||
|
* `StellaOps.Gateway.WebService.Tests` → `StellaOps.Gateway.WebService`
|
||||||
|
* `StellaOps.Microservice.Tests` → `StellaOps.Microservice`
|
||||||
|
|
||||||
|
Use `dotnet add reference`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet add src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj \
|
||||||
|
src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj
|
||||||
|
|
||||||
|
dotnet add src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
|
||||||
|
|
||||||
|
dotnet add src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
|
||||||
|
|
||||||
|
dotnet add tests/StellaOps.Router.Common.Tests/StellaOps.Router.Common.Tests.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
|
||||||
|
|
||||||
|
dotnet add tests/StellaOps.Gateway.WebService.Tests/StellaOps.Gateway.WebService.Tests.csproj reference \
|
||||||
|
src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj
|
||||||
|
|
||||||
|
dotnet add tests/StellaOps.Microservice.Tests/StellaOps.Microservice.Tests.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Set common build settings
|
||||||
|
|
||||||
|
**Owner: infra agent**
|
||||||
|
|
||||||
|
Add a `Directory.Build.props` at repo root to centralize:
|
||||||
|
|
||||||
|
* Target framework (e.g. `net8.0`).
|
||||||
|
* Nullable context.
|
||||||
|
* LangVersion.
|
||||||
|
|
||||||
|
Example (minimal):
|
||||||
|
|
||||||
|
```xml
|
||||||
|
<Project>
|
||||||
|
<PropertyGroup>
|
||||||
|
<TargetFramework>net8.0</TargetFramework>
|
||||||
|
<Nullable>enable</Nullable>
|
||||||
|
<LangVersion>preview</LangVersion> <!-- if needed for newer features -->
|
||||||
|
<ImplicitUsings>enable</ImplicitUsings>
|
||||||
|
</PropertyGroup>
|
||||||
|
</Project>
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, strip redundant `<TargetFramework>` from individual `.csproj` files if desired.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Stub namespaces and “empty” entry points
|
||||||
|
|
||||||
|
**Owner: each project’s agent**
|
||||||
|
|
||||||
|
### 5.1 Common library
|
||||||
|
|
||||||
|
Create empty placeholder types that match the spec names (no logic, just shells) so everything compiles and IntelliSense knows the shapes.
|
||||||
|
|
||||||
|
Example files:
|
||||||
|
|
||||||
|
* `TransportType.cs`
|
||||||
|
* `FrameType.cs`
|
||||||
|
* `InstanceHealthStatus.cs`
|
||||||
|
* `ClaimRequirement.cs`
|
||||||
|
* `EndpointDescriptor.cs`
|
||||||
|
* `InstanceDescriptor.cs`
|
||||||
|
* `ConnectionState.cs`
|
||||||
|
* `RoutingContext.cs`
|
||||||
|
* `RoutingDecision.cs`
|
||||||
|
* `PayloadLimits.cs`
|
||||||
|
* Interfaces: `IGlobalRoutingState`, `IRoutingPlugin`, `ITransportServer`, `ITransportClient`.
|
||||||
|
|
||||||
|
Each type can be an auto-property-only record/class/enum; no methods yet.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
namespace StellaOps.Router.Common;
|
||||||
|
|
||||||
|
public enum TransportType
|
||||||
|
{
|
||||||
|
Udp,
|
||||||
|
Tcp,
|
||||||
|
Certificate,
|
||||||
|
RabbitMq
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
and so on.
|
||||||
|
|
||||||
|
### 5.2 Config library
|
||||||
|
|
||||||
|
Add a minimal `RouterConfig` and `PayloadLimits` class aligned with the spec; again, just properties.
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
namespace StellaOps.Router.Config;
|
||||||
|
|
||||||
|
public sealed class RouterConfig
|
||||||
|
{
|
||||||
|
public IList<ServiceConfig> Services { get; init; } = new List<ServiceConfig>();
|
||||||
|
public PayloadLimits PayloadLimits { get; init; } = new();
|
||||||
|
}
|
||||||
|
|
||||||
|
public sealed class ServiceConfig
|
||||||
|
{
|
||||||
|
public string Name { get; init; } = string.Empty;
|
||||||
|
public string DefaultVersion { get; init; } = "1.0.0";
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
No YAML binding, no logic yet.
|
||||||
|
|
||||||
|
### 5.3 Microservice library
|
||||||
|
|
||||||
|
Create:
|
||||||
|
|
||||||
|
* `StellaMicroserviceOptions` with required properties.
|
||||||
|
* `RouterEndpointConfig` (host/port/transport).
|
||||||
|
* Extension method `AddStellaMicroservice(...)` with an empty body that just registers options and placeholder services.
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
namespace StellaOps.Microservice;
|
||||||
|
|
||||||
|
public sealed class StellaMicroserviceOptions
|
||||||
|
{
|
||||||
|
public string ServiceName { get; set; } = string.Empty;
|
||||||
|
public string Version { get; set; } = string.Empty;
|
||||||
|
public string Region { get; set; } = string.Empty;
|
||||||
|
public string InstanceId { get; set; } = string.Empty;
|
||||||
|
public IList<RouterEndpointConfig> Routers { get; set; } = new List<RouterEndpointConfig>();
|
||||||
|
public string? ConfigFilePath { get; set; }
|
||||||
|
}
|
||||||
|
|
||||||
|
public sealed class RouterEndpointConfig
|
||||||
|
{
|
||||||
|
public string Host { get; set; } = string.Empty;
|
||||||
|
public int Port { get; set; }
|
||||||
|
public TransportType TransportType { get; set; }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`AddStellaMicroservice`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public static class ServiceCollectionExtensions
|
||||||
|
{
|
||||||
|
public static IServiceCollection AddStellaMicroservice(
|
||||||
|
this IServiceCollection services,
|
||||||
|
Action<StellaMicroserviceOptions> configure)
|
||||||
|
{
|
||||||
|
services.Configure(configure);
|
||||||
|
// TODO: register internal SDK services in later phases
|
||||||
|
return services;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.4 Microservice.SourceGen
|
||||||
|
|
||||||
|
For now:
|
||||||
|
|
||||||
|
* Leave this as an empty classlib with an empty `README.md` stating:
|
||||||
|
|
||||||
|
* “This project will host Roslyn source generators for endpoint discovery. No implementation yet.”
|
||||||
|
|
||||||
|
Don’t hook it as an analyzer until there is content.
|
||||||
|
|
||||||
|
### 5.5 Gateway webservice
|
||||||
|
|
||||||
|
Simplify the scaffolded Web API to minimal:
|
||||||
|
|
||||||
|
* In `Program.cs`:
|
||||||
|
|
||||||
|
* Build a barebones `WebApplication` that:
|
||||||
|
|
||||||
|
* Binds `GatewayNodeConfig` from config.
|
||||||
|
* Adds controllers or minimal endpoints.
|
||||||
|
* Runs; no router logic yet.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var builder = WebApplication.CreateBuilder(args);
|
||||||
|
|
||||||
|
builder.Services.Configure<GatewayNodeConfig>(
|
||||||
|
builder.Configuration.GetSection("GatewayNode"));
|
||||||
|
|
||||||
|
builder.Services.AddControllers();
|
||||||
|
|
||||||
|
var app = builder.Build();
|
||||||
|
|
||||||
|
app.MapControllers(); // may be empty for now
|
||||||
|
|
||||||
|
app.Run();
|
||||||
|
```
|
||||||
|
|
||||||
|
* Add `GatewayNodeConfig` class in `StellaOps.Gateway.WebService` project.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Make tests compile (even if empty)
|
||||||
|
|
||||||
|
**Owner: test agent**
|
||||||
|
|
||||||
|
For each test project:
|
||||||
|
|
||||||
|
* Reference the appropriate main project (already done).
|
||||||
|
* Add a single dummy test class so CI passes:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public class SmokeTests
|
||||||
|
{
|
||||||
|
[Fact]
|
||||||
|
public void SolutionCompiles()
|
||||||
|
{
|
||||||
|
Assert.True(true);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This is just to ensure the pipeline runs; real tests come later.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Add initial CI/build pipeline
|
||||||
|
|
||||||
|
**Owner: infra agent**
|
||||||
|
|
||||||
|
Set up minimal CI (GitHub Actions, GitLab, Azure DevOps, whatever you use):
|
||||||
|
|
||||||
|
* Steps:
|
||||||
|
|
||||||
|
* `dotnet restore`
|
||||||
|
* `dotnet build StellaOps.Router.sln -c Release`
|
||||||
|
* `dotnet test StellaOps.Router.sln -c Release`
|
||||||
|
|
||||||
|
No packaging or deployment yet; just compile + tests.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Sanity check & readiness criteria
|
||||||
|
|
||||||
|
Before you let agents move to “Common model implementation” (next phase), confirm:
|
||||||
|
|
||||||
|
* The solution builds cleanly in a clean checkout (`dotnet restore`, `dotnet build`).
|
||||||
|
* All test projects run and pass (even with dummy tests).
|
||||||
|
* Namespaces and project names match the spec:
|
||||||
|
|
||||||
|
* `StellaOps.Gateway.WebService`
|
||||||
|
* `StellaOps.Router.Common`
|
||||||
|
* `StellaOps.Router.Config`
|
||||||
|
* `StellaOps.Microservice`
|
||||||
|
* There is no real business logic yet:
|
||||||
|
|
||||||
|
* No transport logic.
|
||||||
|
* No routing decisions.
|
||||||
|
* No reflection or YAML.
|
||||||
|
* `docs/router/specs.md` is referenced in `docs/router/README.md` as the spec.
|
||||||
|
|
||||||
|
At that point, the skeleton is in place and stable. Next phases can then focus on:
|
||||||
|
|
||||||
|
* Filling in `Common` contracts properly.
|
||||||
|
* Implementing the in-memory transport.
|
||||||
|
* Wiring minimal microservice/gateway flows.
|
||||||
|
|
||||||
|
If you want, I can outline the next phase (“implement core model + in-memory transport”) with a similar task breakdown next.
|
||||||
375
docs/router/02-Step.md
Normal file
375
docs/router/02-Step.md
Normal file
@@ -0,0 +1,375 @@
|
|||||||
|
For this step, the goal is: make `StellaOps.Router.Common` the single, stable contract layer that everything else can depend on, with **no behavior** yet, just shapes. After this, gateway, microservice SDK, transports, and config can all compile against it.
|
||||||
|
|
||||||
|
Think of this as “lock down the domain vocabulary”.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Pre-work
|
||||||
|
|
||||||
|
**All devs touching Common:**
|
||||||
|
|
||||||
|
1. Read `docs/router/specs.md`, specifically:
|
||||||
|
|
||||||
|
* The sections describing:
|
||||||
|
|
||||||
|
* Enums (`TransportType`, `FrameType`, `InstanceHealthStatus`, etc.).
|
||||||
|
* Endpoint/instance/routing models.
|
||||||
|
* Frames and request/response correlation.
|
||||||
|
* Routing state and routing plugin.
|
||||||
|
2. Agree that no class/interface will be added to Common if it isn’t in the spec (or discussed with you and then added to the spec).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Inventory and file layout
|
||||||
|
|
||||||
|
**Owner: “Common” lead**
|
||||||
|
|
||||||
|
1. From `specs.md`, extract a **type inventory** for `StellaOps.Router.Common`:
|
||||||
|
|
||||||
|
Enumerations:
|
||||||
|
|
||||||
|
* `TransportType`
|
||||||
|
* `FrameType`
|
||||||
|
* `InstanceHealthStatus`
|
||||||
|
|
||||||
|
Core value objects:
|
||||||
|
|
||||||
|
* `ClaimRequirement`
|
||||||
|
* `EndpointDescriptor`
|
||||||
|
* `InstanceDescriptor`
|
||||||
|
* `ConnectionState`
|
||||||
|
* `PayloadLimits` (if used from Common; otherwise keep in Config only)
|
||||||
|
* Any small value types you’ve defined (e.g. cancel payload, ping metrics etc. if present in specs).
|
||||||
|
|
||||||
|
Routing:
|
||||||
|
|
||||||
|
* `RoutingContext`
|
||||||
|
* `RoutingDecision`
|
||||||
|
|
||||||
|
Frames:
|
||||||
|
|
||||||
|
* `Frame` (type + correlation id + payload)
|
||||||
|
* Optional payload contracts for HELLO, HEARTBEAT, ENDPOINTS_UPDATE, etc., if you’ve specified them explicitly.
|
||||||
|
|
||||||
|
Abstractions/interfaces:
|
||||||
|
|
||||||
|
* `IGlobalRoutingState`
|
||||||
|
* `IRoutingPlugin`
|
||||||
|
* `ITransportServer`
|
||||||
|
* `ITransportClient`
|
||||||
|
* Optional: `IRegionProvider` if you kept it in the spec.
|
||||||
|
|
||||||
|
2. Propose a file layout inside `src/__Libraries/StellaOps.Router.Common`:
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/StellaOps.Router.Common
|
||||||
|
/Enums
|
||||||
|
TransportType.cs
|
||||||
|
FrameType.cs
|
||||||
|
InstanceHealthStatus.cs
|
||||||
|
/Models
|
||||||
|
ClaimRequirement.cs
|
||||||
|
EndpointDescriptor.cs
|
||||||
|
InstanceDescriptor.cs
|
||||||
|
ConnectionState.cs
|
||||||
|
RoutingContext.cs
|
||||||
|
RoutingDecision.cs
|
||||||
|
Frame.cs
|
||||||
|
/Abstractions
|
||||||
|
IGlobalRoutingState.cs
|
||||||
|
IRoutingPlugin.cs
|
||||||
|
ITransportClient.cs
|
||||||
|
ITransportServer.cs
|
||||||
|
IRegionProvider.cs (if used)
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Get a quick 👍/👎 from you on the layout (no code yet, just file names and namespaces).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Implement enums and basic models
|
||||||
|
|
||||||
|
**Owner: Common dev**
|
||||||
|
|
||||||
|
Scope: simple, immutable models, no methods.
|
||||||
|
|
||||||
|
1. **Enums**
|
||||||
|
|
||||||
|
Implement:
|
||||||
|
|
||||||
|
* `TransportType` with `[Udp, Tcp, Certificate, RabbitMq]`.
|
||||||
|
* `FrameType` with:
|
||||||
|
|
||||||
|
* `Hello`, `Heartbeat`, `EndpointsUpdate`, `Request`, `RequestStreamData`, `Response`, `ResponseStreamData`, `Cancel` (and any others in specs).
|
||||||
|
* `InstanceHealthStatus` with:
|
||||||
|
|
||||||
|
* `Unknown`, `Healthy`, `Degraded`, `Draining`, `Unhealthy`.
|
||||||
|
|
||||||
|
All enums live under `namespace StellaOps.Router.Common;`.
|
||||||
|
|
||||||
|
2. **Value models**
|
||||||
|
|
||||||
|
Implement as plain classes/records with auto-properties:
|
||||||
|
|
||||||
|
* `ClaimRequirement`:
|
||||||
|
|
||||||
|
* `string Type` (required).
|
||||||
|
* `string? Value` (optional).
|
||||||
|
* `EndpointDescriptor`:
|
||||||
|
|
||||||
|
* `string ServiceName`
|
||||||
|
* `string Version`
|
||||||
|
* `string Method`
|
||||||
|
* `string Path`
|
||||||
|
* `TimeSpan DefaultTimeout`
|
||||||
|
* `bool SupportsStreaming`
|
||||||
|
* `IReadOnlyList<ClaimRequirement> RequiringClaims`
|
||||||
|
* `InstanceDescriptor`:
|
||||||
|
|
||||||
|
* `string InstanceId`
|
||||||
|
* `string ServiceName`
|
||||||
|
* `string Version`
|
||||||
|
* `string Region`
|
||||||
|
* `ConnectionState`:
|
||||||
|
|
||||||
|
* `string ConnectionId`
|
||||||
|
* `InstanceDescriptor Instance`
|
||||||
|
* `InstanceHealthStatus Status`
|
||||||
|
* `DateTime LastHeartbeatUtc`
|
||||||
|
* `double AveragePingMs`
|
||||||
|
* `TransportType TransportType`
|
||||||
|
* `IReadOnlyDictionary<(string Method, string Path), EndpointDescriptor> Endpoints`
|
||||||
|
|
||||||
|
Design choices:
|
||||||
|
|
||||||
|
* Make constructors minimal (empty constructors okay for now).
|
||||||
|
* Use `init` where reasonable to encourage immutability for descriptors; `ConnectionState` can have mutable health fields.
|
||||||
|
|
||||||
|
3. **PayloadLimits (if in Common)**
|
||||||
|
|
||||||
|
If the spec places `PayloadLimits` in Common (versus Config), implement:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class PayloadLimits
|
||||||
|
{
|
||||||
|
public long MaxRequestBytesPerCall { get; set; }
|
||||||
|
public long MaxRequestBytesPerConnection { get; set; }
|
||||||
|
public long MaxAggregateInflightBytes { get; set; }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
If it’s defined in Config only, leave it there and avoid duplication.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Implement frame & correlation model
|
||||||
|
|
||||||
|
**Owner: Common dev**
|
||||||
|
|
||||||
|
1. Implement `Frame`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class Frame
|
||||||
|
{
|
||||||
|
public FrameType Type { get; init; }
|
||||||
|
public Guid CorrelationId { get; init; }
|
||||||
|
public byte[] Payload { get; init; } = Array.Empty<byte>();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. If `specs.md` defines specific payload DTOs (e.g. `HelloPayload`, `HeartbeatPayload`, `CancelPayload`), define them too:
|
||||||
|
|
||||||
|
* `HelloPayload`:
|
||||||
|
|
||||||
|
* `InstanceDescriptor` and list of `EndpointDescriptor`s, or the equivalent properties.
|
||||||
|
* `HeartbeatPayload`:
|
||||||
|
|
||||||
|
* `InstanceId`, `Status`, metrics.
|
||||||
|
* `CancelPayload`:
|
||||||
|
|
||||||
|
* `string Reason` or similar.
|
||||||
|
|
||||||
|
Keep them as simple DTOs with no logic.
|
||||||
|
|
||||||
|
3. Do **not** implement serialization yet (no JSON/MessagePack references here); Common should only define shapes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Routing abstractions
|
||||||
|
|
||||||
|
**Owner: Common dev**
|
||||||
|
|
||||||
|
Implement the routing interface + context & decision types.
|
||||||
|
|
||||||
|
1. `RoutingContext`:
|
||||||
|
|
||||||
|
* Match the spec. If your `specs.md` version includes `HttpContext`, follow it; if you intentionally kept Common free of ASP.NET types, use a neutral context (e.g. method/path/headers/principal).
|
||||||
|
* For now, if `HttpContext` is included in spec, define:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class RoutingContext
|
||||||
|
{
|
||||||
|
public object HttpContext { get; init; } = default!; // or Microsoft.AspNetCore.Http.HttpContext if allowed
|
||||||
|
public EndpointDescriptor Endpoint { get; init; } = default!;
|
||||||
|
public string GatewayRegion { get; init; } = string.Empty;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then you can refine the type once you finalize whether Common can reference ASP.NET packages. If you want to avoid that now, define your own lightweight context model and let gateway adapt.
|
||||||
|
|
||||||
|
2. `RoutingDecision`:
|
||||||
|
|
||||||
|
* Must include:
|
||||||
|
|
||||||
|
* `EndpointDescriptor Endpoint`
|
||||||
|
* `ConnectionState Connection`
|
||||||
|
* `TransportType TransportType`
|
||||||
|
* `TimeSpan EffectiveTimeout`
|
||||||
|
|
||||||
|
3. `IGlobalRoutingState`:
|
||||||
|
|
||||||
|
Interface only, no implementation:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface IGlobalRoutingState
|
||||||
|
{
|
||||||
|
EndpointDescriptor? ResolveEndpoint(string method, string path);
|
||||||
|
|
||||||
|
IReadOnlyList<ConnectionState> GetConnectionsFor(
|
||||||
|
string serviceName,
|
||||||
|
string version,
|
||||||
|
string method,
|
||||||
|
string path);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
4. `IRoutingPlugin`:
|
||||||
|
|
||||||
|
* Single method:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface IRoutingPlugin
|
||||||
|
{
|
||||||
|
Task<RoutingDecision?> ChooseInstanceAsync(
|
||||||
|
RoutingContext context,
|
||||||
|
CancellationToken cancellationToken);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
* No logic; just interface.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Transport abstractions
|
||||||
|
|
||||||
|
**Owner: Common dev**
|
||||||
|
|
||||||
|
Implement the shared transport contracts.
|
||||||
|
|
||||||
|
1. `ITransportServer`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface ITransportServer
|
||||||
|
{
|
||||||
|
Task StartAsync(CancellationToken cancellationToken);
|
||||||
|
Task StopAsync(CancellationToken cancellationToken);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. `ITransportClient`:
|
||||||
|
|
||||||
|
Per spec, you need:
|
||||||
|
|
||||||
|
* A buffered call (request → response).
|
||||||
|
* A streaming call.
|
||||||
|
* A cancel call.
|
||||||
|
|
||||||
|
Interfaces only; content roughly:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface ITransportClient
|
||||||
|
{
|
||||||
|
Task<Frame> SendRequestAsync(
|
||||||
|
ConnectionState connection,
|
||||||
|
Frame requestFrame,
|
||||||
|
TimeSpan timeout,
|
||||||
|
CancellationToken cancellationToken);
|
||||||
|
|
||||||
|
Task SendCancelAsync(
|
||||||
|
ConnectionState connection,
|
||||||
|
Guid correlationId,
|
||||||
|
string? reason = null);
|
||||||
|
|
||||||
|
Task SendStreamingAsync(
|
||||||
|
ConnectionState connection,
|
||||||
|
Frame requestHeader,
|
||||||
|
Stream requestBody,
|
||||||
|
Func<Stream, Task> readResponseBody,
|
||||||
|
PayloadLimits limits,
|
||||||
|
CancellationToken cancellationToken);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
No implementation or transport-specific logic here. No network types beyond `Stream` and `Task`.
|
||||||
|
|
||||||
|
3. `IRegionProvider` (if you decided to keep it):
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface IRegionProvider
|
||||||
|
{
|
||||||
|
string Region { get; }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Wire Common into tests (sanity checks only)
|
||||||
|
|
||||||
|
**Owner: Common tests dev**
|
||||||
|
|
||||||
|
Create a few very simple unit tests in `StellaOps.Router.Common.Tests`:
|
||||||
|
|
||||||
|
1. **Shape tests** (these are mostly compile-time):
|
||||||
|
|
||||||
|
* That `EndpointDescriptor` has the expected properties and default values can be set.
|
||||||
|
* That `ConnectionState` can be constructed and that its `Endpoints` dictionary handles `(Method, Path)` keys.
|
||||||
|
|
||||||
|
2. **Enum completeness tests**:
|
||||||
|
|
||||||
|
* Assert that `Enum.GetValues(typeof(FrameType))` contains all expected values. This catches accidental changes.
|
||||||
|
|
||||||
|
3. **No behavior yet**:
|
||||||
|
|
||||||
|
* No routing algorithms or transport behavior tests here; just that model contracts behave like dumb DTOs (e.g. property assignment, default value semantics).
|
||||||
|
|
||||||
|
This is mostly to lock in the shape and catch accidental refactors later.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Cleanliness & review checklist
|
||||||
|
|
||||||
|
Before you move on to the in-memory transport and gateway/microservice wiring, check:
|
||||||
|
|
||||||
|
1. `StellaOps.Router.Common`:
|
||||||
|
|
||||||
|
* Compiles with zero warnings (nullable enabled).
|
||||||
|
* Only references BCL; no ASP.NET or serializer packages unless intentionally agreed in the spec.
|
||||||
|
|
||||||
|
2. All types listed in `specs.md` under the Common section exist and match names & property sets.
|
||||||
|
|
||||||
|
3. No behavior/logic:
|
||||||
|
|
||||||
|
* No LINQ-heavy methods.
|
||||||
|
* No routing algorithm code.
|
||||||
|
* No network code.
|
||||||
|
* No YAML/JSON or serialization.
|
||||||
|
|
||||||
|
4. `StellaOps.Router.Common.Tests` runs and passes.
|
||||||
|
|
||||||
|
5. `docs/router/specs.md` is updated if there was any discrepancy (or the code is updated to match the spec, not the other way around).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you want the next step, I can outline “3. Build in-memory transport + minimal HELLO/REQUEST/RESPONSE wiring” in the same style, so agents can move from contracts to a working vertical slice.
|
||||||
144
docs/router/03-Step.md
Normal file
144
docs/router/03-Step.md
Normal file
@@ -0,0 +1,144 @@
|
|||||||
|
For this step, you’re not writing any real logic yet – you’re just making sure the projects depend on each other in the right direction so future work doesn’t turn into spaghetti.
|
||||||
|
|
||||||
|
Think of it as locking in the dependency graph.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Pin the desired dependency graph
|
||||||
|
|
||||||
|
First, make explicit what is allowed to depend on what.
|
||||||
|
|
||||||
|
Target graph:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common`
|
||||||
|
|
||||||
|
* Lowest layer.
|
||||||
|
* **No** project references to any other StellaOps projects.
|
||||||
|
|
||||||
|
* `StellaOps.Router.Config`
|
||||||
|
|
||||||
|
* References:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common`.
|
||||||
|
|
||||||
|
* `StellaOps.Microservice`
|
||||||
|
|
||||||
|
* References:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common`.
|
||||||
|
|
||||||
|
* `StellaOps.Microservice.SourceGen`
|
||||||
|
|
||||||
|
* For now: no references, or only to Common if needed for types in generated code.
|
||||||
|
* Later: will be consumed as an analyzer by `StellaOps.Microservice`, not via normal project reference.
|
||||||
|
|
||||||
|
* `StellaOps.Gateway.WebService`
|
||||||
|
|
||||||
|
* References:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common`
|
||||||
|
* `StellaOps.Router.Config`.
|
||||||
|
|
||||||
|
Test projects:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common.Tests` → `StellaOps.Router.Common`
|
||||||
|
* `StellaOps.Gateway.WebService.Tests` → `StellaOps.Gateway.WebService`
|
||||||
|
* `StellaOps.Microservice.Tests` → `StellaOps.Microservice`
|
||||||
|
|
||||||
|
Explicitly: there should be **no** circular references, and nothing should reference the Gateway from libraries.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Add the project references
|
||||||
|
|
||||||
|
From repo root, for each needed edge:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Gateway → Common + Config
|
||||||
|
dotnet add src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj \
|
||||||
|
src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj
|
||||||
|
|
||||||
|
# Microservice → Common
|
||||||
|
dotnet add src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
|
||||||
|
|
||||||
|
# Config → Common
|
||||||
|
dotnet add src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
|
||||||
|
|
||||||
|
# Tests → main projects
|
||||||
|
dotnet add tests/StellaOps.Router.Common.Tests/StellaOps.Router.Common.Tests.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
|
||||||
|
|
||||||
|
dotnet add tests/StellaOps.Gateway.WebService.Tests/StellaOps.Gateway.WebService.Tests.csproj reference \
|
||||||
|
src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj
|
||||||
|
|
||||||
|
dotnet add tests/StellaOps.Microservice.Tests/StellaOps.Microservice.Tests.csproj reference \
|
||||||
|
src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj
|
||||||
|
```
|
||||||
|
|
||||||
|
Do **not** add any references:
|
||||||
|
|
||||||
|
* From `Common` → anything.
|
||||||
|
* From `Config` → Gateway or Microservice.
|
||||||
|
* From `Microservice` → Gateway.
|
||||||
|
* From tests → libraries other than their primary target (unless you explicitly want shared test utils later).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Verify the .csproj contents
|
||||||
|
|
||||||
|
Have one agent open each `.csproj` and confirm:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common.csproj`
|
||||||
|
|
||||||
|
* No `<ProjectReference>` elements.
|
||||||
|
|
||||||
|
* `StellaOps.Router.Config.csproj`
|
||||||
|
|
||||||
|
* Exactly one `<ProjectReference>`: Common.
|
||||||
|
|
||||||
|
* `StellaOps.Microservice.csproj`
|
||||||
|
|
||||||
|
* Exactly one `<ProjectReference>`: Common.
|
||||||
|
|
||||||
|
* `StellaOps.Microservice.SourceGen.csproj`
|
||||||
|
|
||||||
|
* No project references for now (we’ll convert it to a proper analyzer / source-generator package later).
|
||||||
|
|
||||||
|
* `StellaOps.Gateway.WebService.csproj`
|
||||||
|
|
||||||
|
* Exactly two `<ProjectReference>`s: Common + Config.
|
||||||
|
* No reference to Microservice.
|
||||||
|
|
||||||
|
* Test projects:
|
||||||
|
|
||||||
|
* Each test project references only its corresponding main project (no cross-test coupling).
|
||||||
|
|
||||||
|
If anything else is present (e.g. leftover references from templates), remove them.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Run a full build & test as a sanity check
|
||||||
|
|
||||||
|
From repo root:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet restore
|
||||||
|
dotnet build StellaOps.Router.sln -c Debug
|
||||||
|
dotnet test StellaOps.Router.sln -c Debug
|
||||||
|
```
|
||||||
|
|
||||||
|
Acceptance criteria for this step:
|
||||||
|
|
||||||
|
* Solution builds without reference errors.
|
||||||
|
* All test projects compile and run (even if they only have dummy tests).
|
||||||
|
* Intellisense / navigation in IDE shows:
|
||||||
|
|
||||||
|
* Gateway can see Common & Config types.
|
||||||
|
* Microservice can see Common types.
|
||||||
|
* Config can see Common types.
|
||||||
|
* No library can see Gateway unless through tests.
|
||||||
|
|
||||||
|
Once this is stable, your devs can safely move on to implementing the Common model and know they won’t have to rewrite references later.
|
||||||
520
docs/router/04-Step.md
Normal file
520
docs/router/04-Step.md
Normal file
@@ -0,0 +1,520 @@
|
|||||||
|
For this step, the goal is: a microservice that can:
|
||||||
|
|
||||||
|
* Start up with `AddStellaMicroservice(...)`
|
||||||
|
* Discover its endpoints from attributes
|
||||||
|
* Connect to the router (via InMemory transport)
|
||||||
|
* Send a HELLO with identity + endpoints
|
||||||
|
* Receive a REQUEST and return a RESPONSE
|
||||||
|
|
||||||
|
No streaming, no cancellation, no heartbeat yet. Pure minimal handshake & dispatch.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Preconditions
|
||||||
|
|
||||||
|
Before your agents start this step, you should have:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common` contracts in place (enums, `EndpointDescriptor`, `ConnectionState`, `Frame`, etc.).
|
||||||
|
* The solution skeleton and project references configured.
|
||||||
|
* A **stub** InMemory transport “router harness” (at least a place to park the future InMemory transport). Even if it’s not fully implemented, assume it will expose:
|
||||||
|
|
||||||
|
* A way for a microservice to “connect” and register itself.
|
||||||
|
* A way to deliver frames from router to microservice and back.
|
||||||
|
|
||||||
|
If InMemory isn’t built yet, the microservice code should be written *against abstractions* so you can plug it in later.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Define microservice public surface (SDK contract)
|
||||||
|
|
||||||
|
**Project:** `__Libraries/StellaOps.Microservice`
|
||||||
|
**Owner:** microservice SDK agent
|
||||||
|
|
||||||
|
Purpose: give product teams a stable way to define services and endpoints without caring about transports.
|
||||||
|
|
||||||
|
### 1.1 Options
|
||||||
|
|
||||||
|
Make sure `StellaMicroserviceOptions` matches the spec:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class StellaMicroserviceOptions
|
||||||
|
{
|
||||||
|
public string ServiceName { get; set; } = string.Empty;
|
||||||
|
public string Version { get; set; } = string.Empty;
|
||||||
|
public string Region { get; set; } = string.Empty;
|
||||||
|
public string InstanceId { get; set; } = string.Empty;
|
||||||
|
|
||||||
|
public IList<RouterEndpointConfig> Routers { get; set; } = new List<RouterEndpointConfig>();
|
||||||
|
|
||||||
|
public string? ConfigFilePath { get; set; }
|
||||||
|
}
|
||||||
|
|
||||||
|
public sealed class RouterEndpointConfig
|
||||||
|
{
|
||||||
|
public string Host { get; set; } = string.Empty;
|
||||||
|
public int Port { get; set; }
|
||||||
|
public TransportType TransportType { get; set; }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`Routers` is mandatory: without at least one router configured, the SDK should refuse to start later (that policy can be enforced in the handshake stage).
|
||||||
|
|
||||||
|
### 1.2 Public endpoint abstractions
|
||||||
|
|
||||||
|
Define:
|
||||||
|
|
||||||
|
* Attribute for endpoint identity:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
|
||||||
|
public sealed class StellaEndpointAttribute : Attribute
|
||||||
|
{
|
||||||
|
public string Method { get; }
|
||||||
|
public string Path { get; }
|
||||||
|
|
||||||
|
public StellaEndpointAttribute(string method, string path)
|
||||||
|
{
|
||||||
|
Method = method;
|
||||||
|
Path = path;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
* Raw handler:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class RawRequestContext
|
||||||
|
{
|
||||||
|
public string Method { get; init; } = string.Empty;
|
||||||
|
public string Path { get; init; } = string.Empty;
|
||||||
|
public IReadOnlyDictionary<string,string> Headers { get; init; } =
|
||||||
|
new Dictionary<string,string>();
|
||||||
|
public Stream Body { get; init; } = Stream.Null;
|
||||||
|
public CancellationToken CancellationToken { get; init; }
|
||||||
|
}
|
||||||
|
|
||||||
|
public sealed class RawResponse
|
||||||
|
{
|
||||||
|
public int StatusCode { get; set; } = 200;
|
||||||
|
public IDictionary<string,string> Headers { get; } =
|
||||||
|
new Dictionary<string,string>();
|
||||||
|
public Func<Stream,Task>? WriteBodyAsync { get; set; } // may be null
|
||||||
|
}
|
||||||
|
|
||||||
|
public interface IRawStellaEndpoint
|
||||||
|
{
|
||||||
|
Task<RawResponse> HandleAsync(RawRequestContext ctx);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
* Typed convenience interfaces (used later, but define now):
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface IStellaEndpoint<TRequest,TResponse>
|
||||||
|
{
|
||||||
|
Task<TResponse> HandleAsync(TRequest request, CancellationToken ct);
|
||||||
|
}
|
||||||
|
|
||||||
|
public interface IStellaEndpoint<TResponse>
|
||||||
|
{
|
||||||
|
Task<TResponse> HandleAsync(CancellationToken ct);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
At this step, you don’t need to implement adapters yet, but the signatures must be fixed.
|
||||||
|
|
||||||
|
### 1.3 Registration extension
|
||||||
|
|
||||||
|
Extend `AddStellaMicroservice` to wire options + a few internal services:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public static class ServiceCollectionExtensions
|
||||||
|
{
|
||||||
|
public static IServiceCollection AddStellaMicroservice(
|
||||||
|
this IServiceCollection services,
|
||||||
|
Action<StellaMicroserviceOptions> configure)
|
||||||
|
{
|
||||||
|
services.Configure(configure);
|
||||||
|
|
||||||
|
services.AddSingleton<IEndpointCatalog, EndpointCatalog>(); // to be implemented
|
||||||
|
services.AddSingleton<IEndpointDispatcher, EndpointDispatcher>(); // to be implemented
|
||||||
|
|
||||||
|
services.AddHostedService<MicroserviceBootstrapHostedService>(); // handshake loop
|
||||||
|
|
||||||
|
return services;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This still compiles with empty implementations; you fill them in next steps.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Endpoint discovery (reflection only for now)
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Microservice`
|
||||||
|
**Owner:** SDK agent
|
||||||
|
|
||||||
|
Goal: given the entry assembly, build:
|
||||||
|
|
||||||
|
* A list of `EndpointDescriptor` objects (from Common).
|
||||||
|
* A mapping `(Method, Path) -> handler type` used for dispatch.
|
||||||
|
|
||||||
|
### 2.1 Internal types
|
||||||
|
|
||||||
|
Define an internal representation:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal sealed class EndpointRegistration
|
||||||
|
{
|
||||||
|
public EndpointDescriptor Descriptor { get; init; } = default!;
|
||||||
|
public Type HandlerType { get; init; } = default!;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Define an interface for discovery:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal interface IEndpointDiscovery
|
||||||
|
{
|
||||||
|
IReadOnlyList<EndpointRegistration> DiscoverEndpoints(StellaMicroserviceOptions options);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 Implement reflection-based discovery
|
||||||
|
|
||||||
|
Create `ReflectionEndpointDiscovery`:
|
||||||
|
|
||||||
|
* Scan the entry assembly (and optionally referenced assemblies) for classes that:
|
||||||
|
|
||||||
|
* Have `StellaEndpointAttribute`.
|
||||||
|
* Implement either:
|
||||||
|
|
||||||
|
* `IRawStellaEndpoint`, or
|
||||||
|
* `IStellaEndpoint<,>`, or
|
||||||
|
* `IStellaEndpoint<>`.
|
||||||
|
|
||||||
|
* For each `[StellaEndpoint]` usage:
|
||||||
|
|
||||||
|
* Create `EndpointDescriptor` with:
|
||||||
|
|
||||||
|
* `ServiceName` = `options.ServiceName`.
|
||||||
|
* `Version` = `options.Version`.
|
||||||
|
* `Method`, `Path` from attribute.
|
||||||
|
* `DefaultTimeout` = some sensible default (e.g. `TimeSpan.FromSeconds(30)`; refine later).
|
||||||
|
* `SupportsStreaming` = `false` (for now).
|
||||||
|
* `RequiringClaims` = empty array (for now).
|
||||||
|
|
||||||
|
* Create `EndpointRegistration` with `Descriptor` + `HandlerType`.
|
||||||
|
|
||||||
|
* Return the list.
|
||||||
|
|
||||||
|
Wire it into DI:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
services.AddSingleton<IEndpointDiscovery, ReflectionEndpointDiscovery>();
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Endpoint catalog & dispatcher (microservice internal)
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Microservice`
|
||||||
|
**Owner:** SDK agent
|
||||||
|
|
||||||
|
Goal: presence of:
|
||||||
|
|
||||||
|
* A catalog holding endpoints and descriptors.
|
||||||
|
* A dispatcher that takes frames and calls handlers.
|
||||||
|
|
||||||
|
### 3.1 Endpoint catalog
|
||||||
|
|
||||||
|
Define:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal interface IEndpointCatalog
|
||||||
|
{
|
||||||
|
IReadOnlyList<EndpointDescriptor> Descriptors { get; }
|
||||||
|
bool TryGetHandler(string method, string path, out EndpointRegistration endpoint);
|
||||||
|
}
|
||||||
|
|
||||||
|
internal sealed class EndpointCatalog : IEndpointCatalog
|
||||||
|
{
|
||||||
|
private readonly Dictionary<(string Method, string Path), EndpointRegistration> _map;
|
||||||
|
public IReadOnlyList<EndpointDescriptor> Descriptors { get; }
|
||||||
|
|
||||||
|
public EndpointCatalog(IEndpointDiscovery discovery,
|
||||||
|
IOptions<StellaMicroserviceOptions> optionsAccessor)
|
||||||
|
{
|
||||||
|
var options = optionsAccessor.Value;
|
||||||
|
var registrations = discovery.DiscoverEndpoints(options);
|
||||||
|
|
||||||
|
_map = registrations.ToDictionary(
|
||||||
|
r => (r.Descriptor.Method, r.Descriptor.Path),
|
||||||
|
r => r,
|
||||||
|
StringComparer.OrdinalIgnoreCase);
|
||||||
|
|
||||||
|
Descriptors = registrations.Select(r => r.Descriptor).ToArray();
|
||||||
|
}
|
||||||
|
|
||||||
|
public bool TryGetHandler(string method, string path, out EndpointRegistration endpoint) =>
|
||||||
|
_map.TryGetValue((method, path), out endpoint!);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You can refine path normalization later; for now, keep it simple.
|
||||||
|
|
||||||
|
### 3.2 Endpoint dispatcher
|
||||||
|
|
||||||
|
Define:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal interface IEndpointDispatcher
|
||||||
|
{
|
||||||
|
Task<Frame> HandleRequestAsync(Frame requestFrame, CancellationToken ct);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Implement `EndpointDispatcher` with minimal behavior:
|
||||||
|
|
||||||
|
1. Decode `requestFrame.Payload` into a small DTO carrying:
|
||||||
|
|
||||||
|
* Method
|
||||||
|
* Path
|
||||||
|
* Headers (if you already have a format; if not, assume no headers in v0)
|
||||||
|
* Body bytes
|
||||||
|
|
||||||
|
For this step, you can stub decoding as:
|
||||||
|
|
||||||
|
* Payload = raw body bytes.
|
||||||
|
* Method/Path are carried separately in frame header or in a simple DTO; decide a minimal interim format and write it down.
|
||||||
|
|
||||||
|
2. Use `IEndpointCatalog.TryGetHandler(method, path, ...)`:
|
||||||
|
|
||||||
|
* If not found:
|
||||||
|
|
||||||
|
* Build a `RawResponse` with status 404 and empty body.
|
||||||
|
|
||||||
|
3. If handler implements `IRawStellaEndpoint`:
|
||||||
|
|
||||||
|
* Instantiate via DI (`IServiceProvider.GetRequiredService(handlerType)`).
|
||||||
|
* Build `RawRequestContext` with:
|
||||||
|
|
||||||
|
* Method, Path, Headers, Body (`new MemoryStream(bodyBytes)` for now).
|
||||||
|
* `CancellationToken` = `ct`.
|
||||||
|
* Call `HandleAsync`.
|
||||||
|
* Convert `RawResponse` into a response frame payload.
|
||||||
|
|
||||||
|
4. If handler implements `IStellaEndpoint<,>` (typed):
|
||||||
|
|
||||||
|
* For now, **you can skip typed handling** or wire a very simple JSON-based adapter if you want to unlock it early. The focus in this step is the raw path; typed adapters can come in the next iteration.
|
||||||
|
|
||||||
|
Return a `Frame` with:
|
||||||
|
|
||||||
|
* `Type = FrameType.Response`
|
||||||
|
* `CorrelationId` = `requestFrame.CorrelationId`
|
||||||
|
* `Payload` = encoded response (status + body bytes).
|
||||||
|
|
||||||
|
No streaming, no cancellation logic beyond passing `ct` through — router won’t cancel yet.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Minimal handshake hosted service (using InMemory)
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Microservice`
|
||||||
|
**Owner:** SDK agent
|
||||||
|
|
||||||
|
This is where the microservice actually “talks” to the router.
|
||||||
|
|
||||||
|
### 4.1 Define a microservice connection abstraction
|
||||||
|
|
||||||
|
Your SDK should not depend directly on InMemory; define an internal abstraction:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal interface IMicroserviceConnection
|
||||||
|
{
|
||||||
|
Task StartAsync(CancellationToken ct);
|
||||||
|
Task StopAsync(CancellationToken ct);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The implementation for this step will target the InMemory transport; later you can add TCP/TLS/RabbitMQ versions.
|
||||||
|
|
||||||
|
### 4.2 Implement InMemory microservice connection
|
||||||
|
|
||||||
|
Assuming you have or will have an `IInMemoryRouter` (or similar) dev harness, implement:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal sealed class InMemoryMicroserviceConnection : IMicroserviceConnection
|
||||||
|
{
|
||||||
|
private readonly IEndpointCatalog _catalog;
|
||||||
|
private readonly IEndpointDispatcher _dispatcher;
|
||||||
|
private readonly IOptions<StellaMicroserviceOptions> _options;
|
||||||
|
private readonly IInMemoryRouterClient _routerClient; // dev-only abstraction
|
||||||
|
|
||||||
|
public InMemoryMicroserviceConnection(
|
||||||
|
IEndpointCatalog catalog,
|
||||||
|
IEndpointDispatcher dispatcher,
|
||||||
|
IOptions<StellaMicroserviceOptions> options,
|
||||||
|
IInMemoryRouterClient routerClient)
|
||||||
|
{
|
||||||
|
_catalog = catalog;
|
||||||
|
_dispatcher = dispatcher;
|
||||||
|
_options = options;
|
||||||
|
_routerClient = routerClient;
|
||||||
|
}
|
||||||
|
|
||||||
|
public async Task StartAsync(CancellationToken ct)
|
||||||
|
{
|
||||||
|
var opts = _options.Value;
|
||||||
|
|
||||||
|
// Build HELLO payload from options + catalog.Descriptors
|
||||||
|
var helloPayload = BuildHelloPayload(opts, _catalog.Descriptors);
|
||||||
|
|
||||||
|
await _routerClient.ConnectAsync(opts, ct);
|
||||||
|
await _routerClient.SendHelloAsync(helloPayload, ct);
|
||||||
|
|
||||||
|
// Start background receive loop
|
||||||
|
_ = Task.Run(() => ReceiveLoopAsync(ct), ct);
|
||||||
|
}
|
||||||
|
|
||||||
|
public Task StopAsync(CancellationToken ct)
|
||||||
|
{
|
||||||
|
// For now: ask routerClient to disconnect; finer handling later
|
||||||
|
return _routerClient.DisconnectAsync(ct);
|
||||||
|
}
|
||||||
|
|
||||||
|
private async Task ReceiveLoopAsync(CancellationToken ct)
|
||||||
|
{
|
||||||
|
await foreach (var frame in _routerClient.GetIncomingFramesAsync(ct))
|
||||||
|
{
|
||||||
|
if (frame.Type == FrameType.Request)
|
||||||
|
{
|
||||||
|
var response = await _dispatcher.HandleRequestAsync(frame, ct);
|
||||||
|
await _routerClient.SendFrameAsync(response, ct);
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
// Ignore other frame types in this minimal step
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`IInMemoryRouterClient` is whatever dev harness you build for the in-memory transport; the exact shape is not important for this step’s planning, only that it provides:
|
||||||
|
|
||||||
|
* `ConnectAsync`
|
||||||
|
* `SendHelloAsync`
|
||||||
|
* `GetIncomingFramesAsync` (async stream of frames)
|
||||||
|
* `SendFrameAsync` for responses
|
||||||
|
* `DisconnectAsync`
|
||||||
|
|
||||||
|
### 4.3 Hosted service to bootstrap the connection
|
||||||
|
|
||||||
|
Implement `MicroserviceBootstrapHostedService`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal sealed class MicroserviceBootstrapHostedService : IHostedService
|
||||||
|
{
|
||||||
|
private readonly IMicroserviceConnection _connection;
|
||||||
|
|
||||||
|
public MicroserviceBootstrapHostedService(IMicroserviceConnection connection)
|
||||||
|
{
|
||||||
|
_connection = connection;
|
||||||
|
}
|
||||||
|
|
||||||
|
public Task StartAsync(CancellationToken cancellationToken) =>
|
||||||
|
_connection.StartAsync(cancellationToken);
|
||||||
|
|
||||||
|
public Task StopAsync(CancellationToken cancellationToken) =>
|
||||||
|
_connection.StopAsync(cancellationToken);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Wire `IMicroserviceConnection` to `InMemoryMicroserviceConnection` in DI for now:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
services.AddSingleton<IMicroserviceConnection, InMemoryMicroserviceConnection>();
|
||||||
|
```
|
||||||
|
|
||||||
|
In a later phase, you’ll swap this to transport-specific connectors.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. End-to-end smoke test (InMemory only)
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Microservice.Tests` + a minimal InMemory router test harness
|
||||||
|
**Owner:** test agent
|
||||||
|
|
||||||
|
Goal: prove that minimal handshake & dispatch works in memory.
|
||||||
|
|
||||||
|
1. Build a trivial test microservice:
|
||||||
|
|
||||||
|
* Define a handler:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
[StellaEndpoint("GET", "/ping")]
|
||||||
|
public sealed class PingEndpoint : IRawStellaEndpoint
|
||||||
|
{
|
||||||
|
public Task<RawResponse> HandleAsync(RawRequestContext ctx)
|
||||||
|
{
|
||||||
|
var resp = new RawResponse { StatusCode = 200 };
|
||||||
|
resp.Headers["Content-Type"] = "text/plain";
|
||||||
|
resp.WriteBodyAsync = stream => stream.WriteAsync(
|
||||||
|
Encoding.UTF8.GetBytes("pong"));
|
||||||
|
return Task.FromResult(resp);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Test harness:
|
||||||
|
|
||||||
|
* Spin up:
|
||||||
|
|
||||||
|
* An instance of the microservice host (generic HostBuilder).
|
||||||
|
* An in-memory “router” that:
|
||||||
|
|
||||||
|
* Accepts HELLO from the microservice.
|
||||||
|
* Sends a single REQUEST frame for `GET /ping`.
|
||||||
|
* Receives the RESPONSE frame.
|
||||||
|
|
||||||
|
3. Assert:
|
||||||
|
|
||||||
|
* The HELLO includes the `/ping` endpoint.
|
||||||
|
* The REQUEST is dispatched to `PingEndpoint`.
|
||||||
|
* The RESPONSE has status 200 and body “pong”.
|
||||||
|
|
||||||
|
This verifies that:
|
||||||
|
|
||||||
|
* `AddStellaMicroservice` wires discovery, catalog, dispatcher, bootstrap.
|
||||||
|
* The microservice sends HELLO on connect.
|
||||||
|
* The microservice can handle at least one request via InMemory.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Done criteria for “minimal handshake & dispatch”
|
||||||
|
|
||||||
|
You can consider this step complete when:
|
||||||
|
|
||||||
|
* `StellaOps.Microservice` exposes:
|
||||||
|
|
||||||
|
* Options.
|
||||||
|
* Attribute & handler interfaces (raw + typed).
|
||||||
|
* `AddStellaMicroservice` registering discovery, catalog, dispatcher, and hosted service.
|
||||||
|
* The microservice can:
|
||||||
|
|
||||||
|
* Discover endpoints via reflection.
|
||||||
|
* Build a `HELLO` payload and send it over InMemory on startup.
|
||||||
|
* Receive a `REQUEST` frame over InMemory.
|
||||||
|
* Dispatch that request to the correct handler.
|
||||||
|
* Return a `RESPONSE` frame.
|
||||||
|
|
||||||
|
Not yet required in this step:
|
||||||
|
|
||||||
|
* Streaming bodies.
|
||||||
|
* Heartbeats or health evaluation.
|
||||||
|
* Cancellation via CANCEL frames.
|
||||||
|
* Authority overrides for requiringClaims.
|
||||||
|
|
||||||
|
Those come in subsequent phases; right now you just want a working minimal vertical slice: InMemory microservice that says “HELLO” and responds to one simple request.
|
||||||
554
docs/router/05-Step.md
Normal file
554
docs/router/05-Step.md
Normal file
@@ -0,0 +1,554 @@
|
|||||||
|
For this step, the goal is: the gateway can accept an HTTP request, route it to **one** microservice over the **InMemory** transport, get a response, and return it to the client.
|
||||||
|
|
||||||
|
No health/heartbeat yet. No streaming yet. Just: HTTP → InMemory → microservice → InMemory → HTTP.
|
||||||
|
|
||||||
|
I’ll assume you’re still in the InMemory world and not touching TCP/UDP/RabbitMQ at this stage.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Preconditions
|
||||||
|
|
||||||
|
Before you start:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common` exists and exposes:
|
||||||
|
|
||||||
|
* `EndpointDescriptor`, `ConnectionState`, `Frame`, `FrameType`, `TransportType`, `RoutingDecision`.
|
||||||
|
* Interfaces: `IGlobalRoutingState`, `IRoutingPlugin`, `ITransportClient`.
|
||||||
|
* `StellaOps.Microservice` minimal handshake & dispatch is in place (from your “step 4”):
|
||||||
|
|
||||||
|
* Microservice can:
|
||||||
|
|
||||||
|
* Discover endpoints.
|
||||||
|
* Connect to an InMemory router client.
|
||||||
|
* Send HELLO.
|
||||||
|
* Receive REQUEST and send RESPONSE.
|
||||||
|
* Gateway project exists (`StellaOps.Gateway.WebService`) and runs as a basic ASP.NET Core app.
|
||||||
|
|
||||||
|
If anything in that list is not true, fix it first or adjust the plan accordingly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Implement an InMemory transport “hub”
|
||||||
|
|
||||||
|
You need a simple in-process component that:
|
||||||
|
|
||||||
|
* Keeps track of “connections” from microservices.
|
||||||
|
* Delivers frames from the gateway to the correct microservice and back.
|
||||||
|
|
||||||
|
You can host this either:
|
||||||
|
|
||||||
|
* In a dedicated **test/support** assembly, or
|
||||||
|
* In the gateway project but marked as “dev-only” transport.
|
||||||
|
|
||||||
|
For this step, keep it simple and in-memory.
|
||||||
|
|
||||||
|
### 1.1 Define an InMemory router hub
|
||||||
|
|
||||||
|
Conceptually:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface IInMemoryRouterHub
|
||||||
|
{
|
||||||
|
// Called by microservice side to register a new connection
|
||||||
|
Task<string> RegisterMicroserviceAsync(
|
||||||
|
InstanceDescriptor instance,
|
||||||
|
IReadOnlyList<EndpointDescriptor> endpoints,
|
||||||
|
Func<Frame, Task> onFrameFromGateway,
|
||||||
|
CancellationToken ct);
|
||||||
|
|
||||||
|
// Called by microservice when it wants to send a frame to the gateway
|
||||||
|
Task SendFromMicroserviceAsync(string connectionId, Frame frame, CancellationToken ct);
|
||||||
|
|
||||||
|
// Called by gateway transport client when sending a frame to a microservice
|
||||||
|
Task<Frame> SendFromGatewayAsync(string connectionId, Frame frame, CancellationToken ct);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Internally, the hub maintains per-connection data:
|
||||||
|
|
||||||
|
* `ConnectionId`
|
||||||
|
* `InstanceDescriptor`
|
||||||
|
* Endpoints
|
||||||
|
* Delegate `onFrameFromGateway` (microservice receiver)
|
||||||
|
|
||||||
|
For minimal routing you can start by:
|
||||||
|
|
||||||
|
* Only supporting `SendFromGatewayAsync` for REQUEST and returning RESPONSE.
|
||||||
|
* For now, heartbeat frames can be ignored or stubbed.
|
||||||
|
|
||||||
|
### 1.2 Connect the microservice side
|
||||||
|
|
||||||
|
Your `InMemoryMicroserviceConnection` (from step 4) should:
|
||||||
|
|
||||||
|
* Call `RegisterMicroserviceAsync` on the hub when it sends HELLO:
|
||||||
|
|
||||||
|
* Get `connectionId`.
|
||||||
|
* Provide a handler `onFrameFromGateway` that:
|
||||||
|
|
||||||
|
* Dispatches REQUEST frames via `IEndpointDispatcher`.
|
||||||
|
* Sends RESPONSE frames back via `SendFromMicroserviceAsync`.
|
||||||
|
|
||||||
|
This is mostly microservice work; you should already have most of it outlined.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Implement an InMemory `ITransportClient` in the gateway
|
||||||
|
|
||||||
|
Now focus on the gateway side.
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService` (or a small internal infra class in the same project)
|
||||||
|
|
||||||
|
### 2.1 `InMemoryTransportClient`
|
||||||
|
|
||||||
|
Implement `ITransportClient` using the `IInMemoryRouterHub`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class InMemoryTransportClient : ITransportClient
|
||||||
|
{
|
||||||
|
private readonly IInMemoryRouterHub _hub;
|
||||||
|
|
||||||
|
public InMemoryTransportClient(IInMemoryRouterHub hub)
|
||||||
|
{
|
||||||
|
_hub = hub;
|
||||||
|
}
|
||||||
|
|
||||||
|
public Task<Frame> SendRequestAsync(
|
||||||
|
ConnectionState connection,
|
||||||
|
Frame requestFrame,
|
||||||
|
TimeSpan timeout,
|
||||||
|
CancellationToken ct)
|
||||||
|
{
|
||||||
|
// connection.ConnectionId must be set when HELLO is processed
|
||||||
|
return _hub.SendFromGatewayAsync(connection.ConnectionId, requestFrame, ct);
|
||||||
|
}
|
||||||
|
|
||||||
|
public Task SendCancelAsync(ConnectionState connection, Guid correlationId, string? reason = null)
|
||||||
|
=> Task.CompletedTask; // no-op at this stage
|
||||||
|
|
||||||
|
public Task SendStreamingAsync(
|
||||||
|
ConnectionState connection,
|
||||||
|
Frame requestHeader,
|
||||||
|
Stream requestBody,
|
||||||
|
Func<Stream, Task> readResponseBody,
|
||||||
|
PayloadLimits limits,
|
||||||
|
CancellationToken ct)
|
||||||
|
=> throw new NotSupportedException("Streaming not implemented for InMemory in this step.");
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
For now:
|
||||||
|
|
||||||
|
* Ignore streaming.
|
||||||
|
* Ignore cancel.
|
||||||
|
* Just call `SendFromGatewayAsync` and get a response frame.
|
||||||
|
|
||||||
|
### 2.2 Register it in DI
|
||||||
|
|
||||||
|
In gateway `Program.cs` or a DI setup:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
services.AddSingleton<IInMemoryRouterHub, InMemoryRouterHub>(); // your hub implementation
|
||||||
|
services.AddSingleton<ITransportClient, InMemoryTransportClient>();
|
||||||
|
```
|
||||||
|
|
||||||
|
You’ll later swap this with real transport clients (TCP, UDP, Rabbit), but for now everything uses InMemory.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Implement minimal `IGlobalRoutingState`
|
||||||
|
|
||||||
|
You now need the gateway’s internal view of:
|
||||||
|
|
||||||
|
* Which endpoints exist.
|
||||||
|
* Which connections serve them.
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService` or a small internal infra namespace.
|
||||||
|
|
||||||
|
### 3.1 In-memory implementation
|
||||||
|
|
||||||
|
Implement an `InMemoryGlobalRoutingState` something like:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class InMemoryGlobalRoutingState : IGlobalRoutingState
|
||||||
|
{
|
||||||
|
private readonly object _lock = new();
|
||||||
|
private readonly Dictionary<(string, string), EndpointDescriptor> _endpoints = new();
|
||||||
|
private readonly List<ConnectionState> _connections = new();
|
||||||
|
|
||||||
|
public EndpointDescriptor? ResolveEndpoint(string method, string path)
|
||||||
|
{
|
||||||
|
lock (_lock)
|
||||||
|
{
|
||||||
|
_endpoints.TryGetValue((method, path), out var endpoint);
|
||||||
|
return endpoint;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public IReadOnlyList<ConnectionState> GetConnectionsFor(
|
||||||
|
string serviceName,
|
||||||
|
string version,
|
||||||
|
string method,
|
||||||
|
string path)
|
||||||
|
{
|
||||||
|
lock (_lock)
|
||||||
|
{
|
||||||
|
return _connections
|
||||||
|
.Where(c =>
|
||||||
|
c.Instance.ServiceName == serviceName &&
|
||||||
|
c.Instance.Version == version &&
|
||||||
|
c.Endpoints.ContainsKey((method, path)))
|
||||||
|
.ToList();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Called when HELLO arrives from microservice
|
||||||
|
public void RegisterConnection(ConnectionState connection)
|
||||||
|
{
|
||||||
|
lock (_lock)
|
||||||
|
{
|
||||||
|
_connections.Add(connection);
|
||||||
|
foreach (var kvp in connection.Endpoints)
|
||||||
|
{
|
||||||
|
var key = kvp.Key; // (Method, Path)
|
||||||
|
var descriptor = kvp.Value;
|
||||||
|
// global endpoint map: any connection's descriptor is ok as "canonical"
|
||||||
|
_endpoints[(key.Method, key.Path)] = descriptor;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You will refine this later; for minimal routing it's enough.
|
||||||
|
|
||||||
|
### 3.2 Hook HELLO to `IGlobalRoutingState`
|
||||||
|
|
||||||
|
In your InMemory router hub, when a microservice registers (HELLO):
|
||||||
|
|
||||||
|
* Create a `ConnectionState`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var conn = new ConnectionState
|
||||||
|
{
|
||||||
|
ConnectionId = generatedConnectionId,
|
||||||
|
Instance = instanceDescriptor,
|
||||||
|
Status = InstanceHealthStatus.Healthy,
|
||||||
|
LastHeartbeatUtc = DateTime.UtcNow,
|
||||||
|
AveragePingMs = 0,
|
||||||
|
TransportType = TransportType.Udp, // or TransportType.Tcp logically for InMemory
|
||||||
|
Endpoints = endpointDescriptors.ToDictionary(
|
||||||
|
e => (e.Method, e.Path),
|
||||||
|
e => e)
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
* Call `InMemoryGlobalRoutingState.RegisterConnection(conn)`.
|
||||||
|
|
||||||
|
This gives the gateway a routing view as soon as HELLO is processed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Implement HTTP pipeline middlewares for routing
|
||||||
|
|
||||||
|
Now, wire the gateway HTTP pipeline so that an incoming HTTP request is:
|
||||||
|
|
||||||
|
1. Resolved to a logical endpoint.
|
||||||
|
2. Routed to one connection.
|
||||||
|
3. Dispatched via InMemory transport.
|
||||||
|
|
||||||
|
### 4.1 EndpointResolutionMiddleware
|
||||||
|
|
||||||
|
This maps `(Method, Path)` to an `EndpointDescriptor`.
|
||||||
|
|
||||||
|
Create a middleware:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class EndpointResolutionMiddleware
|
||||||
|
{
|
||||||
|
private readonly RequestDelegate _next;
|
||||||
|
|
||||||
|
public EndpointResolutionMiddleware(RequestDelegate next) => _next = next;
|
||||||
|
|
||||||
|
public async Task Invoke(HttpContext context, IGlobalRoutingState routingState)
|
||||||
|
{
|
||||||
|
var method = context.Request.Method;
|
||||||
|
var path = context.Request.Path.ToString();
|
||||||
|
|
||||||
|
var endpoint = routingState.ResolveEndpoint(method, path);
|
||||||
|
if (endpoint is null)
|
||||||
|
{
|
||||||
|
context.Response.StatusCode = StatusCodes.Status404NotFound;
|
||||||
|
await context.Response.WriteAsync("Endpoint not found");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
context.Items["Stella.EndpointDescriptor"] = endpoint;
|
||||||
|
await _next(context);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Register it in the pipeline:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
app.UseMiddleware<EndpointResolutionMiddleware>();
|
||||||
|
```
|
||||||
|
|
||||||
|
Before or after auth depending on your final pipeline; for minimal routing, order is not critical.
|
||||||
|
|
||||||
|
### 4.2 Minimal routing plugin (pick first connection)
|
||||||
|
|
||||||
|
Implement a very naive `IRoutingPlugin` just to get things moving:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class NaiveRoutingPlugin : IRoutingPlugin
|
||||||
|
{
|
||||||
|
private readonly IGlobalRoutingState _state;
|
||||||
|
|
||||||
|
public NaiveRoutingPlugin(IGlobalRoutingState state) => _state = state;
|
||||||
|
|
||||||
|
public Task<RoutingDecision?> ChooseInstanceAsync(
|
||||||
|
RoutingContext context,
|
||||||
|
CancellationToken cancellationToken)
|
||||||
|
{
|
||||||
|
var endpoint = context.Endpoint;
|
||||||
|
|
||||||
|
var connections = _state.GetConnectionsFor(
|
||||||
|
endpoint.ServiceName,
|
||||||
|
endpoint.Version,
|
||||||
|
endpoint.Method,
|
||||||
|
endpoint.Path);
|
||||||
|
|
||||||
|
var chosen = connections.FirstOrDefault();
|
||||||
|
if (chosen is null)
|
||||||
|
return Task.FromResult<RoutingDecision?>(null);
|
||||||
|
|
||||||
|
var decision = new RoutingDecision
|
||||||
|
{
|
||||||
|
Endpoint = endpoint,
|
||||||
|
Connection = chosen,
|
||||||
|
TransportType = chosen.TransportType,
|
||||||
|
EffectiveTimeout = endpoint.DefaultTimeout
|
||||||
|
};
|
||||||
|
|
||||||
|
return Task.FromResult<RoutingDecision?>(decision);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Register it:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
services.AddSingleton<IGlobalRoutingState, InMemoryGlobalRoutingState>();
|
||||||
|
services.AddSingleton<IRoutingPlugin, NaiveRoutingPlugin>();
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 RoutingDecisionMiddleware
|
||||||
|
|
||||||
|
This middleware grabs the endpoint descriptor and asks the routing plugin for a connection.
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class RoutingDecisionMiddleware
|
||||||
|
{
|
||||||
|
private readonly RequestDelegate _next;
|
||||||
|
|
||||||
|
public RoutingDecisionMiddleware(RequestDelegate next) => _next = next;
|
||||||
|
|
||||||
|
public async Task Invoke(HttpContext context, IRoutingPlugin routingPlugin)
|
||||||
|
{
|
||||||
|
var endpoint = (EndpointDescriptor?)context.Items["Stella.EndpointDescriptor"];
|
||||||
|
if (endpoint is null)
|
||||||
|
{
|
||||||
|
context.Response.StatusCode = 500;
|
||||||
|
await context.Response.WriteAsync("Endpoint metadata missing");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
var routingContext = new RoutingContext
|
||||||
|
{
|
||||||
|
Endpoint = endpoint,
|
||||||
|
GatewayRegion = "not_used_yet", // you’ll fill this from GatewayNodeConfig later
|
||||||
|
HttpContext = context
|
||||||
|
};
|
||||||
|
|
||||||
|
var decision = await routingPlugin.ChooseInstanceAsync(routingContext, context.RequestAborted);
|
||||||
|
if (decision is null)
|
||||||
|
{
|
||||||
|
context.Response.StatusCode = StatusCodes.Status503ServiceUnavailable;
|
||||||
|
await context.Response.WriteAsync("No instances available");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
context.Items["Stella.RoutingDecision"] = decision;
|
||||||
|
await _next(context);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Register it after `EndpointResolutionMiddleware`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
app.UseMiddleware<RoutingDecisionMiddleware>();
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 TransportDispatchMiddleware
|
||||||
|
|
||||||
|
This middleware:
|
||||||
|
|
||||||
|
* Builds a REQUEST frame from HTTP.
|
||||||
|
* Uses `ITransportClient` to send it to the chosen connection.
|
||||||
|
* Writes the RESPONSE frame back to HTTP.
|
||||||
|
|
||||||
|
Minimal version (buffered, no streaming):
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class TransportDispatchMiddleware
|
||||||
|
{
|
||||||
|
private readonly RequestDelegate _next;
|
||||||
|
|
||||||
|
public TransportDispatchMiddleware(RequestDelegate next) => _next = next;
|
||||||
|
|
||||||
|
public async Task Invoke(
|
||||||
|
HttpContext context,
|
||||||
|
ITransportClient transportClient)
|
||||||
|
{
|
||||||
|
var decision = (RoutingDecision?)context.Items["Stella.RoutingDecision"];
|
||||||
|
if (decision is null)
|
||||||
|
{
|
||||||
|
context.Response.StatusCode = 500;
|
||||||
|
await context.Response.WriteAsync("Routing decision missing");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Read request body into memory (safe for minimal tests)
|
||||||
|
byte[] bodyBytes;
|
||||||
|
using (var ms = new MemoryStream())
|
||||||
|
{
|
||||||
|
await context.Request.Body.CopyToAsync(ms);
|
||||||
|
bodyBytes = ms.ToArray();
|
||||||
|
}
|
||||||
|
|
||||||
|
var requestPayload = new MinimalRequestPayload
|
||||||
|
{
|
||||||
|
Method = context.Request.Method,
|
||||||
|
Path = context.Request.Path.ToString(),
|
||||||
|
Body = bodyBytes
|
||||||
|
// headers can be ignored or added later
|
||||||
|
};
|
||||||
|
|
||||||
|
var requestFrame = new Frame
|
||||||
|
{
|
||||||
|
Type = FrameType.Request,
|
||||||
|
CorrelationId = Guid.NewGuid(),
|
||||||
|
Payload = SerializeRequestPayload(requestPayload)
|
||||||
|
};
|
||||||
|
|
||||||
|
var timeout = decision.EffectiveTimeout;
|
||||||
|
using var cts = CancellationTokenSource.CreateLinkedTokenSource(context.RequestAborted);
|
||||||
|
cts.CancelAfter(timeout);
|
||||||
|
|
||||||
|
Frame responseFrame;
|
||||||
|
try
|
||||||
|
{
|
||||||
|
responseFrame = await transportClient.SendRequestAsync(
|
||||||
|
decision.Connection,
|
||||||
|
requestFrame,
|
||||||
|
timeout,
|
||||||
|
cts.Token);
|
||||||
|
}
|
||||||
|
catch (OperationCanceledException)
|
||||||
|
{
|
||||||
|
context.Response.StatusCode = StatusCodes.Status504GatewayTimeout;
|
||||||
|
await context.Response.WriteAsync("Upstream timeout");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
var responsePayload = DeserializeResponsePayload(responseFrame.Payload);
|
||||||
|
|
||||||
|
context.Response.StatusCode = responsePayload.StatusCode;
|
||||||
|
foreach (var (k, v) in responsePayload.Headers)
|
||||||
|
{
|
||||||
|
context.Response.Headers[k] = v;
|
||||||
|
}
|
||||||
|
if (responsePayload.Body is { Length: > 0 })
|
||||||
|
{
|
||||||
|
await context.Response.Body.WriteAsync(responsePayload.Body);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You’ll need minimal DTOs and serializers (`MinimalRequestPayload`, `MinimalResponsePayload`) just to move bytes. You can use JSON for now; protocol details will be formalized later.
|
||||||
|
|
||||||
|
Register it after `RoutingDecisionMiddleware`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
app.UseMiddleware<TransportDispatchMiddleware>();
|
||||||
|
```
|
||||||
|
|
||||||
|
At this point, you no longer need ASP.NET controllers for microservice endpoints; you can have a catch-all pipeline.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Minimal end-to-end test
|
||||||
|
|
||||||
|
**Owner:** test agent, probably in `StellaOps.Gateway.WebService.Tests` (plus a simple host for microservice in tests)
|
||||||
|
|
||||||
|
Scenario:
|
||||||
|
|
||||||
|
1. Start an in-memory microservice host:
|
||||||
|
|
||||||
|
* It uses `AddStellaMicroservice`.
|
||||||
|
* It attaches to the same `IInMemoryRouterHub` instance as the gateway (created inside the test).
|
||||||
|
* It has a single endpoint:
|
||||||
|
|
||||||
|
* `[StellaEndpoint("GET", "/ping")]`
|
||||||
|
* Handler returns “pong”.
|
||||||
|
|
||||||
|
2. Start the gateway host:
|
||||||
|
|
||||||
|
* Inject the same `IInMemoryRouterHub`.
|
||||||
|
* Use middlewares: `EndpointResolutionMiddleware`, `RoutingDecisionMiddleware`, `TransportDispatchMiddleware`.
|
||||||
|
|
||||||
|
3. Invoke HTTP `GET /ping` against the gateway (using `WebApplicationFactory` or `TestServer`).
|
||||||
|
|
||||||
|
Assert:
|
||||||
|
|
||||||
|
* HTTP status 200.
|
||||||
|
* Body “pong”.
|
||||||
|
* The router hub saw:
|
||||||
|
|
||||||
|
* At least one HELLO frame.
|
||||||
|
* One REQUEST frame.
|
||||||
|
* One RESPONSE frame.
|
||||||
|
|
||||||
|
This proves:
|
||||||
|
|
||||||
|
* HELLO → gateway routing state population.
|
||||||
|
* Endpoint resolution → connection selection.
|
||||||
|
* InMemory transport client used.
|
||||||
|
* Minimal dispatch works.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Done criteria for “Gateway: minimal routing using InMemory plugin”
|
||||||
|
|
||||||
|
You’re done with this step when:
|
||||||
|
|
||||||
|
* A microservice can register with the gateway via InMemory.
|
||||||
|
* The gateway’s `IGlobalRoutingState` knows about endpoints and connections.
|
||||||
|
* The HTTP pipeline:
|
||||||
|
|
||||||
|
* Resolves an endpoint based on `(Method, Path)`.
|
||||||
|
* Asks `IRoutingPlugin` for a connection.
|
||||||
|
* Uses `ITransportClient` (InMemory) to send REQUEST and get RESPONSE.
|
||||||
|
* Returns the mapped HTTP response to the client.
|
||||||
|
* You have at least one automated test showing:
|
||||||
|
|
||||||
|
* `GET /ping` through gateway → InMemory → microservice → back to HTTP.
|
||||||
|
|
||||||
|
After this, you’re ready to:
|
||||||
|
|
||||||
|
* Swap `NaiveRoutingPlugin` with the health/region-sensitive plugin you defined.
|
||||||
|
* Implement heartbeat and latency.
|
||||||
|
* Later replace InMemory with TCP/UDP/Rabbit without changing the HTTP pipeline.
|
||||||
541
docs/router/06-Step.md
Normal file
541
docs/router/06-Step.md
Normal file
@@ -0,0 +1,541 @@
|
|||||||
|
For this step, you’re layering **liveness** and **basic routing intelligence** on top of the minimal handshake/dispatch you already designed.
|
||||||
|
|
||||||
|
Target outcome:
|
||||||
|
|
||||||
|
* Microservices send **heartbeats** over the existing connection.
|
||||||
|
* The router tracks **LastHeartbeatUtc**, **health status**, and **AveragePingMs** per connection.
|
||||||
|
* The router’s `IRoutingPlugin` uses **region + health + latency** to pick an instance.
|
||||||
|
|
||||||
|
No need to handle cancellation or streaming yet; just make routing decisions *not* naive.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Preconditions
|
||||||
|
|
||||||
|
Before starting, confirm:
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common` already has:
|
||||||
|
|
||||||
|
* `InstanceHealthStatus` enum.
|
||||||
|
* `ConnectionState` with at least `Instance`, `Status`, `LastHeartbeatUtc`, `AveragePingMs`, `TransportType`.
|
||||||
|
* Minimal handshake is working:
|
||||||
|
|
||||||
|
* Microservice sends HELLO (instance + endpoints).
|
||||||
|
* Router creates `ConnectionState` & populates global routing view.
|
||||||
|
* Router can send REQUEST and receive RESPONSE via InMemory transport.
|
||||||
|
|
||||||
|
If any of that is incomplete, shore it up first.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Extend Common with heartbeat payloads
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Router.Common`
|
||||||
|
**Owner:** Common dev
|
||||||
|
|
||||||
|
Add DTOs for heartbeat frames.
|
||||||
|
|
||||||
|
### 1.1 Heartbeat payload
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class HeartbeatPayload
|
||||||
|
{
|
||||||
|
public string InstanceId { get; init; } = string.Empty;
|
||||||
|
public InstanceHealthStatus Status { get; init; } = InstanceHealthStatus.Healthy;
|
||||||
|
|
||||||
|
// Optional basic metrics
|
||||||
|
public int InFlightRequests { get; init; }
|
||||||
|
public double ErrorRate { get; init; } // 0–1 range, optional
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
* This is application-level health; `Status` lets the microservice say “Degraded” / “Draining”.
|
||||||
|
* In-flight + error rate can be used later for smarter routing; initially, you can ignore them.
|
||||||
|
|
||||||
|
### 1.2 Wire into frame model
|
||||||
|
|
||||||
|
Ensure:
|
||||||
|
|
||||||
|
* `FrameType` includes `Heartbeat`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public enum FrameType : byte
|
||||||
|
{
|
||||||
|
Hello = 1,
|
||||||
|
Heartbeat = 2,
|
||||||
|
EndpointsUpdate = 3,
|
||||||
|
Request = 4,
|
||||||
|
RequestStreamData = 5,
|
||||||
|
Response = 6,
|
||||||
|
ResponseStreamData = 7,
|
||||||
|
Cancel = 8
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
* No behavior in Common; only DTOs and enums.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Microservice SDK: send heartbeats on the same connection
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Microservice`
|
||||||
|
**Owner:** SDK dev
|
||||||
|
|
||||||
|
You already have `MicroserviceConnectionHostedService` doing HELLO and request dispatch. Now add heartbeat sending.
|
||||||
|
|
||||||
|
### 2.1 Introduce heartbeat options
|
||||||
|
|
||||||
|
Extend `StellaMicroserviceOptions` with simple settings:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class StellaMicroserviceOptions
|
||||||
|
{
|
||||||
|
// existing fields...
|
||||||
|
public TimeSpan HeartbeatInterval { get; set; } = TimeSpan.FromSeconds(10);
|
||||||
|
public TimeSpan HeartbeatTimeout { get; set; } = TimeSpan.FromSeconds(30); // used by router, not here
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 Internal heartbeat sender
|
||||||
|
|
||||||
|
Create an internal interface and implementation:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal interface IHeartbeatSource
|
||||||
|
{
|
||||||
|
InstanceHealthStatus GetCurrentStatus();
|
||||||
|
int GetInFlightRequests();
|
||||||
|
double GetErrorRate();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
For now you can implement a trivial `DefaultHeartbeatSource`:
|
||||||
|
|
||||||
|
* `GetCurrentStatus()` → `Healthy`.
|
||||||
|
* `GetInFlightRequests()` → 0.
|
||||||
|
* `GetErrorRate()` → 0.
|
||||||
|
|
||||||
|
Wire this in DI:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
services.AddSingleton<IHeartbeatSource, DefaultHeartbeatSource>();
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.3 Add heartbeat loop to MicroserviceConnectionHostedService
|
||||||
|
|
||||||
|
In `StartAsync` of `MicroserviceConnectionHostedService`:
|
||||||
|
|
||||||
|
* After sending HELLO and subscribing to requests, start a background heartbeat loop.
|
||||||
|
|
||||||
|
Pseudo-plan:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
private Task? _heartbeatLoop;
|
||||||
|
|
||||||
|
public async Task StartAsync(CancellationToken ct)
|
||||||
|
{
|
||||||
|
// existing HELLO logic...
|
||||||
|
await _connection.SendHelloAsync(payload, ct);
|
||||||
|
|
||||||
|
_connection.OnRequest(frame => HandleRequestAsync(frame, ct));
|
||||||
|
|
||||||
|
_heartbeatLoop = Task.Run(() => HeartbeatLoopAsync(ct), ct);
|
||||||
|
}
|
||||||
|
|
||||||
|
private async Task HeartbeatLoopAsync(CancellationToken outerCt)
|
||||||
|
{
|
||||||
|
var opt = _options.Value;
|
||||||
|
var interval = opt.HeartbeatInterval;
|
||||||
|
var instanceId = opt.InstanceId;
|
||||||
|
|
||||||
|
while (!outerCt.IsCancellationRequested)
|
||||||
|
{
|
||||||
|
var payload = new HeartbeatPayload
|
||||||
|
{
|
||||||
|
InstanceId = instanceId,
|
||||||
|
Status = _heartbeatSource.GetCurrentStatus(),
|
||||||
|
InFlightRequests = _heartbeatSource.GetInFlightRequests(),
|
||||||
|
ErrorRate = _heartbeatSource.GetErrorRate()
|
||||||
|
};
|
||||||
|
|
||||||
|
var frame = new Frame
|
||||||
|
{
|
||||||
|
Type = FrameType.Heartbeat,
|
||||||
|
CorrelationId = Guid.Empty, // or a reserved value
|
||||||
|
Payload = SerializeHeartbeatPayload(payload)
|
||||||
|
};
|
||||||
|
|
||||||
|
await _connection.SendHeartbeatAsync(frame, outerCt);
|
||||||
|
|
||||||
|
try
|
||||||
|
{
|
||||||
|
await Task.Delay(interval, outerCt);
|
||||||
|
}
|
||||||
|
catch (TaskCanceledException)
|
||||||
|
{
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You’ll need to extend `IMicroserviceConnection` with:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
Task SendHeartbeatAsync(Frame frame, CancellationToken ct);
|
||||||
|
```
|
||||||
|
|
||||||
|
In this step, manipulation is simple: every N seconds, push a heartbeat.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Router: accept heartbeats and update connection health
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService`
|
||||||
|
**Owner:** Gateway dev
|
||||||
|
|
||||||
|
You already have an InMemory router or similar structure that:
|
||||||
|
|
||||||
|
* Handles HELLO frames, creates `ConnectionState`.
|
||||||
|
* Maintains a global `IGlobalRoutingState`.
|
||||||
|
|
||||||
|
Now you need to:
|
||||||
|
|
||||||
|
* Handle HEARTBEAT frames.
|
||||||
|
* Update `ConnectionState.Status` and `LastHeartbeatUtc`.
|
||||||
|
|
||||||
|
### 3.1 Frame dispatch on router side
|
||||||
|
|
||||||
|
In your router’s InMemory server loop (or equivalent), add case for `FrameType.Heartbeat`:
|
||||||
|
|
||||||
|
* Deserialize `HeartbeatPayload` from `frame.Payload`.
|
||||||
|
* Find the corresponding `ConnectionState` by `InstanceId` (and/or connection ID).
|
||||||
|
* Update:
|
||||||
|
|
||||||
|
* `LastHeartbeatUtc` = `DateTime.UtcNow`.
|
||||||
|
* `Status` = `payload.Status`.
|
||||||
|
|
||||||
|
You can add a method in your routing-state implementation:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public void UpdateHeartbeat(string connectionId, HeartbeatPayload payload)
|
||||||
|
{
|
||||||
|
if (!_connections.TryGetValue(connectionId, out var conn))
|
||||||
|
return;
|
||||||
|
|
||||||
|
conn.LastHeartbeatUtc = DateTime.UtcNow;
|
||||||
|
conn.Status = payload.Status;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The router’s transport server should know which `connectionId` delivered the frame; pass that along.
|
||||||
|
|
||||||
|
### 3.2 Detect stale connections (health degradation)
|
||||||
|
|
||||||
|
Add a background “health monitor” in the gateway:
|
||||||
|
|
||||||
|
* Reads `HeartbeatTimeout` from configuration (can reuse the same default as microservice or have separate router-side config).
|
||||||
|
* Periodically scans all `ConnectionState` entries:
|
||||||
|
|
||||||
|
* If `Now - LastHeartbeatUtc > HeartbeatTimeout`, mark `Status = Unhealthy` (or remove connection entirely).
|
||||||
|
* If connection drops (transport disconnect), also mark `Unhealthy` or remove.
|
||||||
|
|
||||||
|
This can be a simple `IHostedService`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal sealed class ConnectionHealthMonitor : IHostedService
|
||||||
|
{
|
||||||
|
private readonly IGlobalRoutingState _state;
|
||||||
|
private readonly TimeSpan _heartbeatTimeout;
|
||||||
|
private Task? _loop;
|
||||||
|
private CancellationTokenSource? _cts;
|
||||||
|
|
||||||
|
public Task StartAsync(CancellationToken cancellationToken)
|
||||||
|
{
|
||||||
|
_cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
|
||||||
|
_loop = Task.Run(() => MonitorLoopAsync(_cts.Token), _cts.Token);
|
||||||
|
return Task.CompletedTask;
|
||||||
|
}
|
||||||
|
|
||||||
|
public async Task StopAsync(CancellationToken cancellationToken)
|
||||||
|
{
|
||||||
|
_cts?.Cancel();
|
||||||
|
if (_loop is not null)
|
||||||
|
await _loop;
|
||||||
|
}
|
||||||
|
|
||||||
|
private async Task MonitorLoopAsync(CancellationToken ct)
|
||||||
|
{
|
||||||
|
while (!ct.IsCancellationRequested)
|
||||||
|
{
|
||||||
|
_state.MarkStaleConnectionsUnhealthy(_heartbeatTimeout, DateTime.UtcNow);
|
||||||
|
await Task.Delay(TimeSpan.FromSeconds(5), ct);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You’ll add a method like `MarkStaleConnectionsUnhealthy` on your `IGlobalRoutingState` implementation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Track basic latency (AveragePingMs)
|
||||||
|
|
||||||
|
**Project:** Gateway + Common
|
||||||
|
**Owner:** Gateway dev
|
||||||
|
|
||||||
|
You want `AveragePingMs` per connection to inform routing decisions.
|
||||||
|
|
||||||
|
### 4.1 Decide where to measure
|
||||||
|
|
||||||
|
Simplest: measure “request → response” round-trip time in the gateway:
|
||||||
|
|
||||||
|
* When you send a `Request` frame to a specific connection, record:
|
||||||
|
|
||||||
|
* `SentAtUtc[CorrelationId] = DateTime.UtcNow`.
|
||||||
|
* When you receive a `Response` frame with that correlation:
|
||||||
|
|
||||||
|
* Compute `latencyMs = (UtcNow - SentAtUtc[CorrelationId]).TotalMilliseconds`.
|
||||||
|
* Discard map entry.
|
||||||
|
|
||||||
|
Then update `ConnectionState.AveragePingMs`, e.g. with an exponential moving average:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
conn.AveragePingMs = conn.AveragePingMs <= 0
|
||||||
|
? latencyMs
|
||||||
|
: conn.AveragePingMs * 0.8 + latencyMs * 0.2;
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Where to hook this
|
||||||
|
|
||||||
|
* In the **gateway-side transport client** (InMemory implementation for now):
|
||||||
|
|
||||||
|
* When sending `Request` frame:
|
||||||
|
|
||||||
|
* Register `SentAtUtc` per correlation ID.
|
||||||
|
* When receiving `Response` frame:
|
||||||
|
|
||||||
|
* Compute latency.
|
||||||
|
* Call `IGlobalRoutingState.UpdateLatency(connectionId, latencyMs)`.
|
||||||
|
|
||||||
|
Add a method to the routing state:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public void UpdateLatency(string connectionId, double latencyMs)
|
||||||
|
{
|
||||||
|
if (_connections.TryGetValue(connectionId, out var conn))
|
||||||
|
{
|
||||||
|
if (conn.AveragePingMs <= 0)
|
||||||
|
conn.AveragePingMs = latencyMs;
|
||||||
|
else
|
||||||
|
conn.AveragePingMs = conn.AveragePingMs * 0.8 + latencyMs * 0.2;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You can keep it simple; sophistication can come later.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Basic routing plugin implementation
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService`
|
||||||
|
**Owner:** Gateway dev
|
||||||
|
|
||||||
|
You already have `IRoutingPlugin` defined. Now implement a concrete `BasicRoutingPlugin` that respects:
|
||||||
|
|
||||||
|
* Region (gateway region first, then neighbor tiers).
|
||||||
|
* Health (`Healthy` / `Degraded` only).
|
||||||
|
* Latency preference (`AveragePingMs`).
|
||||||
|
|
||||||
|
### 5.1 Inputs & data
|
||||||
|
|
||||||
|
`RoutingContext` should carry:
|
||||||
|
|
||||||
|
* `EndpointDescriptor` (with ServiceName, Version, Method, Path).
|
||||||
|
* `GatewayRegion` (from `GatewayNodeConfig.Region`).
|
||||||
|
* The `HttpContext` if you need headers (not needed for routing at this stage).
|
||||||
|
|
||||||
|
`IGlobalRoutingState` should provide:
|
||||||
|
|
||||||
|
* `GetConnectionsFor(serviceName, version, method, path)` returning all `ConnectionState`s that support that endpoint.
|
||||||
|
|
||||||
|
### 5.2 Basic algorithm
|
||||||
|
|
||||||
|
Algorithm outline:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class BasicRoutingPlugin : IRoutingPlugin
|
||||||
|
{
|
||||||
|
private readonly IGlobalRoutingState _state;
|
||||||
|
private readonly string[] _neighborRegions; // configured, can be empty
|
||||||
|
|
||||||
|
public async Task<RoutingDecision?> ChooseInstanceAsync(
|
||||||
|
RoutingContext context,
|
||||||
|
CancellationToken cancellationToken)
|
||||||
|
{
|
||||||
|
var endpoint = context.Endpoint;
|
||||||
|
var candidates = _state.GetConnectionsFor(
|
||||||
|
endpoint.ServiceName,
|
||||||
|
endpoint.Version,
|
||||||
|
endpoint.Method,
|
||||||
|
endpoint.Path);
|
||||||
|
|
||||||
|
if (candidates.Count == 0)
|
||||||
|
return null;
|
||||||
|
|
||||||
|
// 1. Filter by health (only Healthy or Degraded)
|
||||||
|
var healthy = candidates
|
||||||
|
.Where(c => c.Status == InstanceHealthStatus.Healthy || c.Status == InstanceHealthStatus.Degraded)
|
||||||
|
.ToList();
|
||||||
|
|
||||||
|
if (healthy.Count == 0)
|
||||||
|
return null;
|
||||||
|
|
||||||
|
// 2. Partition by region tier
|
||||||
|
var gatewayRegion = context.GatewayRegion;
|
||||||
|
|
||||||
|
List<ConnectionState> tier1 = healthy.Where(c => c.Instance.Region == gatewayRegion).ToList();
|
||||||
|
List<ConnectionState> tier2 = healthy.Where(c => _neighborRegions.Contains(c.Instance.Region)).ToList();
|
||||||
|
List<ConnectionState> tier3 = healthy.Except(tier1).Except(tier2).ToList();
|
||||||
|
|
||||||
|
var chosenTier = tier1.Count > 0 ? tier1 : tier2.Count > 0 ? tier2 : tier3;
|
||||||
|
if (chosenTier.Count == 0)
|
||||||
|
return null;
|
||||||
|
|
||||||
|
// 3. Sort by latency, then heartbeat freshness
|
||||||
|
var ordered = chosenTier
|
||||||
|
.OrderBy(c => c.AveragePingMs <= 0 ? double.MaxValue : c.AveragePingMs)
|
||||||
|
.ThenByDescending(c => c.LastHeartbeatUtc)
|
||||||
|
.ToList();
|
||||||
|
|
||||||
|
var winner = ordered[0];
|
||||||
|
|
||||||
|
// 4. Build decision
|
||||||
|
return new RoutingDecision
|
||||||
|
{
|
||||||
|
Endpoint = endpoint,
|
||||||
|
Connection = winner,
|
||||||
|
TransportType = winner.TransportType,
|
||||||
|
EffectiveTimeout = endpoint.DefaultTimeout // or compose with config later
|
||||||
|
};
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Wire it into DI:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
services.AddSingleton<IRoutingPlugin, BasicRoutingPlugin>();
|
||||||
|
```
|
||||||
|
|
||||||
|
And ensure `RoutingDecisionMiddleware` calls it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Integrate health-aware routing into the HTTP pipeline
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService`
|
||||||
|
**Owner:** Gateway dev
|
||||||
|
|
||||||
|
Update your `RoutingDecisionMiddleware` to:
|
||||||
|
|
||||||
|
* Use the final `IRoutingPlugin` instead of picking a random connection.
|
||||||
|
* Handle null decision appropriately:
|
||||||
|
|
||||||
|
* If `ChooseInstanceAsync` returns `null`, respond with `503 Service Unavailable` or `502 Bad Gateway` and a generic error body, log the incident.
|
||||||
|
|
||||||
|
Check that:
|
||||||
|
|
||||||
|
* Gateway’s region is injected (via `GatewayNodeConfig.Region`) into `RoutingContext.GatewayRegion`.
|
||||||
|
* Endpoint descriptor is resolved before you call the plugin.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Testing plan
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService.Tests`, `StellaOps.Microservice.Tests`
|
||||||
|
**Owner:** test agent
|
||||||
|
|
||||||
|
Write basic tests to lock in behavior.
|
||||||
|
|
||||||
|
### 7.1 Microservice heartbeat tests
|
||||||
|
|
||||||
|
In `StellaOps.Microservice.Tests`:
|
||||||
|
|
||||||
|
* Use a fake `IMicroserviceConnection` that records frames sent.
|
||||||
|
* Configure `HeartbeatInterval` to a small number (e.g. 100 ms).
|
||||||
|
* Start a Host with `AddStellaMicroservice`.
|
||||||
|
* Wait some time, assert:
|
||||||
|
|
||||||
|
* At least one HELLO frame was sent.
|
||||||
|
* At least N HEARTBEAT frames were sent.
|
||||||
|
* HEARTBEAT payload has correct `InstanceId` and `Status`.
|
||||||
|
|
||||||
|
### 7.2 Router health update tests
|
||||||
|
|
||||||
|
In `StellaOps.Gateway.WebService.Tests` (or a separate routing-state test project):
|
||||||
|
|
||||||
|
* Create an instance of your `IGlobalRoutingState` implementation.
|
||||||
|
|
||||||
|
* Add a connection via HELLO simulation.
|
||||||
|
|
||||||
|
* Call `UpdateHeartbeat` with a HeartbeatPayload.
|
||||||
|
|
||||||
|
* Assert:
|
||||||
|
|
||||||
|
* `LastHeartbeatUtc` updated.
|
||||||
|
* `Status` set to `Healthy` (or whatever payload said).
|
||||||
|
|
||||||
|
* Advance time (simulate via injecting a clock or mocking DateTime) and call `MarkStaleConnectionsUnhealthy`:
|
||||||
|
|
||||||
|
* Assert that `Status` changed to `Unhealthy`.
|
||||||
|
|
||||||
|
### 7.3 Routing plugin tests
|
||||||
|
|
||||||
|
Write tests for `BasicRoutingPlugin`:
|
||||||
|
|
||||||
|
* Case 1: multiple connections, some unhealthy:
|
||||||
|
|
||||||
|
* Only Healthy/Degraded are considered.
|
||||||
|
* Case 2: multiple regions:
|
||||||
|
|
||||||
|
* Instances in gateway region win over others.
|
||||||
|
* Case 3: same region, different `AveragePingMs`:
|
||||||
|
|
||||||
|
* Lower latency chosen.
|
||||||
|
* Case 4: same latency, different `LastHeartbeatUtc`:
|
||||||
|
|
||||||
|
* More recent heartbeat chosen.
|
||||||
|
|
||||||
|
These tests will give you confidence that the routing logic behaves as requested and is stable as you add complexity later (streaming, cancellation, etc.).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Done criteria for “Add heartbeat, health, basic routing rules”
|
||||||
|
|
||||||
|
You can declare this step complete when:
|
||||||
|
|
||||||
|
* Microservices:
|
||||||
|
|
||||||
|
* Periodically send HEARTBEAT frames on the same connection they use for requests.
|
||||||
|
* Gateway/router:
|
||||||
|
|
||||||
|
* Updates `LastHeartbeatUtc` and `Status` on receipt of HEARTBEAT.
|
||||||
|
* Marks stale or disconnected connections as `Unhealthy` (or removes them).
|
||||||
|
* Tracks `AveragePingMs` per connection based on request/response round trips.
|
||||||
|
* Routing:
|
||||||
|
|
||||||
|
* `IRoutingPlugin` chooses instances based on:
|
||||||
|
|
||||||
|
* Strict `ServiceName` + `Version` + endpoint match.
|
||||||
|
* Health (`Healthy`/`Degraded` only).
|
||||||
|
* Region preference (gateway region > neighbors > others).
|
||||||
|
* Latency (`AveragePingMs`) then heartbeat recency.
|
||||||
|
* Tests:
|
||||||
|
|
||||||
|
* Validate heartbeats are sent and processed.
|
||||||
|
* Validate stale connections are marked unhealthy.
|
||||||
|
* Validate routing plugin picks the expected instance in simple scenarios.
|
||||||
|
|
||||||
|
Once this is in place, you have a live, health-aware routing fabric. The next logical step after this is to add **cancellation** and then **streaming + payload limits** on top of the same structures.
|
||||||
378
docs/router/07-Step.md
Normal file
378
docs/router/07-Step.md
Normal file
@@ -0,0 +1,378 @@
|
|||||||
|
For this step you’re wiring **request cancellation** end‑to‑end in the InMemory setup:
|
||||||
|
|
||||||
|
> Client / gateway gives up → gateway sends CANCEL → microservice cancels handler
|
||||||
|
|
||||||
|
No need to mix in streaming or payload limits yet; just enforce cancellation for timeouts and client disconnects.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Preconditions
|
||||||
|
|
||||||
|
Have in place:
|
||||||
|
|
||||||
|
* `FrameType.Cancel` in `StellaOps.Router.Common.FrameType`.
|
||||||
|
* `ITransportClient.SendCancelAsync(ConnectionState, Guid, string?)` in Common.
|
||||||
|
* Minimal InMemory path from HTTP → gateway → microservice (HELLO + REQUEST/RESPONSE) working.
|
||||||
|
|
||||||
|
If `FrameType.Cancel` or `SendCancelAsync` aren’t there yet, add them first.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Common: cancel payload (optional, but useful)
|
||||||
|
|
||||||
|
If you want reasons attached, add a DTO in Common:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class CancelPayload
|
||||||
|
{
|
||||||
|
public string Reason { get; init; } = string.Empty; // eg: "ClientDisconnected", "Timeout"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You’ll serialize this into `Frame.Payload` when sending a CANCEL. If you don’t care about reasons yet, you can skip the payload and just use the correlation id.
|
||||||
|
|
||||||
|
No behavior in Common, just the shape.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Gateway: trigger CANCEL on client abort and timeout
|
||||||
|
|
||||||
|
### 2.1 Extend `TransportDispatchMiddleware`
|
||||||
|
|
||||||
|
You already:
|
||||||
|
|
||||||
|
* Generate a `correlationId`.
|
||||||
|
* Build a `FrameType.Request`.
|
||||||
|
* Call `ITransportClient.SendRequestAsync(...)` and await it.
|
||||||
|
|
||||||
|
Now:
|
||||||
|
|
||||||
|
1. Create a linked CTS that combines:
|
||||||
|
|
||||||
|
* `HttpContext.RequestAborted`
|
||||||
|
* The endpoint timeout
|
||||||
|
|
||||||
|
2. Register a callback on `RequestAborted` that sends a CANCEL with the same correlationId.
|
||||||
|
|
||||||
|
3. On `OperationCanceledException` where the HTTP token is not canceled (pure timeout), send a CANCEL once and return 504.
|
||||||
|
|
||||||
|
Sketch:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public async Task Invoke(HttpContext context, ITransportClient transportClient)
|
||||||
|
{
|
||||||
|
var decision = (RoutingDecision)context.Items[RouterHttpContextKeys.RoutingDecision]!;
|
||||||
|
var correlationId = Guid.NewGuid();
|
||||||
|
|
||||||
|
// build requestFrame as before
|
||||||
|
|
||||||
|
var timeout = decision.EffectiveTimeout;
|
||||||
|
using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(context.RequestAborted);
|
||||||
|
linkedCts.CancelAfter(timeout);
|
||||||
|
|
||||||
|
// fire-and-forget cancel on client disconnect
|
||||||
|
context.RequestAborted.Register(() =>
|
||||||
|
{
|
||||||
|
_ = transportClient.SendCancelAsync(
|
||||||
|
decision.Connection, correlationId, "ClientDisconnected");
|
||||||
|
});
|
||||||
|
|
||||||
|
Frame responseFrame;
|
||||||
|
try
|
||||||
|
{
|
||||||
|
responseFrame = await transportClient.SendRequestAsync(
|
||||||
|
decision.Connection,
|
||||||
|
requestFrame,
|
||||||
|
timeout,
|
||||||
|
linkedCts.Token);
|
||||||
|
}
|
||||||
|
catch (OperationCanceledException) when (!context.RequestAborted.IsCancellationRequested)
|
||||||
|
{
|
||||||
|
// internal timeout
|
||||||
|
await transportClient.SendCancelAsync(
|
||||||
|
decision.Connection, correlationId, "Timeout");
|
||||||
|
|
||||||
|
context.Response.StatusCode = StatusCodes.Status504GatewayTimeout;
|
||||||
|
await context.Response.WriteAsync("Upstream timeout");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// existing response mapping goes here
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Key points:
|
||||||
|
|
||||||
|
* The gateway sends CANCEL **as soon as**:
|
||||||
|
|
||||||
|
* The client disconnects (RequestAborted).
|
||||||
|
* Or the internal timeout triggers (catch branch).
|
||||||
|
* We do not need any global correlation registry on the gateway side; the middleware has the `correlationId` and `Connection`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. InMemory transport: propagate CANCEL to microservice
|
||||||
|
|
||||||
|
### 3.1 Implement `SendCancelAsync` in `InMemoryTransportClient` (gateway side)
|
||||||
|
|
||||||
|
In your gateway InMemory implementation:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public Task SendCancelAsync(ConnectionState connection, Guid correlationId, string? reason = null)
|
||||||
|
{
|
||||||
|
var payload = reason is null
|
||||||
|
? Array.Empty<byte>()
|
||||||
|
: SerializeCancelPayload(new CancelPayload { Reason = reason });
|
||||||
|
|
||||||
|
var frame = new Frame
|
||||||
|
{
|
||||||
|
Type = FrameType.Cancel,
|
||||||
|
CorrelationId = correlationId,
|
||||||
|
Payload = payload
|
||||||
|
};
|
||||||
|
|
||||||
|
return _hub.SendFromGatewayAsync(connection.ConnectionId, frame, CancellationToken.None);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`_hub.SendFromGatewayAsync` must route the frame to the microservice’s receive loop for that connection.
|
||||||
|
|
||||||
|
### 3.2 Hub routing
|
||||||
|
|
||||||
|
Ensure your `IInMemoryRouterHub` implementation:
|
||||||
|
|
||||||
|
* When `SendFromGatewayAsync(connectionId, cancelFrame, ct)` is called:
|
||||||
|
|
||||||
|
* Enqueues that frame onto the microservice’s incoming channel (`GetFramesForMicroserviceAsync` stream).
|
||||||
|
|
||||||
|
No extra logic; just treat CANCEL like REQUEST/HELLO in terms of delivery.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Microservice: track in-flight requests
|
||||||
|
|
||||||
|
Now microservice needs to know **which** request to cancel when a CANCEL arrives.
|
||||||
|
|
||||||
|
### 4.1 In-flight registry
|
||||||
|
|
||||||
|
In the microservice connection class (the one doing the receive loop):
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
private readonly ConcurrentDictionary<Guid, RequestExecution> _inflight =
|
||||||
|
new();
|
||||||
|
|
||||||
|
private sealed class RequestExecution
|
||||||
|
{
|
||||||
|
public CancellationTokenSource Cts { get; init; } = default!;
|
||||||
|
public Task ExecutionTask { get; init; } = default!;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
When a `Request` frame arrives:
|
||||||
|
|
||||||
|
* Create a `CancellationTokenSource`.
|
||||||
|
* Start the handler using that token.
|
||||||
|
* Store both in `_inflight`.
|
||||||
|
|
||||||
|
Example pattern in `ReceiveLoopAsync`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
private async Task ReceiveLoopAsync(CancellationToken ct)
|
||||||
|
{
|
||||||
|
await foreach (var frame in _routerClient.GetIncomingFramesAsync(ct))
|
||||||
|
{
|
||||||
|
switch (frame.Type)
|
||||||
|
{
|
||||||
|
case FrameType.Request:
|
||||||
|
HandleRequest(frame);
|
||||||
|
break;
|
||||||
|
|
||||||
|
case FrameType.Cancel:
|
||||||
|
HandleCancel(frame);
|
||||||
|
break;
|
||||||
|
|
||||||
|
// other frame types...
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private void HandleRequest(Frame frame)
|
||||||
|
{
|
||||||
|
var cts = new CancellationTokenSource();
|
||||||
|
var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(cts.Token); // later link to global shutdown if needed
|
||||||
|
|
||||||
|
var exec = new RequestExecution
|
||||||
|
{
|
||||||
|
Cts = cts,
|
||||||
|
ExecutionTask = HandleRequestCoreAsync(frame, linkedCts.Token)
|
||||||
|
};
|
||||||
|
|
||||||
|
_inflight[frame.CorrelationId] = exec;
|
||||||
|
|
||||||
|
_ = exec.ExecutionTask.ContinueWith(_ =>
|
||||||
|
{
|
||||||
|
_inflight.TryRemove(frame.CorrelationId, out _);
|
||||||
|
cts.Dispose();
|
||||||
|
linkedCts.Dispose();
|
||||||
|
}, TaskScheduler.Default);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Wire CancellationToken into dispatcher
|
||||||
|
|
||||||
|
`HandleRequestCoreAsync` should:
|
||||||
|
|
||||||
|
* Deserialize the request payload.
|
||||||
|
* Build a `RawRequestContext` with `CancellationToken = token`.
|
||||||
|
* Pass that token through to:
|
||||||
|
|
||||||
|
* `IRawStellaEndpoint.HandleAsync(context)` (via the context).
|
||||||
|
* Or typed handler adapter (`IStellaEndpoint<,>` / `IStellaEndpoint<TResponse>`), passing it explicitly.
|
||||||
|
|
||||||
|
Example pattern:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
private async Task HandleRequestCoreAsync(Frame frame, CancellationToken ct)
|
||||||
|
{
|
||||||
|
var req = DeserializeRequestPayload(frame.Payload);
|
||||||
|
|
||||||
|
if (!_catalog.TryGetHandler(req.Method, req.Path, out var registration))
|
||||||
|
{
|
||||||
|
var notFound = BuildNotFoundResponse(frame.CorrelationId);
|
||||||
|
await _routerClient.SendFrameAsync(notFound, ct);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
using var bodyStream = new MemoryStream(req.Body); // minimal case
|
||||||
|
|
||||||
|
var ctx = new RawRequestContext
|
||||||
|
{
|
||||||
|
Method = req.Method,
|
||||||
|
Path = req.Path,
|
||||||
|
Headers = req.Headers,
|
||||||
|
Body = bodyStream,
|
||||||
|
CancellationToken = ct
|
||||||
|
};
|
||||||
|
|
||||||
|
var handler = (IRawStellaEndpoint)_serviceProvider.GetRequiredService(registration.HandlerType);
|
||||||
|
|
||||||
|
var response = await handler.HandleAsync(ctx);
|
||||||
|
|
||||||
|
var respFrame = BuildResponseFrame(frame.CorrelationId, response);
|
||||||
|
await _routerClient.SendFrameAsync(respFrame, ct);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Now each handler sees a token that will be canceled when a CANCEL frame arrives.
|
||||||
|
|
||||||
|
### 4.3 Handle CANCEL frames
|
||||||
|
|
||||||
|
When a `Cancel` frame arrives:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
private void HandleCancel(Frame frame)
|
||||||
|
{
|
||||||
|
if (_inflight.TryGetValue(frame.CorrelationId, out var exec))
|
||||||
|
{
|
||||||
|
exec.Cts.Cancel();
|
||||||
|
}
|
||||||
|
// Ignore if not found (e.g. already completed)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
If you care about the reason, deserialize `CancelPayload` and log it; not required for behavior.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Handler guidance (for your Microservice docs)
|
||||||
|
|
||||||
|
In `Stella Ops Router – Microservice.md`, add simple rules devs must follow:
|
||||||
|
|
||||||
|
* Any long‑running or IO-heavy code in endpoints MUST:
|
||||||
|
|
||||||
|
* Accept a `CancellationToken` (for typed endpoints).
|
||||||
|
* Or use `RawRequestContext.CancellationToken` for raw endpoints.
|
||||||
|
* Always pass the token into:
|
||||||
|
|
||||||
|
* DB calls.
|
||||||
|
* File I/O and stream operations.
|
||||||
|
* HTTP/gRPC calls to other services.
|
||||||
|
* Do not swallow `OperationCanceledException` unless there is a good reason; normally let it bubble or treat it as a normal cancellation.
|
||||||
|
|
||||||
|
Concrete example for devs:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
[StellaEndpoint("POST", "/billing/slow-operation")]
|
||||||
|
public sealed class SlowEndpoint : IRawStellaEndpoint
|
||||||
|
{
|
||||||
|
public async Task<RawResponse> HandleAsync(RawRequestContext ctx)
|
||||||
|
{
|
||||||
|
// Correct: observe token
|
||||||
|
await Task.Delay(TimeSpan.FromMinutes(5), ctx.CancellationToken);
|
||||||
|
|
||||||
|
return new RawResponse { StatusCode = 204 };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Tests
|
||||||
|
|
||||||
|
### 6.1 Client abort → CANCEL
|
||||||
|
|
||||||
|
Test outline:
|
||||||
|
|
||||||
|
* Setup:
|
||||||
|
|
||||||
|
* Gateway + microservice wired via InMemory hub.
|
||||||
|
* Microservice endpoint that:
|
||||||
|
|
||||||
|
* Waits on `Task.Delay(TimeSpan.FromMinutes(5), ctx.CancellationToken)`.
|
||||||
|
|
||||||
|
* Test:
|
||||||
|
|
||||||
|
1. Start HTTP request to `/slow`.
|
||||||
|
2. After sending request, cancel the client’s HttpClient token or close the connection.
|
||||||
|
3. Assert:
|
||||||
|
|
||||||
|
* Gateway’s InMemory transport sent a `FrameType.Cancel`.
|
||||||
|
* Microservice’s handler is canceled (e.g. no longer running after a short time).
|
||||||
|
* No response (or partial) is written; HTTP side will produce whatever your test harness expects when client aborts.
|
||||||
|
|
||||||
|
### 6.2 Gateway timeout → CANCEL
|
||||||
|
|
||||||
|
* Configure endpoint timeout small (e.g. 100 ms).
|
||||||
|
* Have endpoint sleep for 5 seconds with the token.
|
||||||
|
* Assert:
|
||||||
|
|
||||||
|
* Gateway returns 504.
|
||||||
|
* Cancel frame was sent.
|
||||||
|
* Handler is canceled (task completes early).
|
||||||
|
|
||||||
|
These tests lock in the semantics so later additions (real transports, streaming) don’t regress cancellation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Done criteria for “Add cancellation semantics (with InMemory)”
|
||||||
|
|
||||||
|
You can mark step 7 as complete when:
|
||||||
|
|
||||||
|
* For every routed request, the gateway knows its correlationId and connection.
|
||||||
|
* On client disconnect:
|
||||||
|
|
||||||
|
* Gateway sends a `FrameType.Cancel` with that correlationId.
|
||||||
|
* On internal timeout:
|
||||||
|
|
||||||
|
* Gateway sends a `FrameType.Cancel` and returns 504 to the client.
|
||||||
|
* InMemory hub delivers CANCEL frames to the microservice.
|
||||||
|
* Microservice:
|
||||||
|
|
||||||
|
* Tracks in‑flight requests by correlationId.
|
||||||
|
* Cancels the proper `CancellationTokenSource` when CANCEL arrives.
|
||||||
|
* Passes the token into handlers via `RawRequestContext` and typed adapters.
|
||||||
|
* At least one automated test proves:
|
||||||
|
|
||||||
|
* Cancellation propagates from gateway to microservice and stops the handler.
|
||||||
|
|
||||||
|
Once this is done, you’ll be in good shape to add streaming & payload-limits on top, because the cancel path is already wired end‑to‑end.
|
||||||
501
docs/router/08-Step.md
Normal file
501
docs/router/08-Step.md
Normal file
@@ -0,0 +1,501 @@
|
|||||||
|
For this step you’re teaching the system to handle **streams** instead of always buffering, and to **enforce payload limits** so the gateway can’t be DoS’d by large uploads. Still only using the InMemory transport.
|
||||||
|
|
||||||
|
Goal state:
|
||||||
|
|
||||||
|
* Gateway can stream HTTP request/response bodies to/from microservice without buffering everything.
|
||||||
|
* Gateway enforces per‑call and global/in‑flight payload limits.
|
||||||
|
* Microservice sees a `Stream` on `RawRequestContext.Body` and reads from it.
|
||||||
|
* All of this works over the existing InMemory “connection”.
|
||||||
|
|
||||||
|
I’ll break it into concrete tasks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Preconditions
|
||||||
|
|
||||||
|
Make sure you already have:
|
||||||
|
|
||||||
|
* Minimal InMemory routing working:
|
||||||
|
|
||||||
|
* HTTP → gateway → InMemory → microservice → InMemory → HTTP.
|
||||||
|
* Cancellation wired (step 7):
|
||||||
|
|
||||||
|
* `FrameType.Cancel`.
|
||||||
|
* `ITransportClient.SendCancelAsync` implemented for InMemory.
|
||||||
|
* Microservice uses `CancellationToken` in `RawRequestContext`.
|
||||||
|
|
||||||
|
Then layer streaming & limits on top.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Confirm / finalize Common primitives for streaming & limits
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Router.Common`
|
||||||
|
|
||||||
|
Tasks:
|
||||||
|
|
||||||
|
1. Ensure `FrameType` has:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public enum FrameType : byte
|
||||||
|
{
|
||||||
|
Hello = 1,
|
||||||
|
Heartbeat = 2,
|
||||||
|
EndpointsUpdate = 3,
|
||||||
|
Request = 4,
|
||||||
|
RequestStreamData = 5,
|
||||||
|
Response = 6,
|
||||||
|
ResponseStreamData = 7,
|
||||||
|
Cancel = 8
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You may not *use* `RequestStreamData` / `ResponseStreamData` in InMemory implementation initially if you choose the bridging approach, but having them defined keeps the model coherent.
|
||||||
|
|
||||||
|
2. Ensure `EndpointDescriptor` has:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public bool SupportsStreaming { get; init; }
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Ensure `PayloadLimits` type exists (in Common or Config, but referenced by both):
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class PayloadLimits
|
||||||
|
{
|
||||||
|
public long MaxRequestBytesPerCall { get; set; } // per HTTP request
|
||||||
|
public long MaxRequestBytesPerConnection { get; set; } // per microservice connection
|
||||||
|
public long MaxAggregateInflightBytes { get; set; } // across all requests
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
4. `ITransportClient` already contains:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
Task SendStreamingAsync(
|
||||||
|
ConnectionState connection,
|
||||||
|
Frame requestHeader,
|
||||||
|
Stream requestBody,
|
||||||
|
Func<Stream, Task> readResponseBody,
|
||||||
|
PayloadLimits limits,
|
||||||
|
CancellationToken ct);
|
||||||
|
```
|
||||||
|
|
||||||
|
If not, add it now (implementation will be InMemory-only for this step).
|
||||||
|
|
||||||
|
No logic in Common; just shapes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Gateway: payload budget tracker
|
||||||
|
|
||||||
|
You need a small service in the gateway that tracks in‑flight bytes to enforce limits.
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService`
|
||||||
|
|
||||||
|
### 2.1 Define a budget interface
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface IPayloadBudget
|
||||||
|
{
|
||||||
|
bool TryReserve(string connectionId, Guid requestId, long bytes);
|
||||||
|
void Release(string connectionId, Guid requestId, long bytes);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 Implement a simple in-memory tracker
|
||||||
|
|
||||||
|
Implementation outline:
|
||||||
|
|
||||||
|
* Track:
|
||||||
|
|
||||||
|
* `long _globalInflightBytes`.
|
||||||
|
* `Dictionary<string,long> _perConnectionInflightBytes`.
|
||||||
|
* `Dictionary<Guid,long> _perRequestInflightBytes`.
|
||||||
|
|
||||||
|
All updated under a lock or `ConcurrentDictionary` + `Interlocked`.
|
||||||
|
|
||||||
|
Logic for `TryReserve`:
|
||||||
|
|
||||||
|
* Compute proposed:
|
||||||
|
|
||||||
|
* `newGlobal = _globalInflightBytes + bytes`
|
||||||
|
* `newConn = perConnection[connectionId] + bytes`
|
||||||
|
* `newReq = perRequest[requestId] + bytes`
|
||||||
|
* If any exceed configured limits (`PayloadLimits` from config), return `false`.
|
||||||
|
* Else:
|
||||||
|
|
||||||
|
* Commit updates and return `true`.
|
||||||
|
|
||||||
|
`Release` subtracts the bytes, never going below zero.
|
||||||
|
|
||||||
|
Register in DI:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
services.AddSingleton<IPayloadBudget, PayloadBudget>();
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Gateway: choose buffered vs streaming path
|
||||||
|
|
||||||
|
Extend `TransportDispatchMiddleware` to branch on mode.
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService`
|
||||||
|
|
||||||
|
### 3.1 Decide mode
|
||||||
|
|
||||||
|
At the start of the middleware:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var decision = (RoutingDecision)context.Items[RouterHttpContextKeys.RoutingDecision]!;
|
||||||
|
var endpoint = decision.Endpoint;
|
||||||
|
var limits = _options.Value.PayloadLimits; // from RouterConfig
|
||||||
|
|
||||||
|
var supportsStreaming = endpoint.SupportsStreaming;
|
||||||
|
var hasKnownLength = context.Request.ContentLength.HasValue;
|
||||||
|
var contentLength = context.Request.ContentLength ?? -1;
|
||||||
|
|
||||||
|
// Simple rule for now:
|
||||||
|
var useStreaming =
|
||||||
|
supportsStreaming &&
|
||||||
|
(!hasKnownLength || contentLength > limits.MaxRequestBytesPerCall);
|
||||||
|
```
|
||||||
|
|
||||||
|
* If `useStreaming == false`:
|
||||||
|
|
||||||
|
* Use buffered path with hard size checks.
|
||||||
|
* If `useStreaming == true`:
|
||||||
|
|
||||||
|
* Use streaming path (`ITransportClient.SendStreamingAsync`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Gateway: buffered path with limits
|
||||||
|
|
||||||
|
**Still in `TransportDispatchMiddleware`**
|
||||||
|
|
||||||
|
### 4.1 Early 413 check
|
||||||
|
|
||||||
|
When `supportsStreaming == false`:
|
||||||
|
|
||||||
|
1. If `Content-Length` known and:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
if (hasKnownLength && contentLength > limits.MaxRequestBytesPerCall)
|
||||||
|
{
|
||||||
|
context.Response.StatusCode = StatusCodes.Status413PayloadTooLarge;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. When reading body into memory:
|
||||||
|
|
||||||
|
* Read in chunks.
|
||||||
|
* Track `bytesReadThisCall`.
|
||||||
|
* If `bytesReadThisCall > limits.MaxRequestBytesPerCall`, abort and return 413.
|
||||||
|
|
||||||
|
You don’t have to call `IPayloadBudget` for buffered mode yet; you can, but the hard per-call limit already protects RAM for this step.
|
||||||
|
|
||||||
|
Buffered path then proceeds as before:
|
||||||
|
|
||||||
|
* Build `MinimalRequestPayload` with full body.
|
||||||
|
* Send via `SendRequestAsync`.
|
||||||
|
* Map response.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Gateway: streaming path (InMemory)
|
||||||
|
|
||||||
|
This is the new part.
|
||||||
|
|
||||||
|
### 5.1 Use `ITransportClient.SendStreamingAsync`
|
||||||
|
|
||||||
|
In the `useStreaming == true` branch:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var correlationId = Guid.NewGuid();
|
||||||
|
|
||||||
|
var headerPayload = new MinimalRequestPayload
|
||||||
|
{
|
||||||
|
Method = context.Request.Method,
|
||||||
|
Path = context.Request.Path.ToString(),
|
||||||
|
Headers = ExtractHeaders(context.Request),
|
||||||
|
Body = Array.Empty<byte>(), // streaming body will follow
|
||||||
|
IsStreaming = true // add this flag to your payload DTO
|
||||||
|
};
|
||||||
|
|
||||||
|
var headerFrame = new Frame
|
||||||
|
{
|
||||||
|
Type = FrameType.Request,
|
||||||
|
CorrelationId = correlationId,
|
||||||
|
Payload = SerializeRequestPayload(headerPayload)
|
||||||
|
};
|
||||||
|
|
||||||
|
using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(context.RequestAborted);
|
||||||
|
linkedCts.CancelAfter(decision.EffectiveTimeout);
|
||||||
|
|
||||||
|
// register cancel → SendCancelAsync (already done in step 7)
|
||||||
|
|
||||||
|
await _transportClient.SendStreamingAsync(
|
||||||
|
decision.Connection,
|
||||||
|
headerFrame,
|
||||||
|
context.Request.Body,
|
||||||
|
async responseBodyStream =>
|
||||||
|
{
|
||||||
|
// Copy microservice stream directly to HTTP response
|
||||||
|
await responseBodyStream.CopyToAsync(context.Response.Body, linkedCts.Token);
|
||||||
|
},
|
||||||
|
limits,
|
||||||
|
linkedCts.Token);
|
||||||
|
```
|
||||||
|
|
||||||
|
Key points:
|
||||||
|
|
||||||
|
* Streaming path does not buffer the whole body.
|
||||||
|
* Limits and cancellation are enforced inside `SendStreamingAsync`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. InMemory transport: streaming implementation
|
||||||
|
|
||||||
|
**Project:** gateway side InMemory `ITransportClient` implementation and InMemory router hub; microservice side connection.
|
||||||
|
|
||||||
|
For InMemory, you can model streaming via **bridged streams**: a producer/consumer pair in memory.
|
||||||
|
|
||||||
|
### 6.1 Add streaming call to InMemory client
|
||||||
|
|
||||||
|
In `InMemoryTransportClient`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public async Task SendStreamingAsync(
|
||||||
|
ConnectionState connection,
|
||||||
|
Frame requestHeader,
|
||||||
|
Stream httpRequestBody,
|
||||||
|
Func<Stream, Task> readResponseBody,
|
||||||
|
PayloadLimits limits,
|
||||||
|
CancellationToken ct)
|
||||||
|
{
|
||||||
|
await _hub.StreamFromGatewayAsync(
|
||||||
|
connection.ConnectionId,
|
||||||
|
requestHeader,
|
||||||
|
httpRequestBody,
|
||||||
|
readResponseBody,
|
||||||
|
limits,
|
||||||
|
ct);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Expose `StreamFromGatewayAsync` on `IInMemoryRouterHub`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
Task StreamFromGatewayAsync(
|
||||||
|
string connectionId,
|
||||||
|
Frame requestHeader,
|
||||||
|
Stream requestBody,
|
||||||
|
Func<Stream, Task> readResponseBody,
|
||||||
|
PayloadLimits limits,
|
||||||
|
CancellationToken ct);
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6.2 InMemory hub streaming strategy (bridging style)
|
||||||
|
|
||||||
|
Inside `StreamFromGatewayAsync`:
|
||||||
|
|
||||||
|
1. Create a **pair of connected streams** for request body:
|
||||||
|
|
||||||
|
* e.g., a custom `ProducerConsumerStream` built on a `Channel<byte[]>` or `System.IO.Pipelines`.
|
||||||
|
* “Producer” side (writer) will be fed from HTTP.
|
||||||
|
* “Consumer” side will be given to the microservice as `RawRequestContext.Body`.
|
||||||
|
|
||||||
|
2. Create a **pair of connected streams** for response body:
|
||||||
|
|
||||||
|
* “Consumer” side will be used in `readResponseBody` to write to HTTP.
|
||||||
|
* “Producer” side will be given to the microservice handler to write response body.
|
||||||
|
|
||||||
|
3. On the microservice side:
|
||||||
|
|
||||||
|
* Build a `RawRequestContext` with `Body = requestBodyConsumerStream` and `CancellationToken = ct`.
|
||||||
|
* Dispatch to the endpoint handler as usual.
|
||||||
|
* Have the handler’s `RawResponse.WriteBodyAsync` pointed at `responseBodyProducerStream`.
|
||||||
|
|
||||||
|
4. Parallel tasks:
|
||||||
|
|
||||||
|
* Task 1: Copy HTTP → `requestBodyProducerStream` in chunks, enforcing `PayloadLimits` (see next section).
|
||||||
|
* Task 2: Execute the handler, which reads from `Body` and writes to `responseBodyProducerStream`.
|
||||||
|
* Task 3: Copy `responseBodyConsumerStream` → HTTP via `readResponseBody`.
|
||||||
|
|
||||||
|
5. Propagate cancellation:
|
||||||
|
|
||||||
|
* If `ct` is canceled (client disconnect/timeout/payload limit breach):
|
||||||
|
|
||||||
|
* Stop HTTP→requestBody copy.
|
||||||
|
* Signal stream completion / cancellation to handler.
|
||||||
|
* Handler should see cancellation via `CancellationToken`.
|
||||||
|
|
||||||
|
Because this is InMemory, you don’t *have* to materialize explicit `RequestStreamData` frames; you only need the behavior. Real transports will implement the same semantics with actual frames.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Enforce payload limits in streaming copy
|
||||||
|
|
||||||
|
Still in `StreamFromGatewayAsync` / InMemory side:
|
||||||
|
|
||||||
|
### 7.1 HTTP → microservice copy with budget
|
||||||
|
|
||||||
|
In Task 1:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var buffer = new byte[64 * 1024];
|
||||||
|
int read;
|
||||||
|
var requestId = requestHeader.CorrelationId;
|
||||||
|
var connectionId = connectionIdFromArgs;
|
||||||
|
|
||||||
|
while ((read = await httpRequestBody.ReadAsync(buffer, 0, buffer.Length, ct)) > 0)
|
||||||
|
{
|
||||||
|
if (!_budget.TryReserve(connectionId, requestId, read))
|
||||||
|
{
|
||||||
|
// Limit exceeded: signal failure
|
||||||
|
await _cancelCallback?.Invoke(requestId, "PayloadLimitExceeded"); // or call SendCancelAsync
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
await requestBodyProducerStream.WriteAsync(buffer.AsMemory(0, read), ct);
|
||||||
|
}
|
||||||
|
|
||||||
|
// After loop, ensure we release whatever was reserved
|
||||||
|
_budget.Release(connectionId, requestId, totalBytesReserved);
|
||||||
|
await requestBodyProducerStream.FlushAsync(ct);
|
||||||
|
await requestBodyProducerStream.DisposeAsync();
|
||||||
|
```
|
||||||
|
|
||||||
|
If `TryReserve` fails:
|
||||||
|
|
||||||
|
* Stop reading further bytes.
|
||||||
|
* Trigger cancellation downstream:
|
||||||
|
|
||||||
|
* Either call the existing `SendCancelAsync` path.
|
||||||
|
* Or signal completion with error and let handler catch cancellation.
|
||||||
|
|
||||||
|
Gateway side should then translate this into 413 or 503 to the client.
|
||||||
|
|
||||||
|
### 7.2 Response copy
|
||||||
|
|
||||||
|
Response path doesn’t need budget tracking (the danger is inbound to gateway); but if you want symmetry, you can also enforce a max outbound size.
|
||||||
|
|
||||||
|
For now, just stream microservice → HTTP through `readResponseBody` until EOF or cancellation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Microservice side: streaming-aware `RawRequestContext.Body`
|
||||||
|
|
||||||
|
Your streaming bridging already gives the handler a `Stream` that reads what the gateway sends:
|
||||||
|
|
||||||
|
* No changes required in handler interfaces.
|
||||||
|
* You only need to ensure:
|
||||||
|
|
||||||
|
* `RawRequestContext.Body` **may be non-seekable**.
|
||||||
|
* Handlers know they must treat it as a forward-only stream.
|
||||||
|
|
||||||
|
Guidance for devs in `Microservice.md`:
|
||||||
|
|
||||||
|
* For binary uploads or large files, implement `IRawStellaEndpoint` and read incrementally:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
[StellaEndpoint("POST", "/billing/invoices/upload")]
|
||||||
|
public sealed class InvoiceUploadEndpoint : IRawStellaEndpoint
|
||||||
|
{
|
||||||
|
public async Task<RawResponse> HandleAsync(RawRequestContext ctx)
|
||||||
|
{
|
||||||
|
var buffer = new byte[64 * 1024];
|
||||||
|
int read;
|
||||||
|
while ((read = await ctx.Body.ReadAsync(buffer.AsMemory(0, buffer.Length), ctx.CancellationToken)) > 0)
|
||||||
|
{
|
||||||
|
// Process chunk
|
||||||
|
}
|
||||||
|
|
||||||
|
return new RawResponse { StatusCode = 204 };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Tests
|
||||||
|
|
||||||
|
**Scope:** still InMemory, but now streaming & limits.
|
||||||
|
|
||||||
|
### 9.1 Streaming happy path
|
||||||
|
|
||||||
|
* Setup:
|
||||||
|
|
||||||
|
* Endpoint with `SupportsStreaming = true`.
|
||||||
|
* `IRawStellaEndpoint` that:
|
||||||
|
|
||||||
|
* Counts total bytes read from `ctx.Body`.
|
||||||
|
* Returns 200.
|
||||||
|
|
||||||
|
* Test:
|
||||||
|
|
||||||
|
* Send an HTTP POST with a body larger than `MaxRequestBytesPerCall`, but with streaming enabled.
|
||||||
|
* Assert:
|
||||||
|
|
||||||
|
* Gateway does **not** buffer entire body in one array (you can assert via instrumentation or at least confirm no 413).
|
||||||
|
* Handler sees the full number of bytes.
|
||||||
|
* Response is 200.
|
||||||
|
|
||||||
|
### 9.2 Per-call limit breach
|
||||||
|
|
||||||
|
* Configure:
|
||||||
|
|
||||||
|
* `SupportsStreaming = false` (or use streaming but set low `MaxRequestBytesPerCall`).
|
||||||
|
* Test:
|
||||||
|
|
||||||
|
* Send a body larger than limit.
|
||||||
|
* Assert:
|
||||||
|
|
||||||
|
* Gateway responds 413.
|
||||||
|
* Handler is not invoked at all.
|
||||||
|
|
||||||
|
### 9.3 Global/in-flight limit breach
|
||||||
|
|
||||||
|
* Configure:
|
||||||
|
|
||||||
|
* `MaxAggregateInflightBytes` very low (e.g. 1 MB).
|
||||||
|
* Test:
|
||||||
|
|
||||||
|
* Start multiple concurrent streaming requests that each try to send more than the allowed total.
|
||||||
|
* Assert:
|
||||||
|
|
||||||
|
* Some of them get a CANCEL / error (413 or 503).
|
||||||
|
* `IPayloadBudget` denies reservations and releases resources correctly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Done criteria for “Add streaming & payload limits (InMemory)”
|
||||||
|
|
||||||
|
You’re done with this step when:
|
||||||
|
|
||||||
|
* Gateway:
|
||||||
|
|
||||||
|
* Chooses buffered vs streaming based on `EndpointDescriptor.SupportsStreaming` and size.
|
||||||
|
* Enforces `MaxRequestBytesPerCall` for buffered requests (413 on violation).
|
||||||
|
* Uses `ITransportClient.SendStreamingAsync` for streaming.
|
||||||
|
* Has an `IPayloadBudget` preventing excessive in-flight payload accumulation.
|
||||||
|
|
||||||
|
* InMemory transport:
|
||||||
|
|
||||||
|
* Implements `SendStreamingAsync` by bridging HTTP streams to microservice handlers and back.
|
||||||
|
* Enforces payload limits while copying.
|
||||||
|
|
||||||
|
* Microservice:
|
||||||
|
|
||||||
|
* Receives a functional `Stream` in `RawRequestContext.Body`.
|
||||||
|
* Can implement `IRawStellaEndpoint` that reads incrementally for large payloads.
|
||||||
|
|
||||||
|
* Tests:
|
||||||
|
|
||||||
|
* Demonstrate a streaming endpoint works for large payloads.
|
||||||
|
* Demonstrate per-call and aggregate limits are respected and cause rejections/cancellations.
|
||||||
|
|
||||||
|
After this, you can reuse the same semantics when you implement real transports (TCP/TLS/RabbitMQ), with InMemory as your reference implementation.
|
||||||
562
docs/router/09-Step.md
Normal file
562
docs/router/09-Step.md
Normal file
@@ -0,0 +1,562 @@
|
|||||||
|
For this step you’re taking the protocol you already proved with InMemory and putting it on real transports:
|
||||||
|
|
||||||
|
* TCP (baseline)
|
||||||
|
* Certificate/TLS (secure TCP)
|
||||||
|
* UDP (small, non‑streaming)
|
||||||
|
* RabbitMQ
|
||||||
|
|
||||||
|
The idea: every plugin implements the same `Frame` semantics (HELLO/HEARTBEAT/REQUEST/RESPONSE/CANCEL, plus streaming where supported), and the gateway/microservices don’t change their business logic at all.
|
||||||
|
|
||||||
|
I’ll structure this as a sequence of sub‑steps you can execute in order.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Preconditions
|
||||||
|
|
||||||
|
Before you start adding real transports, make sure:
|
||||||
|
|
||||||
|
* Frame model is stable in `StellaOps.Router.Common`:
|
||||||
|
|
||||||
|
* `Frame`, `FrameType`, `TransportType`.
|
||||||
|
* Microservice and gateway code use **only**:
|
||||||
|
|
||||||
|
* `ITransportClient` to send (gateway side).
|
||||||
|
* `ITransportServer` / connection abstractions to receive (gateway side).
|
||||||
|
* `IMicroserviceConnection` + `ITransportClient` under the hood (microservice side).
|
||||||
|
* InMemory transport is working with:
|
||||||
|
|
||||||
|
* HELLO
|
||||||
|
* REQUEST / RESPONSE
|
||||||
|
* CANCEL
|
||||||
|
* Streaming & payload limits (step 8)
|
||||||
|
|
||||||
|
If any code still directly talks to “InMemoryRouterHub” from app logic, hide it behind the `ITransportClient` / `ITransportServer` abstractions first.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Freeze the wire protocol and serializer
|
||||||
|
|
||||||
|
**Owner:** protocol / infra dev
|
||||||
|
|
||||||
|
Before touching sockets or RabbitMQ, lock down **how a `Frame` is encoded** on the wire. This must be consistent across all transports except InMemory (which can cheat a bit internally).
|
||||||
|
|
||||||
|
### 1.1 Frame header
|
||||||
|
|
||||||
|
Define a simple binary header; for example:
|
||||||
|
|
||||||
|
* 1 byte: `FrameType`
|
||||||
|
* 16 bytes: `CorrelationId` (`Guid`)
|
||||||
|
* 4 bytes: payload length (`int32`, big- or little-endian, but be consistent)
|
||||||
|
|
||||||
|
Total header = 21 bytes. Then `payloadLength` bytes follow.
|
||||||
|
|
||||||
|
You can evolve later but start with something simple.
|
||||||
|
|
||||||
|
### 1.2 Frame serializer
|
||||||
|
|
||||||
|
In a small shared, **non‑ASP.NET** assembly (either Common or a new `StellaOps.Router.Protocol` library), implement:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface IFrameSerializer
|
||||||
|
{
|
||||||
|
void WriteFrame(Frame frame, Stream stream, CancellationToken ct);
|
||||||
|
Task WriteFrameAsync(Frame frame, Stream stream, CancellationToken ct);
|
||||||
|
|
||||||
|
Frame ReadFrame(Stream stream, CancellationToken ct);
|
||||||
|
Task<Frame> ReadFrameAsync(Stream stream, CancellationToken ct);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Implementation:
|
||||||
|
|
||||||
|
* Writes header then payload.
|
||||||
|
* Reads header then payload; throws on EOF.
|
||||||
|
|
||||||
|
For payloads (HELLO, HEARTBEAT, etc.), use one encoding consistently (e.g. `System.Text.Json` for now) and **centralize** DTO ⇒ `byte[]` conversions:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public static class PayloadCodec
|
||||||
|
{
|
||||||
|
public static byte[] Encode<T>(T payload) { ... }
|
||||||
|
public static T Decode<T>(byte[] bytes) { ... }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
All transports use `IFrameSerializer` + `PayloadCodec`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Introduce a transport registry / resolver
|
||||||
|
|
||||||
|
**Projects:** gateway + microservice
|
||||||
|
**Owner:** infra dev
|
||||||
|
|
||||||
|
You need a way to map `TransportType` to a concrete plugin.
|
||||||
|
|
||||||
|
### 2.1 Gateway side
|
||||||
|
|
||||||
|
Define:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public interface ITransportClientResolver
|
||||||
|
{
|
||||||
|
ITransportClient GetClient(TransportType transportType);
|
||||||
|
}
|
||||||
|
|
||||||
|
public interface ITransportServerFactory
|
||||||
|
{
|
||||||
|
ITransportServer CreateServer(TransportType transportType);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Initial implementation:
|
||||||
|
|
||||||
|
* Registers the available clients:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class TransportClientResolver : ITransportClientResolver
|
||||||
|
{
|
||||||
|
private readonly IServiceProvider _sp;
|
||||||
|
|
||||||
|
public TransportClientResolver(IServiceProvider sp) => _sp = sp;
|
||||||
|
|
||||||
|
public ITransportClient GetClient(TransportType transportType) =>
|
||||||
|
transportType switch
|
||||||
|
{
|
||||||
|
TransportType.Tcp => _sp.GetRequiredService<TcpTransportClient>(),
|
||||||
|
TransportType.Certificate=> _sp.GetRequiredService<TlsTransportClient>(),
|
||||||
|
TransportType.Udp => _sp.GetRequiredService<UdpTransportClient>(),
|
||||||
|
TransportType.RabbitMq => _sp.GetRequiredService<RabbitMqTransportClient>(),
|
||||||
|
_ => throw new NotSupportedException($"Transport {transportType} not supported.")
|
||||||
|
};
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then in `TransportDispatchMiddleware`, instead of injecting a single `ITransportClient`, inject `ITransportClientResolver` and choose:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var client = clientResolver.GetClient(decision.TransportType);
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 Microservice side
|
||||||
|
|
||||||
|
On the microservice, you can do something similar:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal interface IMicroserviceTransportConnector
|
||||||
|
{
|
||||||
|
Task ConnectAsync(StellaMicroserviceOptions options, CancellationToken ct);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Implement one per transport type; later `StellaMicroserviceOptions.Routers` will determine which transport to use for each router endpoint.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Implement plugin 1: TCP
|
||||||
|
|
||||||
|
Start with TCP; it’s the most straightforward and will largely mirror your InMemory behavior.
|
||||||
|
|
||||||
|
### 3.1 Gateway: `TcpTransportServer`
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService` or a transport sub-namespace.
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
* Listen on a configured TCP port (e.g. from `RouterConfig`).
|
||||||
|
* Accept connections, each mapping to a `ConnectionId`.
|
||||||
|
* For each connection:
|
||||||
|
|
||||||
|
* Start a background receive loop:
|
||||||
|
|
||||||
|
* Use `IFrameSerializer.ReadFrameAsync` on a `NetworkStream`.
|
||||||
|
* On `FrameType.Hello`:
|
||||||
|
|
||||||
|
* Deserialize HELLO payload.
|
||||||
|
* Build a `ConnectionState` and register with `IGlobalRoutingState`.
|
||||||
|
* On `FrameType.Heartbeat`:
|
||||||
|
|
||||||
|
* Update heartbeat for that `ConnectionId`.
|
||||||
|
* On `FrameType.Response` or `ResponseStreamData`:
|
||||||
|
|
||||||
|
* Push frame into the gateway’s correlation / streaming handler (similar to InMemory path).
|
||||||
|
* On `FrameType.Cancel` (rare from microservice):
|
||||||
|
|
||||||
|
* Optionally implement; can be ignored for now.
|
||||||
|
* Provide a sending API to the matching `TcpTransportClient` (gateway-side) using `WriteFrameAsync`.
|
||||||
|
|
||||||
|
You will likely have:
|
||||||
|
|
||||||
|
* A `TcpConnectionContext` per connected microservice:
|
||||||
|
|
||||||
|
* Holds `ConnectionId`, `TcpClient`, `NetworkStream`, `TaskCompletionSource` maps for correlation IDs.
|
||||||
|
|
||||||
|
### 3.2 Gateway: `TcpTransportClient` (gateway-side, to microservices)
|
||||||
|
|
||||||
|
Implements `ITransportClient`:
|
||||||
|
|
||||||
|
* `SendRequestAsync`:
|
||||||
|
|
||||||
|
* Given `ConnectionState`:
|
||||||
|
|
||||||
|
* Get the associated `TcpConnectionContext`.
|
||||||
|
* Register a `TaskCompletionSource<Frame>` keyed by `CorrelationId`.
|
||||||
|
* Call `WriteFrameAsync(requestFrame)` on the connection’s stream.
|
||||||
|
* Await the TCS, which is completed in the receive loop when a `Response` frame arrives.
|
||||||
|
* `SendStreamingAsync`:
|
||||||
|
|
||||||
|
* Write header `FrameType.Request`.
|
||||||
|
* Read from `BudgetedRequestStream` in chunks:
|
||||||
|
|
||||||
|
* For TCP plugin you can either:
|
||||||
|
|
||||||
|
* Use `RequestStreamData` frames with chunk payloads, or
|
||||||
|
* Keep the simple bridging approach and send a single `Request` with all body bytes.
|
||||||
|
* Since you already validated streaming semantics with InMemory, you can decide:
|
||||||
|
|
||||||
|
* For first version of TCP, **only support buffered data**, then add chunk frames later.
|
||||||
|
* `SendCancelAsync`:
|
||||||
|
|
||||||
|
* Write a `FrameType.Cancel` frame with the same `CorrelationId`.
|
||||||
|
|
||||||
|
### 3.3 Microservice: `TcpTransportClientConnection`
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Microservice`
|
||||||
|
|
||||||
|
Responsibilities on microservice side:
|
||||||
|
|
||||||
|
* For each `RouterEndpointConfig` where `TransportType == Tcp`:
|
||||||
|
|
||||||
|
* Open a `TcpClient` to `Host:Port`.
|
||||||
|
* Use `IFrameSerializer` to send:
|
||||||
|
|
||||||
|
* `HELLO` frame (payload = identity + descriptors).
|
||||||
|
* Periodic `HEARTBEAT` frames.
|
||||||
|
* `RESPONSE` frames for incoming `REQUEST`s.
|
||||||
|
|
||||||
|
* Receive loop:
|
||||||
|
|
||||||
|
* `ReadFrameAsync` from `NetworkStream`.
|
||||||
|
* On `REQUEST`:
|
||||||
|
|
||||||
|
* Dispatch through `IEndpointDispatcher`.
|
||||||
|
* For minimal streaming, treat payload as buffered; you’ll align with streaming later.
|
||||||
|
* On `CANCEL`:
|
||||||
|
|
||||||
|
* Use correlation ID to cancel the `CancellationTokenSource` you already maintain.
|
||||||
|
|
||||||
|
This is conceptually the same as InMemory but using real sockets.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Implement plugin 2: Certificate/TLS
|
||||||
|
|
||||||
|
Build TLS on top of TCP plugin; do not fork logic unnecessarily.
|
||||||
|
|
||||||
|
### 4.1 Gateway: `TlsTransportServer`
|
||||||
|
|
||||||
|
* Wrap accepted `TcpClient` sockets in `SslStream`.
|
||||||
|
* Load server certificate from configuration (for the node/region).
|
||||||
|
* Authenticate client if you want mutual TLS.
|
||||||
|
|
||||||
|
Structure:
|
||||||
|
|
||||||
|
* Reuse almost all of `TcpTransportServer` logic, but instead of `NetworkStream` you use `SslStream` as the underlying stream for `IFrameSerializer`.
|
||||||
|
|
||||||
|
### 4.2 Microservice: `TlsTransportClientConnection`
|
||||||
|
|
||||||
|
* Instead of plain `TcpClient.GetStream`, wrap in `SslStream`.
|
||||||
|
* Authenticate server (hostname & certificate).
|
||||||
|
* Optional: present client certificate.
|
||||||
|
|
||||||
|
Configuration fields in `RouterEndpointConfig` (or a TLS-specific sub-config):
|
||||||
|
|
||||||
|
* `UseTls` / `TransportType.Certificate`.
|
||||||
|
* Certificate paths / thumbprints / validation parameters.
|
||||||
|
|
||||||
|
At the SDK level, you just treat it as a different transport type; protocol remains identical.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Implement plugin 3: UDP (small, non‑streaming)
|
||||||
|
|
||||||
|
UDP is only for small, bounded payloads. No streaming, best‑effort delivery.
|
||||||
|
|
||||||
|
### 5.1 Constraints
|
||||||
|
|
||||||
|
* Use UDP **only** for buffered, small payload endpoints.
|
||||||
|
* No streaming (`SupportsStreaming` must be `false` for UDP endpoints).
|
||||||
|
* No guarantee of delivery or ordering; caller must tolerate occasional failures/timeouts.
|
||||||
|
|
||||||
|
### 5.2 Gateway: `UdpTransportServer`
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
* Listen on a UDP port.
|
||||||
|
* Parse each incoming datagram as a full `Frame`:
|
||||||
|
|
||||||
|
* `FrameType.Hello`:
|
||||||
|
|
||||||
|
* Register a “logical connection” keyed by `(remoteEndpoint)` and `InstanceId`.
|
||||||
|
* `FrameType.Heartbeat`:
|
||||||
|
|
||||||
|
* Update health for that logical connection.
|
||||||
|
* `FrameType.Response`:
|
||||||
|
|
||||||
|
* Use `CorrelationId` and “connectionId” to complete a `TaskCompletionSource` as with TCP.
|
||||||
|
|
||||||
|
Because UDP is connectionless, your `ConnectionId` can be:
|
||||||
|
|
||||||
|
* A composite of microservice identity + remote endpoint, e.g. `"{instanceId}@{ip}:{port}"`.
|
||||||
|
|
||||||
|
### 5.3 Gateway: `UdpTransportClient` (gateway-side)
|
||||||
|
|
||||||
|
`SendRequestAsync`:
|
||||||
|
|
||||||
|
* Serialize `Frame` to `byte[]`.
|
||||||
|
* Send via `UdpClient.SendAsync` to the remote endpoint from `ConnectionState`.
|
||||||
|
* Start a timer:
|
||||||
|
|
||||||
|
* Wait for `Response` datagram with matching `CorrelationId`.
|
||||||
|
* If none comes within timeout → throw `OperationCanceledException`.
|
||||||
|
|
||||||
|
`SendStreamingAsync`:
|
||||||
|
|
||||||
|
* For this first iteration, **throw NotSupportedException**.
|
||||||
|
* Router should not route streaming endpoints over UDP; your routing config should enforce that.
|
||||||
|
|
||||||
|
`SendCancelAsync`:
|
||||||
|
|
||||||
|
* Optionally send a CANCEL datagram; but in practice, if requests are small, this is less useful. You can still implement it for symmetry.
|
||||||
|
|
||||||
|
### 5.4 Microservice: UDP connection
|
||||||
|
|
||||||
|
For microservice side:
|
||||||
|
|
||||||
|
* A single `UdpClient` bound to a local port.
|
||||||
|
* For each configured router (host/port):
|
||||||
|
|
||||||
|
* HELLO: send a `FrameType.Hello` datagram.
|
||||||
|
* HEARTBEAT: send periodic `FrameType.Heartbeat`.
|
||||||
|
* REQUEST handling: not needed; UDP plugin is used **for gateway → microservice** only if you design it that way. More likely, microservice is the server in TCP, but for UDP you might decide microservice is listening on port and gateway sends requests. So invert roles if needed.
|
||||||
|
|
||||||
|
Given the complexity and limited utility, you can treat UDP as “advanced/optional transport” and implement it last.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Implement plugin 4: RabbitMQ
|
||||||
|
|
||||||
|
This is conceptually similar to what you had in Serdica.
|
||||||
|
|
||||||
|
### 6.1 Exchange/queue design
|
||||||
|
|
||||||
|
Decide and document (in `Protocol & Transport Specification.md`) something like:
|
||||||
|
|
||||||
|
* Exchange: `stella.router`
|
||||||
|
* Routing keys:
|
||||||
|
|
||||||
|
* `request.{serviceName}.{version}` — gateway → microservice.
|
||||||
|
* Microservice’s reply queue per instance: `reply.{serviceName}.{version}.{instanceId}`.
|
||||||
|
|
||||||
|
Rabbit usages:
|
||||||
|
|
||||||
|
* Gateway:
|
||||||
|
|
||||||
|
* Publishes REQUEST frames to `request.{serviceName}.{version}`.
|
||||||
|
* Consumes from `reply.*` for responses.
|
||||||
|
|
||||||
|
* Microservice:
|
||||||
|
|
||||||
|
* Consumes from `request.{serviceName}.{version}`.
|
||||||
|
* Publishes responses to its own reply queue; sets `CorrelationId` property.
|
||||||
|
|
||||||
|
### 6.2 Gateway: `RabbitMqTransportClient`
|
||||||
|
|
||||||
|
Implements `ITransportClient`:
|
||||||
|
|
||||||
|
* `SendRequestAsync`:
|
||||||
|
|
||||||
|
* Create a message with:
|
||||||
|
|
||||||
|
* Body = serialized `Frame` (REQUEST or buffered streaming).
|
||||||
|
* Properties:
|
||||||
|
|
||||||
|
* `CorrelationId` = `frame.CorrelationId`.
|
||||||
|
* `ReplyTo` = microservice’s reply queue name for this instance.
|
||||||
|
* Publish to `request.{serviceName}.{version}`.
|
||||||
|
* Await a response:
|
||||||
|
|
||||||
|
* Consumer on reply queue completes a `TaskCompletionSource<Frame>` keyed by correlation ID.
|
||||||
|
|
||||||
|
* `SendStreamingAsync`:
|
||||||
|
|
||||||
|
* For v1, you can:
|
||||||
|
|
||||||
|
* Only support buffered endpoints over RabbitMQ (like UDP).
|
||||||
|
* Or send chunked messages (`RequestStreamData` frames as separate messages) and reconstruct on microservice side.
|
||||||
|
* I’d recommend:
|
||||||
|
|
||||||
|
* Start with buffered only over RabbitMQ.
|
||||||
|
* Mark Rabbit as “no streaming support yet” in config.
|
||||||
|
|
||||||
|
* `SendCancelAsync`:
|
||||||
|
|
||||||
|
* Option 1: send a separate CANCEL message with same `CorrelationId`.
|
||||||
|
* Option 2: rely on timeout; cancellation doesn’t buy much given overhead.
|
||||||
|
|
||||||
|
### 6.3 Microservice: RabbitMQ listener
|
||||||
|
|
||||||
|
* Single `IConnection` and `IModel`.
|
||||||
|
|
||||||
|
* Declare and bind:
|
||||||
|
|
||||||
|
* Service request queue: `request.{serviceName}.{version}`.
|
||||||
|
* Reply queue: `reply.{serviceName}.{version}.{instanceId}`.
|
||||||
|
|
||||||
|
* Consume request queue:
|
||||||
|
|
||||||
|
* On message:
|
||||||
|
|
||||||
|
* Deserialize `Frame`.
|
||||||
|
* Dispatch through `IEndpointDispatcher`.
|
||||||
|
* Publish RESPONSE message to `ReplyTo` queue with same `CorrelationId`.
|
||||||
|
|
||||||
|
If you already have RabbitMQ experience from Serdica, this should feel familiar.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Routing config & transport selection
|
||||||
|
|
||||||
|
**Projects:** router config + microservice options
|
||||||
|
**Owner:** config / platform dev
|
||||||
|
|
||||||
|
You need to define which transport is actually used in production.
|
||||||
|
|
||||||
|
### 7.1 Gateway config (RouterConfig)
|
||||||
|
|
||||||
|
Per service/instance, store:
|
||||||
|
|
||||||
|
* `TransportType` to listen on / expect connections for.
|
||||||
|
* Ports / Rabbit URLs / TLS settings.
|
||||||
|
|
||||||
|
Example shape in `RouterConfig`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed class ServiceInstanceConfig
|
||||||
|
{
|
||||||
|
public string ServiceName { get; set; } = string.Empty;
|
||||||
|
public string Version { get; set; } = string.Empty;
|
||||||
|
public string Region { get; set; } = string.Empty;
|
||||||
|
public TransportType TransportType { get; set; } = TransportType.Udp; // default
|
||||||
|
public int Port { get; set; } // for TCP/UDP/TLS
|
||||||
|
public string? RabbitConnectionString { get; set; }
|
||||||
|
// TLS info, etc.
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`StellaOps.Gateway.WebService` startup:
|
||||||
|
|
||||||
|
* Reads these configs.
|
||||||
|
* Starts corresponding `ITransportServer` instances.
|
||||||
|
|
||||||
|
### 7.2 Microservice options
|
||||||
|
|
||||||
|
`StellaMicroserviceOptions.Routers` entries must define:
|
||||||
|
|
||||||
|
* `Host`
|
||||||
|
* `Port`
|
||||||
|
* `TransportType`
|
||||||
|
* Any transport-specific settings (TLS, Rabbit URL).
|
||||||
|
|
||||||
|
At connect time, microservice chooses:
|
||||||
|
|
||||||
|
* For each `RouterEndpointConfig`, instantiate the right connector:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
switch(config.TransportType)
|
||||||
|
{
|
||||||
|
case TransportType.Tcp:
|
||||||
|
use TcpMicroserviceConnector;
|
||||||
|
break;
|
||||||
|
case TransportType.Certificate:
|
||||||
|
use TlsMicroserviceConnector;
|
||||||
|
break;
|
||||||
|
case TransportType.Udp:
|
||||||
|
use UdpMicroserviceConnector;
|
||||||
|
break;
|
||||||
|
case TransportType.RabbitMq:
|
||||||
|
use RabbitMqMicroserviceConnector;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Implementation order & testing strategy
|
||||||
|
|
||||||
|
**Owner:** tech lead
|
||||||
|
|
||||||
|
Do NOT try to implement all at once. Suggested order:
|
||||||
|
|
||||||
|
1. **TCP**:
|
||||||
|
|
||||||
|
* Reuse InMemory test suite:
|
||||||
|
|
||||||
|
* HELLO + endpoint registration.
|
||||||
|
* REQUEST → RESPONSE.
|
||||||
|
* CANCEL.
|
||||||
|
* Heartbeats.
|
||||||
|
* (Optional) streaming as buffered stub for v1, then add genuine streaming.
|
||||||
|
|
||||||
|
2. **Certificate/TLS**:
|
||||||
|
|
||||||
|
* Wrap TCP logic in TLS.
|
||||||
|
* Same tests, plus:
|
||||||
|
|
||||||
|
* Certificate validation.
|
||||||
|
* Mutual TLS if required.
|
||||||
|
|
||||||
|
3. **RabbitMQ**:
|
||||||
|
|
||||||
|
* Start with buffered-only endpoints.
|
||||||
|
* Mirror existing InMemory/TCP tests where payloads are small.
|
||||||
|
* Add tests for connection resilience (reconnect, etc.).
|
||||||
|
|
||||||
|
4. **UDP**:
|
||||||
|
|
||||||
|
* Implement only for very small buffered requests; no streaming.
|
||||||
|
* Add tests that verify:
|
||||||
|
|
||||||
|
* HELLO + basic health.
|
||||||
|
* REQUEST → RESPONSE with small payload.
|
||||||
|
* Proper timeouts.
|
||||||
|
|
||||||
|
At each stage, tests for that plugin must reuse the **same microservice and gateway** code that worked with InMemory. Only the transport factories change.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Done criteria for “Implement real transport plugins one by one”
|
||||||
|
|
||||||
|
You can consider step 9 done when:
|
||||||
|
|
||||||
|
* There are **concrete implementations** of `ITransportServer` + `ITransportClient` for:
|
||||||
|
|
||||||
|
* TCP
|
||||||
|
* Certificate/TLS
|
||||||
|
* UDP (buffered only)
|
||||||
|
* RabbitMQ (buffered at minimum)
|
||||||
|
* Gateway startup:
|
||||||
|
|
||||||
|
* Reads `RouterConfig`.
|
||||||
|
* Starts appropriate transport servers per node/region.
|
||||||
|
* Microservice SDK:
|
||||||
|
|
||||||
|
* Reads `StellaMicroserviceOptions.Routers`.
|
||||||
|
* Connects to router nodes using the configured `TransportType`.
|
||||||
|
* Uses the same HELLO/HEARTBEAT/REQUEST/RESPONSE/CANCEL semantics as InMemory.
|
||||||
|
* The same functional tests that passed for InMemory:
|
||||||
|
|
||||||
|
* Now pass with TCP plugin.
|
||||||
|
* At least a subset pass with TLS, Rabbit, and UDP, honoring their constraints (no streaming on UDP, etc.).
|
||||||
|
|
||||||
|
From there, you can move into hardening each plugin (reconnect, backoff, error handling) and documenting “which transport to use when” in your router docs.
|
||||||
586
docs/router/10-Step.md
Normal file
586
docs/router/10-Step.md
Normal file
@@ -0,0 +1,586 @@
|
|||||||
|
For this step you’re wiring **configuration** into the system properly:
|
||||||
|
|
||||||
|
* Router reads a strongly‑typed config model (including payload limits, node region, transports).
|
||||||
|
* Microservices can optionally load a YAML file to **override** endpoint metadata discovered by reflection.
|
||||||
|
* No behavior changes to routing or transports, just how they get their settings.
|
||||||
|
|
||||||
|
Think “config plumbing and merging rules,” not new business logic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Preconditions
|
||||||
|
|
||||||
|
Before starting, confirm:
|
||||||
|
|
||||||
|
* `__Libraries/StellaOps.Router.Config` project exists and references `StellaOps.Router.Common`.
|
||||||
|
* `StellaOps.Microservice` has:
|
||||||
|
|
||||||
|
* `StellaMicroserviceOptions` (ServiceName, Version, Region, InstanceId, Routers, ConfigFilePath).
|
||||||
|
* Reflection‑based endpoint discovery that produces `EndpointDescriptor` instances.
|
||||||
|
* Gateway and microservices currently use **hardcoded** or stub config; you’re about to replace that with real config.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Define RouterConfig model and YAML schema
|
||||||
|
|
||||||
|
**Project:** `__Libraries/StellaOps.Router.Config`
|
||||||
|
**Owner:** config / platform dev
|
||||||
|
|
||||||
|
### 1.1 C# model
|
||||||
|
|
||||||
|
Create clear, minimal models to cover current needs (you can extend later):
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
namespace StellaOps.Router.Config;
|
||||||
|
|
||||||
|
public sealed class RouterConfig
|
||||||
|
{
|
||||||
|
public GatewayNodeConfig Node { get; set; } = new();
|
||||||
|
public PayloadLimits PayloadLimits { get; set; } = new();
|
||||||
|
public IList<TransportEndpointConfig> Transports { get; set; } = new List<TransportEndpointConfig>();
|
||||||
|
public IList<ServiceConfig> Services { get; set; } = new List<ServiceConfig>();
|
||||||
|
}
|
||||||
|
|
||||||
|
public sealed class GatewayNodeConfig
|
||||||
|
{
|
||||||
|
public string NodeId { get; set; } = string.Empty;
|
||||||
|
public string Region { get; set; } = string.Empty;
|
||||||
|
public string Environment { get; set; } = "prod";
|
||||||
|
}
|
||||||
|
|
||||||
|
public sealed class TransportEndpointConfig
|
||||||
|
{
|
||||||
|
public TransportType TransportType { get; set; }
|
||||||
|
public int Port { get; set; } // for TCP/UDP/TLS
|
||||||
|
public bool Enabled { get; set; } = true;
|
||||||
|
|
||||||
|
// TLS-specific
|
||||||
|
public string? ServerCertificatePath { get; set; }
|
||||||
|
public string? ServerCertificatePassword { get; set; }
|
||||||
|
public bool RequireClientCertificate { get; set; }
|
||||||
|
|
||||||
|
// Rabbit-specific
|
||||||
|
public string? RabbitConnectionString { get; set; }
|
||||||
|
}
|
||||||
|
|
||||||
|
public sealed class ServiceConfig
|
||||||
|
{
|
||||||
|
public string Name { get; set; } = string.Empty;
|
||||||
|
public string DefaultVersion { get; set; } = "1.0.0";
|
||||||
|
public IList<string> NeighborRegions { get; set; } = new List<string>();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Use the `PayloadLimits` class from Common (or mirror it here and keep a single definition).
|
||||||
|
|
||||||
|
### 1.2 YAML shape
|
||||||
|
|
||||||
|
Decide and document a YAML layout, e.g.:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
node:
|
||||||
|
nodeId: "gw-eu1-01"
|
||||||
|
region: "eu1"
|
||||||
|
environment: "prod"
|
||||||
|
|
||||||
|
payloadLimits:
|
||||||
|
maxRequestBytesPerCall: 10485760 # 10 MB
|
||||||
|
maxRequestBytesPerConnection: 52428800
|
||||||
|
maxAggregateInflightBytes: 209715200
|
||||||
|
|
||||||
|
transports:
|
||||||
|
- transportType: Tcp
|
||||||
|
port: 45000
|
||||||
|
enabled: true
|
||||||
|
- transportType: Certificate
|
||||||
|
port: 45001
|
||||||
|
enabled: false
|
||||||
|
serverCertificatePath: "certs/router.pfx"
|
||||||
|
serverCertificatePassword: "secret"
|
||||||
|
- transportType: Udp
|
||||||
|
port: 45002
|
||||||
|
enabled: true
|
||||||
|
- transportType: RabbitMq
|
||||||
|
enabled: true
|
||||||
|
rabbitConnectionString: "amqp://guest:guest@localhost:5672"
|
||||||
|
|
||||||
|
services:
|
||||||
|
- name: "Billing"
|
||||||
|
defaultVersion: "1.0.0"
|
||||||
|
neighborRegions: ["eu2", "us1"]
|
||||||
|
- name: "Identity"
|
||||||
|
defaultVersion: "2.1.0"
|
||||||
|
neighborRegions: ["eu2"]
|
||||||
|
```
|
||||||
|
|
||||||
|
This YAML is the canonical config for the router; environment variables and JSON can override individual properties later via `IConfiguration`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Implement Router.Config loader and DI extensions
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Router.Config`
|
||||||
|
|
||||||
|
### 2.1 Choose YAML library
|
||||||
|
|
||||||
|
Add a YAML library (e.g. YamlDotNet) to `StellaOps.Router.Config`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet add src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj package YamlDotNet
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 Implement simple loader
|
||||||
|
|
||||||
|
Provide a helper that can load YAML into `RouterConfig`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public static class RouterConfigLoader
|
||||||
|
{
|
||||||
|
public static RouterConfig LoadFromYaml(string path)
|
||||||
|
{
|
||||||
|
using var reader = new StreamReader(path);
|
||||||
|
var yaml = new YamlStream();
|
||||||
|
yaml.Load(reader);
|
||||||
|
|
||||||
|
var root = (YamlMappingNode)yaml.Documents[0].RootNode;
|
||||||
|
var json = ConvertYamlToJson(root); // simplest: walk node, serialize to JSON string
|
||||||
|
return JsonSerializer.Deserialize<RouterConfig>(json)!;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Alternatively, bind YAML directly to `RouterConfig` with YamlDotNet’s object mapping; the detail is implementation‑specific.
|
||||||
|
|
||||||
|
### 2.3 ASP.NET Core integration extension
|
||||||
|
|
||||||
|
In the router library, add a DI extension the gateway can call:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public static class ServiceCollectionExtensions
|
||||||
|
{
|
||||||
|
public static IServiceCollection AddRouterConfig(
|
||||||
|
this IServiceCollection services,
|
||||||
|
IConfiguration configuration)
|
||||||
|
{
|
||||||
|
services.Configure<RouterConfig>(configuration.GetSection("Router"));
|
||||||
|
services.AddSingleton(sp => sp.GetRequiredService<IOptionsMonitor<RouterConfig>>());
|
||||||
|
|
||||||
|
return services;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Gateway will:
|
||||||
|
|
||||||
|
* Add the YAML file to the configuration builder.
|
||||||
|
* Call `AddRouterConfig` to bind it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Wire RouterConfig into Gateway startup & components
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService`
|
||||||
|
**Owner:** gateway dev
|
||||||
|
|
||||||
|
### 3.1 Program.cs configuration
|
||||||
|
|
||||||
|
Adjust `Program.cs`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var builder = WebApplication.CreateBuilder(args);
|
||||||
|
|
||||||
|
// add YAML config
|
||||||
|
builder.Configuration
|
||||||
|
.AddJsonFile("appsettings.json", optional: true)
|
||||||
|
.AddYamlFile("router.yaml", optional: false, reloadOnChange: true)
|
||||||
|
.AddEnvironmentVariables("STELLAOPS_");
|
||||||
|
|
||||||
|
// bind RouterConfig
|
||||||
|
builder.Services.AddRouterConfig(builder.Configuration.GetSection("Router"));
|
||||||
|
|
||||||
|
var app = builder.Build();
|
||||||
|
```
|
||||||
|
|
||||||
|
Key points:
|
||||||
|
|
||||||
|
* `AddYamlFile("router.yaml", reloadOnChange: true)` ensures hot‑reload from YAML.
|
||||||
|
* `AddEnvironmentVariables("STELLAOPS_")` allows env‑based overrides (optional, but useful).
|
||||||
|
|
||||||
|
### 3.2 Inject config into transport factories and routing
|
||||||
|
|
||||||
|
Where you start transports:
|
||||||
|
|
||||||
|
* Inject `IOptionsMonitor<RouterConfig>` into your `ITransportServerFactory`, and use `RouterConfig.Transports` to know which servers to create and on which ports.
|
||||||
|
|
||||||
|
Where you need node identity:
|
||||||
|
|
||||||
|
* Inject `IOptionsMonitor<RouterConfig>` into any service needing `GatewayNodeConfig` (e.g. when building `RoutingContext.GatewayRegion`):
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var nodeRegion = routerConfig.CurrentValue.Node.Region;
|
||||||
|
```
|
||||||
|
|
||||||
|
Where you need payload limits:
|
||||||
|
|
||||||
|
* Inject `IOptionsMonitor<RouterConfig>` into `IPayloadBudget` or `TransportDispatchMiddleware` to fetch current `PayloadLimits`.
|
||||||
|
|
||||||
|
Because you’re using `IOptionsMonitor`, components can react to changes when `router.yaml` is modified.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Microservice YAML: schema & loader
|
||||||
|
|
||||||
|
**Project:** `__Libraries/StellaOps.Microservice`
|
||||||
|
**Owner:** SDK dev
|
||||||
|
|
||||||
|
Microservice YAML is optional and used **only** to override endpoint metadata, not to define identity or router pool.
|
||||||
|
|
||||||
|
### 4.1 Define YAML shape
|
||||||
|
|
||||||
|
Keep it focused on endpoints and overrides:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
service:
|
||||||
|
serviceName: "Billing"
|
||||||
|
version: "1.0.0"
|
||||||
|
region: "eu1"
|
||||||
|
|
||||||
|
endpoints:
|
||||||
|
- method: "POST"
|
||||||
|
path: "/billing/invoices/upload"
|
||||||
|
defaultTimeout: "00:02:00"
|
||||||
|
supportsStreaming: true
|
||||||
|
requiringClaims:
|
||||||
|
- type: "role"
|
||||||
|
value: "billing-editor"
|
||||||
|
- method: "GET"
|
||||||
|
path: "/billing/invoices/{id}"
|
||||||
|
defaultTimeout: "00:00:10"
|
||||||
|
requiringClaims:
|
||||||
|
- type: "role"
|
||||||
|
value: "billing-reader"
|
||||||
|
```
|
||||||
|
|
||||||
|
Identity (`serviceName`, `version`, `region`) in YAML is **informative**; the authoritative values still come from `StellaMicroserviceOptions`. If they differ, you log, but don’t override options from YAML.
|
||||||
|
|
||||||
|
### 4.2 C# model
|
||||||
|
|
||||||
|
In `StellaOps.Microservice`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal sealed class MicroserviceYamlConfig
|
||||||
|
{
|
||||||
|
public MicroserviceYamlService? Service { get; set; }
|
||||||
|
public IList<MicroserviceYamlEndpoint> Endpoints { get; set; } = new List<MicroserviceYamlEndpoint>();
|
||||||
|
}
|
||||||
|
|
||||||
|
internal sealed class MicroserviceYamlService
|
||||||
|
{
|
||||||
|
public string? ServiceName { get; set; }
|
||||||
|
public string? Version { get; set; }
|
||||||
|
public string? Region { get; set; }
|
||||||
|
}
|
||||||
|
|
||||||
|
internal sealed class MicroserviceYamlEndpoint
|
||||||
|
{
|
||||||
|
public string Method { get; set; } = string.Empty;
|
||||||
|
public string Path { get; set; } = string.Empty;
|
||||||
|
public string? DefaultTimeout { get; set; }
|
||||||
|
public bool? SupportsStreaming { get; set; }
|
||||||
|
public IList<ClaimRequirement> RequiringClaims { get; set; } = new List<ClaimRequirement>();
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 YAML loader
|
||||||
|
|
||||||
|
Reuse YamlDotNet (add package to `StellaOps.Microservice` if needed):
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal interface IMicroserviceYamlLoader
|
||||||
|
{
|
||||||
|
MicroserviceYamlConfig? Load(string? path);
|
||||||
|
}
|
||||||
|
|
||||||
|
internal sealed class MicroserviceYamlLoader : IMicroserviceYamlLoader
|
||||||
|
{
|
||||||
|
private readonly ILogger<MicroserviceYamlLoader> _logger;
|
||||||
|
|
||||||
|
public MicroserviceYamlLoader(ILogger<MicroserviceYamlLoader> logger)
|
||||||
|
{
|
||||||
|
_logger = logger;
|
||||||
|
}
|
||||||
|
|
||||||
|
public MicroserviceYamlConfig? Load(string? path)
|
||||||
|
{
|
||||||
|
if (string.IsNullOrWhiteSpace(path) || !File.Exists(path))
|
||||||
|
return null;
|
||||||
|
|
||||||
|
try
|
||||||
|
{
|
||||||
|
using var reader = new StreamReader(path);
|
||||||
|
var deserializer = new DeserializerBuilder().Build();
|
||||||
|
return deserializer.Deserialize<MicroserviceYamlConfig>(reader);
|
||||||
|
}
|
||||||
|
catch (Exception ex)
|
||||||
|
{
|
||||||
|
_logger.LogError(ex, "Failed to load microservice YAML from {Path}", path);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Register in DI:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
services.AddSingleton<IMicroserviceYamlLoader, MicroserviceYamlLoader>();
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Merge YAML overrides with reflection-discovered endpoints
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Microservice`
|
||||||
|
**Owner:** SDK dev
|
||||||
|
|
||||||
|
Extend `EndpointCatalog` to apply YAML overrides.
|
||||||
|
|
||||||
|
### 5.1 Extend constructor to accept YAML config
|
||||||
|
|
||||||
|
Adjust `EndpointCatalog`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
internal sealed class EndpointCatalog : IEndpointCatalog
|
||||||
|
{
|
||||||
|
public IReadOnlyList<EndpointDescriptor> Descriptors { get; }
|
||||||
|
|
||||||
|
private readonly Dictionary<(string Method, string Path), EndpointRegistration> _map;
|
||||||
|
|
||||||
|
public EndpointCatalog(
|
||||||
|
IEndpointDiscovery discovery,
|
||||||
|
IMicroserviceYamlLoader yamlLoader,
|
||||||
|
IOptions<StellaMicroserviceOptions> optionsAccessor)
|
||||||
|
{
|
||||||
|
var options = optionsAccessor.Value;
|
||||||
|
|
||||||
|
var registrations = discovery.DiscoverEndpoints(options);
|
||||||
|
var yamlConfig = yamlLoader.Load(options.ConfigFilePath);
|
||||||
|
|
||||||
|
registrations = ApplyYamlOverrides(registrations, yamlConfig);
|
||||||
|
|
||||||
|
_map = registrations.ToDictionary(
|
||||||
|
r => (r.Descriptor.Method, r.Descriptor.Path),
|
||||||
|
r => r,
|
||||||
|
StringComparer.OrdinalIgnoreCase);
|
||||||
|
|
||||||
|
Descriptors = registrations.Select(r => r.Descriptor).ToArray();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.2 Implement `ApplyYamlOverrides`
|
||||||
|
|
||||||
|
Key rules:
|
||||||
|
|
||||||
|
* Identity (ServiceName, Version, Region) always come from `StellaMicroserviceOptions`.
|
||||||
|
* YAML can override:
|
||||||
|
|
||||||
|
* `DefaultTimeout`
|
||||||
|
* `SupportsStreaming`
|
||||||
|
* `RequiringClaims`
|
||||||
|
|
||||||
|
Implementation sketch:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
private static IReadOnlyList<EndpointRegistration> ApplyYamlOverrides(
|
||||||
|
IReadOnlyList<EndpointRegistration> registrations,
|
||||||
|
MicroserviceYamlConfig? yaml)
|
||||||
|
{
|
||||||
|
if (yaml is null || yaml.Endpoints.Count == 0)
|
||||||
|
return registrations;
|
||||||
|
|
||||||
|
var overrideMap = yaml.Endpoints.ToDictionary(
|
||||||
|
e => (e.Method, e.Path),
|
||||||
|
e => e,
|
||||||
|
StringComparer.OrdinalIgnoreCase);
|
||||||
|
|
||||||
|
var result = new List<EndpointRegistration>(registrations.Count);
|
||||||
|
|
||||||
|
foreach (var reg in registrations)
|
||||||
|
{
|
||||||
|
if (!overrideMap.TryGetValue((reg.Descriptor.Method, reg.Descriptor.Path), out var ov))
|
||||||
|
{
|
||||||
|
result.Add(reg);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
var desc = reg.Descriptor;
|
||||||
|
|
||||||
|
var timeout = desc.DefaultTimeout;
|
||||||
|
if (!string.IsNullOrWhiteSpace(ov.DefaultTimeout) &&
|
||||||
|
TimeSpan.TryParse(ov.DefaultTimeout, out var parsed))
|
||||||
|
{
|
||||||
|
timeout = parsed;
|
||||||
|
}
|
||||||
|
|
||||||
|
var supportsStreaming = desc.SupportsStreaming;
|
||||||
|
if (ov.SupportsStreaming.HasValue)
|
||||||
|
{
|
||||||
|
supportsStreaming = ov.SupportsStreaming.Value;
|
||||||
|
}
|
||||||
|
|
||||||
|
var requiringClaims = ov.RequiringClaims.Count > 0
|
||||||
|
? ov.RequiringClaims.ToArray()
|
||||||
|
: desc.RequiringClaims;
|
||||||
|
|
||||||
|
var overriddenDescriptor = new EndpointDescriptor
|
||||||
|
{
|
||||||
|
ServiceName = desc.ServiceName,
|
||||||
|
Version = desc.Version,
|
||||||
|
Method = desc.Method,
|
||||||
|
Path = desc.Path,
|
||||||
|
DefaultTimeout = timeout,
|
||||||
|
SupportsStreaming = supportsStreaming,
|
||||||
|
RequiringClaims = requiringClaims
|
||||||
|
};
|
||||||
|
|
||||||
|
result.Add(new EndpointRegistration
|
||||||
|
{
|
||||||
|
Descriptor = overriddenDescriptor,
|
||||||
|
HandlerType = reg.HandlerType
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This ensures code defines the set of endpoints; YAML only tunes metadata.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Hot‑reload / YAML change handling
|
||||||
|
|
||||||
|
**Router side:** you already enabled `reloadOnChange` for `router.yaml`, and use `IOptionsMonitor<RouterConfig>`. Next:
|
||||||
|
|
||||||
|
* Components that care about changes must **react**:
|
||||||
|
|
||||||
|
* Payload limits:
|
||||||
|
|
||||||
|
* `IPayloadBudget` or `TransportDispatchMiddleware` should read `routerConfig.CurrentValue.PayloadLimits` on each request rather than caching.
|
||||||
|
* Node region:
|
||||||
|
|
||||||
|
* `RoutingContext.GatewayRegion` can be built from `routerConfig.CurrentValue.Node.Region` per request.
|
||||||
|
|
||||||
|
You do **not** need a custom watcher; `IOptionsMonitor` already tracks config changes.
|
||||||
|
|
||||||
|
**Microservice side:** for now you can start with **load-on-startup** YAML. If you want hot‑reload:
|
||||||
|
|
||||||
|
* Implement a FileSystemWatcher in `MicroserviceYamlLoader` or a small `IHostedService`:
|
||||||
|
|
||||||
|
* Watch `options.ConfigFilePath` for changes.
|
||||||
|
* On change:
|
||||||
|
|
||||||
|
* Reload YAML.
|
||||||
|
* Rebuild `EndpointDescriptor` list.
|
||||||
|
* Send an updated HELLO or an ENDPOINTS_UPDATE frame to router.
|
||||||
|
|
||||||
|
Given complexity, you can postpone true hot reload to a later iteration and document that microservices must be restarted to pick up YAML changes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Tests
|
||||||
|
|
||||||
|
**Router.Config tests:**
|
||||||
|
|
||||||
|
* Unit tests for `RouterConfigLoader`:
|
||||||
|
|
||||||
|
* Given a YAML string, bind to `RouterConfig` properly.
|
||||||
|
* Validate `TransportType.Tcp` / `Udp` / `RabbitMq` values map correctly.
|
||||||
|
|
||||||
|
* Integration test:
|
||||||
|
|
||||||
|
* Start gateway with `router.yaml`.
|
||||||
|
* Access `IOptionsMonitor<RouterConfig>` in a test controller or test service and assert values.
|
||||||
|
* Modify YAML on disk (if test infra allows) and ensure values update via `IOptionsMonitor`.
|
||||||
|
|
||||||
|
**Microservice YAML tests:**
|
||||||
|
|
||||||
|
* Unit tests for `MicroserviceYamlLoader`:
|
||||||
|
|
||||||
|
* Load valid YAML, confirm endpoints and claims/timeouts parsed.
|
||||||
|
|
||||||
|
* `EndpointCatalog` tests:
|
||||||
|
|
||||||
|
* Build fake `EndpointRegistration` list from reflection.
|
||||||
|
* Build YAML overrides.
|
||||||
|
* Call `ApplyYamlOverrides` and assert:
|
||||||
|
|
||||||
|
* Timeouts updated.
|
||||||
|
* SupportsStreaming updated.
|
||||||
|
* RequiringClaims replaced where provided.
|
||||||
|
* Descriptors with no matching YAML remain unchanged.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Documentation updates
|
||||||
|
|
||||||
|
Update docs under `docs/router`:
|
||||||
|
|
||||||
|
1. **Stella Ops Router – Webserver.md**:
|
||||||
|
|
||||||
|
* Describe `router.yaml`:
|
||||||
|
|
||||||
|
* Node config (region, nodeId).
|
||||||
|
* PayloadLimits.
|
||||||
|
* Transports.
|
||||||
|
* Explain precedence:
|
||||||
|
|
||||||
|
* YAML as base.
|
||||||
|
* Environment variables can override individual fields via `STELLAOPS_Router__Node__Region` etc.
|
||||||
|
|
||||||
|
2. **Stella Ops Router – Microservice.md**:
|
||||||
|
|
||||||
|
* Explain `ConfigFilePath` in `StellaMicroserviceOptions`.
|
||||||
|
* Show full example microservice YAML and how it maps to endpoint metadata.
|
||||||
|
* Clearly state:
|
||||||
|
|
||||||
|
* Identity comes from options (code/config), not YAML.
|
||||||
|
* YAML can override per‑endpoint timeout, streaming flag, requiringClaims.
|
||||||
|
* YAML can’t add endpoints that don’t exist in code.
|
||||||
|
|
||||||
|
3. **Stella Ops Router Documentation.md**:
|
||||||
|
|
||||||
|
* Add a short “Configuration” chapter:
|
||||||
|
|
||||||
|
* Where `router.yaml` lives.
|
||||||
|
* Where microservice YAML lives.
|
||||||
|
* How to run locally with custom configs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Done criteria for “Add Router.Config + Microservice YAML integration”
|
||||||
|
|
||||||
|
You can call step 10 complete when:
|
||||||
|
|
||||||
|
* Router:
|
||||||
|
|
||||||
|
* Loads `router.yaml` into `RouterConfig` using `StellaOps.Router.Config`.
|
||||||
|
* Uses `RouterConfig.Node.Region` when building routing context.
|
||||||
|
* Uses `RouterConfig.PayloadLimits` for payload budget enforcement.
|
||||||
|
* Uses `RouterConfig.Transports` to start the right `ITransportServer` instances.
|
||||||
|
* Supports runtime changes to `router.yaml` via `IOptionsMonitor` for at least node identity and payload limits.
|
||||||
|
|
||||||
|
* Microservice:
|
||||||
|
|
||||||
|
* Accepts optional `ConfigFilePath` in `StellaMicroserviceOptions`.
|
||||||
|
* Loads YAML (when present) and merges overrides into reflection‑discovered endpoints.
|
||||||
|
* Sends HELLO with the **merged** descriptors (i.e., YAML-aware defaults).
|
||||||
|
* Behavior remains unchanged when no YAML is provided (pure reflection mode).
|
||||||
|
|
||||||
|
* Tests:
|
||||||
|
|
||||||
|
* Confirm config binding for router and microservice.
|
||||||
|
* Confirm YAML overrides are applied correctly to endpoint metadata.
|
||||||
|
|
||||||
|
At that point, configuration is no longer hardcoded, and you have a clear, documented path for both router operators and microservice teams to configure behavior via YAML with predictable precedence.
|
||||||
550
docs/router/11-Step.md
Normal file
550
docs/router/11-Step.md
Normal file
@@ -0,0 +1,550 @@
|
|||||||
|
Goal for this step: have a **concrete, runnable example** (gateway + one microservice) and a **clear skeleton** for migrating any existing `StellaOps.*.WebService` into `StellaOps.*.Microservice`. After this, devs should be able to:
|
||||||
|
|
||||||
|
* Run a full vertical slice locally.
|
||||||
|
* Open a “migration cookbook” and follow a predictable recipe.
|
||||||
|
|
||||||
|
I’ll split it into two tracks: reference example, then migration skeleton.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Reference example: “Billing” vertical slice
|
||||||
|
|
||||||
|
### 1.1 Create the sample microservice project
|
||||||
|
|
||||||
|
**Project:** `src/StellaOps.Billing.Microservice`
|
||||||
|
**Owner:** feature/example dev
|
||||||
|
|
||||||
|
Tasks:
|
||||||
|
|
||||||
|
1. Create the project:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd src
|
||||||
|
dotnet new worker -n StellaOps.Billing.Microservice
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Add references:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dotnet add StellaOps.Billing.Microservice/StellaOps.Billing.Microservice.csproj reference \
|
||||||
|
__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj
|
||||||
|
dotnet add StellaOps.Billing.Microservice/StellaOps.Billing.Microservice.csproj reference \
|
||||||
|
__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
|
||||||
|
```
|
||||||
|
|
||||||
|
3. In `Program.cs`, wire the SDK with **InMemory transport** for now:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var builder = Host.CreateApplicationBuilder(args);
|
||||||
|
|
||||||
|
builder.Services.AddStellaMicroservice(opts =>
|
||||||
|
{
|
||||||
|
opts.ServiceName = "Billing";
|
||||||
|
opts.Version = "1.0.0";
|
||||||
|
opts.Region = "eu1";
|
||||||
|
opts.InstanceId = $"billing-{Environment.MachineName}";
|
||||||
|
opts.Routers.Add(new RouterEndpointConfig
|
||||||
|
{
|
||||||
|
Host = "localhost",
|
||||||
|
Port = 50050, // to match gateway’s InMemory/TCP harness
|
||||||
|
TransportType = TransportType.Tcp
|
||||||
|
});
|
||||||
|
opts.ConfigFilePath = "billing.microservice.yaml"; // optional overrides
|
||||||
|
});
|
||||||
|
|
||||||
|
var app = builder.Build();
|
||||||
|
await app.RunAsync();
|
||||||
|
```
|
||||||
|
|
||||||
|
(You can keep `TransportType` as TCP even if implemented in-process for now; once real TCP is in, nothing changes here.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 1.2 Implement a few canonical endpoints
|
||||||
|
|
||||||
|
Pick 3–4 endpoints that exercise different features:
|
||||||
|
|
||||||
|
1. **Health / contract check**
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
[StellaEndpoint("GET", "/ping")]
|
||||||
|
public sealed class PingEndpoint : IRawStellaEndpoint
|
||||||
|
{
|
||||||
|
public Task<RawResponse> HandleAsync(RawRequestContext ctx)
|
||||||
|
{
|
||||||
|
var resp = new RawResponse { StatusCode = 200 };
|
||||||
|
resp.Headers["Content-Type"] = "text/plain";
|
||||||
|
resp.WriteBodyAsync = async stream =>
|
||||||
|
{
|
||||||
|
await stream.WriteAsync("pong"u8.ToArray(), ctx.CancellationToken);
|
||||||
|
};
|
||||||
|
return Task.FromResult(resp);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Simple JSON read/write (non-streaming)**
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
public sealed record CreateInvoiceRequest(string CustomerId, decimal Amount);
|
||||||
|
public sealed record CreateInvoiceResponse(Guid Id);
|
||||||
|
|
||||||
|
[StellaEndpoint("POST", "/billing/invoices")]
|
||||||
|
public sealed class CreateInvoiceEndpoint : IStellaEndpoint<CreateInvoiceRequest, CreateInvoiceResponse>
|
||||||
|
{
|
||||||
|
public Task<CreateInvoiceResponse> HandleAsync(CreateInvoiceRequest req, CancellationToken ct)
|
||||||
|
{
|
||||||
|
// pretend to store in DB
|
||||||
|
return Task.FromResult(new CreateInvoiceResponse(Guid.NewGuid()));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Streaming upload (large file)**
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
[StellaEndpoint("POST", "/billing/invoices/upload")]
|
||||||
|
public sealed class InvoiceUploadEndpoint : IRawStellaEndpoint
|
||||||
|
{
|
||||||
|
public async Task<RawResponse> HandleAsync(RawRequestContext ctx)
|
||||||
|
{
|
||||||
|
var buffer = new byte[64 * 1024];
|
||||||
|
var total = 0L;
|
||||||
|
|
||||||
|
int read;
|
||||||
|
while ((read = await ctx.Body.ReadAsync(buffer.AsMemory(0, buffer.Length), ctx.CancellationToken)) > 0)
|
||||||
|
{
|
||||||
|
total += read;
|
||||||
|
// process chunk or write to temp file
|
||||||
|
}
|
||||||
|
|
||||||
|
var resp = new RawResponse { StatusCode = 200 };
|
||||||
|
resp.Headers["Content-Type"] = "application/json";
|
||||||
|
resp.WriteBodyAsync = async stream =>
|
||||||
|
{
|
||||||
|
var json = $"{{\"bytesReceived\":{total}}}";
|
||||||
|
await stream.WriteAsync(System.Text.Encoding.UTF8.GetBytes(json), ctx.CancellationToken);
|
||||||
|
};
|
||||||
|
return resp;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This gives devs examples of:
|
||||||
|
|
||||||
|
* Raw endpoint (`/ping`, `/upload`).
|
||||||
|
* Typed endpoint (`/billing/invoices`).
|
||||||
|
* Streaming usage (`Body.ReadAsync`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 1.3 Microservice YAML override example
|
||||||
|
|
||||||
|
**File:** `src/StellaOps.Billing.Microservice/billing.microservice.yaml`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
endpoints:
|
||||||
|
- method: GET
|
||||||
|
path: /ping
|
||||||
|
timeout: 00:00:02
|
||||||
|
|
||||||
|
- method: POST
|
||||||
|
path: /billing/invoices
|
||||||
|
timeout: 00:00:05
|
||||||
|
supportsStreaming: false
|
||||||
|
requiringClaims:
|
||||||
|
- type: role
|
||||||
|
value: BillingWriter
|
||||||
|
|
||||||
|
- method: POST
|
||||||
|
path: /billing/invoices/upload
|
||||||
|
timeout: 00:02:00
|
||||||
|
supportsStreaming: true
|
||||||
|
requiringClaims:
|
||||||
|
- type: role
|
||||||
|
value: BillingUploader
|
||||||
|
```
|
||||||
|
|
||||||
|
This file demonstrates:
|
||||||
|
|
||||||
|
* Timeout override.
|
||||||
|
* Streaming flag.
|
||||||
|
* `RequiringClaims` usage.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 1.4 Gateway example config for Billing
|
||||||
|
|
||||||
|
**File:** `config/router.billing.yaml` (for local dev)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
nodeId: "gw-dev-01"
|
||||||
|
region: "eu1"
|
||||||
|
|
||||||
|
payloadLimits:
|
||||||
|
maxRequestBytesPerCall: 10485760 # 10 MB
|
||||||
|
maxRequestBytesPerConnection: 52428800 # 50 MB
|
||||||
|
maxAggregateInflightBytes: 209715200 # 200 MB
|
||||||
|
|
||||||
|
services:
|
||||||
|
- name: "Billing"
|
||||||
|
defaultVersion: "1.0.0"
|
||||||
|
endpoints:
|
||||||
|
- method: "GET"
|
||||||
|
path: "/ping"
|
||||||
|
# router defaults, if any
|
||||||
|
- method: "POST"
|
||||||
|
path: "/billing/invoices"
|
||||||
|
defaultTimeout: "00:00:05"
|
||||||
|
requiringClaims:
|
||||||
|
- type: "role"
|
||||||
|
value: "BillingWriter"
|
||||||
|
- method: "POST"
|
||||||
|
path: "/billing/invoices/upload"
|
||||||
|
defaultTimeout: "00:02:00"
|
||||||
|
supportsStreaming: true
|
||||||
|
requiringClaims:
|
||||||
|
- type: "role"
|
||||||
|
value: "BillingUploader"
|
||||||
|
```
|
||||||
|
|
||||||
|
This lets you show precedence:
|
||||||
|
|
||||||
|
* Reflection → microservice YAML → router YAML.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 1.5 Gateway wiring for the example
|
||||||
|
|
||||||
|
**Project:** `StellaOps.Gateway.WebService`
|
||||||
|
|
||||||
|
In `Program.cs`:
|
||||||
|
|
||||||
|
1. Load router config and point it to `router.billing.yaml` for dev:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
builder.Configuration
|
||||||
|
.AddJsonFile("appsettings.json", optional: true)
|
||||||
|
.AddEnvironmentVariables(prefix: "STELLAOPS_");
|
||||||
|
|
||||||
|
builder.Services.AddOptions<RouterConfig>()
|
||||||
|
.Configure<IConfiguration>((cfg, configuration) =>
|
||||||
|
{
|
||||||
|
configuration.GetSection("Router").Bind(cfg);
|
||||||
|
|
||||||
|
var yamlPath = configuration["Router:YamlPath"] ?? "config/router.billing.yaml";
|
||||||
|
if (File.Exists(yamlPath))
|
||||||
|
{
|
||||||
|
var yamlCfg = RouterConfigLoader.LoadFromFile(yamlPath);
|
||||||
|
// either cfg = yamlCfg (if you treat YAML as source of truth)
|
||||||
|
OverlayRouterConfig(cfg, yamlCfg);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
builder.Services.AddOptions<GatewayNodeConfig>()
|
||||||
|
.Configure<IOptions<RouterConfig>>((node, routerCfg) =>
|
||||||
|
{
|
||||||
|
var cfg = routerCfg.Value;
|
||||||
|
node.NodeId = cfg.NodeId;
|
||||||
|
node.Region = cfg.Region;
|
||||||
|
});
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Ensure you start the appropriate transport server (for dev, TCP on localhost:50050):
|
||||||
|
|
||||||
|
* From `RouterConfig.Transports` or a dev shortcut, start the TCP server listening on that port.
|
||||||
|
|
||||||
|
3. HTTP pipeline:
|
||||||
|
|
||||||
|
* `EndpointResolutionMiddleware`
|
||||||
|
* `RoutingDecisionMiddleware`
|
||||||
|
* `TransportDispatchMiddleware`
|
||||||
|
|
||||||
|
Now your dev loop is:
|
||||||
|
|
||||||
|
* Run `StellaOps.Gateway.WebService`.
|
||||||
|
* Run `StellaOps.Billing.Microservice`.
|
||||||
|
* `curl http://localhost:{gatewayPort}/ping` → should go through gateway to microservice and back.
|
||||||
|
* Similarly for `/billing/invoices` and `/billing/invoices/upload`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 1.6 Example documentation
|
||||||
|
|
||||||
|
Create `docs/router/examples/Billing.Sample.md`:
|
||||||
|
|
||||||
|
* “How to run the example”:
|
||||||
|
|
||||||
|
* build solution
|
||||||
|
* `dotnet run` for gateway
|
||||||
|
* `dotnet run` for Billing microservice
|
||||||
|
* Show sample `curl` commands:
|
||||||
|
|
||||||
|
* `curl http://localhost:8080/ping`
|
||||||
|
* `curl -X POST http://localhost:8080/billing/invoices -d '{"customerId":"C1","amount":123.45}'`
|
||||||
|
* `curl -X POST http://localhost:8080/billing/invoices/upload --data-binary @bigfile.bin`
|
||||||
|
* Note where config files live and how to change them.
|
||||||
|
|
||||||
|
This becomes your canonical reference for new teams.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Migration skeleton: from WebService to Microservice
|
||||||
|
|
||||||
|
Now that you have a working example, you need a **repeatable recipe** for migrating any existing `StellaOps.*.WebService` into the microservice router model.
|
||||||
|
|
||||||
|
### 2.1 Define the migration target shape
|
||||||
|
|
||||||
|
For each webservice you migrate, you want:
|
||||||
|
|
||||||
|
* A new project: `StellaOps.{Domain}.Microservice`.
|
||||||
|
|
||||||
|
* Shared domain logic extracted into a library (if not already): `StellaOps.{Domain}.Core` or similar.
|
||||||
|
|
||||||
|
* Controllers → endpoint classes:
|
||||||
|
|
||||||
|
* `Controller` methods ⇨ `[StellaEndpoint]`-annotated types.
|
||||||
|
* `HttpGet/HttpPost` attributes ⇨ `Method` and `Path` pair.
|
||||||
|
|
||||||
|
* Configuration:
|
||||||
|
|
||||||
|
* WebService’s appsettings routes → microservice YAML + router YAML.
|
||||||
|
* Authentication/authorization → `RequiringClaims` in endpoint metadata.
|
||||||
|
|
||||||
|
Document this target shape in `docs/router/Migration of Webservices to Microservices.md`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2.2 Skeleton microservice template
|
||||||
|
|
||||||
|
Create a **generic** microservice skeleton that any team can copy:
|
||||||
|
|
||||||
|
**Project:** `templates/StellaOps.Template.Microservice` or at least a folder `samples/MigrationSkeleton/`.
|
||||||
|
|
||||||
|
Contents:
|
||||||
|
|
||||||
|
* `Program.cs`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
var builder = Host.CreateApplicationBuilder(args);
|
||||||
|
|
||||||
|
builder.Services.AddStellaMicroservice(opts =>
|
||||||
|
{
|
||||||
|
opts.ServiceName = "{DomainName}";
|
||||||
|
opts.Version = "1.0.0";
|
||||||
|
opts.Region = "eu1";
|
||||||
|
opts.InstanceId = "{DomainName}-" + Environment.MachineName;
|
||||||
|
|
||||||
|
// Mandatory router pool configuration
|
||||||
|
opts.Routers.Add(new RouterEndpointConfig
|
||||||
|
{
|
||||||
|
Host = "localhost", // or injected via env
|
||||||
|
Port = 50050,
|
||||||
|
TransportType = TransportType.Tcp
|
||||||
|
});
|
||||||
|
|
||||||
|
opts.ConfigFilePath = $"{DomainName}.microservice.yaml";
|
||||||
|
});
|
||||||
|
|
||||||
|
// domain DI (reuse existing domain services from WebService)
|
||||||
|
// builder.Services.AddDomainServices();
|
||||||
|
|
||||||
|
var app = builder.Build();
|
||||||
|
await app.RunAsync();
|
||||||
|
```
|
||||||
|
|
||||||
|
* A sample endpoint mapping from a typical WebService controller method:
|
||||||
|
|
||||||
|
Legacy controller:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
[ApiController]
|
||||||
|
[Route("api/billing/invoices")]
|
||||||
|
public class InvoicesController : ControllerBase
|
||||||
|
{
|
||||||
|
[HttpPost]
|
||||||
|
[Authorize(Roles = "BillingWriter")]
|
||||||
|
public async Task<ActionResult<InvoiceDto>> Create(CreateInvoiceRequest request)
|
||||||
|
{
|
||||||
|
var result = await _service.Create(request);
|
||||||
|
return Ok(result);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Microservice endpoint:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
[StellaEndpoint("POST", "/billing/invoices")]
|
||||||
|
public sealed class CreateInvoiceEndpoint : IStellaEndpoint<CreateInvoiceRequest, InvoiceDto>
|
||||||
|
{
|
||||||
|
private readonly IInvoiceService _service;
|
||||||
|
|
||||||
|
public CreateInvoiceEndpoint(IInvoiceService service)
|
||||||
|
{
|
||||||
|
_service = service;
|
||||||
|
}
|
||||||
|
|
||||||
|
public Task<InvoiceDto> HandleAsync(CreateInvoiceRequest request, CancellationToken ct)
|
||||||
|
{
|
||||||
|
return _service.Create(request, ct);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
And matching YAML:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
endpoints:
|
||||||
|
- method: POST
|
||||||
|
path: /billing/invoices
|
||||||
|
timeout: 00:00:05
|
||||||
|
requiringClaims:
|
||||||
|
- type: role
|
||||||
|
value: BillingWriter
|
||||||
|
```
|
||||||
|
|
||||||
|
This skeleton demonstrates the mapping clearly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2.3 Migration workflow for a team (per service)
|
||||||
|
|
||||||
|
Put this as a checklist in `Migration of Webservices to Microservices.md`:
|
||||||
|
|
||||||
|
1. **Inventory existing HTTP surface**
|
||||||
|
|
||||||
|
* List all controllers and actions with:
|
||||||
|
|
||||||
|
* HTTP method.
|
||||||
|
* Route template (full path).
|
||||||
|
* Auth attributes (`[Authorize(Roles=..)]` or policies).
|
||||||
|
* Whether the action handles large uploads/downloads.
|
||||||
|
|
||||||
|
2. **Create microservice project**
|
||||||
|
|
||||||
|
* Add `StellaOps.{Domain}.Microservice` using the skeleton.
|
||||||
|
* Reference domain logic project (`StellaOps.{Domain}.Core`), or extract one if necessary.
|
||||||
|
|
||||||
|
3. **Map each controller action → endpoint**
|
||||||
|
|
||||||
|
For each action:
|
||||||
|
|
||||||
|
* Create an endpoint class in the microservice:
|
||||||
|
|
||||||
|
* `IRawStellaEndpoint` for:
|
||||||
|
|
||||||
|
* Large payloads.
|
||||||
|
* Very custom body handling.
|
||||||
|
* `IStellaEndpoint<TRequest,TResponse>` for standard JSON APIs.
|
||||||
|
* Use `[StellaEndpoint("{METHOD}", "{PATH}")]` matching the existing route.
|
||||||
|
|
||||||
|
4. **Wire domain services & auth**
|
||||||
|
|
||||||
|
* Register the same domain services the WebService used (DB contexts, repositories, etc.).
|
||||||
|
* Translate role/claim-based `[Authorize]` usage to microservice YAML `RequiringClaims`.
|
||||||
|
|
||||||
|
5. **Create microservice YAML**
|
||||||
|
|
||||||
|
* For each new endpoint:
|
||||||
|
|
||||||
|
* Define default timeout.
|
||||||
|
* `supportsStreaming: true` where appropriate.
|
||||||
|
* `requiringClaims` matching prior auth requirements.
|
||||||
|
|
||||||
|
6. **Update router YAML**
|
||||||
|
|
||||||
|
* Add service entry under `services`:
|
||||||
|
|
||||||
|
* `name: "{Domain}"`.
|
||||||
|
* `defaultVersion: "1.0.0"`.
|
||||||
|
* Add endpoints (method/path, router-side overrides if needed).
|
||||||
|
|
||||||
|
7. **Smoke-test locally**
|
||||||
|
|
||||||
|
* Run gateway + microservice side-by-side.
|
||||||
|
* Hit the same URLs via gateway that previously were served by the WebService directly.
|
||||||
|
* Compare behavior (status codes, semantics) with existing environment.
|
||||||
|
|
||||||
|
8. **Gradual rollout**
|
||||||
|
|
||||||
|
Strategy options:
|
||||||
|
|
||||||
|
* **Proxy mode**:
|
||||||
|
|
||||||
|
* Keep WebService behind gateway for a while.
|
||||||
|
* Add router endpoints that proxy to existing WebService (via HTTP) while microservice matures.
|
||||||
|
* Gradually switch endpoints to microservice once stable.
|
||||||
|
|
||||||
|
* **Blue/green**:
|
||||||
|
|
||||||
|
* Run WebService and Microservice in parallel.
|
||||||
|
* Route a small percentage of traffic to microservice via router.
|
||||||
|
* Increase gradually.
|
||||||
|
|
||||||
|
Outline these as patterns in the migration doc, but keep them high-level here.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2.4 Migration skeleton repository structure
|
||||||
|
|
||||||
|
Add a clear place in repo for skeleton code & docs:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/docs
|
||||||
|
/router
|
||||||
|
Migration of Webservices to Microservices.md
|
||||||
|
examples/
|
||||||
|
Billing.Sample.md
|
||||||
|
|
||||||
|
/samples
|
||||||
|
/Billing
|
||||||
|
StellaOps.Billing.Microservice/ # full example project
|
||||||
|
router.billing.yaml # example router config
|
||||||
|
/MigrationSkeleton
|
||||||
|
StellaOps.Template.Microservice/ # template project
|
||||||
|
example-controller-mapping.md # before/after snippet
|
||||||
|
```
|
||||||
|
|
||||||
|
The **skeleton** project should:
|
||||||
|
|
||||||
|
* Compile.
|
||||||
|
* Contain TODO markers where teams fill in domain pieces.
|
||||||
|
* Be referenced in the migration doc so people know where to look.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2.5 Tests to make the reference stick
|
||||||
|
|
||||||
|
Add a minimal test suite around the Billing example:
|
||||||
|
|
||||||
|
* **Integration tests** in `tests/StellaOps.Billing.IntegrationTests`:
|
||||||
|
|
||||||
|
* Start gateway + Billing microservice (using in-memory test host or docker-compose).
|
||||||
|
* `GET /ping` returns 200 and “pong”.
|
||||||
|
* `POST /billing/invoices` returns 200 with a JSON body containing an `id`.
|
||||||
|
* `POST /billing/invoices/upload` with a large payload succeeds and reports `bytesReceived`.
|
||||||
|
|
||||||
|
* Use these tests as a reference for future services: they show how to spin up a microservice + gateway in tests.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Done criteria for step 11
|
||||||
|
|
||||||
|
You can treat “Build a reference example + migration skeleton” as complete when:
|
||||||
|
|
||||||
|
* `StellaOps.Billing.Microservice` exists, runs, and successfully serves requests through the gateway using your real transport (or InMemory/TCP for dev).
|
||||||
|
* `router.billing.yaml` plus `billing.microservice.yaml` show config patterns for:
|
||||||
|
|
||||||
|
* timeouts
|
||||||
|
* streaming
|
||||||
|
* requiringClaims
|
||||||
|
* `docs/router/examples/Billing.Sample.md` explains how to run and test the example.
|
||||||
|
* `Migration of Webservices to Microservices.md` contains:
|
||||||
|
|
||||||
|
* A concrete mapping example (controller → endpoint + YAML).
|
||||||
|
* A step-by-step migration checklist for teams.
|
||||||
|
* Pointers to the skeleton project and sample configs.
|
||||||
|
* A template microservice project exists (`StellaOps.Template.Microservice` or equivalent) that teams can copy to bootstrap new services.
|
||||||
|
|
||||||
|
Once you have this, onboarding new domains and migrating old WebServices stops being an ad-hoc effort and becomes a repeatable, documented process.
|
||||||
415
docs/router/12-Step.md
Normal file
415
docs/router/12-Step.md
Normal file
@@ -0,0 +1,415 @@
|
|||||||
|
Below is how I’d tell your dev agents to operate on this codebase so it doesn’t turn into chaos over time.
|
||||||
|
|
||||||
|
Think of this as the “rules of engagement” for Stella Ops Router.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Non‑negotiable operating principles
|
||||||
|
|
||||||
|
All agents follow these rules:
|
||||||
|
|
||||||
|
1. **Specs are law**
|
||||||
|
|
||||||
|
* `docs/router/specs.md` is the primary source of truth.
|
||||||
|
* If code and spec differ:
|
||||||
|
|
||||||
|
* Fix the spec **first** (in a PR), then adjust the code.
|
||||||
|
* No “quick fixes” that contradict the spec.
|
||||||
|
|
||||||
|
2. **Common & protocol are sacred**
|
||||||
|
|
||||||
|
* `StellaOps.Router.Common` and the wire protocol (Frame/FrameType/serialization) are stable layers.
|
||||||
|
* Any change to:
|
||||||
|
|
||||||
|
* `Frame`, `FrameType`
|
||||||
|
* `EndpointDescriptor`, `ConnectionState`
|
||||||
|
* `ITransportClient` / `ITransportServer`
|
||||||
|
* …requires:
|
||||||
|
|
||||||
|
* Explicit spec update.
|
||||||
|
* Compatibility consideration.
|
||||||
|
* Code review by someone thinking about all transports and both sides (gateway + microservice).
|
||||||
|
|
||||||
|
3. **InMemory first, then real transports**
|
||||||
|
|
||||||
|
* New protocol semantics (e.g., new frame type, new behavior, new timeout rules) MUST:
|
||||||
|
|
||||||
|
1. Be implemented and proven with InMemory.
|
||||||
|
2. Have tests passing with InMemory.
|
||||||
|
3. Only then be rolled into TCP/TLS/UDP/RabbitMQ.
|
||||||
|
|
||||||
|
4. **No backdoor HTTP between microservices and router**
|
||||||
|
|
||||||
|
* Microservices must never talk HTTP to the router for control plane or data.
|
||||||
|
* All microservice–router traffic goes through the registered transports (UDP/TCP/TLS/RabbitMQ) using `Frame`.
|
||||||
|
|
||||||
|
5. **Method + Path = contract**
|
||||||
|
|
||||||
|
* Endpoint identity is always: `HTTP Method + Path`, nothing else.
|
||||||
|
* No “dynamic” routing hacks that bypass the `(Method, Path)` resolution.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. How agents should structure work (vertical slices, not scattered edits)
|
||||||
|
|
||||||
|
Whenever you assign work, agents should:
|
||||||
|
|
||||||
|
1. **Work in vertical slices**
|
||||||
|
|
||||||
|
* Example slice: “Cancellation with InMemory”, “Streaming + payload limits with TCP”, “RabbitMQ buffered requests”.
|
||||||
|
* Each slice includes:
|
||||||
|
|
||||||
|
* Spec amendments (if needed).
|
||||||
|
* Common contracts (if needed).
|
||||||
|
* Implementation (gateway + microservice + transport).
|
||||||
|
* Tests.
|
||||||
|
|
||||||
|
2. **Avoid cross‑cutting, half‑finished changes**
|
||||||
|
|
||||||
|
* Do not:
|
||||||
|
|
||||||
|
* Change Common, start on TCP, then get bored and leave InMemory broken.
|
||||||
|
* Do:
|
||||||
|
|
||||||
|
* Finish one vertical slice end‑to‑end, then move on.
|
||||||
|
|
||||||
|
3. **Keep changes small and reviewable**
|
||||||
|
|
||||||
|
* Prefer:
|
||||||
|
|
||||||
|
* One PR for “add YAML overrides merging”.
|
||||||
|
* Another PR for “add router YAML hot‑reload details”.
|
||||||
|
* Avoid huge omnibus PRs that change protocol, transports, router, and microservice in one go.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Change categories & review rules
|
||||||
|
|
||||||
|
Agents should classify their work by category and obey the review level.
|
||||||
|
|
||||||
|
1. **Category A – Protocol / Common changes**
|
||||||
|
|
||||||
|
* Affects:
|
||||||
|
|
||||||
|
* `Frame`, `FrameType`, payload DTOs.
|
||||||
|
* `EndpointDescriptor`, `ConnectionState`, `RoutingDecision`.
|
||||||
|
* `ITransportClient`, `ITransportServer`.
|
||||||
|
* Requirements:
|
||||||
|
|
||||||
|
* Spec change with rationale.
|
||||||
|
* Cross‑side impact analysis: gateway + microservice + all transports.
|
||||||
|
* Tests updated for InMemory and at least one real transport.
|
||||||
|
* Review: 2+ reviewers, one acting as “protocol owner”.
|
||||||
|
|
||||||
|
2. **Category B – Router logic / routing plugin**
|
||||||
|
|
||||||
|
* Affects:
|
||||||
|
|
||||||
|
* `IGlobalRoutingState` implementation.
|
||||||
|
* `IRoutingPlugin` logic (region, ping, heartbeat).
|
||||||
|
* Requirements:
|
||||||
|
|
||||||
|
* Unit tests for routing plugin (selection rules).
|
||||||
|
* At least one integration test through gateway + InMemory.
|
||||||
|
* Review: at least one reviewer who understands region/version semantics.
|
||||||
|
|
||||||
|
3. **Category C – Transport implementation**
|
||||||
|
|
||||||
|
* Affects:
|
||||||
|
|
||||||
|
* TCP/TLS/UDP/RabbitMQ clients & servers.
|
||||||
|
* Requirements:
|
||||||
|
|
||||||
|
* Transport‑specific tests (connection, basic request/response, timeout).
|
||||||
|
* No protocol changes.
|
||||||
|
* Review: 1–2 reviewers, including one who owns that transport.
|
||||||
|
|
||||||
|
4. **Category D – SDK / Microservice developer experience**
|
||||||
|
|
||||||
|
* Affects:
|
||||||
|
|
||||||
|
* `StellaOps.Microservice` public surface, endpoint discovery, YAML merging.
|
||||||
|
* Requirements:
|
||||||
|
|
||||||
|
* API review for public surface.
|
||||||
|
* Docs update (`Microservice.md`) if behavior changes.
|
||||||
|
* Review: 1–2 reviewers.
|
||||||
|
|
||||||
|
5. **Category E – Docs only**
|
||||||
|
|
||||||
|
* Affects:
|
||||||
|
|
||||||
|
* `docs/router/*`, no code.
|
||||||
|
* Requirements:
|
||||||
|
|
||||||
|
* Ensure docs match current behavior; if not, spawn follow‑up issues.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Workflow per change (what each agent does)
|
||||||
|
|
||||||
|
For any non‑trivial change:
|
||||||
|
|
||||||
|
1. **Check the spec**
|
||||||
|
|
||||||
|
* Confirm that:
|
||||||
|
|
||||||
|
* The desired behavior is already described, or
|
||||||
|
* You will extend the spec first.
|
||||||
|
|
||||||
|
2. **Update / extend spec if needed**
|
||||||
|
|
||||||
|
* Edit `docs/router/specs.md` or appropriate doc.
|
||||||
|
* Document:
|
||||||
|
|
||||||
|
* What’s changing.
|
||||||
|
* Why we need it.
|
||||||
|
* Which components are affected.
|
||||||
|
|
||||||
|
3. **Adjust Common / contracts if needed**
|
||||||
|
|
||||||
|
* Only after spec is updated.
|
||||||
|
* Keep changes minimal and backwards compatible where possible.
|
||||||
|
|
||||||
|
4. **Implement in InMemory path**
|
||||||
|
|
||||||
|
* Update:
|
||||||
|
|
||||||
|
* InMemory `ITransportClient`/hub.
|
||||||
|
* Microservice and gateway logic that rely on it.
|
||||||
|
* Add tests to prove behavior.
|
||||||
|
|
||||||
|
5. **Port to real transports**
|
||||||
|
|
||||||
|
* Implement the same behavior in:
|
||||||
|
|
||||||
|
* TCP (baseline).
|
||||||
|
* TLS (wrapping TCP).
|
||||||
|
* Others when needed.
|
||||||
|
* Reuse the same InMemory tests pattern for transport tests.
|
||||||
|
|
||||||
|
6. **Add / update tests**
|
||||||
|
|
||||||
|
* Unit tests for logic.
|
||||||
|
* Integration tests for gateway + microservice via at least one real transport.
|
||||||
|
|
||||||
|
7. **Update documentation**
|
||||||
|
|
||||||
|
* Update relevant docs:
|
||||||
|
|
||||||
|
* `Stella Ops Router - Webserver.md`
|
||||||
|
* `Stella Ops Router - Microservice.md`
|
||||||
|
* `Common.md`, if common contracts changed.
|
||||||
|
* Highlight any new configuration knobs or invariants.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Testing expectations for all agents
|
||||||
|
|
||||||
|
Agents should treat tests as part of the change, not an afterthought.
|
||||||
|
|
||||||
|
1. **Unit tests**
|
||||||
|
|
||||||
|
* For:
|
||||||
|
|
||||||
|
* Routing plugin decisions.
|
||||||
|
* YAML merge behavior.
|
||||||
|
* Payload budget logic.
|
||||||
|
* Goal:
|
||||||
|
|
||||||
|
* All tricky branches are covered.
|
||||||
|
|
||||||
|
2. **Integration tests**
|
||||||
|
|
||||||
|
* For gateway + microservice using:
|
||||||
|
|
||||||
|
* InMemory.
|
||||||
|
* At least one real transport (TCP in dev).
|
||||||
|
|
||||||
|
* Scenarios to maintain:
|
||||||
|
|
||||||
|
* Simple request/response.
|
||||||
|
* Streaming upload.
|
||||||
|
* Cancellation on client abort.
|
||||||
|
* Timeout leading to CANCEL.
|
||||||
|
* Payload limit exceeded.
|
||||||
|
|
||||||
|
3. **Smoke tests for examples**
|
||||||
|
|
||||||
|
* Ensure `StellaOps.Billing.Microservice` example always passes a small test:
|
||||||
|
|
||||||
|
* `/billing/health` works.
|
||||||
|
* `/billing/invoices/upload` streaming behaves.
|
||||||
|
|
||||||
|
4. **CI gating**
|
||||||
|
|
||||||
|
* No PR merges unless:
|
||||||
|
|
||||||
|
* `dotnet build` for solution succeeds.
|
||||||
|
* All tests pass.
|
||||||
|
* If agents add new projects/tests, CI must be updated in the same PR.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. How agents should use configuration & YAML
|
||||||
|
|
||||||
|
1. **Router side**
|
||||||
|
|
||||||
|
* Always read payload limits, node region, transports from `RouterConfig` (bound from YAML + env).
|
||||||
|
* Do not hardcode:
|
||||||
|
|
||||||
|
* Limits.
|
||||||
|
* Regions.
|
||||||
|
* Ports.
|
||||||
|
* If behavior depends on config, fetch from `IOptionsMonitor<RouterConfig>` at runtime, not from cached fields unless you explicitly freeze.
|
||||||
|
|
||||||
|
2. **Microservice side**
|
||||||
|
|
||||||
|
* Identity & router pool:
|
||||||
|
|
||||||
|
* From `StellaMicroserviceOptions` (code/env).
|
||||||
|
* Endpoint metadata overrides:
|
||||||
|
|
||||||
|
* From YAML (`ConfigFilePath`) merged into reflection result.
|
||||||
|
* Agents must not let YAML create endpoints that don’t exist in code; overrides only.
|
||||||
|
|
||||||
|
3. **No hidden defaults**
|
||||||
|
|
||||||
|
* If a default is important (e.g. `HeartbeatInterval`), document it and centralize it.
|
||||||
|
* Don’t sprinkle magic numbers across code.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Adding new capabilities: pattern all agents follow
|
||||||
|
|
||||||
|
When someone wants a new capability (e.g. “retry on transient transport failures”):
|
||||||
|
|
||||||
|
1. **Open a design issue / doc snippet**
|
||||||
|
|
||||||
|
* Describe:
|
||||||
|
|
||||||
|
* Problem.
|
||||||
|
* Proposed design.
|
||||||
|
* Where it sits in architecture (router, microservice, transport, config).
|
||||||
|
|
||||||
|
2. **Update spec**
|
||||||
|
|
||||||
|
* Write the behavior in the appropriate doc section.
|
||||||
|
* Include:
|
||||||
|
|
||||||
|
* API shape (if public).
|
||||||
|
* Transport impacts.
|
||||||
|
* Failure modes.
|
||||||
|
|
||||||
|
3. **Follow the vertical slice path**
|
||||||
|
|
||||||
|
* Implement in Common (if needed).
|
||||||
|
* Implement InMemory.
|
||||||
|
* Implement in primary transport (TCP).
|
||||||
|
* Add tests.
|
||||||
|
* Update docs.
|
||||||
|
|
||||||
|
Agents should not just spike code into TCP implementation without spec or tests.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Logging, tracing, and debugging expectations
|
||||||
|
|
||||||
|
Agents should instrument consistently; this matters for operations and for debugging during development.
|
||||||
|
|
||||||
|
1. **Use structured logging**
|
||||||
|
|
||||||
|
* At minimum, include:
|
||||||
|
|
||||||
|
* `ServiceName`
|
||||||
|
* `InstanceId`
|
||||||
|
* `CorrelationId`
|
||||||
|
* `Method`
|
||||||
|
* `Path`
|
||||||
|
* `ConnectionId`
|
||||||
|
* Never log full payload bodies by default for privacy and performance; log sizes and key metadata instead.
|
||||||
|
|
||||||
|
2. **Trace correlation**
|
||||||
|
|
||||||
|
* Ensure correlation IDs:
|
||||||
|
|
||||||
|
* Propagate from HTTP (gateway) into `Frame.CorrelationId`.
|
||||||
|
* Are used in logs on both sides (gateway + microservice).
|
||||||
|
|
||||||
|
3. **Agent debugging guidance**
|
||||||
|
|
||||||
|
* When debugging a routing or transport problem:
|
||||||
|
|
||||||
|
* Turn on debug logging for gateway + microservice for that service.
|
||||||
|
* Use the correlation ID to follow the request end‑to‑end.
|
||||||
|
* Verify:
|
||||||
|
|
||||||
|
* HELLO registration.
|
||||||
|
* HEARTBEAT events.
|
||||||
|
* REQUEST leaving gateway.
|
||||||
|
* RESPONSE arriving.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Daily agent workflow (practical directions)
|
||||||
|
|
||||||
|
For each day / task, an agent should:
|
||||||
|
|
||||||
|
1. **Start from an issue or spec line item**
|
||||||
|
|
||||||
|
* Never “just code something” without an issue/state in the backlog.
|
||||||
|
|
||||||
|
2. **Locate the relevant doc**
|
||||||
|
|
||||||
|
* Spec section.
|
||||||
|
* Example docs (e.g. Billing sample).
|
||||||
|
* Migration doc if working on conversion.
|
||||||
|
|
||||||
|
3. **Work in a feature branch**
|
||||||
|
|
||||||
|
* Branch name reflects scope: `feature/streaming-tcp`, `fix/router-cancellation`, etc.
|
||||||
|
|
||||||
|
4. **Keep notes**
|
||||||
|
|
||||||
|
* If an assumption is made (e.g. “we currently don’t support streaming over RabbitMQ”), note it in the issue.
|
||||||
|
* If they discover inconsistency in docs, open a doc‑fix issue.
|
||||||
|
|
||||||
|
5. **Finish the full slice**
|
||||||
|
|
||||||
|
* Code + tests + docs.
|
||||||
|
* Keep partial implementations behind feature flags (if needed) and clearly marked.
|
||||||
|
|
||||||
|
6. **Open PR with clear description**
|
||||||
|
|
||||||
|
* What changed.
|
||||||
|
* Which spec section it implements or modifies.
|
||||||
|
* Any risks or roll‑back notes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Guardrails against drift
|
||||||
|
|
||||||
|
Finally, a few things agents must actively avoid:
|
||||||
|
|
||||||
|
* **No silent protocol changes**
|
||||||
|
|
||||||
|
* Don’t change `FrameType` semantics, payload formats, or header layout without:
|
||||||
|
|
||||||
|
* Spec update.
|
||||||
|
* Full impact review.
|
||||||
|
|
||||||
|
* **No specless behavior**
|
||||||
|
|
||||||
|
* If something matters at runtime (timeouts, retries, routing rules), it has to be in the docs, not just in someone’s head.
|
||||||
|
|
||||||
|
* **No bypassing of router**
|
||||||
|
|
||||||
|
* Do not introduce “temporary” direct calls from clients to microservices. All client HTTP should go via gateway.
|
||||||
|
|
||||||
|
* **No direct dependencies on specific transports in domain code**
|
||||||
|
|
||||||
|
* Domain and microservice endpoint logic must not know if the transport is TCP, TLS, UDP, or RabbitMQ. They only see `RawRequestContext`, `RawResponse`, and cancellation tokens.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you want, I can turn this into a one‑page “Agent Handbook” markdown file you can drop into `docs/router/AGENTS_PROCESS.md` and link from `specs.md` so every AI or human dev working on this stack has the same ground rules.
|
||||||
41
docs/router/SPRINT_7000_0001_0001_router_skeleton.md
Normal file
41
docs/router/SPRINT_7000_0001_0001_router_skeleton.md
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
# Sprint 7000·0001·0001 · Router Skeleton
|
||||||
|
|
||||||
|
## Topic & Scope
|
||||||
|
- Stand up the dedicated StellaOps Router repo skeleton under `docs/router` as per `specs.md` / `01-Step.md`.
|
||||||
|
- Produce the empty solution structure, projects, references, and placeholder docs ready for future transport/SDK work.
|
||||||
|
- Enforce .NET 10 (`net10.0`) across all new projects; ignore prior net8 defaults.
|
||||||
|
- **Working directory:** `docs/router`.
|
||||||
|
|
||||||
|
## Dependencies & Concurrency
|
||||||
|
- Depends on `docs/router/specs.md` remaining the authoritative requirements source.
|
||||||
|
- No upstream sprint blockers; this spin-off is self-contained.
|
||||||
|
- Can run in parallel with other repo work because it writes only under `docs/router`.
|
||||||
|
|
||||||
|
## Documentation Prerequisites
|
||||||
|
- `docs/router/specs.md`
|
||||||
|
- `docs/router/implplan.md`
|
||||||
|
- `docs/router/01-Step.md`
|
||||||
|
|
||||||
|
## Delivery Tracker
|
||||||
|
| # | Task ID | Status | Key dependency / next step | Owners | Task Definition |
|
||||||
|
| --- | --- | --- | --- | --- | --- |
|
||||||
|
| 1 | ROUTER-SKEL-SETUP | TODO | Read specs + step docs | Skeleton Agent | Create repo folders (`src/`, `src/__Libraries/`, `tests/`, `docs/router`) & add `README.md` pointer. |
|
||||||
|
| 2 | ROUTER-SKEL-SOLUTION | TODO | Task 1 | Skeleton Agent | Generate `StellaOps.Router.sln`, add Gateway + library + test projects targeting `net10.0`. |
|
||||||
|
| 3 | ROUTER-SKEL-REFS | TODO | Task 2 | Skeleton Agent | Wire project references per plan (Gateway→Common+Config, etc.). |
|
||||||
|
| 4 | ROUTER-SKEL-BUILDPROPS | TODO | Task 2 | Infra Agent | Add repo-level `Directory.Build.props` pinning `net10.0`, nullable, implicit usings. |
|
||||||
|
| 5 | ROUTER-SKEL-STUBS | TODO | Tasks 2-4 | Common/Microservice Agents | Add placeholder types/extension methods per `01-Step.md` (no logic). |
|
||||||
|
| 6 | ROUTER-SKEL-TESTS | TODO | Task 5 | QA Agent | Create dummy `[Fact]` tests in each test project so `dotnet test` passes. |
|
||||||
|
| 7 | ROUTER-SKEL-CI | TODO | Tasks 2-6 | Infra Agent | Configure CI pipeline running `dotnet restore/build/test` on solution. |
|
||||||
|
|
||||||
|
## Execution Log
|
||||||
|
| Date (UTC) | Update | Owner |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| 2025-12-02 | Created sprint skeleton per router spin-off instructions. | Planning |
|
||||||
|
|
||||||
|
## Decisions & Risks
|
||||||
|
- Use .NET 10 baseline even though other modules still target net8; future agents must not downgrade frameworks.
|
||||||
|
- Scope intentionally limited to `docs/router` to avoid cross-repo conflicts; any shared assets must be duplicated or referenced via documentation until later alignment.
|
||||||
|
- Risk: missing AGENTS.md for this folder—future sprint should establish one if work extends beyond skeleton.
|
||||||
|
|
||||||
|
## Next Checkpoints
|
||||||
|
- 2025-12-04: Verify solution + CI scaffold committed and passing.
|
||||||
356
docs/router/implplan.md
Normal file
356
docs/router/implplan.md
Normal file
@@ -0,0 +1,356 @@
|
|||||||
|
Start by treating `docs/router/specs.md` as law. Nothing gets coded that contradicts it. The first sprint or two should be about *wiring the skeleton* and proving the core flows with the simplest possible transport, then layering in the real transports and migration paths.
|
||||||
|
|
||||||
|
I’d structure the work for your agents like this.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Read & freeze invariants
|
||||||
|
|
||||||
|
**All agents:**
|
||||||
|
|
||||||
|
* Read `docs/router/specs.md` end to end.
|
||||||
|
* Extract and pin the non-negotiables:
|
||||||
|
|
||||||
|
* Method + Path identity.
|
||||||
|
* Strict semver for versions.
|
||||||
|
* Region from `GatewayNodeConfig.Region` (no host/header magic).
|
||||||
|
* No HTTP transport for microservice communications.
|
||||||
|
* Single connection carrying HELLO + HEARTBEAT + REQUEST/RESPONSE + CANCEL.
|
||||||
|
* Router treats body as opaque bytes/streams.
|
||||||
|
* `RequiringClaims` replaces any form of `AllowedRoles`.
|
||||||
|
|
||||||
|
Agree that these are invariants; any future idea that violates them needs an explicit spec change first.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Lay down the solution skeleton
|
||||||
|
|
||||||
|
**“Skeleton” agent (or gateway core agent):**
|
||||||
|
|
||||||
|
Create the basic project structure, no logic yet:
|
||||||
|
|
||||||
|
* `src/__Libraries/StellaOps.Router.Common`
|
||||||
|
* `src/__Libraries/StellaOps.Router.Config`
|
||||||
|
* `src/__Libraries/StellaOps.Microservice`
|
||||||
|
* `src/StellaOps.Gateway.WebService`
|
||||||
|
* `docs/router/` already has `specs.md` (add placeholders for the other docs).
|
||||||
|
|
||||||
|
Goal: everything builds, but most classes are empty or stubs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Implement the shared core model (Common)
|
||||||
|
|
||||||
|
**Common/core agent:**
|
||||||
|
|
||||||
|
Implement only the *data* and *interfaces*, no behavior:
|
||||||
|
|
||||||
|
* Enums:
|
||||||
|
|
||||||
|
* `TransportType`, `FrameType`, `InstanceHealthStatus`.
|
||||||
|
* Models:
|
||||||
|
|
||||||
|
* `ClaimRequirement`
|
||||||
|
* `EndpointDescriptor`
|
||||||
|
* `InstanceDescriptor`
|
||||||
|
* `ConnectionState`
|
||||||
|
* `RoutingContext`, `RoutingDecision`
|
||||||
|
* `PayloadLimits`
|
||||||
|
* Interfaces:
|
||||||
|
|
||||||
|
* `IGlobalRoutingState`
|
||||||
|
* `IRoutingPlugin`
|
||||||
|
* `ITransportServer`
|
||||||
|
* `ITransportClient`
|
||||||
|
* `Frame` struct/class:
|
||||||
|
|
||||||
|
* `FrameType`, `CorrelationId`, `Payload` (byte[]).
|
||||||
|
|
||||||
|
Leave implementations of `IGlobalRoutingState`, `IRoutingPlugin`, transports, etc., for later steps.
|
||||||
|
|
||||||
|
Deliverable: a stable set of contracts that gateway + microservice SDK depend on.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Build a fake “in-memory” transport plugin
|
||||||
|
|
||||||
|
**Transport agent:**
|
||||||
|
|
||||||
|
Before UDP/TCP/Rabbit, build an **in-process transport**:
|
||||||
|
|
||||||
|
* `InMemoryTransportServer` and `InMemoryTransportClient`.
|
||||||
|
* They share a concurrent dictionary keyed by `ConnectionId`.
|
||||||
|
* Frames are passed via channels/queues in memory.
|
||||||
|
|
||||||
|
Purpose:
|
||||||
|
|
||||||
|
* Let you prove HELLO/HEARTBEAT/REQUEST/RESPONSE/CANCEL semantics and routing logic *without* dealing with sockets and Rabbit yet.
|
||||||
|
* Let you unit and integration test the router and SDK quickly.
|
||||||
|
|
||||||
|
This plugin will never ship to production; it’s only for dev tests and CI.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Microservice SDK: minimal handshake & dispatch (with InMemory)
|
||||||
|
|
||||||
|
**Microservice agent:**
|
||||||
|
|
||||||
|
Initial focus: “connect and say HELLO, then handle a simple request.”
|
||||||
|
|
||||||
|
1. Implement `StellaMicroserviceOptions`.
|
||||||
|
2. Implement `AddStellaMicroservice(...)`:
|
||||||
|
|
||||||
|
* Bind options.
|
||||||
|
* Register endpoint handlers and SDK internal services.
|
||||||
|
3. Endpoint discovery:
|
||||||
|
|
||||||
|
* Implement runtime reflection for `[StellaEndpoint]` + handler types.
|
||||||
|
* Build in-memory `EndpointDescriptor` list (simple: no YAML yet).
|
||||||
|
4. Connection:
|
||||||
|
|
||||||
|
* Use `InMemoryTransportClient` to “connect” to a fake router.
|
||||||
|
* On connect, send a HELLO frame with:
|
||||||
|
|
||||||
|
* Identity.
|
||||||
|
* Endpoint list and metadata (`SupportsStreaming` false for now, simple `RequiringClaims` empty).
|
||||||
|
5. Request handling:
|
||||||
|
|
||||||
|
* Implement `IRawStellaEndpoint` and adapter to it.
|
||||||
|
* Implement `RawRequestContext` / `RawResponse`.
|
||||||
|
* Implement a dispatcher that:
|
||||||
|
|
||||||
|
* Receives `Request` frame.
|
||||||
|
* Builds `RawRequestContext`.
|
||||||
|
* Invokes the correct handler.
|
||||||
|
* Sends `Response` frame.
|
||||||
|
|
||||||
|
Do **not** handle streaming or cancellation yet; just basic request/response with small bodies.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Gateway: minimal routing using InMemory plugin
|
||||||
|
|
||||||
|
**Gateway agent:**
|
||||||
|
|
||||||
|
Goal: HTTP → in-memory transport → microservice → HTTP response.
|
||||||
|
|
||||||
|
1. Implement `GatewayNodeConfig` and bind it from config.
|
||||||
|
|
||||||
|
2. Implement `IGlobalRoutingState` as a simple in-memory implementation that:
|
||||||
|
|
||||||
|
* Holds `ConnectionState` objects.
|
||||||
|
* Builds a map `(Method, Path)` → endpoint + connections.
|
||||||
|
|
||||||
|
3. Implement a minimal `IRoutingPlugin` that:
|
||||||
|
|
||||||
|
* For now, just picks *any* connection that has the endpoint (no region/ping logic yet).
|
||||||
|
|
||||||
|
4. Implement minimal HTTP pipeline:
|
||||||
|
|
||||||
|
* `EndpointResolutionMiddleware`:
|
||||||
|
|
||||||
|
* `(Method, Path)` → `EndpointDescriptor` from `IGlobalRoutingState`.
|
||||||
|
* Naive authorization middleware stub (only checks “needs authenticated user”; ignore real requiringClaims for now).
|
||||||
|
* `RoutingDecisionMiddleware`:
|
||||||
|
|
||||||
|
* Ask `IRoutingPlugin` for a `RoutingDecision`.
|
||||||
|
* `TransportDispatchMiddleware`:
|
||||||
|
|
||||||
|
* Build a `Request` frame.
|
||||||
|
* Use `InMemoryTransportClient` to send and await `Response`.
|
||||||
|
* Map response to HTTP.
|
||||||
|
|
||||||
|
5. Implement HELLO handler on gateway side:
|
||||||
|
|
||||||
|
* When InMemory “connection” from microservice appears and sends HELLO:
|
||||||
|
|
||||||
|
* Construct `ConnectionState`.
|
||||||
|
* Update `IGlobalRoutingState` with endpoint → connection mapping.
|
||||||
|
|
||||||
|
Once this works, you have end-to-end:
|
||||||
|
|
||||||
|
* Example microservice.
|
||||||
|
* Example gateway.
|
||||||
|
* In-memory transport.
|
||||||
|
* A couple of test endpoints returning simple JSON.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Add heartbeat, health, and basic routing rules
|
||||||
|
|
||||||
|
**Common/core + gateway agent:**
|
||||||
|
|
||||||
|
Now enforce liveness and basic routing:
|
||||||
|
|
||||||
|
1. Heartbeat:
|
||||||
|
|
||||||
|
* Microservice SDK sends HEARTBEAT frames on a timer.
|
||||||
|
* Gateway updates `LastHeartbeatUtc` and `Status`.
|
||||||
|
2. Health:
|
||||||
|
|
||||||
|
* Add background job in gateway that:
|
||||||
|
|
||||||
|
* Marks instances Unhealthy if heartbeat stale.
|
||||||
|
3. Routing:
|
||||||
|
|
||||||
|
* Enhance `IRoutingPlugin` to:
|
||||||
|
|
||||||
|
* Filter out Unhealthy instances.
|
||||||
|
* Prefer gateway region (using `GatewayNodeConfig.Region`).
|
||||||
|
* Use simple `AveragePingMs` stub from request/response timings.
|
||||||
|
|
||||||
|
Still using InMemory transport; just building the selection logic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Add cancellation semantics (with InMemory)
|
||||||
|
|
||||||
|
**Microservice + gateway agents:**
|
||||||
|
|
||||||
|
Wire up cancellation logic before touching real transports:
|
||||||
|
|
||||||
|
1. Common:
|
||||||
|
|
||||||
|
* Extend `FrameType` with `Cancel`.
|
||||||
|
2. Gateway:
|
||||||
|
|
||||||
|
* In `TransportDispatchMiddleware`:
|
||||||
|
|
||||||
|
* Tie `HttpContext.RequestAborted` to a `SendCancelAsync` call.
|
||||||
|
* On timeout, send CANCEL.
|
||||||
|
* Ignore late `Response`/stream data for canceled correlation IDs.
|
||||||
|
3. Microservice:
|
||||||
|
|
||||||
|
* Maintain `_inflight` map of correlation → `CancellationTokenSource`.
|
||||||
|
* When `Cancel` frame arrives, call `cts.Cancel()`.
|
||||||
|
* Ensure handlers receive and honor `CancellationToken`.
|
||||||
|
|
||||||
|
Prove via tests: if client disconnects, handler stops quickly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Add streaming & payload limits (still InMemory)
|
||||||
|
|
||||||
|
**Gateway + microservice agents:**
|
||||||
|
|
||||||
|
1. Streaming:
|
||||||
|
|
||||||
|
* Extend InMemory transport to support `RequestStreamData` / `ResponseStreamData` frames.
|
||||||
|
* On the gateway:
|
||||||
|
|
||||||
|
* For `SupportsStreaming` endpoints, pipe HTTP body stream → frame stream.
|
||||||
|
* For response, pipe frames → HTTP response stream.
|
||||||
|
* On microservice:
|
||||||
|
|
||||||
|
* Expose `RawRequestContext.Body` as a stream reading frames as they arrive.
|
||||||
|
* Allow `RawResponse.WriteBodyAsync` to stream out.
|
||||||
|
|
||||||
|
2. Payload limits:
|
||||||
|
|
||||||
|
* Implement `PayloadLimits` enforcement at gateway:
|
||||||
|
|
||||||
|
* Early reject large `Content-Length`.
|
||||||
|
* Track counters in streaming; trigger cancellation when exceeding thresholds.
|
||||||
|
|
||||||
|
Demonstrate with a fake “upload” endpoint that uses `IRawStellaEndpoint` and streaming.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Implement real transport plugins one by one
|
||||||
|
|
||||||
|
**Transport agent:**
|
||||||
|
|
||||||
|
Now replace InMemory with real transports:
|
||||||
|
|
||||||
|
Order:
|
||||||
|
|
||||||
|
1. **TCP plugin** (easiest baseline):
|
||||||
|
|
||||||
|
* Length-prefixed frame protocol.
|
||||||
|
* Connection per microservice instance (or multi-instance if needed later).
|
||||||
|
* Implement HELLO/HEARTBEAT/REQUEST/RESPONSE/STREAM/CANCEL as per frame model.
|
||||||
|
|
||||||
|
2. **Certificate (TLS) plugin**:
|
||||||
|
|
||||||
|
* Wrap TCP plugin with TLS.
|
||||||
|
* Add configuration for server & client certs.
|
||||||
|
|
||||||
|
3. **UDP plugin**:
|
||||||
|
|
||||||
|
* Single datagram = single frame; no streaming.
|
||||||
|
* Enforce `MaxRequestBytesPerCall`.
|
||||||
|
* Use for small, idempotent operations.
|
||||||
|
|
||||||
|
4. **RabbitMQ plugin**:
|
||||||
|
|
||||||
|
* Add exchanges/queues for HELLO/HEARTBEAT and REQUEST/RESPONSE.
|
||||||
|
* Use `CorrelationId` properties for matching.
|
||||||
|
* Guarantee at-most-once semantics where practical.
|
||||||
|
|
||||||
|
While each plugin is built, keep the core router and microservice SDK relying only on `ITransportClient`/`ITransportServer` abstractions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Add Router.Config + Microservice YAML integration
|
||||||
|
|
||||||
|
**Config agent:**
|
||||||
|
|
||||||
|
1. Implement `__Libraries/StellaOps.Router.Config`:
|
||||||
|
|
||||||
|
* YAML → `RouterConfig` binding.
|
||||||
|
* Services, endpoints, static instances, payload limits.
|
||||||
|
* Hot-reload via `IOptionsMonitor` / file watcher.
|
||||||
|
|
||||||
|
2. Implement microservice YAML:
|
||||||
|
|
||||||
|
* Endpoint-level overrides only (timeouts, requiringClaims, SupportsStreaming).
|
||||||
|
* Merge logic: code defaults → YAML override.
|
||||||
|
|
||||||
|
3. Integrate:
|
||||||
|
|
||||||
|
* Gateway uses RouterConfig for:
|
||||||
|
|
||||||
|
* Defaults when no microservice registered yet.
|
||||||
|
* Payload limits.
|
||||||
|
* Microservice uses YAML to refine endpoint metadata before sending HELLO.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Build a reference example + migration skeleton
|
||||||
|
|
||||||
|
**DX / migration agent:**
|
||||||
|
|
||||||
|
1. Build a `StellaOps.Billing.Microservice` example:
|
||||||
|
|
||||||
|
* A couple of simple endpoints (GET/POST).
|
||||||
|
* One streaming upload endpoint.
|
||||||
|
* YAML for requiringClaims and timeouts.
|
||||||
|
|
||||||
|
2. Build a `StellaOps.Gateway.WebService` example config around it.
|
||||||
|
|
||||||
|
3. Document the full path:
|
||||||
|
|
||||||
|
* How to run both locally.
|
||||||
|
* How to add a new endpoint.
|
||||||
|
* How cancellation behaves (killing the client, watching logs).
|
||||||
|
* How payload limits work (try to upload too-large file).
|
||||||
|
|
||||||
|
4. Outline migration steps from an imaginary `StellaOps.Billing.WebService` using the patterns in `Migration of Webservices to Microservices.md`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Process guidance for your agents
|
||||||
|
|
||||||
|
* **Do not jump to UDP/TCP immediately.**
|
||||||
|
Prove the protocol (HELLO/HEARTBEAT/REQUEST/RESPONSE/STREAM/CANCEL), routing, and limits on the InMemory plugin first.
|
||||||
|
|
||||||
|
* **Guard the invariants.**
|
||||||
|
If someone proposes “just call HTTP between services” or “let’s derive region from host,” they’re violating spec and must update `docs/router/specs.md` before coding.
|
||||||
|
|
||||||
|
* **Keep Common stable.**
|
||||||
|
Changes to `StellaOps.Router.Common` must be rare and reviewed; everything else depends on it.
|
||||||
|
|
||||||
|
* **Document as you go.**
|
||||||
|
Every time a behavior settles (e.g. status mapping, frame layout), update the docs under `docs/router/` so new agents always have a single source of truth.
|
||||||
|
|
||||||
|
If you want, next step I can convert this into a task board (epic → stories) per repo folder, so you can assign specific chunks to named agents.
|
||||||
494
docs/router/specs.md
Normal file
494
docs/router/specs.md
Normal file
@@ -0,0 +1,494 @@
|
|||||||
|
I’ll group everything into requirement buckets, but keep it all as requirements statements (no rationale). This is the union of what you asked for or confirmed across the whole thread.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Architectural / scope requirements
|
||||||
|
|
||||||
|
* There SHALL be a single HTTP ingress service named `StellaOps.Gateway.WebService`.
|
||||||
|
* Microservices SHALL NOT expose HTTP to the router; all microservice-to-router traffic (control + data) MUST use in-house transports (UDP, TCP, certificate/TLS, RabbitMQ).
|
||||||
|
* There SHALL NOT be a separate control-plane service or protocol; each transport connection between a microservice and the router MUST carry:
|
||||||
|
|
||||||
|
* Initial registration (HELLO) and endpoint configuration.
|
||||||
|
* Ongoing heartbeats.
|
||||||
|
* Endpoint updates (if any).
|
||||||
|
* Request/response and streaming data.
|
||||||
|
* The router SHALL maintain per-connection endpoint mappings and derive its global routing state from the union of all live connections.
|
||||||
|
* The router SHALL treat request and response bodies as opaque (raw bytes / streams); all deserialization and schema handling SHALL be the microservice’s responsibility.
|
||||||
|
* The system SHALL support both buffered and streaming request/response flows end-to-end.
|
||||||
|
* The design MUST reuse only the generic parts of `__SerdicaTemplate` (dynamic endpoint metadata, attribute-based endpoint discovery, request routing patterns, correlation, connection management) and MUST drop Serdica-specific stack (Oracle schema, domain logic, etc.).
|
||||||
|
* The solution MUST be a simpler, generic replacement for the existing Serdica HTTP→RabbitMQ→microservice design.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Service identity, region, versioning
|
||||||
|
|
||||||
|
* Each microservice instance SHALL be identified by `(ServiceName, Version, Region, InstanceId)`.
|
||||||
|
* `Version` MUST follow strict semantic versioning (`major.minor.patch`).
|
||||||
|
* Routing MUST be strict on version:
|
||||||
|
|
||||||
|
* The router MUST only route a request to instances whose `Version` equals the selected version.
|
||||||
|
* When a version is not explicitly specified by the client, a default version MUST be used (from config or metadata).
|
||||||
|
* Each gateway node SHALL have a static configuration object `GatewayNodeConfig` containing at least:
|
||||||
|
|
||||||
|
* `Region` (e.g. `"eu1"`).
|
||||||
|
* `NodeId` (e.g. `"gw-eu1-01"`).
|
||||||
|
* `Environment` (e.g. `"prod"`).
|
||||||
|
* Routing decisions MUST use `GatewayNodeConfig.Region` as the node’s region; the router MUST NOT derive region from HTTP headers or URL host names.
|
||||||
|
* DNS/host naming conventions SHOULD express region in the domain (e.g. `eu1.global.stella-ops.org`, `mainoffice.contoso.stella-ops.org`), but routing logic MUST be driven by `GatewayNodeConfig.Region` rather than by host parsing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Endpoint identity and metadata
|
||||||
|
|
||||||
|
* Endpoint identity in the router and microservices MUST be `HTTP Method + Path`, for example:
|
||||||
|
|
||||||
|
* `Method`: one of `GET`, `POST`, `PUT`, `PATCH`, `DELETE`.
|
||||||
|
* `Path`: e.g. `/section/get/{id}`.
|
||||||
|
|
||||||
|
* The router and microservices MUST use the same path template syntax and matching rules (e.g. ASP.NET-style route templates), including decisions on:
|
||||||
|
|
||||||
|
* Case sensitivity.
|
||||||
|
* Trailing slash handling.
|
||||||
|
* Parameter segments (e.g. `{id}`).
|
||||||
|
|
||||||
|
* The router MUST resolve an incoming HTTP `(Method, Path)` to a logical endpoint descriptor that includes:
|
||||||
|
|
||||||
|
* ServiceName.
|
||||||
|
* Version.
|
||||||
|
* Method.
|
||||||
|
* Path.
|
||||||
|
* DefaultTimeout.
|
||||||
|
* `RequiringClaims`: a list of claim requirements.
|
||||||
|
* A flag indicating whether the endpoint supports streaming.
|
||||||
|
|
||||||
|
* Every place that previously spoke about `AllowedRoles` MUST be replaced with `RequiringClaims`:
|
||||||
|
|
||||||
|
* Each requirement MUST at minimum contain a `Type` and MAY contain a `Value`.
|
||||||
|
|
||||||
|
* Endpoints MUST support being configured with default `RequiringClaims` in microservices, with the possibility of external override (see Authority section).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Routing algorithm / instance selection
|
||||||
|
|
||||||
|
* Given a resolved endpoint `(ServiceName, Version, Method, Path)`, the router MUST:
|
||||||
|
|
||||||
|
* Filter candidate instances by:
|
||||||
|
|
||||||
|
* Matching `ServiceName`.
|
||||||
|
* Matching `Version` (strict semver equality).
|
||||||
|
* Health in an acceptable set (e.g. `Healthy` or `Degraded`).
|
||||||
|
* Instances MUST have health metadata:
|
||||||
|
|
||||||
|
* `Status` ∈ {`Unknown`, `Healthy`, `Degraded`, `Draining`, `Unhealthy`}.
|
||||||
|
* `LastHeartbeatUtc`.
|
||||||
|
* `AveragePingMs`.
|
||||||
|
* The router’s instance selection MUST obey these rules:
|
||||||
|
|
||||||
|
* Region:
|
||||||
|
|
||||||
|
* Prefer instances whose `Region == GatewayNodeConfig.Region`.
|
||||||
|
* If none, fall back to configured neighbor regions.
|
||||||
|
* If none, fall back to all other regions.
|
||||||
|
* Within a chosen region tier:
|
||||||
|
|
||||||
|
* Prefer lower `AveragePingMs`.
|
||||||
|
* If several are tied, prefer more recent `LastHeartbeatUtc`.
|
||||||
|
* If still tied, use a balancing strategy (e.g. random or round-robin).
|
||||||
|
* The router MUST support a strict fallback order as requested:
|
||||||
|
|
||||||
|
* Prefer “closest by region and heartbeat and ping.”
|
||||||
|
* If having to choose between worse candidates, fall back in order of:
|
||||||
|
|
||||||
|
* Greater ping (latency).
|
||||||
|
* Greater heartbeat age.
|
||||||
|
* Less preferred region tier.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Transport plugin requirements
|
||||||
|
|
||||||
|
* There MUST be a transport plugin abstraction representing how the router and microservices communicate.
|
||||||
|
* The default transport type MUST be UDP.
|
||||||
|
* Additional supported transport types MUST include:
|
||||||
|
|
||||||
|
* TCP.
|
||||||
|
* Certificate-based TCP (TLS / mTLS).
|
||||||
|
* RabbitMQ.
|
||||||
|
* There MUST NOT be an HTTP transport plugin; HTTP MUST NOT be used for microservice-to-router communications (control or data).
|
||||||
|
* Each transport plugin MUST support:
|
||||||
|
|
||||||
|
* Establishing logical connections between microservices and the router.
|
||||||
|
* Sending/receiving HELLO (registration), HEARTBEAT, optional ENDPOINTS_UPDATE.
|
||||||
|
* Sending/receiving REQUEST/RESPONSE frames.
|
||||||
|
* Supporting streaming via REQUEST_STREAM_DATA / RESPONSE_STREAM_DATA frames where the transport allows it.
|
||||||
|
* Sending/receiving CANCEL frames to abort specific in-flight requests.
|
||||||
|
* UDP transport:
|
||||||
|
|
||||||
|
* MUST be used only for small/bounded payloads (no unbounded streaming).
|
||||||
|
* MUST respect configured `MaxRequestBytesPerCall`.
|
||||||
|
* TCP and Certificate transports:
|
||||||
|
|
||||||
|
* MUST implement a length-prefixed framing protocol capable of multiplexing frames for multiple correlation IDs.
|
||||||
|
* Certificate transport MUST enforce TLS and support optional mutual TLS (verifiable peer identity).
|
||||||
|
* RabbitMQ:
|
||||||
|
|
||||||
|
* MUST implement queue/exchange naming and routing keys sufficient to represent logical connections and correlation IDs.
|
||||||
|
* MUST use message properties (e.g. `CorrelationId`) for request/response matching.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Gateway (`StellaOps.Gateway.WebService`) requirements
|
||||||
|
|
||||||
|
### 6.1 HTTP ingress pipeline
|
||||||
|
|
||||||
|
* The gateway MUST host an ASP.NET Core HTTP server.
|
||||||
|
* The HTTP middleware pipeline MUST include at least:
|
||||||
|
|
||||||
|
* Forwarded headers handling (when behind reverse proxy).
|
||||||
|
* Request logging (e.g. via Serilog) including correlation ID, service, endpoint, region, instance.
|
||||||
|
* Global error-handling middleware.
|
||||||
|
* Authentication middleware.
|
||||||
|
* `EndpointResolutionMiddleware` to resolve `(Method, Path)` → endpoint.
|
||||||
|
* Authorization middleware that enforces `RequiringClaims`.
|
||||||
|
* `RoutingDecisionMiddleware` to choose connection/instance/transport.
|
||||||
|
* `TransportDispatchMiddleware` to carry out buffered or streaming dispatch.
|
||||||
|
* The gateway MUST read `Method` and `Path` from the HTTP request and use them to resolve endpoints.
|
||||||
|
|
||||||
|
### 6.2 Per-connection state and routing view
|
||||||
|
|
||||||
|
* The gateway MUST maintain a `ConnectionState` per logical connection that includes:
|
||||||
|
|
||||||
|
* ConnectionId.
|
||||||
|
* `InstanceDescriptor` (`InstanceId`, `ServiceName`, `Version`, `Region`).
|
||||||
|
* `Status`, `LastHeartbeatUtc`, `AveragePingMs`.
|
||||||
|
* The set of endpoints that this connection serves (`(Method, Path)` → `EndpointDescriptor`).
|
||||||
|
* The transport type for that connection.
|
||||||
|
* The gateway MUST maintain a global routing state (`IGlobalRoutingState`) that:
|
||||||
|
|
||||||
|
* Resolves `(Method, Path)` to an `EndpointDescriptor` (service, version, metadata).
|
||||||
|
* Provides the set of `ConnectionState` objects that can handle a given `(ServiceName, Version, Method, Path)`.
|
||||||
|
|
||||||
|
### 6.3 Buffered vs streaming dispatch
|
||||||
|
|
||||||
|
* The gateway MUST support:
|
||||||
|
|
||||||
|
* **Buffered mode** for small to medium payloads:
|
||||||
|
|
||||||
|
* Read the entire HTTP body into memory (or temp file when above a threshold).
|
||||||
|
* Send as a single REQUEST payload.
|
||||||
|
* **Streaming mode** for large or unknown content:
|
||||||
|
|
||||||
|
* Streaming from HTTP body to microservice via a sequence of REQUEST_STREAM_DATA frames.
|
||||||
|
* Streaming from microservice back to HTTP via RESPONSE_STREAM_DATA frames.
|
||||||
|
* For each endpoint, the gateway MUST know whether it can use streaming or must use buffered mode (`SupportsStreaming` flag).
|
||||||
|
|
||||||
|
### 6.4 Opaque body handling
|
||||||
|
|
||||||
|
* The gateway MUST treat request and response bodies as opaque byte sequences and MUST NOT attempt to deserialize or interpret payload contents.
|
||||||
|
* The gateway MUST forward headers and body bytes as given and leave any schema, JSON, or other decoding to the microservice.
|
||||||
|
|
||||||
|
### 6.5 Payload and memory protection
|
||||||
|
|
||||||
|
* The gateway MUST enforce configured payload limits:
|
||||||
|
|
||||||
|
* `MaxRequestBytesPerCall`.
|
||||||
|
* `MaxRequestBytesPerConnection`.
|
||||||
|
* `MaxAggregateInflightBytes`.
|
||||||
|
* If `Content-Length` is known and exceeds `MaxRequestBytesPerCall`, the gateway MUST reject the request early (e.g. HTTP 413 Payload Too Large).
|
||||||
|
* During streaming, the gateway MUST maintain counters of:
|
||||||
|
|
||||||
|
* Bytes read for this request.
|
||||||
|
* Bytes for this connection.
|
||||||
|
* Total in-flight bytes across all requests.
|
||||||
|
* If any limit is exceeded mid-stream, the gateway MUST:
|
||||||
|
|
||||||
|
* Stop reading the HTTP body.
|
||||||
|
* Send a CANCEL frame for that correlation ID.
|
||||||
|
* Abort the stream to the microservice.
|
||||||
|
* Return an appropriate error to the client (e.g. 413 or 503) and log the incident.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Microservice SDK (`__Libraries/StellaOps.Microservice`) requirements
|
||||||
|
|
||||||
|
### 7.1 Identity & router connections
|
||||||
|
|
||||||
|
* `StellaMicroserviceOptions` MUST let microservices configure:
|
||||||
|
|
||||||
|
* `ServiceName`.
|
||||||
|
* `Version`.
|
||||||
|
* `Region`.
|
||||||
|
* `InstanceId`.
|
||||||
|
* A list of router endpoints (`Routers` / router pool) including host, port, and transport type for each.
|
||||||
|
* Optional path to a YAML config file for endpoint-level overrides.
|
||||||
|
* Providing the router pool (`Routers` / HTTP servers pool) MUST be mandatory; a microservice cannot start without at least one configured router endpoint.
|
||||||
|
* The router pool SHOULD be configurable via code and MAY optionally be configured via YAML with hot-reload (causing reconnections if changed).
|
||||||
|
|
||||||
|
### 7.2 Endpoint definition & discovery
|
||||||
|
|
||||||
|
* Microservice endpoints MUST be declared using attributes that specify `(Method, Path)`:
|
||||||
|
|
||||||
|
```csharp
|
||||||
|
[StellaEndpoint("POST", "/billing/invoices")]
|
||||||
|
public sealed class CreateInvoiceEndpoint : ...
|
||||||
|
```
|
||||||
|
|
||||||
|
* The SDK MUST support two handler shapes:
|
||||||
|
|
||||||
|
* Raw handler:
|
||||||
|
|
||||||
|
* `IRawStellaEndpoint` taking a `RawRequestContext` and returning a `RawResponse`, where:
|
||||||
|
|
||||||
|
* `RawRequestContext.Body` is a stream (may be buffered or streaming).
|
||||||
|
* Body contents are raw bytes.
|
||||||
|
* Typed handlers:
|
||||||
|
|
||||||
|
* `IStellaEndpoint<TRequest, TResponse>` which takes a typed request and returns a typed response.
|
||||||
|
* `IStellaEndpoint<TResponse>` which has no request payload and returns a typed response.
|
||||||
|
|
||||||
|
* The SDK MUST adapt typed endpoints to the raw model internally (microservice-side only), leaving the router unaware of types.
|
||||||
|
|
||||||
|
* Endpoint discovery MUST work by:
|
||||||
|
|
||||||
|
* Runtime reflection: scanning assemblies for `[StellaEndpoint]` and handler interfaces.
|
||||||
|
* Build-time reflection via source generation:
|
||||||
|
|
||||||
|
* A Roslyn source generator MUST generate a descriptor list at build time.
|
||||||
|
* At runtime, the SDK MUST prefer source-generated metadata and only fall back to reflection if generation is not available.
|
||||||
|
|
||||||
|
### 7.3 Endpoint metadata defaults & overrides
|
||||||
|
|
||||||
|
* Microservices MUST be able to provide default endpoint metadata:
|
||||||
|
|
||||||
|
* `SupportsStreaming` flag.
|
||||||
|
* Default timeout.
|
||||||
|
* Default `RequiringClaims`.
|
||||||
|
* Microservice-local YAML MUST be allowed to override or refine these defaults per endpoint, keyed by `(Method, Path)`.
|
||||||
|
* Precedence rules MUST be clearly defined and honored:
|
||||||
|
|
||||||
|
* Service identity & router pool: from `StellaMicroserviceOptions` (not YAML).
|
||||||
|
* Endpoint set: from code (attributes/source gen); YAML MAY override properties but ideally not create endpoints not present in code (policy decision to be documented).
|
||||||
|
* `RequiringClaims` and timeouts: YAML overrides defaults from code, unless overridden by central Authority.
|
||||||
|
|
||||||
|
### 7.4 Connection behavior
|
||||||
|
|
||||||
|
* On establishing a connection to a router endpoint, the SDK MUST:
|
||||||
|
|
||||||
|
* Immediately send a HELLO frame containing:
|
||||||
|
|
||||||
|
* `ServiceName`, `Version`, `Region`, `InstanceId`.
|
||||||
|
* The list of endpoints (Method, Path) with their metadata (SupportsStreaming, default timeouts, default RequiringClaims).
|
||||||
|
* At regular intervals, the SDK MUST send HEARTBEAT frames on each connection indicating:
|
||||||
|
|
||||||
|
* Instance health status.
|
||||||
|
* Optional metrics (e.g. in-flight request count, error rate).
|
||||||
|
* The SDK SHOULD support optional ENDPOINTS_UPDATE (or a re-HELLO) to update endpoint metadata at runtime if needed.
|
||||||
|
|
||||||
|
### 7.5 Request handling & streaming
|
||||||
|
|
||||||
|
* For each incoming REQUEST frame:
|
||||||
|
|
||||||
|
* The SDK MUST create a `RawRequestContext` with:
|
||||||
|
|
||||||
|
* Method.
|
||||||
|
* Path.
|
||||||
|
* Headers.
|
||||||
|
* A `Body` stream that either:
|
||||||
|
|
||||||
|
* Wraps a buffered byte array.
|
||||||
|
* Or exposes streaming reads from subsequent REQUEST_STREAM_DATA frames.
|
||||||
|
* A `CancellationToken` that will be cancelled when the router sends a CANCEL frame or the connection fails.
|
||||||
|
* The SDK MUST resolve the correct endpoint handler by `(Method, Path)` using the same path template rules as the router.
|
||||||
|
* For streaming endpoints, handlers MUST be able to read from `RawRequestContext.Body` incrementally and obey the `CancellationToken`.
|
||||||
|
|
||||||
|
### 7.6 Cancellation handling (microservice side)
|
||||||
|
|
||||||
|
* The SDK MUST maintain a map of in-flight requests by correlation ID, each containing:
|
||||||
|
|
||||||
|
* A `CancellationTokenSource`.
|
||||||
|
* The task executing the handler.
|
||||||
|
* Upon receiving a CANCEL frame for a given correlation ID, the SDK MUST:
|
||||||
|
|
||||||
|
* Look up the corresponding entry and call `CancellationTokenSource.Cancel()`.
|
||||||
|
* Handlers (both raw and typed) MUST receive a `CancellationToken`:
|
||||||
|
|
||||||
|
* They MUST observe the token and be coded to cancel promptly where needed.
|
||||||
|
* They MUST pass the token to downstream I/O operations (DB calls, file I/O, network).
|
||||||
|
* If the transport connection is closed, the SDK MUST treat it as a cancellation trigger for all outstanding requests on that connection and cancel their tokens.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Control / health / ping requirements
|
||||||
|
|
||||||
|
* Heartbeats MUST be sent over the same connection as requests (no separate control channel).
|
||||||
|
* The router MUST:
|
||||||
|
|
||||||
|
* Track `LastHeartbeatUtc` for each connection.
|
||||||
|
* Derive `InstanceHealthStatus` based on heartbeat recency and optionally metrics.
|
||||||
|
* Drop or mark as Unhealthy any instances whose heartbeats are stale past configured thresholds.
|
||||||
|
* The router SHOULD measure network latency (ping) by:
|
||||||
|
|
||||||
|
* Timing request-response round trips, or
|
||||||
|
* Using explicit ping frames, and updating `AveragePingMs` for each connection.
|
||||||
|
* The router MUST use heartbeat and ping metrics in its routing decision as described above.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Authorization / requiringClaims / Authority requirements
|
||||||
|
|
||||||
|
* `RequiringClaims` MUST be the only authorization metadata field; `AllowedRoles` MUST NOT be used.
|
||||||
|
* Every endpoint MUST be able to specify:
|
||||||
|
|
||||||
|
* An empty `RequiringClaims` list (no additional claims required beyond authenticated).
|
||||||
|
* Or one or more `ClaimRequirement` objects (Type + optional Value).
|
||||||
|
* The gateway MUST enforce `RequiringClaims` per request:
|
||||||
|
|
||||||
|
* Authorization MUST check that the request’s user principal has all required claims for the endpoint.
|
||||||
|
* Microservices MUST provide default `RequiringClaims` as part of their HELLO metadata.
|
||||||
|
* There MUST be a mechanism for an external Authority service to override `RequiringClaims` centrally:
|
||||||
|
|
||||||
|
* Defaults MUST come from microservices.
|
||||||
|
* Authority MUST be able to push or supply overrides that the gateway applies at startup and/or at runtime.
|
||||||
|
* The gateway MUST proactively request such overrides on startup (e.g. via a special message or mechanism) before handling traffic, or as early as practical.
|
||||||
|
* Final, effective `RequiringClaims` enforced at the gateway MUST be derived from microservice defaults plus Authority overrides, with Authority taking precedence where applicable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Cancellation requirements (router side)
|
||||||
|
|
||||||
|
* The protocol MUST define a `FrameType.Cancel` with:
|
||||||
|
|
||||||
|
* A `CorrelationId` indicating which request to cancel.
|
||||||
|
* An optional payload containing a reason code (e.g. `"ClientDisconnected"`, `"Timeout"`, `"PayloadLimitExceeded"`).
|
||||||
|
* The router MUST send CANCEL frames when:
|
||||||
|
|
||||||
|
* The HTTP client disconnects (ASP.NET `HttpContext.RequestAborted` fires) while the request is in progress.
|
||||||
|
* The router’s effective timeout for the request elapses, and no response has been received.
|
||||||
|
* The router detects payload/memory limit breaches and has to abort the request.
|
||||||
|
* The router is shutting down and explicitly aborts in-flight requests (if implemented).
|
||||||
|
* The router MUST:
|
||||||
|
|
||||||
|
* Stop forwarding any additional REQUEST_STREAM_DATA to the microservice once a CANCEL is sent.
|
||||||
|
* Stop reading any remaining response frames for that correlation and either:
|
||||||
|
|
||||||
|
* Discard them.
|
||||||
|
* Or treat them as late, log them, and ignore them.
|
||||||
|
* For streaming responses, if the HTTP client disconnects or router cancels:
|
||||||
|
|
||||||
|
* The router MUST stop writing to the HTTP response and treat any subsequent frames as ignored.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Configuration and YAML requirements
|
||||||
|
|
||||||
|
* `__Libraries/StellaOps.Router.Config` MUST handle:
|
||||||
|
|
||||||
|
* Binding router config from JSON/appsettings + YAML + environment variables.
|
||||||
|
* Static service definitions:
|
||||||
|
|
||||||
|
* ServiceName.
|
||||||
|
* DefaultVersion.
|
||||||
|
* DefaultTransport.
|
||||||
|
* Endpoint list (Method, Path) with default timeouts, requiringClaims, streaming flags.
|
||||||
|
* Static instance definitions (optional):
|
||||||
|
|
||||||
|
* ServiceName, Version, Region, supported transports, plugin-specific settings.
|
||||||
|
* Global payload limits (`PayloadLimits`).
|
||||||
|
* Router YAML config MUST support hot-reload:
|
||||||
|
|
||||||
|
* Changes SHOULD be picked up at runtime without restarting the gateway.
|
||||||
|
* Hot-reload MUST cause in-memory routing state to be updated, including:
|
||||||
|
|
||||||
|
* New or removed services/endpoints.
|
||||||
|
* New or removed instances (static).
|
||||||
|
* Updated payload limits.
|
||||||
|
* Microservice YAML config MUST be optional and used for endpoint-level overrides only, not for identity or router pool configuration.
|
||||||
|
* The router pool for microservices MUST be configured via code and MAY be backed by YAML (with hot-plug / reconnection behavior) if desired.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Library naming / repo structure requirements
|
||||||
|
|
||||||
|
* The router configuration library MUST be named `__Libraries/StellaOps.Router.Config`.
|
||||||
|
* The microservice SDK library MUST be named `__Libraries/StellaOps.Microservice`.
|
||||||
|
* The gateway webservice MUST be named `StellaOps.Gateway.WebService`.
|
||||||
|
* There MUST be a “common” library for shared types and abstractions (e.g. `__Libraries/StellaOps.Router.Common`).
|
||||||
|
* Documentation files MUST include at least:
|
||||||
|
|
||||||
|
* `Stella Ops Router.md` (what it is, why, high-level architecture).
|
||||||
|
* `Stella Ops Router - Webserver.md` (how the webservice works).
|
||||||
|
* `Stella Ops Router - Microservice.md` (how the microservice SDK works and is implemented).
|
||||||
|
* `Stella Ops Router - Common.md` (common components and how they are implemented).
|
||||||
|
* `Migration of Webservices to Microservices.md`.
|
||||||
|
* `Stella Ops Router Documentation.md` (doc structure & guidance).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. Documentation & developer-experience requirements
|
||||||
|
|
||||||
|
* The docs MUST be detailed; “do not spare details” implies:
|
||||||
|
|
||||||
|
* High-fidelity, concrete examples and not hand-wavy descriptions.
|
||||||
|
* For average C# developers, documentation MUST cover:
|
||||||
|
|
||||||
|
* Exact .NET / ASP.NET Core target version and runtime baseline.
|
||||||
|
* Required NuGet packages (logging, serialization, YAML parsing, RabbitMQ client, etc.).
|
||||||
|
* Exact serialization formats for frames and payloads (JSON vs MessagePack vs others).
|
||||||
|
* Exact framing rules for each transport (length-prefix for TCP/TLS, datagrams for UDP, exchanges/queues for Rabbit).
|
||||||
|
* Concrete sample `Program.cs` for:
|
||||||
|
|
||||||
|
* A gateway node.
|
||||||
|
* A microservice.
|
||||||
|
* Example endpoint implementations:
|
||||||
|
|
||||||
|
* Typed (with and without request).
|
||||||
|
* Raw streaming endpoints for large payloads.
|
||||||
|
* Example router YAML and microservice YAML with realistic values.
|
||||||
|
* Error and HTTP status mapping policy:
|
||||||
|
|
||||||
|
* E.g. “version not found → 404 or 400; no instance available → 503; timeout → 504; payload too large → 413.”
|
||||||
|
* Guidelines on:
|
||||||
|
|
||||||
|
* When to use UDP vs TCP vs RabbitMQ.
|
||||||
|
* How to configure and validate certificates for the certificate transport.
|
||||||
|
* How to write cancellation-friendly handlers (proper use of `CancellationToken`).
|
||||||
|
* Testing strategies: local dev setups, integration test harnesses, how to run router + microservice together for tests.
|
||||||
|
* Clear explanation of config precedence:
|
||||||
|
|
||||||
|
* Code options vs YAML vs microservice defaults vs Authority for claims.
|
||||||
|
* Documentation MUST answer for each major concept:
|
||||||
|
|
||||||
|
* What it is.
|
||||||
|
* Why it exists.
|
||||||
|
* How it works.
|
||||||
|
* How to use it (with examples).
|
||||||
|
* What happens when it is misused and how to debug issues.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 14. Migration requirements
|
||||||
|
|
||||||
|
* There MUST be a defined migration path from `StellaOps.*.WebServices` to `StellaOps.*.Microservices`.
|
||||||
|
* Migration documentation MUST cover:
|
||||||
|
|
||||||
|
* Inventorying existing HTTP routes (Method + Path).
|
||||||
|
* Strategy A (in-place adaptation):
|
||||||
|
|
||||||
|
* Adding microservice SDK into WebService.
|
||||||
|
* Declaring endpoints with `[StellaEndpoint]`.
|
||||||
|
* Wrapping existing controller logic in handlers.
|
||||||
|
* Connecting to the router and validating registration.
|
||||||
|
* Gradually shifting traffic from direct WebService HTTP ingress to gateway routing.
|
||||||
|
* Strategy B (split):
|
||||||
|
|
||||||
|
* Extracting domain logic into shared libraries.
|
||||||
|
* Creating a dedicated microservice project using the SDK.
|
||||||
|
* Mapping routes and handlers.
|
||||||
|
* Phasing out or repurposing the original WebService.
|
||||||
|
* Ensuring cancellation tokens are wired throughout migrated code.
|
||||||
|
* Handling streaming endpoints (large uploads/downloads) via `IRawStellaEndpoint` and streaming support instead of naive buffered HTTP controllers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
If you want, I can next turn this requirement set into a machine-readable checklist (e.g. JSON or YAML) or derive a first-pass implementation roadmap directly from these requirements.
|
||||||
Reference in New Issue
Block a user