Compare commits

...

3 Commits

Author SHA1 Message Date
master
d785a9095f Merge branch 'main' of https://git.stella-ops.org/stella-ops.org/git.stella-ops.org
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
2025-12-02 18:38:37 +02:00
master
0c9e8d5d18 router planning 2025-12-02 18:38:32 +02:00
master
790801f329 add advisories 2025-12-01 17:50:11 +02:00
22 changed files with 10162 additions and 0 deletions

View File

@@ -0,0 +1,446 @@
Heres a crisp, practical way to turn StellaOps “verifiable proof spine” into a moat—and how to measure it.
# Why this matters (in plain terms)
Security tools often say “trust me.” Youll say “prove it”—every finding and every “notaffected” claim ships with cryptographic receipts anyone can verify.
---
# Differentiators to build in
**1) Bind every verdict to a graph hash**
* Compute a stable **Graph Revision ID** (Merkle root) over: SBOM nodes, edges, policies, feeds, scan params, and tool versions.
* Store the ID on each finding/VEX item; show it in the UI and APIs.
* Rule: any data change → new graph hash → new revisioned verdicts.
**2) Attach machineverifiable receipts (intoto/DSSE)**
* For each verdict, emit a **DSSEwrapped intoto statement**:
* predicateType: `stellaops.dev/verdict@v1`
* includes: graphRevisionId, artifact digests, rule id/version, inputs (CPE/CVE/CVSS), timestamps.
* Sign with your **Authority** (Sigstore key, offline mode supported).
* Keep receipts queryable and exportable; mirror to Rekorcompatible ledger when online.
**3) Add reachability “callstack slices” or binarysymbol proofs**
* For codelevel reachability, store compact slices: entry → sink, with symbol names + file:line.
* For binary-only targets, include **symbol presence proofs** (e.g., Bloom filters + offsets) with executable digest.
* Compress and embed a hash of the slice/proof inside the DSSE payload.
**4) Deterministic replay manifests**
* Alongside receipts, publish a **Replay Manifest** (inputs, feeds, rule versions, container digests) so any auditor can reproduce the same graph hash and verdicts offline.
---
# Benchmarks to publish (make them your headline KPIs)
**A) Falsepositive reduction vs. baseline scanners (%)**
* Method: run a public corpus (e.g., sample images + app stacks) across 34 popular scanners; label ground truth once; compare FP rate.
* Report: mean & p95 FP reduction.
**B) Proof coverage (% of findings with signed evidence)**
* Definition: `(# findings or VEX items carrying valid DSSE receipts) / (total surfaced items)`.
* Break out: runtimereachable vs. unreachable, and “notaffected” claims.
**C) Triage time saved (p50/p95)**
* Measure analyst minutes from “alert created” → “final disposition.”
* A/B with receipts hidden vs. visible; publish median/p95 deltas.
**D) Determinism stability**
* Re-run identical scans N times / across nodes; publish `% identical graph hashes` and drift causes when different.
---
# Minimal implementation plan (weekbyweek)
**Week 1: primitives**
* Add Graph Revision ID generator in `scanner.webservice` (Merkle over normalized JSON of SBOM+edges+policies+toolVersions).
* Define `VerdictReceipt` schema (protobuf/JSON) and DSSE envelope types.
**Week 2: signing + storage**
* Wire DSSE signing in **Authority**; offline key support + rotation.
* Persist receipts in `Receipts` table (Postgres) keyed by `(graphRevisionId, verdictId)`; enable export (JSONL) and ledger mirror.
**Week 3: reachability proofs**
* Add callstack slice capture in reachability engine; serialize compactly; hash + reference from receipts.
* Binary symbol proof module for ELF/PE: symbol bitmap + digest.
**Week 4: replay + UX**
* Emit `replay.manifest.json` per scan (inputs, tool digests).
* UI: show **“Verified”** badge, graph hash, signature issuer, and a oneclick “Copy receipt” button.
* API: `GET /verdicts/{id}/receipt`, `GET /graphs/{rev}/replay`.
**Week 5: benchmarks harness**
* Create `bench/` with golden fixtures and a runner:
* Baseline scanner adapters
* Groundtruth labels
* Metrics export (FP%, proof coverage, triage time capture hooks)
---
# Developer guardrails (make these nonnegotiable)
* **No receipt, no ship:** any surfaced verdict must carry a DSSE receipt.
* **Schema freeze windows:** changes to rule inputs or policy logic must bump rule version and therefore the graph hash.
* **Replayfirst CI:** PRs touching scanning/rules must pass a replay test that reproduces prior graph hashes on gold fixtures.
* **Clock safety:** use monotonic time inside receipts; add UTC walltime separately.
---
# What to show buyers/auditors
* A short **audit kit**: sample container + your receipts + replay manifest + one command to reproduce the same graph hash.
* A onepage **benchmark readout**: FP reduction, proof coverage, and triage time saved (p50/p95), with corpus description.
---
If you want, Ill draft:
1. the DSSE `predicate` schema,
2. the Postgres DDL for `Receipts` and `Graphs`, and
3. a tiny .NET verification CLI (`stellaops-verify`) that replays a manifest and validates signatures.
Heres a focused “developer guidelines” doc just for **Benchmarks for a Testable Security Moat** in StellaOps.
---
# Stella Ops Developer Guidelines
## Benchmarks for a Testable Security Moat
> **Goal:** Benchmarks are how we *prove* StellaOps is better, not just say it is. If a “moat” claim cant be tied to a benchmark, it doesnt exist.
Everything here is about how you, as a developer, design, extend, and run those benchmarks.
---
## 1. What our benchmarks must measure
Every core product claim needs at least one benchmark:
1. **Detection quality**
* Precision / recall vs ground truth.
* False positives vs popular scanners.
* False negatives on knownbad samples.
2. **Proof & evidence quality**
* % of findings with **valid receipts** (DSSE).
* % of VEX “notaffected” with attached proofs.
* Reachability proof quality:
* callstack slice present?
* symbol proof present for binaries?
3. **Triage & workflow impact**
* Timetodecision for analysts (p50/p95).
* Click depth and context switches per decision.
* “Verified” vs “unverified” verdict triage times.
4. **Determinism & reproducibility**
* Same inputs → same **Graph Revision ID**.
* Stable verdict sets across runs/nodes.
> **Rule:** If you add a feature that impacts any of these, you must either hook it into an existing benchmark or add a new one.
---
## 2. Benchmark assets and layout
**2.1 Repo layout (convention)**
Under `bench/` we maintain everything benchmarkrelated:
* `bench/corpus/`
* `images/` curated container images / tarballs.
* `repos/` sample codebases (with known vulns).
* `sboms/` canned SBOMs for edge cases.
* `bench/scenarios/`
* `*.yaml` scenario definitions (inputs + expected outputs).
* `bench/golden/`
* `*.json` golden results (expected findings, metrics).
* `bench/tools/`
* adapters for baseline scanners, parsers, helpers.
* `bench/scripts/`
* `run_benchmarks.[sh/cs]` single entrypoint.
**2.2 Scenario definition (highlevel)**
Each scenario yaml should minimally specify:
* **Inputs**
* artifact references (image name / path / repo SHA / SBOM file).
* environment knobs (features enabled/disabled).
* **Ground truth**
* list of expected vulns (or explicit “none”).
* for some: expected reachability (reachable/unreachable).
* expected VEX entries (affected / not affected).
* **Expectations**
* required metrics (e.g., “no more than 2 FPs”, “no FNs”).
* required proof coverage (e.g., “100% of surfaced findings have receipts”).
---
## 3. Core benchmark metrics (developerfacing definitions)
Use these consistently across code and docs.
### 3.1 Detection metrics
* `true_positive_count` (TP)
* `false_positive_count` (FP)
* `false_negative_count` (FN)
Derived:
* `precision = TP / (TP + FP)`
* `recall = TP / (TP + FN)`
* For UX: track **FP per asset** and **FP per 100 findings**.
**Developer guideline:**
* When you introduce a filter, deduper, or rule tweak, add/modify a scenario where:
* the change **helps** (reduces FP or FN); and
* a different scenario guards against regressions.
### 3.2 Moatspecific metrics
These are the ones that directly support the “testable moat” story:
1. **Falsepositive reduction vs baseline scanners**
* Run baseline scanners across our corpus (via adapters in `bench/tools`).
* Compute:
* `baseline_fp_rate`
* `stella_fp_rate`
* `fp_reduction = (baseline_fp_rate - stella_fp_rate) / baseline_fp_rate`.
2. **Proof coverage**
* `proof_coverage_all = findings_with_valid_receipts / total_findings`
* `proof_coverage_vex = vex_items_with_valid_receipts / total_vex_items`
* `proof_coverage_reachable = reachable_findings_with_proofs / total_reachable_findings`
3. **Triage time improvement**
* In test harnesses, simulate or record:
* `time_to_triage_with_receipts`
* `time_to_triage_without_receipts`
* Compute median & p95 deltas.
4. **Determinism**
* Rerun the same scenario `N` times:
* `% runs with identical Graph Revision ID`
* `% runs with identical verdict sets`
* On mismatch, diff and log cause (e.g., nonstable sort, nonpinned feed).
---
## 4. How developers should work with benchmarks
### 4.1 “No feature without benchmarks”
If youre adding or changing:
* graph structure,
* rule logic,
* scanner integration,
* VEX handling,
* proof / receipt generation,
you **must** do *at least one* of:
1. **Extend an existing scenario**
* Add expectations that cover your change, or
* tighten an existing bound (e.g., lower FP threshold).
2. **Add a new scenario**
* For new attack classes / edge cases / ecosystems.
**Antipatterns:**
* Shipping a new capability with *no* corresponding scenario.
* Updating golden outputs without explaining why metrics changed.
### 4.2 CI gates
We treat benchmarks as **blocking**:
* Add a CI job, e.g.:
* `make bench:quick` on every PR (small subset).
* `make bench:full` on main / nightly.
* CI fails if:
* Any scenario marked `strict: true` has:
* Precision or recall below its threshold.
* Proof coverage below its configured threshold.
* Global regressions above tolerance:
* e.g. total FP increases > X% without an explicit override.
**Developer rule:**
* If you intentionally change behavior:
* Update the relevant golden files.
* Include a short note in the PR (e.g., `bench-notes.md` snippet) describing:
* what changed,
* why the new result is better, and
* which moat metric it improves (FP, proof coverage, determinism, etc.).
---
## 5. Benchmark implementation guidelines
### 5.1 Make benchmarks deterministic
* **Pin everything**:
* feed snapshots,
* tool container digests,
* rule versions,
* time windows.
* Use **Replay Manifests** as the source of truth:
* `replay.manifest.json` should contain:
* input artifacts,
* tool versions,
* feed versions,
* configuration flags.
* If a benchmark depends on time:
* Inject a **fake clock** or explicit “as of” timestamp.
### 5.2 Keep scenarios small but meaningful
* Prefer many **focused** scenarios over a few huge ones.
* Each scenario should clearly answer:
* “What property of StellaOps are we testing?”
* “What moat claim does this support?”
Examples:
* `bench/scenarios/false_pos_kubernetes.yaml`
* Focus: config noise reduction vs baseline scanner.
* `bench/scenarios/reachability_java_webapp.yaml`
* Focus: reachable vs unreachable vuln proofs.
* `bench/scenarios/vex_not_affected_openssl.yaml`
* Focus: VEX correctness and proof coverage.
### 5.3 Use golden outputs, not adhoc assertions
* Bench harness should:
* Run StellaOps on scenario inputs.
* Normalize outputs (sorted lists, stable IDs).
* Compare to `bench/golden/<scenario>.json`.
* Golden file should include:
* expected findings (id, severity, reachable?, etc.),
* expected VEX entries,
* expected metrics (precision, recall, coverage).
---
## 6. Moatcritical benchmark types (we must have all of these)
When youre thinking about gaps, check that we have:
1. **Crosstool comparison**
* Same corpus, multiple scanners.
* Metrics vs baselines for FP/FN.
2. **Proof density & quality**
* Corpus where:
* some vulns are reachable,
* some are not,
* some are not present.
* Ensure:
* reachable ones have rich proofs (stack slices / symbol proofs).
* nonreachable or absent ones have:
* correct disposition, and
* clear receipts explaining why.
3. **VEX accuracy**
* Scenarios with known SBOM + known vulnerability impact.
* Check:
* VEX “affected”/“notaffected” matches ground truth.
* every VEX entry has a receipt.
4. **Analyst workflow**
* Small usability corpus for internal testing:
* Measure timetotriage with/without receipts.
* Use the same scenarios across releases to track improvement.
5. **Upgrade / drift resistance**
* Scenarios that are **expected to remain stable** across:
* rule changes that *shouldnt* affect outcomes.
* feed updates (within a given version window).
* These act as canaries for unintended regressions.
---
## 7. Developer checklist (TL;DR)
Before merging a change that touches security logic, ask yourself:
1. **Is there at least one benchmark scenario that exercises this change?**
2. **Does the change improve at least one moat metric, or is it neutral?**
3. **Have I run `make bench:quick` locally and checked diffs?**
4. **If goldens changed, did I explain why in the PR?**
5. **Did I keep benchmarks deterministic (pinned versions, fake time, etc.)?**
If any answer is “no”, fix that before merging.
---
If youd like, next step I can sketch a concrete `bench/scenarios/*.yaml` and matching `bench/golden/*.json` example that encodes one *specific* moat claim (e.g., “30% fewer FPs than Scanner X on Kubernetes configs”) so your team has a readyto-copy pattern.

View File

@@ -0,0 +1,287 @@
Heres a condensed **“Stella Ops Developer Guidelines”** based on the official engineering docs and dev guides.
---
## 0. Where to start
* **Dev docs index:** The main entrypoint is `Development Guides & Tooling` (docs/technical/development/README.md). It links to coding standards, test strategy, performance workbook, plugin SDK, examples, and more. ([Gitea: Git with a cup of tea][1])
* **If a term is unfamiliar:** Check the onepage *Glossary of Terms* first. ([Stella Ops][2])
* **Big picture:** Stella Ops is an SBOMfirst, offlineready container security platform; a lot of design decisions (determinism, signatures, policy DSL, SBOM delta scans) flow from that. ([Stella Ops][3])
---
## 1. Core engineering principles
From **Coding Standards & Contributor Guide**: ([Gitea: Git with a cup of tea][4])
1. **SOLID first** especially interface & dependency inversion.
2. **100line file rule** if a file grows >100 physical lines, split or refactor.
3. **Contracts vs runtime** public DTOs and interfaces live in lightweight `*.Contracts` projects; implementations live in sibling runtime projects.
4. **Single composition root** DI wiring happens in `StellaOps.Web/Program.cs` and each plugins `IoCConfigurator`. Nothing else creates a service provider.
5. **No service locator** constructor injection only; no global `ServiceProvider` or static service lookups.
6. **Failfast startup** validate configuration *before* the web host starts listening.
7. **Hotload compatibility** avoid static singletons that would survive plugin unload; dont manually load assemblies outside the builtin loader.
These all serve the product goals of **deterministic, offline, explainable security decisions**. ([Stella Ops][3])
---
## 2. Repository layout & layering
From the repo layout section: ([Gitea: Git with a cup of tea][4])
* **Toplevel structure (simplified):**
```text
src/
backend/
StellaOps.Web/ # ASP.NET host + composition root
StellaOps.Common/ # logging, helpers
StellaOps.Contracts/ # DTO + interface contracts
… more runtime projects
plugins-sdk/ # plugin templates & abstractions
frontend/ # Angular workspace
tests/ # mirrors src 1to1
```
* **Rules:**
* No “Module” folders or nested solution hierarchies.
* Tests mirror `src/` structure 1:1; **no test code in production projects**.
* New features follow *feature folder* layout (e.g., `Scan/ScanService.cs`, `Scan/ScanController.cs`).
---
## 3. Naming, style & language usage
Key conventions: ([Gitea: Git with a cup of tea][4])
* **Namespaces:** filescoped, `StellaOps.*`.
* **Interfaces:** `I` prefix (`IScannerRunner`).
* **Classes/records:** PascalCase (`ScanRequest`, `TrivyRunner`).
* **Private fields:** `camelCase` (no leading `_`).
* **Constants:** `SCREAMING_SNAKE_CASE`.
* **Async methods:** end with `Async`.
* **Usings:** outside namespace, sorted, no wildcard imports.
* **File length:** keep ≤100 lines including `using` and braces (enforced by tooling).
C# feature usage: ([Gitea: Git with a cup of tea][4])
* Nullable reference types **on**.
* Use `record` for immutable DTOs.
* Prefer pattern matching over long `switch` cascades.
* `Span`/`Memory` only when youve measured that you need them.
* Use `await foreach` instead of manual iterator loops.
Formatting & analysis:
* `dotnet format` must be clean; StyleCop + security analyzers + CodeQL run in CI and are treated as gates. ([Gitea: Git with a cup of tea][4])
---
## 4. Dependency injection, async & concurrency
DI policy (core + plugins): ([Gitea: Git with a cup of tea][4])
* Exactly **one composition root** per process (`StellaOps.Web/Program.cs`).
* Plugins contribute through:
* `[ServiceBinding]` attributes for simple bindings, or
* An `IoCConfigurator : IDependencyInjectionRoutine` for advanced setups.
* Default lifetime is **scoped**. Use singletons only for truly stateless, threadsafe helpers.
* Never use a service locator or manually build nested service providers except in tests.
Async & threading: ([Gitea: Git with a cup of tea][4])
* All I/O is async; avoid `.Result` / `.Wait()`.
* Library code uses `ConfigureAwait(false)`.
* Control concurrency with channels or `Parallel.ForEachAsync`, not adhoc `Task.Run` loops.
---
## 5. Tests, tooling & quality gates
The **Automated TestSuite Overview** spells out all CI layers and budgets. ([Gitea: Git with a cup of tea][5])
**Test layers (highlevel):**
* Unit tests: xUnit.
* Propertybased tests: FsCheck.
* Integration:
* API integration with Testcontainers.
* DB/merge flows using Mongo + Redis.
* Contracts: gRPC breakage checks with Buf.
* Frontend:
* Unit tests with Jest.
* E2E tests with Playwright.
* Lighthouse runs for performance & accessibility.
* Nonfunctional:
* Load tests via k6.
* Chaos experiments (CPU/OOM) using Docker tooling.
* Dependency & license scanning.
* SBOM reproducibility/attestation checks.
**Quality gates (examples):** ([Gitea: Git with a cup of tea][5])
* API unit test line coverage ≥ ~85%.
* API P95 latency ≤ ~120ms in nightly runs.
* ΔSBOM warm scan P95 ≤ ~5s on reference hardware.
* Lighthouse perf score ≥ ~90, a11y ≥ ~95.
**Local workflows:**
* Use `./scripts/dev-test.sh` for “fast” local runs and `--full` for the entire stack (API, UI, Playwright, Lighthouse, etc.). Needs Docker and modern Node. ([Gitea: Git with a cup of tea][5])
* Some suites use Mongo2Go + an OpenSSL 1.1 shim; others use a helper script to spin up a local `mongod` for deeper debugging. ([Gitea: Git with a cup of tea][5])
---
## 6. Plugins & connectors
The **Plugin SDK Guide** is your bible for schedule jobs, scanner adapters, TLS providers, notification channels, etc. ([Gitea: Git with a cup of tea][6])
**Basics:**
* Use `.NET` templates to scaffold:
```bash
dotnet new stellaops-plugin-schedule -n MyPlugin.Schedule --output src
```
* At publish time, copy **signed** artefacts to:
```text
src/backend/Stella.Ops.Plugin.Binaries/<MyPlugin>/
MyPlugin.dll
MyPlugin.dll.sig
```
* The backend:
* Verifies the Cosign signature.
* Enforces `[StellaPluginVersion]` compatibility.
* Loads plugins in isolated `AssemblyLoadContext`s.
**DI entrypoints:**
* For simple cases, mark implementations with `[ServiceBinding(typeof(IMyContract), ServiceLifetime.Scoped, …)]`.
* For more control, implement `IoCConfigurator : IDependencyInjectionRoutine` and configure services/options in `Register(...)`. ([Gitea: Git with a cup of tea][6])
**Examples:**
* **Schedule job:** implement `IJob.ExecuteAsync`, add `[StellaPluginVersion("X.Y.Z")]`, register cron with `services.AddCronJob<MyJob>("0 15 * * *")`.
* **Scanner adapter:** implement `IScannerRunner` and register via `services.AddScanner<MyAltScanner>("alt")`; document Docker sidecars if needed. ([Gitea: Git with a cup of tea][6])
**Signing & deployment:**
* Publish, sign with Cosign, optionally zip:
```bash
dotnet publish -c Release -p:PublishSingleFile=true -o out
cosign sign --key $COSIGN_KEY out/MyPlugin.Schedule.dll
```
* Copy into the backend container (e.g., `/opt/plugins/`) and restart.
* Unsigned DLLs are rejected when `StellaOps:Security:DisableUnsigned=false`. ([Gitea: Git with a cup of tea][6])
**Marketplace:**
* Tag releases like `plugin-vX.Y.Z`, attach the signed ZIP, and submit metadata to the community plugin index so it shows up in the UI Marketplace. ([Gitea: Git with a cup of tea][6])
---
## 7. Policy DSL & security decisions
For policy authors and tooling engineers, the **Stella Policy DSL (stelladsl@1)** doc is key. ([Stella Ops][7])
**Goals:**
* Deterministic: same inputs → same findings on every machine.
* Declarative: no arbitrary loops, network calls, or clocks.
* Explainable: each decision carries rule, inputs, rationale.
* Offlinefriendly and reachabilityaware (SBOM + advisories + VEX + reachability). ([Stella Ops][7])
**Structure:**
* One `policy` block per `.stella` file, with:
* `metadata` (description, tags).
* `profile` blocks (severity, trust, reachability adjustments).
* `rule` blocks (`when` / `then` logic).
* Optional `settings`. ([Stella Ops][7])
**Context & builtins:**
* Namespaces like `sbom`, `advisory`, `vex`, `env`, `telemetry`, `secret`, `profile.*`, etc. ([Stella Ops][7])
* Helpers such as `normalize_cvss`, `risk_score`, `vex.any`, `vex.latest`, `sbom.any_component`, `exists`, `coalesce`, and secretsspecific helpers. ([Stella Ops][7])
**Rules of thumb:**
* Always include a clear `because` when you change `status` or `severity`. ([Stella Ops][7])
* Avoid catchall suppressions (`when true` + `status := "suppressed"`); the linter will flag them. ([Stella Ops][7])
* Use `stella policy lint/compile/simulate` in CI and locally; test in sealed (offline) mode to ensure no network dependencies. ([Stella Ops][7])
---
## 8. Commits, PRs & docs
From the commit/PR checklist: ([Gitea: Git with a cup of tea][4])
Before opening a PR:
1. Use **Conventional Commit** prefixes (`feat:`, `fix:`, `docs:`, etc.).
2. Run `dotnet format` and `dotnet test`; both must be green.
3. Keep new/changed files within the 100line guideline.
4. Update XMLdoc comments for any new public API.
5. If you add/change a public contract:
* Update the relevant markdown docs.
* Update JSON schema / API descriptions as needed.
6. Ensure static analyzers and CI jobs relevant to your change are passing.
For new test layers or jobs, also update the testsuite overview and metrics docs so the CI configuration stays discoverable. ([Gitea: Git with a cup of tea][5])
---
## 9. Licensing & reciprocity
Stella Ops ships under **AGPL3.0orlater** with a strong reciprocity clause: ([Stella Ops][8])
* You may run, study, modify, and redistribute it, including as a hosted service.
* If you run a **modified** version for others over a network, you must make that exact source code available to those users.
* Official containers are signed and include SBOMs and attestations; verify them with Cosign as described on the license/security pages. ([Stella Ops][8])
When you build extensions:
* Keep plugins compatible with AGPL expectations around combined works.
* Dont embed proprietary logic into the core without checking license implications.
---
## 10. If you just want a “first contribution” recipe
A practical path that follows the guidelines:
1. Clone the main repo; skim **Coding Standards**, **Test Suite Overview**, and the **Dev Guides & Tooling** index. ([Gitea: Git with a cup of tea][4])
2. Get `dotnet`, Docker, Node set up; run `./scripts/dev-test.sh` to make sure your environment is healthy. ([Gitea: Git with a cup of tea][5])
3. Pick a small issue (docs, small refactor, or new test), make changes respecting:
* 100line files,
* DI patterns,
* naming & style.
4. Add/adjust tests plus any affected docs or JSON schemas.
5. Run tests + formatting locally, push, and open a PR with a conventional title and a short “how I tested this” note.
If you tell me what youre planning to work on (plugin, policy pack, core feature, or UI), I can turn this into a very concrete checklist tailored to that slice of Stella Ops.
[1]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/08b27b8a266c82960c7653797460e1e1d17ecd45/docs/technical/development/README.md "git.stella-ops.org/README.md at 08b27b8a266c82960c7653797460e1e1d17ecd45 - git.stella-ops.org - Gitea: Git with a cup of tea"
[2]: https://stella-ops.org/docs/14_glossary_of_terms/?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
[3]: https://stella-ops.org/docs/05_SYSTEM_REQUIREMENTS_SPEC/?utm_source=chatgpt.com "system requirements specification - Stella Ops Open • Sovereign ..."
[4]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/08b27b8a266c82960c7653797460e1e1d17ecd45/docs/18_CODING_STANDARDS.md "git.stella-ops.org/18_CODING_STANDARDS.md at 08b27b8a266c82960c7653797460e1e1d17ecd45 - git.stella-ops.org - Gitea: Git with a cup of tea"
[5]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/08b27b8a266c82960c7653797460e1e1d17ecd45/docs/19_TEST_SUITE_OVERVIEW.md "git.stella-ops.org/19_TEST_SUITE_OVERVIEW.md at 08b27b8a266c82960c7653797460e1e1d17ecd45 - git.stella-ops.org - Gitea: Git with a cup of tea"
[6]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/08b27b8a266c82960c7653797460e1e1d17ecd45/docs/10_PLUGIN_SDK_GUIDE.md "git.stella-ops.org/10_PLUGIN_SDK_GUIDE.md at 08b27b8a266c82960c7653797460e1e1d17ecd45 - git.stella-ops.org - Gitea: Git with a cup of tea"
[7]: https://stella-ops.org/docs/policy/dsl/index.html "Stella Ops Signed Reachability · Deterministic Replay · Sovereign Crypto"
[8]: https://stella-ops.org/license/?utm_source=chatgpt.com "AGPL3.0orlater - Stella Ops"

View File

@@ -0,0 +1,585 @@
Heres a tight, practical pattern to make your scanners vulnDB updates rocksolid even when feeds hiccup:
# Offline, verifiable update bundles (DSSE + Rekor v2)
**Idea:** distribute DB updates as offline tarballs. Each tarball ships with:
* a **DSSEsigned** statement (e.g., intoto style) over the bundle hash
* a **Rekor v2 receipt** proving the signature/statement was logged
* a small **manifest.json** (version, created_at, content hashes)
**Startup flow (happy path):**
1. Load latest tarball from your local `updates/` cache.
2. Verify DSSE signature against your trusted public keys.
3. Verify Rekor v2 receipt (inclusion proof) matches the DSSE payload hash.
4. If both pass, unpack/activate; record the bundles **trust_id** (e.g., statement digest).
5. If anything fails, **keep using the last good bundle**. No service disruption.
**Why this helps**
* **Airgap friendly:** no live network needed at activation time.
* **Tamperevident:** DSSE + Rekor receipt proves provenance and transparency.
* **Operational stability:** feed outages become nonevents—scanner just keeps the last good state.
---
## File layout inside each bundle
```
/bundle-2025-11-29/
manifest.json # { version, created_at, entries[], sha256s }
payload.tar.zst # the actual DB/indices
payload.tar.zst.sha256
statement.dsse.json # DSSE-wrapped statement over payload hash
rekor-receipt.json # Rekor v2 inclusion/verification material
```
---
## Acceptance/Activation rules
* **Trust root:** pin one (or more) publisher public keys; rotate via separate, outofband process.
* **Monotonicity:** only activate if `manifest.version > current.version` (or if trust policy explicitly allows replay for rollback testing).
* **Atomic switch:** unpack to `db/staging/`, validate, then symlinkflip to `db/active/`.
* **Quarantine on failure:** move bad bundles to `updates/quarantine/` with a reason code.
---
## Minimal .NET 10 verifier sketch (C#)
```csharp
public sealed record BundlePaths(string Dir) {
public string Manifest => Path.Combine(Dir, "manifest.json");
public string Payload => Path.Combine(Dir, "payload.tar.zst");
public string Dsse => Path.Combine(Dir, "statement.dsse.json");
public string Receipt => Path.Combine(Dir, "rekor-receipt.json");
}
public async Task<bool> ActivateBundleAsync(BundlePaths b, TrustConfig trust, string activeDir) {
var manifest = await Manifest.LoadAsync(b.Manifest);
if (!await Hashes.VerifyAsync(b.Payload, manifest.PayloadSha256)) return false;
// 1) DSSE verify (publisher keys pinned in trust)
var (okSig, dssePayloadDigest) = await Dsse.VerifyAsync(b.Dsse, trust.PublisherKeys);
if (!okSig || dssePayloadDigest != manifest.PayloadSha256) return false;
// 2) Rekor v2 receipt verify (inclusion + statement digest == dssePayloadDigest)
if (!await RekorV2.VerifyReceiptAsync(b.Receipt, dssePayloadDigest, trust.RekorPub)) return false;
// 3) Stage, validate, then atomically flip
var staging = Path.Combine(activeDir, "..", "staging");
DirUtil.Empty(staging);
await TarZstd.ExtractAsync(b.Payload, staging);
if (!await LocalDbSelfCheck.RunAsync(staging)) return false;
SymlinkUtil.AtomicSwap(source: staging, target: activeDir);
State.WriteLastGood(manifest.Version, dssePayloadDigest);
return true;
}
```
---
## Operational playbook
* **On boot & daily at HH:MM:** try `ActivateBundleAsync()` on the newest bundle; on failure, log and continue.
* **Telemetry (no PII):** reason codes (SIG_FAIL, RECEIPT_FAIL, HASH_MISMATCH, SELFTEST_FAIL), versions, last_good.
* **Keys & rotation:** keep `publisher.pub` and `rekor.pub` in a rootowned, readonly path; rotate via a separate signed “trust bundle”.
* **Defenseindepth:** verify both the **payload hash** and each files hash listed in `manifest.entries[]`.
* **Rollback:** allow `--force-activate <bundle>` for emergency testing, but mark as **nonmonotonic** in state.
---
## What to hand your release team
* A Make/CI target that:
1. Builds `payload.tar.zst` and computes hashes
2. Generates `manifest.json`
3. Creates and signs the **DSSE statement**
4. Submits to Rekor (or your mirror) and saves the **v2 receipt**
5. Packages the bundle folder and publishes to your offline repo
* A checksum file (`*.sha256sum`) for ops to verify outofband.
---
If you want, I can turn this into a StellaOps spec page (`docs/modules/scanner/offline-bundles.md`) plus a small reference implementation (C# library + CLI) that drops right into your Scanner service.
Heres a “dropin” Stella Ops dev guide for **DSSEsigned Offline Scanner Updates** — written in the same spirit as the existing docs and sprint files.
You can treat this as the seed for `docs/modules/scanner/development/dsse-offline-updates.md` (or similar).
---
# DSSESigned Offline Scanner Updates — Developer Guidelines
> **Audience**
> Scanner, Export Center, Attestor, CLI, and DevOps engineers implementing DSSEsigned offline vulnerability updates and integrating them into the Offline Update Kit (OUK).
>
> **Context**
>
> * OUK already ships **signed, atomic offline update bundles** with merged vulnerability feeds, container images, and an attested manifest.([git.stella-ops.org][1])
> * DSSE + Rekor is already used for **scan evidence** (SBOM attestations, Rekor proofs).([git.stella-ops.org][2])
> * Sprints 160/162 add **attestation bundles** with manifest, checksums, DSSE signature, and optional transparency log segments, and integrate them into OUK and CLI flows.([git.stella-ops.org][3])
These guidelines tell you how to **wire all of that together** for “offline scanner updates” (feeds, rules, packs) in a way that matches Stella Ops determinism + sovereignty promises.
---
## 0. Mental model
At a high level, youre building this:
```text
Advisory mirrors / Feeds builders
ExportCenter.AttestationBundles
(creates DSSE + Rekor evidence
for each offline update snapshot)
Offline Update Kit (OUK) builder
(adds feeds + evidence to kit tarball)
stella offline kit import / admin CLI
(verifies Cosign + DSSE + Rekor segments,
then atomically swaps scanner feeds)
```
Online, Rekor is live; offline, you rely on **bundled Rekor segments / snapshots** and the existing OUK mechanics (import is atomic, old feeds kept until new bundle is fully verified).([git.stella-ops.org][1])
---
## 1. Goals & nongoals
### Goals
1. **Authentic offline snapshots**
Every offline scanner update (OUK or delta) must be verifiably tied to:
* a DSSE envelope,
* a certificate chain rooted in Stellas Fulcio/KMS profile or BYO KMS/HSM,
* *and* a Rekor v2 inclusion proof or bundled log segment.([Stella Ops][4])
2. **Deterministic replay**
Given:
* a specific offline update kit (`stella-ops-offline-kit-<DATE>.tgz` + `offline-manifest-<DATE>.json`)([git.stella-ops.org][1])
* its DSSE attestation bundle + Rekor segments
every verifier must reach the *same* verdict on integrity and contents — online or fully airgapped.
3. **Separation of concerns**
* Export Center: build attestation bundles, no business logic about scanning.([git.stella-ops.org][5])
* Scanner: import & apply feeds; verify but not generate DSSE.
* Signer / Attestor: own DSSE & Rekor integration.([git.stella-ops.org][2])
4. **Operational safety**
* Imports remain **atomic and idempotent**.
* Old feeds stay live until the new update is **fully verified** (Cosign + DSSE + Rekor).([git.stella-ops.org][1])
### Nongoals
* Designing new crypto or log formats.
* Perfeed DSSE envelopes (you can have more later, but the minimum contract is **bundlelevel** attestation).
---
## 2. Bundle contract for DSSEsigned offline updates
Youre extending the existing OUK contract:
* OUK already packs:
* merged vuln feeds (OSV, GHSA, optional NVD 2.0, CNNVD/CNVD, ENISA, JVN, BDU),
* container images (`stella-ops`, Zastava, etc.),
* provenance (Cosign signature, SPDX SBOM, intoto SLSA attestation),
* `offline-manifest.json` + detached JWS signed during export.([git.stella-ops.org][1])
For **DSSEsigned offline scanner updates**, add a new logical layer:
### 2.1. Files to ship
Inside each offline kit (full or delta) you must produce:
```text
/attestations/
offline-update.dsse.json # DSSE envelope
offline-update.rekor.json # Rekor entry + inclusion proof (or segment descriptor)
/manifest/
offline-manifest.json # existing manifest
offline-manifest.json.jws # existing detached JWS
/feeds/
... # existing feed payloads
```
The exact paths can be adjusted, but keep:
* **One DSSE bundle per kit** (min spec).
* **One canonical Rekor proof file** per DSSE envelope.
### 2.2. DSSE payload contents (minimal)
Define (or reuse) a predicate type such as:
```jsonc
{
"payloadType": "application/vnd.in-toto+json",
"payload": { /* base64 */ }
}
```
Decoded payload (in-toto statement) should **at minimum** contain:
* **Subject**
* `name`: `stella-ops-offline-kit-<DATE>.tgz`
* `digest.sha256`: tarball digest
* **Predicate type** (recommendation)
* `https://stella-ops.org/attestations/offline-update/1`
* **Predicate fields**
* `offline_manifest_sha256` SHA256 of `offline-manifest.json`
* `feeds` array of feed entries such as `{ name, snapshot_date, archive_digest }` (mirrors `rules_and_feeds` style used in the moat doc).([Stella Ops][6])
* `builder` CI workflow id / git commit / Export Center job id
* `created_at` UTC ISO8601
* `oukit_channel` e.g., `edge`, `stable`, `fips-profile`
**Guideline:** this DSSE payload is the **single canonical description** of “what this offline update snapshot is”.
### 2.3. Rekor material
Attestor must:
* Submit `offline-update.dsse.json` to Rekor v2, obtaining:
* `uuid`
* `logIndex`
* inclusion proof (`rootHash`, `hashes`, `checkpoint`)
* Serialize that to `offline-update.rekor.json` and store it in object storage + OUK staging, so it ships in the kit.([git.stella-ops.org][2])
For fully offline operation:
* Either:
* embed a **minimal log segment** containing that entry; or
* rely on daily Rekor snapshot exports included elsewhere in the kit.([git.stella-ops.org][2])
---
## 3. Implementation by module
### 3.1 Export Center — attestation bundles
**Working directory:** `src/ExportCenter/StellaOps.ExportCenter.AttestationBundles`([git.stella-ops.org][7])
**Responsibilities**
1. **Compose attestation bundle job** (EXPORTATTEST74001)
* Input: a snapshot identifier (e.g., offline kit build id or feed snapshot date).
* Read manifest and feed metadata from the Export Centers storage.([git.stella-ops.org][5])
* Generate the DSSE payload structure described above.
* Call `StellaOps.Signer` to wrap it in a DSSE envelope.
* Call `StellaOps.Attestor` to submit DSSE → Rekor and get the inclusion proof.([git.stella-ops.org][2])
* Persist:
* `offline-update.dsse.json`
* `offline-update.rekor.json`
* any log segment artifacts.
2. **Integrate into offline kit packaging** (EXPORTATTEST74002 / 75001)
* The OUK builder (Python script `ops/offline-kit/build_offline_kit.py`) already assembles artifacts & manifests.([Stella Ops][8])
* Extend that pipeline (or add an Export Center step) to:
* fetch the attestation bundle for the snapshot,
* place it under `/attestations/` in the kit staging dir,
* ensure `offline-manifest.json` contains entries for the DSSE and Rekor files (name, sha256, size, capturedAt).([git.stella-ops.org][1])
3. **Contracts & schemas**
* Define a small JSON schema for `offline-update.rekor.json` (UUID, index, proof fields) and check it into `docs/11_DATA_SCHEMAS.md` or modulelocal schemas.
* Keep all new payload schemas **versioned**; avoid “shape drift”.
**Do / Dont**
***Do** treat attestation bundle job as *pure aggregation* (AOC guardrail: no modification of evidence).([git.stella-ops.org][5])
***Do** rely on Signer + Attestor; dont handroll DSSE/Rekor logic in Export Center.([git.stella-ops.org][2])
***Dont** reach out to external networks from this job — it must run with the same offlineready posture as the rest of the platform.
---
### 3.2 Offline Update Kit builder
**Working area:** `ops/offline-kit/*` + `docs/24_OFFLINE_KIT.md`([git.stella-ops.org][1])
Guidelines:
1. **Preserve current guarantees**
* Imports must remain **idempotent and atomic**, with **old feeds kept until the new bundle is fully verified**. This now includes DSSE/Rekor checks in addition to Cosign + JWS.([git.stella-ops.org][1])
2. **Staging layout**
* When staging a kit, ensure the tree looks like:
```text
out/offline-kit/staging/
feeds/...
images/...
manifest/offline-manifest.json
attestations/offline-update.dsse.json
attestations/offline-update.rekor.json
```
* Update `offline-manifest.json` so each new file appears with:
* `name`, `sha256`, `size`, `capturedAt`.([git.stella-ops.org][1])
3. **Deterministic ordering**
* File lists in manifests must be in a stable order (e.g., lexical paths).
* Timestamps = UTC ISO8601 only; never use local time. (Matches determinism guidance in AGENTS.md + policy/runs docs.)([git.stella-ops.org][9])
4. **Delta kits**
* For deltas (`stella-ouk-YYYY-MM-DD.delta.tgz`), DSSE should still cover:
* the delta tarball digest,
* the **logical state** (feeds & versions) after applying the delta.
* Dont shortcut by “attesting only the diff files” — the predicate must describe the resulting snapshot.
---
### 3.3 Scanner — import & activation
**Working directory:** `src/Scanner/StellaOps.Scanner.WebService`, `StellaOps.Scanner.Worker`([git.stella-ops.org][9])
Scanner already exposes admin flows for:
* **Offline kit import**, which:
* validates the Cosign signature of the kit,
* uses the attested manifest,
* keeps old feeds until verification is done.([git.stella-ops.org][1])
Add DSSE/Rekor awareness as follows:
1. **Verification sequence (happy path)**
On `import-offline-usage-kit`:
1. Validate **Cosign** signature of the tarball.
2. Validate `offline-manifest.json` with its JWS signature.
3. Verify **file digests** for all entries (including `/attestations/*`).
4. Verify **DSSE**:
* Call `StellaOps.Attestor.Verify` (or CLI equivalent) with:
* `offline-update.dsse.json`
* `offline-update.rekor.json`
* local Rekor log snapshot / segment (if configured)([git.stella-ops.org][2])
* Ensure the payload digest matches the kit tarball + manifest digests.
5. Only after all checks pass:
* swap Scanners feed pointer to the new snapshot,
* emit an audit event noting:
* kit filename, tarball digest,
* DSSE statement digest,
* Rekor UUID + log index.
2. **Config surface**
Add config keys (names illustrative):
```yaml
scanner:
offlineKit:
requireDsse: true # fail import if DSSE/Rekor verification fails
rekorOfflineMode: true # use local snapshots only
attestationVerifier: https://attestor.internal
```
* Mirror them via ASP.NET Core config + env vars (`SCANNER__OFFLINEKIT__REQUIREDSSSE`, etc.), following the same pattern as the DSSE/Rekor operator guide.([git.stella-ops.org][2])
3. **Failure behaviour**
* **DSSE/Rekor fail, Cosign + manifest OK**
* Keep old feeds active.
* Mark import as failed; surface a `ProblemDetails` error via API/UI.
* Log structured fields: `rekorUuid`, `attestationDigest`, `offlineKitHash`, `failureReason`.([git.stella-ops.org][2])
* **Config flag to soften during rollout**
* When `requireDsse=false`, treat DSSE/Rekor failure as a warning and still allow the import (for initial observation phase), but emit alerts. This mirrors the “observe → enforce” pattern in the DSSE/Rekor operator guide.([git.stella-ops.org][2])
---
### 3.4 Signer & Attestor
You mostly **reuse** existing guidance:([git.stella-ops.org][2])
* Add a new predicate type & schema for offline updates in Signer.
* Ensure Attestor:
* can submit offlineupdate DSSE envelopes to Rekor,
* can emit verification routines (used by CLI and Scanner) that:
* verify the DSSE signature,
* check the certificate chain against the configured root pack (FIPS/eIDAS/GOST/SM, etc.),([Stella Ops][4])
* verify Rekor inclusion using either live log or local snapshot.
* For fully airgapped installs:
* rely on Rekor **snapshots mirrored** into Offline Kit (already recommended in the operator guides offline section).([git.stella-ops.org][2])
---
### 3.5 CLI & UI
Extend CLI with explicit verbs (matching EXPORTATTEST sprints):([git.stella-ops.org][10])
* `stella attest bundle verify --bundle path/to/offline-kit.tgz --rekor-key rekor.pub`
* `stella attest bundle import --bundle ...` (for sites that prefer a twostep “verify then import” flow)
* Wire UI Admin → Offline Kit screen so that:
* verification status shows both **Cosign/JWS** and **DSSE/Rekor** state,
* policy banners display kit generation time, manifest hash, and DSSE/Rekor freshness.([Stella Ops][11])
---
## 4. Determinism & offlinesafety rules
When touching any of this code, keep these rules frontofmind (they align with the policy DSL and architecture docs):([Stella Ops][4])
1. **No hidden network dependencies**
* All verification **must work offline** given the kit + Rekor snapshots.
* Any fallback to live Rekor / Fulcio endpoints must be explicitly toggled and never on by default for “offline mode”.
2. **Stable serialization**
* DSSE payload JSON:
* stable ordering of fields,
* no float weirdness,
* UTC timestamps.
3. **Replayable imports**
* Running `import-offline-usage-kit` twice with the same bundle must be a noop after the first time.
* The DSSE payload for a given snapshot must not change over time; if it does, bump the predicate or snapshot version.
4. **Explainability**
* When verification fails, errors must explain **what** mismatched (kit digest, manifest digest, DSSE envelope hash, Rekor inclusion) so auditors can reason about it.
---
## 5. Testing & CI expectations
Tie this into the existing CI workflows (`scanner-determinism.yml`, `attestation-bundle.yml`, `offline-kit` pipelines, etc.):([git.stella-ops.org][12])
### 5.1 Unit & integration tests
Write tests that cover:
1. **Happy paths**
* Full kit import with valid:
* Cosign,
* manifest JWS,
* DSSE,
* Rekor proof (online and offline modes).
2. **Corruption scenarios**
* Tampered feed file (hash mismatch).
* Tampered `offline-manifest.json`.
* Tampered DSSE payload (signature fails).
* Mismatched Rekor entry (payload digest doesnt match DSSE).
3. **Offline scenarios**
* No network access, only Rekor snapshot:
* DSSE verification still passes,
* Rekor proof validates against local tree head.
4. **Rollback logic**
* Import fails at DSSE/Rekor step:
* scanner DB still points at previous feeds,
* metrics/logs show failure and no partial state.
### 5.2 SLOs & observability
Reuse metrics suggested by DSSE/Rekor guide and adapt to OUK imports:([git.stella-ops.org][2])
* `offlinekit_import_total{status="success|failed_dsse|failed_rekor|failed_cosign"}`
* `offlinekit_attestation_verify_latency_seconds` (histogram)
* `attestor_rekor_success_total`, `attestor_rekor_retry_total`, `rekor_inclusion_latency`
* Dashboards: kit versions per environment, time since last kit, DSSE/Rekor health.
---
## 6. Developer checklist (TL;DR)
When you pick up a task touching DSSEsigned offline updates:
1. **Read the background**
* `docs/modules/scanner/operations/dsse-rekor-operator-guide.md`([git.stella-ops.org][2])
* `docs/24_OFFLINE_KIT.md` (and public offline kit guide).([git.stella-ops.org][1])
* Relevant sprint file (`SPRINT_160_export_evidence`, `SPRINT_162_exportcenter_i`, etc.).([git.stella-ops.org][10])
2. **Implement**
* Generate DSSE payloads in Export Center only.
* Call Signer & Attestor; persist DSSE + Rekor JSON next to manifests.
* Extend OUK builder to include attestation bundle and list it in `offline-manifest.json`.
* Update Scanner import flow to verify DSSE/Rekor before swapping feeds.
3. **Test**
* Unit tests for bundle composition & schema.
* Integration tests for import + rollback.
* Determinism tests (same inputs → same DSSE payload).
4. **Wire telemetry**
* Counters + latency histograms.
* Logs with `offlineKitHash`, `attestationDigest`, `rekorUuid`.
5. **Document**
* Update `docs/modules/export-center/architecture.md`, `docs/modules/scanner/architecture.md`, and the OUK docs where flows or contracts changed.([git.stella-ops.org][5])
---
If you tell me which module youre actually coding in next (Scanner, Export Center, CLI, or Attestor), I can turn this into a very concrete “AGENTS.mdstyle” section with exact file paths, class names, and a starter test layout for that module.
[1]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/7bf40b8589c94078e8eadb240553c02f097a5127/docs/24_OFFLINE_KIT.md "git.stella-ops.org/24_OFFLINE_KIT.md at 7bf40b8589c94078e8eadb240553c02f097a5127 - git.stella-ops.org - Gitea: Git with a cup of tea"
[2]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/13e4b53dda1575ba46c6188c794fd465ec6fdeec/docs/modules/scanner/operations/dsse-rekor-operator-guide.md "git.stella-ops.org/dsse-rekor-operator-guide.md at 13e4b53dda1575ba46c6188c794fd465ec6fdeec - git.stella-ops.org - Gitea: Git with a cup of tea"
[3]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/raw/commit/61f963fd52cd4d6bb2f86afc5a82eac04c04b00e/docs/implplan/SPRINT_162_exportcenter_i.md?utm_source=chatgpt.com "https://git.stella-ops.org/stella-ops.org/git.stel..."
[4]: https://stella-ops.org/docs/07_high_level_architecture/index.html?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
[5]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d870da18ce194c6a5f1a6d71abea36205d9fb276/docs/export-center/architecture.md?utm_source=chatgpt.com "Export Center Architecture - Stella Ops"
[6]: https://stella-ops.org/docs/moat/?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
[7]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/79b8e53441e92dbc63684f42072434d40b80275f/src/ExportCenter?utm_source=chatgpt.com "Code - Stella Ops"
[8]: https://stella-ops.org/docs/24_offline_kit/?utm_source=chatgpt.com "Offline Update Kit (OUK) — AirGap Bundle - Stella Ops Open ..."
[9]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/7768555f2d107326050cc5ff7f5cb81b82b7ce5f/AGENTS.md "git.stella-ops.org/AGENTS.md at 7768555f2d107326050cc5ff7f5cb81b82b7ce5f - git.stella-ops.org - Gitea: Git with a cup of tea"
[10]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/66cb6c4b8af58a33efa1521b7953dda834431497/docs/implplan/SPRINT_160_export_evidence.md?utm_source=chatgpt.com "git.stella-ops.org/SPRINT_160_export_evidence.md at ..."
[11]: https://stella-ops.org/about/?utm_source=chatgpt.com "Signed Reachability · Deterministic Replay · Sovereign Crypto"
[12]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/actions/?actor=0&status=0&workflow=sdk-publish.yml&utm_source=chatgpt.com "Actions - git.stella-ops.org - Gitea: Git with a cup of tea"

View File

@@ -0,0 +1,819 @@
Heres a crisp, opinionated storage blueprint you can hand to your StellaOps devs right now, plus zerodowntime conversion tactics so you can keep prototyping fast without painting yourself into a corner.
# Module → store map (deterministic by default)
* **Authority / OAuth / Accounts & Audit**
* **PostgreSQL** as the primary source of truth.
* Tables: `users`, `clients`, `oauth_tokens`, `roles`, `grants`, `audit_log`.
* **RowLevel Security (RLS)** on `users`, `grants`, `audit_log`; **STRICT FK + CHECK** constraints; **immutable UUID PKs**.
* **Audit**: `audit_log(actor_id, action, entity, entity_id, at timestamptz default now(), diff jsonb)`.
* **Why**: ACID + RLS keeps authz decisions and audit trails deterministic and reviewable.
* **VEX & Vulnerability Writes**
* **PostgreSQL** with **JSONB facts + relational decisions**.
* Tables: `vuln_fact(jsonb)`, `vex_decision(package_id, vuln_id, status, rationale, proof_ref, updated_at)`.
* **Materialized views** for triage queues, e.g. `mv_triage_hotset` (refresh on commit or scheduled).
* **Why**: JSONB lets you ingest vendorshaped docs; decisions stay relational for joins, integrity, and explainability.
* **Routing / Feature Flags / Ratelimits**
* **PostgreSQL** (truth) + **Redis** (cache).
* Tables: `feature_flag(key, rules jsonb, version)`, `route(domain, service, instance_id, last_heartbeat)`, `rate_limiter(bucket, quota, interval)`.
* Redis keys: `flag:{key}:{version}`, `route:{domain}`, `rl:{bucket}` with short TTLs.
* **Why**: one canonical RDBMS for consistency; Redis for hot path latency.
* **Unknowns Registry (ambiguity tracker)**
* **PostgreSQL** with **temporal tables** (bitemporal pattern via `valid_from/valid_to`, `sys_from/sys_to`).
* Table: `unknowns(subject_hash, kind, context jsonb, valid_from, valid_to, sys_from default now(), sys_to)`.
* Views: `unknowns_current` where `valid_to is null`.
* **Why**: preserves how/when uncertainty changed (critical for proofs and audits).
* **Artifacts / SBOM / VEX files**
* **OCIcompatible CAS** (e.g., selfhosted registry or MinIO bucket as contentaddressable store).
* Keys by **digest** (`sha256:...`), metadata in Postgres `artifact(index)` with `digest`, `media_type`, `size`, `signatures`.
* **Why**: blobs dont belong in your RDBMS; use CAS for scale + cryptographic addressing.
---
# PostgreSQL implementation essentials (copy/paste starters)
* **RLS scaffold (Authority)**:
```sql
alter table audit_log enable row level security;
create policy p_audit_read_self
on audit_log for select
using (actor_id = current_setting('app.user_id')::uuid or
exists (select 1 from grants g where g.user_id = current_setting('app.user_id')::uuid and g.role = 'auditor'));
```
* **JSONB facts + relational decisions**:
```sql
create table vuln_fact (
id uuid primary key default gen_random_uuid(),
source text not null,
payload jsonb not null,
received_at timestamptz default now()
);
create table vex_decision (
package_id uuid not null,
vuln_id text not null,
status text check (status in ('not_affected','affected','fixed','under_investigation')),
rationale text,
proof_ref text,
decided_at timestamptz default now(),
primary key (package_id, vuln_id)
);
```
* **Materialized view for triage**:
```sql
create materialized view mv_triage_hotset as
select v.id as fact_id, v.payload->>'vuln' as vuln, v.received_at
from vuln_fact v
where (now() - v.received_at) < interval '7 days';
-- refresh concurrently via job
```
* **Temporal pattern (Unknowns)**:
```sql
create table unknowns (
id uuid primary key default gen_random_uuid(),
subject_hash text not null,
kind text not null,
context jsonb not null,
valid_from timestamptz not null default now(),
valid_to timestamptz,
sys_from timestamptz not null default now(),
sys_to timestamptz
);
create view unknowns_current as
select * from unknowns where valid_to is null;
```
---
# Conversion (not migration): zerodowntime, prototypefriendly
Even if youre “not migrating anything yet,” set these rails now so cutting over later is painless.
1. **Encode Mongoshaped docs into JSONB with versioned schemas**
* Ingest pipeline writes to `*_fact(payload jsonb, schema_version int)`.
* Add a **`validate(schema_version, payload)`** step in your service layer (JSON Schema or SQL checks).
* Keep a **forwardcompatible view** that projects stable columns from JSONB (e.g., `payload->>'id' as vendor_id`) so downstream code doesnt break when payload evolves.
2. **Outbox pattern for exactlyonce sideeffects**
* Add `outbox(id, topic, key, payload jsonb, created_at, dispatched bool default false)`.
* On the same transaction as your write, insert the outbox row.
* A background dispatcher reads `dispatched=false`, publishes to MQ/Webhook, then marks `dispatched=true`.
* Guarantees: no lost events, no duplicates to external systems.
3. **Parallel read adapters behind feature flags**
* Keep old readers (e.g., Mongo driver) and new Postgres readers in the same service.
* Gate by `feature_flag('pg_reads')` per tenant or env; flip gradually.
* Add a **readdiff monitor** that compares results and logs mismatches to `audit_log(diff)`.
4. **CDC for analytics without coupling**
* Enable **logical replication** (pgoutput) on your key tables.
* Stream changes into analyzers (reachability, heuristics) without hitting primaries.
* This lets you keep OLTP clean and still power dashboards/tests.
5. **Materialized views & job cadence**
* Refresh `mv_*` on a fixed cadence (e.g., every 25 minutes) or postcommit for hot paths.
* Keep **“cold path”** analytics in separate schemas (`analytics.*`) sourced from CDC.
6. **Cutover playbook (phased)**
* Phase A (Dark Read): write Postgres, still serve from Mongo; compare results silently.
* Phase B (Shadow Serve): 510% traffic from Postgres via flag; autorollback switch.
* Phase C (Authoritative): Postgres becomes source; Mongo path left for emergency readonly.
* Phase D (Retire): freeze Mongo, back up, remove writes, delete code paths after 2 stable sprints.
---
# Ratelimits & flags: single truth, fast edges
* **Truth in Postgres** with versioned flag docs:
```sql
create table feature_flag (
key text primary key,
rules jsonb not null,
version int not null default 1,
updated_at timestamptz default now()
);
```
* **Edge cache** in Redis:
* `SETEX flag:{key}:{version} <ttl> <json>`
* On update, bump `version`; readers compose cache key with version (cachebusting without deletes).
* **Rate limiting**: Persist quotas in Postgres; counters in Redis (`INCR rl:{bucket}:{window}`), with periodic reconciliation jobs writing summaries back to Postgres for audits.
---
# CAS for SBOM/VEX/attestations
* Push blobs to OCI/MinIO by digest; store only pointers in Postgres:
```sql
create table artifact_index (
digest text primary key,
media_type text not null,
size bigint not null,
created_at timestamptz default now(),
signature_refs jsonb
);
```
* Benefits: immutable, deduped, easy to mirror into offline kits.
---
# Guardrails your team should follow
* **Always** wrap multitable writes (facts + outbox + decisions) in a single transaction.
* **Prefer** `jsonb_path_query` for targeted reads; **avoid** scanning entire payloads.
* **Enforce** RLS + leastprivilege roles; application sets `app.user_id` at session start.
* **Version everything**: schemas, flags, materialized views; never “change in place” without bumping version.
* **Observability**: expose `pg_stat_statements`, refresh latency for `mv_*`, outbox lag, Redis hit ratio, and RLS policy hits.
---
If you want, I can turn this into:
* readytorun **EF Core 10** migrations,
* a **/docs/architecture/store-map.md** for your repo,
* and a tiny **dev seed** (Docker + sample data) so the team can poke it immediately.
Heres a focused “PostgreSQL patterns per module” doc you can hand straight to your StellaOps devs.
---
# StellaOps PostgreSQL Patterns per Module
**Scope:** How each StellaOps module should use PostgreSQL: schema patterns, constraints, RLS, indexing, and transaction rules.
---
## 0. Crosscutting PostgreSQL Rules
These apply everywhere unless explicitly overridden.
### 0.1 Core conventions
* **Schemas**
* Use **one logical schema** per module: `authority`, `routing`, `vex`, `unknowns`, `artifact`.
* Shared utilities (e.g., `outbox`) live in a `core` schema.
* **Naming**
* Tables: `snake_case`, singular: `user`, `feature_flag`, `vuln_fact`.
* PK: `id uuid primary key`.
* FKs: `<referenced_table>_id` (e.g., `user_id`, `tenant_id`).
* Timestamps:
* `created_at timestamptz not null default now()`
* `updated_at timestamptz not null default now()`
* **Multitenancy**
* All tenantscoped tables must have `tenant_id uuid not null`.
* Enforce tenant isolation with **RLS** on `tenant_id`.
* **Time & timezones**
* Always `timestamptz`, always store **UTC**, let the DB default `now()`.
### 0.2 RLS & security
* RLS must be **enabled** on any table reachable from a userinitiated path.
* Every session must set:
```sql
select set_config('app.user_id', '<uuid>', false);
select set_config('app.tenant_id', '<uuid>', false);
select set_config('app.roles', 'role1,role2', false);
```
* RLS policies:
* Base policy: `tenant_id = current_setting('app.tenant_id')::uuid`.
* Extra predicates for peruser privacy (e.g., only see own tokens, only own API clients).
* DB users:
* Each modules service has its **own role** with access only to its schema + `core.outbox`.
### 0.3 JSONB & versioning
* Any JSONB column must have:
* `payload jsonb not null`,
* `schema_version int not null`.
* Always index:
* by source (`source` / `origin`),
* by a small set of projected fields used in WHERE clauses.
### 0.4 Migrations
* All schema changes via migrations, forwardonly.
* Backwardscompat pattern:
1. Add new columns / tables.
2. Backfill.
3. Flip code to use new structure (behind a feature flag).
4. After stability, remove old columns/paths.
---
## 1. Authority Module (auth, accounts, audit)
**Schema:** `authority.*`
**Mission:** identity, OAuth, roles, grants, audit.
### 1.1 Core tables & patterns
* `authority.user`
```sql
create table authority.user (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
email text not null,
display_name text not null,
is_disabled boolean not null default false,
created_at timestamptz not null default now(),
updated_at timestamptz not null default now(),
unique (tenant_id, email)
);
```
* Never harddelete users: use `is_disabled` (and optionally `disabled_at`).
* `authority.role`
```sql
create table authority.role (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
name text not null,
description text,
created_at timestamptz not null default now(),
updated_at timestamptz not null default now(),
unique (tenant_id, name)
);
```
* `authority.grant`
```sql
create table authority.grant (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
user_id uuid not null references authority.user(id),
role_id uuid not null references authority.role(id),
created_at timestamptz not null default now(),
unique (tenant_id, user_id, role_id)
);
```
* `authority.oauth_client`, `authority.oauth_token`
* Enforce token uniqueness:
```sql
create table authority.oauth_token (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
user_id uuid not null references authority.user(id),
client_id uuid not null references authority.oauth_client(id),
token_hash text not null, -- hash, never raw
expires_at timestamptz not null,
created_at timestamptz not null default now(),
revoked_at timestamptz,
unique (token_hash)
);
```
### 1.2 Audit log pattern
* `authority.audit_log`
```sql
create table authority.audit_log (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
actor_id uuid, -- null for system
action text not null,
entity_type text not null,
entity_id uuid,
at timestamptz not null default now(),
diff jsonb not null
);
```
* Insert audit rows in the **same transaction** as the change.
### 1.3 RLS patterns
* Base RLS:
```sql
alter table authority.user enable row level security;
create policy p_user_tenant on authority.user
for all using (tenant_id = current_setting('app.tenant_id')::uuid);
```
* Extra policies:
* Audit log is visible only to:
* actor themself, or
* users with an `auditor` or `admin` role.
---
## 2. Routing & Feature Flags Module
**Schema:** `routing.*`
**Mission:** where instances live, what features are on, ratelimit configuration.
### 2.1 Feature flags
* `routing.feature_flag`
```sql
create table routing.feature_flag (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
key text not null,
rules jsonb not null,
version int not null default 1,
is_enabled boolean not null default true,
created_at timestamptz not null default now(),
updated_at timestamptz not null default now(),
unique (tenant_id, key)
);
```
* **Immutability by version**:
* On update, **increment `version`**, dont overwrite historical data.
* Mirror changes into a history table via trigger:
```sql
create table routing.feature_flag_history (
id uuid primary key default gen_random_uuid(),
feature_flag_id uuid not null references routing.feature_flag(id),
tenant_id uuid not null,
key text not null,
rules jsonb not null,
version int not null,
changed_at timestamptz not null default now(),
changed_by uuid
);
```
### 2.2 Instance registry
* `routing.instance`
```sql
create table routing.instance (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
instance_key text not null,
domain text not null,
last_heartbeat timestamptz not null default now(),
status text not null check (status in ('active','draining','offline')),
created_at timestamptz not null default now(),
updated_at timestamptz not null default now(),
unique (tenant_id, instance_key),
unique (tenant_id, domain)
);
```
* Pattern:
* Heartbeats use `update ... set last_heartbeat = now()` without touching other fields.
* Routing logic filters by `status='active'` and heartbeat recency.
### 2.3 Ratelimit configuration
* Config in Postgres, counters in Redis:
```sql
create table routing.rate_limit_config (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
key text not null,
limit_per_interval int not null,
interval_seconds int not null,
created_at timestamptz not null default now(),
updated_at timestamptz not null default now(),
unique (tenant_id, key)
);
```
---
## 3. VEX & Vulnerability Module
**Schema:** `vex.*`
**Mission:** ingest vulnerability facts, keep decisions & triage state.
### 3.1 Facts as JSONB
* `vex.vuln_fact`
```sql
create table vex.vuln_fact (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
source text not null, -- e.g. "nvd", "vendor_x_vex"
external_id text, -- e.g. CVE, advisory id
payload jsonb not null,
schema_version int not null,
received_at timestamptz not null default now()
);
```
* Index patterns:
```sql
create index on vex.vuln_fact (tenant_id, source);
create index on vex.vuln_fact (tenant_id, external_id);
create index vuln_fact_payload_gin on vex.vuln_fact using gin (payload);
```
### 3.2 Decisions as relational data
* `vex.package`
```sql
create table vex.package (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
name text not null,
version text not null,
ecosystem text not null, -- e.g. "pypi", "npm"
created_at timestamptz not null default now(),
unique (tenant_id, name, version, ecosystem)
);
```
* `vex.vex_decision`
```sql
create table vex.vex_decision (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
package_id uuid not null references vex.package(id),
vuln_id text not null,
status text not null check (status in (
'not_affected', 'affected', 'fixed', 'under_investigation'
)),
rationale text,
proof_ref text, -- CAS digest or URL
decided_by uuid,
decided_at timestamptz not null default now(),
created_at timestamptz not null default now(),
updated_at timestamptz not null default now(),
unique (tenant_id, package_id, vuln_id)
);
```
* For history:
* Keep current state in `vex_decision`.
* Mirror previous versions into `vex_decision_history` table (similar to feature flags).
### 3.3 Triage queues with materialized views
* Example triage view:
```sql
create materialized view vex.mv_triage_queue as
select
d.tenant_id,
p.name,
p.version,
d.vuln_id,
d.status,
d.decided_at
from vex.vex_decision d
join vex.package p on p.id = d.package_id
where d.status = 'under_investigation';
```
* Refresh options:
* Scheduled refresh (cron/worker).
* Or **incremental** via triggers (more complex; use only when needed).
### 3.4 RLS for VEX
* All tables scoped by `tenant_id`.
* Typical policy:
```sql
alter table vex.vex_decision enable row level security;
create policy p_vex_tenant on vex.vex_decision
for all using (tenant_id = current_setting('app.tenant_id')::uuid);
```
---
## 4. Unknowns Module
**Schema:** `unknowns.*`
**Mission:** represent uncertainty and how it changes over time.
### 4.1 Bitemporal unknowns table
* `unknowns.unknown`
```sql
create table unknowns.unknown (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
subject_hash text not null, -- stable identifier for "thing" being reasoned about
kind text not null, -- e.g. "reachability", "version_inferred"
context jsonb not null, -- extra info: call graph node, evidence, etc.
valid_from timestamptz not null default now(),
valid_to timestamptz,
sys_from timestamptz not null default now(),
sys_to timestamptz,
created_at timestamptz not null default now()
);
```
* “Exactly one open unknown per subject/kind” pattern:
```sql
create unique index unknown_one_open_per_subject
on unknowns.unknown (tenant_id, subject_hash, kind)
where valid_to is null;
```
### 4.2 Closing an unknown
* Close by setting `valid_to` and `sys_to`:
```sql
update unknowns.unknown
set valid_to = now(), sys_to = now()
where id = :id and valid_to is null;
```
* Never hard-delete; keep all rows for audit/explanation.
### 4.3 Convenience views
* Current unknowns:
```sql
create view unknowns.current as
select *
from unknowns.unknown
where valid_to is null;
```
### 4.4 RLS
* Same tenant policy as other modules; unknowns are tenantscoped.
---
## 5. Artifact Index / CAS Module
**Schema:** `artifact.*`
**Mission:** index of immutable blobs stored in OCI / S3 / MinIO etc.
### 5.1 Artifact index
* `artifact.artifact`
```sql
create table artifact.artifact (
digest text primary key, -- e.g. "sha256:..."
tenant_id uuid not null,
media_type text not null,
size_bytes bigint not null,
created_at timestamptz not null default now(),
created_by uuid
);
```
* Validate digest shape with a CHECK:
```sql
alter table artifact.artifact
add constraint chk_digest_format
check (digest ~ '^sha[0-9]+:[0-9a-fA-F]{32,}$');
```
### 5.2 Signatures and tags
* `artifact.signature`
```sql
create table artifact.signature (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
artifact_digest text not null references artifact.artifact(digest),
signer text not null,
signature_payload jsonb not null,
created_at timestamptz not null default now()
);
```
* `artifact.tag`
```sql
create table artifact.tag (
id uuid primary key default gen_random_uuid(),
tenant_id uuid not null,
name text not null,
artifact_digest text not null references artifact.artifact(digest),
created_at timestamptz not null default now(),
unique (tenant_id, name)
);
```
### 5.3 RLS
* Ensure that tenants cannot see each others digests, even if the CAS backing store is shared:
```sql
alter table artifact.artifact enable row level security;
create policy p_artifact_tenant on artifact.artifact
for all using (tenant_id = current_setting('app.tenant_id')::uuid);
```
---
## 6. Shared Outbox / Event Pattern
**Schema:** `core.*`
**Mission:** reliable events for external sideeffects.
### 6.1 Outbox table
* `core.outbox`
```sql
create table core.outbox (
id uuid primary key default gen_random_uuid(),
tenant_id uuid,
aggregate_type text not null, -- e.g. "vex_decision", "feature_flag"
aggregate_id uuid,
topic text not null,
payload jsonb not null,
created_at timestamptz not null default now(),
dispatched_at timestamptz,
dispatch_attempts int not null default 0,
error text
);
```
### 6.2 Usage rule
* For anything that must emit an event (webhook, Kafka, notifications):
* In the **same transaction** as the change:
* write primary data (e.g. `vex.vex_decision`),
* insert an `outbox` row.
* A background worker:
* pulls undelivered rows,
* sends to external system,
* updates `dispatched_at`/`dispatch_attempts`/`error`.
---
## 7. Indexing & Query Patterns per Module
### 7.1 Authority
* Index:
* `user(tenant_id, email)`
* `grant(tenant_id, user_id)`
* `oauth_token(token_hash)`
* Typical query patterns:
* Look up user by `tenant_id + email`.
* All roles/grants for a user; design composite indexes accordingly.
### 7.2 Routing & Flags
* Index:
* `feature_flag(tenant_id, key)`
* partial index on enabled flags:
```sql
create index on routing.feature_flag (tenant_id, key)
where is_enabled;
```
* `instance(tenant_id, status)`, `instance(tenant_id, domain)`.
### 7.3 VEX
* Index:
* `package(tenant_id, name, version, ecosystem)`
* `vex_decision(tenant_id, package_id, vuln_id)`
* GIN on `vuln_fact.payload` for flexible querying.
### 7.4 Unknowns
* Index:
* unique open unknown per subject/kind (shown above).
* `unknown(tenant_id, kind)` for filtering by kind.
### 7.5 Artifact
* Index:
* PK on `digest`.
* `signature(tenant_id, artifact_digest)`.
* `tag(tenant_id, name)`.
---
## 8. Transaction & Isolation Guidelines
* Default isolation: **READ COMMITTED**.
* For critical sequences (e.g., provisioning a tenant, bulk role updates):
* consider **REPEATABLE READ** or **SERIALIZABLE** and keep transactions short.
* Pattern:
* One transaction per logical user action (e.g., “set flag”, “record decision”).
* Never do longrunning external calls inside a database transaction.
---
If youd like, next step I can turn this into:
* concrete `CREATE SCHEMA` + `CREATE TABLE` migration files, and
* a short “How to write queries in each module” cheatsheet for devs (with example SELECT/INSERT/UPDATE patterns).

View File

@@ -0,0 +1,585 @@
Heres a tight, practical pattern to make your scanners vulnDB updates rocksolid even when feeds hiccup:
# Offline, verifiable update bundles (DSSE + Rekor v2)
**Idea:** distribute DB updates as offline tarballs. Each tarball ships with:
* a **DSSEsigned** statement (e.g., intoto style) over the bundle hash
* a **Rekor v2 receipt** proving the signature/statement was logged
* a small **manifest.json** (version, created_at, content hashes)
**Startup flow (happy path):**
1. Load latest tarball from your local `updates/` cache.
2. Verify DSSE signature against your trusted public keys.
3. Verify Rekor v2 receipt (inclusion proof) matches the DSSE payload hash.
4. If both pass, unpack/activate; record the bundles **trust_id** (e.g., statement digest).
5. If anything fails, **keep using the last good bundle**. No service disruption.
**Why this helps**
* **Airgap friendly:** no live network needed at activation time.
* **Tamperevident:** DSSE + Rekor receipt proves provenance and transparency.
* **Operational stability:** feed outages become nonevents—scanner just keeps the last good state.
---
## File layout inside each bundle
```
/bundle-2025-11-29/
manifest.json # { version, created_at, entries[], sha256s }
payload.tar.zst # the actual DB/indices
payload.tar.zst.sha256
statement.dsse.json # DSSE-wrapped statement over payload hash
rekor-receipt.json # Rekor v2 inclusion/verification material
```
---
## Acceptance/Activation rules
* **Trust root:** pin one (or more) publisher public keys; rotate via separate, outofband process.
* **Monotonicity:** only activate if `manifest.version > current.version` (or if trust policy explicitly allows replay for rollback testing).
* **Atomic switch:** unpack to `db/staging/`, validate, then symlinkflip to `db/active/`.
* **Quarantine on failure:** move bad bundles to `updates/quarantine/` with a reason code.
---
## Minimal .NET 10 verifier sketch (C#)
```csharp
public sealed record BundlePaths(string Dir) {
public string Manifest => Path.Combine(Dir, "manifest.json");
public string Payload => Path.Combine(Dir, "payload.tar.zst");
public string Dsse => Path.Combine(Dir, "statement.dsse.json");
public string Receipt => Path.Combine(Dir, "rekor-receipt.json");
}
public async Task<bool> ActivateBundleAsync(BundlePaths b, TrustConfig trust, string activeDir) {
var manifest = await Manifest.LoadAsync(b.Manifest);
if (!await Hashes.VerifyAsync(b.Payload, manifest.PayloadSha256)) return false;
// 1) DSSE verify (publisher keys pinned in trust)
var (okSig, dssePayloadDigest) = await Dsse.VerifyAsync(b.Dsse, trust.PublisherKeys);
if (!okSig || dssePayloadDigest != manifest.PayloadSha256) return false;
// 2) Rekor v2 receipt verify (inclusion + statement digest == dssePayloadDigest)
if (!await RekorV2.VerifyReceiptAsync(b.Receipt, dssePayloadDigest, trust.RekorPub)) return false;
// 3) Stage, validate, then atomically flip
var staging = Path.Combine(activeDir, "..", "staging");
DirUtil.Empty(staging);
await TarZstd.ExtractAsync(b.Payload, staging);
if (!await LocalDbSelfCheck.RunAsync(staging)) return false;
SymlinkUtil.AtomicSwap(source: staging, target: activeDir);
State.WriteLastGood(manifest.Version, dssePayloadDigest);
return true;
}
```
---
## Operational playbook
* **On boot & daily at HH:MM:** try `ActivateBundleAsync()` on the newest bundle; on failure, log and continue.
* **Telemetry (no PII):** reason codes (SIG_FAIL, RECEIPT_FAIL, HASH_MISMATCH, SELFTEST_FAIL), versions, last_good.
* **Keys & rotation:** keep `publisher.pub` and `rekor.pub` in a rootowned, readonly path; rotate via a separate signed “trust bundle”.
* **Defenseindepth:** verify both the **payload hash** and each files hash listed in `manifest.entries[]`.
* **Rollback:** allow `--force-activate <bundle>` for emergency testing, but mark as **nonmonotonic** in state.
---
## What to hand your release team
* A Make/CI target that:
1. Builds `payload.tar.zst` and computes hashes
2. Generates `manifest.json`
3. Creates and signs the **DSSE statement**
4. Submits to Rekor (or your mirror) and saves the **v2 receipt**
5. Packages the bundle folder and publishes to your offline repo
* A checksum file (`*.sha256sum`) for ops to verify outofband.
---
If you want, I can turn this into a StellaOps spec page (`docs/modules/scanner/offline-bundles.md`) plus a small reference implementation (C# library + CLI) that drops right into your Scanner service.
Heres a “dropin” Stella Ops dev guide for **DSSEsigned Offline Scanner Updates** — written in the same spirit as the existing docs and sprint files.
You can treat this as the seed for `docs/modules/scanner/development/dsse-offline-updates.md` (or similar).
---
# DSSESigned Offline Scanner Updates — Developer Guidelines
> **Audience**
> Scanner, Export Center, Attestor, CLI, and DevOps engineers implementing DSSEsigned offline vulnerability updates and integrating them into the Offline Update Kit (OUK).
>
> **Context**
>
> * OUK already ships **signed, atomic offline update bundles** with merged vulnerability feeds, container images, and an attested manifest.([git.stella-ops.org][1])
> * DSSE + Rekor is already used for **scan evidence** (SBOM attestations, Rekor proofs).([git.stella-ops.org][2])
> * Sprints 160/162 add **attestation bundles** with manifest, checksums, DSSE signature, and optional transparency log segments, and integrate them into OUK and CLI flows.([git.stella-ops.org][3])
These guidelines tell you how to **wire all of that together** for “offline scanner updates” (feeds, rules, packs) in a way that matches Stella Ops determinism + sovereignty promises.
---
## 0. Mental model
At a high level, youre building this:
```text
Advisory mirrors / Feeds builders
ExportCenter.AttestationBundles
(creates DSSE + Rekor evidence
for each offline update snapshot)
Offline Update Kit (OUK) builder
(adds feeds + evidence to kit tarball)
stella offline kit import / admin CLI
(verifies Cosign + DSSE + Rekor segments,
then atomically swaps scanner feeds)
```
Online, Rekor is live; offline, you rely on **bundled Rekor segments / snapshots** and the existing OUK mechanics (import is atomic, old feeds kept until new bundle is fully verified).([git.stella-ops.org][1])
---
## 1. Goals & nongoals
### Goals
1. **Authentic offline snapshots**
Every offline scanner update (OUK or delta) must be verifiably tied to:
* a DSSE envelope,
* a certificate chain rooted in Stellas Fulcio/KMS profile or BYO KMS/HSM,
* *and* a Rekor v2 inclusion proof or bundled log segment.([Stella Ops][4])
2. **Deterministic replay**
Given:
* a specific offline update kit (`stella-ops-offline-kit-<DATE>.tgz` + `offline-manifest-<DATE>.json`)([git.stella-ops.org][1])
* its DSSE attestation bundle + Rekor segments
every verifier must reach the *same* verdict on integrity and contents — online or fully airgapped.
3. **Separation of concerns**
* Export Center: build attestation bundles, no business logic about scanning.([git.stella-ops.org][5])
* Scanner: import & apply feeds; verify but not generate DSSE.
* Signer / Attestor: own DSSE & Rekor integration.([git.stella-ops.org][2])
4. **Operational safety**
* Imports remain **atomic and idempotent**.
* Old feeds stay live until the new update is **fully verified** (Cosign + DSSE + Rekor).([git.stella-ops.org][1])
### Nongoals
* Designing new crypto or log formats.
* Perfeed DSSE envelopes (you can have more later, but the minimum contract is **bundlelevel** attestation).
---
## 2. Bundle contract for DSSEsigned offline updates
Youre extending the existing OUK contract:
* OUK already packs:
* merged vuln feeds (OSV, GHSA, optional NVD 2.0, CNNVD/CNVD, ENISA, JVN, BDU),
* container images (`stella-ops`, Zastava, etc.),
* provenance (Cosign signature, SPDX SBOM, intoto SLSA attestation),
* `offline-manifest.json` + detached JWS signed during export.([git.stella-ops.org][1])
For **DSSEsigned offline scanner updates**, add a new logical layer:
### 2.1. Files to ship
Inside each offline kit (full or delta) you must produce:
```text
/attestations/
offline-update.dsse.json # DSSE envelope
offline-update.rekor.json # Rekor entry + inclusion proof (or segment descriptor)
/manifest/
offline-manifest.json # existing manifest
offline-manifest.json.jws # existing detached JWS
/feeds/
... # existing feed payloads
```
The exact paths can be adjusted, but keep:
* **One DSSE bundle per kit** (min spec).
* **One canonical Rekor proof file** per DSSE envelope.
### 2.2. DSSE payload contents (minimal)
Define (or reuse) a predicate type such as:
```jsonc
{
"payloadType": "application/vnd.in-toto+json",
"payload": { /* base64 */ }
}
```
Decoded payload (in-toto statement) should **at minimum** contain:
* **Subject**
* `name`: `stella-ops-offline-kit-<DATE>.tgz`
* `digest.sha256`: tarball digest
* **Predicate type** (recommendation)
* `https://stella-ops.org/attestations/offline-update/1`
* **Predicate fields**
* `offline_manifest_sha256` SHA256 of `offline-manifest.json`
* `feeds` array of feed entries such as `{ name, snapshot_date, archive_digest }` (mirrors `rules_and_feeds` style used in the moat doc).([Stella Ops][6])
* `builder` CI workflow id / git commit / Export Center job id
* `created_at` UTC ISO8601
* `oukit_channel` e.g., `edge`, `stable`, `fips-profile`
**Guideline:** this DSSE payload is the **single canonical description** of “what this offline update snapshot is”.
### 2.3. Rekor material
Attestor must:
* Submit `offline-update.dsse.json` to Rekor v2, obtaining:
* `uuid`
* `logIndex`
* inclusion proof (`rootHash`, `hashes`, `checkpoint`)
* Serialize that to `offline-update.rekor.json` and store it in object storage + OUK staging, so it ships in the kit.([git.stella-ops.org][2])
For fully offline operation:
* Either:
* embed a **minimal log segment** containing that entry; or
* rely on daily Rekor snapshot exports included elsewhere in the kit.([git.stella-ops.org][2])
---
## 3. Implementation by module
### 3.1 Export Center — attestation bundles
**Working directory:** `src/ExportCenter/StellaOps.ExportCenter.AttestationBundles`([git.stella-ops.org][7])
**Responsibilities**
1. **Compose attestation bundle job** (EXPORTATTEST74001)
* Input: a snapshot identifier (e.g., offline kit build id or feed snapshot date).
* Read manifest and feed metadata from the Export Centers storage.([git.stella-ops.org][5])
* Generate the DSSE payload structure described above.
* Call `StellaOps.Signer` to wrap it in a DSSE envelope.
* Call `StellaOps.Attestor` to submit DSSE → Rekor and get the inclusion proof.([git.stella-ops.org][2])
* Persist:
* `offline-update.dsse.json`
* `offline-update.rekor.json`
* any log segment artifacts.
2. **Integrate into offline kit packaging** (EXPORTATTEST74002 / 75001)
* The OUK builder (Python script `ops/offline-kit/build_offline_kit.py`) already assembles artifacts & manifests.([Stella Ops][8])
* Extend that pipeline (or add an Export Center step) to:
* fetch the attestation bundle for the snapshot,
* place it under `/attestations/` in the kit staging dir,
* ensure `offline-manifest.json` contains entries for the DSSE and Rekor files (name, sha256, size, capturedAt).([git.stella-ops.org][1])
3. **Contracts & schemas**
* Define a small JSON schema for `offline-update.rekor.json` (UUID, index, proof fields) and check it into `docs/11_DATA_SCHEMAS.md` or modulelocal schemas.
* Keep all new payload schemas **versioned**; avoid “shape drift”.
**Do / Dont**
***Do** treat attestation bundle job as *pure aggregation* (AOC guardrail: no modification of evidence).([git.stella-ops.org][5])
***Do** rely on Signer + Attestor; dont handroll DSSE/Rekor logic in Export Center.([git.stella-ops.org][2])
***Dont** reach out to external networks from this job — it must run with the same offlineready posture as the rest of the platform.
---
### 3.2 Offline Update Kit builder
**Working area:** `ops/offline-kit/*` + `docs/24_OFFLINE_KIT.md`([git.stella-ops.org][1])
Guidelines:
1. **Preserve current guarantees**
* Imports must remain **idempotent and atomic**, with **old feeds kept until the new bundle is fully verified**. This now includes DSSE/Rekor checks in addition to Cosign + JWS.([git.stella-ops.org][1])
2. **Staging layout**
* When staging a kit, ensure the tree looks like:
```text
out/offline-kit/staging/
feeds/...
images/...
manifest/offline-manifest.json
attestations/offline-update.dsse.json
attestations/offline-update.rekor.json
```
* Update `offline-manifest.json` so each new file appears with:
* `name`, `sha256`, `size`, `capturedAt`.([git.stella-ops.org][1])
3. **Deterministic ordering**
* File lists in manifests must be in a stable order (e.g., lexical paths).
* Timestamps = UTC ISO8601 only; never use local time. (Matches determinism guidance in AGENTS.md + policy/runs docs.)([git.stella-ops.org][9])
4. **Delta kits**
* For deltas (`stella-ouk-YYYY-MM-DD.delta.tgz`), DSSE should still cover:
* the delta tarball digest,
* the **logical state** (feeds & versions) after applying the delta.
* Dont shortcut by “attesting only the diff files” — the predicate must describe the resulting snapshot.
---
### 3.3 Scanner — import & activation
**Working directory:** `src/Scanner/StellaOps.Scanner.WebService`, `StellaOps.Scanner.Worker`([git.stella-ops.org][9])
Scanner already exposes admin flows for:
* **Offline kit import**, which:
* validates the Cosign signature of the kit,
* uses the attested manifest,
* keeps old feeds until verification is done.([git.stella-ops.org][1])
Add DSSE/Rekor awareness as follows:
1. **Verification sequence (happy path)**
On `import-offline-usage-kit`:
1. Validate **Cosign** signature of the tarball.
2. Validate `offline-manifest.json` with its JWS signature.
3. Verify **file digests** for all entries (including `/attestations/*`).
4. Verify **DSSE**:
* Call `StellaOps.Attestor.Verify` (or CLI equivalent) with:
* `offline-update.dsse.json`
* `offline-update.rekor.json`
* local Rekor log snapshot / segment (if configured)([git.stella-ops.org][2])
* Ensure the payload digest matches the kit tarball + manifest digests.
5. Only after all checks pass:
* swap Scanners feed pointer to the new snapshot,
* emit an audit event noting:
* kit filename, tarball digest,
* DSSE statement digest,
* Rekor UUID + log index.
2. **Config surface**
Add config keys (names illustrative):
```yaml
scanner:
offlineKit:
requireDsse: true # fail import if DSSE/Rekor verification fails
rekorOfflineMode: true # use local snapshots only
attestationVerifier: https://attestor.internal
```
* Mirror them via ASP.NET Core config + env vars (`SCANNER__OFFLINEKIT__REQUIREDSSSE`, etc.), following the same pattern as the DSSE/Rekor operator guide.([git.stella-ops.org][2])
3. **Failure behaviour**
* **DSSE/Rekor fail, Cosign + manifest OK**
* Keep old feeds active.
* Mark import as failed; surface a `ProblemDetails` error via API/UI.
* Log structured fields: `rekorUuid`, `attestationDigest`, `offlineKitHash`, `failureReason`.([git.stella-ops.org][2])
* **Config flag to soften during rollout**
* When `requireDsse=false`, treat DSSE/Rekor failure as a warning and still allow the import (for initial observation phase), but emit alerts. This mirrors the “observe → enforce” pattern in the DSSE/Rekor operator guide.([git.stella-ops.org][2])
---
### 3.4 Signer & Attestor
You mostly **reuse** existing guidance:([git.stella-ops.org][2])
* Add a new predicate type & schema for offline updates in Signer.
* Ensure Attestor:
* can submit offlineupdate DSSE envelopes to Rekor,
* can emit verification routines (used by CLI and Scanner) that:
* verify the DSSE signature,
* check the certificate chain against the configured root pack (FIPS/eIDAS/GOST/SM, etc.),([Stella Ops][4])
* verify Rekor inclusion using either live log or local snapshot.
* For fully airgapped installs:
* rely on Rekor **snapshots mirrored** into Offline Kit (already recommended in the operator guides offline section).([git.stella-ops.org][2])
---
### 3.5 CLI & UI
Extend CLI with explicit verbs (matching EXPORTATTEST sprints):([git.stella-ops.org][10])
* `stella attest bundle verify --bundle path/to/offline-kit.tgz --rekor-key rekor.pub`
* `stella attest bundle import --bundle ...` (for sites that prefer a twostep “verify then import” flow)
* Wire UI Admin → Offline Kit screen so that:
* verification status shows both **Cosign/JWS** and **DSSE/Rekor** state,
* policy banners display kit generation time, manifest hash, and DSSE/Rekor freshness.([Stella Ops][11])
---
## 4. Determinism & offlinesafety rules
When touching any of this code, keep these rules frontofmind (they align with the policy DSL and architecture docs):([Stella Ops][4])
1. **No hidden network dependencies**
* All verification **must work offline** given the kit + Rekor snapshots.
* Any fallback to live Rekor / Fulcio endpoints must be explicitly toggled and never on by default for “offline mode”.
2. **Stable serialization**
* DSSE payload JSON:
* stable ordering of fields,
* no float weirdness,
* UTC timestamps.
3. **Replayable imports**
* Running `import-offline-usage-kit` twice with the same bundle must be a noop after the first time.
* The DSSE payload for a given snapshot must not change over time; if it does, bump the predicate or snapshot version.
4. **Explainability**
* When verification fails, errors must explain **what** mismatched (kit digest, manifest digest, DSSE envelope hash, Rekor inclusion) so auditors can reason about it.
---
## 5. Testing & CI expectations
Tie this into the existing CI workflows (`scanner-determinism.yml`, `attestation-bundle.yml`, `offline-kit` pipelines, etc.):([git.stella-ops.org][12])
### 5.1 Unit & integration tests
Write tests that cover:
1. **Happy paths**
* Full kit import with valid:
* Cosign,
* manifest JWS,
* DSSE,
* Rekor proof (online and offline modes).
2. **Corruption scenarios**
* Tampered feed file (hash mismatch).
* Tampered `offline-manifest.json`.
* Tampered DSSE payload (signature fails).
* Mismatched Rekor entry (payload digest doesnt match DSSE).
3. **Offline scenarios**
* No network access, only Rekor snapshot:
* DSSE verification still passes,
* Rekor proof validates against local tree head.
4. **Rollback logic**
* Import fails at DSSE/Rekor step:
* scanner DB still points at previous feeds,
* metrics/logs show failure and no partial state.
### 5.2 SLOs & observability
Reuse metrics suggested by DSSE/Rekor guide and adapt to OUK imports:([git.stella-ops.org][2])
* `offlinekit_import_total{status="success|failed_dsse|failed_rekor|failed_cosign"}`
* `offlinekit_attestation_verify_latency_seconds` (histogram)
* `attestor_rekor_success_total`, `attestor_rekor_retry_total`, `rekor_inclusion_latency`
* Dashboards: kit versions per environment, time since last kit, DSSE/Rekor health.
---
## 6. Developer checklist (TL;DR)
When you pick up a task touching DSSEsigned offline updates:
1. **Read the background**
* `docs/modules/scanner/operations/dsse-rekor-operator-guide.md`([git.stella-ops.org][2])
* `docs/24_OFFLINE_KIT.md` (and public offline kit guide).([git.stella-ops.org][1])
* Relevant sprint file (`SPRINT_160_export_evidence`, `SPRINT_162_exportcenter_i`, etc.).([git.stella-ops.org][10])
2. **Implement**
* Generate DSSE payloads in Export Center only.
* Call Signer & Attestor; persist DSSE + Rekor JSON next to manifests.
* Extend OUK builder to include attestation bundle and list it in `offline-manifest.json`.
* Update Scanner import flow to verify DSSE/Rekor before swapping feeds.
3. **Test**
* Unit tests for bundle composition & schema.
* Integration tests for import + rollback.
* Determinism tests (same inputs → same DSSE payload).
4. **Wire telemetry**
* Counters + latency histograms.
* Logs with `offlineKitHash`, `attestationDigest`, `rekorUuid`.
5. **Document**
* Update `docs/modules/export-center/architecture.md`, `docs/modules/scanner/architecture.md`, and the OUK docs where flows or contracts changed.([git.stella-ops.org][5])
---
If you tell me which module youre actually coding in next (Scanner, Export Center, CLI, or Attestor), I can turn this into a very concrete “AGENTS.mdstyle” section with exact file paths, class names, and a starter test layout for that module.
[1]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/7bf40b8589c94078e8eadb240553c02f097a5127/docs/24_OFFLINE_KIT.md "git.stella-ops.org/24_OFFLINE_KIT.md at 7bf40b8589c94078e8eadb240553c02f097a5127 - git.stella-ops.org - Gitea: Git with a cup of tea"
[2]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/13e4b53dda1575ba46c6188c794fd465ec6fdeec/docs/modules/scanner/operations/dsse-rekor-operator-guide.md "git.stella-ops.org/dsse-rekor-operator-guide.md at 13e4b53dda1575ba46c6188c794fd465ec6fdeec - git.stella-ops.org - Gitea: Git with a cup of tea"
[3]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/raw/commit/61f963fd52cd4d6bb2f86afc5a82eac04c04b00e/docs/implplan/SPRINT_162_exportcenter_i.md?utm_source=chatgpt.com "https://git.stella-ops.org/stella-ops.org/git.stel..."
[4]: https://stella-ops.org/docs/07_high_level_architecture/index.html?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
[5]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/d870da18ce194c6a5f1a6d71abea36205d9fb276/docs/export-center/architecture.md?utm_source=chatgpt.com "Export Center Architecture - Stella Ops"
[6]: https://stella-ops.org/docs/moat/?utm_source=chatgpt.com "Open • Sovereign • Modular container security - Stella Ops"
[7]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/79b8e53441e92dbc63684f42072434d40b80275f/src/ExportCenter?utm_source=chatgpt.com "Code - Stella Ops"
[8]: https://stella-ops.org/docs/24_offline_kit/?utm_source=chatgpt.com "Offline Update Kit (OUK) — AirGap Bundle - Stella Ops Open ..."
[9]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/7768555f2d107326050cc5ff7f5cb81b82b7ce5f/AGENTS.md "git.stella-ops.org/AGENTS.md at 7768555f2d107326050cc5ff7f5cb81b82b7ce5f - git.stella-ops.org - Gitea: Git with a cup of tea"
[10]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/src/commit/66cb6c4b8af58a33efa1521b7953dda834431497/docs/implplan/SPRINT_160_export_evidence.md?utm_source=chatgpt.com "git.stella-ops.org/SPRINT_160_export_evidence.md at ..."
[11]: https://stella-ops.org/about/?utm_source=chatgpt.com "Signed Reachability · Deterministic Replay · Sovereign Crypto"
[12]: https://git.stella-ops.org/stella-ops.org/git.stella-ops.org/actions/?actor=0&status=0&workflow=sdk-publish.yml&utm_source=chatgpt.com "Actions - git.stella-ops.org - Gitea: Git with a cup of tea"

View File

@@ -0,0 +1,425 @@
Heres a simple metric that will make your security UI (and teams) radically better: **TimetoEvidence (TTE)** — the time from opening a finding to seeing *raw proof* (a dataflow edge, an SBOM line, or a VEX note), not a summary.
---
### What it is
* **Definition:** TTE = `t_first_proof_rendered t_open_finding`.
* **Proof =** the exact artifact or path that justifies the claim (e.g., `package-lock.json: line 214 → openssl@1.1.1`, `reachability: A → B → C sink`, or `VEX: not_affected due to unreachable code`).
* **Target:** **P95 ≤ 15s** (stretch: P99 ≤ 30s). If 95% of findings show proof within 15 seconds, the UI stays honest: evidence before opinion, low noise, fast explainability.
---
### Why it matters
* **Trust:** People accept decisions they can *verify* quickly.
* **Triage speed:** Proof-first UIs cut back-and-forth and guesswork.
* **Noise control:** If you cant surface proof fast, you probably shouldnt surface the finding yet.
---
### How to measure (engineeringready)
* Emit two stamps per finding view:
* `t_open_finding` (on route enter or modal open).
* `t_first_proof_rendered` (first DOM paint of SBOM line / path list / VEX clause).
* Store as `tte_ms` in a lightweight events table (Postgres) with tags: `tenant`, `finding_id`, `proof_kind` (`sbom|reachability|vex`), `source` (`local|remote|cache`).
* Nightly rollup: compute P50/P90/P95/P99 by proof_kind and page.
* Alert when **P95 > 15s** for 15 minutes.
---
### UI contract (keeps the UX honest)
* **Above the fold:** always show a compact **Proof panel** first (not hidden behind tabs).
* **Skeletons over spinners:** reserve space; render partial proof as soon as any piece is ready.
* **Plain text copy affordance:** “Copy SBOM line / path” button right next to the proof.
* **Defer nonproof widgets:** CVSS badges, remediation prose, and charts load *after* proof.
* **Emptystate truth:** if no proof exists, say “No proof available yet” and show the loader for *that* proof type only (dont pretend with summaries).
---
### Backend rules of thumb
* **Preindex for first paint:** cache top N proof items per hot finding (e.g., first SBOM hit + shortest path).
* **Bound queries:** proof queries must be *O(log n)* on indexed columns (pkg name@version, file hash, graph node id).
* **Chunked streaming:** send first proof chunk <200ms after backend hit; dont hold for the full set.
* **Timeout budget:** 12s backend budget + 3s UI/render margin = 15s P95.
---
### Minimal contract to add in your code
```ts
// Frontend: fire on open
metrics.emit('finding_open', { findingId, t: performance.now() });
// When the first real proof node/line hits the DOM:
metrics.emit('proof_rendered', { findingId, proofKind, t: performance.now() });
```
```sql
-- Rollup (hourly)
SELECT
proof_kind,
percentile_cont(0.95) WITHIN GROUP (ORDER BY tte_ms) AS p95_ms
FROM tte_events
WHERE ts >= now() - interval '1 hour'
GROUP BY proof_kind;
```
---
### What to put on the team dashboard
* **TTE P95 by page** (Findings list, Finding details).
* **TTE P95 by proof_kind** (sbom / reachability / vex).
* **Error budget burn**: minutes over target per day.
* **Top regressions**: last 7 days vs prior 7.
---
### Acceptance checklist for any finding view
* [ ] First paint shows a real proof snippet (not a summary).
* [ ] Copy proof button works within 1 click.
* [ ] TTE P95 in staging 10s; in prod 15s.
* [ ] If proof missing, explicit emptystate + retry path.
* [ ] Telemetry sampled 50% of sessions (or 100% for internal).
---
If you want, I can turn this into a **`docs/UX/tte.md`** spec plus a **Grafana panel JSON** and a tiny **.NET middleware** to emit the metricsjust say the word and Ill drop in readytouse snippets.
Perfect, got it lets turn **TimetoEvidence (TTE)** into a proper set of *“how we track UX health”* guidelines.
Ill treat this like something you can drop into an internal engineering/UX doc.
---
## 1. What exactly is TimetoEvidence?
**Definition**
> **TTE = t_first_proof_rendered t_open_finding**
* **t_open_finding** when the user first opens a finding / detail view (e.g., vulnerability, alert, ticket, log event).
* **t_first_proof_rendered** when the UI first paints **actual evidence** that backs the finding, for example:
* The SBOM row showing `package@version`.
* The callgraph/dataflow path to a sink.
* A VEX note explaining why something is (not) affected.
* A raw log snippet that the alert is based on.
**Key principle:**
TTE measures **how long users have to trust you blindly** before they can see proof with their own eyes.
---
## 2. UX health goals & targets
Treat TTE like latency SLOs:
* **Primary SLO**:
* **P95 TTE 15s** for all findings in normal conditions.
* **Stretch SLO**:
* **P99 TTE 30s** for heavy cases (big graphs, huge SBOMs, cold caches).
* **Guardrail**:
* P50 TTE should be **< 3s**. If the median creeps up, youre in trouble even if P95 looks OK.
You can refine by feature:
* Simple proof (single SBOM row, small payload):
* P95 5s.
* Complex proof (reachability graph, crossrepo joins):
* P95 15s.
**UX rule of thumb**
* < 2s: feels instant.
* 210s: acceptable if clearly loading something heavy.
* > 10s: needs **strong** feedback (progress, partial results, explanations).
* > 30s: the system should probably **offer fallback** (e.g., “download raw evidence” or “retry”).
---
## 3. Instrumentation guidelines
### 3.1 Event model
Emit two core events per finding view:
1. **`finding_open`**
* When user opens the finding details (route enter / modal open).
* Must include:
* `finding_id`
* `tenant_id` / `org_id`
* `user_role` (admin, dev, triager, etc.)
* `entry_point` (list, search, notification, deep link)
* `ui_version` / `build_sha`
2. **`proof_rendered`**
* First time *any* qualifying proof element is painted.
* Must include:
* `finding_id`
* `proof_kind` (`sbom | reachability | vex | logs | other`)
* `source` (`local_cache | backend_api | 3rd_party`)
* `proof_height` (e.g., pixel offset from top) to ensure its actually above the fold or very close.
**Derived metric**
Your telemetry pipeline should compute:
```text
tte_ms = proof_rendered.timestamp - finding_open.timestamp
```
If there are multiple `proof_rendered` events for the same `finding_open`, use:
* **TTE (first proof)** minimum timestamp; primary SLO.
* Optionally: **TTE (full evidence)** last proof in a defined “bundle” (e.g., path + SBOM row).
### 3.2 Implementation notes
**Frontend**
* Emit `finding_open` as soon as:
* The route is confirmed and
* You know which `finding_id` is being displayed.
* Emit `proof_rendered`:
* **Not** when you *fetch* data, but when at least one evidence component is **visibly rendered**.
* Easiest approach: hook into component lifecycle / intersection observer on the evidence container.
Pseudoexample:
```ts
// On route/mount:
metrics.emit('finding_open', {
findingId,
entryPoint,
userRole,
uiVersion,
t: performance.now()
});
// In EvidencePanel component, after first render with real data:
if (!hasEmittedProof && hasRealEvidence) {
metrics.emit('proof_rendered', {
findingId,
proofKind: 'sbom',
source: 'backend_api',
t: performance.now()
});
hasEmittedProof = true;
}
```
**Backend**
* No special requirement beyond:
* Stable IDs (`finding_id`).
* Knowing which API endpoints respond with evidence payloads — youll want to correlate backend latency with TTE later.
---
## 4. Data quality & sampling
If you want TTE to drive decisions, the data must be boringly reliable.
**Guidelines**
1. **Sample rate**
* Start with **100%** in staging.
* In production, aim for **≥ 25% of sessions** for TTE events at minimum; 100% is ideal if volume is reasonable.
2. **Clock skew**
* Prefer **frontend timestamps** using `performance.now()` for TTE; theyre monotonic within a tab.
* Dont mix backend clocks into the TTE calculation.
3. **Bot / synthetic traffic**
* Tag synthetic tests (`is_synthetic = true`) and exclude them from UX health dashboards.
4. **Retry behavior**
* If the proof fails to load and user hits “retry”:
* Treat it as a separate measurement (`retry = true`) or
* Log an additional `proof_error` event with error class (timeout, 5xx, network, parse, etc.).
---
## 5. Dashboards: how to watch TTE
You want a small, opinionated set of views that answer:
> “Is UX getting better or worse for people trying to understand findings?”
### 5.1 Core widgets
1. **TTE distribution**
* P50 / P90 / P95 / P99 per day (or per release).
* Split by `proof_kind`.
2. **TTE by page / surface**
* Finding list → detail.
* Deep links from notifications.
* Direct URLs / bookmarks.
3. **TTE by user segment**
* New users vs power users.
* Different roles (security engineer vs application dev).
4. **Error budget panel**
* “Minutes over SLO per day” e.g., sum of all userminutes where TTE > 15s.
* Use this to prioritize work.
5. **Correlation with engagement**
* Scatter: TTE vs session length, or TTE vs “user clicked ignore / snooze”.
* Aim to confirm the obvious: **long TTE → worse engagement/completion**.
### 5.2 Operational details
* Update granularity: **realtime or ≤15 min** for oncall/ops panels.
* Retention: at least **90 days** to see trends across big releases.
* Breakdowns:
* `backend_region` (to catch regional issues).
* `build_version` (to spot regressions quickly).
---
## 6. UX & engineering design rules anchored in TTE
These are the **behavior rules** for the product that keep TTE healthy.
### 6.1 “Evidence first” layout rules
* **Evidence above the fold**
* At least *one* proof element must be visible **without scrolling** on a typical laptop viewport.
* **Summary second**
* CVSS scores, severity badges, long descriptions: all secondary. Evidence should come *before* opinion.
* **No fake proof**
* Dont use placeholders that *look* like evidence but arent (e.g., “example path” or generic text).
* If evidence is still loading, show a clear skeleton/loader with “Loading evidence…”.
### 6.2 Loading strategy rules
* Start fetching evidence **as soon as navigation begins**, not after the page is fully mounted.
* Use **lazy loading** for noncritical widgets until after proof is shown.
* If a call is known to be heavy:
* Consider **precomputing** and caching the top evidence (shortest path, first SBOM hit).
* Stream results: render first proof item as soon as it arrives; dont wait for the full list.
### 6.3 Empty / error state rules
* If there is genuinely no evidence:
* Explicitly say **“No supporting evidence available yet”** and treat TTE as:
* Either “no value” (excluded), or
* A special bucket `proof_kind = "none"`.
* If loading fails:
* Show a clear error and a **retry** that reemits `proof_rendered` when successful.
* Log `proof_error` with reason; track error rate alongside TTE.
---
## 7. How to *use* TTE in practice
### 7.1 For releases
For any change that affects findings UI or evidence plumbing:
* Add a release checklist item:
* “No regression on TTE P95 for [pages X, Y].”
* During rollout:
* Compare **pre vs postrelease** TTE P95 by `ui_version`.
* If regression > 20%:
* Roll back, or
* Add a followup ticket explicitly tagged with the regression.
### 7.2 For experiments / A/B tests
When running UI experiments around findings:
* Always capture TTE per variant.
* Compare:
* TTE P50/P95.
* Task completion rate (e.g., “user changed status”).
* Subjective UX (CSAT) if you have it.
Youre looking for patterns like:
* Variant B: **+5% completion**, **+8% TTE** → maybe OK.
* Variant C: **+2% completion**, **+70% TTE** → probably not acceptable.
### 7.3 For prioritization
Use TTE as a lever in planning:
* If P95 TTE is healthy and stable:
* More room for new features / experiments.
* If P95 TTE is trending up for 2+ weeks:
* Time to schedule a “TTE debt” story: caching, query optimization, UI relayout, etc.
---
## 8. Quick “TTEready” checklist
Youre “tracking UX health with TTE” if you can honestly tick these:
1. **Instrumentation**
* [ ] `finding_open` + `proof_rendered` events exist and are correlated.
* [ ] TTE computed in a stable pipeline (joins, dedupe, etc.).
2. **Targets**
* [ ] TTE SLOs defined (P95, P99) and agreed by UX + engineering.
3. **Dashboards**
* [ ] A dashboard shows TTE by proof kind, page, and release.
* [ ] Oncall / ops can see TTE in near realtime.
4. **UX rules**
* [ ] Evidence is visible above the fold for all main finding types.
* [ ] Noncritical widgets load after evidence.
* [ ] Empty/error states are explicit about evidence availability.
5. **Process**
* [ ] Major UI changes check TTE pre vs post as part of release acceptance.
* [ ] Regressions in TTE create real tickets, not just “well watch it”.
---
If you tell me what stack youre on (e.g., React + Next.js + OpenTelemetry + X observability tool), I can turn this into concrete code snippets and an example dashboard spec (fields, queries, charts) tailored exactly to your setup.

View File

@@ -0,0 +1,576 @@
Heres a tight, practical blueprint to turn your SBOM→VEX links into an auditable “proof spine”—using signed DSSE statements and a perdependency trust anchor—so every VEX verdict can be traced, verified, and replayed.
# What this gives you
* A **chain of evidence** from each SBOM entry → analysis → VEX verdict.
* **Tamperevident** DSSEsigned records (offlinefriendly).
* **Deterministic replay**: same inputs → same verdicts (great for audits/regulators).
# Core objects (canonical IDs)
* **ArtifactID**: digest of package/container (e.g., `sha256:…`).
* **SBOMEntryID**: stable ID for a component in an SBOM (`sbomDigest:package@version[:purl]`).
* **EvidenceID**: hash of raw evidence (scanner JSON, reachability, exploit intel).
* **ReasoningID**: hash of normalized reasoning (rules/lattice inputs used).
* **VEXVerdictID**: hash of the final VEX statement body.
* **ProofBundleID**: merkle root of {SBOMEntryID, EvidenceID[], ReasoningID, VEXVerdictID}.
* **TrustAnchorID**: perdependency anchor (public key + policy) used to validate the above.
# Signed DSSE envelopes youll produce
1. **Evidence Statement** (per evidence item)
* `subject`: SBOMEntryID
* `predicateType`: `evidence.stella/v1`
* `predicate`: source, tool version, timestamps, EvidenceID
* **Signers**: scanner/ingestor key
2. **Reasoning Statement**
* `subject`: SBOMEntryID
* `predicateType`: `reasoning.stella/v1` (your lattice/policy inputs + ReasoningID)
* **Signers**: “Policy/Lattice Engine” key (Authority)
3. **VEX Verdict Statement**
* `subject`: SBOMEntryID
* `predicateType`: CycloneDX or CSAF VEX body + VEXVerdictID
* **Signers**: VEXer key (or vendor key if you have it)
4. **Proof Spine Statement** (the spine itself)
* `subject`: SBOMEntryID
* `predicateType`: `proofspine.stella/v1`
* `predicate`: EvidenceID[], ReasoningID, VEXVerdictID, ProofBundleID
* **Signers**: Authority key
# Trust model (perdependency anchor)
* **TrustAnchor** (per package/purl): { TrustAnchorID, allowed signers (KMS refs, PKs), accepted predicateTypes, policy version, revocation list }.
* Store anchors in **Authority** and pin them in your graph by SBOMEntryID→TrustAnchorID.
* Optional: PQC mode (Dilithium/Falcon) for longterm archives.
# Verification pipeline (deterministic)
1. Resolve SBOMEntryID → TrustAnchorID.
2. Verify every DSSE envelopes signature **against the anchors allowed keys**.
3. Recompute EvidenceID/ReasoningID/VEXVerdictID from raw content; compare hashes.
4. Recompute ProofBundleID (merkle root) and compare to the spine.
5. Emit a **Receipt**: {ProofBundleID, verification log, tool digests}. Cache it.
# Storage layout (Postgres + blob store)
* `sbom_entries(entry_id PK, bom_digest, purl, version, artifact_digest, trust_anchor_id)`
* `dsse_envelopes(env_id PK, entry_id, predicate_type, signer_keyid, body_hash, envelope_blob_ref, signed_at)`
* `spines(entry_id PK, bundle_id, evidence_ids[], reasoning_id, vex_id, anchor_id, created_at)`
* `trust_anchors(anchor_id PK, purl_pattern, allowed_keyids[], policy_ref, revoked_keys[])`
* Blobs (immutable): raw evidence, normalized reasoning JSON, VEX JSON, DSSE bytes.
# API surface (clean and small)
* `POST /proofs/:entry/spine` → submit or update spine (idempotent by ProofBundleID)
* `GET /proofs/:entry/receipt` → full verification receipt (JSON)
* `GET /proofs/:entry/vex` → the verified VEX body
* `GET /anchors/:anchor` → fetch trust anchor (for offline kits)
# Normalization rules (so hashes are stable)
* Canonical JSON (UTF8, sorted keys, no insignificant whitespace).
* Strip volatile fields (timestamps that arent part of the semantic claim).
* Version your schemas: `evidence.stella/v1`, `reasoning.stella/v1`, etc.
# Signing keys & rotation
* Keep keys in your **Authority** module (KMS/HSM; offline export for airgap).
* Publish key material via an **attestation feed** (or Rekormirror) for thirdparty audit.
* Rotate by **adding** new allowed_keyids in the TrustAnchor; never mutate old envelopes.
# CI/CD hooks
* On SBOM ingest → create/refresh SBOMEntry rows + attach TrustAnchor.
* On scan completion → produce Evidence Statements (DSSE) immediately.
* On policy evaluation → produce Reasoning + VEX, then assemble Spine.
* Gate releases on `GET /proofs/:entry/receipt` == PASS.
# UX (auditorfriendly)
* **Proof timeline** per entry: SBOM → Evidence tiles → Reasoning → VEX → Receipt.
* Oneclick “Recompute & Compare” to show deterministic replay passes.
* Red/amber flags when a signature no longer matches a TrustAnchor or a key is revoked.
# Minimal dev checklist
* [ ] Implement canonicalizers (Evidence, Reasoning, VEX).
* [ ] Implement DSSE sign/verify (ECDSA + optional PQC).
* [ ] TrustAnchor registry + resolver by purl pattern.
* [ ] Merkle bundling to get ProofBundleID.
* [ ] Receipt generator + verifier.
* [ ] Postgres schema + blob GC (contentaddressed).
* [ ] CI gates + API endpoints above.
* [ ] Auditor UI: timeline + diff + receipts download.
If you want, I can drop in a readytouse JSON schema set (`evidence.stella/v1`, `reasoning.stella/v1`, `proofspine.stella/v1`) and sample DSSE envelopes wired to your .NET 10 stack.
Heres a focused **Stella Ops Developer Guidelines** doc, specifically for the pipeline that turns **SBOM data into verifiable proofs** (your SBOM → Evidence → Reasoning → VEX → Proof Spine).
Feel free to paste this into your internal handbook and tweak names to match your repos/services.
---
# Stella Ops Developer Guidelines
## Turning SBOM Data Into Verifiable Proofs
---
## 1. Mental Model: What Youre Actually Building
For every component in an SBOM, Stella must be able to answer, *“Why should anyone trust our VEX verdict for this dependency, today and ten years from now?”*
We do that with a pipeline:
1. **SBOM Ingest**
Raw SBOM → validated → normalized → `SBOMEntryID`.
2. **Evidence Collection**
Scans, feeds, configs, reachability, etc. → canonical evidence blobs → `EvidenceID` → DSSE-signed.
3. **Reasoning / Policy**
Policy + evidence → deterministic reasoning → `ReasoningID` → DSSE-signed.
4. **VEX Verdict**
VEX statement (CycloneDX / CSAF) → canonicalized → `VEXVerdictID` → DSSE-signed.
5. **Proof Spine**
`{SBOMEntryID, EvidenceIDs[], ReasoningID, VEXVerdictID}` → merkle bundle → `ProofBundleID` → DSSE-signed.
6. **Verification & Receipts**
Re-run verification → `Receipt` that proves everything above is intact and anchored to trusted keys.
Everything you do in this area should keep this spine intact and verifiable.
---
## 2. NonNegotiable Invariants
These are the rules you dont break without an explicit, company-level decision:
1. **Immutability of Signed Facts**
* DSSE envelopes (evidence, reasoning, VEX, spines) are appendonly.
* You never edit or delete content inside a previously signed envelope.
* Corrections are made by **superseding** (new statement pointing at the old one).
2. **Determinism**
* Same `{SBOMEntryID, Evidence set, policyVersion}` ⇒ same `{ReasoningID, VEXVerdictID, ProofBundleID}`.
* No non-deterministic inputs (e.g., “current time”, random IDs) in anything that affects IDs or verdicts.
3. **Traceability**
* Every VEX verdict must be traceable back to:
* The precise SBOM entry
* Concrete evidence blobs
* A specific policy & reasoning snapshot
* A trust anchor defining allowed signers
4. **Least Trust / Least Privilege**
* Services only know the keys and data they need.
* Trust is always explicit: through **TrustAnchors** and signature verification, never “because its in our DB”.
5. **Backwards Compatibility**
* New code must continue to verify **old proofs**.
* New policies must **not rewrite history**; they produce *new* spines, leaving old ones intact.
---
## 3. SBOM Ingestion Guidelines
**Goal:** Turn arbitrary SBOMs into stable, addressable `SBOMEntryID`s and safe internal models.
### 3.1 Inputs & Formats
* Support at least:
* CycloneDX (JSON)
* SPDX (JSON / Tag-Value)
* For each ingested SBOM, store:
* Raw SBOM bytes (immutable, content-addressed)
* A normalized internal representation (your own model)
### 3.2 IDs
* Generate:
* `sbomDigest` = hash(raw SBOM, canonical form)
* `SBOMEntryID` = `sbomDigest + purl + version` (or equivalent stable tuple)
* `SBOMEntryID` must:
* Not depend on ingestion time or database IDs.
* Be reproducible from SBOM + deterministic normalization.
### 3.3 Validation & Errors
* Validate:
* Syntax (JSON, schema)
* Core semantics (package identifiers, digests, versions)
* If invalid:
* Reject the SBOM **but** record a small DSSE “failure attestation” explaining:
* Why it failed
* Which file
* Which system version
* This still gives you a proof trail for “we tried and it failed”.
---
## 4. Evidence Collection Guidelines
**Goal:** Capture all inputs that influence the verdict in a canonical, signed form.
Typical evidence types:
* SCA / vuln scanner results
* CVE feeds & advisory data
* Reachability / call graph analysis
* Runtime context (where this component is used)
* Manual assessments (e.g., security engineer verdicts)
### 4.1 Evidence Canonicalization
For every evidence item:
* Normalize to a schema like `evidence.stella/v1` with fields such as:
* `source` (scanner name, feed)
* `sourceVersion` (tool version, DB version)
* `collectionTime`
* `sbomEntryId`
* `vulnerabilityId` (if applicable)
* `rawFinding` (or pointer to it)
* Canonical JSON rules:
* Sorted keys
* UTF8, no extraneous whitespace
* No volatile fields beyond whats semantically needed (e.g., you might include `collectionTime`, but then know it affects the hash and treat that consciously).
Then:
* Compute `EvidenceID = hash(canonicalEvidenceJson)`.
* Wrap in DSSE:
* `subject`: `SBOMEntryID`
* `predicateType`: `evidence.stella/v1`
* `predicate`: canonical evidence + `EvidenceID`.
* Sign with **evidence-ingestor key** (per environment).
### 4.2 Ops Rules
* **Idempotency:**
Re-running the same scan with same inputs should produce the same evidence object and `EvidenceID`.
* **Tool changes:**
When tool version or configuration changes, thats a *new* evidence statement with a new `EvidenceID`. Do not overwrite old evidence.
* **Partial failure:**
If a scan fails, produce a minimal failure evidence record (with error details) instead of “nothing”.
---
## 5. Reasoning & Policy Engine Guidelines
**Goal:** Turn evidence into a defensible, replayable reasoning step with a clear policy version.
### 5.1 Reasoning Object
Define a canonical reasoning schema, e.g. `reasoning.stella/v1`:
* `sbomEntryId`
* `evidenceIds[]` (sorted)
* `policyVersion`
* `inputs`: normalized form of all policy inputs (severity thresholds, lattice rules, etc.)
* `intermediateFindings`: optional but useful — e.g., “reachable vulns = …”
Then:
* Canonicalize JSON and compute `ReasoningID = hash(canonicalReasoning)`.
* Wrap in DSSE:
* `subject`: `SBOMEntryID`
* `predicateType`: `reasoning.stella/v1`
* `predicate`: canonical reasoning + `ReasoningID`.
* Sign with **Policy/Authority key**.
### 5.2 Determinism
* Reasoning functions must be **pure**:
* Inputs: SBOMEntryID, evidence set, policy version, configuration.
* No hidden calls to external APIs at decision time (fetch feeds earlier and record them as evidence).
* If you need “current time” in policy:
* Treat it as **explicit input** and record it inside reasoning under `inputs.currentEvaluationTime`.
### 5.3 Policy Evolution
* When policy changes:
* Bump `policyVersion`.
* New evaluations produce new `ReasoningID` and new VEX/spines.
* Dont retroactively apply new policy to old reasoning objects; generate new ones alongside.
---
## 6. VEX Verdict Guidelines
**Goal:** Generate VEX statements that are strongly tied to SBOM entries and your reasoning.
### 6.1 Shape
* Target standard formats:
* CycloneDX VEX
* or CSAF
* Required linkages:
* Component reference = `SBOMEntryID` or a resolvable component identifier from your SBOM normalize layer.
* Vulnerability IDs (CVE, GHSA, internal IDs).
* Status (`not_affected`, `affected`, `fixed`, etc.).
* Justification & impact.
### 6.2 Canonicalization & Signing
* Define a canonical VEX body schema (subset of the standard + internal metadata):
* `sbomEntryId`
* `vulnerabilityId`
* `status`
* `justification`
* `policyVersion`
* `reasoningId`
* Canonicalize JSON → `VEXVerdictID = hash(canonicalVexBody)`.
* DSSE-envelope:
* `subject`: `SBOMEntryID`
* `predicateType`: e.g. `cdx-vex.stella/v1`
* `predicate`: canonical VEX + `VEXVerdictID`.
* Sign with **VEXer key** or vendor key (depending on trust anchor).
### 6.3 External VEX
* When importing vendor VEX:
* Verify signature against vendors TrustAnchor.
* Canonicalize to your internal schema but preserve:
* Original document
* Original signature material
* Record “source = vendor” vs “source = stella” so auditors see origin.
---
## 7. Proof Spine Guidelines
**Goal:** Build a compact, tamper-evident “bundle” that ties everything together.
### 7.1 Structure
For each `SBOMEntryID`, gather:
* `EvidenceIDs[]` (sorted lexicographically).
* `ReasoningID`.
* `VEXVerdictID`.
Compute:
* Merkle tree root (or deterministic hash) over:
* `sbomEntryId`
* sorted `EvidenceIDs[]`
* `ReasoningID`
* `VEXVerdictID`
* Result is `ProofBundleID`.
Create a DSSE “spine”:
* `subject`: `SBOMEntryID`
* `predicateType`: `proofspine.stella/v1`
* `predicate`:
* `evidenceIds[]`
* `reasoningId`
* `vexVerdictId`
* `policyVersion`
* `proofBundleId`
* Sign with **Authority key**.
### 7.2 Ops Rules
* Spine generation is idempotent:
* Same inputs → same `ProofBundleID`.
* Never mutate existing spines; new policy or new evidence ⇒ new spine.
* Keep a clear API contract:
* `GET /proofs/:entry` returns **all** spines, each labeled with `policyVersion` and timestamps.
---
## 8. Storage & Schema Guidelines
**Goal:** Keep proofs queryable forever without breaking verification.
### 8.1 Tables (conceptual)
* `sbom_entries`: `entry_id`, `bom_digest`, `purl`, `version`, `artifact_digest`, `trust_anchor_id`.
* `dsse_envelopes`: `env_id`, `entry_id`, `predicate_type`, `signer_keyid`, `body_hash`, `envelope_blob_ref`, `signed_at`.
* `spines`: `entry_id`, `proof_bundle_id`, `policy_version`, `evidence_ids[]`, `reasoning_id`, `vex_verdict_id`, `anchor_id`, `created_at`.
* `trust_anchors`: `anchor_id`, `purl_pattern`, `allowed_keyids[]`, `policy_ref`, `revoked_keys[]`.
### 8.2 Schema Changes
Always follow:
1. **Expand**
* Add new columns/tables.
* Make new code tolerant of old data.
2. **Backfill**
* Idempotent jobs that fill in new IDs/fields without touching old DSSE payloads.
3. **Contract**
* Only after all code uses the new model.
* Never drop the raw DSSE or raw SBOM blobs.
---
## 9. Verification & Receipts
**Goal:** Make it trivial (for you, customers, and regulators) to recheck everything.
### 9.1 Verification Flow
Given `SBOMEntryID` or `ProofBundleID`:
1. Fetch spine and trust anchor.
2. Verify:
* Spine DSSE signature against TrustAnchors allowed keys.
* VEX, reasoning, and evidence DSSE signatures.
3. Recompute:
* `EvidenceIDs` from stored canonical evidence.
* `ReasoningID` from reasoning.
* `VEXVerdictID` from VEX body.
* `ProofBundleID` from the above.
4. Compare to stored IDs.
Emit a **Receipt**:
* `proofBundleId`
* `verifiedAt`
* `verifierVersion`
* `anchorId`
* `result` (pass/fail, with reasons)
### 9.2 Offline Kit
* Provide a minimal CLI (`stella verify`) that:
* Accepts a bundle export (SBOM + DSSE envelopes + anchors).
* Verifies everything without network access.
Developers must ensure:
* Export format is documented and stable.
* All fields required for verification are included.
---
## 10. Security & Key Management (for Devs)
* Keys live in **KMS/HSM**, not env vars or config files.
* Separate keysets:
* `dev`, `staging`, `prod`
* Authority vs VEXer vs Evidence Ingestor.
* TrustAnchors:
* Edit via Authority service only.
* Every change:
* Requires code-reviewed change.
* Writes an audit log entry.
Never:
* Log private keys.
* Log full DSSE envelopes in plaintext logs (log IDs and hashes instead).
---
## 11. Observability & OnCall Expectations
### 11.1 Metrics
For the SBOM→Proof pipeline, expose:
* `sboms_ingested_total`
* `sbom_ingest_errors_total{reason}`
* `evidence_statements_created_total`
* `reasoning_statements_created_total`
* `vex_statements_created_total`
* `proof_spines_created_total`
* `proof_verifications_total{result}` (pass/fail reason)
* Latency histograms per stage (`_duration_seconds`)
### 11.2 Logging
Include in structured logs wherever relevant:
* `sbomEntryId`
* `proofBundleId`
* `anchorId`
* `policyVersion`
* `requestId` / `traceId`
### 11.3 Runbooks
You should maintain runbooks for at least:
* “Pipeline is stalled” (backlog of SBOMs, evidence, or spines).
* “Verification failures increased”.
* “Trust anchor or key issues” (rotation, revocation, misconfiguration).
* “Backfill gone wrong” (how to safely stop, resume, and audit).
---
## 12. Dev Workflow & PR Checklist (SBOM→Proof Changes Only)
When your change touches SBOM ingestion, evidence, reasoning, VEX, or proof spines, check:
* [ ] IDs (`SBOMEntryID`, `EvidenceID`, `ReasoningID`, `VEXVerdictID`, `ProofBundleID`) remain **deterministic** and fully specified.
* [ ] No mutation of existing DSSE envelopes or historical proof data.
* [ ] Schema changes follow **expand → backfill → contract**.
* [ ] New/updated TrustAnchors reviewed by Authority owner.
* [ ] Unit tests cover:
* Canonicalization for any new/changed predicate.
* ID computation.
* [ ] Integration test covers:
* SBOM → Evidence → Reasoning → VEX → Spine → Verification → Receipt.
* [ ] Observability updated:
* New paths emit logs & metrics.
* [ ] Rollback plan documented (especially for migrations & policy changes).
---
If you tell me which microservices/repos map to these stages (e.g. `stella-sbom-ingest`, `stella-proof-authority`, `stella-vexer`), I can turn this into a more concrete, servicebyservice checklist with example API contracts and class/interface sketches.

422
docs/router/01-Step.md Normal file
View File

@@ -0,0 +1,422 @@
Goal for this phase: get a clean, compiling skeleton in place that matches the spec and folder conventions, with zero real logic and minimal dependencies. After this, all future work plugs into this structure.
Ill break it into concrete tasks you can assign to agents.
---
## 1. Define the repository layout
**Owner: “Skeleton” / infra agent**
Target layout (no code yet, just dirs):
```text
/ (repo root)
StellaOps.Router.sln
/src
/StellaOps.Gateway.WebService
/__Libraries
/StellaOps.Router.Common
/StellaOps.Router.Config
/StellaOps.Microservice
/StellaOps.Microservice.SourceGen (empty stub for now)
/tests
/StellaOps.Router.Common.Tests
/StellaOps.Gateway.WebService.Tests
/StellaOps.Microservice.Tests
/docs
/router
specs.md (already exists)
README.md (placeholder, 23 lines)
```
Tasks:
1. Create `src`, `src/__Libraries`, `tests`, `docs/router` directories if missing.
2. Move/confirm `docs/router/specs.md` is the canonical spec.
3. Add `docs/router/README.md` with a pointer: “Start with specs.md; this folder will host router-related docs.”
---
## 2. Create the solution and projects
**Owner: skeleton agent**
### 2.1 Create solution
* At repo root:
```bash
dotnet new sln -n StellaOps.Router
```
* Add projects as they are created in the next step.
### 2.2 Create projects
For each project below:
* `dotnet new` with appropriate template.
* Set `RootNamespace` / `AssemblyName` to match folder & spec.
Projects:
1. **Gateway webservice**
```bash
cd src/StellaOps.Gateway.WebService
dotnet new webapi -n StellaOps.Gateway.WebService
```
* This will create an ASP.NET Core Web API project; well trim later.
2. **Common library**
```bash
cd src/__Libraries
dotnet new classlib -n StellaOps.Router.Common
```
3. **Config library**
```bash
dotnet new classlib -n StellaOps.Router.Config
```
4. **Microservice SDK**
```bash
dotnet new classlib -n StellaOps.Microservice
```
5. **Microservice Source Generator (stub)**
```bash
dotnet new classlib -n StellaOps.Microservice.SourceGen
```
* This will be converted to an Analyzer/SourceGen project later; for now it can compile as a plain library.
6. **Test projects**
Under `tests`:
```bash
cd tests
dotnet new xunit -n StellaOps.Router.Common.Tests
dotnet new xunit -n StellaOps.Gateway.WebService.Tests
dotnet new xunit -n StellaOps.Microservice.Tests
```
### 2.3 Add projects to solution
At repo root:
```bash
dotnet sln StellaOps.Router.sln add \
src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj \
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj \
src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj \
src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj \
src/__Libraries/StellaOps.Microservice.SourceGen/StellaOps.Microservice.SourceGen.csproj \
tests/StellaOps.Router.Common.Tests/StellaOps.Router.Common.Tests.csproj \
tests/StellaOps.Gateway.WebService.Tests/StellaOps.Gateway.WebService.Tests.csproj \
tests/StellaOps.Microservice.Tests/StellaOps.Microservice.Tests.csproj
```
---
## 3. Wire basic project references
**Owner: skeleton agent**
The reference graph should be:
* `StellaOps.Gateway.WebService`
* references `StellaOps.Router.Common`
* references `StellaOps.Router.Config`
* `StellaOps.Microservice`
* references `StellaOps.Router.Common`
* (later) references `StellaOps.Microservice.SourceGen` as analyzer; for now no reference.
* `StellaOps.Router.Config`
* references `StellaOps.Router.Common` (for `EndpointDescriptor`, `InstanceDescriptor`, etc.)
Test projects:
* `StellaOps.Router.Common.Tests` → `StellaOps.Router.Common`
* `StellaOps.Gateway.WebService.Tests` → `StellaOps.Gateway.WebService`
* `StellaOps.Microservice.Tests` → `StellaOps.Microservice`
Use `dotnet add reference`:
```bash
dotnet add src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj reference \
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj \
src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj
dotnet add src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj reference \
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
dotnet add src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj reference \
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
dotnet add tests/StellaOps.Router.Common.Tests/StellaOps.Router.Common.Tests.csproj reference \
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
dotnet add tests/StellaOps.Gateway.WebService.Tests/StellaOps.Gateway.WebService.Tests.csproj reference \
src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj
dotnet add tests/StellaOps.Microservice.Tests/StellaOps.Microservice.Tests.csproj reference \
src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj
```
---
## 4. Set common build settings
**Owner: infra agent**
Add a `Directory.Build.props` at repo root to centralize:
* Target framework (e.g. `net8.0`).
* Nullable context.
* LangVersion.
Example (minimal):
```xml
<Project>
<PropertyGroup>
<TargetFramework>net8.0</TargetFramework>
<Nullable>enable</Nullable>
<LangVersion>preview</LangVersion> <!-- if needed for newer features -->
<ImplicitUsings>enable</ImplicitUsings>
</PropertyGroup>
</Project>
```
Then, strip redundant `<TargetFramework>` from individual `.csproj` files if desired.
---
## 5. Stub namespaces and “empty” entry points
**Owner: each projects agent**
### 5.1 Common library
Create empty placeholder types that match the spec names (no logic, just shells) so everything compiles and IntelliSense knows the shapes.
Example files:
* `TransportType.cs`
* `FrameType.cs`
* `InstanceHealthStatus.cs`
* `ClaimRequirement.cs`
* `EndpointDescriptor.cs`
* `InstanceDescriptor.cs`
* `ConnectionState.cs`
* `RoutingContext.cs`
* `RoutingDecision.cs`
* `PayloadLimits.cs`
* Interfaces: `IGlobalRoutingState`, `IRoutingPlugin`, `ITransportServer`, `ITransportClient`.
Each type can be an auto-property-only record/class/enum; no methods yet.
Example:
```csharp
namespace StellaOps.Router.Common;
public enum TransportType
{
Udp,
Tcp,
Certificate,
RabbitMq
}
```
and so on.
### 5.2 Config library
Add a minimal `RouterConfig` and `PayloadLimits` class aligned with the spec; again, just properties.
```csharp
namespace StellaOps.Router.Config;
public sealed class RouterConfig
{
public IList<ServiceConfig> Services { get; init; } = new List<ServiceConfig>();
public PayloadLimits PayloadLimits { get; init; } = new();
}
public sealed class ServiceConfig
{
public string Name { get; init; } = string.Empty;
public string DefaultVersion { get; init; } = "1.0.0";
}
```
No YAML binding, no logic yet.
### 5.3 Microservice library
Create:
* `StellaMicroserviceOptions` with required properties.
* `RouterEndpointConfig` (host/port/transport).
* Extension method `AddStellaMicroservice(...)` with an empty body that just registers options and placeholder services.
```csharp
namespace StellaOps.Microservice;
public sealed class StellaMicroserviceOptions
{
public string ServiceName { get; set; } = string.Empty;
public string Version { get; set; } = string.Empty;
public string Region { get; set; } = string.Empty;
public string InstanceId { get; set; } = string.Empty;
public IList<RouterEndpointConfig> Routers { get; set; } = new List<RouterEndpointConfig>();
public string? ConfigFilePath { get; set; }
}
public sealed class RouterEndpointConfig
{
public string Host { get; set; } = string.Empty;
public int Port { get; set; }
public TransportType TransportType { get; set; }
}
```
`AddStellaMicroservice`:
```csharp
public static class ServiceCollectionExtensions
{
public static IServiceCollection AddStellaMicroservice(
this IServiceCollection services,
Action<StellaMicroserviceOptions> configure)
{
services.Configure(configure);
// TODO: register internal SDK services in later phases
return services;
}
}
```
### 5.4 Microservice.SourceGen
For now:
* Leave this as an empty classlib with an empty `README.md` stating:
* “This project will host Roslyn source generators for endpoint discovery. No implementation yet.”
Dont hook it as an analyzer until there is content.
### 5.5 Gateway webservice
Simplify the scaffolded Web API to minimal:
* In `Program.cs`:
* Build a barebones `WebApplication` that:
* Binds `GatewayNodeConfig` from config.
* Adds controllers or minimal endpoints.
* Runs; no router logic yet.
Example:
```csharp
var builder = WebApplication.CreateBuilder(args);
builder.Services.Configure<GatewayNodeConfig>(
builder.Configuration.GetSection("GatewayNode"));
builder.Services.AddControllers();
var app = builder.Build();
app.MapControllers(); // may be empty for now
app.Run();
```
* Add `GatewayNodeConfig` class in `StellaOps.Gateway.WebService` project.
---
## 6. Make tests compile (even if empty)
**Owner: test agent**
For each test project:
* Reference the appropriate main project (already done).
* Add a single dummy test class so CI passes:
```csharp
public class SmokeTests
{
[Fact]
public void SolutionCompiles()
{
Assert.True(true);
}
}
```
This is just to ensure the pipeline runs; real tests come later.
---
## 7. Add initial CI/build pipeline
**Owner: infra agent**
Set up minimal CI (GitHub Actions, GitLab, Azure DevOps, whatever you use):
* Steps:
* `dotnet restore`
* `dotnet build StellaOps.Router.sln -c Release`
* `dotnet test StellaOps.Router.sln -c Release`
No packaging or deployment yet; just compile + tests.
---
## 8. Sanity check & readiness criteria
Before you let agents move to “Common model implementation” (next phase), confirm:
* The solution builds cleanly in a clean checkout (`dotnet restore`, `dotnet build`).
* All test projects run and pass (even with dummy tests).
* Namespaces and project names match the spec:
* `StellaOps.Gateway.WebService`
* `StellaOps.Router.Common`
* `StellaOps.Router.Config`
* `StellaOps.Microservice`
* There is no real business logic yet:
* No transport logic.
* No routing decisions.
* No reflection or YAML.
* `docs/router/specs.md` is referenced in `docs/router/README.md` as the spec.
At that point, the skeleton is in place and stable. Next phases can then focus on:
* Filling in `Common` contracts properly.
* Implementing the in-memory transport.
* Wiring minimal microservice/gateway flows.
If you want, I can outline the next phase (“implement core model + in-memory transport”) with a similar task breakdown next.

375
docs/router/02-Step.md Normal file
View File

@@ -0,0 +1,375 @@
For this step, the goal is: make `StellaOps.Router.Common` the single, stable contract layer that everything else can depend on, with **no behavior** yet, just shapes. After this, gateway, microservice SDK, transports, and config can all compile against it.
Think of this as “lock down the domain vocabulary”.
---
## 0. Pre-work
**All devs touching Common:**
1. Read `docs/router/specs.md`, specifically:
* The sections describing:
* Enums (`TransportType`, `FrameType`, `InstanceHealthStatus`, etc.).
* Endpoint/instance/routing models.
* Frames and request/response correlation.
* Routing state and routing plugin.
2. Agree that no class/interface will be added to Common if it isnt in the spec (or discussed with you and then added to the spec).
---
## 1. Inventory and file layout
**Owner: “Common” lead**
1. From `specs.md`, extract a **type inventory** for `StellaOps.Router.Common`:
Enumerations:
* `TransportType`
* `FrameType`
* `InstanceHealthStatus`
Core value objects:
* `ClaimRequirement`
* `EndpointDescriptor`
* `InstanceDescriptor`
* `ConnectionState`
* `PayloadLimits` (if used from Common; otherwise keep in Config only)
* Any small value types youve defined (e.g. cancel payload, ping metrics etc. if present in specs).
Routing:
* `RoutingContext`
* `RoutingDecision`
Frames:
* `Frame` (type + correlation id + payload)
* Optional payload contracts for HELLO, HEARTBEAT, ENDPOINTS_UPDATE, etc., if youve specified them explicitly.
Abstractions/interfaces:
* `IGlobalRoutingState`
* `IRoutingPlugin`
* `ITransportServer`
* `ITransportClient`
* Optional: `IRegionProvider` if you kept it in the spec.
2. Propose a file layout inside `src/__Libraries/StellaOps.Router.Common`:
Example:
```text
/StellaOps.Router.Common
/Enums
TransportType.cs
FrameType.cs
InstanceHealthStatus.cs
/Models
ClaimRequirement.cs
EndpointDescriptor.cs
InstanceDescriptor.cs
ConnectionState.cs
RoutingContext.cs
RoutingDecision.cs
Frame.cs
/Abstractions
IGlobalRoutingState.cs
IRoutingPlugin.cs
ITransportClient.cs
ITransportServer.cs
IRegionProvider.cs (if used)
```
3. Get a quick 👍/👎 from you on the layout (no code yet, just file names and namespaces).
---
## 2. Implement enums and basic models
**Owner: Common dev**
Scope: simple, immutable models, no methods.
1. **Enums**
Implement:
* `TransportType` with `[Udp, Tcp, Certificate, RabbitMq]`.
* `FrameType` with:
* `Hello`, `Heartbeat`, `EndpointsUpdate`, `Request`, `RequestStreamData`, `Response`, `ResponseStreamData`, `Cancel` (and any others in specs).
* `InstanceHealthStatus` with:
* `Unknown`, `Healthy`, `Degraded`, `Draining`, `Unhealthy`.
All enums live under `namespace StellaOps.Router.Common;`.
2. **Value models**
Implement as plain classes/records with auto-properties:
* `ClaimRequirement`:
* `string Type` (required).
* `string? Value` (optional).
* `EndpointDescriptor`:
* `string ServiceName`
* `string Version`
* `string Method`
* `string Path`
* `TimeSpan DefaultTimeout`
* `bool SupportsStreaming`
* `IReadOnlyList<ClaimRequirement> RequiringClaims`
* `InstanceDescriptor`:
* `string InstanceId`
* `string ServiceName`
* `string Version`
* `string Region`
* `ConnectionState`:
* `string ConnectionId`
* `InstanceDescriptor Instance`
* `InstanceHealthStatus Status`
* `DateTime LastHeartbeatUtc`
* `double AveragePingMs`
* `TransportType TransportType`
* `IReadOnlyDictionary<(string Method, string Path), EndpointDescriptor> Endpoints`
Design choices:
* Make constructors minimal (empty constructors okay for now).
* Use `init` where reasonable to encourage immutability for descriptors; `ConnectionState` can have mutable health fields.
3. **PayloadLimits (if in Common)**
If the spec places `PayloadLimits` in Common (versus Config), implement:
```csharp
public sealed class PayloadLimits
{
public long MaxRequestBytesPerCall { get; set; }
public long MaxRequestBytesPerConnection { get; set; }
public long MaxAggregateInflightBytes { get; set; }
}
```
If its defined in Config only, leave it there and avoid duplication.
---
## 3. Implement frame & correlation model
**Owner: Common dev**
1. Implement `Frame`:
```csharp
public sealed class Frame
{
public FrameType Type { get; init; }
public Guid CorrelationId { get; init; }
public byte[] Payload { get; init; } = Array.Empty<byte>();
}
```
2. If `specs.md` defines specific payload DTOs (e.g. `HelloPayload`, `HeartbeatPayload`, `CancelPayload`), define them too:
* `HelloPayload`:
* `InstanceDescriptor` and list of `EndpointDescriptor`s, or the equivalent properties.
* `HeartbeatPayload`:
* `InstanceId`, `Status`, metrics.
* `CancelPayload`:
* `string Reason` or similar.
Keep them as simple DTOs with no logic.
3. Do **not** implement serialization yet (no JSON/MessagePack references here); Common should only define shapes.
---
## 4. Routing abstractions
**Owner: Common dev**
Implement the routing interface + context & decision types.
1. `RoutingContext`:
* Match the spec. If your `specs.md` version includes `HttpContext`, follow it; if you intentionally kept Common free of ASP.NET types, use a neutral context (e.g. method/path/headers/principal).
* For now, if `HttpContext` is included in spec, define:
```csharp
public sealed class RoutingContext
{
public object HttpContext { get; init; } = default!; // or Microsoft.AspNetCore.Http.HttpContext if allowed
public EndpointDescriptor Endpoint { get; init; } = default!;
public string GatewayRegion { get; init; } = string.Empty;
}
```
Then you can refine the type once you finalize whether Common can reference ASP.NET packages. If you want to avoid that now, define your own lightweight context model and let gateway adapt.
2. `RoutingDecision`:
* Must include:
* `EndpointDescriptor Endpoint`
* `ConnectionState Connection`
* `TransportType TransportType`
* `TimeSpan EffectiveTimeout`
3. `IGlobalRoutingState`:
Interface only, no implementation:
```csharp
public interface IGlobalRoutingState
{
EndpointDescriptor? ResolveEndpoint(string method, string path);
IReadOnlyList<ConnectionState> GetConnectionsFor(
string serviceName,
string version,
string method,
string path);
}
```
4. `IRoutingPlugin`:
* Single method:
```csharp
public interface IRoutingPlugin
{
Task<RoutingDecision?> ChooseInstanceAsync(
RoutingContext context,
CancellationToken cancellationToken);
}
```
* No logic; just interface.
---
## 5. Transport abstractions
**Owner: Common dev**
Implement the shared transport contracts.
1. `ITransportServer`:
```csharp
public interface ITransportServer
{
Task StartAsync(CancellationToken cancellationToken);
Task StopAsync(CancellationToken cancellationToken);
}
```
2. `ITransportClient`:
Per spec, you need:
* A buffered call (request → response).
* A streaming call.
* A cancel call.
Interfaces only; content roughly:
```csharp
public interface ITransportClient
{
Task<Frame> SendRequestAsync(
ConnectionState connection,
Frame requestFrame,
TimeSpan timeout,
CancellationToken cancellationToken);
Task SendCancelAsync(
ConnectionState connection,
Guid correlationId,
string? reason = null);
Task SendStreamingAsync(
ConnectionState connection,
Frame requestHeader,
Stream requestBody,
Func<Stream, Task> readResponseBody,
PayloadLimits limits,
CancellationToken cancellationToken);
}
```
No implementation or transport-specific logic here. No network types beyond `Stream` and `Task`.
3. `IRegionProvider` (if you decided to keep it):
```csharp
public interface IRegionProvider
{
string Region { get; }
}
```
---
## 6. Wire Common into tests (sanity checks only)
**Owner: Common tests dev**
Create a few very simple unit tests in `StellaOps.Router.Common.Tests`:
1. **Shape tests** (these are mostly compile-time):
* That `EndpointDescriptor` has the expected properties and default values can be set.
* That `ConnectionState` can be constructed and that its `Endpoints` dictionary handles `(Method, Path)` keys.
2. **Enum completeness tests**:
* Assert that `Enum.GetValues(typeof(FrameType))` contains all expected values. This catches accidental changes.
3. **No behavior yet**:
* No routing algorithms or transport behavior tests here; just that model contracts behave like dumb DTOs (e.g. property assignment, default value semantics).
This is mostly to lock in the shape and catch accidental refactors later.
---
## 7. Cleanliness & review checklist
Before you move on to the in-memory transport and gateway/microservice wiring, check:
1. `StellaOps.Router.Common`:
* Compiles with zero warnings (nullable enabled).
* Only references BCL; no ASP.NET or serializer packages unless intentionally agreed in the spec.
2. All types listed in `specs.md` under the Common section exist and match names & property sets.
3. No behavior/logic:
* No LINQ-heavy methods.
* No routing algorithm code.
* No network code.
* No YAML/JSON or serialization.
4. `StellaOps.Router.Common.Tests` runs and passes.
5. `docs/router/specs.md` is updated if there was any discrepancy (or the code is updated to match the spec, not the other way around).
---
If you want the next step, I can outline “3. Build in-memory transport + minimal HELLO/REQUEST/RESPONSE wiring” in the same style, so agents can move from contracts to a working vertical slice.

144
docs/router/03-Step.md Normal file
View File

@@ -0,0 +1,144 @@
For this step, youre not writing any real logic yet youre just making sure the projects depend on each other in the right direction so future work doesnt turn into spaghetti.
Think of it as locking in the dependency graph.
---
## 1. Pin the desired dependency graph
First, make explicit what is allowed to depend on what.
Target graph:
* `StellaOps.Router.Common`
* Lowest layer.
* **No** project references to any other StellaOps projects.
* `StellaOps.Router.Config`
* References:
* `StellaOps.Router.Common`.
* `StellaOps.Microservice`
* References:
* `StellaOps.Router.Common`.
* `StellaOps.Microservice.SourceGen`
* For now: no references, or only to Common if needed for types in generated code.
* Later: will be consumed as an analyzer by `StellaOps.Microservice`, not via normal project reference.
* `StellaOps.Gateway.WebService`
* References:
* `StellaOps.Router.Common`
* `StellaOps.Router.Config`.
Test projects:
* `StellaOps.Router.Common.Tests``StellaOps.Router.Common`
* `StellaOps.Gateway.WebService.Tests``StellaOps.Gateway.WebService`
* `StellaOps.Microservice.Tests``StellaOps.Microservice`
Explicitly: there should be **no** circular references, and nothing should reference the Gateway from libraries.
---
## 2. Add the project references
From repo root, for each needed edge:
```bash
# Gateway → Common + Config
dotnet add src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj reference \
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj \
src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj
# Microservice → Common
dotnet add src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj reference \
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
# Config → Common
dotnet add src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj reference \
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
# Tests → main projects
dotnet add tests/StellaOps.Router.Common.Tests/StellaOps.Router.Common.Tests.csproj reference \
src/__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
dotnet add tests/StellaOps.Gateway.WebService.Tests/StellaOps.Gateway.WebService.Tests.csproj reference \
src/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj
dotnet add tests/StellaOps.Microservice.Tests/StellaOps.Microservice.Tests.csproj reference \
src/__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj
```
Do **not** add any references:
* From `Common` → anything.
* From `Config` → Gateway or Microservice.
* From `Microservice` → Gateway.
* From tests → libraries other than their primary target (unless you explicitly want shared test utils later).
---
## 3. Verify the .csproj contents
Have one agent open each `.csproj` and confirm:
* `StellaOps.Router.Common.csproj`
* No `<ProjectReference>` elements.
* `StellaOps.Router.Config.csproj`
* Exactly one `<ProjectReference>`: Common.
* `StellaOps.Microservice.csproj`
* Exactly one `<ProjectReference>`: Common.
* `StellaOps.Microservice.SourceGen.csproj`
* No project references for now (well convert it to a proper analyzer / source-generator package later).
* `StellaOps.Gateway.WebService.csproj`
* Exactly two `<ProjectReference>`s: Common + Config.
* No reference to Microservice.
* Test projects:
* Each test project references only its corresponding main project (no cross-test coupling).
If anything else is present (e.g. leftover references from templates), remove them.
---
## 4. Run a full build & test as a sanity check
From repo root:
```bash
dotnet restore
dotnet build StellaOps.Router.sln -c Debug
dotnet test StellaOps.Router.sln -c Debug
```
Acceptance criteria for this step:
* Solution builds without reference errors.
* All test projects compile and run (even if they only have dummy tests).
* Intellisense / navigation in IDE shows:
* Gateway can see Common & Config types.
* Microservice can see Common types.
* Config can see Common types.
* No library can see Gateway unless through tests.
Once this is stable, your devs can safely move on to implementing the Common model and know they wont have to rewrite references later.

520
docs/router/04-Step.md Normal file
View File

@@ -0,0 +1,520 @@
For this step, the goal is: a microservice that can:
* Start up with `AddStellaMicroservice(...)`
* Discover its endpoints from attributes
* Connect to the router (via InMemory transport)
* Send a HELLO with identity + endpoints
* Receive a REQUEST and return a RESPONSE
No streaming, no cancellation, no heartbeat yet. Pure minimal handshake & dispatch.
---
## 0. Preconditions
Before your agents start this step, you should have:
* `StellaOps.Router.Common` contracts in place (enums, `EndpointDescriptor`, `ConnectionState`, `Frame`, etc.).
* The solution skeleton and project references configured.
* A **stub** InMemory transport “router harness” (at least a place to park the future InMemory transport). Even if its not fully implemented, assume it will expose:
* A way for a microservice to “connect” and register itself.
* A way to deliver frames from router to microservice and back.
If InMemory isnt built yet, the microservice code should be written *against abstractions* so you can plug it in later.
---
## 1. Define microservice public surface (SDK contract)
**Project:** `__Libraries/StellaOps.Microservice`
**Owner:** microservice SDK agent
Purpose: give product teams a stable way to define services and endpoints without caring about transports.
### 1.1 Options
Make sure `StellaMicroserviceOptions` matches the spec:
```csharp
public sealed class StellaMicroserviceOptions
{
public string ServiceName { get; set; } = string.Empty;
public string Version { get; set; } = string.Empty;
public string Region { get; set; } = string.Empty;
public string InstanceId { get; set; } = string.Empty;
public IList<RouterEndpointConfig> Routers { get; set; } = new List<RouterEndpointConfig>();
public string? ConfigFilePath { get; set; }
}
public sealed class RouterEndpointConfig
{
public string Host { get; set; } = string.Empty;
public int Port { get; set; }
public TransportType TransportType { get; set; }
}
```
`Routers` is mandatory: without at least one router configured, the SDK should refuse to start later (that policy can be enforced in the handshake stage).
### 1.2 Public endpoint abstractions
Define:
* Attribute for endpoint identity:
```csharp
[AttributeUsage(AttributeTargets.Class, AllowMultiple = true)]
public sealed class StellaEndpointAttribute : Attribute
{
public string Method { get; }
public string Path { get; }
public StellaEndpointAttribute(string method, string path)
{
Method = method;
Path = path;
}
}
```
* Raw handler:
```csharp
public sealed class RawRequestContext
{
public string Method { get; init; } = string.Empty;
public string Path { get; init; } = string.Empty;
public IReadOnlyDictionary<string,string> Headers { get; init; } =
new Dictionary<string,string>();
public Stream Body { get; init; } = Stream.Null;
public CancellationToken CancellationToken { get; init; }
}
public sealed class RawResponse
{
public int StatusCode { get; set; } = 200;
public IDictionary<string,string> Headers { get; } =
new Dictionary<string,string>();
public Func<Stream,Task>? WriteBodyAsync { get; set; } // may be null
}
public interface IRawStellaEndpoint
{
Task<RawResponse> HandleAsync(RawRequestContext ctx);
}
```
* Typed convenience interfaces (used later, but define now):
```csharp
public interface IStellaEndpoint<TRequest,TResponse>
{
Task<TResponse> HandleAsync(TRequest request, CancellationToken ct);
}
public interface IStellaEndpoint<TResponse>
{
Task<TResponse> HandleAsync(CancellationToken ct);
}
```
At this step, you dont need to implement adapters yet, but the signatures must be fixed.
### 1.3 Registration extension
Extend `AddStellaMicroservice` to wire options + a few internal services:
```csharp
public static class ServiceCollectionExtensions
{
public static IServiceCollection AddStellaMicroservice(
this IServiceCollection services,
Action<StellaMicroserviceOptions> configure)
{
services.Configure(configure);
services.AddSingleton<IEndpointCatalog, EndpointCatalog>(); // to be implemented
services.AddSingleton<IEndpointDispatcher, EndpointDispatcher>(); // to be implemented
services.AddHostedService<MicroserviceBootstrapHostedService>(); // handshake loop
return services;
}
}
```
This still compiles with empty implementations; you fill them in next steps.
---
## 2. Endpoint discovery (reflection only for now)
**Project:** `StellaOps.Microservice`
**Owner:** SDK agent
Goal: given the entry assembly, build:
* A list of `EndpointDescriptor` objects (from Common).
* A mapping `(Method, Path) -> handler type` used for dispatch.
### 2.1 Internal types
Define an internal representation:
```csharp
internal sealed class EndpointRegistration
{
public EndpointDescriptor Descriptor { get; init; } = default!;
public Type HandlerType { get; init; } = default!;
}
```
Define an interface for discovery:
```csharp
internal interface IEndpointDiscovery
{
IReadOnlyList<EndpointRegistration> DiscoverEndpoints(StellaMicroserviceOptions options);
}
```
### 2.2 Implement reflection-based discovery
Create `ReflectionEndpointDiscovery`:
* Scan the entry assembly (and optionally referenced assemblies) for classes that:
* Have `StellaEndpointAttribute`.
* Implement either:
* `IRawStellaEndpoint`, or
* `IStellaEndpoint<,>`, or
* `IStellaEndpoint<>`.
* For each `[StellaEndpoint]` usage:
* Create `EndpointDescriptor` with:
* `ServiceName` = `options.ServiceName`.
* `Version` = `options.Version`.
* `Method`, `Path` from attribute.
* `DefaultTimeout` = some sensible default (e.g. `TimeSpan.FromSeconds(30)`; refine later).
* `SupportsStreaming` = `false` (for now).
* `RequiringClaims` = empty array (for now).
* Create `EndpointRegistration` with `Descriptor` + `HandlerType`.
* Return the list.
Wire it into DI:
```csharp
services.AddSingleton<IEndpointDiscovery, ReflectionEndpointDiscovery>();
```
---
## 3. Endpoint catalog & dispatcher (microservice internal)
**Project:** `StellaOps.Microservice`
**Owner:** SDK agent
Goal: presence of:
* A catalog holding endpoints and descriptors.
* A dispatcher that takes frames and calls handlers.
### 3.1 Endpoint catalog
Define:
```csharp
internal interface IEndpointCatalog
{
IReadOnlyList<EndpointDescriptor> Descriptors { get; }
bool TryGetHandler(string method, string path, out EndpointRegistration endpoint);
}
internal sealed class EndpointCatalog : IEndpointCatalog
{
private readonly Dictionary<(string Method, string Path), EndpointRegistration> _map;
public IReadOnlyList<EndpointDescriptor> Descriptors { get; }
public EndpointCatalog(IEndpointDiscovery discovery,
IOptions<StellaMicroserviceOptions> optionsAccessor)
{
var options = optionsAccessor.Value;
var registrations = discovery.DiscoverEndpoints(options);
_map = registrations.ToDictionary(
r => (r.Descriptor.Method, r.Descriptor.Path),
r => r,
StringComparer.OrdinalIgnoreCase);
Descriptors = registrations.Select(r => r.Descriptor).ToArray();
}
public bool TryGetHandler(string method, string path, out EndpointRegistration endpoint) =>
_map.TryGetValue((method, path), out endpoint!);
}
```
You can refine path normalization later; for now, keep it simple.
### 3.2 Endpoint dispatcher
Define:
```csharp
internal interface IEndpointDispatcher
{
Task<Frame> HandleRequestAsync(Frame requestFrame, CancellationToken ct);
}
```
Implement `EndpointDispatcher` with minimal behavior:
1. Decode `requestFrame.Payload` into a small DTO carrying:
* Method
* Path
* Headers (if you already have a format; if not, assume no headers in v0)
* Body bytes
For this step, you can stub decoding as:
* Payload = raw body bytes.
* Method/Path are carried separately in frame header or in a simple DTO; decide a minimal interim format and write it down.
2. Use `IEndpointCatalog.TryGetHandler(method, path, ...)`:
* If not found:
* Build a `RawResponse` with status 404 and empty body.
3. If handler implements `IRawStellaEndpoint`:
* Instantiate via DI (`IServiceProvider.GetRequiredService(handlerType)`).
* Build `RawRequestContext` with:
* Method, Path, Headers, Body (`new MemoryStream(bodyBytes)` for now).
* `CancellationToken` = `ct`.
* Call `HandleAsync`.
* Convert `RawResponse` into a response frame payload.
4. If handler implements `IStellaEndpoint<,>` (typed):
* For now, **you can skip typed handling** or wire a very simple JSON-based adapter if you want to unlock it early. The focus in this step is the raw path; typed adapters can come in the next iteration.
Return a `Frame` with:
* `Type = FrameType.Response`
* `CorrelationId` = `requestFrame.CorrelationId`
* `Payload` = encoded response (status + body bytes).
No streaming, no cancellation logic beyond passing `ct` through — router wont cancel yet.
---
## 4. Minimal handshake hosted service (using InMemory)
**Project:** `StellaOps.Microservice`
**Owner:** SDK agent
This is where the microservice actually “talks” to the router.
### 4.1 Define a microservice connection abstraction
Your SDK should not depend directly on InMemory; define an internal abstraction:
```csharp
internal interface IMicroserviceConnection
{
Task StartAsync(CancellationToken ct);
Task StopAsync(CancellationToken ct);
}
```
The implementation for this step will target the InMemory transport; later you can add TCP/TLS/RabbitMQ versions.
### 4.2 Implement InMemory microservice connection
Assuming you have or will have an `IInMemoryRouter` (or similar) dev harness, implement:
```csharp
internal sealed class InMemoryMicroserviceConnection : IMicroserviceConnection
{
private readonly IEndpointCatalog _catalog;
private readonly IEndpointDispatcher _dispatcher;
private readonly IOptions<StellaMicroserviceOptions> _options;
private readonly IInMemoryRouterClient _routerClient; // dev-only abstraction
public InMemoryMicroserviceConnection(
IEndpointCatalog catalog,
IEndpointDispatcher dispatcher,
IOptions<StellaMicroserviceOptions> options,
IInMemoryRouterClient routerClient)
{
_catalog = catalog;
_dispatcher = dispatcher;
_options = options;
_routerClient = routerClient;
}
public async Task StartAsync(CancellationToken ct)
{
var opts = _options.Value;
// Build HELLO payload from options + catalog.Descriptors
var helloPayload = BuildHelloPayload(opts, _catalog.Descriptors);
await _routerClient.ConnectAsync(opts, ct);
await _routerClient.SendHelloAsync(helloPayload, ct);
// Start background receive loop
_ = Task.Run(() => ReceiveLoopAsync(ct), ct);
}
public Task StopAsync(CancellationToken ct)
{
// For now: ask routerClient to disconnect; finer handling later
return _routerClient.DisconnectAsync(ct);
}
private async Task ReceiveLoopAsync(CancellationToken ct)
{
await foreach (var frame in _routerClient.GetIncomingFramesAsync(ct))
{
if (frame.Type == FrameType.Request)
{
var response = await _dispatcher.HandleRequestAsync(frame, ct);
await _routerClient.SendFrameAsync(response, ct);
}
else
{
// Ignore other frame types in this minimal step
}
}
}
}
```
`IInMemoryRouterClient` is whatever dev harness you build for the in-memory transport; the exact shape is not important for this steps planning, only that it provides:
* `ConnectAsync`
* `SendHelloAsync`
* `GetIncomingFramesAsync` (async stream of frames)
* `SendFrameAsync` for responses
* `DisconnectAsync`
### 4.3 Hosted service to bootstrap the connection
Implement `MicroserviceBootstrapHostedService`:
```csharp
internal sealed class MicroserviceBootstrapHostedService : IHostedService
{
private readonly IMicroserviceConnection _connection;
public MicroserviceBootstrapHostedService(IMicroserviceConnection connection)
{
_connection = connection;
}
public Task StartAsync(CancellationToken cancellationToken) =>
_connection.StartAsync(cancellationToken);
public Task StopAsync(CancellationToken cancellationToken) =>
_connection.StopAsync(cancellationToken);
}
```
Wire `IMicroserviceConnection` to `InMemoryMicroserviceConnection` in DI for now:
```csharp
services.AddSingleton<IMicroserviceConnection, InMemoryMicroserviceConnection>();
```
In a later phase, youll swap this to transport-specific connectors.
---
## 5. End-to-end smoke test (InMemory only)
**Project:** `StellaOps.Microservice.Tests` + a minimal InMemory router test harness
**Owner:** test agent
Goal: prove that minimal handshake & dispatch works in memory.
1. Build a trivial test microservice:
* Define a handler:
```csharp
[StellaEndpoint("GET", "/ping")]
public sealed class PingEndpoint : IRawStellaEndpoint
{
public Task<RawResponse> HandleAsync(RawRequestContext ctx)
{
var resp = new RawResponse { StatusCode = 200 };
resp.Headers["Content-Type"] = "text/plain";
resp.WriteBodyAsync = stream => stream.WriteAsync(
Encoding.UTF8.GetBytes("pong"));
return Task.FromResult(resp);
}
}
```
2. Test harness:
* Spin up:
* An instance of the microservice host (generic HostBuilder).
* An in-memory “router” that:
* Accepts HELLO from the microservice.
* Sends a single REQUEST frame for `GET /ping`.
* Receives the RESPONSE frame.
3. Assert:
* The HELLO includes the `/ping` endpoint.
* The REQUEST is dispatched to `PingEndpoint`.
* The RESPONSE has status 200 and body “pong”.
This verifies that:
* `AddStellaMicroservice` wires discovery, catalog, dispatcher, bootstrap.
* The microservice sends HELLO on connect.
* The microservice can handle at least one request via InMemory.
---
## 6. Done criteria for “minimal handshake & dispatch”
You can consider this step complete when:
* `StellaOps.Microservice` exposes:
* Options.
* Attribute & handler interfaces (raw + typed).
* `AddStellaMicroservice` registering discovery, catalog, dispatcher, and hosted service.
* The microservice can:
* Discover endpoints via reflection.
* Build a `HELLO` payload and send it over InMemory on startup.
* Receive a `REQUEST` frame over InMemory.
* Dispatch that request to the correct handler.
* Return a `RESPONSE` frame.
Not yet required in this step:
* Streaming bodies.
* Heartbeats or health evaluation.
* Cancellation via CANCEL frames.
* Authority overrides for requiringClaims.
Those come in subsequent phases; right now you just want a working minimal vertical slice: InMemory microservice that says “HELLO” and responds to one simple request.

554
docs/router/05-Step.md Normal file
View File

@@ -0,0 +1,554 @@
For this step, the goal is: the gateway can accept an HTTP request, route it to **one** microservice over the **InMemory** transport, get a response, and return it to the client.
No health/heartbeat yet. No streaming yet. Just: HTTP → InMemory → microservice → InMemory → HTTP.
Ill assume youre still in the InMemory world and not touching TCP/UDP/RabbitMQ at this stage.
---
## 0. Preconditions
Before you start:
* `StellaOps.Router.Common` exists and exposes:
* `EndpointDescriptor`, `ConnectionState`, `Frame`, `FrameType`, `TransportType`, `RoutingDecision`.
* Interfaces: `IGlobalRoutingState`, `IRoutingPlugin`, `ITransportClient`.
* `StellaOps.Microservice` minimal handshake & dispatch is in place (from your “step 4”):
* Microservice can:
* Discover endpoints.
* Connect to an InMemory router client.
* Send HELLO.
* Receive REQUEST and send RESPONSE.
* Gateway project exists (`StellaOps.Gateway.WebService`) and runs as a basic ASP.NET Core app.
If anything in that list is not true, fix it first or adjust the plan accordingly.
---
## 1. Implement an InMemory transport “hub”
You need a simple in-process component that:
* Keeps track of “connections” from microservices.
* Delivers frames from the gateway to the correct microservice and back.
You can host this either:
* In a dedicated **test/support** assembly, or
* In the gateway project but marked as “dev-only” transport.
For this step, keep it simple and in-memory.
### 1.1 Define an InMemory router hub
Conceptually:
```csharp
public interface IInMemoryRouterHub
{
// Called by microservice side to register a new connection
Task<string> RegisterMicroserviceAsync(
InstanceDescriptor instance,
IReadOnlyList<EndpointDescriptor> endpoints,
Func<Frame, Task> onFrameFromGateway,
CancellationToken ct);
// Called by microservice when it wants to send a frame to the gateway
Task SendFromMicroserviceAsync(string connectionId, Frame frame, CancellationToken ct);
// Called by gateway transport client when sending a frame to a microservice
Task<Frame> SendFromGatewayAsync(string connectionId, Frame frame, CancellationToken ct);
}
```
Internally, the hub maintains per-connection data:
* `ConnectionId`
* `InstanceDescriptor`
* Endpoints
* Delegate `onFrameFromGateway` (microservice receiver)
For minimal routing you can start by:
* Only supporting `SendFromGatewayAsync` for REQUEST and returning RESPONSE.
* For now, heartbeat frames can be ignored or stubbed.
### 1.2 Connect the microservice side
Your `InMemoryMicroserviceConnection` (from step 4) should:
* Call `RegisterMicroserviceAsync` on the hub when it sends HELLO:
* Get `connectionId`.
* Provide a handler `onFrameFromGateway` that:
* Dispatches REQUEST frames via `IEndpointDispatcher`.
* Sends RESPONSE frames back via `SendFromMicroserviceAsync`.
This is mostly microservice work; you should already have most of it outlined.
---
## 2. Implement an InMemory `ITransportClient` in the gateway
Now focus on the gateway side.
**Project:** `StellaOps.Gateway.WebService` (or a small internal infra class in the same project)
### 2.1 `InMemoryTransportClient`
Implement `ITransportClient` using the `IInMemoryRouterHub`:
```csharp
public sealed class InMemoryTransportClient : ITransportClient
{
private readonly IInMemoryRouterHub _hub;
public InMemoryTransportClient(IInMemoryRouterHub hub)
{
_hub = hub;
}
public Task<Frame> SendRequestAsync(
ConnectionState connection,
Frame requestFrame,
TimeSpan timeout,
CancellationToken ct)
{
// connection.ConnectionId must be set when HELLO is processed
return _hub.SendFromGatewayAsync(connection.ConnectionId, requestFrame, ct);
}
public Task SendCancelAsync(ConnectionState connection, Guid correlationId, string? reason = null)
=> Task.CompletedTask; // no-op at this stage
public Task SendStreamingAsync(
ConnectionState connection,
Frame requestHeader,
Stream requestBody,
Func<Stream, Task> readResponseBody,
PayloadLimits limits,
CancellationToken ct)
=> throw new NotSupportedException("Streaming not implemented for InMemory in this step.");
}
```
For now:
* Ignore streaming.
* Ignore cancel.
* Just call `SendFromGatewayAsync` and get a response frame.
### 2.2 Register it in DI
In gateway `Program.cs` or a DI setup:
```csharp
services.AddSingleton<IInMemoryRouterHub, InMemoryRouterHub>(); // your hub implementation
services.AddSingleton<ITransportClient, InMemoryTransportClient>();
```
Youll later swap this with real transport clients (TCP, UDP, Rabbit), but for now everything uses InMemory.
---
## 3. Implement minimal `IGlobalRoutingState`
You now need the gateways internal view of:
* Which endpoints exist.
* Which connections serve them.
**Project:** `StellaOps.Gateway.WebService` or a small internal infra namespace.
### 3.1 In-memory implementation
Implement an `InMemoryGlobalRoutingState` something like:
```csharp
public sealed class InMemoryGlobalRoutingState : IGlobalRoutingState
{
private readonly object _lock = new();
private readonly Dictionary<(string, string), EndpointDescriptor> _endpoints = new();
private readonly List<ConnectionState> _connections = new();
public EndpointDescriptor? ResolveEndpoint(string method, string path)
{
lock (_lock)
{
_endpoints.TryGetValue((method, path), out var endpoint);
return endpoint;
}
}
public IReadOnlyList<ConnectionState> GetConnectionsFor(
string serviceName,
string version,
string method,
string path)
{
lock (_lock)
{
return _connections
.Where(c =>
c.Instance.ServiceName == serviceName &&
c.Instance.Version == version &&
c.Endpoints.ContainsKey((method, path)))
.ToList();
}
}
// Called when HELLO arrives from microservice
public void RegisterConnection(ConnectionState connection)
{
lock (_lock)
{
_connections.Add(connection);
foreach (var kvp in connection.Endpoints)
{
var key = kvp.Key; // (Method, Path)
var descriptor = kvp.Value;
// global endpoint map: any connection's descriptor is ok as "canonical"
_endpoints[(key.Method, key.Path)] = descriptor;
}
}
}
}
```
You will refine this later; for minimal routing it's enough.
### 3.2 Hook HELLO to `IGlobalRoutingState`
In your InMemory router hub, when a microservice registers (HELLO):
* Create a `ConnectionState`:
```csharp
var conn = new ConnectionState
{
ConnectionId = generatedConnectionId,
Instance = instanceDescriptor,
Status = InstanceHealthStatus.Healthy,
LastHeartbeatUtc = DateTime.UtcNow,
AveragePingMs = 0,
TransportType = TransportType.Udp, // or TransportType.Tcp logically for InMemory
Endpoints = endpointDescriptors.ToDictionary(
e => (e.Method, e.Path),
e => e)
};
```
* Call `InMemoryGlobalRoutingState.RegisterConnection(conn)`.
This gives the gateway a routing view as soon as HELLO is processed.
---
## 4. Implement HTTP pipeline middlewares for routing
Now, wire the gateway HTTP pipeline so that an incoming HTTP request is:
1. Resolved to a logical endpoint.
2. Routed to one connection.
3. Dispatched via InMemory transport.
### 4.1 EndpointResolutionMiddleware
This maps `(Method, Path)` to an `EndpointDescriptor`.
Create a middleware:
```csharp
public sealed class EndpointResolutionMiddleware
{
private readonly RequestDelegate _next;
public EndpointResolutionMiddleware(RequestDelegate next) => _next = next;
public async Task Invoke(HttpContext context, IGlobalRoutingState routingState)
{
var method = context.Request.Method;
var path = context.Request.Path.ToString();
var endpoint = routingState.ResolveEndpoint(method, path);
if (endpoint is null)
{
context.Response.StatusCode = StatusCodes.Status404NotFound;
await context.Response.WriteAsync("Endpoint not found");
return;
}
context.Items["Stella.EndpointDescriptor"] = endpoint;
await _next(context);
}
}
```
Register it in the pipeline:
```csharp
app.UseMiddleware<EndpointResolutionMiddleware>();
```
Before or after auth depending on your final pipeline; for minimal routing, order is not critical.
### 4.2 Minimal routing plugin (pick first connection)
Implement a very naive `IRoutingPlugin` just to get things moving:
```csharp
public sealed class NaiveRoutingPlugin : IRoutingPlugin
{
private readonly IGlobalRoutingState _state;
public NaiveRoutingPlugin(IGlobalRoutingState state) => _state = state;
public Task<RoutingDecision?> ChooseInstanceAsync(
RoutingContext context,
CancellationToken cancellationToken)
{
var endpoint = context.Endpoint;
var connections = _state.GetConnectionsFor(
endpoint.ServiceName,
endpoint.Version,
endpoint.Method,
endpoint.Path);
var chosen = connections.FirstOrDefault();
if (chosen is null)
return Task.FromResult<RoutingDecision?>(null);
var decision = new RoutingDecision
{
Endpoint = endpoint,
Connection = chosen,
TransportType = chosen.TransportType,
EffectiveTimeout = endpoint.DefaultTimeout
};
return Task.FromResult<RoutingDecision?>(decision);
}
}
```
Register it:
```csharp
services.AddSingleton<IGlobalRoutingState, InMemoryGlobalRoutingState>();
services.AddSingleton<IRoutingPlugin, NaiveRoutingPlugin>();
```
### 4.3 RoutingDecisionMiddleware
This middleware grabs the endpoint descriptor and asks the routing plugin for a connection.
```csharp
public sealed class RoutingDecisionMiddleware
{
private readonly RequestDelegate _next;
public RoutingDecisionMiddleware(RequestDelegate next) => _next = next;
public async Task Invoke(HttpContext context, IRoutingPlugin routingPlugin)
{
var endpoint = (EndpointDescriptor?)context.Items["Stella.EndpointDescriptor"];
if (endpoint is null)
{
context.Response.StatusCode = 500;
await context.Response.WriteAsync("Endpoint metadata missing");
return;
}
var routingContext = new RoutingContext
{
Endpoint = endpoint,
GatewayRegion = "not_used_yet", // youll fill this from GatewayNodeConfig later
HttpContext = context
};
var decision = await routingPlugin.ChooseInstanceAsync(routingContext, context.RequestAborted);
if (decision is null)
{
context.Response.StatusCode = StatusCodes.Status503ServiceUnavailable;
await context.Response.WriteAsync("No instances available");
return;
}
context.Items["Stella.RoutingDecision"] = decision;
await _next(context);
}
}
```
Register it after `EndpointResolutionMiddleware`:
```csharp
app.UseMiddleware<RoutingDecisionMiddleware>();
```
### 4.4 TransportDispatchMiddleware
This middleware:
* Builds a REQUEST frame from HTTP.
* Uses `ITransportClient` to send it to the chosen connection.
* Writes the RESPONSE frame back to HTTP.
Minimal version (buffered, no streaming):
```csharp
public sealed class TransportDispatchMiddleware
{
private readonly RequestDelegate _next;
public TransportDispatchMiddleware(RequestDelegate next) => _next = next;
public async Task Invoke(
HttpContext context,
ITransportClient transportClient)
{
var decision = (RoutingDecision?)context.Items["Stella.RoutingDecision"];
if (decision is null)
{
context.Response.StatusCode = 500;
await context.Response.WriteAsync("Routing decision missing");
return;
}
// Read request body into memory (safe for minimal tests)
byte[] bodyBytes;
using (var ms = new MemoryStream())
{
await context.Request.Body.CopyToAsync(ms);
bodyBytes = ms.ToArray();
}
var requestPayload = new MinimalRequestPayload
{
Method = context.Request.Method,
Path = context.Request.Path.ToString(),
Body = bodyBytes
// headers can be ignored or added later
};
var requestFrame = new Frame
{
Type = FrameType.Request,
CorrelationId = Guid.NewGuid(),
Payload = SerializeRequestPayload(requestPayload)
};
var timeout = decision.EffectiveTimeout;
using var cts = CancellationTokenSource.CreateLinkedTokenSource(context.RequestAborted);
cts.CancelAfter(timeout);
Frame responseFrame;
try
{
responseFrame = await transportClient.SendRequestAsync(
decision.Connection,
requestFrame,
timeout,
cts.Token);
}
catch (OperationCanceledException)
{
context.Response.StatusCode = StatusCodes.Status504GatewayTimeout;
await context.Response.WriteAsync("Upstream timeout");
return;
}
var responsePayload = DeserializeResponsePayload(responseFrame.Payload);
context.Response.StatusCode = responsePayload.StatusCode;
foreach (var (k, v) in responsePayload.Headers)
{
context.Response.Headers[k] = v;
}
if (responsePayload.Body is { Length: > 0 })
{
await context.Response.Body.WriteAsync(responsePayload.Body);
}
}
}
```
Youll need minimal DTOs and serializers (`MinimalRequestPayload`, `MinimalResponsePayload`) just to move bytes. You can use JSON for now; protocol details will be formalized later.
Register it after `RoutingDecisionMiddleware`:
```csharp
app.UseMiddleware<TransportDispatchMiddleware>();
```
At this point, you no longer need ASP.NET controllers for microservice endpoints; you can have a catch-all pipeline.
---
## 5. Minimal end-to-end test
**Owner:** test agent, probably in `StellaOps.Gateway.WebService.Tests` (plus a simple host for microservice in tests)
Scenario:
1. Start an in-memory microservice host:
* It uses `AddStellaMicroservice`.
* It attaches to the same `IInMemoryRouterHub` instance as the gateway (created inside the test).
* It has a single endpoint:
* `[StellaEndpoint("GET", "/ping")]`
* Handler returns “pong”.
2. Start the gateway host:
* Inject the same `IInMemoryRouterHub`.
* Use middlewares: `EndpointResolutionMiddleware`, `RoutingDecisionMiddleware`, `TransportDispatchMiddleware`.
3. Invoke HTTP `GET /ping` against the gateway (using `WebApplicationFactory` or `TestServer`).
Assert:
* HTTP status 200.
* Body “pong”.
* The router hub saw:
* At least one HELLO frame.
* One REQUEST frame.
* One RESPONSE frame.
This proves:
* HELLO → gateway routing state population.
* Endpoint resolution → connection selection.
* InMemory transport client used.
* Minimal dispatch works.
---
## 6. Done criteria for “Gateway: minimal routing using InMemory plugin”
Youre done with this step when:
* A microservice can register with the gateway via InMemory.
* The gateways `IGlobalRoutingState` knows about endpoints and connections.
* The HTTP pipeline:
* Resolves an endpoint based on `(Method, Path)`.
* Asks `IRoutingPlugin` for a connection.
* Uses `ITransportClient` (InMemory) to send REQUEST and get RESPONSE.
* Returns the mapped HTTP response to the client.
* You have at least one automated test showing:
* `GET /ping` through gateway → InMemory → microservice → back to HTTP.
After this, youre ready to:
* Swap `NaiveRoutingPlugin` with the health/region-sensitive plugin you defined.
* Implement heartbeat and latency.
* Later replace InMemory with TCP/UDP/Rabbit without changing the HTTP pipeline.

541
docs/router/06-Step.md Normal file
View File

@@ -0,0 +1,541 @@
For this step, youre layering **liveness** and **basic routing intelligence** on top of the minimal handshake/dispatch you already designed.
Target outcome:
* Microservices send **heartbeats** over the existing connection.
* The router tracks **LastHeartbeatUtc**, **health status**, and **AveragePingMs** per connection.
* The routers `IRoutingPlugin` uses **region + health + latency** to pick an instance.
No need to handle cancellation or streaming yet; just make routing decisions *not* naive.
---
## 0. Preconditions
Before starting, confirm:
* `StellaOps.Router.Common` already has:
* `InstanceHealthStatus` enum.
* `ConnectionState` with at least `Instance`, `Status`, `LastHeartbeatUtc`, `AveragePingMs`, `TransportType`.
* Minimal handshake is working:
* Microservice sends HELLO (instance + endpoints).
* Router creates `ConnectionState` & populates global routing view.
* Router can send REQUEST and receive RESPONSE via InMemory transport.
If any of that is incomplete, shore it up first.
---
## 1. Extend Common with heartbeat payloads
**Project:** `StellaOps.Router.Common`
**Owner:** Common dev
Add DTOs for heartbeat frames.
### 1.1 Heartbeat payload
```csharp
public sealed class HeartbeatPayload
{
public string InstanceId { get; init; } = string.Empty;
public InstanceHealthStatus Status { get; init; } = InstanceHealthStatus.Healthy;
// Optional basic metrics
public int InFlightRequests { get; init; }
public double ErrorRate { get; init; } // 01 range, optional
}
```
* This is application-level health; `Status` lets the microservice say “Degraded” / “Draining”.
* In-flight + error rate can be used later for smarter routing; initially, you can ignore them.
### 1.2 Wire into frame model
Ensure:
* `FrameType` includes `Heartbeat`:
```csharp
public enum FrameType : byte
{
Hello = 1,
Heartbeat = 2,
EndpointsUpdate = 3,
Request = 4,
RequestStreamData = 5,
Response = 6,
ResponseStreamData = 7,
Cancel = 8
}
```
* No behavior in Common; only DTOs and enums.
---
## 2. Microservice SDK: send heartbeats on the same connection
**Project:** `StellaOps.Microservice`
**Owner:** SDK dev
You already have `MicroserviceConnectionHostedService` doing HELLO and request dispatch. Now add heartbeat sending.
### 2.1 Introduce heartbeat options
Extend `StellaMicroserviceOptions` with simple settings:
```csharp
public sealed class StellaMicroserviceOptions
{
// existing fields...
public TimeSpan HeartbeatInterval { get; set; } = TimeSpan.FromSeconds(10);
public TimeSpan HeartbeatTimeout { get; set; } = TimeSpan.FromSeconds(30); // used by router, not here
}
```
### 2.2 Internal heartbeat sender
Create an internal interface and implementation:
```csharp
internal interface IHeartbeatSource
{
InstanceHealthStatus GetCurrentStatus();
int GetInFlightRequests();
double GetErrorRate();
}
```
For now you can implement a trivial `DefaultHeartbeatSource`:
* `GetCurrentStatus()` → `Healthy`.
* `GetInFlightRequests()` → 0.
* `GetErrorRate()` → 0.
Wire this in DI:
```csharp
services.AddSingleton<IHeartbeatSource, DefaultHeartbeatSource>();
```
### 2.3 Add heartbeat loop to MicroserviceConnectionHostedService
In `StartAsync` of `MicroserviceConnectionHostedService`:
* After sending HELLO and subscribing to requests, start a background heartbeat loop.
Pseudo-plan:
```csharp
private Task? _heartbeatLoop;
public async Task StartAsync(CancellationToken ct)
{
// existing HELLO logic...
await _connection.SendHelloAsync(payload, ct);
_connection.OnRequest(frame => HandleRequestAsync(frame, ct));
_heartbeatLoop = Task.Run(() => HeartbeatLoopAsync(ct), ct);
}
private async Task HeartbeatLoopAsync(CancellationToken outerCt)
{
var opt = _options.Value;
var interval = opt.HeartbeatInterval;
var instanceId = opt.InstanceId;
while (!outerCt.IsCancellationRequested)
{
var payload = new HeartbeatPayload
{
InstanceId = instanceId,
Status = _heartbeatSource.GetCurrentStatus(),
InFlightRequests = _heartbeatSource.GetInFlightRequests(),
ErrorRate = _heartbeatSource.GetErrorRate()
};
var frame = new Frame
{
Type = FrameType.Heartbeat,
CorrelationId = Guid.Empty, // or a reserved value
Payload = SerializeHeartbeatPayload(payload)
};
await _connection.SendHeartbeatAsync(frame, outerCt);
try
{
await Task.Delay(interval, outerCt);
}
catch (TaskCanceledException)
{
break;
}
}
}
```
Youll need to extend `IMicroserviceConnection` with:
```csharp
Task SendHeartbeatAsync(Frame frame, CancellationToken ct);
```
In this step, manipulation is simple: every N seconds, push a heartbeat.
---
## 3. Router: accept heartbeats and update connection health
**Project:** `StellaOps.Gateway.WebService`
**Owner:** Gateway dev
You already have an InMemory router or similar structure that:
* Handles HELLO frames, creates `ConnectionState`.
* Maintains a global `IGlobalRoutingState`.
Now you need to:
* Handle HEARTBEAT frames.
* Update `ConnectionState.Status` and `LastHeartbeatUtc`.
### 3.1 Frame dispatch on router side
In your routers InMemory server loop (or equivalent), add case for `FrameType.Heartbeat`:
* Deserialize `HeartbeatPayload` from `frame.Payload`.
* Find the corresponding `ConnectionState` by `InstanceId` (and/or connection ID).
* Update:
* `LastHeartbeatUtc` = `DateTime.UtcNow`.
* `Status` = `payload.Status`.
You can add a method in your routing-state implementation:
```csharp
public void UpdateHeartbeat(string connectionId, HeartbeatPayload payload)
{
if (!_connections.TryGetValue(connectionId, out var conn))
return;
conn.LastHeartbeatUtc = DateTime.UtcNow;
conn.Status = payload.Status;
}
```
The routers transport server should know which `connectionId` delivered the frame; pass that along.
### 3.2 Detect stale connections (health degradation)
Add a background “health monitor” in the gateway:
* Reads `HeartbeatTimeout` from configuration (can reuse the same default as microservice or have separate router-side config).
* Periodically scans all `ConnectionState` entries:
* If `Now - LastHeartbeatUtc > HeartbeatTimeout`, mark `Status = Unhealthy` (or remove connection entirely).
* If connection drops (transport disconnect), also mark `Unhealthy` or remove.
This can be a simple `IHostedService`:
```csharp
internal sealed class ConnectionHealthMonitor : IHostedService
{
private readonly IGlobalRoutingState _state;
private readonly TimeSpan _heartbeatTimeout;
private Task? _loop;
private CancellationTokenSource? _cts;
public Task StartAsync(CancellationToken cancellationToken)
{
_cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
_loop = Task.Run(() => MonitorLoopAsync(_cts.Token), _cts.Token);
return Task.CompletedTask;
}
public async Task StopAsync(CancellationToken cancellationToken)
{
_cts?.Cancel();
if (_loop is not null)
await _loop;
}
private async Task MonitorLoopAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
_state.MarkStaleConnectionsUnhealthy(_heartbeatTimeout, DateTime.UtcNow);
await Task.Delay(TimeSpan.FromSeconds(5), ct);
}
}
}
```
Youll add a method like `MarkStaleConnectionsUnhealthy` on your `IGlobalRoutingState` implementation.
---
## 4. Track basic latency (AveragePingMs)
**Project:** Gateway + Common
**Owner:** Gateway dev
You want `AveragePingMs` per connection to inform routing decisions.
### 4.1 Decide where to measure
Simplest: measure “request → response” round-trip time in the gateway:
* When you send a `Request` frame to a specific connection, record:
* `SentAtUtc[CorrelationId] = DateTime.UtcNow`.
* When you receive a `Response` frame with that correlation:
* Compute `latencyMs = (UtcNow - SentAtUtc[CorrelationId]).TotalMilliseconds`.
* Discard map entry.
Then update `ConnectionState.AveragePingMs`, e.g. with an exponential moving average:
```csharp
conn.AveragePingMs = conn.AveragePingMs <= 0
? latencyMs
: conn.AveragePingMs * 0.8 + latencyMs * 0.2;
```
### 4.2 Where to hook this
* In the **gateway-side transport client** (InMemory implementation for now):
* When sending `Request` frame:
* Register `SentAtUtc` per correlation ID.
* When receiving `Response` frame:
* Compute latency.
* Call `IGlobalRoutingState.UpdateLatency(connectionId, latencyMs)`.
Add a method to the routing state:
```csharp
public void UpdateLatency(string connectionId, double latencyMs)
{
if (_connections.TryGetValue(connectionId, out var conn))
{
if (conn.AveragePingMs <= 0)
conn.AveragePingMs = latencyMs;
else
conn.AveragePingMs = conn.AveragePingMs * 0.8 + latencyMs * 0.2;
}
}
```
You can keep it simple; sophistication can come later.
---
## 5. Basic routing plugin implementation
**Project:** `StellaOps.Gateway.WebService`
**Owner:** Gateway dev
You already have `IRoutingPlugin` defined. Now implement a concrete `BasicRoutingPlugin` that respects:
* Region (gateway region first, then neighbor tiers).
* Health (`Healthy` / `Degraded` only).
* Latency preference (`AveragePingMs`).
### 5.1 Inputs & data
`RoutingContext` should carry:
* `EndpointDescriptor` (with ServiceName, Version, Method, Path).
* `GatewayRegion` (from `GatewayNodeConfig.Region`).
* The `HttpContext` if you need headers (not needed for routing at this stage).
`IGlobalRoutingState` should provide:
* `GetConnectionsFor(serviceName, version, method, path)` returning all `ConnectionState`s that support that endpoint.
### 5.2 Basic algorithm
Algorithm outline:
```csharp
public sealed class BasicRoutingPlugin : IRoutingPlugin
{
private readonly IGlobalRoutingState _state;
private readonly string[] _neighborRegions; // configured, can be empty
public async Task<RoutingDecision?> ChooseInstanceAsync(
RoutingContext context,
CancellationToken cancellationToken)
{
var endpoint = context.Endpoint;
var candidates = _state.GetConnectionsFor(
endpoint.ServiceName,
endpoint.Version,
endpoint.Method,
endpoint.Path);
if (candidates.Count == 0)
return null;
// 1. Filter by health (only Healthy or Degraded)
var healthy = candidates
.Where(c => c.Status == InstanceHealthStatus.Healthy || c.Status == InstanceHealthStatus.Degraded)
.ToList();
if (healthy.Count == 0)
return null;
// 2. Partition by region tier
var gatewayRegion = context.GatewayRegion;
List<ConnectionState> tier1 = healthy.Where(c => c.Instance.Region == gatewayRegion).ToList();
List<ConnectionState> tier2 = healthy.Where(c => _neighborRegions.Contains(c.Instance.Region)).ToList();
List<ConnectionState> tier3 = healthy.Except(tier1).Except(tier2).ToList();
var chosenTier = tier1.Count > 0 ? tier1 : tier2.Count > 0 ? tier2 : tier3;
if (chosenTier.Count == 0)
return null;
// 3. Sort by latency, then heartbeat freshness
var ordered = chosenTier
.OrderBy(c => c.AveragePingMs <= 0 ? double.MaxValue : c.AveragePingMs)
.ThenByDescending(c => c.LastHeartbeatUtc)
.ToList();
var winner = ordered[0];
// 4. Build decision
return new RoutingDecision
{
Endpoint = endpoint,
Connection = winner,
TransportType = winner.TransportType,
EffectiveTimeout = endpoint.DefaultTimeout // or compose with config later
};
}
}
```
Wire it into DI:
```csharp
services.AddSingleton<IRoutingPlugin, BasicRoutingPlugin>();
```
And ensure `RoutingDecisionMiddleware` calls it.
---
## 6. Integrate health-aware routing into the HTTP pipeline
**Project:** `StellaOps.Gateway.WebService`
**Owner:** Gateway dev
Update your `RoutingDecisionMiddleware` to:
* Use the final `IRoutingPlugin` instead of picking a random connection.
* Handle null decision appropriately:
* If `ChooseInstanceAsync` returns `null`, respond with `503 Service Unavailable` or `502 Bad Gateway` and a generic error body, log the incident.
Check that:
* Gateways region is injected (via `GatewayNodeConfig.Region`) into `RoutingContext.GatewayRegion`.
* Endpoint descriptor is resolved before you call the plugin.
---
## 7. Testing plan
**Project:** `StellaOps.Gateway.WebService.Tests`, `StellaOps.Microservice.Tests`
**Owner:** test agent
Write basic tests to lock in behavior.
### 7.1 Microservice heartbeat tests
In `StellaOps.Microservice.Tests`:
* Use a fake `IMicroserviceConnection` that records frames sent.
* Configure `HeartbeatInterval` to a small number (e.g. 100 ms).
* Start a Host with `AddStellaMicroservice`.
* Wait some time, assert:
* At least one HELLO frame was sent.
* At least N HEARTBEAT frames were sent.
* HEARTBEAT payload has correct `InstanceId` and `Status`.
### 7.2 Router health update tests
In `StellaOps.Gateway.WebService.Tests` (or a separate routing-state test project):
* Create an instance of your `IGlobalRoutingState` implementation.
* Add a connection via HELLO simulation.
* Call `UpdateHeartbeat` with a HeartbeatPayload.
* Assert:
* `LastHeartbeatUtc` updated.
* `Status` set to `Healthy` (or whatever payload said).
* Advance time (simulate via injecting a clock or mocking DateTime) and call `MarkStaleConnectionsUnhealthy`:
* Assert that `Status` changed to `Unhealthy`.
### 7.3 Routing plugin tests
Write tests for `BasicRoutingPlugin`:
* Case 1: multiple connections, some unhealthy:
* Only Healthy/Degraded are considered.
* Case 2: multiple regions:
* Instances in gateway region win over others.
* Case 3: same region, different `AveragePingMs`:
* Lower latency chosen.
* Case 4: same latency, different `LastHeartbeatUtc`:
* More recent heartbeat chosen.
These tests will give you confidence that the routing logic behaves as requested and is stable as you add complexity later (streaming, cancellation, etc.).
---
## 8. Done criteria for “Add heartbeat, health, basic routing rules”
You can declare this step complete when:
* Microservices:
* Periodically send HEARTBEAT frames on the same connection they use for requests.
* Gateway/router:
* Updates `LastHeartbeatUtc` and `Status` on receipt of HEARTBEAT.
* Marks stale or disconnected connections as `Unhealthy` (or removes them).
* Tracks `AveragePingMs` per connection based on request/response round trips.
* Routing:
* `IRoutingPlugin` chooses instances based on:
* Strict `ServiceName` + `Version` + endpoint match.
* Health (`Healthy`/`Degraded` only).
* Region preference (gateway region > neighbors > others).
* Latency (`AveragePingMs`) then heartbeat recency.
* Tests:
* Validate heartbeats are sent and processed.
* Validate stale connections are marked unhealthy.
* Validate routing plugin picks the expected instance in simple scenarios.
Once this is in place, you have a live, health-aware routing fabric. The next logical step after this is to add **cancellation** and then **streaming + payload limits** on top of the same structures.

378
docs/router/07-Step.md Normal file
View File

@@ -0,0 +1,378 @@
For this step youre wiring **request cancellation** endtoend in the InMemory setup:
> Client / gateway gives up → gateway sends CANCEL → microservice cancels handler
No need to mix in streaming or payload limits yet; just enforce cancellation for timeouts and client disconnects.
---
## 0. Preconditions
Have in place:
* `FrameType.Cancel` in `StellaOps.Router.Common.FrameType`.
* `ITransportClient.SendCancelAsync(ConnectionState, Guid, string?)` in Common.
* Minimal InMemory path from HTTP → gateway → microservice (HELLO + REQUEST/RESPONSE) working.
If `FrameType.Cancel` or `SendCancelAsync` arent there yet, add them first.
---
## 1. Common: cancel payload (optional, but useful)
If you want reasons attached, add a DTO in Common:
```csharp
public sealed class CancelPayload
{
public string Reason { get; init; } = string.Empty; // eg: "ClientDisconnected", "Timeout"
}
```
Youll serialize this into `Frame.Payload` when sending a CANCEL. If you dont care about reasons yet, you can skip the payload and just use the correlation id.
No behavior in Common, just the shape.
---
## 2. Gateway: trigger CANCEL on client abort and timeout
### 2.1 Extend `TransportDispatchMiddleware`
You already:
* Generate a `correlationId`.
* Build a `FrameType.Request`.
* Call `ITransportClient.SendRequestAsync(...)` and await it.
Now:
1. Create a linked CTS that combines:
* `HttpContext.RequestAborted`
* The endpoint timeout
2. Register a callback on `RequestAborted` that sends a CANCEL with the same correlationId.
3. On `OperationCanceledException` where the HTTP token is not canceled (pure timeout), send a CANCEL once and return 504.
Sketch:
```csharp
public async Task Invoke(HttpContext context, ITransportClient transportClient)
{
var decision = (RoutingDecision)context.Items[RouterHttpContextKeys.RoutingDecision]!;
var correlationId = Guid.NewGuid();
// build requestFrame as before
var timeout = decision.EffectiveTimeout;
using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(context.RequestAborted);
linkedCts.CancelAfter(timeout);
// fire-and-forget cancel on client disconnect
context.RequestAborted.Register(() =>
{
_ = transportClient.SendCancelAsync(
decision.Connection, correlationId, "ClientDisconnected");
});
Frame responseFrame;
try
{
responseFrame = await transportClient.SendRequestAsync(
decision.Connection,
requestFrame,
timeout,
linkedCts.Token);
}
catch (OperationCanceledException) when (!context.RequestAborted.IsCancellationRequested)
{
// internal timeout
await transportClient.SendCancelAsync(
decision.Connection, correlationId, "Timeout");
context.Response.StatusCode = StatusCodes.Status504GatewayTimeout;
await context.Response.WriteAsync("Upstream timeout");
return;
}
// existing response mapping goes here
}
```
Key points:
* The gateway sends CANCEL **as soon as**:
* The client disconnects (RequestAborted).
* Or the internal timeout triggers (catch branch).
* We do not need any global correlation registry on the gateway side; the middleware has the `correlationId` and `Connection`.
---
## 3. InMemory transport: propagate CANCEL to microservice
### 3.1 Implement `SendCancelAsync` in `InMemoryTransportClient` (gateway side)
In your gateway InMemory implementation:
```csharp
public Task SendCancelAsync(ConnectionState connection, Guid correlationId, string? reason = null)
{
var payload = reason is null
? Array.Empty<byte>()
: SerializeCancelPayload(new CancelPayload { Reason = reason });
var frame = new Frame
{
Type = FrameType.Cancel,
CorrelationId = correlationId,
Payload = payload
};
return _hub.SendFromGatewayAsync(connection.ConnectionId, frame, CancellationToken.None);
}
```
`_hub.SendFromGatewayAsync` must route the frame to the microservices receive loop for that connection.
### 3.2 Hub routing
Ensure your `IInMemoryRouterHub` implementation:
* When `SendFromGatewayAsync(connectionId, cancelFrame, ct)` is called:
* Enqueues that frame onto the microservices incoming channel (`GetFramesForMicroserviceAsync` stream).
No extra logic; just treat CANCEL like REQUEST/HELLO in terms of delivery.
---
## 4. Microservice: track in-flight requests
Now microservice needs to know **which** request to cancel when a CANCEL arrives.
### 4.1 In-flight registry
In the microservice connection class (the one doing the receive loop):
```csharp
private readonly ConcurrentDictionary<Guid, RequestExecution> _inflight =
new();
private sealed class RequestExecution
{
public CancellationTokenSource Cts { get; init; } = default!;
public Task ExecutionTask { get; init; } = default!;
}
```
When a `Request` frame arrives:
* Create a `CancellationTokenSource`.
* Start the handler using that token.
* Store both in `_inflight`.
Example pattern in `ReceiveLoopAsync`:
```csharp
private async Task ReceiveLoopAsync(CancellationToken ct)
{
await foreach (var frame in _routerClient.GetIncomingFramesAsync(ct))
{
switch (frame.Type)
{
case FrameType.Request:
HandleRequest(frame);
break;
case FrameType.Cancel:
HandleCancel(frame);
break;
// other frame types...
}
}
}
private void HandleRequest(Frame frame)
{
var cts = new CancellationTokenSource();
var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(cts.Token); // later link to global shutdown if needed
var exec = new RequestExecution
{
Cts = cts,
ExecutionTask = HandleRequestCoreAsync(frame, linkedCts.Token)
};
_inflight[frame.CorrelationId] = exec;
_ = exec.ExecutionTask.ContinueWith(_ =>
{
_inflight.TryRemove(frame.CorrelationId, out _);
cts.Dispose();
linkedCts.Dispose();
}, TaskScheduler.Default);
}
```
### 4.2 Wire CancellationToken into dispatcher
`HandleRequestCoreAsync` should:
* Deserialize the request payload.
* Build a `RawRequestContext` with `CancellationToken = token`.
* Pass that token through to:
* `IRawStellaEndpoint.HandleAsync(context)` (via the context).
* Or typed handler adapter (`IStellaEndpoint<,>` / `IStellaEndpoint<TResponse>`), passing it explicitly.
Example pattern:
```csharp
private async Task HandleRequestCoreAsync(Frame frame, CancellationToken ct)
{
var req = DeserializeRequestPayload(frame.Payload);
if (!_catalog.TryGetHandler(req.Method, req.Path, out var registration))
{
var notFound = BuildNotFoundResponse(frame.CorrelationId);
await _routerClient.SendFrameAsync(notFound, ct);
return;
}
using var bodyStream = new MemoryStream(req.Body); // minimal case
var ctx = new RawRequestContext
{
Method = req.Method,
Path = req.Path,
Headers = req.Headers,
Body = bodyStream,
CancellationToken = ct
};
var handler = (IRawStellaEndpoint)_serviceProvider.GetRequiredService(registration.HandlerType);
var response = await handler.HandleAsync(ctx);
var respFrame = BuildResponseFrame(frame.CorrelationId, response);
await _routerClient.SendFrameAsync(respFrame, ct);
}
```
Now each handler sees a token that will be canceled when a CANCEL frame arrives.
### 4.3 Handle CANCEL frames
When a `Cancel` frame arrives:
```csharp
private void HandleCancel(Frame frame)
{
if (_inflight.TryGetValue(frame.CorrelationId, out var exec))
{
exec.Cts.Cancel();
}
// Ignore if not found (e.g. already completed)
}
```
If you care about the reason, deserialize `CancelPayload` and log it; not required for behavior.
---
## 5. Handler guidance (for your Microservice docs)
In `Stella Ops Router Microservice.md`, add simple rules devs must follow:
* Any longrunning or IO-heavy code in endpoints MUST:
* Accept a `CancellationToken` (for typed endpoints).
* Or use `RawRequestContext.CancellationToken` for raw endpoints.
* Always pass the token into:
* DB calls.
* File I/O and stream operations.
* HTTP/gRPC calls to other services.
* Do not swallow `OperationCanceledException` unless there is a good reason; normally let it bubble or treat it as a normal cancellation.
Concrete example for devs:
```csharp
[StellaEndpoint("POST", "/billing/slow-operation")]
public sealed class SlowEndpoint : IRawStellaEndpoint
{
public async Task<RawResponse> HandleAsync(RawRequestContext ctx)
{
// Correct: observe token
await Task.Delay(TimeSpan.FromMinutes(5), ctx.CancellationToken);
return new RawResponse { StatusCode = 204 };
}
}
```
---
## 6. Tests
### 6.1 Client abort → CANCEL
Test outline:
* Setup:
* Gateway + microservice wired via InMemory hub.
* Microservice endpoint that:
* Waits on `Task.Delay(TimeSpan.FromMinutes(5), ctx.CancellationToken)`.
* Test:
1. Start HTTP request to `/slow`.
2. After sending request, cancel the clients HttpClient token or close the connection.
3. Assert:
* Gateways InMemory transport sent a `FrameType.Cancel`.
* Microservices handler is canceled (e.g. no longer running after a short time).
* No response (or partial) is written; HTTP side will produce whatever your test harness expects when client aborts.
### 6.2 Gateway timeout → CANCEL
* Configure endpoint timeout small (e.g. 100 ms).
* Have endpoint sleep for 5 seconds with the token.
* Assert:
* Gateway returns 504.
* Cancel frame was sent.
* Handler is canceled (task completes early).
These tests lock in the semantics so later additions (real transports, streaming) dont regress cancellation.
---
## 7. Done criteria for “Add cancellation semantics (with InMemory)”
You can mark step 7 as complete when:
* For every routed request, the gateway knows its correlationId and connection.
* On client disconnect:
* Gateway sends a `FrameType.Cancel` with that correlationId.
* On internal timeout:
* Gateway sends a `FrameType.Cancel` and returns 504 to the client.
* InMemory hub delivers CANCEL frames to the microservice.
* Microservice:
* Tracks inflight requests by correlationId.
* Cancels the proper `CancellationTokenSource` when CANCEL arrives.
* Passes the token into handlers via `RawRequestContext` and typed adapters.
* At least one automated test proves:
* Cancellation propagates from gateway to microservice and stops the handler.
Once this is done, youll be in good shape to add streaming & payload-limits on top, because the cancel path is already wired endtoend.

501
docs/router/08-Step.md Normal file
View File

@@ -0,0 +1,501 @@
For this step youre teaching the system to handle **streams** instead of always buffering, and to **enforce payload limits** so the gateway cant be DoSd by large uploads. Still only using the InMemory transport.
Goal state:
* Gateway can stream HTTP request/response bodies to/from microservice without buffering everything.
* Gateway enforces percall and global/inflight payload limits.
* Microservice sees a `Stream` on `RawRequestContext.Body` and reads from it.
* All of this works over the existing InMemory “connection”.
Ill break it into concrete tasks.
---
## 0. Preconditions
Make sure you already have:
* Minimal InMemory routing working:
* HTTP → gateway → InMemory → microservice → InMemory → HTTP.
* Cancellation wired (step 7):
* `FrameType.Cancel`.
* `ITransportClient.SendCancelAsync` implemented for InMemory.
* Microservice uses `CancellationToken` in `RawRequestContext`.
Then layer streaming & limits on top.
---
## 1. Confirm / finalize Common primitives for streaming & limits
**Project:** `StellaOps.Router.Common`
Tasks:
1. Ensure `FrameType` has:
```csharp
public enum FrameType : byte
{
Hello = 1,
Heartbeat = 2,
EndpointsUpdate = 3,
Request = 4,
RequestStreamData = 5,
Response = 6,
ResponseStreamData = 7,
Cancel = 8
}
```
You may not *use* `RequestStreamData` / `ResponseStreamData` in InMemory implementation initially if you choose the bridging approach, but having them defined keeps the model coherent.
2. Ensure `EndpointDescriptor` has:
```csharp
public bool SupportsStreaming { get; init; }
```
3. Ensure `PayloadLimits` type exists (in Common or Config, but referenced by both):
```csharp
public sealed class PayloadLimits
{
public long MaxRequestBytesPerCall { get; set; } // per HTTP request
public long MaxRequestBytesPerConnection { get; set; } // per microservice connection
public long MaxAggregateInflightBytes { get; set; } // across all requests
}
```
4. `ITransportClient` already contains:
```csharp
Task SendStreamingAsync(
ConnectionState connection,
Frame requestHeader,
Stream requestBody,
Func<Stream, Task> readResponseBody,
PayloadLimits limits,
CancellationToken ct);
```
If not, add it now (implementation will be InMemory-only for this step).
No logic in Common; just shapes.
---
## 2. Gateway: payload budget tracker
You need a small service in the gateway that tracks inflight bytes to enforce limits.
**Project:** `StellaOps.Gateway.WebService`
### 2.1 Define a budget interface
```csharp
public interface IPayloadBudget
{
bool TryReserve(string connectionId, Guid requestId, long bytes);
void Release(string connectionId, Guid requestId, long bytes);
}
```
### 2.2 Implement a simple in-memory tracker
Implementation outline:
* Track:
* `long _globalInflightBytes`.
* `Dictionary<string,long> _perConnectionInflightBytes`.
* `Dictionary<Guid,long> _perRequestInflightBytes`.
All updated under a lock or `ConcurrentDictionary` + `Interlocked`.
Logic for `TryReserve`:
* Compute proposed:
* `newGlobal = _globalInflightBytes + bytes`
* `newConn = perConnection[connectionId] + bytes`
* `newReq = perRequest[requestId] + bytes`
* If any exceed configured limits (`PayloadLimits` from config), return `false`.
* Else:
* Commit updates and return `true`.
`Release` subtracts the bytes, never going below zero.
Register in DI:
```csharp
services.AddSingleton<IPayloadBudget, PayloadBudget>();
```
---
## 3. Gateway: choose buffered vs streaming path
Extend `TransportDispatchMiddleware` to branch on mode.
**Project:** `StellaOps.Gateway.WebService`
### 3.1 Decide mode
At the start of the middleware:
```csharp
var decision = (RoutingDecision)context.Items[RouterHttpContextKeys.RoutingDecision]!;
var endpoint = decision.Endpoint;
var limits = _options.Value.PayloadLimits; // from RouterConfig
var supportsStreaming = endpoint.SupportsStreaming;
var hasKnownLength = context.Request.ContentLength.HasValue;
var contentLength = context.Request.ContentLength ?? -1;
// Simple rule for now:
var useStreaming =
supportsStreaming &&
(!hasKnownLength || contentLength > limits.MaxRequestBytesPerCall);
```
* If `useStreaming == false`:
* Use buffered path with hard size checks.
* If `useStreaming == true`:
* Use streaming path (`ITransportClient.SendStreamingAsync`).
---
## 4. Gateway: buffered path with limits
**Still in `TransportDispatchMiddleware`**
### 4.1 Early 413 check
When `supportsStreaming == false`:
1. If `Content-Length` known and:
```csharp
if (hasKnownLength && contentLength > limits.MaxRequestBytesPerCall)
{
context.Response.StatusCode = StatusCodes.Status413PayloadTooLarge;
return;
}
```
2. When reading body into memory:
* Read in chunks.
* Track `bytesReadThisCall`.
* If `bytesReadThisCall > limits.MaxRequestBytesPerCall`, abort and return 413.
You dont have to call `IPayloadBudget` for buffered mode yet; you can, but the hard per-call limit already protects RAM for this step.
Buffered path then proceeds as before:
* Build `MinimalRequestPayload` with full body.
* Send via `SendRequestAsync`.
* Map response.
---
## 5. Gateway: streaming path (InMemory)
This is the new part.
### 5.1 Use `ITransportClient.SendStreamingAsync`
In the `useStreaming == true` branch:
```csharp
var correlationId = Guid.NewGuid();
var headerPayload = new MinimalRequestPayload
{
Method = context.Request.Method,
Path = context.Request.Path.ToString(),
Headers = ExtractHeaders(context.Request),
Body = Array.Empty<byte>(), // streaming body will follow
IsStreaming = true // add this flag to your payload DTO
};
var headerFrame = new Frame
{
Type = FrameType.Request,
CorrelationId = correlationId,
Payload = SerializeRequestPayload(headerPayload)
};
using var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(context.RequestAborted);
linkedCts.CancelAfter(decision.EffectiveTimeout);
// register cancel → SendCancelAsync (already done in step 7)
await _transportClient.SendStreamingAsync(
decision.Connection,
headerFrame,
context.Request.Body,
async responseBodyStream =>
{
// Copy microservice stream directly to HTTP response
await responseBodyStream.CopyToAsync(context.Response.Body, linkedCts.Token);
},
limits,
linkedCts.Token);
```
Key points:
* Streaming path does not buffer the whole body.
* Limits and cancellation are enforced inside `SendStreamingAsync`.
---
## 6. InMemory transport: streaming implementation
**Project:** gateway side InMemory `ITransportClient` implementation and InMemory router hub; microservice side connection.
For InMemory, you can model streaming via **bridged streams**: a producer/consumer pair in memory.
### 6.1 Add streaming call to InMemory client
In `InMemoryTransportClient`:
```csharp
public async Task SendStreamingAsync(
ConnectionState connection,
Frame requestHeader,
Stream httpRequestBody,
Func<Stream, Task> readResponseBody,
PayloadLimits limits,
CancellationToken ct)
{
await _hub.StreamFromGatewayAsync(
connection.ConnectionId,
requestHeader,
httpRequestBody,
readResponseBody,
limits,
ct);
}
```
Expose `StreamFromGatewayAsync` on `IInMemoryRouterHub`:
```csharp
Task StreamFromGatewayAsync(
string connectionId,
Frame requestHeader,
Stream requestBody,
Func<Stream, Task> readResponseBody,
PayloadLimits limits,
CancellationToken ct);
```
### 6.2 InMemory hub streaming strategy (bridging style)
Inside `StreamFromGatewayAsync`:
1. Create a **pair of connected streams** for request body:
* e.g., a custom `ProducerConsumerStream` built on a `Channel<byte[]>` or `System.IO.Pipelines`.
* “Producer” side (writer) will be fed from HTTP.
* “Consumer” side will be given to the microservice as `RawRequestContext.Body`.
2. Create a **pair of connected streams** for response body:
* “Consumer” side will be used in `readResponseBody` to write to HTTP.
* “Producer” side will be given to the microservice handler to write response body.
3. On the microservice side:
* Build a `RawRequestContext` with `Body = requestBodyConsumerStream` and `CancellationToken = ct`.
* Dispatch to the endpoint handler as usual.
* Have the handlers `RawResponse.WriteBodyAsync` pointed at `responseBodyProducerStream`.
4. Parallel tasks:
* Task 1: Copy HTTP → `requestBodyProducerStream` in chunks, enforcing `PayloadLimits` (see next section).
* Task 2: Execute the handler, which reads from `Body` and writes to `responseBodyProducerStream`.
* Task 3: Copy `responseBodyConsumerStream` → HTTP via `readResponseBody`.
5. Propagate cancellation:
* If `ct` is canceled (client disconnect/timeout/payload limit breach):
* Stop HTTP→requestBody copy.
* Signal stream completion / cancellation to handler.
* Handler should see cancellation via `CancellationToken`.
Because this is InMemory, you dont *have* to materialize explicit `RequestStreamData` frames; you only need the behavior. Real transports will implement the same semantics with actual frames.
---
## 7. Enforce payload limits in streaming copy
Still in `StreamFromGatewayAsync` / InMemory side:
### 7.1 HTTP → microservice copy with budget
In Task 1:
```csharp
var buffer = new byte[64 * 1024];
int read;
var requestId = requestHeader.CorrelationId;
var connectionId = connectionIdFromArgs;
while ((read = await httpRequestBody.ReadAsync(buffer, 0, buffer.Length, ct)) > 0)
{
if (!_budget.TryReserve(connectionId, requestId, read))
{
// Limit exceeded: signal failure
await _cancelCallback?.Invoke(requestId, "PayloadLimitExceeded"); // or call SendCancelAsync
break;
}
await requestBodyProducerStream.WriteAsync(buffer.AsMemory(0, read), ct);
}
// After loop, ensure we release whatever was reserved
_budget.Release(connectionId, requestId, totalBytesReserved);
await requestBodyProducerStream.FlushAsync(ct);
await requestBodyProducerStream.DisposeAsync();
```
If `TryReserve` fails:
* Stop reading further bytes.
* Trigger cancellation downstream:
* Either call the existing `SendCancelAsync` path.
* Or signal completion with error and let handler catch cancellation.
Gateway side should then translate this into 413 or 503 to the client.
### 7.2 Response copy
Response path doesnt need budget tracking (the danger is inbound to gateway); but if you want symmetry, you can also enforce a max outbound size.
For now, just stream microservice → HTTP through `readResponseBody` until EOF or cancellation.
---
## 8. Microservice side: streaming-aware `RawRequestContext.Body`
Your streaming bridging already gives the handler a `Stream` that reads what the gateway sends:
* No changes required in handler interfaces.
* You only need to ensure:
* `RawRequestContext.Body` **may be non-seekable**.
* Handlers know they must treat it as a forward-only stream.
Guidance for devs in `Microservice.md`:
* For binary uploads or large files, implement `IRawStellaEndpoint` and read incrementally:
```csharp
[StellaEndpoint("POST", "/billing/invoices/upload")]
public sealed class InvoiceUploadEndpoint : IRawStellaEndpoint
{
public async Task<RawResponse> HandleAsync(RawRequestContext ctx)
{
var buffer = new byte[64 * 1024];
int read;
while ((read = await ctx.Body.ReadAsync(buffer.AsMemory(0, buffer.Length), ctx.CancellationToken)) > 0)
{
// Process chunk
}
return new RawResponse { StatusCode = 204 };
}
}
```
---
## 9. Tests
**Scope:** still InMemory, but now streaming & limits.
### 9.1 Streaming happy path
* Setup:
* Endpoint with `SupportsStreaming = true`.
* `IRawStellaEndpoint` that:
* Counts total bytes read from `ctx.Body`.
* Returns 200.
* Test:
* Send an HTTP POST with a body larger than `MaxRequestBytesPerCall`, but with streaming enabled.
* Assert:
* Gateway does **not** buffer entire body in one array (you can assert via instrumentation or at least confirm no 413).
* Handler sees the full number of bytes.
* Response is 200.
### 9.2 Per-call limit breach
* Configure:
* `SupportsStreaming = false` (or use streaming but set low `MaxRequestBytesPerCall`).
* Test:
* Send a body larger than limit.
* Assert:
* Gateway responds 413.
* Handler is not invoked at all.
### 9.3 Global/in-flight limit breach
* Configure:
* `MaxAggregateInflightBytes` very low (e.g. 1 MB).
* Test:
* Start multiple concurrent streaming requests that each try to send more than the allowed total.
* Assert:
* Some of them get a CANCEL / error (413 or 503).
* `IPayloadBudget` denies reservations and releases resources correctly.
---
## 10. Done criteria for “Add streaming & payload limits (InMemory)”
Youre done with this step when:
* Gateway:
* Chooses buffered vs streaming based on `EndpointDescriptor.SupportsStreaming` and size.
* Enforces `MaxRequestBytesPerCall` for buffered requests (413 on violation).
* Uses `ITransportClient.SendStreamingAsync` for streaming.
* Has an `IPayloadBudget` preventing excessive in-flight payload accumulation.
* InMemory transport:
* Implements `SendStreamingAsync` by bridging HTTP streams to microservice handlers and back.
* Enforces payload limits while copying.
* Microservice:
* Receives a functional `Stream` in `RawRequestContext.Body`.
* Can implement `IRawStellaEndpoint` that reads incrementally for large payloads.
* Tests:
* Demonstrate a streaming endpoint works for large payloads.
* Demonstrate per-call and aggregate limits are respected and cause rejections/cancellations.
After this, you can reuse the same semantics when you implement real transports (TCP/TLS/RabbitMQ), with InMemory as your reference implementation.

562
docs/router/09-Step.md Normal file
View File

@@ -0,0 +1,562 @@
For this step youre taking the protocol you already proved with InMemory and putting it on real transports:
* TCP (baseline)
* Certificate/TLS (secure TCP)
* UDP (small, nonstreaming)
* RabbitMQ
The idea: every plugin implements the same `Frame` semantics (HELLO/HEARTBEAT/REQUEST/RESPONSE/CANCEL, plus streaming where supported), and the gateway/microservices dont change their business logic at all.
Ill structure this as a sequence of substeps you can execute in order.
---
## 0. Preconditions
Before you start adding real transports, make sure:
* Frame model is stable in `StellaOps.Router.Common`:
* `Frame`, `FrameType`, `TransportType`.
* Microservice and gateway code use **only**:
* `ITransportClient` to send (gateway side).
* `ITransportServer` / connection abstractions to receive (gateway side).
* `IMicroserviceConnection` + `ITransportClient` under the hood (microservice side).
* InMemory transport is working with:
* HELLO
* REQUEST / RESPONSE
* CANCEL
* Streaming & payload limits (step 8)
If any code still directly talks to “InMemoryRouterHub” from app logic, hide it behind the `ITransportClient` / `ITransportServer` abstractions first.
---
## 1. Freeze the wire protocol and serializer
**Owner:** protocol / infra dev
Before touching sockets or RabbitMQ, lock down **how a `Frame` is encoded** on the wire. This must be consistent across all transports except InMemory (which can cheat a bit internally).
### 1.1 Frame header
Define a simple binary header; for example:
* 1 byte: `FrameType`
* 16 bytes: `CorrelationId` (`Guid`)
* 4 bytes: payload length (`int32`, big- or little-endian, but be consistent)
Total header = 21 bytes. Then `payloadLength` bytes follow.
You can evolve later but start with something simple.
### 1.2 Frame serializer
In a small shared, **nonASP.NET** assembly (either Common or a new `StellaOps.Router.Protocol` library), implement:
```csharp
public interface IFrameSerializer
{
void WriteFrame(Frame frame, Stream stream, CancellationToken ct);
Task WriteFrameAsync(Frame frame, Stream stream, CancellationToken ct);
Frame ReadFrame(Stream stream, CancellationToken ct);
Task<Frame> ReadFrameAsync(Stream stream, CancellationToken ct);
}
```
Implementation:
* Writes header then payload.
* Reads header then payload; throws on EOF.
For payloads (HELLO, HEARTBEAT, etc.), use one encoding consistently (e.g. `System.Text.Json` for now) and **centralize** DTO ⇒ `byte[]` conversions:
```csharp
public static class PayloadCodec
{
public static byte[] Encode<T>(T payload) { ... }
public static T Decode<T>(byte[] bytes) { ... }
}
```
All transports use `IFrameSerializer` + `PayloadCodec`.
---
## 2. Introduce a transport registry / resolver
**Projects:** gateway + microservice
**Owner:** infra dev
You need a way to map `TransportType` to a concrete plugin.
### 2.1 Gateway side
Define:
```csharp
public interface ITransportClientResolver
{
ITransportClient GetClient(TransportType transportType);
}
public interface ITransportServerFactory
{
ITransportServer CreateServer(TransportType transportType);
}
```
Initial implementation:
* Registers the available clients:
```csharp
public sealed class TransportClientResolver : ITransportClientResolver
{
private readonly IServiceProvider _sp;
public TransportClientResolver(IServiceProvider sp) => _sp = sp;
public ITransportClient GetClient(TransportType transportType) =>
transportType switch
{
TransportType.Tcp => _sp.GetRequiredService<TcpTransportClient>(),
TransportType.Certificate=> _sp.GetRequiredService<TlsTransportClient>(),
TransportType.Udp => _sp.GetRequiredService<UdpTransportClient>(),
TransportType.RabbitMq => _sp.GetRequiredService<RabbitMqTransportClient>(),
_ => throw new NotSupportedException($"Transport {transportType} not supported.")
};
}
```
Then in `TransportDispatchMiddleware`, instead of injecting a single `ITransportClient`, inject `ITransportClientResolver` and choose:
```csharp
var client = clientResolver.GetClient(decision.TransportType);
```
### 2.2 Microservice side
On the microservice, you can do something similar:
```csharp
internal interface IMicroserviceTransportConnector
{
Task ConnectAsync(StellaMicroserviceOptions options, CancellationToken ct);
}
```
Implement one per transport type; later `StellaMicroserviceOptions.Routers` will determine which transport to use for each router endpoint.
---
## 3. Implement plugin 1: TCP
Start with TCP; its the most straightforward and will largely mirror your InMemory behavior.
### 3.1 Gateway: `TcpTransportServer`
**Project:** `StellaOps.Gateway.WebService` or a transport sub-namespace.
Responsibilities:
* Listen on a configured TCP port (e.g. from `RouterConfig`).
* Accept connections, each mapping to a `ConnectionId`.
* For each connection:
* Start a background receive loop:
* Use `IFrameSerializer.ReadFrameAsync` on a `NetworkStream`.
* On `FrameType.Hello`:
* Deserialize HELLO payload.
* Build a `ConnectionState` and register with `IGlobalRoutingState`.
* On `FrameType.Heartbeat`:
* Update heartbeat for that `ConnectionId`.
* On `FrameType.Response` or `ResponseStreamData`:
* Push frame into the gateways correlation / streaming handler (similar to InMemory path).
* On `FrameType.Cancel` (rare from microservice):
* Optionally implement; can be ignored for now.
* Provide a sending API to the matching `TcpTransportClient` (gateway-side) using `WriteFrameAsync`.
You will likely have:
* A `TcpConnectionContext` per connected microservice:
* Holds `ConnectionId`, `TcpClient`, `NetworkStream`, `TaskCompletionSource` maps for correlation IDs.
### 3.2 Gateway: `TcpTransportClient` (gateway-side, to microservices)
Implements `ITransportClient`:
* `SendRequestAsync`:
* Given `ConnectionState`:
* Get the associated `TcpConnectionContext`.
* Register a `TaskCompletionSource<Frame>` keyed by `CorrelationId`.
* Call `WriteFrameAsync(requestFrame)` on the connections stream.
* Await the TCS, which is completed in the receive loop when a `Response` frame arrives.
* `SendStreamingAsync`:
* Write header `FrameType.Request`.
* Read from `BudgetedRequestStream` in chunks:
* For TCP plugin you can either:
* Use `RequestStreamData` frames with chunk payloads, or
* Keep the simple bridging approach and send a single `Request` with all body bytes.
* Since you already validated streaming semantics with InMemory, you can decide:
* For first version of TCP, **only support buffered data**, then add chunk frames later.
* `SendCancelAsync`:
* Write a `FrameType.Cancel` frame with the same `CorrelationId`.
### 3.3 Microservice: `TcpTransportClientConnection`
**Project:** `StellaOps.Microservice`
Responsibilities on microservice side:
* For each `RouterEndpointConfig` where `TransportType == Tcp`:
* Open a `TcpClient` to `Host:Port`.
* Use `IFrameSerializer` to send:
* `HELLO` frame (payload = identity + descriptors).
* Periodic `HEARTBEAT` frames.
* `RESPONSE` frames for incoming `REQUEST`s.
* Receive loop:
* `ReadFrameAsync` from `NetworkStream`.
* On `REQUEST`:
* Dispatch through `IEndpointDispatcher`.
* For minimal streaming, treat payload as buffered; youll align with streaming later.
* On `CANCEL`:
* Use correlation ID to cancel the `CancellationTokenSource` you already maintain.
This is conceptually the same as InMemory but using real sockets.
---
## 4. Implement plugin 2: Certificate/TLS
Build TLS on top of TCP plugin; do not fork logic unnecessarily.
### 4.1 Gateway: `TlsTransportServer`
* Wrap accepted `TcpClient` sockets in `SslStream`.
* Load server certificate from configuration (for the node/region).
* Authenticate client if you want mutual TLS.
Structure:
* Reuse almost all of `TcpTransportServer` logic, but instead of `NetworkStream` you use `SslStream` as the underlying stream for `IFrameSerializer`.
### 4.2 Microservice: `TlsTransportClientConnection`
* Instead of plain `TcpClient.GetStream`, wrap in `SslStream`.
* Authenticate server (hostname & certificate).
* Optional: present client certificate.
Configuration fields in `RouterEndpointConfig` (or a TLS-specific sub-config):
* `UseTls` / `TransportType.Certificate`.
* Certificate paths / thumbprints / validation parameters.
At the SDK level, you just treat it as a different transport type; protocol remains identical.
---
## 5. Implement plugin 3: UDP (small, nonstreaming)
UDP is only for small, bounded payloads. No streaming, besteffort delivery.
### 5.1 Constraints
* Use UDP **only** for buffered, small payload endpoints.
* No streaming (`SupportsStreaming` must be `false` for UDP endpoints).
* No guarantee of delivery or ordering; caller must tolerate occasional failures/timeouts.
### 5.2 Gateway: `UdpTransportServer`
Responsibilities:
* Listen on a UDP port.
* Parse each incoming datagram as a full `Frame`:
* `FrameType.Hello`:
* Register a “logical connection” keyed by `(remoteEndpoint)` and `InstanceId`.
* `FrameType.Heartbeat`:
* Update health for that logical connection.
* `FrameType.Response`:
* Use `CorrelationId` and “connectionId” to complete a `TaskCompletionSource` as with TCP.
Because UDP is connectionless, your `ConnectionId` can be:
* A composite of microservice identity + remote endpoint, e.g. `"{instanceId}@{ip}:{port}"`.
### 5.3 Gateway: `UdpTransportClient` (gateway-side)
`SendRequestAsync`:
* Serialize `Frame` to `byte[]`.
* Send via `UdpClient.SendAsync` to the remote endpoint from `ConnectionState`.
* Start a timer:
* Wait for `Response` datagram with matching `CorrelationId`.
* If none comes within timeout → throw `OperationCanceledException`.
`SendStreamingAsync`:
* For this first iteration, **throw NotSupportedException**.
* Router should not route streaming endpoints over UDP; your routing config should enforce that.
`SendCancelAsync`:
* Optionally send a CANCEL datagram; but in practice, if requests are small, this is less useful. You can still implement it for symmetry.
### 5.4 Microservice: UDP connection
For microservice side:
* A single `UdpClient` bound to a local port.
* For each configured router (host/port):
* HELLO: send a `FrameType.Hello` datagram.
* HEARTBEAT: send periodic `FrameType.Heartbeat`.
* REQUEST handling: not needed; UDP plugin is used **for gateway → microservice** only if you design it that way. More likely, microservice is the server in TCP, but for UDP you might decide microservice is listening on port and gateway sends requests. So invert roles if needed.
Given the complexity and limited utility, you can treat UDP as “advanced/optional transport” and implement it last.
---
## 6. Implement plugin 4: RabbitMQ
This is conceptually similar to what you had in Serdica.
### 6.1 Exchange/queue design
Decide and document (in `Protocol & Transport Specification.md`) something like:
* Exchange: `stella.router`
* Routing keys:
* `request.{serviceName}.{version}` — gateway → microservice.
* Microservices reply queue per instance: `reply.{serviceName}.{version}.{instanceId}`.
Rabbit usages:
* Gateway:
* Publishes REQUEST frames to `request.{serviceName}.{version}`.
* Consumes from `reply.*` for responses.
* Microservice:
* Consumes from `request.{serviceName}.{version}`.
* Publishes responses to its own reply queue; sets `CorrelationId` property.
### 6.2 Gateway: `RabbitMqTransportClient`
Implements `ITransportClient`:
* `SendRequestAsync`:
* Create a message with:
* Body = serialized `Frame` (REQUEST or buffered streaming).
* Properties:
* `CorrelationId` = `frame.CorrelationId`.
* `ReplyTo` = microservices reply queue name for this instance.
* Publish to `request.{serviceName}.{version}`.
* Await a response:
* Consumer on reply queue completes a `TaskCompletionSource<Frame>` keyed by correlation ID.
* `SendStreamingAsync`:
* For v1, you can:
* Only support buffered endpoints over RabbitMQ (like UDP).
* Or send chunked messages (`RequestStreamData` frames as separate messages) and reconstruct on microservice side.
* Id recommend:
* Start with buffered only over RabbitMQ.
* Mark Rabbit as “no streaming support yet” in config.
* `SendCancelAsync`:
* Option 1: send a separate CANCEL message with same `CorrelationId`.
* Option 2: rely on timeout; cancellation doesnt buy much given overhead.
### 6.3 Microservice: RabbitMQ listener
* Single `IConnection` and `IModel`.
* Declare and bind:
* Service request queue: `request.{serviceName}.{version}`.
* Reply queue: `reply.{serviceName}.{version}.{instanceId}`.
* Consume request queue:
* On message:
* Deserialize `Frame`.
* Dispatch through `IEndpointDispatcher`.
* Publish RESPONSE message to `ReplyTo` queue with same `CorrelationId`.
If you already have RabbitMQ experience from Serdica, this should feel familiar.
---
## 7. Routing config & transport selection
**Projects:** router config + microservice options
**Owner:** config / platform dev
You need to define which transport is actually used in production.
### 7.1 Gateway config (RouterConfig)
Per service/instance, store:
* `TransportType` to listen on / expect connections for.
* Ports / Rabbit URLs / TLS settings.
Example shape in `RouterConfig`:
```csharp
public sealed class ServiceInstanceConfig
{
public string ServiceName { get; set; } = string.Empty;
public string Version { get; set; } = string.Empty;
public string Region { get; set; } = string.Empty;
public TransportType TransportType { get; set; } = TransportType.Udp; // default
public int Port { get; set; } // for TCP/UDP/TLS
public string? RabbitConnectionString { get; set; }
// TLS info, etc.
}
```
`StellaOps.Gateway.WebService` startup:
* Reads these configs.
* Starts corresponding `ITransportServer` instances.
### 7.2 Microservice options
`StellaMicroserviceOptions.Routers` entries must define:
* `Host`
* `Port`
* `TransportType`
* Any transport-specific settings (TLS, Rabbit URL).
At connect time, microservice chooses:
* For each `RouterEndpointConfig`, instantiate the right connector:
```csharp
switch(config.TransportType)
{
case TransportType.Tcp:
use TcpMicroserviceConnector;
break;
case TransportType.Certificate:
use TlsMicroserviceConnector;
break;
case TransportType.Udp:
use UdpMicroserviceConnector;
break;
case TransportType.RabbitMq:
use RabbitMqMicroserviceConnector;
break;
}
```
---
## 8. Implementation order & testing strategy
**Owner:** tech lead
Do NOT try to implement all at once. Suggested order:
1. **TCP**:
* Reuse InMemory test suite:
* HELLO + endpoint registration.
* REQUEST → RESPONSE.
* CANCEL.
* Heartbeats.
* (Optional) streaming as buffered stub for v1, then add genuine streaming.
2. **Certificate/TLS**:
* Wrap TCP logic in TLS.
* Same tests, plus:
* Certificate validation.
* Mutual TLS if required.
3. **RabbitMQ**:
* Start with buffered-only endpoints.
* Mirror existing InMemory/TCP tests where payloads are small.
* Add tests for connection resilience (reconnect, etc.).
4. **UDP**:
* Implement only for very small buffered requests; no streaming.
* Add tests that verify:
* HELLO + basic health.
* REQUEST → RESPONSE with small payload.
* Proper timeouts.
At each stage, tests for that plugin must reuse the **same microservice and gateway** code that worked with InMemory. Only the transport factories change.
---
## 9. Done criteria for “Implement real transport plugins one by one”
You can consider step 9 done when:
* There are **concrete implementations** of `ITransportServer` + `ITransportClient` for:
* TCP
* Certificate/TLS
* UDP (buffered only)
* RabbitMQ (buffered at minimum)
* Gateway startup:
* Reads `RouterConfig`.
* Starts appropriate transport servers per node/region.
* Microservice SDK:
* Reads `StellaMicroserviceOptions.Routers`.
* Connects to router nodes using the configured `TransportType`.
* Uses the same HELLO/HEARTBEAT/REQUEST/RESPONSE/CANCEL semantics as InMemory.
* The same functional tests that passed for InMemory:
* Now pass with TCP plugin.
* At least a subset pass with TLS, Rabbit, and UDP, honoring their constraints (no streaming on UDP, etc.).
From there, you can move into hardening each plugin (reconnect, backoff, error handling) and documenting “which transport to use when” in your router docs.

586
docs/router/10-Step.md Normal file
View File

@@ -0,0 +1,586 @@
For this step youre wiring **configuration** into the system properly:
* Router reads a stronglytyped config model (including payload limits, node region, transports).
* Microservices can optionally load a YAML file to **override** endpoint metadata discovered by reflection.
* No behavior changes to routing or transports, just how they get their settings.
Think “config plumbing and merging rules,” not new business logic.
---
## 0. Preconditions
Before starting, confirm:
* `__Libraries/StellaOps.Router.Config` project exists and references `StellaOps.Router.Common`.
* `StellaOps.Microservice` has:
* `StellaMicroserviceOptions` (ServiceName, Version, Region, InstanceId, Routers, ConfigFilePath).
* Reflectionbased endpoint discovery that produces `EndpointDescriptor` instances.
* Gateway and microservices currently use **hardcoded** or stub config; youre about to replace that with real config.
---
## 1. Define RouterConfig model and YAML schema
**Project:** `__Libraries/StellaOps.Router.Config`
**Owner:** config / platform dev
### 1.1 C# model
Create clear, minimal models to cover current needs (you can extend later):
```csharp
namespace StellaOps.Router.Config;
public sealed class RouterConfig
{
public GatewayNodeConfig Node { get; set; } = new();
public PayloadLimits PayloadLimits { get; set; } = new();
public IList<TransportEndpointConfig> Transports { get; set; } = new List<TransportEndpointConfig>();
public IList<ServiceConfig> Services { get; set; } = new List<ServiceConfig>();
}
public sealed class GatewayNodeConfig
{
public string NodeId { get; set; } = string.Empty;
public string Region { get; set; } = string.Empty;
public string Environment { get; set; } = "prod";
}
public sealed class TransportEndpointConfig
{
public TransportType TransportType { get; set; }
public int Port { get; set; } // for TCP/UDP/TLS
public bool Enabled { get; set; } = true;
// TLS-specific
public string? ServerCertificatePath { get; set; }
public string? ServerCertificatePassword { get; set; }
public bool RequireClientCertificate { get; set; }
// Rabbit-specific
public string? RabbitConnectionString { get; set; }
}
public sealed class ServiceConfig
{
public string Name { get; set; } = string.Empty;
public string DefaultVersion { get; set; } = "1.0.0";
public IList<string> NeighborRegions { get; set; } = new List<string>();
}
```
Use the `PayloadLimits` class from Common (or mirror it here and keep a single definition).
### 1.2 YAML shape
Decide and document a YAML layout, e.g.:
```yaml
node:
nodeId: "gw-eu1-01"
region: "eu1"
environment: "prod"
payloadLimits:
maxRequestBytesPerCall: 10485760 # 10 MB
maxRequestBytesPerConnection: 52428800
maxAggregateInflightBytes: 209715200
transports:
- transportType: Tcp
port: 45000
enabled: true
- transportType: Certificate
port: 45001
enabled: false
serverCertificatePath: "certs/router.pfx"
serverCertificatePassword: "secret"
- transportType: Udp
port: 45002
enabled: true
- transportType: RabbitMq
enabled: true
rabbitConnectionString: "amqp://guest:guest@localhost:5672"
services:
- name: "Billing"
defaultVersion: "1.0.0"
neighborRegions: ["eu2", "us1"]
- name: "Identity"
defaultVersion: "2.1.0"
neighborRegions: ["eu2"]
```
This YAML is the canonical config for the router; environment variables and JSON can override individual properties later via `IConfiguration`.
---
## 2. Implement Router.Config loader and DI extensions
**Project:** `StellaOps.Router.Config`
### 2.1 Choose YAML library
Add a YAML library (e.g. YamlDotNet) to `StellaOps.Router.Config`:
```bash
dotnet add src/__Libraries/StellaOps.Router.Config/StellaOps.Router.Config.csproj package YamlDotNet
```
### 2.2 Implement simple loader
Provide a helper that can load YAML into `RouterConfig`:
```csharp
public static class RouterConfigLoader
{
public static RouterConfig LoadFromYaml(string path)
{
using var reader = new StreamReader(path);
var yaml = new YamlStream();
yaml.Load(reader);
var root = (YamlMappingNode)yaml.Documents[0].RootNode;
var json = ConvertYamlToJson(root); // simplest: walk node, serialize to JSON string
return JsonSerializer.Deserialize<RouterConfig>(json)!;
}
}
```
Alternatively, bind YAML directly to `RouterConfig` with YamlDotNets object mapping; the detail is implementationspecific.
### 2.3 ASP.NET Core integration extension
In the router library, add a DI extension the gateway can call:
```csharp
public static class ServiceCollectionExtensions
{
public static IServiceCollection AddRouterConfig(
this IServiceCollection services,
IConfiguration configuration)
{
services.Configure<RouterConfig>(configuration.GetSection("Router"));
services.AddSingleton(sp => sp.GetRequiredService<IOptionsMonitor<RouterConfig>>());
return services;
}
}
```
Gateway will:
* Add the YAML file to the configuration builder.
* Call `AddRouterConfig` to bind it.
---
## 3. Wire RouterConfig into Gateway startup & components
**Project:** `StellaOps.Gateway.WebService`
**Owner:** gateway dev
### 3.1 Program.cs configuration
Adjust `Program.cs`:
```csharp
var builder = WebApplication.CreateBuilder(args);
// add YAML config
builder.Configuration
.AddJsonFile("appsettings.json", optional: true)
.AddYamlFile("router.yaml", optional: false, reloadOnChange: true)
.AddEnvironmentVariables("STELLAOPS_");
// bind RouterConfig
builder.Services.AddRouterConfig(builder.Configuration.GetSection("Router"));
var app = builder.Build();
```
Key points:
* `AddYamlFile("router.yaml", reloadOnChange: true)` ensures hotreload from YAML.
* `AddEnvironmentVariables("STELLAOPS_")` allows envbased overrides (optional, but useful).
### 3.2 Inject config into transport factories and routing
Where you start transports:
* Inject `IOptionsMonitor<RouterConfig>` into your `ITransportServerFactory`, and use `RouterConfig.Transports` to know which servers to create and on which ports.
Where you need node identity:
* Inject `IOptionsMonitor<RouterConfig>` into any service needing `GatewayNodeConfig` (e.g. when building `RoutingContext.GatewayRegion`):
```csharp
var nodeRegion = routerConfig.CurrentValue.Node.Region;
```
Where you need payload limits:
* Inject `IOptionsMonitor<RouterConfig>` into `IPayloadBudget` or `TransportDispatchMiddleware` to fetch current `PayloadLimits`.
Because youre using `IOptionsMonitor`, components can react to changes when `router.yaml` is modified.
---
## 4. Microservice YAML: schema & loader
**Project:** `__Libraries/StellaOps.Microservice`
**Owner:** SDK dev
Microservice YAML is optional and used **only** to override endpoint metadata, not to define identity or router pool.
### 4.1 Define YAML shape
Keep it focused on endpoints and overrides:
```yaml
service:
serviceName: "Billing"
version: "1.0.0"
region: "eu1"
endpoints:
- method: "POST"
path: "/billing/invoices/upload"
defaultTimeout: "00:02:00"
supportsStreaming: true
requiringClaims:
- type: "role"
value: "billing-editor"
- method: "GET"
path: "/billing/invoices/{id}"
defaultTimeout: "00:00:10"
requiringClaims:
- type: "role"
value: "billing-reader"
```
Identity (`serviceName`, `version`, `region`) in YAML is **informative**; the authoritative values still come from `StellaMicroserviceOptions`. If they differ, you log, but dont override options from YAML.
### 4.2 C# model
In `StellaOps.Microservice`:
```csharp
internal sealed class MicroserviceYamlConfig
{
public MicroserviceYamlService? Service { get; set; }
public IList<MicroserviceYamlEndpoint> Endpoints { get; set; } = new List<MicroserviceYamlEndpoint>();
}
internal sealed class MicroserviceYamlService
{
public string? ServiceName { get; set; }
public string? Version { get; set; }
public string? Region { get; set; }
}
internal sealed class MicroserviceYamlEndpoint
{
public string Method { get; set; } = string.Empty;
public string Path { get; set; } = string.Empty;
public string? DefaultTimeout { get; set; }
public bool? SupportsStreaming { get; set; }
public IList<ClaimRequirement> RequiringClaims { get; set; } = new List<ClaimRequirement>();
}
```
### 4.3 YAML loader
Reuse YamlDotNet (add package to `StellaOps.Microservice` if needed):
```csharp
internal interface IMicroserviceYamlLoader
{
MicroserviceYamlConfig? Load(string? path);
}
internal sealed class MicroserviceYamlLoader : IMicroserviceYamlLoader
{
private readonly ILogger<MicroserviceYamlLoader> _logger;
public MicroserviceYamlLoader(ILogger<MicroserviceYamlLoader> logger)
{
_logger = logger;
}
public MicroserviceYamlConfig? Load(string? path)
{
if (string.IsNullOrWhiteSpace(path) || !File.Exists(path))
return null;
try
{
using var reader = new StreamReader(path);
var deserializer = new DeserializerBuilder().Build();
return deserializer.Deserialize<MicroserviceYamlConfig>(reader);
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to load microservice YAML from {Path}", path);
return null;
}
}
}
```
Register in DI:
```csharp
services.AddSingleton<IMicroserviceYamlLoader, MicroserviceYamlLoader>();
```
---
## 5. Merge YAML overrides with reflection-discovered endpoints
**Project:** `StellaOps.Microservice`
**Owner:** SDK dev
Extend `EndpointCatalog` to apply YAML overrides.
### 5.1 Extend constructor to accept YAML config
Adjust `EndpointCatalog`:
```csharp
internal sealed class EndpointCatalog : IEndpointCatalog
{
public IReadOnlyList<EndpointDescriptor> Descriptors { get; }
private readonly Dictionary<(string Method, string Path), EndpointRegistration> _map;
public EndpointCatalog(
IEndpointDiscovery discovery,
IMicroserviceYamlLoader yamlLoader,
IOptions<StellaMicroserviceOptions> optionsAccessor)
{
var options = optionsAccessor.Value;
var registrations = discovery.DiscoverEndpoints(options);
var yamlConfig = yamlLoader.Load(options.ConfigFilePath);
registrations = ApplyYamlOverrides(registrations, yamlConfig);
_map = registrations.ToDictionary(
r => (r.Descriptor.Method, r.Descriptor.Path),
r => r,
StringComparer.OrdinalIgnoreCase);
Descriptors = registrations.Select(r => r.Descriptor).ToArray();
}
}
```
### 5.2 Implement `ApplyYamlOverrides`
Key rules:
* Identity (ServiceName, Version, Region) always come from `StellaMicroserviceOptions`.
* YAML can override:
* `DefaultTimeout`
* `SupportsStreaming`
* `RequiringClaims`
Implementation sketch:
```csharp
private static IReadOnlyList<EndpointRegistration> ApplyYamlOverrides(
IReadOnlyList<EndpointRegistration> registrations,
MicroserviceYamlConfig? yaml)
{
if (yaml is null || yaml.Endpoints.Count == 0)
return registrations;
var overrideMap = yaml.Endpoints.ToDictionary(
e => (e.Method, e.Path),
e => e,
StringComparer.OrdinalIgnoreCase);
var result = new List<EndpointRegistration>(registrations.Count);
foreach (var reg in registrations)
{
if (!overrideMap.TryGetValue((reg.Descriptor.Method, reg.Descriptor.Path), out var ov))
{
result.Add(reg);
continue;
}
var desc = reg.Descriptor;
var timeout = desc.DefaultTimeout;
if (!string.IsNullOrWhiteSpace(ov.DefaultTimeout) &&
TimeSpan.TryParse(ov.DefaultTimeout, out var parsed))
{
timeout = parsed;
}
var supportsStreaming = desc.SupportsStreaming;
if (ov.SupportsStreaming.HasValue)
{
supportsStreaming = ov.SupportsStreaming.Value;
}
var requiringClaims = ov.RequiringClaims.Count > 0
? ov.RequiringClaims.ToArray()
: desc.RequiringClaims;
var overriddenDescriptor = new EndpointDescriptor
{
ServiceName = desc.ServiceName,
Version = desc.Version,
Method = desc.Method,
Path = desc.Path,
DefaultTimeout = timeout,
SupportsStreaming = supportsStreaming,
RequiringClaims = requiringClaims
};
result.Add(new EndpointRegistration
{
Descriptor = overriddenDescriptor,
HandlerType = reg.HandlerType
});
}
return result;
}
```
This ensures code defines the set of endpoints; YAML only tunes metadata.
---
## 6. Hotreload / YAML change handling
**Router side:** you already enabled `reloadOnChange` for `router.yaml`, and use `IOptionsMonitor<RouterConfig>`. Next:
* Components that care about changes must **react**:
* Payload limits:
* `IPayloadBudget` or `TransportDispatchMiddleware` should read `routerConfig.CurrentValue.PayloadLimits` on each request rather than caching.
* Node region:
* `RoutingContext.GatewayRegion` can be built from `routerConfig.CurrentValue.Node.Region` per request.
You do **not** need a custom watcher; `IOptionsMonitor` already tracks config changes.
**Microservice side:** for now you can start with **load-on-startup** YAML. If you want hotreload:
* Implement a FileSystemWatcher in `MicroserviceYamlLoader` or a small `IHostedService`:
* Watch `options.ConfigFilePath` for changes.
* On change:
* Reload YAML.
* Rebuild `EndpointDescriptor` list.
* Send an updated HELLO or an ENDPOINTS_UPDATE frame to router.
Given complexity, you can postpone true hot reload to a later iteration and document that microservices must be restarted to pick up YAML changes.
---
## 7. Tests
**Router.Config tests:**
* Unit tests for `RouterConfigLoader`:
* Given a YAML string, bind to `RouterConfig` properly.
* Validate `TransportType.Tcp` / `Udp` / `RabbitMq` values map correctly.
* Integration test:
* Start gateway with `router.yaml`.
* Access `IOptionsMonitor<RouterConfig>` in a test controller or test service and assert values.
* Modify YAML on disk (if test infra allows) and ensure values update via `IOptionsMonitor`.
**Microservice YAML tests:**
* Unit tests for `MicroserviceYamlLoader`:
* Load valid YAML, confirm endpoints and claims/timeouts parsed.
* `EndpointCatalog` tests:
* Build fake `EndpointRegistration` list from reflection.
* Build YAML overrides.
* Call `ApplyYamlOverrides` and assert:
* Timeouts updated.
* SupportsStreaming updated.
* RequiringClaims replaced where provided.
* Descriptors with no matching YAML remain unchanged.
---
## 8. Documentation updates
Update docs under `docs/router`:
1. **Stella Ops Router Webserver.md**:
* Describe `router.yaml`:
* Node config (region, nodeId).
* PayloadLimits.
* Transports.
* Explain precedence:
* YAML as base.
* Environment variables can override individual fields via `STELLAOPS_Router__Node__Region` etc.
2. **Stella Ops Router Microservice.md**:
* Explain `ConfigFilePath` in `StellaMicroserviceOptions`.
* Show full example microservice YAML and how it maps to endpoint metadata.
* Clearly state:
* Identity comes from options (code/config), not YAML.
* YAML can override perendpoint timeout, streaming flag, requiringClaims.
* YAML cant add endpoints that dont exist in code.
3. **Stella Ops Router Documentation.md**:
* Add a short “Configuration” chapter:
* Where `router.yaml` lives.
* Where microservice YAML lives.
* How to run locally with custom configs.
---
## 9. Done criteria for “Add Router.Config + Microservice YAML integration”
You can call step 10 complete when:
* Router:
* Loads `router.yaml` into `RouterConfig` using `StellaOps.Router.Config`.
* Uses `RouterConfig.Node.Region` when building routing context.
* Uses `RouterConfig.PayloadLimits` for payload budget enforcement.
* Uses `RouterConfig.Transports` to start the right `ITransportServer` instances.
* Supports runtime changes to `router.yaml` via `IOptionsMonitor` for at least node identity and payload limits.
* Microservice:
* Accepts optional `ConfigFilePath` in `StellaMicroserviceOptions`.
* Loads YAML (when present) and merges overrides into reflectiondiscovered endpoints.
* Sends HELLO with the **merged** descriptors (i.e., YAML-aware defaults).
* Behavior remains unchanged when no YAML is provided (pure reflection mode).
* Tests:
* Confirm config binding for router and microservice.
* Confirm YAML overrides are applied correctly to endpoint metadata.
At that point, configuration is no longer hardcoded, and you have a clear, documented path for both router operators and microservice teams to configure behavior via YAML with predictable precedence.

550
docs/router/11-Step.md Normal file
View File

@@ -0,0 +1,550 @@
Goal for this step: have a **concrete, runnable example** (gateway + one microservice) and a **clear skeleton** for migrating any existing `StellaOps.*.WebService` into `StellaOps.*.Microservice`. After this, devs should be able to:
* Run a full vertical slice locally.
* Open a “migration cookbook” and follow a predictable recipe.
Ill split it into two tracks: reference example, then migration skeleton.
---
## 1. Reference example: “Billing” vertical slice
### 1.1 Create the sample microservice project
**Project:** `src/StellaOps.Billing.Microservice`
**Owner:** feature/example dev
Tasks:
1. Create the project:
```bash
cd src
dotnet new worker -n StellaOps.Billing.Microservice
```
2. Add references:
```bash
dotnet add StellaOps.Billing.Microservice/StellaOps.Billing.Microservice.csproj reference \
__Libraries/StellaOps.Microservice/StellaOps.Microservice.csproj
dotnet add StellaOps.Billing.Microservice/StellaOps.Billing.Microservice.csproj reference \
__Libraries/StellaOps.Router.Common/StellaOps.Router.Common.csproj
```
3. In `Program.cs`, wire the SDK with **InMemory transport** for now:
```csharp
var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddStellaMicroservice(opts =>
{
opts.ServiceName = "Billing";
opts.Version = "1.0.0";
opts.Region = "eu1";
opts.InstanceId = $"billing-{Environment.MachineName}";
opts.Routers.Add(new RouterEndpointConfig
{
Host = "localhost",
Port = 50050, // to match gateways InMemory/TCP harness
TransportType = TransportType.Tcp
});
opts.ConfigFilePath = "billing.microservice.yaml"; // optional overrides
});
var app = builder.Build();
await app.RunAsync();
```
(You can keep `TransportType` as TCP even if implemented in-process for now; once real TCP is in, nothing changes here.)
---
### 1.2 Implement a few canonical endpoints
Pick 34 endpoints that exercise different features:
1. **Health / contract check**
```csharp
[StellaEndpoint("GET", "/ping")]
public sealed class PingEndpoint : IRawStellaEndpoint
{
public Task<RawResponse> HandleAsync(RawRequestContext ctx)
{
var resp = new RawResponse { StatusCode = 200 };
resp.Headers["Content-Type"] = "text/plain";
resp.WriteBodyAsync = async stream =>
{
await stream.WriteAsync("pong"u8.ToArray(), ctx.CancellationToken);
};
return Task.FromResult(resp);
}
}
```
2. **Simple JSON read/write (non-streaming)**
```csharp
public sealed record CreateInvoiceRequest(string CustomerId, decimal Amount);
public sealed record CreateInvoiceResponse(Guid Id);
[StellaEndpoint("POST", "/billing/invoices")]
public sealed class CreateInvoiceEndpoint : IStellaEndpoint<CreateInvoiceRequest, CreateInvoiceResponse>
{
public Task<CreateInvoiceResponse> HandleAsync(CreateInvoiceRequest req, CancellationToken ct)
{
// pretend to store in DB
return Task.FromResult(new CreateInvoiceResponse(Guid.NewGuid()));
}
}
```
3. **Streaming upload (large file)**
```csharp
[StellaEndpoint("POST", "/billing/invoices/upload")]
public sealed class InvoiceUploadEndpoint : IRawStellaEndpoint
{
public async Task<RawResponse> HandleAsync(RawRequestContext ctx)
{
var buffer = new byte[64 * 1024];
var total = 0L;
int read;
while ((read = await ctx.Body.ReadAsync(buffer.AsMemory(0, buffer.Length), ctx.CancellationToken)) > 0)
{
total += read;
// process chunk or write to temp file
}
var resp = new RawResponse { StatusCode = 200 };
resp.Headers["Content-Type"] = "application/json";
resp.WriteBodyAsync = async stream =>
{
var json = $"{{\"bytesReceived\":{total}}}";
await stream.WriteAsync(System.Text.Encoding.UTF8.GetBytes(json), ctx.CancellationToken);
};
return resp;
}
}
```
This gives devs examples of:
* Raw endpoint (`/ping`, `/upload`).
* Typed endpoint (`/billing/invoices`).
* Streaming usage (`Body.ReadAsync`).
---
### 1.3 Microservice YAML override example
**File:** `src/StellaOps.Billing.Microservice/billing.microservice.yaml`
```yaml
endpoints:
- method: GET
path: /ping
timeout: 00:00:02
- method: POST
path: /billing/invoices
timeout: 00:00:05
supportsStreaming: false
requiringClaims:
- type: role
value: BillingWriter
- method: POST
path: /billing/invoices/upload
timeout: 00:02:00
supportsStreaming: true
requiringClaims:
- type: role
value: BillingUploader
```
This file demonstrates:
* Timeout override.
* Streaming flag.
* `RequiringClaims` usage.
---
### 1.4 Gateway example config for Billing
**File:** `config/router.billing.yaml` (for local dev)
```yaml
nodeId: "gw-dev-01"
region: "eu1"
payloadLimits:
maxRequestBytesPerCall: 10485760 # 10 MB
maxRequestBytesPerConnection: 52428800 # 50 MB
maxAggregateInflightBytes: 209715200 # 200 MB
services:
- name: "Billing"
defaultVersion: "1.0.0"
endpoints:
- method: "GET"
path: "/ping"
# router defaults, if any
- method: "POST"
path: "/billing/invoices"
defaultTimeout: "00:00:05"
requiringClaims:
- type: "role"
value: "BillingWriter"
- method: "POST"
path: "/billing/invoices/upload"
defaultTimeout: "00:02:00"
supportsStreaming: true
requiringClaims:
- type: "role"
value: "BillingUploader"
```
This lets you show precedence:
* Reflection → microservice YAML → router YAML.
---
### 1.5 Gateway wiring for the example
**Project:** `StellaOps.Gateway.WebService`
In `Program.cs`:
1. Load router config and point it to `router.billing.yaml` for dev:
```csharp
builder.Configuration
.AddJsonFile("appsettings.json", optional: true)
.AddEnvironmentVariables(prefix: "STELLAOPS_");
builder.Services.AddOptions<RouterConfig>()
.Configure<IConfiguration>((cfg, configuration) =>
{
configuration.GetSection("Router").Bind(cfg);
var yamlPath = configuration["Router:YamlPath"] ?? "config/router.billing.yaml";
if (File.Exists(yamlPath))
{
var yamlCfg = RouterConfigLoader.LoadFromFile(yamlPath);
// either cfg = yamlCfg (if you treat YAML as source of truth)
OverlayRouterConfig(cfg, yamlCfg);
}
});
builder.Services.AddOptions<GatewayNodeConfig>()
.Configure<IOptions<RouterConfig>>((node, routerCfg) =>
{
var cfg = routerCfg.Value;
node.NodeId = cfg.NodeId;
node.Region = cfg.Region;
});
```
2. Ensure you start the appropriate transport server (for dev, TCP on localhost:50050):
* From `RouterConfig.Transports` or a dev shortcut, start the TCP server listening on that port.
3. HTTP pipeline:
* `EndpointResolutionMiddleware`
* `RoutingDecisionMiddleware`
* `TransportDispatchMiddleware`
Now your dev loop is:
* Run `StellaOps.Gateway.WebService`.
* Run `StellaOps.Billing.Microservice`.
* `curl http://localhost:{gatewayPort}/ping` → should go through gateway to microservice and back.
* Similarly for `/billing/invoices` and `/billing/invoices/upload`.
---
### 1.6 Example documentation
Create `docs/router/examples/Billing.Sample.md`:
* “How to run the example”:
* build solution
* `dotnet run` for gateway
* `dotnet run` for Billing microservice
* Show sample `curl` commands:
* `curl http://localhost:8080/ping`
* `curl -X POST http://localhost:8080/billing/invoices -d '{"customerId":"C1","amount":123.45}'`
* `curl -X POST http://localhost:8080/billing/invoices/upload --data-binary @bigfile.bin`
* Note where config files live and how to change them.
This becomes your canonical reference for new teams.
---
## 2. Migration skeleton: from WebService to Microservice
Now that you have a working example, you need a **repeatable recipe** for migrating any existing `StellaOps.*.WebService` into the microservice router model.
### 2.1 Define the migration target shape
For each webservice you migrate, you want:
* A new project: `StellaOps.{Domain}.Microservice`.
* Shared domain logic extracted into a library (if not already): `StellaOps.{Domain}.Core` or similar.
* Controllers → endpoint classes:
* `Controller` methods ⇨ `[StellaEndpoint]`-annotated types.
* `HttpGet/HttpPost` attributes ⇨ `Method` and `Path` pair.
* Configuration:
* WebServices appsettings routes → microservice YAML + router YAML.
* Authentication/authorization → `RequiringClaims` in endpoint metadata.
Document this target shape in `docs/router/Migration of Webservices to Microservices.md`.
---
### 2.2 Skeleton microservice template
Create a **generic** microservice skeleton that any team can copy:
**Project:** `templates/StellaOps.Template.Microservice` or at least a folder `samples/MigrationSkeleton/`.
Contents:
* `Program.cs`:
```csharp
var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddStellaMicroservice(opts =>
{
opts.ServiceName = "{DomainName}";
opts.Version = "1.0.0";
opts.Region = "eu1";
opts.InstanceId = "{DomainName}-" + Environment.MachineName;
// Mandatory router pool configuration
opts.Routers.Add(new RouterEndpointConfig
{
Host = "localhost", // or injected via env
Port = 50050,
TransportType = TransportType.Tcp
});
opts.ConfigFilePath = $"{DomainName}.microservice.yaml";
});
// domain DI (reuse existing domain services from WebService)
// builder.Services.AddDomainServices();
var app = builder.Build();
await app.RunAsync();
```
* A sample endpoint mapping from a typical WebService controller method:
Legacy controller:
```csharp
[ApiController]
[Route("api/billing/invoices")]
public class InvoicesController : ControllerBase
{
[HttpPost]
[Authorize(Roles = "BillingWriter")]
public async Task<ActionResult<InvoiceDto>> Create(CreateInvoiceRequest request)
{
var result = await _service.Create(request);
return Ok(result);
}
}
```
Microservice endpoint:
```csharp
[StellaEndpoint("POST", "/billing/invoices")]
public sealed class CreateInvoiceEndpoint : IStellaEndpoint<CreateInvoiceRequest, InvoiceDto>
{
private readonly IInvoiceService _service;
public CreateInvoiceEndpoint(IInvoiceService service)
{
_service = service;
}
public Task<InvoiceDto> HandleAsync(CreateInvoiceRequest request, CancellationToken ct)
{
return _service.Create(request, ct);
}
}
```
And matching YAML:
```yaml
endpoints:
- method: POST
path: /billing/invoices
timeout: 00:00:05
requiringClaims:
- type: role
value: BillingWriter
```
This skeleton demonstrates the mapping clearly.
---
### 2.3 Migration workflow for a team (per service)
Put this as a checklist in `Migration of Webservices to Microservices.md`:
1. **Inventory existing HTTP surface**
* List all controllers and actions with:
* HTTP method.
* Route template (full path).
* Auth attributes (`[Authorize(Roles=..)]` or policies).
* Whether the action handles large uploads/downloads.
2. **Create microservice project**
* Add `StellaOps.{Domain}.Microservice` using the skeleton.
* Reference domain logic project (`StellaOps.{Domain}.Core`), or extract one if necessary.
3. **Map each controller action → endpoint**
For each action:
* Create an endpoint class in the microservice:
* `IRawStellaEndpoint` for:
* Large payloads.
* Very custom body handling.
* `IStellaEndpoint<TRequest,TResponse>` for standard JSON APIs.
* Use `[StellaEndpoint("{METHOD}", "{PATH}")]` matching the existing route.
4. **Wire domain services & auth**
* Register the same domain services the WebService used (DB contexts, repositories, etc.).
* Translate role/claim-based `[Authorize]` usage to microservice YAML `RequiringClaims`.
5. **Create microservice YAML**
* For each new endpoint:
* Define default timeout.
* `supportsStreaming: true` where appropriate.
* `requiringClaims` matching prior auth requirements.
6. **Update router YAML**
* Add service entry under `services`:
* `name: "{Domain}"`.
* `defaultVersion: "1.0.0"`.
* Add endpoints (method/path, router-side overrides if needed).
7. **Smoke-test locally**
* Run gateway + microservice side-by-side.
* Hit the same URLs via gateway that previously were served by the WebService directly.
* Compare behavior (status codes, semantics) with existing environment.
8. **Gradual rollout**
Strategy options:
* **Proxy mode**:
* Keep WebService behind gateway for a while.
* Add router endpoints that proxy to existing WebService (via HTTP) while microservice matures.
* Gradually switch endpoints to microservice once stable.
* **Blue/green**:
* Run WebService and Microservice in parallel.
* Route a small percentage of traffic to microservice via router.
* Increase gradually.
Outline these as patterns in the migration doc, but keep them high-level here.
---
### 2.4 Migration skeleton repository structure
Add a clear place in repo for skeleton code & docs:
```text
/docs
/router
Migration of Webservices to Microservices.md
examples/
Billing.Sample.md
/samples
/Billing
StellaOps.Billing.Microservice/ # full example project
router.billing.yaml # example router config
/MigrationSkeleton
StellaOps.Template.Microservice/ # template project
example-controller-mapping.md # before/after snippet
```
The **skeleton** project should:
* Compile.
* Contain TODO markers where teams fill in domain pieces.
* Be referenced in the migration doc so people know where to look.
---
### 2.5 Tests to make the reference stick
Add a minimal test suite around the Billing example:
* **Integration tests** in `tests/StellaOps.Billing.IntegrationTests`:
* Start gateway + Billing microservice (using in-memory test host or docker-compose).
* `GET /ping` returns 200 and “pong”.
* `POST /billing/invoices` returns 200 with a JSON body containing an `id`.
* `POST /billing/invoices/upload` with a large payload succeeds and reports `bytesReceived`.
* Use these tests as a reference for future services: they show how to spin up a microservice + gateway in tests.
---
## 3. Done criteria for step 11
You can treat “Build a reference example + migration skeleton” as complete when:
* `StellaOps.Billing.Microservice` exists, runs, and successfully serves requests through the gateway using your real transport (or InMemory/TCP for dev).
* `router.billing.yaml` plus `billing.microservice.yaml` show config patterns for:
* timeouts
* streaming
* requiringClaims
* `docs/router/examples/Billing.Sample.md` explains how to run and test the example.
* `Migration of Webservices to Microservices.md` contains:
* A concrete mapping example (controller → endpoint + YAML).
* A step-by-step migration checklist for teams.
* Pointers to the skeleton project and sample configs.
* A template microservice project exists (`StellaOps.Template.Microservice` or equivalent) that teams can copy to bootstrap new services.
Once you have this, onboarding new domains and migrating old WebServices stops being an ad-hoc effort and becomes a repeatable, documented process.

415
docs/router/12-Step.md Normal file
View File

@@ -0,0 +1,415 @@
Below is how Id tell your dev agents to operate on this codebase so it doesnt turn into chaos over time.
Think of this as the “rules of engagement” for Stella Ops Router.
---
## 1. Nonnegotiable operating principles
All agents follow these rules:
1. **Specs are law**
* `docs/router/specs.md` is the primary source of truth.
* If code and spec differ:
* Fix the spec **first** (in a PR), then adjust the code.
* No “quick fixes” that contradict the spec.
2. **Common & protocol are sacred**
* `StellaOps.Router.Common` and the wire protocol (Frame/FrameType/serialization) are stable layers.
* Any change to:
* `Frame`, `FrameType`
* `EndpointDescriptor`, `ConnectionState`
* `ITransportClient` / `ITransportServer`
* …requires:
* Explicit spec update.
* Compatibility consideration.
* Code review by someone thinking about all transports and both sides (gateway + microservice).
3. **InMemory first, then real transports**
* New protocol semantics (e.g., new frame type, new behavior, new timeout rules) MUST:
1. Be implemented and proven with InMemory.
2. Have tests passing with InMemory.
3. Only then be rolled into TCP/TLS/UDP/RabbitMQ.
4. **No backdoor HTTP between microservices and router**
* Microservices must never talk HTTP to the router for control plane or data.
* All microservicerouter traffic goes through the registered transports (UDP/TCP/TLS/RabbitMQ) using `Frame`.
5. **Method + Path = contract**
* Endpoint identity is always: `HTTP Method + Path`, nothing else.
* No “dynamic” routing hacks that bypass the `(Method, Path)` resolution.
---
## 2. How agents should structure work (vertical slices, not scattered edits)
Whenever you assign work, agents should:
1. **Work in vertical slices**
* Example slice: “Cancellation with InMemory”, “Streaming + payload limits with TCP”, “RabbitMQ buffered requests”.
* Each slice includes:
* Spec amendments (if needed).
* Common contracts (if needed).
* Implementation (gateway + microservice + transport).
* Tests.
2. **Avoid crosscutting, halffinished changes**
* Do not:
* Change Common, start on TCP, then get bored and leave InMemory broken.
* Do:
* Finish one vertical slice endtoend, then move on.
3. **Keep changes small and reviewable**
* Prefer:
* One PR for “add YAML overrides merging”.
* Another PR for “add router YAML hotreload details”.
* Avoid huge omnibus PRs that change protocol, transports, router, and microservice in one go.
---
## 3. Change categories & review rules
Agents should classify their work by category and obey the review level.
1. **Category A Protocol / Common changes**
* Affects:
* `Frame`, `FrameType`, payload DTOs.
* `EndpointDescriptor`, `ConnectionState`, `RoutingDecision`.
* `ITransportClient`, `ITransportServer`.
* Requirements:
* Spec change with rationale.
* Crossside impact analysis: gateway + microservice + all transports.
* Tests updated for InMemory and at least one real transport.
* Review: 2+ reviewers, one acting as “protocol owner”.
2. **Category B Router logic / routing plugin**
* Affects:
* `IGlobalRoutingState` implementation.
* `IRoutingPlugin` logic (region, ping, heartbeat).
* Requirements:
* Unit tests for routing plugin (selection rules).
* At least one integration test through gateway + InMemory.
* Review: at least one reviewer who understands region/version semantics.
3. **Category C Transport implementation**
* Affects:
* TCP/TLS/UDP/RabbitMQ clients & servers.
* Requirements:
* Transportspecific tests (connection, basic request/response, timeout).
* No protocol changes.
* Review: 12 reviewers, including one who owns that transport.
4. **Category D SDK / Microservice developer experience**
* Affects:
* `StellaOps.Microservice` public surface, endpoint discovery, YAML merging.
* Requirements:
* API review for public surface.
* Docs update (`Microservice.md`) if behavior changes.
* Review: 12 reviewers.
5. **Category E Docs only**
* Affects:
* `docs/router/*`, no code.
* Requirements:
* Ensure docs match current behavior; if not, spawn followup issues.
---
## 4. Workflow per change (what each agent does)
For any nontrivial change:
1. **Check the spec**
* Confirm that:
* The desired behavior is already described, or
* You will extend the spec first.
2. **Update / extend spec if needed**
* Edit `docs/router/specs.md` or appropriate doc.
* Document:
* Whats changing.
* Why we need it.
* Which components are affected.
3. **Adjust Common / contracts if needed**
* Only after spec is updated.
* Keep changes minimal and backwards compatible where possible.
4. **Implement in InMemory path**
* Update:
* InMemory `ITransportClient`/hub.
* Microservice and gateway logic that rely on it.
* Add tests to prove behavior.
5. **Port to real transports**
* Implement the same behavior in:
* TCP (baseline).
* TLS (wrapping TCP).
* Others when needed.
* Reuse the same InMemory tests pattern for transport tests.
6. **Add / update tests**
* Unit tests for logic.
* Integration tests for gateway + microservice via at least one real transport.
7. **Update documentation**
* Update relevant docs:
* `Stella Ops Router - Webserver.md`
* `Stella Ops Router - Microservice.md`
* `Common.md`, if common contracts changed.
* Highlight any new configuration knobs or invariants.
---
## 5. Testing expectations for all agents
Agents should treat tests as part of the change, not an afterthought.
1. **Unit tests**
* For:
* Routing plugin decisions.
* YAML merge behavior.
* Payload budget logic.
* Goal:
* All tricky branches are covered.
2. **Integration tests**
* For gateway + microservice using:
* InMemory.
* At least one real transport (TCP in dev).
* Scenarios to maintain:
* Simple request/response.
* Streaming upload.
* Cancellation on client abort.
* Timeout leading to CANCEL.
* Payload limit exceeded.
3. **Smoke tests for examples**
* Ensure `StellaOps.Billing.Microservice` example always passes a small test:
* `/billing/health` works.
* `/billing/invoices/upload` streaming behaves.
4. **CI gating**
* No PR merges unless:
* `dotnet build` for solution succeeds.
* All tests pass.
* If agents add new projects/tests, CI must be updated in the same PR.
---
## 6. How agents should use configuration & YAML
1. **Router side**
* Always read payload limits, node region, transports from `RouterConfig` (bound from YAML + env).
* Do not hardcode:
* Limits.
* Regions.
* Ports.
* If behavior depends on config, fetch from `IOptionsMonitor<RouterConfig>` at runtime, not from cached fields unless you explicitly freeze.
2. **Microservice side**
* Identity & router pool:
* From `StellaMicroserviceOptions` (code/env).
* Endpoint metadata overrides:
* From YAML (`ConfigFilePath`) merged into reflection result.
* Agents must not let YAML create endpoints that dont exist in code; overrides only.
3. **No hidden defaults**
* If a default is important (e.g. `HeartbeatInterval`), document it and centralize it.
* Dont sprinkle magic numbers across code.
---
## 7. Adding new capabilities: pattern all agents follow
When someone wants a new capability (e.g. “retry on transient transport failures”):
1. **Open a design issue / doc snippet**
* Describe:
* Problem.
* Proposed design.
* Where it sits in architecture (router, microservice, transport, config).
2. **Update spec**
* Write the behavior in the appropriate doc section.
* Include:
* API shape (if public).
* Transport impacts.
* Failure modes.
3. **Follow the vertical slice path**
* Implement in Common (if needed).
* Implement InMemory.
* Implement in primary transport (TCP).
* Add tests.
* Update docs.
Agents should not just spike code into TCP implementation without spec or tests.
---
## 8. Logging, tracing, and debugging expectations
Agents should instrument consistently; this matters for operations and for debugging during development.
1. **Use structured logging**
* At minimum, include:
* `ServiceName`
* `InstanceId`
* `CorrelationId`
* `Method`
* `Path`
* `ConnectionId`
* Never log full payload bodies by default for privacy and performance; log sizes and key metadata instead.
2. **Trace correlation**
* Ensure correlation IDs:
* Propagate from HTTP (gateway) into `Frame.CorrelationId`.
* Are used in logs on both sides (gateway + microservice).
3. **Agent debugging guidance**
* When debugging a routing or transport problem:
* Turn on debug logging for gateway + microservice for that service.
* Use the correlation ID to follow the request endtoend.
* Verify:
* HELLO registration.
* HEARTBEAT events.
* REQUEST leaving gateway.
* RESPONSE arriving.
---
## 9. Daily agent workflow (practical directions)
For each day / task, an agent should:
1. **Start from an issue or spec line item**
* Never “just code something” without an issue/state in the backlog.
2. **Locate the relevant doc**
* Spec section.
* Example docs (e.g. Billing sample).
* Migration doc if working on conversion.
3. **Work in a feature branch**
* Branch name reflects scope: `feature/streaming-tcp`, `fix/router-cancellation`, etc.
4. **Keep notes**
* If an assumption is made (e.g. “we currently dont support streaming over RabbitMQ”), note it in the issue.
* If they discover inconsistency in docs, open a docfix issue.
5. **Finish the full slice**
* Code + tests + docs.
* Keep partial implementations behind feature flags (if needed) and clearly marked.
6. **Open PR with clear description**
* What changed.
* Which spec section it implements or modifies.
* Any risks or rollback notes.
---
## 10. Guardrails against drift
Finally, a few things agents must actively avoid:
* **No silent protocol changes**
* Dont change `FrameType` semantics, payload formats, or header layout without:
* Spec update.
* Full impact review.
* **No specless behavior**
* If something matters at runtime (timeouts, retries, routing rules), it has to be in the docs, not just in someones head.
* **No bypassing of router**
* Do not introduce “temporary” direct calls from clients to microservices. All client HTTP should go via gateway.
* **No direct dependencies on specific transports in domain code**
* Domain and microservice endpoint logic must not know if the transport is TCP, TLS, UDP, or RabbitMQ. They only see `RawRequestContext`, `RawResponse`, and cancellation tokens.
---
If you want, I can turn this into a onepage “Agent Handbook” markdown file you can drop into `docs/router/AGENTS_PROCESS.md` and link from `specs.md` so every AI or human dev working on this stack has the same ground rules.

View File

@@ -0,0 +1,41 @@
# Sprint 7000·0001·0001 · Router Skeleton
## Topic & Scope
- Stand up the dedicated StellaOps Router repo skeleton under `docs/router` as per `specs.md` / `01-Step.md`.
- Produce the empty solution structure, projects, references, and placeholder docs ready for future transport/SDK work.
- Enforce .NET 10 (`net10.0`) across all new projects; ignore prior net8 defaults.
- **Working directory:** `docs/router`.
## Dependencies & Concurrency
- Depends on `docs/router/specs.md` remaining the authoritative requirements source.
- No upstream sprint blockers; this spin-off is self-contained.
- Can run in parallel with other repo work because it writes only under `docs/router`.
## Documentation Prerequisites
- `docs/router/specs.md`
- `docs/router/implplan.md`
- `docs/router/01-Step.md`
## Delivery Tracker
| # | Task ID | Status | Key dependency / next step | Owners | Task Definition |
| --- | --- | --- | --- | --- | --- |
| 1 | ROUTER-SKEL-SETUP | TODO | Read specs + step docs | Skeleton Agent | Create repo folders (`src/`, `src/__Libraries/`, `tests/`, `docs/router`) & add `README.md` pointer. |
| 2 | ROUTER-SKEL-SOLUTION | TODO | Task 1 | Skeleton Agent | Generate `StellaOps.Router.sln`, add Gateway + library + test projects targeting `net10.0`. |
| 3 | ROUTER-SKEL-REFS | TODO | Task 2 | Skeleton Agent | Wire project references per plan (Gateway→Common+Config, etc.). |
| 4 | ROUTER-SKEL-BUILDPROPS | TODO | Task 2 | Infra Agent | Add repo-level `Directory.Build.props` pinning `net10.0`, nullable, implicit usings. |
| 5 | ROUTER-SKEL-STUBS | TODO | Tasks 2-4 | Common/Microservice Agents | Add placeholder types/extension methods per `01-Step.md` (no logic). |
| 6 | ROUTER-SKEL-TESTS | TODO | Task 5 | QA Agent | Create dummy `[Fact]` tests in each test project so `dotnet test` passes. |
| 7 | ROUTER-SKEL-CI | TODO | Tasks 2-6 | Infra Agent | Configure CI pipeline running `dotnet restore/build/test` on solution. |
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2025-12-02 | Created sprint skeleton per router spin-off instructions. | Planning |
## Decisions & Risks
- Use .NET 10 baseline even though other modules still target net8; future agents must not downgrade frameworks.
- Scope intentionally limited to `docs/router` to avoid cross-repo conflicts; any shared assets must be duplicated or referenced via documentation until later alignment.
- Risk: missing AGENTS.md for this folder—future sprint should establish one if work extends beyond skeleton.
## Next Checkpoints
- 2025-12-04: Verify solution + CI scaffold committed and passing.

356
docs/router/implplan.md Normal file
View File

@@ -0,0 +1,356 @@
Start by treating `docs/router/specs.md` as law. Nothing gets coded that contradicts it. The first sprint or two should be about *wiring the skeleton* and proving the core flows with the simplest possible transport, then layering in the real transports and migration paths.
Id structure the work for your agents like this.
---
## 0. Read & freeze invariants
**All agents:**
* Read `docs/router/specs.md` end to end.
* Extract and pin the non-negotiables:
* Method + Path identity.
* Strict semver for versions.
* Region from `GatewayNodeConfig.Region` (no host/header magic).
* No HTTP transport for microservice communications.
* Single connection carrying HELLO + HEARTBEAT + REQUEST/RESPONSE + CANCEL.
* Router treats body as opaque bytes/streams.
* `RequiringClaims` replaces any form of `AllowedRoles`.
Agree that these are invariants; any future idea that violates them needs an explicit spec change first.
---
## 1. Lay down the solution skeleton
**“Skeleton” agent (or gateway core agent):**
Create the basic project structure, no logic yet:
* `src/__Libraries/StellaOps.Router.Common`
* `src/__Libraries/StellaOps.Router.Config`
* `src/__Libraries/StellaOps.Microservice`
* `src/StellaOps.Gateway.WebService`
* `docs/router/` already has `specs.md` (add placeholders for the other docs).
Goal: everything builds, but most classes are empty or stubs.
---
## 2. Implement the shared core model (Common)
**Common/core agent:**
Implement only the *data* and *interfaces*, no behavior:
* Enums:
* `TransportType`, `FrameType`, `InstanceHealthStatus`.
* Models:
* `ClaimRequirement`
* `EndpointDescriptor`
* `InstanceDescriptor`
* `ConnectionState`
* `RoutingContext`, `RoutingDecision`
* `PayloadLimits`
* Interfaces:
* `IGlobalRoutingState`
* `IRoutingPlugin`
* `ITransportServer`
* `ITransportClient`
* `Frame` struct/class:
* `FrameType`, `CorrelationId`, `Payload` (byte[]).
Leave implementations of `IGlobalRoutingState`, `IRoutingPlugin`, transports, etc., for later steps.
Deliverable: a stable set of contracts that gateway + microservice SDK depend on.
---
## 3. Build a fake “in-memory” transport plugin
**Transport agent:**
Before UDP/TCP/Rabbit, build an **in-process transport**:
* `InMemoryTransportServer` and `InMemoryTransportClient`.
* They share a concurrent dictionary keyed by `ConnectionId`.
* Frames are passed via channels/queues in memory.
Purpose:
* Let you prove HELLO/HEARTBEAT/REQUEST/RESPONSE/CANCEL semantics and routing logic *without* dealing with sockets and Rabbit yet.
* Let you unit and integration test the router and SDK quickly.
This plugin will never ship to production; its only for dev tests and CI.
---
## 4. Microservice SDK: minimal handshake & dispatch (with InMemory)
**Microservice agent:**
Initial focus: “connect and say HELLO, then handle a simple request.”
1. Implement `StellaMicroserviceOptions`.
2. Implement `AddStellaMicroservice(...)`:
* Bind options.
* Register endpoint handlers and SDK internal services.
3. Endpoint discovery:
* Implement runtime reflection for `[StellaEndpoint]` + handler types.
* Build in-memory `EndpointDescriptor` list (simple: no YAML yet).
4. Connection:
* Use `InMemoryTransportClient` to “connect” to a fake router.
* On connect, send a HELLO frame with:
* Identity.
* Endpoint list and metadata (`SupportsStreaming` false for now, simple `RequiringClaims` empty).
5. Request handling:
* Implement `IRawStellaEndpoint` and adapter to it.
* Implement `RawRequestContext` / `RawResponse`.
* Implement a dispatcher that:
* Receives `Request` frame.
* Builds `RawRequestContext`.
* Invokes the correct handler.
* Sends `Response` frame.
Do **not** handle streaming or cancellation yet; just basic request/response with small bodies.
---
## 5. Gateway: minimal routing using InMemory plugin
**Gateway agent:**
Goal: HTTP → in-memory transport → microservice → HTTP response.
1. Implement `GatewayNodeConfig` and bind it from config.
2. Implement `IGlobalRoutingState` as a simple in-memory implementation that:
* Holds `ConnectionState` objects.
* Builds a map `(Method, Path)` → endpoint + connections.
3. Implement a minimal `IRoutingPlugin` that:
* For now, just picks *any* connection that has the endpoint (no region/ping logic yet).
4. Implement minimal HTTP pipeline:
* `EndpointResolutionMiddleware`:
* `(Method, Path)``EndpointDescriptor` from `IGlobalRoutingState`.
* Naive authorization middleware stub (only checks “needs authenticated user”; ignore real requiringClaims for now).
* `RoutingDecisionMiddleware`:
* Ask `IRoutingPlugin` for a `RoutingDecision`.
* `TransportDispatchMiddleware`:
* Build a `Request` frame.
* Use `InMemoryTransportClient` to send and await `Response`.
* Map response to HTTP.
5. Implement HELLO handler on gateway side:
* When InMemory “connection” from microservice appears and sends HELLO:
* Construct `ConnectionState`.
* Update `IGlobalRoutingState` with endpoint → connection mapping.
Once this works, you have end-to-end:
* Example microservice.
* Example gateway.
* In-memory transport.
* A couple of test endpoints returning simple JSON.
---
## 6. Add heartbeat, health, and basic routing rules
**Common/core + gateway agent:**
Now enforce liveness and basic routing:
1. Heartbeat:
* Microservice SDK sends HEARTBEAT frames on a timer.
* Gateway updates `LastHeartbeatUtc` and `Status`.
2. Health:
* Add background job in gateway that:
* Marks instances Unhealthy if heartbeat stale.
3. Routing:
* Enhance `IRoutingPlugin` to:
* Filter out Unhealthy instances.
* Prefer gateway region (using `GatewayNodeConfig.Region`).
* Use simple `AveragePingMs` stub from request/response timings.
Still using InMemory transport; just building the selection logic.
---
## 7. Add cancellation semantics (with InMemory)
**Microservice + gateway agents:**
Wire up cancellation logic before touching real transports:
1. Common:
* Extend `FrameType` with `Cancel`.
2. Gateway:
* In `TransportDispatchMiddleware`:
* Tie `HttpContext.RequestAborted` to a `SendCancelAsync` call.
* On timeout, send CANCEL.
* Ignore late `Response`/stream data for canceled correlation IDs.
3. Microservice:
* Maintain `_inflight` map of correlation → `CancellationTokenSource`.
* When `Cancel` frame arrives, call `cts.Cancel()`.
* Ensure handlers receive and honor `CancellationToken`.
Prove via tests: if client disconnects, handler stops quickly.
---
## 8. Add streaming & payload limits (still InMemory)
**Gateway + microservice agents:**
1. Streaming:
* Extend InMemory transport to support `RequestStreamData` / `ResponseStreamData` frames.
* On the gateway:
* For `SupportsStreaming` endpoints, pipe HTTP body stream → frame stream.
* For response, pipe frames → HTTP response stream.
* On microservice:
* Expose `RawRequestContext.Body` as a stream reading frames as they arrive.
* Allow `RawResponse.WriteBodyAsync` to stream out.
2. Payload limits:
* Implement `PayloadLimits` enforcement at gateway:
* Early reject large `Content-Length`.
* Track counters in streaming; trigger cancellation when exceeding thresholds.
Demonstrate with a fake “upload” endpoint that uses `IRawStellaEndpoint` and streaming.
---
## 9. Implement real transport plugins one by one
**Transport agent:**
Now replace InMemory with real transports:
Order:
1. **TCP plugin** (easiest baseline):
* Length-prefixed frame protocol.
* Connection per microservice instance (or multi-instance if needed later).
* Implement HELLO/HEARTBEAT/REQUEST/RESPONSE/STREAM/CANCEL as per frame model.
2. **Certificate (TLS) plugin**:
* Wrap TCP plugin with TLS.
* Add configuration for server & client certs.
3. **UDP plugin**:
* Single datagram = single frame; no streaming.
* Enforce `MaxRequestBytesPerCall`.
* Use for small, idempotent operations.
4. **RabbitMQ plugin**:
* Add exchanges/queues for HELLO/HEARTBEAT and REQUEST/RESPONSE.
* Use `CorrelationId` properties for matching.
* Guarantee at-most-once semantics where practical.
While each plugin is built, keep the core router and microservice SDK relying only on `ITransportClient`/`ITransportServer` abstractions.
---
## 10. Add Router.Config + Microservice YAML integration
**Config agent:**
1. Implement `__Libraries/StellaOps.Router.Config`:
* YAML → `RouterConfig` binding.
* Services, endpoints, static instances, payload limits.
* Hot-reload via `IOptionsMonitor` / file watcher.
2. Implement microservice YAML:
* Endpoint-level overrides only (timeouts, requiringClaims, SupportsStreaming).
* Merge logic: code defaults → YAML override.
3. Integrate:
* Gateway uses RouterConfig for:
* Defaults when no microservice registered yet.
* Payload limits.
* Microservice uses YAML to refine endpoint metadata before sending HELLO.
---
## 11. Build a reference example + migration skeleton
**DX / migration agent:**
1. Build a `StellaOps.Billing.Microservice` example:
* A couple of simple endpoints (GET/POST).
* One streaming upload endpoint.
* YAML for requiringClaims and timeouts.
2. Build a `StellaOps.Gateway.WebService` example config around it.
3. Document the full path:
* How to run both locally.
* How to add a new endpoint.
* How cancellation behaves (killing the client, watching logs).
* How payload limits work (try to upload too-large file).
4. Outline migration steps from an imaginary `StellaOps.Billing.WebService` using the patterns in `Migration of Webservices to Microservices.md`.
---
## 12. Process guidance for your agents
* **Do not jump to UDP/TCP immediately.**
Prove the protocol (HELLO/HEARTBEAT/REQUEST/RESPONSE/STREAM/CANCEL), routing, and limits on the InMemory plugin first.
* **Guard the invariants.**
If someone proposes “just call HTTP between services” or “lets derive region from host,” theyre violating spec and must update `docs/router/specs.md` before coding.
* **Keep Common stable.**
Changes to `StellaOps.Router.Common` must be rare and reviewed; everything else depends on it.
* **Document as you go.**
Every time a behavior settles (e.g. status mapping, frame layout), update the docs under `docs/router/` so new agents always have a single source of truth.
If you want, next step I can convert this into a task board (epic → stories) per repo folder, so you can assign specific chunks to named agents.

494
docs/router/specs.md Normal file
View File

@@ -0,0 +1,494 @@
Ill group everything into requirement buckets, but keep it all as requirements statements (no rationale). This is the union of what you asked for or confirmed across the whole thread.
---
## 1. Architectural / scope requirements
* There SHALL be a single HTTP ingress service named `StellaOps.Gateway.WebService`.
* Microservices SHALL NOT expose HTTP to the router; all microservice-to-router traffic (control + data) MUST use in-house transports (UDP, TCP, certificate/TLS, RabbitMQ).
* There SHALL NOT be a separate control-plane service or protocol; each transport connection between a microservice and the router MUST carry:
* Initial registration (HELLO) and endpoint configuration.
* Ongoing heartbeats.
* Endpoint updates (if any).
* Request/response and streaming data.
* The router SHALL maintain per-connection endpoint mappings and derive its global routing state from the union of all live connections.
* The router SHALL treat request and response bodies as opaque (raw bytes / streams); all deserialization and schema handling SHALL be the microservices responsibility.
* The system SHALL support both buffered and streaming request/response flows end-to-end.
* The design MUST reuse only the generic parts of `__SerdicaTemplate` (dynamic endpoint metadata, attribute-based endpoint discovery, request routing patterns, correlation, connection management) and MUST drop Serdica-specific stack (Oracle schema, domain logic, etc.).
* The solution MUST be a simpler, generic replacement for the existing Serdica HTTP→RabbitMQ→microservice design.
---
## 2. Service identity, region, versioning
* Each microservice instance SHALL be identified by `(ServiceName, Version, Region, InstanceId)`.
* `Version` MUST follow strict semantic versioning (`major.minor.patch`).
* Routing MUST be strict on version:
* The router MUST only route a request to instances whose `Version` equals the selected version.
* When a version is not explicitly specified by the client, a default version MUST be used (from config or metadata).
* Each gateway node SHALL have a static configuration object `GatewayNodeConfig` containing at least:
* `Region` (e.g. `"eu1"`).
* `NodeId` (e.g. `"gw-eu1-01"`).
* `Environment` (e.g. `"prod"`).
* Routing decisions MUST use `GatewayNodeConfig.Region` as the nodes region; the router MUST NOT derive region from HTTP headers or URL host names.
* DNS/host naming conventions SHOULD express region in the domain (e.g. `eu1.global.stella-ops.org`, `mainoffice.contoso.stella-ops.org`), but routing logic MUST be driven by `GatewayNodeConfig.Region` rather than by host parsing.
---
## 3. Endpoint identity and metadata
* Endpoint identity in the router and microservices MUST be `HTTP Method + Path`, for example:
* `Method`: one of `GET`, `POST`, `PUT`, `PATCH`, `DELETE`.
* `Path`: e.g. `/section/get/{id}`.
* The router and microservices MUST use the same path template syntax and matching rules (e.g. ASP.NET-style route templates), including decisions on:
* Case sensitivity.
* Trailing slash handling.
* Parameter segments (e.g. `{id}`).
* The router MUST resolve an incoming HTTP `(Method, Path)` to a logical endpoint descriptor that includes:
* ServiceName.
* Version.
* Method.
* Path.
* DefaultTimeout.
* `RequiringClaims`: a list of claim requirements.
* A flag indicating whether the endpoint supports streaming.
* Every place that previously spoke about `AllowedRoles` MUST be replaced with `RequiringClaims`:
* Each requirement MUST at minimum contain a `Type` and MAY contain a `Value`.
* Endpoints MUST support being configured with default `RequiringClaims` in microservices, with the possibility of external override (see Authority section).
---
## 4. Routing algorithm / instance selection
* Given a resolved endpoint `(ServiceName, Version, Method, Path)`, the router MUST:
* Filter candidate instances by:
* Matching `ServiceName`.
* Matching `Version` (strict semver equality).
* Health in an acceptable set (e.g. `Healthy` or `Degraded`).
* Instances MUST have health metadata:
* `Status` ∈ {`Unknown`, `Healthy`, `Degraded`, `Draining`, `Unhealthy`}.
* `LastHeartbeatUtc`.
* `AveragePingMs`.
* The routers instance selection MUST obey these rules:
* Region:
* Prefer instances whose `Region == GatewayNodeConfig.Region`.
* If none, fall back to configured neighbor regions.
* If none, fall back to all other regions.
* Within a chosen region tier:
* Prefer lower `AveragePingMs`.
* If several are tied, prefer more recent `LastHeartbeatUtc`.
* If still tied, use a balancing strategy (e.g. random or round-robin).
* The router MUST support a strict fallback order as requested:
* Prefer “closest by region and heartbeat and ping.”
* If having to choose between worse candidates, fall back in order of:
* Greater ping (latency).
* Greater heartbeat age.
* Less preferred region tier.
---
## 5. Transport plugin requirements
* There MUST be a transport plugin abstraction representing how the router and microservices communicate.
* The default transport type MUST be UDP.
* Additional supported transport types MUST include:
* TCP.
* Certificate-based TCP (TLS / mTLS).
* RabbitMQ.
* There MUST NOT be an HTTP transport plugin; HTTP MUST NOT be used for microservice-to-router communications (control or data).
* Each transport plugin MUST support:
* Establishing logical connections between microservices and the router.
* Sending/receiving HELLO (registration), HEARTBEAT, optional ENDPOINTS_UPDATE.
* Sending/receiving REQUEST/RESPONSE frames.
* Supporting streaming via REQUEST_STREAM_DATA / RESPONSE_STREAM_DATA frames where the transport allows it.
* Sending/receiving CANCEL frames to abort specific in-flight requests.
* UDP transport:
* MUST be used only for small/bounded payloads (no unbounded streaming).
* MUST respect configured `MaxRequestBytesPerCall`.
* TCP and Certificate transports:
* MUST implement a length-prefixed framing protocol capable of multiplexing frames for multiple correlation IDs.
* Certificate transport MUST enforce TLS and support optional mutual TLS (verifiable peer identity).
* RabbitMQ:
* MUST implement queue/exchange naming and routing keys sufficient to represent logical connections and correlation IDs.
* MUST use message properties (e.g. `CorrelationId`) for request/response matching.
---
## 6. Gateway (`StellaOps.Gateway.WebService`) requirements
### 6.1 HTTP ingress pipeline
* The gateway MUST host an ASP.NET Core HTTP server.
* The HTTP middleware pipeline MUST include at least:
* Forwarded headers handling (when behind reverse proxy).
* Request logging (e.g. via Serilog) including correlation ID, service, endpoint, region, instance.
* Global error-handling middleware.
* Authentication middleware.
* `EndpointResolutionMiddleware` to resolve `(Method, Path)` → endpoint.
* Authorization middleware that enforces `RequiringClaims`.
* `RoutingDecisionMiddleware` to choose connection/instance/transport.
* `TransportDispatchMiddleware` to carry out buffered or streaming dispatch.
* The gateway MUST read `Method` and `Path` from the HTTP request and use them to resolve endpoints.
### 6.2 Per-connection state and routing view
* The gateway MUST maintain a `ConnectionState` per logical connection that includes:
* ConnectionId.
* `InstanceDescriptor` (`InstanceId`, `ServiceName`, `Version`, `Region`).
* `Status`, `LastHeartbeatUtc`, `AveragePingMs`.
* The set of endpoints that this connection serves (`(Method, Path)``EndpointDescriptor`).
* The transport type for that connection.
* The gateway MUST maintain a global routing state (`IGlobalRoutingState`) that:
* Resolves `(Method, Path)` to an `EndpointDescriptor` (service, version, metadata).
* Provides the set of `ConnectionState` objects that can handle a given `(ServiceName, Version, Method, Path)`.
### 6.3 Buffered vs streaming dispatch
* The gateway MUST support:
* **Buffered mode** for small to medium payloads:
* Read the entire HTTP body into memory (or temp file when above a threshold).
* Send as a single REQUEST payload.
* **Streaming mode** for large or unknown content:
* Streaming from HTTP body to microservice via a sequence of REQUEST_STREAM_DATA frames.
* Streaming from microservice back to HTTP via RESPONSE_STREAM_DATA frames.
* For each endpoint, the gateway MUST know whether it can use streaming or must use buffered mode (`SupportsStreaming` flag).
### 6.4 Opaque body handling
* The gateway MUST treat request and response bodies as opaque byte sequences and MUST NOT attempt to deserialize or interpret payload contents.
* The gateway MUST forward headers and body bytes as given and leave any schema, JSON, or other decoding to the microservice.
### 6.5 Payload and memory protection
* The gateway MUST enforce configured payload limits:
* `MaxRequestBytesPerCall`.
* `MaxRequestBytesPerConnection`.
* `MaxAggregateInflightBytes`.
* If `Content-Length` is known and exceeds `MaxRequestBytesPerCall`, the gateway MUST reject the request early (e.g. HTTP 413 Payload Too Large).
* During streaming, the gateway MUST maintain counters of:
* Bytes read for this request.
* Bytes for this connection.
* Total in-flight bytes across all requests.
* If any limit is exceeded mid-stream, the gateway MUST:
* Stop reading the HTTP body.
* Send a CANCEL frame for that correlation ID.
* Abort the stream to the microservice.
* Return an appropriate error to the client (e.g. 413 or 503) and log the incident.
---
## 7. Microservice SDK (`__Libraries/StellaOps.Microservice`) requirements
### 7.1 Identity & router connections
* `StellaMicroserviceOptions` MUST let microservices configure:
* `ServiceName`.
* `Version`.
* `Region`.
* `InstanceId`.
* A list of router endpoints (`Routers` / router pool) including host, port, and transport type for each.
* Optional path to a YAML config file for endpoint-level overrides.
* Providing the router pool (`Routers` / HTTP servers pool) MUST be mandatory; a microservice cannot start without at least one configured router endpoint.
* The router pool SHOULD be configurable via code and MAY optionally be configured via YAML with hot-reload (causing reconnections if changed).
### 7.2 Endpoint definition & discovery
* Microservice endpoints MUST be declared using attributes that specify `(Method, Path)`:
```csharp
[StellaEndpoint("POST", "/billing/invoices")]
public sealed class CreateInvoiceEndpoint : ...
```
* The SDK MUST support two handler shapes:
* Raw handler:
* `IRawStellaEndpoint` taking a `RawRequestContext` and returning a `RawResponse`, where:
* `RawRequestContext.Body` is a stream (may be buffered or streaming).
* Body contents are raw bytes.
* Typed handlers:
* `IStellaEndpoint<TRequest, TResponse>` which takes a typed request and returns a typed response.
* `IStellaEndpoint<TResponse>` which has no request payload and returns a typed response.
* The SDK MUST adapt typed endpoints to the raw model internally (microservice-side only), leaving the router unaware of types.
* Endpoint discovery MUST work by:
* Runtime reflection: scanning assemblies for `[StellaEndpoint]` and handler interfaces.
* Build-time reflection via source generation:
* A Roslyn source generator MUST generate a descriptor list at build time.
* At runtime, the SDK MUST prefer source-generated metadata and only fall back to reflection if generation is not available.
### 7.3 Endpoint metadata defaults & overrides
* Microservices MUST be able to provide default endpoint metadata:
* `SupportsStreaming` flag.
* Default timeout.
* Default `RequiringClaims`.
* Microservice-local YAML MUST be allowed to override or refine these defaults per endpoint, keyed by `(Method, Path)`.
* Precedence rules MUST be clearly defined and honored:
* Service identity & router pool: from `StellaMicroserviceOptions` (not YAML).
* Endpoint set: from code (attributes/source gen); YAML MAY override properties but ideally not create endpoints not present in code (policy decision to be documented).
* `RequiringClaims` and timeouts: YAML overrides defaults from code, unless overridden by central Authority.
### 7.4 Connection behavior
* On establishing a connection to a router endpoint, the SDK MUST:
* Immediately send a HELLO frame containing:
* `ServiceName`, `Version`, `Region`, `InstanceId`.
* The list of endpoints (Method, Path) with their metadata (SupportsStreaming, default timeouts, default RequiringClaims).
* At regular intervals, the SDK MUST send HEARTBEAT frames on each connection indicating:
* Instance health status.
* Optional metrics (e.g. in-flight request count, error rate).
* The SDK SHOULD support optional ENDPOINTS_UPDATE (or a re-HELLO) to update endpoint metadata at runtime if needed.
### 7.5 Request handling & streaming
* For each incoming REQUEST frame:
* The SDK MUST create a `RawRequestContext` with:
* Method.
* Path.
* Headers.
* A `Body` stream that either:
* Wraps a buffered byte array.
* Or exposes streaming reads from subsequent REQUEST_STREAM_DATA frames.
* A `CancellationToken` that will be cancelled when the router sends a CANCEL frame or the connection fails.
* The SDK MUST resolve the correct endpoint handler by `(Method, Path)` using the same path template rules as the router.
* For streaming endpoints, handlers MUST be able to read from `RawRequestContext.Body` incrementally and obey the `CancellationToken`.
### 7.6 Cancellation handling (microservice side)
* The SDK MUST maintain a map of in-flight requests by correlation ID, each containing:
* A `CancellationTokenSource`.
* The task executing the handler.
* Upon receiving a CANCEL frame for a given correlation ID, the SDK MUST:
* Look up the corresponding entry and call `CancellationTokenSource.Cancel()`.
* Handlers (both raw and typed) MUST receive a `CancellationToken`:
* They MUST observe the token and be coded to cancel promptly where needed.
* They MUST pass the token to downstream I/O operations (DB calls, file I/O, network).
* If the transport connection is closed, the SDK MUST treat it as a cancellation trigger for all outstanding requests on that connection and cancel their tokens.
---
## 8. Control / health / ping requirements
* Heartbeats MUST be sent over the same connection as requests (no separate control channel).
* The router MUST:
* Track `LastHeartbeatUtc` for each connection.
* Derive `InstanceHealthStatus` based on heartbeat recency and optionally metrics.
* Drop or mark as Unhealthy any instances whose heartbeats are stale past configured thresholds.
* The router SHOULD measure network latency (ping) by:
* Timing request-response round trips, or
* Using explicit ping frames, and updating `AveragePingMs` for each connection.
* The router MUST use heartbeat and ping metrics in its routing decision as described above.
---
## 9. Authorization / requiringClaims / Authority requirements
* `RequiringClaims` MUST be the only authorization metadata field; `AllowedRoles` MUST NOT be used.
* Every endpoint MUST be able to specify:
* An empty `RequiringClaims` list (no additional claims required beyond authenticated).
* Or one or more `ClaimRequirement` objects (Type + optional Value).
* The gateway MUST enforce `RequiringClaims` per request:
* Authorization MUST check that the requests user principal has all required claims for the endpoint.
* Microservices MUST provide default `RequiringClaims` as part of their HELLO metadata.
* There MUST be a mechanism for an external Authority service to override `RequiringClaims` centrally:
* Defaults MUST come from microservices.
* Authority MUST be able to push or supply overrides that the gateway applies at startup and/or at runtime.
* The gateway MUST proactively request such overrides on startup (e.g. via a special message or mechanism) before handling traffic, or as early as practical.
* Final, effective `RequiringClaims` enforced at the gateway MUST be derived from microservice defaults plus Authority overrides, with Authority taking precedence where applicable.
---
## 10. Cancellation requirements (router side)
* The protocol MUST define a `FrameType.Cancel` with:
* A `CorrelationId` indicating which request to cancel.
* An optional payload containing a reason code (e.g. `"ClientDisconnected"`, `"Timeout"`, `"PayloadLimitExceeded"`).
* The router MUST send CANCEL frames when:
* The HTTP client disconnects (ASP.NET `HttpContext.RequestAborted` fires) while the request is in progress.
* The routers effective timeout for the request elapses, and no response has been received.
* The router detects payload/memory limit breaches and has to abort the request.
* The router is shutting down and explicitly aborts in-flight requests (if implemented).
* The router MUST:
* Stop forwarding any additional REQUEST_STREAM_DATA to the microservice once a CANCEL is sent.
* Stop reading any remaining response frames for that correlation and either:
* Discard them.
* Or treat them as late, log them, and ignore them.
* For streaming responses, if the HTTP client disconnects or router cancels:
* The router MUST stop writing to the HTTP response and treat any subsequent frames as ignored.
---
## 11. Configuration and YAML requirements
* `__Libraries/StellaOps.Router.Config` MUST handle:
* Binding router config from JSON/appsettings + YAML + environment variables.
* Static service definitions:
* ServiceName.
* DefaultVersion.
* DefaultTransport.
* Endpoint list (Method, Path) with default timeouts, requiringClaims, streaming flags.
* Static instance definitions (optional):
* ServiceName, Version, Region, supported transports, plugin-specific settings.
* Global payload limits (`PayloadLimits`).
* Router YAML config MUST support hot-reload:
* Changes SHOULD be picked up at runtime without restarting the gateway.
* Hot-reload MUST cause in-memory routing state to be updated, including:
* New or removed services/endpoints.
* New or removed instances (static).
* Updated payload limits.
* Microservice YAML config MUST be optional and used for endpoint-level overrides only, not for identity or router pool configuration.
* The router pool for microservices MUST be configured via code and MAY be backed by YAML (with hot-plug / reconnection behavior) if desired.
---
## 12. Library naming / repo structure requirements
* The router configuration library MUST be named `__Libraries/StellaOps.Router.Config`.
* The microservice SDK library MUST be named `__Libraries/StellaOps.Microservice`.
* The gateway webservice MUST be named `StellaOps.Gateway.WebService`.
* There MUST be a “common” library for shared types and abstractions (e.g. `__Libraries/StellaOps.Router.Common`).
* Documentation files MUST include at least:
* `Stella Ops Router.md` (what it is, why, high-level architecture).
* `Stella Ops Router - Webserver.md` (how the webservice works).
* `Stella Ops Router - Microservice.md` (how the microservice SDK works and is implemented).
* `Stella Ops Router - Common.md` (common components and how they are implemented).
* `Migration of Webservices to Microservices.md`.
* `Stella Ops Router Documentation.md` (doc structure & guidance).
---
## 13. Documentation & developer-experience requirements
* The docs MUST be detailed; “do not spare details” implies:
* High-fidelity, concrete examples and not hand-wavy descriptions.
* For average C# developers, documentation MUST cover:
* Exact .NET / ASP.NET Core target version and runtime baseline.
* Required NuGet packages (logging, serialization, YAML parsing, RabbitMQ client, etc.).
* Exact serialization formats for frames and payloads (JSON vs MessagePack vs others).
* Exact framing rules for each transport (length-prefix for TCP/TLS, datagrams for UDP, exchanges/queues for Rabbit).
* Concrete sample `Program.cs` for:
* A gateway node.
* A microservice.
* Example endpoint implementations:
* Typed (with and without request).
* Raw streaming endpoints for large payloads.
* Example router YAML and microservice YAML with realistic values.
* Error and HTTP status mapping policy:
* E.g. “version not found → 404 or 400; no instance available → 503; timeout → 504; payload too large → 413.”
* Guidelines on:
* When to use UDP vs TCP vs RabbitMQ.
* How to configure and validate certificates for the certificate transport.
* How to write cancellation-friendly handlers (proper use of `CancellationToken`).
* Testing strategies: local dev setups, integration test harnesses, how to run router + microservice together for tests.
* Clear explanation of config precedence:
* Code options vs YAML vs microservice defaults vs Authority for claims.
* Documentation MUST answer for each major concept:
* What it is.
* Why it exists.
* How it works.
* How to use it (with examples).
* What happens when it is misused and how to debug issues.
---
## 14. Migration requirements
* There MUST be a defined migration path from `StellaOps.*.WebServices` to `StellaOps.*.Microservices`.
* Migration documentation MUST cover:
* Inventorying existing HTTP routes (Method + Path).
* Strategy A (in-place adaptation):
* Adding microservice SDK into WebService.
* Declaring endpoints with `[StellaEndpoint]`.
* Wrapping existing controller logic in handlers.
* Connecting to the router and validating registration.
* Gradually shifting traffic from direct WebService HTTP ingress to gateway routing.
* Strategy B (split):
* Extracting domain logic into shared libraries.
* Creating a dedicated microservice project using the SDK.
* Mapping routes and handlers.
* Phasing out or repurposing the original WebService.
* Ensuring cancellation tokens are wired throughout migrated code.
* Handling streaming endpoints (large uploads/downloads) via `IRawStellaEndpoint` and streaming support instead of naive buffered HTTP controllers.
---
If you want, I can next turn this requirement set into a machine-readable checklist (e.g. JSON or YAML) or derive a first-pass implementation roadmap directly from these requirements.