Make remote localization startup non-blocking

This commit is contained in:
master
2026-03-11 10:07:30 +02:00
parent 7a1c090f2e
commit 5c874c8f64
8 changed files with 299 additions and 18 deletions

View File

@@ -0,0 +1,77 @@
# Sprint 20260311_001 - Graph Remote Localization Startup Nonblocking
## Topic & Scope
- Remove the scratch-setup startup bottleneck where Graph API can stay dark for an extended period while remote localization overrides load before Kestrel binds.
- Treat remote translation bundles as optional startup enrichment, not a dependency that can hold a service offline during a fresh compose bootstrap.
- Verify the fix with focused localization-library tests, a rebuilt Graph image, and live service/browser checks on the scratch stack.
- Working directory: `src/__Libraries/StellaOps.Localization`.
- Allowed coordination edits: `src/Graph/**`, `src/__Libraries/__Tests/**`, `devops/compose/**`, `docs/modules/graph/architecture.md`, `docs/implplan/SPRINT_20260311_001_Graph_remote_localization_startup_nonblocking.md`.
- Expected evidence: targeted localization test output, rebuilt Graph runtime health, and live verification artifacts showing the scratch stack no longer masks the startup fault.
## Dependencies & Concurrency
- Depends on the existing scratch-reset stack being up so the late-start Graph behavior can be reproduced and rechecked.
- Safe parallelism: stay inside the localization library, Graph service, and the listed docs; avoid unrelated web search or component-revival slices.
## Documentation Prerequisites
- `AGENTS.md`
- `src/Graph/AGENTS.md`
- `docs/modules/graph/architecture.md`
- `docs/qa/feature-checks/FLOW.md`
## Delivery Tracker
### GRAPH-LOC-001 - Diagnose the real startup gate
Status: DONE
Dependency: none
Owners: QA, Developer
Task description:
- Reproduce the Graph startup fault from the scratch stack and separate product failures from harness noise.
- Capture why the container can stay unhealthy during scratch setup even though the same binary later starts when rerun interactively.
Completion criteria:
- [x] Container/runtime evidence shows where startup is being gated.
- [x] The diagnosis identifies the shared-library behavior that needs correction.
### GRAPH-LOC-002 - Make remote localization startup-safe
Status: DONE
Dependency: GRAPH-LOC-001
Owners: Architect, Developer
Task description:
- Change the shared localization bootstrap so remote bundle overrides are bounded and parallelized per provider, preserving deterministic merge order while preventing optional remote fetches from serially blocking service readiness.
- Keep the contract library-centric so Graph is fixed through the real root cause rather than a service-specific workaround.
Completion criteria:
- [x] Remote bundle fetches have an explicit bounded timeout.
- [x] Translation registry no longer serially waits per locale for a single provider.
- [x] Focused tests cover timeout handling and concurrent locale loading.
### GRAPH-LOC-003 - Rebuild and prove the scratch-stack behavior
Status: DONE
Dependency: GRAPH-LOC-002
Owners: QA
Task description:
- Rebuild the affected runtime, redeploy the live stack, and verify Graph startup and the related UI surface on the scratch environment.
- Record the new behavior in sprint evidence and module docs.
Completion criteria:
- [x] Graph container becomes healthy promptly after redeploy.
- [x] Focused live checks confirm the reachability/security surfaces no longer surface backend-unavailable fallback on this defect path.
- [x] Docs and sprint log reflect the startup contract change.
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-03-11 | Sprint created after a fresh scratch rebuild showed `stellaops-graph-api` remaining unhealthy while the frontdoor route sweep stayed green. | Developer |
| 2026-03-11 | Reproduced that the Graph binary starts normally on host and in-container when rerun interactively, but the scratch container can stay dark for a long interval before eventually binding. The shared startup gate is `LoadTranslationsAsync()` calling remote bundle overrides before `Run()`, with one remote fetch per locale executed serially. | QA |
| 2026-03-11 | Implemented the shared-library fix in `StellaOps.Localization`: remote bundle fetches now use a bounded per-request timeout and locale loads run concurrently within a provider while merging back in deterministic order. Added focused tests in `src/__Libraries/__Tests/StellaOps.Localization.Tests` covering timeout fallback and concurrent load behavior. | Developer |
| 2026-03-11 | Verified the fix on the live scratch stack by rebuilding only `graph-api`, stopping Platform, force-recreating the Graph container, and confirming immediate recovery: `stellaops-graph-api` reported `healthy` and `GET http://127.1.0.20/healthz` returned `200` while Platform was still down. Then brought Platform back and ran a live authenticated Playwright check on `/security/supply-chain-data/graph`, which passed with zero console errors, zero request failures, and zero error responses. | QA |
## Decisions & Risks
- Decision: fix the startup contract in `StellaOps.Localization` instead of adding Graph-only retries, because remote translation overrides are used by many services and should never gate service availability during scratch bootstrap.
- Risk: changing translation loading order could accidentally alter merge determinism.
- Mitigation: keep provider priority ordering intact, parallelize only within a provider, and merge results back in deterministic locale order.
- Decision: bounded remote translation fetches default to a short timeout because remote overrides are optional enrichment; if Platform is unavailable during scratch bootstrap, services must prefer embedded bundles and come online instead of waiting unboundedly on localization.
## Next Checkpoints
- Add focused localization tests before changing runtime behavior.
- Rebuild the Graph image and redeploy the stack immediately after the library fix.

View File

@@ -68,6 +68,7 @@ The edge metadata system provides explainability for graph relationships:
- Graph API now initializes localization via `AddStellaOpsLocalization(...)`, `AddTranslationBundle(...)`, `AddRemoteTranslationBundles()`, `UseStellaOpsLocalization()`, and `LoadTranslationsAsync()`.
- Locale resolution order for API messages is deterministic: `X-Locale` header -> `Accept-Language` header -> default locale (`en-US`).
- Translation layering is deterministic: shared embedded `common` bundle -> Graph embedded bundle (`Translations/*.graph.json`) -> Platform runtime override bundle.
- Remote Platform override fetches are bounded and loaded concurrently per provider locale so scratch bootstrap cannot hold the Graph API offline while optional translation overrides load.
- This rollout localizes selected error paths (for example, edge/export not found, invalid reason, and tenant/auth validation text) for `en-US` and `de-DE`.
## 4) Storage considerations