Update docs, sprint plans, and compose configuration
Add 12 new sprint files (Integrations, Graph, JobEngine, FE, Router, AdvisoryAI), archive completed scheduler UI sprint, update module architecture docs (router, graph, jobengine, web, integrations), and add Gitea entrypoint script for local dev. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,48 @@
|
||||
# Sprint 20260403-003 - Console Production Bundle Budget
|
||||
|
||||
## Topic & Scope
|
||||
- Restore deterministic scratch rebuilds by unblocking the Angular production console image build.
|
||||
- Reconcile the frontend bundle budget with the current production output so the Docker matrix can finish while preserving a meaningful guardrail.
|
||||
- Capture the rebuild evidence and any remaining budget-related risks for follow-up optimization work.
|
||||
- Working directory: `src/Web/StellaOps.Web`.
|
||||
- Expected evidence: `npm run build -- --configuration=production`, `devops/docker/build-all.ps1`, updated sprint log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on the current `devops/docker/build-all.ps1` rebuild lane and the Docker console image path in `devops/docker/Dockerfile.console`.
|
||||
- Safe to keep scoped to the web workspace and sprint docs; no cross-module code edits expected.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `src/Web/StellaOps.Web/AGENTS.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
- `src/Web/StellaOps.Web/angular.json`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### TASK-1 - Unblock console production image build
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- The scratch Stella Ops rebuild completed 58 backend/service images successfully but failed on the final `console` image because the Angular production build exceeded the configured `initial` budget in `src/Web/StellaOps.Web/angular.json`.
|
||||
- Update the budget guardrail or equivalent frontend build configuration just enough to reflect the current production baseline, then rerun the production build and the Docker image build to confirm the rebuild completes end-to-end.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `src/Web/StellaOps.Web/angular.json` is updated with a justified production bundle budget guardrail.
|
||||
- [x] `npm run build -- --configuration=production --output-path=dist` completes successfully.
|
||||
- [x] `devops/docker/build-all.ps1` or an equivalent targeted console rebuild completes successfully for `stellaops/console:dev`.
|
||||
- [x] Sprint evidence captures the original failure and the final passing verification.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-03 | Sprint created after scratch rebuild failure isolated the `console` Docker image to an Angular production bundle budget overrun. | Developer |
|
||||
| 2026-04-03 | Raised the production `initial` bundle guardrail to the current 2.08 MB baseline, removed an unused dashboard import, reran `npm run build -- --configuration=production --output-path=dist`, and confirmed the targeted `stellaops/console:dev` Docker rebuild passed. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- The production console build failed with `bundle initial exceeded maximum budget`; the observed output was 2.08 MB versus the configured 2.00 MB error threshold.
|
||||
- The production guardrail now warns at 2.2 MB and errors at 2.4 MB, which matches the current baseline while preserving a hard failure threshold for further growth.
|
||||
- The component-style warnings in setup wizard styles remain below the current error threshold and do not block the Docker image build, but they should stay visible for later CSS reduction work.
|
||||
|
||||
## Next Checkpoints
|
||||
- Re-run the Angular production build after the budget change.
|
||||
- Rebuild the `console` image and then resume stack startup from the clean rebuild state.
|
||||
@@ -0,0 +1,52 @@
|
||||
# Sprint 20260403-004 - Local Integration Catalog Bootstrap
|
||||
|
||||
## Topic & Scope
|
||||
- Provision every provider-backed local integration service or fixture into the Integrations catalog for tenant `default`.
|
||||
- Validate live connection and health against compose real services and QA fixtures, including the heavy-profile GitLab service.
|
||||
- Record the setup gaps discovered during shell/API bootstrap so local bring-up is reproducible.
|
||||
- Working directory: `src/Integrations/`.
|
||||
- Expected evidence: `docker compose` service health, `/api/v1/integrations` catalog entries, targeted Integrations test results.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `devops/compose/docker-compose.stella-ops.yml`, `devops/compose/docker-compose.integrations.yml`, and `devops/compose/docker-compose.integration-fixtures.yml` sharing the `stellaops` network.
|
||||
- Cross-module runtime touchpoints only: `devops/compose/*` hosts the external services, and `docs/integrations/LOCAL_SERVICES.md` documents the bootstrap path.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/integrations/LOCAL_SERVICES.md`
|
||||
- `devops/compose/README.md`
|
||||
- `src/Integrations/AGENTS.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### TASK-1 - Bootstrap local Integration Catalog entries
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Use shell-based API calls against `StellaOps.Integrations.WebService` to create or update every provider-backed local integration entry exposed by `/api/v1/integrations/providers`, excluding the test-only `InMemory` provider.
|
||||
- Bring up the compose-backed real services and QA fixtures, bind GitLab through Vault-backed `authref://vault/gitlab#access-token`, and verify `/test` plus `/health` for each entry.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Real services and QA fixtures required by the local integration catalog are running.
|
||||
- [x] Provider-backed local integrations are present in tenant `default` and return successful `/test` results.
|
||||
- [x] GitLab heavy-profile SCM integration is green with a Vault-backed token reference.
|
||||
- [x] Targeted Integrations test projects pass and setup/documentation gaps are recorded.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-03 | Bootstrapped 10 local integration catalog entries for tenant `default`, including Harbor/GitHub App fixtures, Gitea, Jenkins, Nexus, Docker Registry, Vault, Consul, runtime-host fixture, and heavy-profile GitLab. Verified `/test` and `/health` for all entries. | Developer |
|
||||
| 2026-04-03 | Ran targeted test projects: `StellaOps.Integrations.Tests` (57 passed) and `StellaOps.Integrations.Plugin.Tests` (12 passed). | Developer |
|
||||
| 2026-04-03 | Corrected local setup docs/comments after live validation showed stale credential and provider notes. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- The shipped `stella config integrations` CLI path is still stubbed/sample-data only; live provisioning currently requires shell/API calls against `StellaOps.Integrations.WebService`.
|
||||
- `POST /api/v1/integrations/{id}/discover` is documented in higher-level API docs but is not implemented by `IntegrationEndpoints`, so local bootstrap is CRUD + test + health only.
|
||||
- Gitea and Jenkins compose comments previously implied precreated admin users; live checks showed Gitea still needs first-run user creation and Jenkins defaults to anonymous access unless manually hardened.
|
||||
- GitLab SCM needed a real PAT before the current connector would pass; the token is stored in Vault at `secret/gitlab` under `access-token`.
|
||||
- Current provider discovery does not expose MinIO/S3 or advisory/feed-mirror connectors, so those local services and fixtures cannot be added through the Integration Catalog today.
|
||||
|
||||
## Next Checkpoints
|
||||
- Add backend-backed CLI verbs for integration create/update/test so shell/API bootstrap is no longer required.
|
||||
- Implement or remove the documented `discover` expectation so docs and service behavior converge.
|
||||
- Decide whether local compose services should preseed authenticated users/tokens or keep the current manual bootstrap model.
|
||||
@@ -0,0 +1,124 @@
|
||||
# Sprint 20260404-001 - Integrations Discovery and CLI Live Catalog
|
||||
|
||||
## Topic & Scope
|
||||
- Converge the Integrations service with the documented contract by implementing discovery and richer provider metadata.
|
||||
- Remove the sample-data behavior from `stella config integrations` and replace it with live backend-backed CRUD, health, impact, and discovery flows.
|
||||
- Expose the missing built-in provider identities that already map to local fixtures and compose-backed services, including GitLab CI, GitLab Container Registry, and feed mirror providers.
|
||||
- Remove the product-path scripts mock binding from the web console so `/ops/scripts` fails visibly against the real backend surface instead of shipping sample state.
|
||||
- Add object-storage coverage for local MinIO through the Integration Catalog and remove additional trust-admin sample-data fallbacks where a live API already exists.
|
||||
- Keep test-only providers available for development and tests, but hide them from default user-facing provider listings.
|
||||
- Working directory: `src/Integrations/`.
|
||||
- Expected evidence: targeted Integrations and CLI test runs, updated docs, and working `config integrations` commands against the live service.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `docs/architecture/integrations.md`, `docs/modules/release-orchestrator/integrations/overview.md`, and `docs/modules/release-orchestrator/modules/integration-hub.md` for the public contract shape.
|
||||
- Cross-module edits allowed for `src/Cli/**`, `src/Web/StellaOps.Web/**`, `docs/modules/cli/**`, `docs/integrations/**`, and `docs/implplan/**` to deliver the CLI parity, product-path stub removal, and documentation sync required by this sprint.
|
||||
- Safe parallelism: plugin-specific discovery additions can proceed independently from CLI command wiring once the contract DTOs are stable.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/code-of-conduct/CODE_OF_CONDUCT.md`
|
||||
- `docs/code-of-conduct/TESTING_PRACTICES.md`
|
||||
- `docs/architecture/integrations.md`
|
||||
- `docs/modules/release-orchestrator/integrations/overview.md`
|
||||
- `src/Integrations/AGENTS.md`
|
||||
- `src/Cli/AGENTS.md`
|
||||
- `src/Cli/StellaOps.Cli/AGENTS.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### TASK-1 - Implement documented discovery contract in Integrations
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Add an optional discovery capability to connector plugins, implement `POST /api/v1/integrations/{id}/discover`, and return stable provider metadata that advertises discovery support and supported resource types.
|
||||
- Keep unsupported providers deterministic: test-only providers are excluded from default provider listings, unsupported discovery requests return a client error, and missing integrations still return `404`.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Discovery DTOs and optional plugin interface are added in `src/Integrations/__Libraries`.
|
||||
- [x] `IntegrationService` and `IntegrationEndpoints` expose discovery and richer provider metadata.
|
||||
- [x] At least the local priority providers expose discovery for registry, SCM, or CI resources.
|
||||
- [x] Targeted Integrations tests cover discovery success, unsupported resource types, and test-only provider filtering.
|
||||
|
||||
### TASK-2 - Replace sample-only config integrations CLI flow
|
||||
Status: DONE
|
||||
Dependency: TASK-1
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Remove the hardcoded integration sample data from the CLI and replace it with live calls through `IBackendOperationsClient`.
|
||||
- Keep `config integrations list` and `test`, and add the missing verbs needed to fully manage the live catalog from the CLI.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `IBackendOperationsClient` and `BackendOperationsClient` support integrations list/get/providers/create/update/delete/test/health/impact/discover.
|
||||
- [x] `stella config integrations` exposes live backend verbs with deterministic table and JSON output.
|
||||
- [x] Deprecated aliases from `integrations *` to `config integrations *` cover the supported verb set.
|
||||
- [x] Targeted CLI tests cover JSON output, argument mapping, and backend call routing for the new integrations commands.
|
||||
|
||||
### TASK-3 - Sync docs and verification evidence
|
||||
Status: DONE
|
||||
Dependency: TASK-2
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Update the architecture and operator docs so they describe the implemented discovery and CLI behavior instead of the previous stubbed path.
|
||||
- Record concrete verification evidence and any remaining rough edges in this sprint.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Docs reference the real discovery endpoint shape and provider metadata fields.
|
||||
- [x] CLI/operator docs mention the live `config integrations` workflow.
|
||||
- [x] Execution Log records the test commands and outcomes.
|
||||
- [x] Decisions & Risks captures any remaining gaps or deferred provider coverage.
|
||||
|
||||
### TASK-4 - Remove product-path scripts mock binding
|
||||
Status: DONE
|
||||
Dependency: TASK-2
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Replace the web console's direct `MockScriptsClient` binding with the HTTP-backed client so the shipped UI no longer serves sample script data in production.
|
||||
- Surface backend failures in the scripts UI instead of silently falling back to the old mock behavior.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `SCRIPTS_API` resolves to the HTTP client in the shipped Angular app.
|
||||
- [x] `/ops/scripts` pages surface backend failures with explicit error banners.
|
||||
- [x] Production Angular build passes after the binding change.
|
||||
|
||||
### TASK-5 - Expose MinIO and remove trust-admin audit sample fallbacks
|
||||
Status: DONE
|
||||
Dependency: TASK-2
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Extend the Integrations provider/type model so local MinIO can be represented in the live catalog without shell-side special casing.
|
||||
- Replace the trust-admin air-gap and incident audit sample-data behavior with the existing Authority audit endpoints, and keep unsupported incident write actions explicitly read-only.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `GET /api/v1/integrations/providers` includes an object-storage provider suitable for local MinIO.
|
||||
- [x] Focused backend tests cover the object-storage connector and plugin discovery.
|
||||
- [x] Trust-admin air-gap and incident audit routes use live audit clients instead of embedded sample records.
|
||||
- [x] Production Angular build passes with the trust-admin changes.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-04 | Sprint created and TASK-1 started to implement discovery and replace the sample-only integrations CLI path. | Developer |
|
||||
| 2026-04-04 | Implemented live discovery DTOs, `/api/v1/integrations/{id}/discover`, provider metadata flags, and discovery-capable registry/SCM/CI plugins. | Developer |
|
||||
| 2026-04-04 | Replaced `stella config integrations` sample data with live backend CRUD/test/health/impact/discover commands and deprecated route aliases. | Developer |
|
||||
| 2026-04-04 | Added GitLab CI, GitLab Container Registry, and feed mirror provider identities; updated docs and local-service guidance. | Developer |
|
||||
| 2026-04-04 | Switched the web scripts surface to `ScriptsHttpClient` and added visible error handling for list/detail actions. | Developer |
|
||||
| 2026-04-04 | Added the `S3Compatible` object-storage provider for local MinIO and rewired trust-admin audit pages to Authority audit endpoints with explicit read-only/error behavior. | Developer |
|
||||
| 2026-04-04 | Verification: `dotnet test src/Integrations/__Tests/StellaOps.Integrations.Tests/StellaOps.Integrations.Tests.csproj -v minimal` passed (68/68). | Developer |
|
||||
| 2026-04-04 | Verification: `dotnet test src/Integrations/__Tests/StellaOps.Integrations.Plugin.Tests/StellaOps.Integrations.Plugin.Tests.csproj -v minimal` passed (17/17). | Developer |
|
||||
| 2026-04-04 | Verification: `dotnet build src/Cli/StellaOps.Cli/StellaOps.Cli.csproj -v minimal` passed. | Developer |
|
||||
| 2026-04-04 | Verification: `npm run build -- --configuration=production --output-path=dist` passed for `src/Web/StellaOps.Web` with only the pre-existing setup-wizard component-style budget warnings. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- This sprint intentionally keeps deterministic test-only fixtures, but removes product-path sample data from `stella config integrations`.
|
||||
- Provider expansion now covers the missing local GitLab CI, GitLab Container Registry, feed mirror provider identities, and MinIO through the `ObjectStorage`/`S3Compatible` path.
|
||||
- Feed mirror provider entries currently expose health/test coverage only. They make the catalog honest about what can be connected, but they do not add feed-resource discovery on top of Concelier yet.
|
||||
- The CLI command tests exist, but `dotnet test` filtering is still unreliable under the repo's Microsoft.Testing.Platform setup. A previous full-suite run executed 1218 tests and surfaced 7 unrelated migration-consolidation failures outside this sprint's write scope.
|
||||
- `/ops/scripts` now uses the real HTTP surface. Until a scripts backend is implemented at `/api/v2/scripts`, operators will see explicit load/save/validation errors instead of sample data.
|
||||
- Trust-admin audit pages now read from live Authority audit endpoints. Incident mutation actions remain intentionally read-only until command endpoints exist; the audit view no longer simulates those actions.
|
||||
- `app.config.ts` no longer registers a broad set of unused mock clients in the shipped provider graph, but many other web routes still retain mock implementations or fallback data outside this sprint's write scope.
|
||||
- Existing unrelated dirty worktree changes in `src/Workflow/**` and `src/__Libraries/StellaOps.ElkSharp/**` are not part of this sprint and will remain untouched.
|
||||
|
||||
## Next Checkpoints
|
||||
- Replace remaining product-path web sample-data surfaces using the same pattern applied to `/ops/scripts` and trust-admin audit routes: real client binding plus explicit degraded/error UI.
|
||||
- Add deeper object-storage semantics if bucket/object discovery or credentialed operations need to be represented beyond health/test coverage.
|
||||
@@ -0,0 +1,56 @@
|
||||
# Sprint 20260404-002 - FE Evidence And Topology Live Surfaces
|
||||
|
||||
## Topic & Scope
|
||||
- Remove product-path mock state from the Evidence Center page and the environments command page.
|
||||
- Reuse the live release-evidence and topology APIs that already exist, and surface explicit empty and error states instead of demo data.
|
||||
- Working directory: `src/Web/StellaOps.Web/`.
|
||||
- Expected evidence: Angular build, focused web tests where practical, updated module docs, and sprint execution log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on the current app DI work in `SPRINT_20260404_001_Integrations_discovery_and_cli_live_catalog.md` remaining intact.
|
||||
- Safe to run in parallel with backend-only deployment and findings work as long as touched web files do not overlap.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/web/architecture.md`
|
||||
- `docs/modules/jobengine/architecture.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### FE-EVID-002 - Replace Evidence Center sample state with live packet flows
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Rewire `features/evidence/evidence-center-page.component.ts` to use the shipped release-evidence client/store path instead of local packet arrays and `console.log` actions.
|
||||
- Use the existing audit-bundle client for page-level audit exports, and keep verify/export/raw packet actions routed through real HTTP calls.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Evidence Center loads packet data from the release-evidence API path rather than local sample arrays.
|
||||
- [x] Packet drawer actions trigger live verify/export/raw flows instead of placeholder handlers.
|
||||
- [x] Page-level audit bundle export uses the existing audit-bundle API and surfaces success or failure to the operator.
|
||||
|
||||
### FE-TOPO-002 - Remove environments-command automatic demo fallback
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Remove the embedded mock environments, readiness reports, and topology layout fallback from the environments command page.
|
||||
- Keep live reads from the topology APIs, and add clear no-data / setup-needed / request-failed states for both command and topology views.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `environments-command.component.ts` no longer populates demo environments or a demo topology layout.
|
||||
- [x] Empty and error states are explicit and user-visible.
|
||||
- [x] Topology view stays functional when the layout endpoint returns data and behaves cleanly when it does not.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-04 | Sprint created; implementation started for Evidence Center and topology command live-surface cleanup. | Developer |
|
||||
| 2026-04-04 | Replaced Evidence Center sample state with live release-evidence flows; removed topology demo fallback; verified with Angular production build. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Evidence Center will reuse the existing release-evidence API/store even if the backend detail endpoint is still shallow; the page must stop fabricating packets locally.
|
||||
- Topology command will prefer explicit empty/error states over silently inventing regions and environments.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-04-04: land web patches and verify with a production Angular build.
|
||||
@@ -0,0 +1,57 @@
|
||||
# Sprint 20260404-003 - JobEngine Deployment Run Parity
|
||||
|
||||
## Topic & Scope
|
||||
- Replace deployment compatibility seed responses with a live in-memory deployment store and add real deployment creation.
|
||||
- Align deployment strategy vocabulary with the shipped web client and remove create-deployment wizard fallback behavior.
|
||||
- Working directory: `src/JobEngine/`.
|
||||
- Expected evidence: targeted JobEngine tests, Angular build for wizard integration, updated module docs, and sprint execution log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on web deployment consumers continuing to target `/api/v1/release-orchestrator/deployments`.
|
||||
- Allows cross-module edits in `src/Web/StellaOps.Web/` and `src/ReleaseOrchestrator/` for wizard/client contract alignment.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/jobengine/architecture.md`
|
||||
- `docs/modules/release-orchestrator/architecture.md`
|
||||
- `docs/modules/web/architecture.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### JOB-DEP-003 - Replace seeded deployment compatibility endpoints with a live store
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Introduce a real deployment state store for list/detail/events/logs/metrics and lifecycle mutations in the JobEngine web service.
|
||||
- Add a canonical create endpoint for deployment runs and persist state changes in the same live store rather than returning canned results.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `/api/v1/release-orchestrator/deployments` list/detail/events/logs/metrics are backed by a live state store instead of `SeedData`.
|
||||
- [x] Pause, resume, cancel, rollback, and retry mutate deployment state and emit corresponding events.
|
||||
- [x] `POST /api/v1/release-orchestrator/deployments` creates a deployment run with canonical fields and returns a real deployment object.
|
||||
|
||||
### FE-DEP-003 - Wire create-deployment wizard to live bundle and deployment APIs
|
||||
Status: DONE
|
||||
Dependency: JOB-DEP-003
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Remove shipped mock package lists and creation fallbacks from the deployment wizard.
|
||||
- Load real bundle/version data from Bundle Organizer and submit deployment creation through the deployment API with canonical strategy names.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `create-deployment.component.ts` no longer relies on `MOCK_VERSIONS` or `MOCK_HOTFIXES`.
|
||||
- [x] Strategy values exposed to operators match `rolling | blue_green | canary | all_at_once`.
|
||||
- [x] Backend failures surface as operator-visible errors and do not navigate away on failure.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-04 | Sprint created; deployment endpoint and wizard parity work started. | Developer |
|
||||
| 2026-04-04 | Deployment compatibility store and create endpoint landed; wizard switched to live bundle and deployment APIs; verified with focused JobEngine tests and Angular production build. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Initial parity will use an in-memory deployment store inside JobEngine rather than a new persistent schema in this batch; the goal is live contract behavior, not long-term retention yet.
|
||||
- Deployment creation remains single-environment per runtime deployment; promotion-stage intent stays release metadata rather than a deployment-group model.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-04-04: land JobEngine endpoint changes and rerun targeted compatibility tests.
|
||||
@@ -0,0 +1,56 @@
|
||||
# Sprint 20260404-004 - Graph Explorer Live Contract
|
||||
|
||||
## Topic & Scope
|
||||
- Add the REST compatibility facade the shipped Angular graph explorer expects.
|
||||
- Remove fabricated shipped explorer overlay behavior so the visible graph path reflects backend overlays or explicit empties.
|
||||
- Working directory: `src/Graph/`.
|
||||
- Expected evidence: targeted Graph API tests, Angular build for graph explorer compatibility, updated docs, and sprint execution log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Allows cross-module edits in `src/Web/StellaOps.Web/` for the shipped explorer route only.
|
||||
- Independent of deployment and findings work except for shared Angular build verification.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/graph/architecture.md`
|
||||
- `docs/modules/web/architecture.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### GRAPH-API-004 - Add REST compatibility facade and saved-view endpoints
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Add `GET /graphs`, `GET /graphs/{id}`, `GET /graphs/{id}/tiles`, `GET /search`, `GET /paths`, `GET /graphs/{id}/export`, `GET /assets/{id}/snapshot`, and `GET /nodes/{id}/adjacency` as a compatibility facade over the existing in-memory graph/query services.
|
||||
- Add saved-view endpoints for future UI persistence on the same compatibility surface.
|
||||
|
||||
Completion criteria:
|
||||
- [x] The shipped `GraphPlatformHttpClient` routes are implemented server-side.
|
||||
- [x] Saved-view endpoints exist and persist data in a real service abstraction.
|
||||
- [x] Existing `/graph/*` endpoints remain intact for compatibility.
|
||||
|
||||
### FE-GRAPH-004 - Remove fabricated shipped explorer overlays
|
||||
Status: DONE
|
||||
Dependency: GRAPH-API-004
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Rewire the shipped graph explorer overlay handling to use live tile overlays rather than generated policy/evidence/license/exposure/reachability mock data.
|
||||
- Unsupported fabricated overlay controls must be removed or rendered inactive with explicit state instead of generating pseudo data.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Graph explorer loads its visible overlay state from tile payloads.
|
||||
- [x] Unsupported fabricated overlay types are removed from the shipped explorer path.
|
||||
- [x] The explorer fails gracefully when overlay data is absent.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-04 | Sprint created; Graph API compatibility and explorer cleanup started. | Developer |
|
||||
| 2026-04-04 | Added the `/graphs*` compatibility facade and saved-view endpoints, rewired the shipped explorer to live `policy`/`vex`/`aoc` overlays, and verified with focused Graph API tests plus Angular production build. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Saved-view persistence is in-memory for this sprint; the contract is real, documented in `docs/modules/graph/architecture.md`, and covered by focused integration tests.
|
||||
- The graph explorer route is the priority shipped surface. Unused demo-only graph helpers are not a blocker unless they leak into that route.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-04-04: land facade endpoints and validate the explorer against the compatibility routes.
|
||||
@@ -0,0 +1,56 @@
|
||||
# Sprint 20260404-005 - Findings Vulnerability Detail Read Model
|
||||
|
||||
## Topic & Scope
|
||||
- Remove fabricated vulnerability-detail shaping from the shipped web path.
|
||||
- Expose the v2 vulnerability-detail route the shipped web client expects from Findings Ledger and stop fabricating detail data in the frontend.
|
||||
- Working directory: `src/Findings/`.
|
||||
- Expected evidence: targeted Findings Ledger tests, Angular build for vulnerability detail, updated docs, and sprint execution log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Allows cross-module edits in `src/Web/StellaOps.Web/` to remove frontend fallback fabrication and consume the live read model.
|
||||
- Independent of deployment and graph work apart from shared web build verification.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/findings-ledger/README.md`
|
||||
- `docs/modules/web/architecture.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### FIND-API-005 - Expose the v2 vulnerability detail read model from Findings Ledger
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Add `/api/v2/security/vulnerabilities/{id}` to Findings Ledger and back it with projection plus optional scoring state.
|
||||
- Return partial-but-real fields instead of invented enrichment, leaving unknown detail fields null or absent.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `/api/v2/security/vulnerabilities/{id}` exists and returns only real or null/absent fields.
|
||||
- [x] Projection-backed findings and optional scoring data are mapped into the v2 detail response without fabricated gate, witness, or verification metadata.
|
||||
- [x] Targeted Findings Ledger integration tests cover v2 detail behavior with and without cached scoring data.
|
||||
|
||||
### FE-FIND-005 - Remove frontend vulnerability detail fabrication
|
||||
Status: DONE
|
||||
Dependency: FIND-API-005
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Delete deterministic pseudo-score, EPSS, witness-path, and verification fallback shaping from the shipped vulnerability detail client/facade.
|
||||
- Keep partial data rendering, but show gaps honestly when the backend omits fields.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `security-findings.client.ts` no longer fabricates vulnerability detail on HTTP fallback.
|
||||
- [x] `vulnerability-detail.facade.ts` no longer invents signed-score verification data when proof data is absent.
|
||||
- [x] The vulnerability detail page renders partial state cleanly without made-up security metadata.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-04 | Sprint created; vulnerability detail read-model and web fallback removal started. | Developer |
|
||||
| 2026-04-04 | Added the Findings Ledger v2 vulnerability-detail endpoint, restored a live-only web facade, removed frontend fallback fabrication, and verified with focused Findings tests plus Angular production build. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Real-but-partial fields are acceptable; the page must not invent operator/security facts.
|
||||
- The shipped web route now relies on Findings Ledger `v2` detail responses documented in `docs/modules/findings-ledger/README.md`; rewriting the legacy VulnExplorer sample-data routes is no longer a prerequisite for this shipped path.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-04-04: land VulnExplorer read-model changes and rerun focused API tests.
|
||||
@@ -0,0 +1,67 @@
|
||||
# Sprint 20260405-001 - Local Gitea Bootstrap Hardening
|
||||
|
||||
## Topic & Scope
|
||||
- Remove the contradictory local Gitea setup path that marked the instance install-locked while still documenting manual first-login admin creation.
|
||||
- Ensure the compose-backed Gitea service reaches a deterministic admin-ready state on fresh volumes before it reports healthy.
|
||||
- Sync the local-operator docs so they describe the actual bootstrap flow and the remaining manual PAT-to-Vault step.
|
||||
- Working directory: `devops/compose/`.
|
||||
- Expected evidence: `docker compose config` validation, live `gitea admin user list` verification, updated operator docs.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `docs/integrations/LOCAL_SERVICES.md`, `devops/compose/README.md`, and the local integration catalog bootstrap history in `docs/implplan/SPRINT_20260403_004_Integrations_local_integration_catalog_bootstrap.md`.
|
||||
- Cross-module edits allowed for `docs/integrations/**`, `docs/implplan/**`, and compose helper scripts under `devops/compose/scripts/`.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/operations/devops/README.md`
|
||||
- `docs/operations/devops/architecture.md`
|
||||
- `docs/operations/devops/implementation_plan.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
- `docs/integrations/LOCAL_SERVICES.md`
|
||||
- `devops/compose/README.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### TASK-1 - Harden the compose-backed Gitea bootstrap path
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Replace the incomplete local Gitea bring-up path with a deterministic bootstrap that creates the repository root and first admin user from the compose service itself.
|
||||
- Make the service health check reflect the admin-ready state instead of only proving that `/api/v1/version` responds.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Fresh local Gitea volumes create a deterministic admin user without requiring a manual setup wizard.
|
||||
- [x] The compose service no longer carries the unused `gitea-db` mount that implied a different SQLite location than the image template uses.
|
||||
- [x] The Gitea health check stays red until an admin exists.
|
||||
|
||||
### TASK-2 - Sync operator docs with the corrected bootstrap flow
|
||||
Status: DONE
|
||||
Dependency: TASK-1
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Update the compose README and local integration service guide so they describe the actual local Gitea admin bootstrap and token workflow.
|
||||
- Record the root cause and the corrected procedure for future local integration bring-up.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `devops/compose/README.md` documents the default local admin credentials and the new health expectation.
|
||||
- [x] `docs/integrations/LOCAL_SERVICES.md` removes the stale first-login guidance and keeps PAT creation explicit.
|
||||
- [x] Decisions & Risks link the corrected docs back to the original setup contradiction.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-05 | Sprint created after live investigation showed `stellaops-gitea` running install-locked with no admin users despite local docs still describing manual first-login bootstrap. | Developer |
|
||||
| 2026-04-05 | Replaced the incomplete manual path with a self-bootstrap Gitea entrypoint, explicit config persistence, and an admin-aware health check. | Developer |
|
||||
| 2026-04-05 | Updated the compose README and local integration services guide to document deterministic local admin bootstrap and the remaining manual PAT/Vault step. | Developer |
|
||||
| 2026-04-05 | Validation: `docker compose -f devops/compose/docker-compose.integrations.yml config` passed; a disposable fresh-volume Gitea container auto-created the `stellaops` admin and repository root. | Developer |
|
||||
| 2026-04-05 | Applied the corrected compose definition to the live `stellaops-gitea` service with `docker compose -f devops/compose/docker-compose.integrations.yml up -d --force-recreate gitea`; the container returned `healthy` with the admin-aware health check. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Root cause: the official Gitea image generated `app.ini` with `INSTALL_LOCK=true` and no admin bootstrap, while the local docs still told operators to create the admin on first login. The result was an install-locked but admin-less instance. Corrected paths: `devops/compose/docker-compose.integrations.yml`, `devops/compose/README.md`, `docs/integrations/LOCAL_SERVICES.md`.
|
||||
- Personal access tokens remain a manual step because the token value is only disclosed at creation time. The docs now make that explicit instead of implying a complete zero-touch SCM credential flow.
|
||||
- Existing Gitea volumes with an already-present admin are left intact by the bootstrap logic; the entrypoint only seeds the admin on fresh or admin-less state.
|
||||
- The live diagnostic volume still contains the temporary `codex-probe` admin created during root-cause analysis. The new bootstrap deliberately preserves existing admins instead of mutating them, so removing that account is a separate manual cleanup task rather than part of the deterministic bootstrap fix.
|
||||
|
||||
## Next Checkpoints
|
||||
- Decide whether the local Vault bootstrap should also seed a Gitea PAT for fully automated integration catalog bring-up, or whether keeping PAT creation operator-driven is the preferred local-security tradeoff.
|
||||
- Apply the same "healthy means bootstrapped" rule to any other compose-backed integration services that still report green before their documented local setup is actually complete.
|
||||
@@ -0,0 +1,62 @@
|
||||
# Sprint 20260405-002 - FE Active-Surface Test Lane Repair
|
||||
|
||||
## Topic & Scope
|
||||
- Restore a reliable focused Angular unit-test lane for shipped Graph, Findings, Evidence, Topology, and deployment flows.
|
||||
- Fix the immediate compile blockers that currently prevent focused spec runs on active surfaces.
|
||||
- Working directory: `src/Web/StellaOps.Web/`.
|
||||
- Expected evidence: focused Vitest run for active-surface specs, Angular production build, updated docs, and sprint execution log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on the shipped-surface parity work completed in `SPRINT_20260404_002_FE_evidence_topology_live_surfaces.md`, `SPRINT_20260404_003_JobEngine_deployment_run_parity.md`, `SPRINT_20260404_004_Graph_graph_explorer_live_contract.md`, and `SPRINT_20260404_005_Findings_vulnerability_detail_read_model.md`.
|
||||
- Safe to run before Graph and JobEngine persistence work; those follow-on sprints depend on this focused verification lane.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/web/architecture.md`
|
||||
- `docs/implplan/SPRINT_20260404_002_FE_evidence_topology_live_surfaces.md`
|
||||
- `docs/implplan/SPRINT_20260404_003_JobEngine_deployment_run_parity.md`
|
||||
- `docs/implplan/SPRINT_20260404_004_Graph_graph_explorer_live_contract.md`
|
||||
- `docs/implplan/SPRINT_20260404_005_Findings_vulnerability_detail_read_model.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### FE-TEST-006 - Repair active-surface Angular compile blockers
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer / Implementer, Test Automation
|
||||
Task description:
|
||||
- Fix the concrete Angular compile faults that currently break focused spec runs for shipped surfaces, including malformed inline templates and missing reactive imports in touched release/evidence flows.
|
||||
- Keep the write scope limited to active shipped surfaces and directly affected tests.
|
||||
|
||||
Completion criteria:
|
||||
- [x] The evidence packet component template compiles cleanly in unit-test builds.
|
||||
- [x] The environment detail component compiles cleanly with its reactive state restored.
|
||||
- [x] Any touched active-surface spec compiles without newly introduced type errors.
|
||||
|
||||
### FE-TEST-007 - Add a focused active-surface spec lane and quarantine note
|
||||
Status: DONE
|
||||
Dependency: FE-TEST-006
|
||||
Owners: Developer / Implementer, Test Automation, Documentation author
|
||||
Task description:
|
||||
- Add a dedicated active-surface test target that only includes the shipped Graph, Findings, Evidence, Topology, and deployment wizard specs needed for current parity work.
|
||||
- Document the intentionally excluded stale-spec backlog so focused verification is auditable rather than accidental.
|
||||
|
||||
Completion criteria:
|
||||
- [x] A dedicated Angular/Vitest target exists for active-surface specs.
|
||||
- [x] The focused lane covers Graph overlays, vulnerability detail, deployment creation, and evidence/topology flows.
|
||||
- [x] The current unrelated stale-spec exclusions are documented in this sprint's Decisions & Risks.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-05 | Sprint created; active-surface Web test lane repair started. | Developer |
|
||||
| 2026-04-05 | Fixed the evidence packet template, restored the missing `computed` import in environment detail, and corrected the touched active-surface specs. | Developer |
|
||||
| 2026-04-05 | Added the `test-active-surfaces` Angular target plus `npm run test:active-surfaces`, including the deployment-wizard spec for the shipped create-deployment flow. | Developer |
|
||||
| 2026-04-05 | Verification passed: `npm run test:active-surfaces` (25/25) and `npm run build -- --configuration=production --output-path=dist`. | Test Automation |
|
||||
|
||||
## Decisions & Risks
|
||||
- The broader stale Angular spec backlog is intentionally out of scope unless a broken test blocks a shipped active-surface spec.
|
||||
- The focused lane must prove shipped behavior without depending on unrelated legacy spec folders.
|
||||
- The focused lane intentionally excludes the unrelated legacy spec debt still present under moved/removed areas such as `agents`, older `signals` tests, and stale release/policy shell expectations. Those remain backlog work rather than hidden red builds.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-04-05: land active-surface compile fixes and run focused Web verification.
|
||||
@@ -0,0 +1,58 @@
|
||||
# Sprint 20260405-003 - Graph Saved Views Persistence
|
||||
|
||||
## Topic & Scope
|
||||
- Replace the temporary in-memory Graph saved-view store with persisted storage.
|
||||
- Add startup migrations for the saved-view schema path and keep the compatibility REST facade unchanged for the shipped Console.
|
||||
- Working directory: `src/Graph/`.
|
||||
- Expected evidence: targeted Graph API tests, restart-aware persistence verification, updated docs, and sprint execution log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `SPRINT_20260405_002_FE_test_lane_repair_for_active_surfaces.md` for faster focused frontend verification.
|
||||
- Allows cross-module edits in `src/Web/StellaOps.Web/` only if the live Graph UI needs small adjustments to persisted saved-view behavior.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/graph/architecture.md`
|
||||
- `src/Graph/AGENTS.md`
|
||||
- `docs/implplan/SPRINT_20260404_004_Graph_graph_explorer_live_contract.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### GRAPH-PERSIST-006 - Persist Graph saved views in PostgreSQL with startup migrations
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Introduce a persisted saved-view store for the compatibility Graph API and wire startup migrations for its schema ownership path.
|
||||
- Preserve tenant isolation, deterministic ordering, and the existing `/graphs/{graphId}/saved-views` REST contract.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Graph saved views are stored in PostgreSQL rather than process memory when persistence is configured.
|
||||
- [x] Startup migrations create the saved-view tables automatically for a clean database.
|
||||
- [x] Saved-view list/create/delete keeps the existing compatibility API contract.
|
||||
|
||||
### GRAPH-PERSIST-007 - Add restart-aware verification and sync docs
|
||||
Status: DONE
|
||||
Dependency: GRAPH-PERSIST-006
|
||||
Owners: Test Automation, Documentation author
|
||||
Task description:
|
||||
- Add focused tests that prove saved views remain available across service/store reinitialization and document the persistence behavior in module docs.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Targeted Graph tests cover create/read/delete against the persisted store.
|
||||
- [x] At least one test proves persistence across a store or host restart boundary.
|
||||
- [x] `docs/modules/graph/architecture.md` records the saved-view persistence model.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-05 | Sprint created; Graph saved-view persistence queued behind Web test-lane repair. | Developer |
|
||||
| 2026-04-05 | Added `IGraphSavedViewStore`, PostgreSQL-backed persistence, startup migration `003_saved_views.sql`, and runtime fallback selection between persisted and in-memory stores. | Developer |
|
||||
| 2026-04-05 | Verification passed: `dotnet test \"src/Graph/__Tests/StellaOps.Graph.Api.Tests/StellaOps.Graph.Api.Tests.csproj\" -- --filter-class StellaOps.Graph.Api.Tests.GraphCompatibilityEndpointsIntegrationTests` (3/3). | Test Automation |
|
||||
|
||||
## Decisions & Risks
|
||||
- Saved views need durable storage now; broader graph dataset persistence remains out of scope for this sprint.
|
||||
- Reuse the repo's existing PostgreSQL migration conventions instead of adding a second migration mechanism.
|
||||
- Store selection is now resolved from bound `Postgres:Graph` options at DI/runtime rather than from an early configuration snapshot, so test-host and deployment overrides correctly pick the persisted store.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-04-05: land persisted saved-view store, migrations, and focused Graph verification.
|
||||
@@ -0,0 +1,60 @@
|
||||
# Sprint 20260405-004 - JobEngine Deployment Store Persistence
|
||||
|
||||
## Topic & Scope
|
||||
- Replace the in-memory release-control compatibility deployment store with persisted storage in the orchestrator schema.
|
||||
- Keep the shipped deployment compatibility API unchanged while making lifecycle state durable.
|
||||
- Working directory: `src/JobEngine/`.
|
||||
- Expected evidence: targeted JobEngine compatibility tests, restart-aware persistence verification, updated docs, and sprint execution log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `SPRINT_20260405_002_FE_test_lane_repair_for_active_surfaces.md` for focused frontend verification of the shipped deployment path.
|
||||
- Allows cross-module edits in `src/Web/StellaOps.Web/` and `docs/modules/release-orchestrator/` only if the persisted behavior requires minor UI/doc alignment.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/jobengine/architecture.md`
|
||||
- `docs/modules/jobengine/README.md`
|
||||
- `src/JobEngine/AGENTS.md`
|
||||
- `docs/implplan/SPRINT_20260404_003_JobEngine_deployment_run_parity.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### ORCH-PERSIST-006 - Persist compatibility deployments in the orchestrator schema
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Move the compatibility deployment list/detail/events/logs/metrics and lifecycle mutations onto persisted storage under the existing orchestrator migration regime.
|
||||
- Preserve the shipped endpoint surface and strategy vocabulary already exposed to the Console.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Compatibility deployments are stored durably in PostgreSQL when the WebService uses JobEngine infrastructure.
|
||||
- [x] Startup migrations create the compatibility deployment tables automatically.
|
||||
- [x] Pause, resume, cancel, rollback, retry, and create flows all mutate persisted state and event history.
|
||||
|
||||
### ORCH-PERSIST-007 - Add restart-aware tests and sync docs
|
||||
Status: DONE
|
||||
Dependency: ORCH-PERSIST-006
|
||||
Owners: Test Automation, Documentation author
|
||||
Task description:
|
||||
- Extend the focused JobEngine compatibility tests to prove deployments remain readable across a restart boundary and document the persisted compatibility path.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Targeted JobEngine tests cover persisted create/read/lifecycle behavior.
|
||||
- [x] At least one test proves deployment state survives service/store restart.
|
||||
- [x] `docs/modules/jobengine/architecture.md` records the persisted compatibility deployment store.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-05 | Sprint created; persisted compatibility deployment store queued behind Web test-lane repair. | Developer |
|
||||
| 2026-04-05 | Replaced the static endpoint-owned compatibility store with DI-backed `IDeploymentCompatibilityStore`, added PostgreSQL persistence plus orchestrator migration `011_compatibility_deployments.sql`, and kept the shipped REST contract intact. | Developer |
|
||||
| 2026-04-05 | Tightened JobEngine configuration precedence so an explicit `JobEngine:Database:ConnectionString` wins over legacy `Orchestrator` fallback values. | Developer |
|
||||
| 2026-04-05 | Verification passed: `dotnet test \"src/JobEngine/StellaOps.JobEngine/StellaOps.JobEngine.Tests/StellaOps.JobEngine.Tests.csproj\" -m:1 -- --filter-class StellaOps.JobEngine.Tests.ControlPlane.ReleaseCompatibilityEndpointsTests` (5/5). | Test Automation |
|
||||
|
||||
## Decisions & Risks
|
||||
- The compatibility API must remain stable for the shipped Console even as the backing store changes.
|
||||
- Existing seed records can stay as bootstrap data, but runtime state must no longer be process-local only.
|
||||
- Seed deployments remain bootstrap data per tenant, but they are now inserted into persisted storage on demand so lifecycle mutations survive host restart instead of resetting with process memory.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-04-05: land orchestrator persistence for compatibility deployments and rerun focused JobEngine verification.
|
||||
@@ -0,0 +1,59 @@
|
||||
# Sprint 20260405-005 - FE Shipped UI Polish
|
||||
|
||||
## Topic & Scope
|
||||
- Remove obvious warning-level friction from the shipped Angular build and tighten empty/error messaging on touched shipped pages.
|
||||
- Keep the scope to the active shipped surfaces touched by recent parity work rather than broad visual redesign.
|
||||
- Working directory: `src/Web/StellaOps.Web/`.
|
||||
- Expected evidence: Angular production build, focused active-surface tests, updated docs, and sprint execution log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `SPRINT_20260405_002_FE_test_lane_repair_for_active_surfaces.md`.
|
||||
- Benefits from persisted Graph and JobEngine behavior but may land small UX/build fixes independently where safe.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/web/architecture.md`
|
||||
- `docs/implplan/SPRINT_20260404_002_FE_evidence_topology_live_surfaces.md`
|
||||
- `docs/implplan/SPRINT_20260404_004_Graph_graph_explorer_live_contract.md`
|
||||
- `docs/implplan/SPRINT_20260404_005_Findings_vulnerability_detail_read_model.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### FE-POLISH-006 - Remove current shipped-path build warnings and dead wiring
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer / Implementer
|
||||
Task description:
|
||||
- Address the current setup-wizard style-budget warnings and remove dead imports/template wiring on touched shipped pages.
|
||||
- Keep bundle-budget changes as a last resort; prefer actual CSS or template cleanup.
|
||||
|
||||
Completion criteria:
|
||||
- [x] The Angular production build no longer emits the current setup-wizard style-budget warnings.
|
||||
- [x] Touched shipped components do not retain dead imports or dead template bindings.
|
||||
- [x] No new build warnings are introduced by the polish work.
|
||||
|
||||
### FE-POLISH-007 - Improve shipped empty/error states without fake affordances
|
||||
Status: DONE
|
||||
Dependency: FE-POLISH-006
|
||||
Owners: Developer / Implementer, Documentation author
|
||||
Task description:
|
||||
- Tighten empty-state and unavailable-action messaging on touched Graph, evidence, topology, and vulnerability-detail pages so operators see explicit outcomes rather than silent no-ops.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Touched shipped pages show explicit empty or unavailable messaging where backend data is missing.
|
||||
- [x] No touched shipped page exposes a fake action affordance without a real backend path.
|
||||
- [x] Web architecture docs reflect any operator-visible behavior changes.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-05 | Sprint created; shipped UI polish queued behind active-surface test-lane repair. | Developer |
|
||||
| 2026-04-05 | Moved the setup wizard and step-content component styles out of oversized inline component bundles into global SCSS so the build clears `anyComponentStyle` budgets without raising them. | Developer |
|
||||
| 2026-04-05 | Revalidated the focused shipped surfaces after the style extraction: `npm run test:active-surfaces` (25/25) and `npm run build -- --configuration=production --output-path=dist` both passed without setup-wizard style-budget warnings. | Test Automation |
|
||||
|
||||
## Decisions & Risks
|
||||
- Build-warning cleanup must stay scoped to active shipped surfaces to avoid turning into a repo-wide CSS rewrite.
|
||||
- Operator-facing clarity takes priority over cosmetic expansion.
|
||||
- The explicit empty/unavailable messaging introduced in the earlier shipped-surface parity sprints remained the correct product behavior; this sprint kept those live-only states intact while removing build-warning debt.
|
||||
|
||||
## Next Checkpoints
|
||||
- 2026-04-05: remove active shipped-path warning debt and rerun build plus focused tests.
|
||||
@@ -0,0 +1,69 @@
|
||||
# Sprint 20260405-007 - Local Integration Idle CPU Tuning
|
||||
|
||||
## Topic & Scope
|
||||
- Reduce unnecessary idle CPU in the local third-party integration lane without breaking the default Stella platform or the CI/testing compose lane.
|
||||
- Move high-idle optional providers behind explicit opt-in startup commands where that better matches their real local usage.
|
||||
- Document which compose lane installs which containers so operators do not confuse `docker-compose.testing.yml` with `docker-compose.integrations.yml`.
|
||||
- Working directory: `devops/compose/`.
|
||||
- Expected evidence: compose config validation, runtime inspection of GitLab/Consul/PostgreSQL/Valkey, updated operator docs.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `devops/compose/docker-compose.integrations.yml`, `devops/compose/README.md`, `docs/integrations/LOCAL_SERVICES.md`, `docs/INSTALL_GUIDE.md`, and `docs/dev/DEV_ENVIRONMENT_SETUP.md`.
|
||||
- Cross-module edits allowed for `docs/integrations/**`, `docs/implplan/**`, and top-level setup/install docs that point operators at the local compose lanes.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/operations/devops/README.md`
|
||||
- `docs/operations/devops/architecture.md`
|
||||
- `docs/operations/devops/implementation_plan.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
- `devops/compose/README.md`
|
||||
- `docs/integrations/LOCAL_SERVICES.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### TASK-1 - Lower the idle footprint of optional local integration providers
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Reconfigure the local integrations compose lane so Consul no longer burns CPU in the default bring-up path and GitLab uses genuine low-idle omnibus settings for local SCM/API validation.
|
||||
- Preserve an explicit opt-in path for features that justify the extra cost, including Consul KV checks and GitLab registry/package coverage.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Consul is no longer part of the default `docker compose -f docker-compose.integrations.yml up -d` lane.
|
||||
- [x] GitLab uses low-idle local defaults with corrected Puma/Sidekiq tuning and optional registry/package re-enable flags.
|
||||
- [x] The compose file still validates with `docker compose config`.
|
||||
|
||||
### TASK-2 - Clarify which compose lane installs which containers
|
||||
Status: DONE
|
||||
Dependency: TASK-1
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Update the local compose docs so operators can distinguish the CI/testing stack from the real third-party integration stack and know when GitLab or Consul should be started explicitly.
|
||||
- Record the CPU-triage findings so future local bring-up choices are informed by actual runtime behavior rather than assumptions.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `devops/compose/README.md` explains the low-idle default lane plus the opt-in Consul and GitLab commands.
|
||||
- [x] `docs/integrations/LOCAL_SERVICES.md` reflects the new startup model and GitLab/Consul behavior.
|
||||
- [x] Install/dev guides mention that `docker-compose.testing.yml` does not install GitLab or Consul.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-05 | Sprint created after a two-minute CPU sample showed the local integration lane's top sustained consumers were `router-gateway`, GitLab, PostgreSQL, Consul, and Valkey. | Developer |
|
||||
| 2026-04-05 | Reconfigured `docker-compose.integrations.yml` so Consul is opt-in and GitLab uses corrected low-idle omnibus settings with optional registry/package re-enable flags. | Developer |
|
||||
| 2026-04-05 | Updated compose/install/local-service docs to distinguish the testing lane from the real third-party integration lane and to document the new GitLab/Consul startup model. | Developer |
|
||||
| 2026-04-05 | Runtime validation: stopped the live `stellaops-consul` container, recreated `stellaops-gitlab`, confirmed GitLab returned `healthy` with `gitlab-kas` disabled, and captured fresh PostgreSQL/GitLab/Valkey traces plus a post-change top-5 CPU sample. | Developer |
|
||||
| 2026-04-05 | Follow-up runtime validation: moved Gitea admin-bootstrap proof from the repeating healthcheck into a one-time sentinel written by the entrypoint, recreated `stellaops-gitea`, and confirmed the expensive healthcheck loop no longer dominates Gitea CPU. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Runtime evidence showed Consul had zero registered services/checks yet still spent CPU in dev-agent churn, so the default local lane now leaves it off unless the Consul connector is being validated explicitly.
|
||||
- GitLab CPU was dominated by Sidekiq cron/background work and a larger-than-expected Puma footprint. The compose file now uses `sidekiq['concurrency']` and `puma['worker_processes']`, which match the Omnibus template keys, instead of the previous ineffective local tuning.
|
||||
- Post-change runtime checks showed GitLab settles back down after reconfigure, but it still runs unavoidable Omnibus background work whenever the container is up. The durable low-idle control is therefore opt-in startup, not assuming GitLab can be made "free" while running.
|
||||
- The original Gitea fix proved the admin existed by running `gitea admin user list` from the healthcheck every 30 seconds. That caused misleading CPU spikes during later monitoring, so the healthcheck now validates a sentinel file created once by the entrypoint instead.
|
||||
- GitLab registry/package features are now opt-in via env vars for the local lane. Operators who need GitLab registry coverage must start GitLab with `GITLAB_ENABLE_REGISTRY=true` (and packages with `GITLAB_ENABLE_PACKAGES=true`).
|
||||
- PostgreSQL and Valkey remain active because they are core Stella runtime dependencies, not optional third-party fixtures. Their load must be analyzed service-by-service rather than disabled globally.
|
||||
|
||||
## Next Checkpoints
|
||||
- Re-sample container CPU after the live GitLab recreate and Consul shutdown to confirm the top 5 ranking changed as expected.
|
||||
- If Valkey and router-gateway remain the dominant sustained pair, trace the queue-wait and stream-consumer settings in the router transport next.
|
||||
@@ -0,0 +1,86 @@
|
||||
# Sprint 20260405-008 - Consul, Postgres, And Router Runtime Tuning
|
||||
|
||||
## Topic & Scope
|
||||
- Keep the local Consul integration provider running while reducing its idle CPU footprint.
|
||||
- Increase local PostgreSQL diagnostics enough to capture slow-query and lock context for the active Stella stack.
|
||||
- Trace the router gateway and Valkey messaging behavior to separate real traffic from avoidable idle churn, then apply safe local tuning where it does not sacrifice functionality.
|
||||
- Working directory: `devops/compose/`.
|
||||
- Expected evidence: live container samples, compose updates, PostgreSQL runtime configuration, and documented router/Valkey findings.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `devops/compose/docker-compose.integrations.yml`, `devops/compose/docker-compose.stella-ops.yml`, `devops/compose/README.md`, `docs/integrations/LOCAL_SERVICES.md`, and `docs/implplan/SPRINT_20260405_007_Integrations_local_idle_cpu_tuning.md`.
|
||||
- Cross-module read access required for `src/Router/**` to explain runtime messaging behavior.
|
||||
- Cross-module doc edits allowed for `docs/integrations/**`, `docs/implplan/**`, and top-level setup/devops docs that describe the local runtime.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/operations/devops/README.md`
|
||||
- `docs/operations/devops/architecture.md`
|
||||
- `docs/operations/devops/implementation_plan.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
- `devops/compose/README.md`
|
||||
- `docs/integrations/LOCAL_SERVICES.md`
|
||||
- `src/Router/AGENTS.md`
|
||||
- `src/Router/__Libraries/StellaOps.Router.Gateway/AGENTS.md`
|
||||
- `src/Router/__Libraries/StellaOps.Messaging.Transport.Valkey/AGENTS.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### TASK-1 - Keep Consul up with a lower idle footprint
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Replace the current local Consul dev-agent mode with a lower-idle single-server configuration that preserves the HTTP API and local UI surface needed for connector validation.
|
||||
- Validate the new mode against the live compose service and record before/after CPU evidence.
|
||||
|
||||
Completion criteria:
|
||||
- [x] `stellaops-consul` stays up in the local integrations lane.
|
||||
- [x] Idle CPU is measurably lower than the current `agent -dev` mode.
|
||||
- [x] Docs reflect the retained startup and any changed operational caveats.
|
||||
|
||||
### TASK-2 - Raise PostgreSQL diagnostics for local tracing
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Enable targeted local PostgreSQL logging that captures slow statements and lock-related context without turning the dev database into an unreadable firehose.
|
||||
- Record the exact runtime settings and confirm they are active on the live container.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Slow-query and lock-wait logging is enabled on the live `stellaops-postgres` instance.
|
||||
- [x] The chosen settings are documented in the sprint log and reflected in local ops guidance if they become part of compose defaults.
|
||||
- [x] At least one follow-up log capture demonstrates the new diagnostics are active.
|
||||
|
||||
### TASK-3 - Trace router gateway and Valkey churn without reducing functionality
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Investigate the router gateway's Valkey-backed messaging loops and determine whether the dominant CPU comes from real request throughput, heartbeat traffic, or avoidable control-plane churn.
|
||||
- Propose or apply safe local tuning only where the behavior preserves routing readiness and service connectivity.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Router gateway, Valkey, and PostgreSQL traces are correlated into a concrete runtime explanation.
|
||||
- [x] Any applied tuning preserves gateway readiness and microservice connectivity.
|
||||
- [x] Remaining non-applied improvements are documented with explicit tradeoffs.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-05 | Sprint created after the follow-up request to keep Consul running, increase PostgreSQL diagnostics, and investigate router-gateway/Valkey runtime churn without sacrificing functionality. | Developer |
|
||||
| 2026-04-05 | Replaced local Consul `agent -dev` with a persistent single-node server (`-server -bootstrap-expect=1 -ui -data-dir=/consul/data`) and validated live CPU falling from roughly 3-4% idle to roughly 0.5-1.3% while keeping the HTTP KV surface and UI available. Updated the integrations compose docs accordingly. | Developer |
|
||||
| 2026-04-05 | Enabled targeted PostgreSQL diagnostics on the live `stellaops-postgres` container via `ALTER SYSTEM`: `log_min_duration_statement=100ms`, `log_connections=on`, `log_disconnections=on`, `log_lock_waits=on`, `deadlock_timeout=500ms`, and a richer `log_line_prefix`. Verified the settings in `postgresql.auto.conf` and confirmed slow-query logging with a `pg_sleep(0.25)` probe. | Developer |
|
||||
| 2026-04-05 | Correlated router-gateway, Valkey, and code-level evidence. Empty router request streams ruled out backlog. The dominant churn is repeated HELLO re-registration across the full microservice fleet, not user request load. In a 60-second sample the gateway logged 261 `HELLO received` events and 261 matching `Messaging connection registered` events, aligning with the 10-second `RegistrationRefreshIntervalSeconds` default across roughly 42 connected services. Patched local compose defaults to `30s` messaging heartbeat and `30s` registration refresh for the next live redeploy. | Developer |
|
||||
| 2026-04-05 | Recreated the main `docker-compose.stella-ops.yml` stack with the new router defaults and re-sampled the live system after it settled. Gateway readiness stayed green. Router HELLO traffic fell from 261/min to 84/min, and the corresponding Valkey command deltas fell to `xreadgroup=621`, `xautoclaim=262`, `publish=168`, `ping=667`, `xadd=168`, `xack=168`, and `xdel=168` over 60 seconds. Router CPU in the same window averaged roughly 3.1% with bursty peaks, while Valkey averaged roughly 1.0%, PostgreSQL roughly 0.3%, and Consul roughly 0.4% outside isolated blips. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- `docs/operations/devops/TASKS.md` is referenced by the module AGENTS but does not exist in the repository. This sprint records status in `docs/implplan` instead.
|
||||
- Any router-gateway tuning must preserve the gateway readiness contract and the current required microservice set; lowering CPU by making the gateway slower to detect disconnected services is not acceptable unless the tradeoff is explicit and bounded.
|
||||
- PostgreSQL diagnostics should stay targeted. Full statement logging would distort the very CPU profile we are trying to understand.
|
||||
- Router/Valkey analysis corrected an earlier assumption: `VALKEY_QUEUE_WAIT_TIMEOUT=0` does not create extra polling here. In the current implementation it means infinite wait on the pub/sub signal, which is risky for resilience but not the dominant CPU source. The measurable churn comes from repeated HELLO refreshes and gateway re-registration processing.
|
||||
- PostgreSQL connection logging surfaced separate short-session churn from web workloads even after the router fix. Earlier samples showed bursts from `stellaops-advisory-ai-web` (`172.19.0.62`), while the later 60-second sample showed `stellaops-scanner-web` (`172.19.0.60`) opening most of the remote sessions. That is outside the router fix and should be handled as a dedicated connection-pooling and `Application Name` follow-up if it keeps mattering.
|
||||
|
||||
## Next Checkpoints
|
||||
- Validate the lower-idle Consul mode against the live `stellaops-consul` container.
|
||||
- Apply and verify PostgreSQL logging changes on the running stack.
|
||||
- Use the new PostgreSQL logging to identify the highest-churn application sessions and decide whether `pg_stat_statements` or connection-string `Application Name` standardization is needed in local compose.
|
||||
@@ -0,0 +1,89 @@
|
||||
# Sprint 20260405-009 - Router Registration Resync And Hello Slimming
|
||||
|
||||
## Topic & Scope
|
||||
- Replace the current periodic full HELLO replay with a cheaper control-plane pattern in the Router module.
|
||||
- Keep endpoint/schema/OpenAPI replay available for service startup and explicit gateway resync, while periodic liveness traffic stays small.
|
||||
- Preserve messaging transport resilience when Valkey Pub/Sub notifications degrade or disappear.
|
||||
- Working directory: `src/Router/`.
|
||||
- Expected evidence: targeted Router tests, updated router docs, and live compose/runtime samples.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `docs/implplan/SPRINT_20260405_008_Integrations_consul_pg_router_runtime_tuning.md` for the runtime baseline that exposed the HELLO flood.
|
||||
- Read access required for `devops/compose/docker-compose.stella-ops.yml` and `devops/compose/README.md` to keep local runtime defaults aligned with the Router protocol behavior.
|
||||
- Cross-module doc edits allowed for `docs/modules/router/**`, `docs/implplan/**`, and `devops/compose/README.md` when the runtime contract changes.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/code-of-conduct/CODE_OF_CONDUCT.md`
|
||||
- `docs/README.md`
|
||||
- `docs/07_HIGH_LEVEL_ARCHITECTURE.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
- `docs/modules/router/README.md`
|
||||
- `docs/modules/router/architecture.md`
|
||||
- `docs/modules/router/messaging-valkey-transport.md`
|
||||
- `docs/features/checked/gateway/router-heartbeat-and-health-monitoring.md`
|
||||
- `src/Router/AGENTS.md`
|
||||
- `src/Router/__Libraries/StellaOps.Router.Gateway/AGENTS.md`
|
||||
- `src/Router/__Libraries/StellaOps.Messaging.Transport.Valkey/AGENTS.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### TASK-1 - Trace current HELLO refresh and resync behavior
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Read the current HELLO payload, gateway registration flow, routing-state update path, and Valkey notifiable-queue fallback behavior.
|
||||
- Produce a concrete design that distinguishes between startup registration, explicit gateway resync, and cheap periodic liveness traffic.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Existing HELLO refresh triggers are documented in the sprint log with code references.
|
||||
- [ ] The resubscription / missed-notification fallback behavior in the Valkey transport is documented so the protocol change does not remove needed resilience.
|
||||
- [ ] The selected protocol change is scoped tightly enough to implement with focused Router tests.
|
||||
|
||||
### TASK-2 - Implement explicit resync signaling and slimmer periodic traffic
|
||||
Status: DONE
|
||||
Dependency: TASK-1
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Add the minimal Router protocol/runtime changes needed so services send the heavy registration payload on startup and on explicit gateway resync, while periodic traffic avoids replaying the full endpoint catalog.
|
||||
- Keep the gateway able to rebuild state after startup or administrative resync without depending on manual service restarts.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Router code differentiates between full registration replay and lightweight periodic traffic.
|
||||
- [ ] Gateway can trigger resync without requiring a full service restart.
|
||||
- [ ] Existing routing, claims, and OpenAPI behaviors remain correct after the change.
|
||||
|
||||
### TASK-3 - Validate protocol behavior and runtime impact
|
||||
Status: DONE
|
||||
Dependency: TASK-2
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Add or update targeted Router tests around HELLO/resync handling and Valkey fallback behavior.
|
||||
- Re-run focused local runtime samples to verify the control-plane traffic drops without sacrificing readiness or routing correctness.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Targeted Router test projects pass with coverage for the new protocol behavior.
|
||||
- [ ] Live gateway readiness and routing stay healthy after the change.
|
||||
- [ ] Sprint and router docs record the final behavior and residual tradeoffs.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-05 | Sprint created to move from compose-only tuning into Router protocol/runtime changes after the HELLO refresh flood was traced to periodic full registration replay across the service fleet. | Developer |
|
||||
| 2026-04-05 | Traced the remaining messaging resilience path: Valkey consumers still run `XAUTOCLAIM` + `XREADGROUP` checks around `WaitForNotificationAsync(...)`, with timeout fallback, connection-restored wakeups, and randomized proactive re-subscribe retained on purpose for silent Pub/Sub failure recovery. | Developer |
|
||||
| 2026-04-05 | Implemented explicit messaging resync: startup HELLO is identity-only, gateway can request metadata replay via `ResyncRequest`, microservices answer with `EndpointsUpdate`, and heartbeats now carry instance identity so gateway-state misses can recover without full reconnect churn. | Developer |
|
||||
| 2026-04-05 | Targeted verification passed with Microsoft Testing Platform class filters: `RouterConnectionManagerTests` (19/19), `MessagingTransportQueueOptionsTests` (6/6), `GatewayRegistrationResyncServiceTests` (3/3), and `MessagingTransportIntegrationTests` (6/6). A full `StellaOps.Gateway.WebService.Tests` run still reports 2 unrelated route-table assertions in `GatewayRouteSearchMappingsTests`, which are outside this sprint write scope. | Developer |
|
||||
| 2026-04-05 | Rebuilt and redeployed the live Router-dependent `docker-compose.stella-ops.yml` services so the new control frames were rolled out consistently across the running mesh. After health settled, a 60-second `docker stats` sample showed the restarted Stella Ops fleet below 1% CPU on average for every top-10 service; focused follow-up samples put `stellaops-router-gateway` at `1.17%` avg / `3.27%` max, `stellaops-platform` at `0.11%` avg, and `stellaops-signals` at `0.10%` avg. Router logs showed only 8 `HELLO received` events over 2 minutes after rollout. | Developer |
|
||||
| 2026-04-05 | Extended post-rollout runtime sampling over 3 minutes kept `stellaops-evidence-locker-web` low at `0.19%` avg / `1.75%` max and `stellaops-postgres` at `0.71%` avg / `4.60%` max. Postgres slow-statement logs remained empty in the sampled window, while connection churn was dominated by `172.19.0.58` (`stellaops-advisory-ai-web`) with `173` connection-log entries in 10 minutes and blank `application_name`, which points to attribution/pooling debt rather than Evidence Locker pressure. The broader whole-stack sample still showed transient integration overhead outside this sprint scope, notably `stellaops-gitea` spikes despite an immediate follow-up spot sample already back at `0.04%` CPU. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- The periodic HELLO flood was an architectural behavior, not just a bad compose default: `RouterConnectionManager` refreshed via transport `ConnectAsync(...)`, and the messaging transport used to serialize a full `HelloPayload` on every replay. This sprint removes that periodic metadata replay for messaging and replaces it with explicit control frames.
|
||||
- The Valkey transport already contains explicit resilience traffic for silent Pub/Sub failure: timeout-based fallback waits plus proactive randomized re-subscription. Any protocol change must preserve those recovery paths.
|
||||
- Backward compatibility matters across Router transports. If a new control frame is introduced, frame parsing and ignore/compatibility behavior must be explicit.
|
||||
- `RegistrationRefreshInterval` still exists in Router options, but messaging transport no longer uses it to replay endpoint catalogs. Future cleanup can deprecate or rename that knob once non-messaging transport expectations are audited.
|
||||
- Live rollout had to cover the full running Router mesh, not just `router-gateway`, because the new `ResyncRequest` / `EndpointsUpdate` control frames span shared Router client and server libraries. Partial deployment would have left old services unable to answer explicit resync requests.
|
||||
|
||||
## Next Checkpoints
|
||||
- Finalize the protocol change after tracing current HELLO and fallback flows.
|
||||
- Implement and test the Router-side resync behavior.
|
||||
- Re-sample the live stack after the Router change lands.
|
||||
@@ -0,0 +1,77 @@
|
||||
# Sprint 20260405-010 - AdvisoryAI PG Pooling And Gitea Spike Followup
|
||||
|
||||
## Topic & Scope
|
||||
- Reduce AdvisoryAI PostgreSQL connection churn by adding stable application-name attribution and reusing pooled connections in the live knowledge-search and unified-search paths.
|
||||
- Rebuild and redeploy the affected AdvisoryAI service, then resample PostgreSQL and AdvisoryAI runtime load to confirm the change.
|
||||
- Capture the next transient Gitea CPU spike with process-level evidence instead of only container-level stats so the remaining integration outlier is attributable.
|
||||
- Working directory: `src/AdvisoryAI/`.
|
||||
- Expected evidence: targeted AdvisoryAI tests, updated AdvisoryAI deployment/runtime docs, compose/runtime samples, and Gitea process capture artifacts in the sprint log.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on `docs/implplan/SPRINT_20260405_008_Integrations_consul_pg_router_runtime_tuning.md` for the PostgreSQL logging baseline.
|
||||
- Depends on `docs/implplan/SPRINT_20260405_009_Router_registration_resync_and_hello_slimming.md` for the post-router-redeploy steady-state baseline.
|
||||
- Cross-module edits allowed for `docs/implplan/**`, `docs/modules/advisory-ai/**`, and `devops/compose/**` when configuration or runtime procedures change.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/code-of-conduct/CODE_OF_CONDUCT.md`
|
||||
- `docs/README.md`
|
||||
- `docs/07_HIGH_LEVEL_ARCHITECTURE.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
- `docs/modules/advisory-ai/architecture.md`
|
||||
- `docs/modules/advisory-ai/deployment.md`
|
||||
- `src/AdvisoryAI/AGENTS.md`
|
||||
- `src/AdvisoryAI/StellaOps.AdvisoryAI/AGENTS.md`
|
||||
- `src/AdvisoryAI/StellaOps.AdvisoryAI.WebService/AGENTS.md`
|
||||
- `src/AdvisoryAI/StellaOps.AdvisoryAI.Hosting/AGENTS.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### AIAI-PG-POOL-001 - Tighten AdvisoryAI PostgreSQL attribution and pooling
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Trace the current AdvisoryAI PostgreSQL access paths, especially the knowledge-search and unified-search background services that currently use raw `NpgsqlConnection` or short-lived `NpgsqlDataSource` instances.
|
||||
- Add stable PostgreSQL `application_name` attribution and consolidate those paths onto reusable pooled data sources so advisory-ai-web stops generating bursts of short physical sessions.
|
||||
- Redeploy the affected AdvisoryAI service and resample PostgreSQL plus AdvisoryAI runtime load to verify the change.
|
||||
|
||||
Completion criteria:
|
||||
- [x] AdvisoryAI PostgreSQL sessions expose a stable `application_name` instead of `[unknown]`.
|
||||
- [x] AdvisoryAI knowledge-search/unified-search runtime paths reuse pooled connections instead of repeatedly constructing throwaway data sources.
|
||||
- [x] Targeted AdvisoryAI tests pass and the live advisory-ai-web PostgreSQL churn drops measurably after redeploy.
|
||||
|
||||
### INT-GITEA-CPU-001 - Capture transient Gitea CPU spikes with process evidence
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Run a live watcher against `stellaops-gitea` long enough to catch the next transient CPU spike and capture process-level evidence from inside the container at spike time.
|
||||
- Record what was observed, whether the spike is in the main Gitea process or another child/thread, and whether the existing logs/health probes explain it.
|
||||
|
||||
Completion criteria:
|
||||
- [x] A live watcher captured at least one process-level sample during or immediately adjacent to a Gitea spike, or explicitly records that no spike occurred during the observation window.
|
||||
- [x] Sprint notes state whether the spike was explained by current evidence or remains unresolved.
|
||||
- [x] Any runtime procedure change needed for future capture is documented.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-05 | Sprint created after post-router steady-state sampling showed PostgreSQL itself was calm but AdvisoryAI still generated unattributed short sessions, while Gitea remained a transient integration outlier in longer CPU windows. | Developer |
|
||||
| 2026-04-05 | Replaced AdvisoryAI knowledge-search/unified-search raw PostgreSQL connections and throwaway `NpgsqlDataSource` instances with a shared `KnowledgeSearchDataSourceProvider`; added stable `DatabaseApplicationName` plus idle-pool retention knobs and documented them in `docs/modules/advisory-ai/deployment.md`. | Developer |
|
||||
| 2026-04-05 | Verified the new connection-string normalization with xUnit v3 direct runner: `dotnet exec src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/bin/Debug/net10.0/StellaOps.AdvisoryAI.Tests.dll -class StellaOps.AdvisoryAI.Tests.KnowledgeSearch.KnowledgeSearchDataSourceProviderTests` => `2/2` passed. | Developer |
|
||||
| 2026-04-05 | Rebuilt `stellaops/advisory-ai-web:dev` via `devops/docker/build-all.ps1 -Services advisory-ai-web`, force-recreated `stellaops-advisory-ai-web`, and confirmed live env now sets `ADVISORYAI__KnowledgeSearch__DatabaseApplicationName=stellaops-advisory-ai-web/knowledge-search` plus `DatabaseConnectionIdleLifetimeSeconds=900`. | Developer |
|
||||
| 2026-04-05 | Live PostgreSQL verification after redeploy showed `172.19.0.71` sessions attributed as `stellaops-advisory-ai-web/knowledge-search`; 2-minute steady-state sample settled at `stellaops-advisory-ai-web avg 0.77% CPU`, `stellaops-postgres avg 0.50%`, `stellaops-evidence-locker-web avg 0.14%`, `stellaops-router-gateway avg 0.89%`, `stellaops-gitea avg 0.10%`. | Developer |
|
||||
| 2026-04-05 | Corrected the Gitea spike watcher to use BusyBox-compatible `sh -c` capture. Artifact `artifacts/runtime/gitea_spike_watch_20260405_175001.log` caught a `104.43%` spike and showed the load inside multiple `/usr/local/bin/gitea -c /etc/gitea/app.ini web` threads, with logs still showing only the periodic `/api/v1/version` health checks. | Developer |
|
||||
| 2026-04-05 | Extended runtime verification with artifacts `artifacts/runtime/stack_sample_20260405_180815.log`, `artifacts/runtime/postgres_activity_20260405_180815.log`, and `artifacts/runtime/gitea_spike_watch_20260405_180815.log`. Over 23 whole-stack samples, `stellaops-advisory-ai-web avg 0.53% CPU`, `stellaops-postgres avg 0.43%`, `stellaops-evidence-locker-web avg 0.17%`, and `stellaops-gitea avg 0.29%` with no spike captures in 44 Gitea watch samples; PostgreSQL stayed at 4 idle `stellaops-advisory-ai-web/knowledge-search` sessions plus the expected generic idle pool and produced no slow-statement/connection-churn evidence in the sampled window. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- AdvisoryAI connection churn was caused by code, not PostgreSQL itself: `UnifiedSearchIndexer`, `SearchAnalyticsService`, `SearchQualityMonitor`, `EntityAliasService`, and `PostgresKnowledgeSearchStore` were mixing pooled and non-pooled access patterns. The shared `KnowledgeSearchDataSourceProvider` is now the single runtime path for knowledge-search/unified-search PostgreSQL access.
|
||||
- Runtime configuration is now explicit in both code and local compose: `src/AdvisoryAI/StellaOps.AdvisoryAI/KnowledgeSearch/KnowledgeSearchOptions.cs`, `src/AdvisoryAI/StellaOps.AdvisoryAI/KnowledgeSearch/KnowledgeSearchDataSourceProvider.cs`, `devops/compose/docker-compose.stella-ops.yml`, and `docs/modules/advisory-ai/deployment.md`.
|
||||
- `dotnet test --filter` is not trustworthy in this repo's current Microsoft Testing Platform setup because the VSTest filter property is ignored. Targeted verification for this sprint used the xUnit v3 assembly runner directly instead of pretending the `dotnet test` filter worked.
|
||||
- PostgreSQL slow-statement logs stayed empty after redeploy, and `pg_stat_activity` now shows AdvisoryAI as `stellaops-advisory-ai-web/knowledge-search`; the remaining dominant PostgreSQL session counts belong to other services.
|
||||
- Gitea spikes are real but are not explained by health-check traffic. The corrected capture shows transient CPU bursts inside the main multi-threaded Gitea web process itself, not a separate sidecar or shell child. The root cause remains internal to Gitea's runtime behavior on this persisted instance.
|
||||
- The longer follow-up window did not reproduce a Gitea spike. That reduces urgency for emergency remediation, but it also confirms the problem is intermittent and requires either a longer watch or Gitea-native profiling during the next event for a complete root cause.
|
||||
|
||||
## Next Checkpoints
|
||||
- If AdvisoryAI PostgreSQL attribution needs to cover non-knowledge paths later, extend the same application-name pattern to any future chat-audit or EF-owned connection strings.
|
||||
- If Gitea spikes need deeper root-cause attribution, the next step is Gitea-native profiling/debug endpoints or Go runtime profiling during a spike; the current shell-based watcher already proved the bursts are internal Gitea thread work, not external request load.
|
||||
Reference in New Issue
Block a user