Reduce idle CPU across 62 containers (phase 1)
- Add resource limits (heavy/medium/light tiers) to all 59 .NET services - Add .NET GC tuning (server/workstation GC, DATAS, conserve memory) - Convert FirstSignalSnapshotWriter from 10s polling to Valkey pub/sub - Convert EnvironmentSettingsRefreshService from 60s polling to Valkey pub/sub - Consolidate GraphAnalytics dual timers to single timer with idle-skip - Increase healthcheck interval from 30s to 60s (configurable) - Reduce debug logging to Information on 4 high-traffic services Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,141 @@
|
||||
# Sprint 019 — Container CPU Optimization
|
||||
|
||||
## Topic & Scope
|
||||
- Reduce idle CPU pressure from 62 Docker containers by adding resource limits, tuning GC, converting polling to event-driven patterns, and reducing log verbosity.
|
||||
- Working directory: `devops/compose/`, `src/JobEngine/`, `src/Graph/`, `src/Platform/`.
|
||||
- Expected evidence: compose validation, `docker stats` showing caps, reduced idle CPU.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- No upstream sprint dependencies.
|
||||
- Workstreams 1/2/4/6 (compose-only) are independent of workstreams 3A/3B/3D (C# changes).
|
||||
- C# workstreams (3A, 3B, 3D) are independent of each other (different modules).
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/router/architecture.md` (Valkey messaging patterns).
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### WS-1 — Resource Limits in Docker Compose
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Add three resource tier YAML anchors (heavy/medium/light) to compose file.
|
||||
- Apply `<<: *resources-{tier}` to all 59 .NET services.
|
||||
- Infrastructure services (postgres, valkey, rustfs, registry, rekor) remain unconstrained.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Three resource anchors defined
|
||||
- [x] Tier assignments: Heavy (6), Medium (16), Light (37)
|
||||
- [x] `docker compose config` validates cleanly
|
||||
- [x] Infrastructure services have no deploy limits
|
||||
|
||||
### WS-2 — Logging Debug→Information
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Change 4 services from Debug to Information logging, keeping Debug as comments.
|
||||
- Services: router-gateway, platform, policy-engine, findings-ledger-web.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Debug log levels commented out with Information active
|
||||
- [x] 4 services updated
|
||||
|
||||
### WS-3A — FirstSignalSnapshotWriter Valkey Pub/Sub
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Convert 10s polling to Valkey subscription on `notify:firstsignal:dirty`.
|
||||
- Add 60s fallback timer via `FallbackPollIntervalSeconds` option.
|
||||
- Fire Valkey notification from JobEngineEventPublisher on job lifecycle events.
|
||||
|
||||
Completion criteria:
|
||||
- [x] SemaphoreSlim + Valkey subscribe pattern implemented
|
||||
- [x] Fallback timer extended from 10s to 60s
|
||||
- [x] Event publisher fires dirty notification on orch.jobs channel events
|
||||
- [x] Project builds with 0 errors
|
||||
|
||||
### WS-3B — GraphAnalyticsHostedService Single Timer + Idle Skip
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Consolidate dual PeriodicTimer to single timer using Min(ClusterInterval, CentralityInterval).
|
||||
- Add idle-check: skip pipeline when no pending snapshots exist.
|
||||
- Add `SkipWhenIdle` option (default: true).
|
||||
|
||||
Completion criteria:
|
||||
- [x] Single timer replaces dual timers
|
||||
- [x] Idle check via IGraphSnapshotProvider.GetPendingSnapshotsAsync
|
||||
- [x] Debug log emitted when skipping
|
||||
- [x] Project builds with 0 errors
|
||||
|
||||
### WS-3D — EnvironmentSettingsRefreshService Valkey Pub/Sub
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Register IConnectionMultiplexer in Platform DI from ConnectionStrings:Redis.
|
||||
- Publish `notify:platform:envsettings:dirty` from PostgresEnvironmentSettingsStore on set/delete.
|
||||
- Convert EnvironmentSettingsRefreshService from Task.Delay(60s) to Valkey subscription with 300s fallback.
|
||||
|
||||
Completion criteria:
|
||||
- [x] IConnectionMultiplexer registered in Platform Program.cs
|
||||
- [x] Store publishes dirty notification (fire-and-forget)
|
||||
- [x] Refresh service uses SemaphoreSlim + Valkey subscribe
|
||||
- [x] Project builds with 0 errors
|
||||
|
||||
### WS-4 — Health Check Interval 60s (Configurable)
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Change healthcheck anchors from 30s to `${HEALTHCHECK_INTERVAL:-60s}`.
|
||||
- Propagates to all ~57 services using these anchors.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Both healthcheck anchors updated
|
||||
- [x] Environment variable override supported
|
||||
- [x] Rendered config shows 60s intervals
|
||||
|
||||
### WS-5 — Messaging Transport (No Changes)
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Verified Valkey messaging transport is already subscription-based with SemaphoreSlim + fallback.
|
||||
- No changes needed.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Verified ValkeyMessageQueue already uses push-first pattern
|
||||
|
||||
### WS-6 — GC Configuration
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
Task description:
|
||||
- Add three GC tuning YAML anchors (heavy/medium/light) with DOTNET_gcServer, GCConserveMemory, GCDynamicAdaptationMode.
|
||||
- Merge into all 59 .NET service environments.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Three GC anchors defined
|
||||
- [x] Heavy/Medium use Server GC; Light uses Workstation GC
|
||||
- [x] GCDynamicAdaptationMode=1 (DATAS) on all services
|
||||
- [x] Not applied to non-.NET infrastructure
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-03-10 | Sprint created. All workstreams completed. All 3 C# projects build clean. Compose validates clean. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Resource limits are dev/QA defaults; production deployments should tune per hardware.
|
||||
- GCDynamicAdaptationMode=1 requires .NET 8+; all services use .NET 8/9.
|
||||
- Healthcheck interval override via HEALTHCHECK_INTERVAL env var for operator flexibility.
|
||||
- Valkey pub/sub notifications are fire-and-forget; fallback timers ensure correctness if missed.
|
||||
|
||||
## Next Checkpoints
|
||||
- Rebuild affected images (platform, jobengine, graph-indexer) after C# changes merge.
|
||||
- Verify `docker stats` shows resource caps in dev environment.
|
||||
Reference in New Issue
Block a user