Reduce idle CPU across 62 containers (phase 1)

- Add resource limits (heavy/medium/light tiers) to all 59 .NET services
- Add .NET GC tuning (server/workstation GC, DATAS, conserve memory)
- Convert FirstSignalSnapshotWriter from 10s polling to Valkey pub/sub
- Convert EnvironmentSettingsRefreshService from 60s polling to Valkey pub/sub
- Consolidate GraphAnalytics dual timers to single timer with idle-skip
- Increase healthcheck interval from 30s to 60s (configurable)
- Reduce debug logging to Information on 4 high-traffic services

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
master
2026-03-10 02:16:19 +02:00
parent c0c0267ac9
commit 166745f9f9
12 changed files with 601 additions and 89 deletions

View File

@@ -0,0 +1,141 @@
# Sprint 019 — Container CPU Optimization
## Topic & Scope
- Reduce idle CPU pressure from 62 Docker containers by adding resource limits, tuning GC, converting polling to event-driven patterns, and reducing log verbosity.
- Working directory: `devops/compose/`, `src/JobEngine/`, `src/Graph/`, `src/Platform/`.
- Expected evidence: compose validation, `docker stats` showing caps, reduced idle CPU.
## Dependencies & Concurrency
- No upstream sprint dependencies.
- Workstreams 1/2/4/6 (compose-only) are independent of workstreams 3A/3B/3D (C# changes).
- C# workstreams (3A, 3B, 3D) are independent of each other (different modules).
## Documentation Prerequisites
- `docs/modules/router/architecture.md` (Valkey messaging patterns).
## Delivery Tracker
### WS-1 — Resource Limits in Docker Compose
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Add three resource tier YAML anchors (heavy/medium/light) to compose file.
- Apply `<<: *resources-{tier}` to all 59 .NET services.
- Infrastructure services (postgres, valkey, rustfs, registry, rekor) remain unconstrained.
Completion criteria:
- [x] Three resource anchors defined
- [x] Tier assignments: Heavy (6), Medium (16), Light (37)
- [x] `docker compose config` validates cleanly
- [x] Infrastructure services have no deploy limits
### WS-2 — Logging Debug→Information
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Change 4 services from Debug to Information logging, keeping Debug as comments.
- Services: router-gateway, platform, policy-engine, findings-ledger-web.
Completion criteria:
- [x] Debug log levels commented out with Information active
- [x] 4 services updated
### WS-3A — FirstSignalSnapshotWriter Valkey Pub/Sub
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Convert 10s polling to Valkey subscription on `notify:firstsignal:dirty`.
- Add 60s fallback timer via `FallbackPollIntervalSeconds` option.
- Fire Valkey notification from JobEngineEventPublisher on job lifecycle events.
Completion criteria:
- [x] SemaphoreSlim + Valkey subscribe pattern implemented
- [x] Fallback timer extended from 10s to 60s
- [x] Event publisher fires dirty notification on orch.jobs channel events
- [x] Project builds with 0 errors
### WS-3B — GraphAnalyticsHostedService Single Timer + Idle Skip
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Consolidate dual PeriodicTimer to single timer using Min(ClusterInterval, CentralityInterval).
- Add idle-check: skip pipeline when no pending snapshots exist.
- Add `SkipWhenIdle` option (default: true).
Completion criteria:
- [x] Single timer replaces dual timers
- [x] Idle check via IGraphSnapshotProvider.GetPendingSnapshotsAsync
- [x] Debug log emitted when skipping
- [x] Project builds with 0 errors
### WS-3D — EnvironmentSettingsRefreshService Valkey Pub/Sub
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Register IConnectionMultiplexer in Platform DI from ConnectionStrings:Redis.
- Publish `notify:platform:envsettings:dirty` from PostgresEnvironmentSettingsStore on set/delete.
- Convert EnvironmentSettingsRefreshService from Task.Delay(60s) to Valkey subscription with 300s fallback.
Completion criteria:
- [x] IConnectionMultiplexer registered in Platform Program.cs
- [x] Store publishes dirty notification (fire-and-forget)
- [x] Refresh service uses SemaphoreSlim + Valkey subscribe
- [x] Project builds with 0 errors
### WS-4 — Health Check Interval 60s (Configurable)
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Change healthcheck anchors from 30s to `${HEALTHCHECK_INTERVAL:-60s}`.
- Propagates to all ~57 services using these anchors.
Completion criteria:
- [x] Both healthcheck anchors updated
- [x] Environment variable override supported
- [x] Rendered config shows 60s intervals
### WS-5 — Messaging Transport (No Changes)
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Verified Valkey messaging transport is already subscription-based with SemaphoreSlim + fallback.
- No changes needed.
Completion criteria:
- [x] Verified ValkeyMessageQueue already uses push-first pattern
### WS-6 — GC Configuration
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Add three GC tuning YAML anchors (heavy/medium/light) with DOTNET_gcServer, GCConserveMemory, GCDynamicAdaptationMode.
- Merge into all 59 .NET service environments.
Completion criteria:
- [x] Three GC anchors defined
- [x] Heavy/Medium use Server GC; Light uses Workstation GC
- [x] GCDynamicAdaptationMode=1 (DATAS) on all services
- [x] Not applied to non-.NET infrastructure
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-03-10 | Sprint created. All workstreams completed. All 3 C# projects build clean. Compose validates clean. | Developer |
## Decisions & Risks
- Resource limits are dev/QA defaults; production deployments should tune per hardware.
- GCDynamicAdaptationMode=1 requires .NET 8+; all services use .NET 8/9.
- Healthcheck interval override via HEALTHCHECK_INTERVAL env var for operator flexibility.
- Valkey pub/sub notifications are fire-and-forget; fallback timers ensure correctness if missed.
## Next Checkpoints
- Rebuild affected images (platform, jobengine, graph-indexer) after C# changes merge.
- Verify `docker stats` shows resource caps in dev environment.