Concelier:
- Register Topology.Read, Topology.Manage, Topology.Admin authorization
policies mapped to OrchRead/OrchOperate/PlatformContextRead/IntegrationWrite
scopes. Previously these policies were referenced by endpoints but never
registered, causing System.InvalidOperationException on every topology
API call.
Gateway routes:
- Simplified targets/environments routes (removed specific sub-path routes,
use catch-all patterns instead)
- Changed environments base route to JobEngine (where CRUD lives)
- Changed to ReverseProxy type for all topology routes
KNOWN ISSUE (not yet fixed):
- ReverseProxy routes don't forward the gateway's identity envelope to
Concelier. The regions/targets/bindings endpoints return 401 because
hasPrincipal=False — the gateway authenticates the user but doesn't
pass the identity to the backend via ReverseProxy. Microservice routes
use Valkey transport which includes envelope headers. Topology endpoints
need either: (a) Valkey transport registration in Concelier, or
(b) Concelier configured to accept raw bearer tokens on ReverseProxy paths.
This is an architecture-level fix.
Journey findings collected so far:
- Integration wizard (Harbor + GitHub App): works end-to-end
- Advisory Check All: fixed (parallel individual checks)
- Mirror domain creation: works, generate-immediately fails silently
- Topology wizard Step 1 (Region): blocked by auth passthrough issue
- Topology wizard Step 2 (Environment): POST to JobEngine needs verify
- User ID resolution: raw hashes shown everywhere
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded 1-5s polling constants with configurable
QueueWaitTimeoutSeconds (default 0 = pure event-driven). Consumers
now only wake on pub/sub notifications, eliminating ~118 idle
XREADGROUP polls per second across 59 services. Override with
VALKEY_QUEUE_WAIT_TIMEOUT env var if a safety-net poll is needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Valkey transport layer used 100ms busy-polling loops (Task.Delay(100))
across ~90 concurrent loops in 45+ services, generating ~900 idle
commands/sec and burning ~58% CPU while the system was completely idle.
Replace polling with Redis Pub/Sub notifications:
- Publishers fire PUBLISH after each XADD (fire-and-forget)
- Consumers SUBSCRIBE and wait on SemaphoreSlim with 30s fallback timeout
- Applies to both ValkeyMessageQueue (INotifiableQueue) and ValkeyEventStream
- Non-Valkey transports fall back to 1s polling via QueueWaitExtensions
Increase heartbeat interval from 10s to 45s across all transport options,
with corresponding health threshold adjustments (stale: 135s, degraded: 90s).
Expected idle CPU reduction: ~58% → ~3-5%.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>