synergy moats product advisory implementations
This commit is contained in:
178
docs/operations/runbooks/orchestrator-gate-timeout.md
Normal file
178
docs/operations/runbooks/orchestrator-gate-timeout.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Runbook: Release Orchestrator - Gate Evaluation Timeout
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.gate-timeout` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotion gates timing out before completing evaluation
|
||||
- [ ] Alert `OrchestratorGateTimeout` firing
|
||||
- [ ] Error: "gate evaluation timeout exceeded"
|
||||
- [ ] Promotion stuck waiting for gate response
|
||||
- [ ] Metric `orchestrator_gate_timeout_total` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Promotions delayed or blocked; release pipeline stalled |
|
||||
| **Data integrity** | No data loss; promotion can be retried |
|
||||
| **SLA impact** | Release SLO violated if timeout persists |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.gate-timeout
|
||||
```
|
||||
|
||||
2. **Identify timed-out gates:**
|
||||
```bash
|
||||
stella promotion gates <promotion-id> --status timeout
|
||||
```
|
||||
|
||||
3. **Check gate service health:**
|
||||
```bash
|
||||
stella orch gate-services status
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check specific gate latency:**
|
||||
```bash
|
||||
stella orch gate stats --gate <gate-name> --last 1h
|
||||
```
|
||||
Look for: P95 latency, timeout rate
|
||||
|
||||
2. **Check external service connectivity:**
|
||||
```bash
|
||||
stella orch connectivity --gate <gate-name>
|
||||
```
|
||||
|
||||
3. **Check gate evaluation logs:**
|
||||
```bash
|
||||
stella orch logs --gate <gate-name> --promotion <promotion-id>
|
||||
```
|
||||
Look for: Slow queries, external API delays
|
||||
|
||||
4. **Check policy engine latency (for policy gates):**
|
||||
```bash
|
||||
stella policy stats --last 10m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Increase timeout for specific gate:**
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.timeout 5m
|
||||
stella orch reload
|
||||
```
|
||||
|
||||
2. **Skip the timed-out gate (requires approval):**
|
||||
```bash
|
||||
stella promotion gate skip <promotion-id> <gate-name> \
|
||||
--reason "External service timeout - approved by <approver>"
|
||||
```
|
||||
|
||||
3. **Retry the promotion:**
|
||||
```bash
|
||||
stella promotion retry <promotion-id>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If external service is slow:**
|
||||
|
||||
1. Configure gate retry with backoff:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.retries 3
|
||||
stella orch config set gates.<gate-name>.retry_backoff 5s
|
||||
```
|
||||
|
||||
2. Enable gate result caching:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.cache_ttl 5m
|
||||
```
|
||||
|
||||
3. Configure circuit breaker:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.circuit_breaker.enabled true
|
||||
stella orch config set gates.<gate-name>.circuit_breaker.threshold 5
|
||||
```
|
||||
|
||||
**If policy evaluation is slow:**
|
||||
|
||||
1. Optimize policy (see `policy-evaluation-slow.md` runbook)
|
||||
|
||||
2. Increase policy worker count:
|
||||
```bash
|
||||
stella policy config set opa.workers 4
|
||||
```
|
||||
|
||||
**If evidence retrieval is slow:**
|
||||
|
||||
1. Enable evidence pre-fetching:
|
||||
```bash
|
||||
stella orch config set gates.evidence_prefetch true
|
||||
```
|
||||
|
||||
2. Increase evidence cache:
|
||||
```bash
|
||||
stella orch config set evidence.cache_size 1000
|
||||
stella orch config set evidence.cache_ttl 10m
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry promotion
|
||||
stella promotion retry <promotion-id>
|
||||
|
||||
# Monitor gate evaluation
|
||||
stella promotion gates <promotion-id> --watch
|
||||
|
||||
# Check gate latency improved
|
||||
stella orch gate stats --gate <gate-name> --last 10m
|
||||
|
||||
# Verify no timeouts
|
||||
stella orch logs --filter "timeout" --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Timeouts:** Set appropriate timeouts based on gate SLAs (default: 2m)
|
||||
- [ ] **Monitoring:** Alert on gate P95 latency > 1m
|
||||
- [ ] **Caching:** Enable caching for slow gates
|
||||
- [ ] **Circuit breakers:** Enable circuit breakers for external service gates
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/gates.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `policy-evaluation-slow.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Gate Latency
|
||||
Reference in New Issue
Block a user