synergy moats product advisory implementations
This commit is contained in:
189
docs/operations/runbooks/orchestrator-quota-exceeded.md
Normal file
189
docs/operations/runbooks/orchestrator-quota-exceeded.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Runbook: Release Orchestrator - Promotion Quota Exhausted
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | Medium |
|
||||
| **On-call scope** | Platform team, Release team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.quota-status` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotions failing with "quota exceeded"
|
||||
- [ ] Alert `OrchestratorQuotaExceeded` firing
|
||||
- [ ] Error: "promotion rate limit reached" or "daily quota exhausted"
|
||||
- [ ] New promotions being rejected
|
||||
- [ ] Queued promotions not processing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | New releases blocked until quota resets or increases |
|
||||
| **Data integrity** | No data loss; promotions queued for later |
|
||||
| **SLA impact** | Release frequency SLO may be violated |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.quota-status
|
||||
```
|
||||
|
||||
2. **Check current quota usage:**
|
||||
```bash
|
||||
stella orch quota status
|
||||
```
|
||||
|
||||
3. **Check quota limits:**
|
||||
```bash
|
||||
stella orch quota limits show
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check promotion history:**
|
||||
```bash
|
||||
stella promotion list --last 24h --count
|
||||
```
|
||||
Look for: Unusual spike in promotions
|
||||
|
||||
2. **Check per-environment quotas:**
|
||||
```bash
|
||||
stella orch quota status --by-environment
|
||||
```
|
||||
|
||||
3. **Check for runaway automation:**
|
||||
```bash
|
||||
stella promotion list --last 1h --by-actor
|
||||
```
|
||||
Problem if: Single actor/service making many promotions
|
||||
|
||||
4. **Check when quota resets:**
|
||||
```bash
|
||||
stella orch quota reset-time
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Request temporary quota increase:**
|
||||
```bash
|
||||
stella orch quota request-increase --amount 50 --reason "Release deadline"
|
||||
```
|
||||
|
||||
2. **Prioritize critical promotions:**
|
||||
```bash
|
||||
stella promotion priority set <promotion-id> high
|
||||
```
|
||||
|
||||
3. **Cancel unnecessary queued promotions:**
|
||||
```bash
|
||||
stella promotion list --status queued
|
||||
stella promotion cancel <promotion-id>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If legitimate high volume:**
|
||||
|
||||
1. Increase quota limits:
|
||||
```bash
|
||||
stella orch quota limits set --daily 200 --hourly 50
|
||||
```
|
||||
|
||||
2. Increase per-environment limits:
|
||||
```bash
|
||||
stella orch quota limits set --env production --daily 50
|
||||
```
|
||||
|
||||
**If runaway automation:**
|
||||
|
||||
1. Identify the source:
|
||||
```bash
|
||||
stella promotion list --last 1h --by-actor --verbose
|
||||
```
|
||||
|
||||
2. Revoke or rate-limit the service account:
|
||||
```bash
|
||||
stella auth rate-limit set <service-account> --promotions-per-hour 10
|
||||
```
|
||||
|
||||
3. Fix the automation bug
|
||||
|
||||
**If promotion retries causing spike:**
|
||||
|
||||
1. Check for failing promotions causing retries:
|
||||
```bash
|
||||
stella promotion list --status failed --last 24h
|
||||
```
|
||||
|
||||
2. Fix underlying promotion failures (see other runbooks)
|
||||
|
||||
3. Configure retry limits:
|
||||
```bash
|
||||
stella orch config set promotion.max_retries 3
|
||||
stella orch config set promotion.retry_backoff 5m
|
||||
```
|
||||
|
||||
**If quota too restrictive for workload:**
|
||||
|
||||
1. Analyze actual promotion patterns:
|
||||
```bash
|
||||
stella orch quota analyze --last 30d
|
||||
```
|
||||
|
||||
2. Adjust quotas based on analysis:
|
||||
```bash
|
||||
stella orch quota limits set --daily <recommended>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check quota status
|
||||
stella orch quota status
|
||||
|
||||
# Verify promotions processing
|
||||
stella promotion list --status in_progress
|
||||
|
||||
# Test new promotion
|
||||
stella promotion create --test --dry-run
|
||||
|
||||
# Check no quota errors
|
||||
stella orch logs --filter "quota" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Monitoring:** Alert at 80% quota usage
|
||||
- [ ] **Limits:** Set appropriate quotas based on team size and release frequency
|
||||
- [ ] **Automation:** Implement rate limiting in CI/CD pipelines
|
||||
- [ ] **Review:** Regularly review and adjust quotas based on usage patterns
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/quotas.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`
|
||||
- **Quota management:** `docs/operations/quota-management.md`
|
||||
Reference in New Issue
Block a user