synergy moats product advisory implementations
This commit is contained in:
256
docs/operations/guides/auditor-guide.md
Normal file
256
docs/operations/guides/auditor-guide.md
Normal file
@@ -0,0 +1,256 @@
|
||||
# Auditor Guide
|
||||
|
||||
> **Sprint:** SPRINT_20260117_027_CLI_audit_bundle_command
|
||||
> **Task:** AUD-007 - Documentation
|
||||
|
||||
This guide is for external auditors reviewing Stella Ops release evidence.
|
||||
|
||||
## Overview
|
||||
|
||||
Stella Ops generates comprehensive, tamper-evident audit bundles that contain all evidence required to verify release decisions. This guide explains how to interpret and verify these bundles.
|
||||
|
||||
## Receiving an Audit Bundle
|
||||
|
||||
Audit bundles may be delivered as:
|
||||
- **Directory:** A folder containing all evidence files
|
||||
- **Archive:** A `.tar.gz` or `.zip` file
|
||||
|
||||
### Extracting Archives
|
||||
|
||||
```bash
|
||||
# tar.gz
|
||||
tar -xzf audit-bundle-sha256-abc123.tar.gz
|
||||
|
||||
# zip
|
||||
unzip audit-bundle-sha256-abc123.zip
|
||||
```
|
||||
|
||||
## Bundle Structure
|
||||
|
||||
```
|
||||
audit-bundle-<digest>-<timestamp>/
|
||||
├── manifest.json # Integrity manifest
|
||||
├── README.md # Quick reference
|
||||
├── verdict/ # Release decision
|
||||
├── evidence/ # Supporting evidence
|
||||
├── policy/ # Policy configuration
|
||||
└── replay/ # Verification instructions
|
||||
```
|
||||
|
||||
## Step 1: Verify Bundle Integrity
|
||||
|
||||
Before reviewing contents, verify the bundle has not been tampered with.
|
||||
|
||||
### Using Stella CLI
|
||||
|
||||
```bash
|
||||
stella audit verify ./audit-bundle-sha256-abc123/
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
✓ Verified 15/15 files
|
||||
✓ Integrity hash verified
|
||||
✓ Bundle integrity verified
|
||||
```
|
||||
|
||||
### Manual Verification
|
||||
|
||||
1. Open `manifest.json`
|
||||
2. For each file listed, compute SHA-256 and compare:
|
||||
```bash
|
||||
sha256sum verdict/verdict.json
|
||||
```
|
||||
3. Verify the `integrityHash` by hashing all file hashes
|
||||
|
||||
## Step 2: Review the Verdict
|
||||
|
||||
The verdict is the official release decision.
|
||||
|
||||
### verdict/verdict.json
|
||||
|
||||
```json
|
||||
{
|
||||
"artifactDigest": "sha256:abc123...",
|
||||
"decision": "PASS",
|
||||
"timestamp": "2026-01-17T10:25:00Z",
|
||||
"gates": [
|
||||
{
|
||||
"gateId": "sbom-required",
|
||||
"status": "PASS",
|
||||
"reason": "Valid CycloneDX SBOM present"
|
||||
},
|
||||
{
|
||||
"gateId": "vex-trust",
|
||||
"status": "PASS",
|
||||
"reason": "Trust score 0.85 >= 0.70 threshold"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Decision Values
|
||||
|
||||
| Decision | Meaning |
|
||||
|----------|---------|
|
||||
| `PASS` | All gates passed, artifact approved for deployment |
|
||||
| `BLOCKED` | One or more gates failed, artifact not approved |
|
||||
| `PENDING` | Evaluation incomplete, awaiting additional evidence |
|
||||
|
||||
### verdict/verdict.dsse.json
|
||||
|
||||
This file contains the cryptographically signed verdict envelope (DSSE format). Verify signatures using:
|
||||
|
||||
```bash
|
||||
stella audit verify ./bundle/ --check-signatures
|
||||
```
|
||||
|
||||
## Step 3: Review Evidence
|
||||
|
||||
### evidence/sbom.json
|
||||
|
||||
Software Bill of Materials (SBOM) listing all components in the artifact.
|
||||
|
||||
**Key fields:**
|
||||
- `components[]` - List of all software components
|
||||
- `dependencies[]` - Dependency relationships
|
||||
- `metadata.timestamp` - When SBOM was generated
|
||||
|
||||
### evidence/vex-statements/
|
||||
|
||||
Vulnerability Exploitability eXchange (VEX) statements that justify vulnerability assessments.
|
||||
|
||||
**index.json:**
|
||||
```json
|
||||
{
|
||||
"statementCount": 3,
|
||||
"statements": [
|
||||
{"fileName": "vex-001.json", "source": "vendor-security"},
|
||||
{"fileName": "vex-002.json", "source": "internal-analysis"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Each VEX statement explains why a vulnerability does or does not affect this artifact.
|
||||
|
||||
### evidence/reachability/analysis.json
|
||||
|
||||
Reachability analysis showing which vulnerabilities are actually reachable in the code.
|
||||
|
||||
```json
|
||||
{
|
||||
"components": [
|
||||
{
|
||||
"purl": "pkg:npm/lodash@4.17.21",
|
||||
"vulnerabilities": [
|
||||
{
|
||||
"id": "CVE-2021-23337",
|
||||
"reachable": false,
|
||||
"reason": "Vulnerable function not in call graph"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Step 4: Review Policy
|
||||
|
||||
### policy/policy-snapshot.json
|
||||
|
||||
The policy configuration used for evaluation:
|
||||
|
||||
```json
|
||||
{
|
||||
"policyVersion": "v2.3.1",
|
||||
"gates": ["sbom-required", "vex-trust", "cve-threshold"],
|
||||
"thresholds": {
|
||||
"vexTrustScore": 0.70,
|
||||
"maxCriticalCves": 0,
|
||||
"maxHighCves": 5
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### policy/gate-decision.json
|
||||
|
||||
Detailed breakdown of each gate evaluation:
|
||||
|
||||
```json
|
||||
{
|
||||
"gates": [
|
||||
{
|
||||
"gateId": "vex-trust",
|
||||
"decision": "PASS",
|
||||
"inputs": {
|
||||
"vexStatements": 3,
|
||||
"trustScore": 0.85,
|
||||
"threshold": 0.70
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Step 5: Replay Verification (Optional)
|
||||
|
||||
For maximum assurance, you can replay the verdict evaluation.
|
||||
|
||||
### Using Stella CLI
|
||||
|
||||
```bash
|
||||
cd audit-bundle-sha256-abc123/
|
||||
stella replay snapshot --manifest replay/knowledge-snapshot.json
|
||||
```
|
||||
|
||||
This re-evaluates the policy using the frozen inputs and should produce an identical verdict.
|
||||
|
||||
### Manual Replay Steps
|
||||
|
||||
See `replay/replay-instructions.md` for detailed steps.
|
||||
|
||||
## Compliance Mapping
|
||||
|
||||
| Compliance Framework | Relevant Bundle Components |
|
||||
|---------------------|---------------------------|
|
||||
| **SOC 2 (CC7.1)** | verdict/, policy/ |
|
||||
| **ISO 27001 (A.12.6)** | evidence/sbom.json |
|
||||
| **FedRAMP** | All components |
|
||||
| **SLSA Level 3** | evidence/provenance/ |
|
||||
|
||||
## Common Questions
|
||||
|
||||
### Q: Why was this artifact blocked?
|
||||
|
||||
Review `policy/gate-decision.json` for the specific gate that failed and its reason.
|
||||
|
||||
### Q: How do I verify the SBOM is accurate?
|
||||
|
||||
The SBOM digest is included in the manifest. Compare against the organization's SBOM generation process.
|
||||
|
||||
### Q: What if replay produces a different result?
|
||||
|
||||
This may indicate:
|
||||
1. Policy version mismatch
|
||||
2. Missing evidence files
|
||||
3. Time-dependent policy rules
|
||||
|
||||
Contact the organization's security team for clarification.
|
||||
|
||||
### Q: How long should audit bundles be retained?
|
||||
|
||||
Stella Ops recommends:
|
||||
- Production releases: 5 years minimum
|
||||
- Security-critical systems: 7 years
|
||||
- Regulated industries: Per compliance requirements
|
||||
|
||||
## Support
|
||||
|
||||
For questions about this audit bundle:
|
||||
1. Contact the organization's Stella Ops administrator
|
||||
2. Reference the Bundle ID from `manifest.json`
|
||||
3. Include the artifact digest
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
112
docs/operations/runbooks/COVERAGE.md
Normal file
112
docs/operations/runbooks/COVERAGE.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# Runbook Coverage Tracking
|
||||
|
||||
This document tracks operational runbook coverage across Stella Ops modules.
|
||||
|
||||
**Target:** 80% coverage of critical failure modes before declaring operability moat achieved.
|
||||
|
||||
---
|
||||
|
||||
## Coverage Summary
|
||||
|
||||
| Module | Critical Failures | Runbooks | Coverage | Status |
|
||||
|--------|-------------------|----------|----------|--------|
|
||||
| Scanner | 5 | 0 | 0% | 🔴 Gap |
|
||||
| Policy Engine | 5 | 0 | 0% | 🔴 Gap |
|
||||
| Release Orchestrator | 5 | 0 | 0% | 🔴 Gap |
|
||||
| Attestor | 5 | 0 | 0% | 🔴 Gap |
|
||||
| Feed Connectors | 4 | 0 | 0% | 🔴 Gap |
|
||||
| **Database (Postgres)** | 4 | 4 | 100% | ✅ Complete |
|
||||
| **Crypto Subsystem** | 4 | 4 | 100% | ✅ Complete |
|
||||
| **Evidence Locker** | 4 | 4 | 100% | ✅ Complete |
|
||||
| **Backup/Restore** | 4 | 4 | 100% | ✅ Complete |
|
||||
| Authority (OAuth/OIDC) | 3 | 0 | 0% | 🔴 Gap |
|
||||
| **Overall** | **43** | **16** | **37%** | 🟡 In Progress |
|
||||
|
||||
---
|
||||
|
||||
## Available Runbooks
|
||||
|
||||
### Database Operations
|
||||
- [postgres-ops.md](postgres-ops.md) - PostgreSQL database operations
|
||||
|
||||
### Crypto Subsystem
|
||||
- [crypto-ops.md](crypto-ops.md) - Regional crypto operations (FIPS, eIDAS, GOST, SM)
|
||||
|
||||
### Evidence Locker
|
||||
- [evidence-locker-ops.md](evidence-locker-ops.md) - Evidence locker operations
|
||||
|
||||
### Backup/Restore
|
||||
- [backup-restore-ops.md](backup-restore-ops.md) - Backup and restore procedures
|
||||
|
||||
### Vulnerability Operations
|
||||
- [vuln-ops.md](vuln-ops.md) - Vulnerability management operations
|
||||
|
||||
### VEX Operations
|
||||
- [vex-ops.md](vex-ops.md) - VEX statement operations
|
||||
|
||||
### Policy Incidents
|
||||
- [policy-incident.md](policy-incident.md) - Policy-related incident response
|
||||
|
||||
---
|
||||
|
||||
## Gap Analysis
|
||||
|
||||
### High Priority Gaps (Critical modules without runbooks)
|
||||
|
||||
1. **Scanner** - Core scanning functionality
|
||||
- Worker stuck
|
||||
- OOM on large images
|
||||
- Registry auth failures
|
||||
|
||||
2. **Policy Engine** - Policy evaluation
|
||||
- Slow evaluation
|
||||
- OPA crashes
|
||||
- Compilation failures
|
||||
|
||||
3. **Release Orchestrator** - Promotion workflow
|
||||
- Stuck promotions
|
||||
- Gate timeouts
|
||||
- Missing evidence
|
||||
|
||||
### Medium Priority Gaps
|
||||
|
||||
4. **Attestor** - Signing and verification
|
||||
- Signing failures
|
||||
- Key expiration
|
||||
- Rekor unavailability
|
||||
|
||||
5. **Feed Connectors** - Advisory feeds
|
||||
- NVD failures
|
||||
- Rate limiting
|
||||
- Offline bundle issues
|
||||
|
||||
### Lower Priority Gaps
|
||||
|
||||
6. **Authority** - Authentication
|
||||
- Token validation failures
|
||||
- OIDC provider issues
|
||||
|
||||
---
|
||||
|
||||
## Template
|
||||
|
||||
New runbooks should use the template: [_template.md](_template.md)
|
||||
|
||||
---
|
||||
|
||||
## Doctor Check Integration
|
||||
|
||||
Runbooks should be linked from Doctor check output. Current integration status:
|
||||
|
||||
| Module | Doctor Checks | Linked to Runbook |
|
||||
|--------|---------------|-------------------|
|
||||
| Postgres | 4 | 0 |
|
||||
| Crypto | 8 | 0 |
|
||||
| Storage | 3 | 0 |
|
||||
| Evidence | 4 | 0 |
|
||||
|
||||
**Next step:** Update Doctor check implementations to include runbook links in remediation output.
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
157
docs/operations/runbooks/_template.md
Normal file
157
docs/operations/runbooks/_template.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# Runbook: [Component] - [Failure Scenario]
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-001 - Runbook Template
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | [Module name: Scanner, Policy, Orchestrator, Attestor, etc.] |
|
||||
| **Severity** | Critical / High / Medium / Low |
|
||||
| **On-call scope** | [Who should be paged: Platform team, Security team, etc.] |
|
||||
| **Last updated** | [YYYY-MM-DD] |
|
||||
| **Doctor check** | [Check ID if applicable, e.g., `check.scanner.worker-health`] |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
Observable indicators that this failure is occurring:
|
||||
|
||||
- [ ] [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
|
||||
- [ ] [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
|
||||
- [ ] [Metric/alert that fires: e.g., "Alert `ScannerWorkerStuck` firing"]
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | [e.g., "New scans cannot complete, blocking CI/CD pipelines"] |
|
||||
| **Data integrity** | [e.g., "No data loss, but stale scan results may be served"] |
|
||||
| **SLA impact** | [e.g., "Scan latency SLO violated if not resolved within 15 minutes"] |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks (< 2 minutes)
|
||||
|
||||
Run these first to confirm the failure:
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check [relevant-check-id]
|
||||
```
|
||||
|
||||
2. **Check service status:**
|
||||
```bash
|
||||
stella [component] status
|
||||
```
|
||||
|
||||
3. **Check recent logs:**
|
||||
```bash
|
||||
stella [component] logs --tail 50 --level error
|
||||
```
|
||||
|
||||
### Deep diagnosis (if quick checks inconclusive)
|
||||
|
||||
1. **[Investigation step 1]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
Expected output: [description]
|
||||
If unexpected: [what it means]
|
||||
|
||||
2. **[Investigation step 2]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
3. **Check related services:**
|
||||
- Postgres connectivity: `stella doctor --check check.storage.postgres`
|
||||
- Valkey connectivity: `stella doctor --check check.storage.valkey`
|
||||
- Network connectivity: `stella doctor --check check.network.[target]`
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation (restore service quickly)
|
||||
|
||||
Use these steps to restore service, even if root cause isn't fixed yet:
|
||||
|
||||
1. **[Mitigation step 1]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
This will: [explanation]
|
||||
|
||||
2. **[Mitigation step 2]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
Once service is restored, address the underlying issue:
|
||||
|
||||
1. **[Fix step 1]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
2. **[Fix step 2]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
3. **Verify fix is complete:**
|
||||
```bash
|
||||
stella doctor --check [relevant-check-id]
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
Confirm the issue is fully resolved:
|
||||
|
||||
```bash
|
||||
# Re-run the failing operation
|
||||
stella [component] [test-command]
|
||||
|
||||
# Verify metrics are healthy
|
||||
stella obs metrics --filter [component] --last 5m
|
||||
|
||||
# Verify no new errors in logs
|
||||
stella [component] logs --tail 20 --level error
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
How to prevent this failure from recurring:
|
||||
|
||||
- [ ] **Monitoring:** [e.g., "Add alert for queue depth > 100"]
|
||||
- [ ] **Configuration:** [e.g., "Increase worker count in high-volume environments"]
|
||||
- [ ] **Code change:** [e.g., "Implement circuit breaker for external service calls"]
|
||||
- [ ] **Documentation:** [e.g., "Update capacity planning guide"]
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture doc:** [Link to relevant architecture documentation]
|
||||
- **Related runbooks:** [Links to related failure scenarios]
|
||||
- **Doctor check source:** [Link to Doctor check implementation]
|
||||
- **Grafana dashboard:** [Link to relevant dashboard]
|
||||
|
||||
---
|
||||
|
||||
## Revision History
|
||||
|
||||
| Date | Author | Changes |
|
||||
|------|--------|---------|
|
||||
| YYYY-MM-DD | [Name] | Initial version |
|
||||
193
docs/operations/runbooks/attestor-hsm-connection.md
Normal file
193
docs/operations/runbooks/attestor-hsm-connection.md
Normal file
@@ -0,0 +1,193 @@
|
||||
# Runbook: Attestor - HSM Connection Issues
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor / Cryptography |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.crypto.hsm-availability` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Signing operations failing with "HSM unavailable"
|
||||
- [ ] Alert `AttestorHsmConnectionFailed` firing
|
||||
- [ ] Error: "PKCS#11 operation failed" or "HSM session timeout"
|
||||
- [ ] Attestations cannot be created
|
||||
- [ ] Key operations (sign, verify) failing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | No attestations can be signed; releases blocked |
|
||||
| **Data integrity** | Keys are safe in HSM; operations resume when connection restored |
|
||||
| **SLA impact** | All signing operations blocked; compliance posture at risk |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.crypto.hsm-availability
|
||||
```
|
||||
|
||||
2. **Check HSM connection status:**
|
||||
```bash
|
||||
stella crypto hsm status
|
||||
```
|
||||
|
||||
3. **Test HSM connectivity:**
|
||||
```bash
|
||||
stella crypto hsm test
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check PKCS#11 library status:**
|
||||
```bash
|
||||
stella crypto hsm pkcs11-status
|
||||
```
|
||||
Look for: Library loaded, slot available, session active
|
||||
|
||||
2. **Check HSM network connectivity:**
|
||||
```bash
|
||||
stella crypto hsm ping
|
||||
```
|
||||
|
||||
3. **Check HSM session logs:**
|
||||
```bash
|
||||
stella crypto hsm logs --last 30m
|
||||
```
|
||||
Look for: Session errors, timeout, authentication failures
|
||||
|
||||
4. **Check HSM slot status:**
|
||||
```bash
|
||||
stella crypto hsm slots list
|
||||
```
|
||||
Problem if: Slot not found, slot busy, token not present
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Attempt HSM reconnection:**
|
||||
```bash
|
||||
stella crypto hsm reconnect
|
||||
```
|
||||
|
||||
2. **If HSM unreachable, switch to software signing (if permitted):**
|
||||
```bash
|
||||
stella attest config set signing.mode software
|
||||
stella attest reload
|
||||
```
|
||||
**Warning:** Software signing may not meet compliance requirements
|
||||
|
||||
3. **Use backup HSM if configured:**
|
||||
```bash
|
||||
stella crypto hsm failover --to backup
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If network connectivity issue:**
|
||||
|
||||
1. Check HSM network path:
|
||||
```bash
|
||||
stella crypto hsm connectivity --verbose
|
||||
```
|
||||
|
||||
2. Verify firewall rules allow HSM port (typically 1792 for Luna, 2225 for SafeNet)
|
||||
|
||||
3. Check HSM server status with vendor tools
|
||||
|
||||
**If session timeout:**
|
||||
|
||||
1. Increase session timeout:
|
||||
```bash
|
||||
stella crypto hsm config set session.timeout 300s
|
||||
stella crypto hsm reconnect
|
||||
```
|
||||
|
||||
2. Enable session keep-alive:
|
||||
```bash
|
||||
stella crypto hsm config set session.keepalive true
|
||||
stella crypto hsm config set session.keepalive_interval 60s
|
||||
```
|
||||
|
||||
**If authentication failed:**
|
||||
|
||||
1. Verify HSM credentials:
|
||||
```bash
|
||||
stella crypto hsm auth verify
|
||||
```
|
||||
|
||||
2. Update HSM PIN if changed:
|
||||
```bash
|
||||
stella crypto hsm auth update --slot <slot-id>
|
||||
```
|
||||
|
||||
**If PKCS#11 library issue:**
|
||||
|
||||
1. Verify library path:
|
||||
```bash
|
||||
stella crypto hsm config get pkcs11.library_path
|
||||
```
|
||||
|
||||
2. Reload PKCS#11 library:
|
||||
```bash
|
||||
stella crypto hsm pkcs11-reload
|
||||
```
|
||||
|
||||
3. Check library compatibility:
|
||||
```bash
|
||||
stella crypto hsm pkcs11-info
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test HSM connectivity
|
||||
stella crypto hsm test
|
||||
|
||||
# Test signing operation
|
||||
stella attest test-sign
|
||||
|
||||
# Verify key access
|
||||
stella keys verify <key-id> --operation sign
|
||||
|
||||
# Check no errors in logs
|
||||
stella crypto hsm logs --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Redundancy:** Configure backup HSM for failover
|
||||
- [ ] **Monitoring:** Alert on HSM connection failures immediately
|
||||
- [ ] **Keep-alive:** Enable session keep-alive to prevent timeouts
|
||||
- [ ] **Testing:** Include HSM health in regular health checks
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/cryptography/hsm-integration.md`
|
||||
- **Related runbooks:** `attestor-signing-failed.md`, `crypto-ops.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Crypto/`
|
||||
- **HSM setup:** `docs/operations/hsm-configuration.md`
|
||||
190
docs/operations/runbooks/attestor-key-expired.md
Normal file
190
docs/operations/runbooks/attestor-key-expired.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Runbook: Attestor - Signing Key Expired
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.attestor.key-expiration` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Attestation creation failing with "key expired" error
|
||||
- [ ] Alert `AttestorKeyExpired` firing
|
||||
- [ ] Error: "signing key certificate has expired"
|
||||
- [ ] New attestations cannot be created
|
||||
- [ ] Verification of new attestations failing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | No new attestations can be signed; releases blocked |
|
||||
| **Data integrity** | Existing attestations remain valid; new ones cannot be created |
|
||||
| **SLA impact** | Release SLO violated; compliance posture compromised |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.attestor.key-expiration
|
||||
```
|
||||
|
||||
2. **List signing keys and expiration:**
|
||||
```bash
|
||||
stella keys list --type signing --show-expiration
|
||||
```
|
||||
Look for: Keys with status "expired" or expiring soon
|
||||
|
||||
3. **Check active signing key:**
|
||||
```bash
|
||||
stella attest config get signing.key_id
|
||||
stella keys show <key-id> --details
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check certificate chain validity:**
|
||||
```bash
|
||||
stella crypto cert verify-chain --key <key-id>
|
||||
```
|
||||
Problem if: Any certificate in chain expired
|
||||
|
||||
2. **Check for backup keys:**
|
||||
```bash
|
||||
stella keys list --type signing --status inactive
|
||||
```
|
||||
Look for: Unexpired backup keys that can be activated
|
||||
|
||||
3. **Check key rotation history:**
|
||||
```bash
|
||||
stella keys rotation-history --key <key-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If backup key available, activate it:**
|
||||
```bash
|
||||
stella keys activate <backup-key-id>
|
||||
stella attest config set signing.key_id <backup-key-id>
|
||||
stella attest reload
|
||||
```
|
||||
|
||||
2. **Verify signing works:**
|
||||
```bash
|
||||
stella attest test-sign
|
||||
```
|
||||
|
||||
3. **Retry failed attestations:**
|
||||
```bash
|
||||
stella attest retry --failed --last 1h
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**Generate new signing key:**
|
||||
|
||||
1. Generate new key pair:
|
||||
```bash
|
||||
stella keys generate \
|
||||
--type signing \
|
||||
--algorithm ecdsa-p256 \
|
||||
--validity 365d \
|
||||
--name "signing-key-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
2. If using HSM:
|
||||
```bash
|
||||
stella keys generate \
|
||||
--type signing \
|
||||
--algorithm ecdsa-p256 \
|
||||
--validity 365d \
|
||||
--hsm-slot <slot> \
|
||||
--name "signing-key-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
3. Register the new key:
|
||||
```bash
|
||||
stella keys register <new-key-id> --purpose attestation-signing
|
||||
```
|
||||
|
||||
4. Update signing configuration:
|
||||
```bash
|
||||
stella attest config set signing.key_id <new-key-id>
|
||||
stella attest reload
|
||||
```
|
||||
|
||||
5. Publish new public key to trust anchors:
|
||||
```bash
|
||||
stella issuer keys publish <new-key-id>
|
||||
```
|
||||
|
||||
**Configure automatic rotation:**
|
||||
|
||||
1. Enable auto-rotation:
|
||||
```bash
|
||||
stella keys config set rotation.auto true
|
||||
stella keys config set rotation.before_expiry 30d
|
||||
stella keys config set rotation.overlap_days 14
|
||||
```
|
||||
|
||||
2. Set up rotation alerts:
|
||||
```bash
|
||||
stella keys config set alerts.expiring_days 30
|
||||
stella keys config set alerts.expiring_days_critical 7
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify new key is active
|
||||
stella keys list --type signing --status active
|
||||
|
||||
# Test signing
|
||||
stella attest test-sign
|
||||
|
||||
# Create test attestation
|
||||
stella attest create --type test --subject "test:key-rotation"
|
||||
|
||||
# Verify the attestation
|
||||
stella verify attestation --last
|
||||
|
||||
# Check key expiration
|
||||
stella keys show <new-key-id> --details | grep -i expir
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Rotation:** Enable automatic key rotation 30 days before expiry
|
||||
- [ ] **Monitoring:** Alert on keys expiring within 30 days (warning) and 7 days (critical)
|
||||
- [ ] **Backup:** Maintain at least one backup signing key
|
||||
- [ ] **Documentation:** Document key rotation procedures and approval process
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/attestor/architecture.md`
|
||||
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-hsm-connection.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
|
||||
- **Key management:** `docs/operations/key-management.md`
|
||||
184
docs/operations/runbooks/attestor-rekor-unavailable.md
Normal file
184
docs/operations/runbooks/attestor-rekor-unavailable.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# Runbook: Attestor - Rekor Transparency Log Unreachable
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.attestor.rekor-connectivity` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Attestation transparency logging failing
|
||||
- [ ] Alert `AttestorRekorUnavailable` firing
|
||||
- [ ] Error: "Rekor server unavailable" or "transparency log submission failed"
|
||||
- [ ] Attestations created but not anchored to transparency log
|
||||
- [ ] Verification failing due to missing log entry
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Attestations not publicly verifiable via transparency log |
|
||||
| **Data integrity** | Attestations still valid locally; transparency reduced |
|
||||
| **SLA impact** | Compliance may require transparency log anchoring |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.attestor.rekor-connectivity
|
||||
```
|
||||
|
||||
2. **Check Rekor connectivity:**
|
||||
```bash
|
||||
stella attest rekor status
|
||||
```
|
||||
|
||||
3. **Test Rekor endpoint:**
|
||||
```bash
|
||||
stella attest rekor ping
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check Rekor server URL:**
|
||||
```bash
|
||||
stella attest config get rekor.url
|
||||
```
|
||||
Default: https://rekor.sigstore.dev
|
||||
|
||||
2. **Check for public Rekor outage:**
|
||||
```bash
|
||||
stella attest rekor api-status
|
||||
```
|
||||
Also check: https://status.sigstore.dev/
|
||||
|
||||
3. **Check network/proxy issues:**
|
||||
```bash
|
||||
stella attest rekor test --verbose
|
||||
```
|
||||
Look for: TLS errors, proxy blocks, timeout
|
||||
|
||||
4. **Check pending log entries:**
|
||||
```bash
|
||||
stella attest rekor pending-entries
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Queue attestations for later submission:**
|
||||
```bash
|
||||
stella attest config set rekor.queue_on_failure true
|
||||
stella attest reload
|
||||
```
|
||||
|
||||
2. **Disable Rekor requirement temporarily:**
|
||||
```bash
|
||||
stella attest config set rekor.required false
|
||||
stella attest reload
|
||||
```
|
||||
**Warning:** Reduces transparency guarantees
|
||||
|
||||
3. **Use private Rekor instance if available:**
|
||||
```bash
|
||||
stella attest config set rekor.url https://rekor.internal.example.com
|
||||
stella attest reload
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If public Rekor outage:**
|
||||
|
||||
1. Wait for Sigstore to resolve the issue
|
||||
2. Check status at https://status.sigstore.dev/
|
||||
3. Process queued entries when service recovers:
|
||||
```bash
|
||||
stella attest rekor process-queue
|
||||
```
|
||||
|
||||
**If network/firewall issue:**
|
||||
|
||||
1. Verify outbound HTTPS to rekor.sigstore.dev:
|
||||
```bash
|
||||
stella attest rekor connectivity --verbose
|
||||
```
|
||||
|
||||
2. Configure proxy if required:
|
||||
```bash
|
||||
stella attest config set rekor.proxy https://proxy:8080
|
||||
```
|
||||
|
||||
3. Add Rekor endpoints to firewall allowlist:
|
||||
- rekor.sigstore.dev:443
|
||||
- fulcio.sigstore.dev:443 (for certificate issuance)
|
||||
|
||||
**If TLS certificate issue:**
|
||||
|
||||
1. Check certificate validity:
|
||||
```bash
|
||||
stella attest rekor cert-check
|
||||
```
|
||||
|
||||
2. Update CA certificates:
|
||||
```bash
|
||||
stella crypto ca update
|
||||
```
|
||||
|
||||
**If private Rekor instance issue:**
|
||||
|
||||
1. Check private Rekor server status
|
||||
2. Verify Rekor database health
|
||||
3. Check Rekor signer availability
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test Rekor connectivity
|
||||
stella attest rekor ping
|
||||
|
||||
# Submit test entry
|
||||
stella attest rekor test-submit
|
||||
|
||||
# Process any queued entries
|
||||
stella attest rekor process-queue
|
||||
|
||||
# Verify recent attestation in log
|
||||
stella attest rekor lookup --attestation <attestation-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Redundancy:** Configure private Rekor instance as fallback
|
||||
- [ ] **Queuing:** Enable queue-on-failure for resilience
|
||||
- [ ] **Monitoring:** Alert on Rekor submission failures
|
||||
- [ ] **Offline:** Document attestation validity without Rekor for air-gap scenarios
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/attestor/transparency-log.md`
|
||||
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-verification-failed.md`
|
||||
- **Sigstore docs:** https://docs.sigstore.dev/
|
||||
- **Rekor setup:** `docs/operations/rekor-configuration.md`
|
||||
176
docs/operations/runbooks/attestor-signing-failed.md
Normal file
176
docs/operations/runbooks/attestor-signing-failed.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# Runbook: Attestor - Signature Generation Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.attestor.signing-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Attestation requests failing with "signing failed" error
|
||||
- [ ] Alert `AttestorSigningFailed` firing
|
||||
- [ ] Evidence bundles missing signatures
|
||||
- [ ] Metric `attestor_signing_failures_total` increasing
|
||||
- [ ] Release pipeline blocked due to unsigned attestations
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Releases blocked; attestations cannot be created |
|
||||
| **Data integrity** | Evidence is recorded but unsigned; can be signed later |
|
||||
| **SLA impact** | Release SLO violated; evidence integrity compromised |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.attestor.signing-health
|
||||
```
|
||||
|
||||
2. **Check attestor service status:**
|
||||
```bash
|
||||
stella attest status
|
||||
```
|
||||
|
||||
3. **Check signing key availability:**
|
||||
```bash
|
||||
stella keys list --type signing --status active
|
||||
```
|
||||
Problem if: No active signing keys
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Test signing operation:**
|
||||
```bash
|
||||
stella attest test-sign --verbose
|
||||
```
|
||||
Look for: Specific error message
|
||||
|
||||
2. **Check key material access:**
|
||||
```bash
|
||||
stella keys verify <key-id> --operation sign
|
||||
```
|
||||
|
||||
3. **If using HSM, check HSM connectivity:**
|
||||
```bash
|
||||
stella doctor --check check.crypto.hsm-availability
|
||||
```
|
||||
|
||||
4. **Check for key expiration:**
|
||||
```bash
|
||||
stella keys list --expiring-within 7d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If key expired, rotate to backup key:**
|
||||
```bash
|
||||
stella keys activate <backup-key-id>
|
||||
stella attest config set signing.key_id <backup-key-id>
|
||||
```
|
||||
|
||||
2. **If HSM unavailable, switch to software signing (temporary):**
|
||||
```bash
|
||||
stella attest config set signing.mode software
|
||||
stella attest reload
|
||||
```
|
||||
⚠️ **Warning:** Software signing may not meet compliance requirements
|
||||
|
||||
3. **Retry failed attestations:**
|
||||
```bash
|
||||
stella attest retry --failed --last 1h
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If key expired:**
|
||||
|
||||
1. Generate new signing key:
|
||||
```bash
|
||||
stella keys generate --type signing --algorithm ecdsa-p256
|
||||
```
|
||||
|
||||
2. Configure key rotation schedule:
|
||||
```bash
|
||||
stella keys config set rotation.auto true
|
||||
stella keys config set rotation.overlap_days 14
|
||||
```
|
||||
|
||||
**If HSM connection failed:**
|
||||
|
||||
1. Verify HSM configuration:
|
||||
```bash
|
||||
stella crypto hsm verify
|
||||
```
|
||||
|
||||
2. Restart HSM connection:
|
||||
```bash
|
||||
stella crypto hsm reconnect
|
||||
```
|
||||
|
||||
**If certificate chain issue:**
|
||||
|
||||
1. Verify certificate chain:
|
||||
```bash
|
||||
stella crypto cert verify-chain --key <key-id>
|
||||
```
|
||||
|
||||
2. Update intermediate certificates:
|
||||
```bash
|
||||
stella crypto cert update-chain --key <key-id>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test signing
|
||||
stella attest test-sign
|
||||
|
||||
# Create test attestation
|
||||
stella attest create --type test --subject "test:verification"
|
||||
|
||||
# Verify the attestation
|
||||
stella verify attestation --last
|
||||
|
||||
# Check no failures in recent operations
|
||||
stella attest logs --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Key rotation:** Enable automatic key rotation with 14-day overlap
|
||||
- [ ] **Monitoring:** Alert on keys expiring within 30 days
|
||||
- [ ] **Backup:** Maintain backup signing key in different HSM slot
|
||||
- [ ] **Testing:** Include signing test in health check schedule
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/attestor/architecture.md`
|
||||
- **Related runbooks:** `attestor-key-expired.md`, `attestor-hsm-connection.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
|
||||
- **Dashboard:** Grafana > Stella Ops > Attestor
|
||||
195
docs/operations/runbooks/attestor-verification-failed.md
Normal file
195
docs/operations/runbooks/attestor-verification-failed.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Runbook: Attestor - Attestation Verification Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.attestor.verification-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Attestation verification failing
|
||||
- [ ] Alert `AttestorVerificationFailed` firing
|
||||
- [ ] Error: "signature verification failed" or "invalid attestation"
|
||||
- [ ] Promotions blocked due to failed verification
|
||||
- [ ] Error: "trust anchor not found" or "certificate chain invalid"
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Artifacts cannot be promoted; release blocked |
|
||||
| **Data integrity** | May indicate tampered attestation or configuration issue |
|
||||
| **SLA impact** | Release pipeline blocked until resolved |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.attestor.verification-health
|
||||
```
|
||||
|
||||
2. **Verify specific attestation:**
|
||||
```bash
|
||||
stella verify attestation --attestation <attestation-id> --verbose
|
||||
```
|
||||
|
||||
3. **Check trust anchors:**
|
||||
```bash
|
||||
stella trust-anchors list
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check attestation details:**
|
||||
```bash
|
||||
stella attest show <attestation-id> --details
|
||||
```
|
||||
Look for: Signer identity, timestamp, subject
|
||||
|
||||
2. **Verify certificate chain:**
|
||||
```bash
|
||||
stella verify cert-chain --attestation <attestation-id>
|
||||
```
|
||||
Problem if: Intermediate cert missing, root not trusted
|
||||
|
||||
3. **Check public key availability:**
|
||||
```bash
|
||||
stella keys show <key-id> --public
|
||||
```
|
||||
|
||||
4. **Check if issuer is trusted:**
|
||||
```bash
|
||||
stella issuer trust-status <issuer-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If trust anchor missing, add it:**
|
||||
```bash
|
||||
stella trust-anchors add --cert <issuer-cert.pem>
|
||||
```
|
||||
|
||||
2. **If intermediate cert missing:**
|
||||
```bash
|
||||
stella trust-anchors add-intermediate --cert <intermediate.pem>
|
||||
```
|
||||
|
||||
3. **Re-verify with verbose output:**
|
||||
```bash
|
||||
stella verify attestation --attestation <attestation-id> --verbose
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If signature mismatch:**
|
||||
|
||||
1. Check attestation wasn't modified:
|
||||
```bash
|
||||
stella attest integrity-check <attestation-id>
|
||||
```
|
||||
|
||||
2. If modified, regenerate attestation:
|
||||
```bash
|
||||
stella attest create --subject <digest> --type <type> --force
|
||||
```
|
||||
|
||||
**If key rotated and old key not trusted:**
|
||||
|
||||
1. Add old public key to trust anchors:
|
||||
```bash
|
||||
stella trust-anchors add-key --key <old-key.pem> --expires <date>
|
||||
```
|
||||
|
||||
2. Or fetch from issuer directory:
|
||||
```bash
|
||||
stella issuer keys fetch <issuer-id>
|
||||
```
|
||||
|
||||
**If certificate expired:**
|
||||
|
||||
1. Check certificate validity:
|
||||
```bash
|
||||
stella verify cert --attestation <attestation-id> --show-expiry
|
||||
```
|
||||
|
||||
2. Re-sign with valid certificate:
|
||||
```bash
|
||||
stella attest resign <attestation-id>
|
||||
```
|
||||
|
||||
**If issuer not trusted:**
|
||||
|
||||
1. Verify issuer identity:
|
||||
```bash
|
||||
stella issuer show <issuer-id>
|
||||
```
|
||||
|
||||
2. Add to trusted issuers (requires approval):
|
||||
```bash
|
||||
stella issuer trust <issuer-id> --reason "Approved by security team"
|
||||
```
|
||||
|
||||
**If algorithm not supported:**
|
||||
|
||||
1. Check algorithm:
|
||||
```bash
|
||||
stella attest show <attestation-id> | grep algorithm
|
||||
```
|
||||
|
||||
2. Verify crypto provider supports algorithm:
|
||||
```bash
|
||||
stella crypto providers list --algorithms
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify attestation
|
||||
stella verify attestation --attestation <attestation-id>
|
||||
|
||||
# Verify trust chain
|
||||
stella verify cert-chain --attestation <attestation-id>
|
||||
|
||||
# Test end-to-end verification
|
||||
stella verify artifact --digest <digest>
|
||||
|
||||
# Check no verification errors
|
||||
stella attest logs --filter "verification" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Trust anchors:** Keep trust anchor list current with all valid issuer certs
|
||||
- [ ] **Key rotation:** Plan key rotation with overlap period for verification continuity
|
||||
- [ ] **Monitoring:** Alert on verification failure rate > 0
|
||||
- [ ] **Testing:** Include verification tests in release pipeline
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/attestor/verification.md`
|
||||
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-key-expired.md`
|
||||
- **Trust management:** `docs/operations/trust-anchors.md`
|
||||
449
docs/operations/runbooks/backup-restore-ops.md
Normal file
449
docs/operations/runbooks/backup-restore-ops.md
Normal file
@@ -0,0 +1,449 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-004 - Backup/Restore Runbook
|
||||
# Backup and Restore Operations Runbook
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.
|
||||
|
||||
---
|
||||
|
||||
## Backup Architecture Overview
|
||||
|
||||
### Backup Components
|
||||
|
||||
| Component | Backup Type | Default Schedule | Retention |
|
||||
|-----------|-------------|------------------|-----------|
|
||||
| PostgreSQL | Full + WAL | Daily full, continuous WAL | 30 days |
|
||||
| Evidence Locker | Incremental | Daily | 90 days |
|
||||
| Configuration | Snapshot | Daily + on change | 90 days |
|
||||
| Secrets | Encrypted snapshot | Daily | 30 days |
|
||||
| Attestation Keys | Encrypted export | Weekly | 1 year |
|
||||
|
||||
### Storage Locations
|
||||
|
||||
- **Primary:** `/var/lib/stellaops/backups/` (local)
|
||||
- **Secondary:** S3/Azure Blob/GCS (configurable)
|
||||
- **Offline:** Removable media for air-gap scenarios
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check backup service status
|
||||
stella backup status
|
||||
|
||||
# Verify backup storage
|
||||
stella doctor --check check.storage.backup
|
||||
|
||||
# List recent backups
|
||||
stella backup list --last 7d
|
||||
|
||||
# Test backup restore capability
|
||||
stella backup test-restore --latest --dry-run
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_backup_last_success_timestamp` - Last successful backup
|
||||
- `stella_backup_duration_seconds` - Backup duration
|
||||
- `stella_backup_size_bytes` - Backup size
|
||||
- `stella_restore_test_last_success` - Last restore test
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Create Manual Backup
|
||||
|
||||
**When:** Before upgrades, schema changes, or major configuration changes
|
||||
**Duration:** 5-30 minutes depending on data volume
|
||||
|
||||
1. Create full system backup:
|
||||
```bash
|
||||
stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
2. Or create component-specific backup:
|
||||
```bash
|
||||
# Database only
|
||||
stella backup create --type database --name "db-pre-migration"
|
||||
|
||||
# Evidence locker only
|
||||
stella backup create --type evidence --name "evidence-snapshot"
|
||||
|
||||
# Configuration only
|
||||
stella backup create --type config --name "config-backup"
|
||||
```
|
||||
|
||||
3. Verify backup:
|
||||
```bash
|
||||
stella backup verify --name "pre-upgrade-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
4. Copy to offsite storage (recommended):
|
||||
```bash
|
||||
stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
|
||||
```
|
||||
|
||||
### SP-002: Verify Backup Integrity
|
||||
|
||||
**Frequency:** Weekly
|
||||
**Duration:** 15-60 minutes
|
||||
|
||||
1. List backups for verification:
|
||||
```bash
|
||||
stella backup list --unverified
|
||||
```
|
||||
|
||||
2. Verify backup integrity:
|
||||
```bash
|
||||
# Verify specific backup
|
||||
stella backup verify --name <backup-name>
|
||||
|
||||
# Verify all unverified
|
||||
stella backup verify --all-unverified
|
||||
```
|
||||
|
||||
3. Test restore (non-destructive):
|
||||
```bash
|
||||
stella backup test-restore --name <backup-name> --target /tmp/restore-test
|
||||
```
|
||||
|
||||
4. Record verification result:
|
||||
```bash
|
||||
stella backup log-verification --name <backup-name> --result success
|
||||
```
|
||||
|
||||
### SP-003: Restore from Backup
|
||||
|
||||
**CAUTION: This is a destructive operation**
|
||||
|
||||
#### Full System Restore
|
||||
|
||||
1. Stop all services:
|
||||
```bash
|
||||
stella service stop --all
|
||||
```
|
||||
|
||||
2. List available backups:
|
||||
```bash
|
||||
stella backup list --type full
|
||||
```
|
||||
|
||||
3. Restore:
|
||||
```bash
|
||||
# Dry run first
|
||||
stella backup restore --name <backup-name> --dry-run
|
||||
|
||||
# Execute restore
|
||||
stella backup restore --name <backup-name> --confirm
|
||||
```
|
||||
|
||||
4. Start services:
|
||||
```bash
|
||||
stella service start --all
|
||||
```
|
||||
|
||||
5. Verify restoration:
|
||||
```bash
|
||||
stella doctor --all
|
||||
stella service health
|
||||
```
|
||||
|
||||
#### Component-Specific Restore
|
||||
|
||||
1. Database restore:
|
||||
```bash
|
||||
stella service stop --service api,release-orchestrator
|
||||
stella backup restore --type database --name <backup-name> --confirm
|
||||
stella db migrate # Apply any pending migrations
|
||||
stella service start --service api,release-orchestrator
|
||||
```
|
||||
|
||||
2. Evidence locker restore:
|
||||
```bash
|
||||
stella backup restore --type evidence --name <backup-name> --confirm
|
||||
stella evidence verify --mode quick
|
||||
```
|
||||
|
||||
3. Configuration restore:
|
||||
```bash
|
||||
stella backup restore --type config --name <backup-name> --confirm
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
### SP-004: Point-in-Time Recovery (Database)
|
||||
|
||||
1. Identify target recovery point:
|
||||
```bash
|
||||
# List WAL archives
|
||||
stella backup wal-list --after <start-date> --before <end-date>
|
||||
```
|
||||
|
||||
2. Perform PITR:
|
||||
```bash
|
||||
stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm
|
||||
```
|
||||
|
||||
3. Verify data state:
|
||||
```bash
|
||||
stella db verify-integrity
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backup Schedules
|
||||
|
||||
### Configure Backup Schedule
|
||||
|
||||
```bash
|
||||
# View current schedule
|
||||
stella backup schedule show
|
||||
|
||||
# Set database backup schedule
|
||||
stella backup schedule set --type database --cron "0 2 * * *"
|
||||
|
||||
# Set evidence backup schedule
|
||||
stella backup schedule set --type evidence --cron "0 3 * * *"
|
||||
|
||||
# Set configuration backup schedule
|
||||
stella backup schedule set --type config --cron "0 4 * * *" --on-change
|
||||
```
|
||||
|
||||
### Retention Policy
|
||||
|
||||
```bash
|
||||
# View retention policy
|
||||
stella backup retention show
|
||||
|
||||
# Set retention
|
||||
stella backup retention set --type database --days 30
|
||||
stella backup retention set --type evidence --days 90
|
||||
stella backup retention set --type config --days 90
|
||||
|
||||
# Apply retention (cleanup old backups)
|
||||
stella backup retention apply
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: Backup Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaBackupFailed`
|
||||
- Missing recent backup
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check backup logs
|
||||
stella backup logs --last 24h
|
||||
|
||||
# Check disk space
|
||||
stella doctor --check check.storage.diskspace,check.storage.backup
|
||||
|
||||
# Test backup operation
|
||||
stella backup test --type database
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Disk space issue:**
|
||||
```bash
|
||||
stella backup retention apply --force
|
||||
stella backup cleanup --expired
|
||||
```
|
||||
|
||||
2. **Database connectivity:**
|
||||
```bash
|
||||
stella doctor --check check.postgres.connectivity
|
||||
```
|
||||
|
||||
3. **Permission issue:**
|
||||
- Check backup directory permissions
|
||||
- Verify service account access
|
||||
|
||||
4. **Retry backup:**
|
||||
```bash
|
||||
stella backup create --type <failed-type> --retry
|
||||
```
|
||||
|
||||
### INC-002: Restore Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Restore command fails
|
||||
- Services not starting after restore
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check restore logs
|
||||
stella backup restore-logs --last-attempt
|
||||
|
||||
# Verify backup integrity
|
||||
stella backup verify --name <backup-name>
|
||||
|
||||
# Check disk space
|
||||
stella doctor --check check.storage.diskspace
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Corrupted backup:**
|
||||
```bash
|
||||
# Try previous backup
|
||||
stella backup list --type <type>
|
||||
stella backup restore --name <previous-backup> --confirm
|
||||
```
|
||||
|
||||
2. **Version mismatch:**
|
||||
```bash
|
||||
# Check backup version
|
||||
stella backup info --name <backup-name>
|
||||
|
||||
# Restore with migration
|
||||
stella backup restore --name <backup-name> --with-migration
|
||||
```
|
||||
|
||||
3. **Disk space:**
|
||||
- Free space or expand volume
|
||||
- Restore to alternate location
|
||||
|
||||
### INC-003: Backup Storage Full
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaBackupStorageFull`
|
||||
- New backups failing
|
||||
|
||||
**Immediate Actions:**
|
||||
```bash
|
||||
# Check storage
|
||||
stella backup storage stats
|
||||
|
||||
# Emergency cleanup
|
||||
stella backup cleanup --keep-last 3
|
||||
|
||||
# Delete specific old backups
|
||||
stella backup delete --older-than 14d --confirm
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Adjust retention:**
|
||||
```bash
|
||||
stella backup retention set --type database --days 14
|
||||
stella backup retention apply
|
||||
```
|
||||
|
||||
2. **Expand storage:**
|
||||
- Add disk space
|
||||
- Configure offsite storage
|
||||
|
||||
3. **Archive to cold storage:**
|
||||
```bash
|
||||
stella backup archive --older-than 30d --destination s3://archive-bucket/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Disaster Recovery Scenarios
|
||||
|
||||
### DR-001: Complete System Loss
|
||||
|
||||
1. Provision new infrastructure
|
||||
2. Install Stella Ops
|
||||
3. Restore from offsite backup:
|
||||
```bash
|
||||
stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm
|
||||
```
|
||||
4. Verify all components
|
||||
5. Update DNS/load balancer
|
||||
|
||||
### DR-002: Database Corruption
|
||||
|
||||
1. Stop services
|
||||
2. Restore database from latest clean backup:
|
||||
```bash
|
||||
stella backup restore --type database --name <last-known-good>
|
||||
```
|
||||
3. Apply WAL to near-corruption point (PITR)
|
||||
4. Verify data integrity
|
||||
5. Resume services
|
||||
|
||||
### DR-003: Evidence Locker Loss
|
||||
|
||||
1. Restore evidence from backup:
|
||||
```bash
|
||||
stella backup restore --type evidence --name <backup-name>
|
||||
```
|
||||
2. Rebuild index:
|
||||
```bash
|
||||
stella evidence index rebuild
|
||||
```
|
||||
3. Verify anchor chain:
|
||||
```bash
|
||||
stella evidence anchor verify --all
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Offline/Air-Gap Backup
|
||||
|
||||
### Creating Offline Backup
|
||||
|
||||
```bash
|
||||
# Create encrypted offline bundle
|
||||
stella backup create-offline \
|
||||
--output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
|
||||
--encrypt \
|
||||
--passphrase-file /secure/backup-key
|
||||
|
||||
# Verify offline backup
|
||||
stella backup verify-offline --input /media/usb/stellaops-backup-*.enc
|
||||
```
|
||||
|
||||
### Restoring from Offline Backup
|
||||
|
||||
```bash
|
||||
# Restore from offline backup
|
||||
stella backup restore-offline \
|
||||
--input /media/usb/stellaops-backup-*.enc \
|
||||
--passphrase-file /secure/backup-key \
|
||||
--confirm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → Backup Status
|
||||
|
||||
Key panels:
|
||||
- Last backup success time
|
||||
- Backup size trend
|
||||
- Backup duration
|
||||
- Restore test status
|
||||
- Storage utilization
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
```bash
|
||||
stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Retry failed backups, basic troubleshooting
|
||||
2. **L2 (Platform team):** Restore operations, schedule adjustments
|
||||
3. **L3 (Architecture):** Disaster recovery execution
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
196
docs/operations/runbooks/connector-ghsa.md
Normal file
196
docs/operations/runbooks/connector-ghsa.md
Normal file
@@ -0,0 +1,196 @@
|
||||
# Runbook: Feed Connector - GitHub Security Advisories (GHSA) Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-006 - Feed Connector Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Concelier / GHSA Connector |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.connector.ghsa-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] GHSA feed sync failing or stale
|
||||
- [ ] Alert `ConnectorGhsaSyncFailed` firing
|
||||
- [ ] Error: "GitHub API rate limit exceeded" or "GraphQL query failed"
|
||||
- [ ] GitHub Advisory Database vulnerabilities missing
|
||||
- [ ] Metric `connector_sync_failures_total{source="ghsa"}` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | GitHub ecosystem vulnerabilities may be missed |
|
||||
| **Data integrity** | Data becomes stale; no data loss |
|
||||
| **SLA impact** | Vulnerability currency SLO violated for GitHub packages |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.connector.ghsa-health
|
||||
```
|
||||
|
||||
2. **Check GHSA sync status:**
|
||||
```bash
|
||||
stella admin feeds status --source ghsa
|
||||
```
|
||||
|
||||
3. **Test GitHub API connectivity:**
|
||||
```bash
|
||||
stella connector test ghsa
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check GitHub API rate limit:**
|
||||
```bash
|
||||
stella connector ghsa rate-limit-status
|
||||
```
|
||||
Problem if: Remaining = 0, rate limit exceeded
|
||||
|
||||
2. **Check GitHub token permissions:**
|
||||
```bash
|
||||
stella connector credentials show ghsa --check-scopes
|
||||
```
|
||||
Required scopes: `public_repo`, `read:packages` (for private advisory access)
|
||||
|
||||
3. **Check sync logs:**
|
||||
```bash
|
||||
stella connector logs ghsa --last 1h --level error
|
||||
```
|
||||
Look for: GraphQL errors, pagination issues, timeout
|
||||
|
||||
4. **Check for GitHub API outage:**
|
||||
```bash
|
||||
stella connector ghsa api-status
|
||||
```
|
||||
Also check: https://www.githubstatus.com/
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If rate limited, wait for reset:**
|
||||
```bash
|
||||
stella connector ghsa rate-limit-status
|
||||
# Note the reset time, then:
|
||||
stella admin feeds refresh --source ghsa
|
||||
```
|
||||
|
||||
2. **Use secondary token if available:**
|
||||
```bash
|
||||
stella connector credentials rotate ghsa --to secondary
|
||||
stella admin feeds refresh --source ghsa
|
||||
```
|
||||
|
||||
3. **Load from offline bundle:**
|
||||
```bash
|
||||
stella offline load --source ghsa --package ghsa-bundle-latest.tar.gz
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If rate limit consistently exceeded:**
|
||||
|
||||
1. Increase sync interval:
|
||||
```bash
|
||||
stella connector config set ghsa.sync_interval 4h
|
||||
```
|
||||
|
||||
2. Enable incremental sync:
|
||||
```bash
|
||||
stella connector config set ghsa.incremental_sync true
|
||||
```
|
||||
|
||||
3. Use authenticated requests (10x rate limit):
|
||||
```bash
|
||||
stella connector credentials update ghsa --token <github-pat>
|
||||
```
|
||||
|
||||
**If token expired or invalid:**
|
||||
|
||||
1. Generate new GitHub PAT at https://github.com/settings/tokens
|
||||
|
||||
2. Update token:
|
||||
```bash
|
||||
stella connector credentials update ghsa --token <new-token>
|
||||
```
|
||||
|
||||
3. Verify scopes:
|
||||
```bash
|
||||
stella connector credentials show ghsa --check-scopes
|
||||
```
|
||||
|
||||
**If GraphQL query failing:**
|
||||
|
||||
1. Check for API schema changes:
|
||||
```bash
|
||||
stella connector ghsa schema-check
|
||||
```
|
||||
|
||||
2. Update connector if schema changed:
|
||||
```bash
|
||||
stella upgrade --component connector-ghsa
|
||||
```
|
||||
|
||||
**If pagination broken:**
|
||||
|
||||
1. Reset sync cursor:
|
||||
```bash
|
||||
stella connector ghsa reset-cursor
|
||||
```
|
||||
|
||||
2. Force full resync:
|
||||
```bash
|
||||
stella admin feeds refresh --source ghsa --full
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Force sync
|
||||
stella admin feeds refresh --source ghsa
|
||||
|
||||
# Monitor sync progress
|
||||
stella admin feeds status --source ghsa --watch
|
||||
|
||||
# Verify recent advisories present
|
||||
stella vuln query GHSA-xxxx-xxxx-xxxx # Use a recent GHSA ID
|
||||
|
||||
# Check no errors
|
||||
stella connector logs ghsa --level error --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Authentication:** Always use authenticated requests for 5000/hr rate limit
|
||||
- [ ] **Monitoring:** Alert on last sync > 12h or sync failures
|
||||
- [ ] **Redundancy:** Use NVD/OSV as backup for GitHub ecosystem coverage
|
||||
- [ ] **Token rotation:** Rotate tokens before expiration
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/concelier/connectors.md`
|
||||
- **Connector config:** `docs/modules/concelier/operations/connectors/ghsa.md`
|
||||
- **Related runbooks:** `connector-nvd.md`, `connector-osv.md`
|
||||
- **GitHub API docs:** https://docs.github.com/en/graphql
|
||||
195
docs/operations/runbooks/connector-nvd.md
Normal file
195
docs/operations/runbooks/connector-nvd.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Runbook: Feed Connector - NVD Connector Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-006 - Feed Connector Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Concelier / NVD Connector |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.connector.nvd-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] NVD feed sync failing or stale (> 24h since last successful sync)
|
||||
- [ ] Alert `ConnectorNvdSyncFailed` firing
|
||||
- [ ] Error: "NVD API request failed" or "rate limit exceeded"
|
||||
- [ ] Vulnerability data missing or outdated
|
||||
- [ ] Metric `connector_sync_failures_total{source="nvd"}` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Vulnerability scans may miss recent CVEs |
|
||||
| **Data integrity** | Data becomes stale; no data loss |
|
||||
| **SLA impact** | Vulnerability currency SLO violated (target: < 24h) |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.connector.nvd-health
|
||||
```
|
||||
|
||||
2. **Check NVD sync status:**
|
||||
```bash
|
||||
stella admin feeds status --source nvd
|
||||
```
|
||||
Look for: Last sync time, error message, sync state
|
||||
|
||||
3. **Check NVD API connectivity:**
|
||||
```bash
|
||||
stella connector test nvd
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check NVD API key status:**
|
||||
```bash
|
||||
stella connector credentials show nvd
|
||||
```
|
||||
Problem if: API key expired or rate limit exhausted
|
||||
|
||||
2. **Check NVD API rate limit:**
|
||||
```bash
|
||||
stella connector nvd rate-limit-status
|
||||
```
|
||||
Problem if: Remaining requests = 0, reset time in future
|
||||
|
||||
3. **Check for NVD API outage:**
|
||||
```bash
|
||||
stella connector nvd api-status
|
||||
```
|
||||
Also check: https://nvd.nist.gov/general/news
|
||||
|
||||
4. **Check sync logs:**
|
||||
```bash
|
||||
stella connector logs nvd --last 1h --level error
|
||||
```
|
||||
Look for: HTTP status codes, timeout errors, parsing failures
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If rate limited, wait for reset:**
|
||||
```bash
|
||||
stella connector nvd rate-limit-status
|
||||
# Wait for reset time, then:
|
||||
stella admin feeds refresh --source nvd
|
||||
```
|
||||
|
||||
2. **If API key expired, use anonymous mode (slower):**
|
||||
```bash
|
||||
stella connector config set nvd.api_key_mode anonymous
|
||||
stella admin feeds refresh --source nvd
|
||||
```
|
||||
|
||||
3. **Load from offline bundle if urgent:**
|
||||
```bash
|
||||
# If you have a recent offline bundle:
|
||||
stella offline load --source nvd --package nvd-bundle-latest.tar.gz
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If API key expired or invalid:**
|
||||
|
||||
1. Generate new NVD API key at https://nvd.nist.gov/developers/request-an-api-key
|
||||
|
||||
2. Update API key:
|
||||
```bash
|
||||
stella connector credentials update nvd --api-key <new-key>
|
||||
```
|
||||
|
||||
3. Verify connectivity:
|
||||
```bash
|
||||
stella connector test nvd
|
||||
```
|
||||
|
||||
**If rate limit consistently exceeded:**
|
||||
|
||||
1. Increase sync interval to reduce API calls:
|
||||
```bash
|
||||
stella connector config set nvd.sync_interval 6h
|
||||
```
|
||||
|
||||
2. Enable delta sync to reduce data volume:
|
||||
```bash
|
||||
stella connector config set nvd.delta_sync true
|
||||
```
|
||||
|
||||
3. Request higher rate limit from NVD (if available)
|
||||
|
||||
**If network/firewall issue:**
|
||||
|
||||
1. Verify outbound connectivity to NVD API:
|
||||
```bash
|
||||
stella connector test nvd --verbose
|
||||
```
|
||||
|
||||
2. Check proxy configuration if required:
|
||||
```bash
|
||||
stella connector config set nvd.proxy https://proxy:8080
|
||||
```
|
||||
|
||||
**If data parsing failures:**
|
||||
|
||||
1. Check for NVD schema changes:
|
||||
```bash
|
||||
stella connector nvd schema-check
|
||||
```
|
||||
|
||||
2. Update connector if schema changed:
|
||||
```bash
|
||||
stella upgrade --component connector-nvd
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Force sync
|
||||
stella admin feeds refresh --source nvd --force
|
||||
|
||||
# Monitor sync progress
|
||||
stella admin feeds status --source nvd --watch
|
||||
|
||||
# Verify recent CVEs are present
|
||||
stella vuln query CVE-2026-XXXX # Use a recent CVE ID
|
||||
|
||||
# Check no errors in recent logs
|
||||
stella connector logs nvd --level error --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **API Key:** Always use API key (not anonymous) for 10x rate limit
|
||||
- [ ] **Monitoring:** Alert on last sync > 24h or sync failure
|
||||
- [ ] **Redundancy:** Configure backup connector (OSV, GitHub Advisory) for overlap
|
||||
- [ ] **Offline:** Maintain weekly offline bundle for disaster recovery
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/concelier/connectors.md`
|
||||
- **Connector config:** `docs/modules/concelier/operations/connectors/nvd.md`
|
||||
- **Related runbooks:** `connector-ghsa.md`, `connector-osv.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Feed Connectors
|
||||
193
docs/operations/runbooks/connector-osv.md
Normal file
193
docs/operations/runbooks/connector-osv.md
Normal file
@@ -0,0 +1,193 @@
|
||||
# Runbook: Feed Connector - OSV (Open Source Vulnerabilities) Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-006 - Feed Connector Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Concelier / OSV Connector |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.connector.osv-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] OSV feed sync failing or stale
|
||||
- [ ] Alert `ConnectorOsvSyncFailed` firing
|
||||
- [ ] Error: "OSV API request failed" or "ecosystem sync failed"
|
||||
- [ ] OSV vulnerabilities missing from database
|
||||
- [ ] Metric `connector_sync_failures_total{source="osv"}` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Open source ecosystem vulnerabilities may be missed |
|
||||
| **Data integrity** | Data becomes stale; no data loss |
|
||||
| **SLA impact** | Vulnerability currency SLO violated for affected ecosystems |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.connector.osv-health
|
||||
```
|
||||
|
||||
2. **Check OSV sync status:**
|
||||
```bash
|
||||
stella admin feeds status --source osv
|
||||
```
|
||||
|
||||
3. **Test OSV API connectivity:**
|
||||
```bash
|
||||
stella connector test osv
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check ecosystem-specific status:**
|
||||
```bash
|
||||
stella connector osv ecosystems status
|
||||
```
|
||||
Look for: Failed ecosystems, stale ecosystems
|
||||
|
||||
2. **Check sync logs:**
|
||||
```bash
|
||||
stella connector logs osv --last 1h --level error
|
||||
```
|
||||
Look for: API errors, parsing failures, timeout
|
||||
|
||||
3. **Check for OSV API outage:**
|
||||
```bash
|
||||
stella connector osv api-status
|
||||
```
|
||||
Also check: https://osv.dev/
|
||||
|
||||
4. **Check GCS bucket access (OSV uses GCS for bulk data):**
|
||||
```bash
|
||||
stella connector osv gcs-status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Retry sync for specific ecosystem:**
|
||||
```bash
|
||||
stella admin feeds refresh --source osv --ecosystem npm
|
||||
```
|
||||
|
||||
2. **Sync from GCS bucket directly (faster for bulk):**
|
||||
```bash
|
||||
stella connector osv sync-from-gcs
|
||||
```
|
||||
|
||||
3. **Load from offline bundle:**
|
||||
```bash
|
||||
stella offline load --source osv --package osv-bundle-latest.tar.gz
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If API request failing:**
|
||||
|
||||
1. Check API endpoint:
|
||||
```bash
|
||||
stella connector osv api-test
|
||||
```
|
||||
|
||||
2. Verify no proxy blocking:
|
||||
```bash
|
||||
stella connector config set osv.proxy <proxy-url>
|
||||
```
|
||||
|
||||
**If GCS access failing:**
|
||||
|
||||
1. Check GCS connectivity:
|
||||
```bash
|
||||
stella connector osv gcs-test
|
||||
```
|
||||
|
||||
2. Enable anonymous access (default):
|
||||
```bash
|
||||
stella connector config set osv.gcs_auth anonymous
|
||||
```
|
||||
|
||||
3. Or configure service account:
|
||||
```bash
|
||||
stella connector config set osv.gcs_credentials /path/to/sa-key.json
|
||||
```
|
||||
|
||||
**If specific ecosystem failing:**
|
||||
|
||||
1. Disable problematic ecosystem temporarily:
|
||||
```bash
|
||||
stella connector config set osv.ecosystems.disabled <ecosystem>
|
||||
```
|
||||
|
||||
2. Check ecosystem data format:
|
||||
```bash
|
||||
stella connector osv ecosystem-check <ecosystem>
|
||||
```
|
||||
|
||||
**If parsing errors:**
|
||||
|
||||
1. Check for schema changes:
|
||||
```bash
|
||||
stella connector osv schema-check
|
||||
```
|
||||
|
||||
2. Update connector:
|
||||
```bash
|
||||
stella upgrade --component connector-osv
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Force sync
|
||||
stella admin feeds refresh --source osv
|
||||
|
||||
# Monitor sync progress
|
||||
stella admin feeds status --source osv --watch
|
||||
|
||||
# Verify ecosystem coverage
|
||||
stella connector osv ecosystems status
|
||||
|
||||
# Query recent vulnerability
|
||||
stella vuln query OSV-2026-xxxx
|
||||
|
||||
# Check no errors
|
||||
stella connector logs osv --level error --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Bulk sync:** Use GCS bulk sync for initial load and daily updates
|
||||
- [ ] **Monitoring:** Alert on ecosystem sync failures
|
||||
- [ ] **Redundancy:** NVD/GHSA provide overlapping coverage for major ecosystems
|
||||
- [ ] **Offline:** Maintain weekly offline bundle
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/concelier/connectors.md`
|
||||
- **Connector config:** `docs/modules/concelier/operations/connectors/osv.md`
|
||||
- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`
|
||||
- **OSV API docs:** https://osv.dev/docs/
|
||||
220
docs/operations/runbooks/connector-vendor-specific.md
Normal file
220
docs/operations/runbooks/connector-vendor-specific.md
Normal file
@@ -0,0 +1,220 @@
|
||||
# Runbook Template: Feed Connector - Vendor-Specific Connectors
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-006 - Feed Connector Runbooks
|
||||
|
||||
## Overview
|
||||
|
||||
This is a template runbook for vendor-specific advisory feed connectors (RedHat, Ubuntu, Debian, Oracle, VMware, etc.). Use this template to create runbooks for specific vendor connectors.
|
||||
|
||||
---
|
||||
|
||||
## Metadata Template
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Concelier / [Vendor] Connector |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | [Date] |
|
||||
| **Doctor check** | `check.connector.[vendor]-health` |
|
||||
|
||||
---
|
||||
|
||||
## Common Vendor Connector Issues
|
||||
|
||||
### Authentication Failures
|
||||
|
||||
**Symptoms:**
|
||||
- Sync failing with 401/403 errors
|
||||
- "authentication failed" or "invalid credentials"
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check credentials
|
||||
stella connector credentials show <vendor>
|
||||
|
||||
# Update credentials
|
||||
stella connector credentials update <vendor> --api-key <key>
|
||||
|
||||
# Test connectivity
|
||||
stella connector test <vendor>
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
**Symptoms:**
|
||||
- Sync failing with 429 errors
|
||||
- "rate limit exceeded"
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check rate limit status
|
||||
stella connector <vendor> rate-limit-status
|
||||
|
||||
# Increase sync interval
|
||||
stella connector config set <vendor>.sync_interval 6h
|
||||
|
||||
# Enable delta sync
|
||||
stella connector config set <vendor>.delta_sync true
|
||||
```
|
||||
|
||||
### Data Format Changes
|
||||
|
||||
**Symptoms:**
|
||||
- Parsing errors in sync logs
|
||||
- "unexpected format" or "schema validation failed"
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check for schema changes
|
||||
stella connector <vendor> schema-check
|
||||
|
||||
# Update connector
|
||||
stella upgrade --component connector-<vendor>
|
||||
```
|
||||
|
||||
### Offline Bundle Refresh
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Create offline bundle
|
||||
stella offline sync --feeds <vendor> --output <vendor>-bundle.tar.gz
|
||||
|
||||
# Load offline bundle
|
||||
stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Vendor-Specific Runbooks
|
||||
|
||||
Use this template to create runbooks for:
|
||||
|
||||
### RedHat Security Data
|
||||
|
||||
**Endpoint:** https://access.redhat.com/security/data/
|
||||
**Authentication:** API token or certificate
|
||||
**Connector:** `connector-redhat`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test redhat
|
||||
stella admin feeds status --source redhat
|
||||
stella connector redhat cve-map-status # RHSA to CVE mapping
|
||||
```
|
||||
|
||||
### Ubuntu Security Notices
|
||||
|
||||
**Endpoint:** https://ubuntu.com/security/notices
|
||||
**Authentication:** None (public)
|
||||
**Connector:** `connector-ubuntu`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test ubuntu
|
||||
stella admin feeds status --source ubuntu
|
||||
stella connector ubuntu usn-status # USN sync status
|
||||
```
|
||||
|
||||
### Debian Security Tracker
|
||||
|
||||
**Endpoint:** https://security-tracker.debian.org/
|
||||
**Authentication:** None (public)
|
||||
**Connector:** `connector-debian`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test debian
|
||||
stella admin feeds status --source debian
|
||||
stella connector debian dla-status # DLA sync status
|
||||
```
|
||||
|
||||
### Oracle Security Alerts
|
||||
|
||||
**Endpoint:** https://www.oracle.com/security-alerts/
|
||||
**Authentication:** Oracle account (optional)
|
||||
**Connector:** `connector-oracle`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test oracle
|
||||
stella admin feeds status --source oracle
|
||||
stella connector oracle cpu-status # Critical Patch Update status
|
||||
```
|
||||
|
||||
### VMware Security Advisories
|
||||
|
||||
**Endpoint:** https://www.vmware.com/security/advisories
|
||||
**Authentication:** None (public)
|
||||
**Connector:** `connector-vmware`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test vmware
|
||||
stella admin feeds status --source vmware
|
||||
stella connector vmware vmsa-status # VMSA sync status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis Checklist
|
||||
|
||||
For any vendor connector issue:
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.connector.<vendor>-health
|
||||
```
|
||||
|
||||
2. **Check sync status:**
|
||||
```bash
|
||||
stella admin feeds status --source <vendor>
|
||||
```
|
||||
|
||||
3. **Test connectivity:**
|
||||
```bash
|
||||
stella connector test <vendor>
|
||||
```
|
||||
|
||||
4. **Check logs:**
|
||||
```bash
|
||||
stella connector logs <vendor> --last 1h --level error
|
||||
```
|
||||
|
||||
5. **Check credentials (if applicable):**
|
||||
```bash
|
||||
stella connector credentials show <vendor>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution Checklist
|
||||
|
||||
1. **Retry sync:**
|
||||
```bash
|
||||
stella admin feeds refresh --source <vendor>
|
||||
```
|
||||
|
||||
2. **Update credentials (if auth issue):**
|
||||
```bash
|
||||
stella connector credentials update <vendor>
|
||||
```
|
||||
|
||||
3. **Update connector (if format changed):**
|
||||
```bash
|
||||
stella upgrade --component connector-<vendor>
|
||||
```
|
||||
|
||||
4. **Load offline bundle (if API unavailable):**
|
||||
```bash
|
||||
stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Connector architecture:** `docs/modules/concelier/connectors.md`
|
||||
- **Vendor connector configs:** `docs/modules/concelier/operations/connectors/`
|
||||
- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`, `connector-osv.md`
|
||||
370
docs/operations/runbooks/crypto-ops.md
Normal file
370
docs/operations/runbooks/crypto-ops.md
Normal file
@@ -0,0 +1,370 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-002 - Crypto Subsystem Runbook
|
||||
# Regional Crypto Operations Runbook
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
Cryptographic subsystem operations including HSM management, regional crypto profile configuration, key rotation, and certificate management for all supported crypto profiles (International, FIPS, eIDAS, GOST, SM).
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check crypto subsystem health
|
||||
stella doctor --category crypto
|
||||
|
||||
# Verify active crypto profile
|
||||
stella crypto profile show
|
||||
|
||||
# List loaded crypto providers
|
||||
stella crypto providers list
|
||||
|
||||
# Check key status
|
||||
stella crypto keys status
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_crypto_operations_total` - Crypto operation count by type
|
||||
- `stella_crypto_operation_duration_seconds` - Signing/verification latency
|
||||
- `stella_hsm_availability` - HSM availability (if configured)
|
||||
- `stella_cert_expiry_days` - Certificate expiration countdown
|
||||
|
||||
---
|
||||
|
||||
## Regional Crypto Profiles
|
||||
|
||||
### Profile Overview
|
||||
|
||||
| Profile | Use Case | Key Algorithms | Compliance |
|
||||
|---------|----------|----------------|------------|
|
||||
| `international` | Default, most deployments | RSA-2048+, ECDSA P-256/P-384, Ed25519 | General |
|
||||
| `fips` | US Government / FedRAMP | FIPS 140-2 approved algorithms only | FIPS 140-2 |
|
||||
| `eidas` | European Union | RSA-PSS, ECDSA, Ed25519 per ETSI TS 119 312 | eIDAS |
|
||||
| `gost` | Russian Federation | GOST R 34.10-2012, GOST R 34.11-2012 | Russian standards |
|
||||
| `sm` | China | SM2, SM3, SM4 | GM/T 0003-2012 |
|
||||
|
||||
### Switching Profiles
|
||||
|
||||
1. **Pre-switch verification:**
|
||||
```bash
|
||||
# Verify target profile is available
|
||||
stella crypto profile verify --profile <target-profile>
|
||||
|
||||
# Check for incompatible existing signatures
|
||||
stella crypto audit --check-compatibility --target-profile <target-profile>
|
||||
```
|
||||
|
||||
2. **Profile switch:**
|
||||
```bash
|
||||
# Switch profile (requires service restart)
|
||||
stella crypto profile set --profile <target-profile>
|
||||
|
||||
# Restart services to apply
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
3. **Post-switch verification:**
|
||||
```bash
|
||||
stella doctor --check check.crypto.fips,check.crypto.eidas,check.crypto.gost,check.crypto.sm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Key Rotation
|
||||
|
||||
**Frequency:** Quarterly or per policy
|
||||
**Duration:** ~15 minutes (no downtime)
|
||||
|
||||
1. Generate new key:
|
||||
```bash
|
||||
# For software keys
|
||||
stella crypto keys generate --type signing --algorithm ecdsa-p256 --name signing-$(date +%Y%m)
|
||||
|
||||
# For HSM-backed keys
|
||||
stella crypto keys generate --type signing --algorithm ecdsa-p256 --provider hsm --name signing-$(date +%Y%m)
|
||||
```
|
||||
|
||||
2. Activate new key:
|
||||
```bash
|
||||
stella crypto keys activate --name signing-$(date +%Y%m)
|
||||
```
|
||||
|
||||
3. Verify signing with new key:
|
||||
```bash
|
||||
echo "test" | stella crypto sign --output /dev/null
|
||||
```
|
||||
|
||||
4. Schedule old key deactivation:
|
||||
```bash
|
||||
stella crypto keys schedule-deactivation --name <old-key-name> --in 30d
|
||||
```
|
||||
|
||||
### SP-002: Certificate Renewal
|
||||
|
||||
**When:** Certificate expiring within 30 days
|
||||
|
||||
1. Check expiration:
|
||||
```bash
|
||||
stella crypto certs check-expiry
|
||||
```
|
||||
|
||||
2. Generate CSR:
|
||||
```bash
|
||||
stella crypto certs csr --subject "CN=stellaops.example.com,O=Example Corp" --output cert.csr
|
||||
```
|
||||
|
||||
3. Install renewed certificate:
|
||||
```bash
|
||||
stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
|
||||
```
|
||||
|
||||
4. Verify certificate chain:
|
||||
```bash
|
||||
stella doctor --check check.crypto.certchain
|
||||
```
|
||||
|
||||
5. Restart services:
|
||||
```bash
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
### SP-003: HSM Health Check
|
||||
|
||||
**Frequency:** Daily (automated) or on-demand
|
||||
|
||||
1. Check HSM connectivity:
|
||||
```bash
|
||||
stella crypto hsm status
|
||||
```
|
||||
|
||||
2. Verify slot access:
|
||||
```bash
|
||||
stella crypto hsm slots list
|
||||
```
|
||||
|
||||
3. Test signing operation:
|
||||
```bash
|
||||
stella crypto hsm test-sign
|
||||
```
|
||||
|
||||
4. Check HSM metrics:
|
||||
- Free objects/sessions
|
||||
- Temperature/health (vendor-specific)
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: HSM Unavailable
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaHsmUnavailable`
|
||||
- Signing operations failing with "HSM connection error"
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check HSM status
|
||||
stella crypto hsm status
|
||||
|
||||
# Test PKCS#11 module
|
||||
stella crypto hsm test-module
|
||||
|
||||
# Check network to HSM
|
||||
stella network test --host <hsm-host> --port <hsm-port>
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Network issue:**
|
||||
- Verify network path to HSM
|
||||
- Check firewall rules
|
||||
- Verify HSM appliance is powered on
|
||||
|
||||
2. **Session exhaustion:**
|
||||
```bash
|
||||
# Release stale sessions
|
||||
stella crypto hsm sessions release --stale
|
||||
|
||||
# Restart crypto service
|
||||
stella service restart --service crypto-signer
|
||||
```
|
||||
|
||||
3. **HSM failure:**
|
||||
- Fail over to secondary HSM (if configured)
|
||||
- Contact HSM vendor support
|
||||
- Consider temporary fallback to software keys (with approval)
|
||||
|
||||
### INC-002: Signing Key Compromised
|
||||
|
||||
**CRITICAL - Follow incident response procedure**
|
||||
|
||||
1. **Immediate containment:**
|
||||
```bash
|
||||
# Revoke compromised key
|
||||
stella crypto keys revoke --name <compromised-key> --reason compromise
|
||||
|
||||
# Block signing with compromised key
|
||||
stella crypto keys block --name <compromised-key>
|
||||
```
|
||||
|
||||
2. **Generate replacement key:**
|
||||
```bash
|
||||
stella crypto keys generate --type signing --algorithm ecdsa-p256 --name emergency-signing
|
||||
stella crypto keys activate --name emergency-signing
|
||||
```
|
||||
|
||||
3. **Notify downstream:**
|
||||
- Update trust registries with new key
|
||||
- Notify relying parties
|
||||
- Publish key revocation notice
|
||||
|
||||
4. **Forensics:**
|
||||
```bash
|
||||
# Export key usage audit log
|
||||
stella crypto audit export --key <compromised-key> --output /secure/key-audit.json
|
||||
```
|
||||
|
||||
### INC-003: Certificate Expired
|
||||
|
||||
**Symptoms:**
|
||||
- TLS connection failures
|
||||
- Alert: `StellaCertExpired`
|
||||
|
||||
**Immediate Resolution:**
|
||||
|
||||
1. If renewed certificate is available:
|
||||
```bash
|
||||
stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
2. If renewal not ready - emergency self-signed (temporary):
|
||||
```bash
|
||||
# Generate emergency certificate (NOT for production use)
|
||||
stella crypto certs generate-self-signed --days 7 --name emergency
|
||||
stella crypto certs install --cert emergency.pem
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
3. Expedite certificate renewal process
|
||||
|
||||
### INC-004: FIPS Mode Not Enabled
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaFipsNotEnabled`
|
||||
- Compliance audit failure
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Linux:**
|
||||
```bash
|
||||
# Enable FIPS mode
|
||||
sudo fips-mode-setup --enable
|
||||
|
||||
# Reboot required
|
||||
sudo reboot
|
||||
|
||||
# Verify after reboot
|
||||
fips-mode-setup --check
|
||||
```
|
||||
|
||||
2. **Windows:**
|
||||
- Enable via Group Policy
|
||||
- Or via registry:
|
||||
```powershell
|
||||
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\Lsa\FipsAlgorithmPolicy" -Name "Enabled" -Value 1
|
||||
Restart-Computer
|
||||
```
|
||||
|
||||
3. Restart Stella services:
|
||||
```bash
|
||||
stella service restart
|
||||
stella doctor --check check.crypto.fips
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Regional-Specific Procedures
|
||||
|
||||
### GOST Configuration (Russian Federation)
|
||||
|
||||
1. Install GOST engine:
|
||||
```bash
|
||||
sudo apt install libengine-gost-openssl1.1
|
||||
```
|
||||
|
||||
2. Configure Stella:
|
||||
```bash
|
||||
stella crypto profile set --profile gost
|
||||
stella crypto config set --gost-engine-path /usr/lib/x86_64-linux-gnu/engines-3/gost.so
|
||||
```
|
||||
|
||||
3. Verify:
|
||||
```bash
|
||||
stella doctor --check check.crypto.gost
|
||||
```
|
||||
|
||||
### SM Configuration (China)
|
||||
|
||||
1. Ensure OpenSSL 1.1.1+ with SM support:
|
||||
```bash
|
||||
openssl version
|
||||
openssl list -cipher-algorithms | grep -i sm
|
||||
```
|
||||
|
||||
2. Configure Stella:
|
||||
```bash
|
||||
stella crypto profile set --profile sm
|
||||
```
|
||||
|
||||
3. Verify:
|
||||
```bash
|
||||
stella doctor --check check.crypto.sm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → Crypto Subsystem
|
||||
|
||||
Key panels:
|
||||
- Signing operation latency
|
||||
- Key usage by key ID
|
||||
- HSM availability
|
||||
- Certificate expiration countdown
|
||||
- Crypto profile in use
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
```bash
|
||||
# Comprehensive crypto diagnostics
|
||||
stella crypto diagnostics --output /tmp/crypto-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
Bundle includes:
|
||||
- Active crypto profile
|
||||
- Key inventory (public keys only)
|
||||
- Certificate chain
|
||||
- HSM status
|
||||
- Operation audit log (last 24h)
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Certificate installs, key activation
|
||||
2. **L2 (Security team):** Key rotation, HSM issues
|
||||
3. **L3 (Crypto SME):** Algorithm issues, compliance questions
|
||||
4. **HSM Vendor:** Hardware failures
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
408
docs/operations/runbooks/evidence-locker-ops.md
Normal file
408
docs/operations/runbooks/evidence-locker-ops.md
Normal file
@@ -0,0 +1,408 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-003 - Evidence Locker Runbook
|
||||
# Evidence Locker Operations Runbook
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
Evidence locker operations including storage management, integrity verification, attestation management, provenance chain maintenance, and disaster recovery procedures.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check evidence locker health
|
||||
stella doctor --category evidence
|
||||
|
||||
# Verify storage accessibility
|
||||
stella evidence status
|
||||
|
||||
# Check index health
|
||||
stella evidence index status
|
||||
|
||||
# Verify anchor chain
|
||||
stella evidence anchor verify --latest
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_evidence_artifacts_total` - Total artifacts stored
|
||||
- `stella_evidence_retrieval_latency_seconds` - Retrieval latency P99
|
||||
- `stella_evidence_storage_bytes` - Storage consumption
|
||||
- `stella_merkle_anchor_age_seconds` - Time since last anchor
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Daily Integrity Check
|
||||
|
||||
**Frequency:** Daily (automated) or on-demand
|
||||
**Duration:** Varies by locker size (typically 5-30 minutes)
|
||||
|
||||
1. Run integrity verification:
|
||||
```bash
|
||||
# Quick check (sample-based)
|
||||
stella evidence verify --mode quick
|
||||
|
||||
# Full check (all artifacts)
|
||||
stella evidence verify --mode full
|
||||
```
|
||||
|
||||
2. Review results:
|
||||
```bash
|
||||
stella evidence verify-report --latest
|
||||
```
|
||||
|
||||
3. Address any failures:
|
||||
```bash
|
||||
# List failed artifacts
|
||||
stella evidence verify-report --latest --filter failed
|
||||
```
|
||||
|
||||
### SP-002: Index Maintenance
|
||||
|
||||
**Frequency:** Weekly or after large ingestion
|
||||
**Duration:** ~10 minutes
|
||||
|
||||
1. Check index health:
|
||||
```bash
|
||||
stella evidence index status
|
||||
```
|
||||
|
||||
2. Refresh index if needed:
|
||||
```bash
|
||||
# Incremental refresh
|
||||
stella evidence index refresh
|
||||
|
||||
# Full rebuild (if corruption suspected)
|
||||
stella evidence index rebuild
|
||||
```
|
||||
|
||||
3. Optimize index:
|
||||
```bash
|
||||
stella evidence index optimize
|
||||
```
|
||||
|
||||
### SP-003: Merkle Anchoring
|
||||
|
||||
**Frequency:** Per policy (default: every 6 hours)
|
||||
**Duration:** ~2 minutes
|
||||
|
||||
1. Create new anchor:
|
||||
```bash
|
||||
stella evidence anchor create
|
||||
```
|
||||
|
||||
2. Verify anchor chain:
|
||||
```bash
|
||||
stella evidence anchor verify --all
|
||||
```
|
||||
|
||||
3. Export anchor for external archival:
|
||||
```bash
|
||||
stella evidence anchor export --latest --output anchor-$(date +%Y%m%dT%H%M%S).json
|
||||
```
|
||||
|
||||
### SP-004: Storage Cleanup
|
||||
|
||||
**Frequency:** Monthly or when storage alerts trigger
|
||||
**Duration:** Varies
|
||||
|
||||
1. Review storage usage:
|
||||
```bash
|
||||
stella evidence storage stats
|
||||
```
|
||||
|
||||
2. Apply retention policy:
|
||||
```bash
|
||||
# Dry run first
|
||||
stella evidence cleanup --apply-retention --dry-run
|
||||
|
||||
# Execute cleanup
|
||||
stella evidence cleanup --apply-retention
|
||||
```
|
||||
|
||||
3. Archive old evidence (if required):
|
||||
```bash
|
||||
stella evidence archive --older-than 365d --output /archive/evidence-$(date +%Y).tar
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: Integrity Verification Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaEvidenceIntegrityFailure`
|
||||
- Verification reports hash mismatch
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Get failure details
|
||||
stella evidence verify-report --latest --filter failed --format json > /tmp/integrity-failures.json
|
||||
|
||||
# Check specific artifact
|
||||
stella evidence inspect <artifact-id>
|
||||
|
||||
# Check provenance
|
||||
stella evidence provenance show <artifact-id>
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Isolated corruption:**
|
||||
```bash
|
||||
# Attempt recovery from replica (if available)
|
||||
stella evidence recover --id <artifact-id> --source replica
|
||||
|
||||
# If no replica, mark as corrupted
|
||||
stella evidence mark-corrupted --id <artifact-id> --reason "hash-mismatch"
|
||||
```
|
||||
|
||||
2. **Widespread corruption:**
|
||||
- Stop evidence ingestion
|
||||
- Identify corruption extent
|
||||
- Restore from backup if necessary
|
||||
- Escalate to L3
|
||||
|
||||
3. **False positive (software bug):**
|
||||
- Verify with multiple hash implementations
|
||||
- Check for recent software updates
|
||||
- Report bug if confirmed
|
||||
|
||||
### INC-002: Evidence Retrieval Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaEvidenceRetrievalFailed`
|
||||
- API returning 404 for known artifacts
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check if artifact exists
|
||||
stella evidence exists <artifact-id>
|
||||
|
||||
# Check index
|
||||
stella evidence index lookup <artifact-id>
|
||||
|
||||
# Check storage backend
|
||||
stella evidence storage check <artifact-id>
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Index corruption:**
|
||||
```bash
|
||||
# Rebuild index
|
||||
stella evidence index rebuild
|
||||
```
|
||||
|
||||
2. **Storage backend issue:**
|
||||
```bash
|
||||
# Check storage health
|
||||
stella doctor --check check.storage.evidencelocker
|
||||
|
||||
# Verify storage connectivity
|
||||
stella evidence storage test
|
||||
```
|
||||
|
||||
3. **File system issue:**
|
||||
- Check disk health
|
||||
- Verify file permissions
|
||||
- Check mount status
|
||||
|
||||
### INC-003: Anchor Chain Break
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaMerkleAnchorChainBroken`
|
||||
- Anchor verification fails
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check anchor chain
|
||||
stella evidence anchor verify --all --verbose
|
||||
|
||||
# Find break point
|
||||
stella evidence anchor list --show-links
|
||||
|
||||
# Inspect specific anchor
|
||||
stella evidence anchor inspect <anchor-id>
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Single broken link:**
|
||||
```bash
|
||||
# Attempt to recover from backup
|
||||
stella evidence anchor recover --id <anchor-id> --source backup
|
||||
```
|
||||
|
||||
2. **Multiple breaks:**
|
||||
- Stop new anchoring
|
||||
- Assess extent of damage
|
||||
- Restore from backup or rebuild chain
|
||||
|
||||
3. **Create new chain segment:**
|
||||
```bash
|
||||
# Start new chain (preserves old chain as archived)
|
||||
stella evidence anchor new-chain --reason "chain-break-recovery"
|
||||
```
|
||||
|
||||
### INC-004: Storage Full
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaEvidenceStorageFull`
|
||||
- Ingestion failing
|
||||
|
||||
**Immediate Actions:**
|
||||
```bash
|
||||
# Check storage usage
|
||||
stella evidence storage stats
|
||||
|
||||
# Emergency cleanup of temporary files
|
||||
stella evidence cleanup --temp-only
|
||||
|
||||
# Find large/old artifacts
|
||||
stella evidence storage analyze --sort size --limit 20
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Apply retention policy:**
|
||||
```bash
|
||||
stella evidence cleanup --apply-retention --aggressive
|
||||
```
|
||||
|
||||
2. **Archive old evidence:**
|
||||
```bash
|
||||
stella evidence archive --older-than 180d --compress
|
||||
```
|
||||
|
||||
3. **Expand storage:**
|
||||
- Follow cloud provider procedure
|
||||
- Or add additional storage volume
|
||||
|
||||
---
|
||||
|
||||
## Disaster Recovery
|
||||
|
||||
### DR-001: Full Evidence Locker Recovery
|
||||
|
||||
**Prerequisites:**
|
||||
- Backup available
|
||||
- Target storage provisioned
|
||||
- Recovery environment ready
|
||||
|
||||
**Procedure:**
|
||||
|
||||
1. Provision new storage:
|
||||
```bash
|
||||
stella evidence storage provision --size <size>
|
||||
```
|
||||
|
||||
2. Restore from backup:
|
||||
```bash
|
||||
# List available backups
|
||||
stella backup list --type evidence-locker
|
||||
|
||||
# Restore
|
||||
stella evidence restore --backup-id <backup-id> --target /var/lib/stellaops/evidence
|
||||
```
|
||||
|
||||
3. Verify restoration:
|
||||
```bash
|
||||
stella evidence verify --mode full
|
||||
stella evidence anchor verify --all
|
||||
```
|
||||
|
||||
4. Update service configuration:
|
||||
```bash
|
||||
stella config set EvidenceLocker:Path /var/lib/stellaops/evidence
|
||||
stella service restart
|
||||
```
|
||||
|
||||
### DR-002: Point-in-Time Recovery
|
||||
|
||||
For recovering to a specific point in time:
|
||||
|
||||
1. Identify target anchor:
|
||||
```bash
|
||||
stella evidence anchor list --before <timestamp>
|
||||
```
|
||||
|
||||
2. Restore to that point:
|
||||
```bash
|
||||
stella evidence restore --to-anchor <anchor-id>
|
||||
```
|
||||
|
||||
3. Verify integrity:
|
||||
```bash
|
||||
stella evidence verify --mode full --to-anchor <anchor-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Offline Mode Operations
|
||||
|
||||
### Preparing Offline Evidence Pack
|
||||
|
||||
```bash
|
||||
# Export evidence for specific artifact
|
||||
stella evidence export --digest <artifact-digest> --output evidence-pack.tar.gz
|
||||
|
||||
# Export with all dependencies
|
||||
stella evidence export --digest <artifact-digest> --include-deps --output evidence-full.tar.gz
|
||||
```
|
||||
|
||||
### Verifying Evidence Offline
|
||||
|
||||
```bash
|
||||
# Verify evidence pack without network
|
||||
stella evidence verify --offline --input evidence-pack.tar.gz
|
||||
|
||||
# Replay verdict using evidence
|
||||
stella replay --evidence evidence-pack.tar.gz --output verdict.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → Evidence Locker
|
||||
|
||||
Key panels:
|
||||
- Artifact ingestion rate
|
||||
- Retrieval latency
|
||||
- Storage utilization trend
|
||||
- Integrity check status
|
||||
- Anchor chain health
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
For any incident:
|
||||
```bash
|
||||
stella evidence diagnostics --output /tmp/evidence-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
Bundle includes:
|
||||
- Index status
|
||||
- Storage stats
|
||||
- Recent anchor chain
|
||||
- Integrity check results
|
||||
- Operation audit log
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Standard procedures, cleanup operations
|
||||
2. **L2 (Platform team):** Index rebuild, anchor issues
|
||||
3. **L3 (Architecture):** Chain recovery, DR procedures
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
183
docs/operations/runbooks/orchestrator-evidence-missing.md
Normal file
183
docs/operations/runbooks/orchestrator-evidence-missing.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# Runbook: Release Orchestrator - Required Evidence Not Found
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.evidence-availability` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotion failing with "required evidence not found"
|
||||
- [ ] Alert `OrchestratorEvidenceMissing` firing
|
||||
- [ ] Gate evaluation blocked waiting for evidence
|
||||
- [ ] Error: "SBOM not found" or "attestation missing"
|
||||
- [ ] Evidence chain incomplete for artifact
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Promotion blocked until evidence is generated |
|
||||
| **Data integrity** | Indicates missing security artifact - must be resolved |
|
||||
| **SLA impact** | Release blocked; compliance requirements not met |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.evidence-availability
|
||||
```
|
||||
|
||||
2. **List missing evidence for promotion:**
|
||||
```bash
|
||||
stella promotion evidence <promotion-id> --missing
|
||||
```
|
||||
|
||||
3. **Check what evidence exists for artifact:**
|
||||
```bash
|
||||
stella evidence list --artifact <digest>
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check evidence chain completeness:**
|
||||
```bash
|
||||
stella evidence chain --artifact <digest> --verbose
|
||||
```
|
||||
Look for: Missing nodes in the chain
|
||||
|
||||
2. **Check if scan completed:**
|
||||
```bash
|
||||
stella scanner jobs list --artifact <digest>
|
||||
```
|
||||
Problem if: No completed scan or scan failed
|
||||
|
||||
3. **Check if attestation was created:**
|
||||
```bash
|
||||
stella attest list --subject <digest>
|
||||
```
|
||||
Problem if: No attestation or attestation failed
|
||||
|
||||
4. **Check evidence store health:**
|
||||
```bash
|
||||
stella evidence store health
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Generate missing SBOM:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --sbom-only
|
||||
```
|
||||
|
||||
2. **Generate missing attestation:**
|
||||
```bash
|
||||
stella attest create --subject <digest> --type slsa-provenance
|
||||
```
|
||||
|
||||
3. **Re-scan artifact to regenerate all evidence:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --force
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If scan never ran:**
|
||||
|
||||
1. Check why artifact wasn't scanned:
|
||||
```bash
|
||||
stella scanner queue list --artifact <digest>
|
||||
```
|
||||
|
||||
2. Configure automatic scanning on push:
|
||||
```bash
|
||||
stella scanner config set auto_scan.enabled true
|
||||
stella scanner config set auto_scan.triggers "push,promote"
|
||||
```
|
||||
|
||||
**If evidence was generated but not stored:**
|
||||
|
||||
1. Check evidence store connectivity:
|
||||
```bash
|
||||
stella evidence store health
|
||||
```
|
||||
|
||||
2. Retry evidence storage:
|
||||
```bash
|
||||
stella evidence retry-store --artifact <digest>
|
||||
```
|
||||
|
||||
**If attestation signing failed:**
|
||||
|
||||
1. Check attestor status:
|
||||
```bash
|
||||
stella attest status
|
||||
```
|
||||
|
||||
2. See `attestor-signing-failed.md` runbook
|
||||
|
||||
**If evidence expired or was deleted:**
|
||||
|
||||
1. Check evidence retention policy:
|
||||
```bash
|
||||
stella evidence policy show
|
||||
```
|
||||
|
||||
2. Regenerate evidence:
|
||||
```bash
|
||||
stella scan image --image <image-ref> --force
|
||||
stella attest create --subject <digest> --type slsa-provenance
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check all evidence now exists
|
||||
stella evidence list --artifact <digest>
|
||||
|
||||
# Verify evidence chain is complete
|
||||
stella evidence chain --artifact <digest>
|
||||
|
||||
# Retry promotion
|
||||
stella promotion retry <promotion-id>
|
||||
|
||||
# Verify promotion proceeds
|
||||
stella promotion status <promotion-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Auto-scan:** Enable automatic scanning for all pushed images
|
||||
- [ ] **Gates:** Configure evidence requirements clearly in promotion policy
|
||||
- [ ] **Monitoring:** Alert on evidence generation failures
|
||||
- [ ] **Retention:** Set appropriate evidence retention periods
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/evidence-locker/architecture.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `attestor-signing-failed.md`
|
||||
- **Evidence requirements:** `docs/operations/evidence-requirements.md`
|
||||
178
docs/operations/runbooks/orchestrator-gate-timeout.md
Normal file
178
docs/operations/runbooks/orchestrator-gate-timeout.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Runbook: Release Orchestrator - Gate Evaluation Timeout
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.gate-timeout` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotion gates timing out before completing evaluation
|
||||
- [ ] Alert `OrchestratorGateTimeout` firing
|
||||
- [ ] Error: "gate evaluation timeout exceeded"
|
||||
- [ ] Promotion stuck waiting for gate response
|
||||
- [ ] Metric `orchestrator_gate_timeout_total` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Promotions delayed or blocked; release pipeline stalled |
|
||||
| **Data integrity** | No data loss; promotion can be retried |
|
||||
| **SLA impact** | Release SLO violated if timeout persists |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.gate-timeout
|
||||
```
|
||||
|
||||
2. **Identify timed-out gates:**
|
||||
```bash
|
||||
stella promotion gates <promotion-id> --status timeout
|
||||
```
|
||||
|
||||
3. **Check gate service health:**
|
||||
```bash
|
||||
stella orch gate-services status
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check specific gate latency:**
|
||||
```bash
|
||||
stella orch gate stats --gate <gate-name> --last 1h
|
||||
```
|
||||
Look for: P95 latency, timeout rate
|
||||
|
||||
2. **Check external service connectivity:**
|
||||
```bash
|
||||
stella orch connectivity --gate <gate-name>
|
||||
```
|
||||
|
||||
3. **Check gate evaluation logs:**
|
||||
```bash
|
||||
stella orch logs --gate <gate-name> --promotion <promotion-id>
|
||||
```
|
||||
Look for: Slow queries, external API delays
|
||||
|
||||
4. **Check policy engine latency (for policy gates):**
|
||||
```bash
|
||||
stella policy stats --last 10m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Increase timeout for specific gate:**
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.timeout 5m
|
||||
stella orch reload
|
||||
```
|
||||
|
||||
2. **Skip the timed-out gate (requires approval):**
|
||||
```bash
|
||||
stella promotion gate skip <promotion-id> <gate-name> \
|
||||
--reason "External service timeout - approved by <approver>"
|
||||
```
|
||||
|
||||
3. **Retry the promotion:**
|
||||
```bash
|
||||
stella promotion retry <promotion-id>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If external service is slow:**
|
||||
|
||||
1. Configure gate retry with backoff:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.retries 3
|
||||
stella orch config set gates.<gate-name>.retry_backoff 5s
|
||||
```
|
||||
|
||||
2. Enable gate result caching:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.cache_ttl 5m
|
||||
```
|
||||
|
||||
3. Configure circuit breaker:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.circuit_breaker.enabled true
|
||||
stella orch config set gates.<gate-name>.circuit_breaker.threshold 5
|
||||
```
|
||||
|
||||
**If policy evaluation is slow:**
|
||||
|
||||
1. Optimize policy (see `policy-evaluation-slow.md` runbook)
|
||||
|
||||
2. Increase policy worker count:
|
||||
```bash
|
||||
stella policy config set opa.workers 4
|
||||
```
|
||||
|
||||
**If evidence retrieval is slow:**
|
||||
|
||||
1. Enable evidence pre-fetching:
|
||||
```bash
|
||||
stella orch config set gates.evidence_prefetch true
|
||||
```
|
||||
|
||||
2. Increase evidence cache:
|
||||
```bash
|
||||
stella orch config set evidence.cache_size 1000
|
||||
stella orch config set evidence.cache_ttl 10m
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry promotion
|
||||
stella promotion retry <promotion-id>
|
||||
|
||||
# Monitor gate evaluation
|
||||
stella promotion gates <promotion-id> --watch
|
||||
|
||||
# Check gate latency improved
|
||||
stella orch gate stats --gate <gate-name> --last 10m
|
||||
|
||||
# Verify no timeouts
|
||||
stella orch logs --filter "timeout" --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Timeouts:** Set appropriate timeouts based on gate SLAs (default: 2m)
|
||||
- [ ] **Monitoring:** Alert on gate P95 latency > 1m
|
||||
- [ ] **Caching:** Enable caching for slow gates
|
||||
- [ ] **Circuit breakers:** Enable circuit breakers for external service gates
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/gates.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `policy-evaluation-slow.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Gate Latency
|
||||
168
docs/operations/runbooks/orchestrator-promotion-stuck.md
Normal file
168
docs/operations/runbooks/orchestrator-promotion-stuck.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Runbook: Release Orchestrator - Promotion Job Not Progressing
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Release team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.job-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotion job stuck in "in_progress" state for >10 minutes
|
||||
- [ ] No progress updates in promotion timeline
|
||||
- [ ] Alert `OrchestratorPromotionStuck` firing
|
||||
- [ ] UI shows promotion spinner indefinitely
|
||||
- [ ] Downstream environment not receiving promoted artifact
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Release blocked, cannot promote to target environment |
|
||||
| **Data integrity** | Artifact is safe; promotion can be retried |
|
||||
| **SLA impact** | Release SLO violated if not resolved within 30 minutes |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.job-health
|
||||
```
|
||||
|
||||
2. **Check promotion status:**
|
||||
```bash
|
||||
stella promotion status <promotion-id>
|
||||
```
|
||||
Look for: Current step, last update time, any error messages
|
||||
|
||||
3. **Check orchestrator service:**
|
||||
```bash
|
||||
stella orch status
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Get detailed promotion trace:**
|
||||
```bash
|
||||
stella promotion trace <promotion-id> --verbose
|
||||
```
|
||||
Look for: Which step is stuck, any timeouts
|
||||
|
||||
2. **Check gate evaluation status:**
|
||||
```bash
|
||||
stella promotion gates <promotion-id>
|
||||
```
|
||||
Problem if: Gate stuck waiting for external service
|
||||
|
||||
3. **Check target environment connectivity:**
|
||||
```bash
|
||||
stella orch connectivity --target <env-name>
|
||||
```
|
||||
|
||||
4. **Check for lock contention:**
|
||||
```bash
|
||||
stella orch locks list
|
||||
```
|
||||
Problem if: Stale locks on the artifact or environment
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If gate is stuck waiting for external service:**
|
||||
```bash
|
||||
# Skip the stuck gate (requires approval)
|
||||
stella promotion gate skip <promotion-id> <gate-name> --reason "External service timeout"
|
||||
```
|
||||
|
||||
2. **If lock is stale:**
|
||||
```bash
|
||||
# Release the lock (use with caution)
|
||||
stella orch locks release <lock-id> --force
|
||||
```
|
||||
|
||||
3. **If orchestrator is unresponsive:**
|
||||
```bash
|
||||
stella service restart orchestrator
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If external gate service is slow:**
|
||||
|
||||
1. Increase gate timeout:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.timeout 5m
|
||||
```
|
||||
|
||||
2. Configure gate retry:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.retries 3
|
||||
```
|
||||
|
||||
**If target environment is unreachable:**
|
||||
|
||||
1. Check network connectivity to target
|
||||
2. Verify credentials for target environment:
|
||||
```bash
|
||||
stella orch credentials verify --target <env-name>
|
||||
```
|
||||
|
||||
**If database lock contention:**
|
||||
|
||||
1. Increase lock timeout:
|
||||
```bash
|
||||
stella orch config set locks.timeout 60s
|
||||
```
|
||||
|
||||
2. Enable optimistic locking:
|
||||
```bash
|
||||
stella orch config set locks.mode optimistic
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check promotion completed
|
||||
stella promotion status <promotion-id>
|
||||
|
||||
# Verify artifact in target environment
|
||||
stella orch artifacts list --env <target-env> --filter <artifact-digest>
|
||||
|
||||
# Check no stuck promotions
|
||||
stella promotion list --status in_progress --older-than 5m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Timeouts:** Configure appropriate timeouts for all gates
|
||||
- [ ] **Monitoring:** Alert on promotions stuck > 10 minutes
|
||||
- [ ] **Health checks:** Enable connectivity pre-checks before promotion
|
||||
- [ ] **Documentation:** Document SLAs for external gate services
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/architecture.md`
|
||||
- **Related runbooks:** `orchestrator-gate-timeout.md`, `orchestrator-evidence-missing.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Release Orchestrator
|
||||
189
docs/operations/runbooks/orchestrator-quota-exceeded.md
Normal file
189
docs/operations/runbooks/orchestrator-quota-exceeded.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Runbook: Release Orchestrator - Promotion Quota Exhausted
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | Medium |
|
||||
| **On-call scope** | Platform team, Release team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.quota-status` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotions failing with "quota exceeded"
|
||||
- [ ] Alert `OrchestratorQuotaExceeded` firing
|
||||
- [ ] Error: "promotion rate limit reached" or "daily quota exhausted"
|
||||
- [ ] New promotions being rejected
|
||||
- [ ] Queued promotions not processing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | New releases blocked until quota resets or increases |
|
||||
| **Data integrity** | No data loss; promotions queued for later |
|
||||
| **SLA impact** | Release frequency SLO may be violated |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.quota-status
|
||||
```
|
||||
|
||||
2. **Check current quota usage:**
|
||||
```bash
|
||||
stella orch quota status
|
||||
```
|
||||
|
||||
3. **Check quota limits:**
|
||||
```bash
|
||||
stella orch quota limits show
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check promotion history:**
|
||||
```bash
|
||||
stella promotion list --last 24h --count
|
||||
```
|
||||
Look for: Unusual spike in promotions
|
||||
|
||||
2. **Check per-environment quotas:**
|
||||
```bash
|
||||
stella orch quota status --by-environment
|
||||
```
|
||||
|
||||
3. **Check for runaway automation:**
|
||||
```bash
|
||||
stella promotion list --last 1h --by-actor
|
||||
```
|
||||
Problem if: Single actor/service making many promotions
|
||||
|
||||
4. **Check when quota resets:**
|
||||
```bash
|
||||
stella orch quota reset-time
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Request temporary quota increase:**
|
||||
```bash
|
||||
stella orch quota request-increase --amount 50 --reason "Release deadline"
|
||||
```
|
||||
|
||||
2. **Prioritize critical promotions:**
|
||||
```bash
|
||||
stella promotion priority set <promotion-id> high
|
||||
```
|
||||
|
||||
3. **Cancel unnecessary queued promotions:**
|
||||
```bash
|
||||
stella promotion list --status queued
|
||||
stella promotion cancel <promotion-id>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If legitimate high volume:**
|
||||
|
||||
1. Increase quota limits:
|
||||
```bash
|
||||
stella orch quota limits set --daily 200 --hourly 50
|
||||
```
|
||||
|
||||
2. Increase per-environment limits:
|
||||
```bash
|
||||
stella orch quota limits set --env production --daily 50
|
||||
```
|
||||
|
||||
**If runaway automation:**
|
||||
|
||||
1. Identify the source:
|
||||
```bash
|
||||
stella promotion list --last 1h --by-actor --verbose
|
||||
```
|
||||
|
||||
2. Revoke or rate-limit the service account:
|
||||
```bash
|
||||
stella auth rate-limit set <service-account> --promotions-per-hour 10
|
||||
```
|
||||
|
||||
3. Fix the automation bug
|
||||
|
||||
**If promotion retries causing spike:**
|
||||
|
||||
1. Check for failing promotions causing retries:
|
||||
```bash
|
||||
stella promotion list --status failed --last 24h
|
||||
```
|
||||
|
||||
2. Fix underlying promotion failures (see other runbooks)
|
||||
|
||||
3. Configure retry limits:
|
||||
```bash
|
||||
stella orch config set promotion.max_retries 3
|
||||
stella orch config set promotion.retry_backoff 5m
|
||||
```
|
||||
|
||||
**If quota too restrictive for workload:**
|
||||
|
||||
1. Analyze actual promotion patterns:
|
||||
```bash
|
||||
stella orch quota analyze --last 30d
|
||||
```
|
||||
|
||||
2. Adjust quotas based on analysis:
|
||||
```bash
|
||||
stella orch quota limits set --daily <recommended>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check quota status
|
||||
stella orch quota status
|
||||
|
||||
# Verify promotions processing
|
||||
stella promotion list --status in_progress
|
||||
|
||||
# Test new promotion
|
||||
stella promotion create --test --dry-run
|
||||
|
||||
# Check no quota errors
|
||||
stella orch logs --filter "quota" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Monitoring:** Alert at 80% quota usage
|
||||
- [ ] **Limits:** Set appropriate quotas based on team size and release frequency
|
||||
- [ ] **Automation:** Implement rate limiting in CI/CD pipelines
|
||||
- [ ] **Review:** Regularly review and adjust quotas based on usage patterns
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/quotas.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`
|
||||
- **Quota management:** `docs/operations/quota-management.md`
|
||||
189
docs/operations/runbooks/orchestrator-rollback-failed.md
Normal file
189
docs/operations/runbooks/orchestrator-rollback-failed.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Runbook: Release Orchestrator - Rollback Operation Failed
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Release team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.rollback-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Rollback operation failing or stuck
|
||||
- [ ] Alert `OrchestratorRollbackFailed` firing
|
||||
- [ ] Error: "rollback failed" or "cannot restore previous version"
|
||||
- [ ] Target environment in inconsistent state
|
||||
- [ ] Previous artifact not available for deployment
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Rollback blocked; potentially broken release in production |
|
||||
| **Data integrity** | Environment may be in partial rollback state |
|
||||
| **SLA impact** | Incident resolution blocked; extended outage |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.rollback-health
|
||||
```
|
||||
|
||||
2. **Check rollback status:**
|
||||
```bash
|
||||
stella rollback status <rollback-id>
|
||||
```
|
||||
|
||||
3. **Check previous deployment history:**
|
||||
```bash
|
||||
stella orch deployments list --env <env-name> --last 10
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check why rollback failed:**
|
||||
```bash
|
||||
stella rollback trace <rollback-id> --verbose
|
||||
```
|
||||
Look for: Which step failed, error message
|
||||
|
||||
2. **Check previous artifact availability:**
|
||||
```bash
|
||||
stella orch artifacts get <previous-digest> --check
|
||||
```
|
||||
Problem if: Artifact deleted, not in registry
|
||||
|
||||
3. **Check environment state:**
|
||||
```bash
|
||||
stella orch env status <env-name> --detailed
|
||||
```
|
||||
|
||||
4. **Check for deployment locks:**
|
||||
```bash
|
||||
stella orch locks list --env <env-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Force release lock if stuck:**
|
||||
```bash
|
||||
stella orch locks release --env <env-name> --force
|
||||
```
|
||||
|
||||
2. **Manual rollback using specific artifact:**
|
||||
```bash
|
||||
stella deploy --env <env-name> --artifact <previous-digest> --force
|
||||
```
|
||||
|
||||
3. **If artifact unavailable, deploy last known good:**
|
||||
```bash
|
||||
stella orch deployments list --env <env-name> --status success
|
||||
stella deploy --env <env-name> --artifact <last-good-digest>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If previous artifact not in registry:**
|
||||
|
||||
1. Check artifact retention policy:
|
||||
```bash
|
||||
stella registry retention show
|
||||
```
|
||||
|
||||
2. Restore from backup registry:
|
||||
```bash
|
||||
stella registry restore --artifact <digest> --from backup
|
||||
```
|
||||
|
||||
3. Increase artifact retention:
|
||||
```bash
|
||||
stella registry retention set --min-versions 10
|
||||
```
|
||||
|
||||
**If deployment service unavailable:**
|
||||
|
||||
1. Check deployment target connectivity:
|
||||
```bash
|
||||
stella orch connectivity --target <env-name>
|
||||
```
|
||||
|
||||
2. Check deployment agent status:
|
||||
```bash
|
||||
stella orch agent status --env <env-name>
|
||||
```
|
||||
|
||||
**If configuration drift:**
|
||||
|
||||
1. Check environment configuration:
|
||||
```bash
|
||||
stella orch env config diff <env-name>
|
||||
```
|
||||
|
||||
2. Reset environment to known state:
|
||||
```bash
|
||||
stella orch env reset <env-name> --to-baseline
|
||||
```
|
||||
|
||||
**If database state inconsistent:**
|
||||
|
||||
1. Check orchestrator database:
|
||||
```bash
|
||||
stella orch db verify
|
||||
```
|
||||
|
||||
2. Repair deployment state:
|
||||
```bash
|
||||
stella orch repair --deployment <deployment-id>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify rollback completed
|
||||
stella rollback status <rollback-id>
|
||||
|
||||
# Verify environment state
|
||||
stella orch env status <env-name>
|
||||
|
||||
# Verify correct version deployed
|
||||
stella orch deployments current --env <env-name>
|
||||
|
||||
# Health check the environment
|
||||
stella orch health-check --env <env-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Retention:** Maintain at least 5 previous versions in registry
|
||||
- [ ] **Testing:** Test rollback procedure in staging regularly
|
||||
- [ ] **Monitoring:** Alert on rollback failures immediately
|
||||
- [ ] **Documentation:** Document manual rollback procedures per environment
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/rollback.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `orchestrator-evidence-missing.md`
|
||||
- **Rollback procedures:** `docs/operations/rollback-procedures.md`
|
||||
189
docs/operations/runbooks/policy-compilation-failed.md
Normal file
189
docs/operations/runbooks/policy-compilation-failed.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Runbook: Policy Engine - Rego Compilation Errors
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.compilation-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy deployment failing with "compilation error"
|
||||
- [ ] Alert `PolicyCompilationFailed` firing
|
||||
- [ ] Error: "rego_parse_error" or "rego_type_error"
|
||||
- [ ] New policies not taking effect
|
||||
- [ ] OPA rejecting policy bundle
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | New policies cannot be deployed; using stale policies |
|
||||
| **Data integrity** | Existing policies continue to work; new rules not enforced |
|
||||
| **SLA impact** | Policy updates blocked; security posture may be outdated |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.compilation-health
|
||||
```
|
||||
|
||||
2. **Check policy compilation status:**
|
||||
```bash
|
||||
stella policy status --compilation
|
||||
```
|
||||
|
||||
3. **Validate specific policy:**
|
||||
```bash
|
||||
stella policy validate --file <policy-file>
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Get detailed compilation errors:**
|
||||
```bash
|
||||
stella policy compile --verbose
|
||||
```
|
||||
Look for: Line numbers, error types, undefined references
|
||||
|
||||
2. **Check for syntax errors:**
|
||||
```bash
|
||||
stella policy lint --file <policy-file>
|
||||
```
|
||||
|
||||
3. **Check for type errors:**
|
||||
```bash
|
||||
stella policy typecheck --file <policy-file>
|
||||
```
|
||||
|
||||
4. **Check OPA version compatibility:**
|
||||
```bash
|
||||
stella policy opa version
|
||||
stella policy check-compat --file <policy-file>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Rollback to last working policy:**
|
||||
```bash
|
||||
stella policy rollback --to-last-good
|
||||
```
|
||||
|
||||
2. **Disable the failing policy:**
|
||||
```bash
|
||||
stella policy disable <policy-id>
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
3. **Use previous bundle:**
|
||||
```bash
|
||||
stella policy bundle load --version <previous-version>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If syntax error:**
|
||||
|
||||
1. Get exact error location:
|
||||
```bash
|
||||
stella policy validate --file <policy-file> --show-line
|
||||
```
|
||||
|
||||
2. Common syntax issues:
|
||||
- Missing brackets or braces
|
||||
- Invalid rule head syntax
|
||||
- Incorrect import statements
|
||||
|
||||
3. Fix and re-validate:
|
||||
```bash
|
||||
stella policy validate --file <fixed-policy.rego>
|
||||
```
|
||||
|
||||
**If undefined reference:**
|
||||
|
||||
1. Check for missing imports:
|
||||
```bash
|
||||
stella policy analyze --file <policy-file> --show-imports
|
||||
```
|
||||
|
||||
2. Verify data references exist:
|
||||
```bash
|
||||
stella policy data show
|
||||
```
|
||||
|
||||
3. Add missing imports or data definitions
|
||||
|
||||
**If type error:**
|
||||
|
||||
1. Check type mismatches:
|
||||
```bash
|
||||
stella policy typecheck --file <policy-file> --verbose
|
||||
```
|
||||
|
||||
2. Common type issues:
|
||||
- Comparing incompatible types
|
||||
- Invalid function arguments
|
||||
- Missing type annotations
|
||||
|
||||
**If OPA version incompatibility:**
|
||||
|
||||
1. Check Rego version features used:
|
||||
```bash
|
||||
stella policy analyze --file <policy-file> --show-features
|
||||
```
|
||||
|
||||
2. Update policy to use compatible features or upgrade OPA
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Validate fixed policy
|
||||
stella policy validate --file <fixed-policy.rego>
|
||||
|
||||
# Test policy compilation
|
||||
stella policy compile --file <fixed-policy.rego>
|
||||
|
||||
# Deploy policy
|
||||
stella policy deploy --file <fixed-policy.rego>
|
||||
|
||||
# Test policy evaluation
|
||||
stella policy evaluate --test
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **CI/CD:** Add policy validation to CI pipeline before deployment
|
||||
- [ ] **Linting:** Run `stella policy lint` on all policy changes
|
||||
- [ ] **Testing:** Write unit tests for policies with `stella policy test`
|
||||
- [ ] **Staging:** Deploy to staging environment before production
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/architecture.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `policy-evaluation-slow.md`
|
||||
- **Rego reference:** https://www.openpolicyagent.org/docs/latest/policy-language/
|
||||
- **Policy testing:** `docs/modules/policy/testing.md`
|
||||
174
docs/operations/runbooks/policy-evaluation-slow.md
Normal file
174
docs/operations/runbooks/policy-evaluation-slow.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Runbook: Policy Engine - Evaluation Latency High
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.evaluation-latency` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy evaluation takes >500ms (warning) or >2s (critical)
|
||||
- [ ] Gate decisions timing out in CI/CD pipelines
|
||||
- [ ] Alert `PolicyEvaluationSlow` firing
|
||||
- [ ] Metric `policy_evaluation_duration_seconds` P95 > 1s
|
||||
- [ ] Users report "policy check taking too long"
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Slow release gate checks, CI/CD pipeline delays |
|
||||
| **Data integrity** | No data loss; decisions are still correct |
|
||||
| **SLA impact** | Gate latency SLO violated (target: P95 < 500ms) |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.evaluation-latency
|
||||
```
|
||||
|
||||
2. **Check policy engine status:**
|
||||
```bash
|
||||
stella policy status
|
||||
```
|
||||
|
||||
3. **Check recent evaluation times:**
|
||||
```bash
|
||||
stella policy stats --last 10m
|
||||
```
|
||||
Look for: P95 latency, cache hit rate
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Profile a slow evaluation:**
|
||||
```bash
|
||||
stella policy evaluate --image <image-ref> --profile
|
||||
```
|
||||
Look for: Which phase is slowest (parse, compile, execute)
|
||||
|
||||
2. **Check OPA compilation cache:**
|
||||
```bash
|
||||
stella policy cache stats
|
||||
```
|
||||
Problem if: Cache hit rate < 90%
|
||||
|
||||
3. **Check policy complexity:**
|
||||
```bash
|
||||
stella policy analyze --complexity
|
||||
```
|
||||
Problem if: Cyclomatic complexity > 50 or rule count > 200
|
||||
|
||||
4. **Check external data fetches:**
|
||||
```bash
|
||||
stella policy logs --filter "external fetch" --level debug
|
||||
```
|
||||
Problem if: Many external fetches or slow responses
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Clear and warm the compilation cache:**
|
||||
```bash
|
||||
stella policy cache clear
|
||||
stella policy cache warm
|
||||
```
|
||||
|
||||
2. **Increase OPA worker count:**
|
||||
```bash
|
||||
stella policy config set opa.workers 4
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
3. **Enable evaluation result caching:**
|
||||
```bash
|
||||
stella policy config set cache.evaluation_ttl 60s
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If policy is too complex:**
|
||||
|
||||
1. Analyze and simplify policy:
|
||||
```bash
|
||||
stella policy analyze --suggest-optimizations
|
||||
```
|
||||
|
||||
2. Split large policies into modules:
|
||||
```bash
|
||||
stella policy refactor --auto-split
|
||||
```
|
||||
|
||||
**If external data fetches are slow:**
|
||||
|
||||
1. Increase external data cache TTL:
|
||||
```bash
|
||||
stella policy config set external_data.cache_ttl 5m
|
||||
```
|
||||
|
||||
2. Pre-fetch external data:
|
||||
```bash
|
||||
stella policy external-data prefetch
|
||||
```
|
||||
|
||||
**If Rego compilation is slow:**
|
||||
|
||||
1. Enable partial evaluation:
|
||||
```bash
|
||||
stella policy config set opa.partial_eval true
|
||||
```
|
||||
|
||||
2. Pre-compile policies:
|
||||
```bash
|
||||
stella policy compile --all
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Run evaluation and check latency
|
||||
stella policy evaluate --image <image-ref> --timing
|
||||
|
||||
# Check P95 latency
|
||||
stella policy stats --last 5m
|
||||
|
||||
# Verify cache is effective
|
||||
stella policy cache stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Review:** Review policy complexity before deployment
|
||||
- [ ] **Monitoring:** Alert on P95 latency > 300ms
|
||||
- [ ] **Caching:** Ensure evaluation cache is enabled
|
||||
- [ ] **Pre-warming:** Add cache warming to deployment pipeline
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/architecture.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `policy-compilation-failed.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Policy Engine
|
||||
205
docs/operations/runbooks/policy-opa-crash.md
Normal file
205
docs/operations/runbooks/policy-opa-crash.md
Normal file
@@ -0,0 +1,205 @@
|
||||
# Runbook: Policy Engine - OPA Process Crashed
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.opa-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy evaluations failing with "OPA unavailable" error
|
||||
- [ ] Alert `PolicyOPACrashed` firing
|
||||
- [ ] OPA process exited unexpectedly
|
||||
- [ ] Error: "connection refused" when connecting to OPA
|
||||
- [ ] Metric `policy_opa_restarts_total` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | All policy evaluations fail; gate decisions blocked |
|
||||
| **Data integrity** | No data loss; decisions delayed until OPA recovers |
|
||||
| **SLA impact** | Gate latency SLO violated; release pipeline blocked |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.opa-health
|
||||
```
|
||||
|
||||
2. **Check OPA process status:**
|
||||
```bash
|
||||
stella policy status
|
||||
```
|
||||
Look for: OPA process state, restart count
|
||||
|
||||
3. **Check OPA logs for crash reason:**
|
||||
```bash
|
||||
stella policy opa logs --last 30m --level error
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check OPA memory usage before crash:**
|
||||
```bash
|
||||
stella policy stats --opa-metrics
|
||||
```
|
||||
Problem if: Memory usage near limit before crash
|
||||
|
||||
2. **Check for problematic policy:**
|
||||
```bash
|
||||
stella policy list --last-error
|
||||
```
|
||||
Look for: Policies that caused evaluation errors
|
||||
|
||||
3. **Check OPA configuration:**
|
||||
```bash
|
||||
stella policy opa config show
|
||||
```
|
||||
Look for: Invalid configuration, missing bundles
|
||||
|
||||
4. **Check for infinite loops in Rego:**
|
||||
```bash
|
||||
stella policy analyze --detect-loops
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Restart OPA process:**
|
||||
```bash
|
||||
stella policy opa restart
|
||||
```
|
||||
|
||||
2. **If OPA keeps crashing, start in safe mode:**
|
||||
```bash
|
||||
stella policy opa start --safe-mode
|
||||
```
|
||||
Note: Safe mode disables custom policies
|
||||
|
||||
3. **Enable failopen temporarily (if allowed by policy):**
|
||||
```bash
|
||||
stella policy config set failopen true
|
||||
stella policy reload
|
||||
```
|
||||
**Warning:** Only use if compliance allows fail-open mode
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If OOM killed:**
|
||||
|
||||
1. Increase OPA memory limit:
|
||||
```bash
|
||||
stella policy opa config set memory_limit 2Gi
|
||||
stella policy opa restart
|
||||
```
|
||||
|
||||
2. Enable garbage collection tuning:
|
||||
```bash
|
||||
stella policy opa config set gc_min_heap_size 256Mi
|
||||
stella policy opa config set gc_max_heap_size 1Gi
|
||||
```
|
||||
|
||||
**If policy caused crash:**
|
||||
|
||||
1. Identify problematic policy:
|
||||
```bash
|
||||
stella policy list --status error
|
||||
```
|
||||
|
||||
2. Disable the problematic policy:
|
||||
```bash
|
||||
stella policy disable <policy-id>
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
3. Fix and re-enable:
|
||||
```bash
|
||||
stella policy validate --file <fixed-policy.rego>
|
||||
stella policy update <policy-id> --file <fixed-policy.rego>
|
||||
stella policy enable <policy-id>
|
||||
```
|
||||
|
||||
**If bundle loading failed:**
|
||||
|
||||
1. Check bundle integrity:
|
||||
```bash
|
||||
stella policy bundle verify
|
||||
```
|
||||
|
||||
2. Rebuild bundle:
|
||||
```bash
|
||||
stella policy bundle build --output bundle.tar.gz
|
||||
stella policy bundle load bundle.tar.gz
|
||||
```
|
||||
|
||||
**If configuration issue:**
|
||||
|
||||
1. Reset to default configuration:
|
||||
```bash
|
||||
stella policy opa config reset
|
||||
```
|
||||
|
||||
2. Reconfigure with validated settings:
|
||||
```bash
|
||||
stella policy opa config set workers 4
|
||||
stella policy opa config set decision_log true
|
||||
stella policy opa restart
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check OPA is running
|
||||
stella policy status
|
||||
|
||||
# Check OPA health
|
||||
stella policy opa health
|
||||
|
||||
# Test policy evaluation
|
||||
stella policy evaluate --test
|
||||
|
||||
# Check no crashes in recent logs
|
||||
stella policy opa logs --level error --last 30m
|
||||
|
||||
# Monitor stability
|
||||
stella policy stats --watch
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Resources:** Set appropriate memory limits based on policy complexity
|
||||
- [ ] **Validation:** Validate all policies before deployment
|
||||
- [ ] **Monitoring:** Alert on OPA restart count > 2 in 10 minutes
|
||||
- [ ] **Testing:** Load test policies before production deployment
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/architecture.md`
|
||||
- **Related runbooks:** `policy-evaluation-slow.md`, `policy-compilation-failed.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Policy/`
|
||||
- **OPA documentation:** https://www.openpolicyagent.org/docs/latest/
|
||||
178
docs/operations/runbooks/policy-storage-unavailable.md
Normal file
178
docs/operations/runbooks/policy-storage-unavailable.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Runbook: Policy Engine - Policy Storage Backend Down
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.storage-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy operations failing with "storage unavailable"
|
||||
- [ ] Alert `PolicyStorageUnavailable` firing
|
||||
- [ ] Error: "failed to connect to policy store" or "database connection refused"
|
||||
- [ ] Policy updates not persisting
|
||||
- [ ] OPA unable to load bundles from storage
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Policy updates fail; cached policies may still work |
|
||||
| **Data integrity** | Policy changes not persisted; risk of inconsistent state |
|
||||
| **SLA impact** | Policy management blocked; evaluations use cached data |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.storage-health
|
||||
```
|
||||
|
||||
2. **Check storage connectivity:**
|
||||
```bash
|
||||
stella policy storage status
|
||||
```
|
||||
|
||||
3. **Check database health:**
|
||||
```bash
|
||||
stella db status --component policy
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check PostgreSQL connectivity:**
|
||||
```bash
|
||||
stella db ping --database policy
|
||||
```
|
||||
|
||||
2. **Check connection pool status:**
|
||||
```bash
|
||||
stella db pool-status --database policy
|
||||
```
|
||||
Problem if: Pool exhausted, connections timing out
|
||||
|
||||
3. **Check storage logs:**
|
||||
```bash
|
||||
stella policy logs --filter "storage" --level error --last 30m
|
||||
```
|
||||
|
||||
4. **Check disk space (if local storage):**
|
||||
```bash
|
||||
stella policy storage disk-usage
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Enable read-only mode (use cached policies):**
|
||||
```bash
|
||||
stella policy config set storage.read_only true
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
2. **Switch to backup storage:**
|
||||
```bash
|
||||
stella policy storage failover --to backup
|
||||
```
|
||||
|
||||
3. **Restart policy service to reconnect:**
|
||||
```bash
|
||||
stella service restart policy-engine
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If database connection issue:**
|
||||
|
||||
1. Check database status:
|
||||
```bash
|
||||
stella db status --database policy --verbose
|
||||
```
|
||||
|
||||
2. Restart database connection pool:
|
||||
```bash
|
||||
stella db pool-restart --database policy
|
||||
```
|
||||
|
||||
3. Check and increase connection limits:
|
||||
```bash
|
||||
stella db config set policy.max_connections 50
|
||||
```
|
||||
|
||||
**If disk space exhausted:**
|
||||
|
||||
1. Check storage usage:
|
||||
```bash
|
||||
stella policy storage disk-usage --verbose
|
||||
```
|
||||
|
||||
2. Clean old policy versions:
|
||||
```bash
|
||||
stella policy versions cleanup --older-than 30d
|
||||
```
|
||||
|
||||
3. Increase storage capacity
|
||||
|
||||
**If storage corruption:**
|
||||
|
||||
1. Verify storage integrity:
|
||||
```bash
|
||||
stella policy storage verify
|
||||
```
|
||||
|
||||
2. Restore from backup:
|
||||
```bash
|
||||
stella policy storage restore --from-backup latest
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check storage status
|
||||
stella policy storage status
|
||||
|
||||
# Test write operation
|
||||
stella policy storage test-write
|
||||
|
||||
# Test policy update
|
||||
stella policy update --test
|
||||
|
||||
# Verify no errors
|
||||
stella policy logs --filter "storage" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Monitoring:** Alert on storage connection failures immediately
|
||||
- [ ] **Redundancy:** Configure backup storage for failover
|
||||
- [ ] **Cleanup:** Schedule regular cleanup of old policy versions
|
||||
- [ ] **Capacity:** Monitor disk usage and plan for growth
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/storage.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `postgres-ops.md`
|
||||
- **Database setup:** `docs/operations/database-configuration.md`
|
||||
195
docs/operations/runbooks/policy-version-mismatch.md
Normal file
195
docs/operations/runbooks/policy-version-mismatch.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Runbook: Policy Engine - Policy Version Conflicts
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | Medium |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.version-consistency` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy evaluation returning unexpected results
|
||||
- [ ] Alert `PolicyVersionMismatch` firing
|
||||
- [ ] Error: "policy version conflict" or "bundle version mismatch"
|
||||
- [ ] Different nodes evaluating with different policy versions
|
||||
- [ ] Inconsistent gate decisions for same artifact
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Inconsistent policy decisions; unpredictable gate results |
|
||||
| **Data integrity** | Decisions may not match expected policy behavior |
|
||||
| **SLA impact** | Gate accuracy SLO violated; trust in decisions reduced |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.version-consistency
|
||||
```
|
||||
|
||||
2. **Check policy version across nodes:**
|
||||
```bash
|
||||
stella policy version --all-nodes
|
||||
```
|
||||
|
||||
3. **Check active policy version:**
|
||||
```bash
|
||||
stella policy active --show-version
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Compare versions across instances:**
|
||||
```bash
|
||||
stella policy version diff --all-instances
|
||||
```
|
||||
Problem if: Different versions on different nodes
|
||||
|
||||
2. **Check bundle distribution status:**
|
||||
```bash
|
||||
stella policy bundle status --all-nodes
|
||||
```
|
||||
|
||||
3. **Check for failed deployments:**
|
||||
```bash
|
||||
stella policy deployments list --status failed --last 24h
|
||||
```
|
||||
|
||||
4. **Check OPA bundle sync:**
|
||||
```bash
|
||||
stella policy opa bundle-status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Force sync to latest version:**
|
||||
```bash
|
||||
stella policy sync --force --all-nodes
|
||||
```
|
||||
|
||||
2. **Pin specific version:**
|
||||
```bash
|
||||
stella policy pin --version <version>
|
||||
stella policy sync --all-nodes
|
||||
```
|
||||
|
||||
3. **Restart policy engines to force reload:**
|
||||
```bash
|
||||
stella service restart policy-engine --all-nodes
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If bundle distribution failed:**
|
||||
|
||||
1. Check bundle storage:
|
||||
```bash
|
||||
stella policy bundle storage-status
|
||||
```
|
||||
|
||||
2. Rebuild and redistribute bundle:
|
||||
```bash
|
||||
stella policy bundle build
|
||||
stella policy bundle distribute --all-nodes
|
||||
```
|
||||
|
||||
**If node out of sync:**
|
||||
|
||||
1. Check specific node status:
|
||||
```bash
|
||||
stella policy status --node <node-id>
|
||||
```
|
||||
|
||||
2. Force node resync:
|
||||
```bash
|
||||
stella policy sync --node <node-id> --force
|
||||
```
|
||||
|
||||
3. Verify node is receiving updates:
|
||||
```bash
|
||||
stella policy bundle check-subscription --node <node-id>
|
||||
```
|
||||
|
||||
**If concurrent deployments caused conflict:**
|
||||
|
||||
1. Check deployment history:
|
||||
```bash
|
||||
stella policy deployments list --last 1h
|
||||
```
|
||||
|
||||
2. Resolve to single version:
|
||||
```bash
|
||||
stella policy resolve-conflict --to-version <version>
|
||||
```
|
||||
|
||||
3. Enable deployment locking:
|
||||
```bash
|
||||
stella policy config set deployment.locking true
|
||||
```
|
||||
|
||||
**If OPA bundle polling issue:**
|
||||
|
||||
1. Check OPA bundle configuration:
|
||||
```bash
|
||||
stella policy opa config show | grep bundle
|
||||
```
|
||||
|
||||
2. Decrease polling interval for faster sync:
|
||||
```bash
|
||||
stella policy opa config set bundle.polling.min_delay_seconds 10
|
||||
stella policy opa config set bundle.polling.max_delay_seconds 30
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify all nodes on same version
|
||||
stella policy version --all-nodes
|
||||
|
||||
# Test consistent evaluation
|
||||
stella policy evaluate --test --all-nodes
|
||||
|
||||
# Verify bundle status
|
||||
stella policy bundle status --all-nodes
|
||||
|
||||
# Check no version warnings
|
||||
stella policy logs --filter "version" --level warning --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Locking:** Enable deployment locking to prevent concurrent updates
|
||||
- [ ] **Monitoring:** Alert on version drift between nodes
|
||||
- [ ] **Sync:** Configure aggressive bundle polling for fast convergence
|
||||
- [ ] **Testing:** Deploy to staging before production to catch issues
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/versioning.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `policy-storage-unavailable.md`
|
||||
- **Deployment guide:** `docs/operations/policy-deployment.md`
|
||||
371
docs/operations/runbooks/postgres-ops.md
Normal file
371
docs/operations/runbooks/postgres-ops.md
Normal file
@@ -0,0 +1,371 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-001 - PostgreSQL Operations Runbook
|
||||
# PostgreSQL Database Runbook (dev-mock ready)
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check database connection
|
||||
stella db ping
|
||||
|
||||
# Verify connection pool health
|
||||
stella doctor --check check.postgres.connectivity,check.postgres.pool
|
||||
|
||||
# Check migration status
|
||||
stella db migrations status
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_postgres_connections_active` - Active connections (should be < 80% of max)
|
||||
- `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms)
|
||||
- `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0)
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Daily Health Check
|
||||
|
||||
**Frequency:** Daily or on-demand
|
||||
**Duration:** ~5 minutes
|
||||
|
||||
1. Run comprehensive health check:
|
||||
```bash
|
||||
stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
2. Review slow queries from last 24h:
|
||||
```bash
|
||||
stella db queries --slow --period 24h --limit 20
|
||||
```
|
||||
|
||||
3. Check replication status (if applicable):
|
||||
```bash
|
||||
stella db replication status
|
||||
```
|
||||
|
||||
4. Verify backup completion:
|
||||
```bash
|
||||
stella backup status --type database
|
||||
```
|
||||
|
||||
### SP-002: Connection Pool Tuning
|
||||
|
||||
**When:** Pool exhaustion alerts or high wait times
|
||||
|
||||
1. Check current pool usage:
|
||||
```bash
|
||||
stella db pool stats --detailed
|
||||
```
|
||||
|
||||
2. Identify connection-holding queries:
|
||||
```bash
|
||||
stella db queries --active --sort duration
|
||||
```
|
||||
|
||||
3. Adjust pool size (if needed):
|
||||
```bash
|
||||
# Review current settings
|
||||
stella config get Database:MaxPoolSize
|
||||
|
||||
# Increase pool size
|
||||
stella config set Database:MaxPoolSize 150
|
||||
|
||||
# Restart affected services
|
||||
stella service restart --service release-orchestrator
|
||||
```
|
||||
|
||||
4. Verify improvement:
|
||||
```bash
|
||||
stella db pool watch --duration 5m
|
||||
```
|
||||
|
||||
### SP-003: Backup and Restore
|
||||
|
||||
**Backup:**
|
||||
```bash
|
||||
# Create immediate backup
|
||||
stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
|
||||
|
||||
# Verify backup
|
||||
stella backup verify --latest
|
||||
```
|
||||
|
||||
**Restore:**
|
||||
```bash
|
||||
# List available backups
|
||||
stella backup list --type database
|
||||
|
||||
# Restore to specific point (CAUTION: destructive)
|
||||
stella backup restore --id <backup-id> --confirm
|
||||
|
||||
# Verify restoration
|
||||
stella db ping
|
||||
stella db migrations status
|
||||
```
|
||||
|
||||
### SP-004: Migration Execution
|
||||
|
||||
1. Pre-migration backup:
|
||||
```bash
|
||||
stella backup create --type database --name "pre-migration"
|
||||
```
|
||||
|
||||
2. Run migrations:
|
||||
```bash
|
||||
# Dry run first
|
||||
stella db migrate --dry-run
|
||||
|
||||
# Apply migrations
|
||||
stella db migrate
|
||||
```
|
||||
|
||||
3. Verify migration success:
|
||||
```bash
|
||||
stella db migrations status
|
||||
stella doctor --check check.postgres.migrations
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: Connection Pool Exhaustion
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresPoolExhausted`
|
||||
- Error logs: "connection pool exhausted, waiting for available connection"
|
||||
- Increased request latency
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check pool status
|
||||
stella db pool stats
|
||||
|
||||
# Find long-running queries
|
||||
stella db queries --active --sort duration --limit 10
|
||||
|
||||
# Check for connection leaks
|
||||
stella db connections --by-client
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Immediate relief** - Terminate long-running queries:
|
||||
```bash
|
||||
# Identify stuck queries
|
||||
stella db queries --active --duration ">5m"
|
||||
|
||||
# Terminate specific query (use with caution)
|
||||
stella db query terminate --pid <pid>
|
||||
```
|
||||
|
||||
2. **Scale pool** (if legitimate load):
|
||||
```bash
|
||||
stella config set Database:MaxPoolSize 200
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
3. **Fix leaks** (if application bug):
|
||||
- Review application logs for unclosed connections
|
||||
- Deploy fix to affected service
|
||||
|
||||
### INC-002: Slow Query Performance
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresQueryLatencyHigh`
|
||||
- P99 query latency > 500ms
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Get slow query report
|
||||
stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
|
||||
|
||||
# Analyze specific query
|
||||
stella db query explain --sql "SELECT ..." --analyze
|
||||
|
||||
# Check table statistics
|
||||
stella db stats tables --sort bloat
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Index optimization:**
|
||||
```bash
|
||||
# Get index recommendations
|
||||
stella db index suggest --table <table>
|
||||
|
||||
# Create recommended index
|
||||
stella db index create --table <table> --columns "col1,col2"
|
||||
```
|
||||
|
||||
2. **Vacuum/analyze:**
|
||||
```bash
|
||||
stella db vacuum --table <table>
|
||||
stella db analyze --table <table>
|
||||
```
|
||||
|
||||
3. **Query optimization** - Review and rewrite problematic queries
|
||||
|
||||
### INC-003: Database Connectivity Loss
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresConnectionFailed`
|
||||
- All services reporting database connection errors
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Test basic connectivity
|
||||
stella db ping
|
||||
|
||||
# Check DNS resolution
|
||||
stella network dns-lookup <db-host>
|
||||
|
||||
# Check firewall/network
|
||||
stella network test --host <db-host> --port 5432
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Network issue:**
|
||||
- Verify security groups / firewall rules
|
||||
- Check VPN/tunnel status if applicable
|
||||
- Verify DNS resolution
|
||||
|
||||
2. **Database server issue:**
|
||||
- Check PostgreSQL service status on server
|
||||
- Review PostgreSQL logs
|
||||
- Check disk space on database server
|
||||
|
||||
3. **Credential issue:**
|
||||
```bash
|
||||
stella db verify-credentials
|
||||
stella secrets rotate --scope database
|
||||
```
|
||||
|
||||
### INC-004: Disk Space Alert
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresDiskSpaceWarning` or `Critical`
|
||||
- Database write failures
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check disk usage
|
||||
stella db disk-usage
|
||||
|
||||
# Find large tables
|
||||
stella db stats tables --sort size --limit 20
|
||||
|
||||
# Check for bloat
|
||||
stella db stats tables --sort bloat
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Immediate cleanup:**
|
||||
```bash
|
||||
# Vacuum to reclaim space
|
||||
stella db vacuum --full --table <large-table>
|
||||
|
||||
# Clean old data (if retention policy allows)
|
||||
stella db prune --table evidence_artifacts --older-than 90d --dry-run
|
||||
```
|
||||
|
||||
2. **Archive old data:**
|
||||
```bash
|
||||
stella db archive --table findings_history --older-than 180d
|
||||
```
|
||||
|
||||
3. **Expand disk** (if legitimate growth):
|
||||
- Follow cloud provider procedure to expand volume
|
||||
- Resize filesystem
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Windows
|
||||
|
||||
### Weekly Maintenance (Sunday 02:00 UTC)
|
||||
|
||||
1. Run vacuum analyze on all tables:
|
||||
```bash
|
||||
stella db vacuum --analyze --all-tables
|
||||
```
|
||||
|
||||
2. Update table statistics:
|
||||
```bash
|
||||
stella db analyze --all-tables
|
||||
```
|
||||
|
||||
3. Clean temporary files:
|
||||
```bash
|
||||
stella db cleanup --temp-files
|
||||
```
|
||||
|
||||
### Monthly Maintenance (First Sunday 03:00 UTC)
|
||||
|
||||
1. Full vacuum on large tables:
|
||||
```bash
|
||||
stella db vacuum --full --table findings --table verdicts
|
||||
```
|
||||
|
||||
2. Reindex if needed:
|
||||
```bash
|
||||
stella db reindex --concurrently --table findings
|
||||
```
|
||||
|
||||
3. Archive old data per retention policy:
|
||||
```bash
|
||||
stella db archive --apply-retention
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → PostgreSQL
|
||||
|
||||
Key panels:
|
||||
- Connection pool utilization
|
||||
- Query latency percentiles
|
||||
- Disk usage trend
|
||||
- Replication lag (if applicable)
|
||||
- Active queries count
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
For any incident, capture:
|
||||
```bash
|
||||
# Comprehensive database state
|
||||
stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
Bundle includes:
|
||||
- Connection stats
|
||||
- Active queries
|
||||
- Lock information
|
||||
- Table statistics
|
||||
- Recent slow query log
|
||||
- Configuration snapshot
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Standard procedures, restart services
|
||||
2. **L2 (Database team):** Query optimization, schema changes
|
||||
3. **L3 (Vendor support):** Hardware/cloud platform issues
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
152
docs/operations/runbooks/scanner-oom.md
Normal file
152
docs/operations/runbooks/scanner-oom.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# Runbook: Scanner - Out of Memory on Large Images
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.memory-usage` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scanner worker exits with code 137 (OOM killed)
|
||||
- [ ] Scans fail consistently for specific large images
|
||||
- [ ] Error log contains "fatal error: runtime: out of memory"
|
||||
- [ ] Alert `ScannerWorkerOOM` firing
|
||||
- [ ] Metric `scanner_worker_restarts_total{reason="oom"}` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Large images cannot be scanned; smaller images may still work |
|
||||
| **Data integrity** | No data loss; failed scans can be retried |
|
||||
| **SLA impact** | Specific images blocked from release pipeline |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Identify the failing image:**
|
||||
```bash
|
||||
stella scanner jobs list --status failed --last 1h
|
||||
```
|
||||
|
||||
2. **Check image size:**
|
||||
```bash
|
||||
stella image inspect <image-ref> --format json | jq '.size'
|
||||
```
|
||||
Problem if: Image size > 2GB or layer count > 100
|
||||
|
||||
3. **Check worker memory limit:**
|
||||
```bash
|
||||
stella scanner config get worker.memory_limit
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Profile memory usage during scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --profile-memory
|
||||
```
|
||||
|
||||
2. **Check SBOM generation memory:**
|
||||
```bash
|
||||
stella scanner logs --filter "sbom" --level debug --last 30m
|
||||
```
|
||||
Look for: "memory allocation failed", "heap exhausted"
|
||||
|
||||
3. **Identify memory-heavy layers:**
|
||||
```bash
|
||||
stella image layers <image-ref> --sort-by size
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Increase worker memory limit:**
|
||||
```bash
|
||||
stella scanner config set worker.memory_limit 8Gi
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
2. **Enable streaming mode for large images:**
|
||||
```bash
|
||||
stella scanner config set sbom.streaming_threshold 1Gi
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
3. **Retry the failed scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --retry
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**For consistently large images:**
|
||||
|
||||
1. Configure dedicated large-image worker pool:
|
||||
```bash
|
||||
stella scanner workers add --pool large-images --memory 16Gi --count 2
|
||||
stella scanner config set routing.large_image_threshold 2Gi
|
||||
stella scanner config set routing.large_image_pool large-images
|
||||
```
|
||||
|
||||
**For images with many small files (node_modules, etc.):**
|
||||
|
||||
1. Enable incremental SBOM mode:
|
||||
```bash
|
||||
stella scanner config set sbom.incremental_mode true
|
||||
```
|
||||
|
||||
**For base image reuse:**
|
||||
|
||||
1. Enable layer caching:
|
||||
```bash
|
||||
stella scanner config set cache.layer_dedup true
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry the previously failing scan
|
||||
stella scan image --image <image-ref>
|
||||
|
||||
# Monitor memory during scan
|
||||
stella scanner workers stats --watch
|
||||
|
||||
# Verify no OOM in recent logs
|
||||
stella scanner logs --filter "out of memory" --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Capacity:** Set memory limit based on largest expected image (recommend 4Gi minimum)
|
||||
- [ ] **Routing:** Configure large-image pool for images > 2GB
|
||||
- [ ] **Monitoring:** Alert on `scanner_worker_memory_usage_bytes` > 80% of limit
|
||||
- [ ] **Documentation:** Document image size limits in user guide
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/architecture.md`
|
||||
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Scanner Memory
|
||||
195
docs/operations/runbooks/scanner-registry-auth.md
Normal file
195
docs/operations/runbooks/scanner-registry-auth.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Runbook: Scanner - Registry Authentication Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.registry-auth` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scans failing with "401 Unauthorized" or "403 Forbidden"
|
||||
- [ ] Alert `ScannerRegistryAuthFailed` firing
|
||||
- [ ] Error: "failed to authenticate with registry"
|
||||
- [ ] Error: "failed to pull image manifest"
|
||||
- [ ] Scans work for public images but fail for private images
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Cannot scan private images; release pipeline blocked |
|
||||
| **Data integrity** | No data loss; authentication issue only |
|
||||
| **SLA impact** | All scans for affected registry blocked |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.scanner.registry-auth
|
||||
```
|
||||
|
||||
2. **List configured registries:**
|
||||
```bash
|
||||
stella registry list --show-status
|
||||
```
|
||||
Look for: Registries with "auth_failed" status
|
||||
|
||||
3. **Test registry authentication:**
|
||||
```bash
|
||||
stella registry test <registry-url>
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check credential expiration:**
|
||||
```bash
|
||||
stella registry credentials show <registry-name>
|
||||
```
|
||||
Look for: Expiration date, token type
|
||||
|
||||
2. **Test with verbose output:**
|
||||
```bash
|
||||
stella registry test <registry-url> --verbose
|
||||
```
|
||||
Look for: Specific auth error message, HTTP status code
|
||||
|
||||
3. **Check registry logs:**
|
||||
```bash
|
||||
stella scanner logs --filter "registry auth" --last 30m
|
||||
```
|
||||
|
||||
4. **Verify IAM/OIDC configuration (for cloud registries):**
|
||||
```bash
|
||||
stella registry iam-status <registry-name>
|
||||
```
|
||||
Problem if: IAM role not assumable, OIDC token expired
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Refresh credentials (for token-based auth):**
|
||||
```bash
|
||||
stella registry refresh-credentials <registry-name>
|
||||
```
|
||||
|
||||
2. **Update static credentials:**
|
||||
```bash
|
||||
stella registry update-credentials <registry-name> \
|
||||
--username <user> \
|
||||
--password <token>
|
||||
```
|
||||
|
||||
3. **For Docker Hub rate limiting:**
|
||||
```bash
|
||||
stella registry configure docker-hub \
|
||||
--username <user> \
|
||||
--access-token <token>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If credentials expired:**
|
||||
|
||||
1. Generate new access token in registry (ECR, GCR, ACR, etc.)
|
||||
|
||||
2. Update credentials:
|
||||
```bash
|
||||
stella registry update-credentials <registry-name> --from-env
|
||||
```
|
||||
|
||||
3. Configure automatic token refresh:
|
||||
```bash
|
||||
stella registry config set <registry-name>.auto_refresh true
|
||||
stella registry config set <registry-name>.refresh_interval 11h
|
||||
```
|
||||
|
||||
**If IAM role/policy changed (AWS ECR):**
|
||||
|
||||
1. Verify IAM role permissions:
|
||||
```bash
|
||||
stella registry iam verify <registry-name>
|
||||
```
|
||||
|
||||
2. Update IAM role ARN if changed:
|
||||
```bash
|
||||
stella registry configure ecr \
|
||||
--region <region> \
|
||||
--role-arn <arn>
|
||||
```
|
||||
|
||||
**If OIDC federation changed (GCP Artifact Registry):**
|
||||
|
||||
1. Verify service account:
|
||||
```bash
|
||||
stella registry oidc verify <registry-name>
|
||||
```
|
||||
|
||||
2. Update workload identity configuration:
|
||||
```bash
|
||||
stella registry configure gcr \
|
||||
--project <project> \
|
||||
--workload-identity-provider <provider>
|
||||
```
|
||||
|
||||
**If certificate changed (self-hosted registries):**
|
||||
|
||||
1. Update CA certificate:
|
||||
```bash
|
||||
stella registry configure <registry-name> \
|
||||
--ca-cert /path/to/ca.crt
|
||||
```
|
||||
|
||||
2. Or skip verification (not recommended for production):
|
||||
```bash
|
||||
stella registry configure <registry-name> \
|
||||
--insecure-skip-verify
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test authentication
|
||||
stella registry test <registry-url>
|
||||
|
||||
# Test scanning a private image
|
||||
stella scan image --image <registry-url>/<image>:<tag> --dry-run
|
||||
|
||||
# Verify no auth failures in recent logs
|
||||
stella scanner logs --filter "auth" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Credentials:** Use service accounts/workload identity instead of static tokens
|
||||
- [ ] **Rotation:** Configure automatic token refresh before expiration
|
||||
- [ ] **Monitoring:** Alert on authentication failure rate > 0
|
||||
- [ ] **Documentation:** Document registry credential management procedures
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/registry-auth.md`
|
||||
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
|
||||
- **Registry setup:** `docs/operations/registry-configuration.md`
|
||||
188
docs/operations/runbooks/scanner-sbom-generation-failed.md
Normal file
188
docs/operations/runbooks/scanner-sbom-generation-failed.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# Runbook: Scanner - SBOM Generation Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.sbom-generation` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scans completing but SBOM generation failing
|
||||
- [ ] Alert `ScannerSbomGenerationFailed` firing
|
||||
- [ ] Error: "SBOM generation failed" or "unsupported package format"
|
||||
- [ ] Partial SBOM with missing components
|
||||
- [ ] Metric `scanner_sbom_generation_failures_total` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Incomplete vulnerability coverage; missing dependencies not scanned |
|
||||
| **Data integrity** | Partial SBOM may miss vulnerabilities; attestations incomplete |
|
||||
| **SLA impact** | SBOM completeness SLO violated (target: > 95%) |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.scanner.sbom-generation
|
||||
```
|
||||
|
||||
2. **Check failed SBOM jobs:**
|
||||
```bash
|
||||
stella scanner jobs list --status sbom_failed --last 1h
|
||||
```
|
||||
|
||||
3. **Check SBOM completeness rate:**
|
||||
```bash
|
||||
stella scanner stats --sbom-metrics
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Analyze specific failure:**
|
||||
```bash
|
||||
stella scanner job details <job-id> --sbom-errors
|
||||
```
|
||||
Look for: Specific package manager or file type causing failure
|
||||
|
||||
2. **Check for unsupported ecosystems:**
|
||||
```bash
|
||||
stella sbom analyze --image <image-ref> --verbose
|
||||
```
|
||||
Look for: "unsupported", "unknown package format", "parsing failed"
|
||||
|
||||
3. **Check scanner plugin status:**
|
||||
```bash
|
||||
stella scanner plugins list --status
|
||||
```
|
||||
Problem if: Package manager plugin disabled or erroring
|
||||
|
||||
4. **Check for corrupted package files:**
|
||||
```bash
|
||||
stella image inspect <image-ref> --check-integrity
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Enable fallback SBOM generation:**
|
||||
```bash
|
||||
stella scanner config set sbom.fallback_mode true
|
||||
stella scan image --image <image-ref> --sbom-fallback
|
||||
```
|
||||
|
||||
2. **Use alternative SBOM generator:**
|
||||
```bash
|
||||
stella sbom generate --image <image-ref> --generator syft --output sbom.json
|
||||
```
|
||||
|
||||
3. **Generate partial SBOM and continue:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --sbom-partial-ok
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If package manager not supported:**
|
||||
|
||||
1. Check supported package managers:
|
||||
```bash
|
||||
stella scanner plugins list --type package-manager
|
||||
```
|
||||
|
||||
2. Enable additional plugins:
|
||||
```bash
|
||||
stella scanner plugins enable <plugin-name>
|
||||
```
|
||||
|
||||
3. For custom package formats, add mapping:
|
||||
```bash
|
||||
stella scanner config set sbom.custom_mappings.<format> <handler>
|
||||
```
|
||||
|
||||
**If package file corrupted:**
|
||||
|
||||
1. Identify corrupted files:
|
||||
```bash
|
||||
stella image layers <image-ref> --verify-packages
|
||||
```
|
||||
|
||||
2. Report to image owner for fix
|
||||
|
||||
**If memory/resource issue during generation:**
|
||||
|
||||
1. Increase SBOM generator resources:
|
||||
```bash
|
||||
stella scanner config set sbom.memory_limit 4Gi
|
||||
stella scanner config set sbom.timeout 10m
|
||||
```
|
||||
|
||||
2. Enable streaming mode:
|
||||
```bash
|
||||
stella scanner config set sbom.streaming_mode true
|
||||
```
|
||||
|
||||
**If plugin crashed:**
|
||||
|
||||
1. Check plugin logs:
|
||||
```bash
|
||||
stella scanner plugins logs <plugin-name> --last 30m
|
||||
```
|
||||
|
||||
2. Restart plugin:
|
||||
```bash
|
||||
stella scanner plugins restart <plugin-name>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry SBOM generation
|
||||
stella sbom generate --image <image-ref> --output sbom.json
|
||||
|
||||
# Validate SBOM completeness
|
||||
stella sbom validate --file sbom.json --check-completeness
|
||||
|
||||
# Check component count
|
||||
stella sbom stats --file sbom.json
|
||||
|
||||
# Full scan with SBOM
|
||||
stella scan image --image <image-ref>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Plugins:** Keep all package manager plugins enabled and updated
|
||||
- [ ] **Monitoring:** Alert on SBOM completeness < 90%
|
||||
- [ ] **Fallback:** Configure fallback SBOM generator for resilience
|
||||
- [ ] **Testing:** Test SBOM generation for new image types before production
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/sbom-generation.md`
|
||||
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
|
||||
- **SBOM formats:** `docs/formats/sbom-spdx.md`, `docs/formats/sbom-cyclonedx.md`
|
||||
174
docs/operations/runbooks/scanner-timeout.md
Normal file
174
docs/operations/runbooks/scanner-timeout.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Runbook: Scanner - Scan Timeout on Complex Images
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | Medium |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.timeout-rate` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scans failing with "timeout exceeded" error
|
||||
- [ ] Alert `ScannerTimeoutExceeded` firing
|
||||
- [ ] Metric `scanner_scan_timeout_total` increasing
|
||||
- [ ] Specific images consistently timing out
|
||||
- [ ] Error log: "scan operation exceeded timeout of X seconds"
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Specific images cannot be scanned; pipeline blocked |
|
||||
| **Data integrity** | No data loss; scans can be retried with adjusted settings |
|
||||
| **SLA impact** | Release pipeline delayed for affected images |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.scanner.timeout-rate
|
||||
```
|
||||
|
||||
2. **Identify failing images:**
|
||||
```bash
|
||||
stella scanner jobs list --status timeout --last 1h
|
||||
```
|
||||
Look for: Pattern in image types or sizes
|
||||
|
||||
3. **Check current timeout settings:**
|
||||
```bash
|
||||
stella scanner config get timeouts
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Analyze image complexity:**
|
||||
```bash
|
||||
stella image inspect <image-ref> --format json | jq '{size, layers: .layers | length, files: .manifest.fileCount}'
|
||||
```
|
||||
Problem if: > 50 layers, > 100k files, or > 5GB size
|
||||
|
||||
2. **Check scanner worker load:**
|
||||
```bash
|
||||
stella scanner workers stats
|
||||
```
|
||||
Problem if: All workers at capacity during timeouts
|
||||
|
||||
3. **Profile a scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --profile --verbose
|
||||
```
|
||||
Look for: Which phase is slowest (layer extraction, SBOM generation, vuln matching)
|
||||
|
||||
4. **Check for filesystem-heavy images:**
|
||||
```bash
|
||||
stella image layers <image-ref> --sort-by file-count
|
||||
```
|
||||
Problem if: Single layer with > 50k files (e.g., node_modules)
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Increase timeout for specific image:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --timeout 30m
|
||||
```
|
||||
|
||||
2. **Increase global scan timeout:**
|
||||
```bash
|
||||
stella scanner config set timeouts.scan 20m
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
3. **Enable fast mode for initial scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --fast-mode
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If image is too complex:**
|
||||
|
||||
1. Enable incremental scanning:
|
||||
```bash
|
||||
stella scanner config set scan.incremental_mode true
|
||||
```
|
||||
|
||||
2. Configure layer caching:
|
||||
```bash
|
||||
stella scanner config set cache.layer_dedup true
|
||||
stella scanner config set cache.sbom_cache true
|
||||
```
|
||||
|
||||
**If filesystem is too large:**
|
||||
|
||||
1. Enable streaming SBOM generation:
|
||||
```bash
|
||||
stella scanner config set sbom.streaming_threshold 500Gi
|
||||
```
|
||||
|
||||
2. Configure file sampling for massive images:
|
||||
```bash
|
||||
stella scanner config set sbom.file_sample_max 100000
|
||||
```
|
||||
|
||||
**If vulnerability matching is slow:**
|
||||
|
||||
1. Enable parallel matching:
|
||||
```bash
|
||||
stella scanner config set vuln.parallel_matching true
|
||||
stella scanner config set vuln.match_workers 4
|
||||
```
|
||||
|
||||
2. Optimize vulnerability database indexes:
|
||||
```bash
|
||||
stella db optimize --component scanner
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry the previously failing scan
|
||||
stella scan image --image <image-ref> --timeout 30m
|
||||
|
||||
# Monitor scan progress
|
||||
stella scanner jobs watch <job-id>
|
||||
|
||||
# Verify no timeouts in recent scans
|
||||
stella scanner jobs list --status timeout --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Capacity:** Configure appropriate timeouts based on expected image complexity (15m default, 30m for large)
|
||||
- [ ] **Monitoring:** Alert on timeout rate > 5%
|
||||
- [ ] **Caching:** Enable layer and SBOM caching for base images
|
||||
- [ ] **Documentation:** Document image size/complexity limits in user guide
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/architecture.md`
|
||||
- **Related runbooks:** `scanner-oom.md`, `scanner-worker-stuck.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Scanner Performance
|
||||
174
docs/operations/runbooks/scanner-worker-stuck.md
Normal file
174
docs/operations/runbooks/scanner-worker-stuck.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Runbook: Scanner - Worker Not Processing Jobs
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.worker-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scan jobs stuck in "pending" or "processing" state for >5 minutes
|
||||
- [ ] Scanner worker process shows 0% CPU usage
|
||||
- [ ] Alert `ScannerWorkerStuck` or `ScannerQueueBacklog` firing
|
||||
- [ ] UI shows "Scan in progress" indefinitely
|
||||
- [ ] Metric `scanner_jobs_pending` increasing over time
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | New scans cannot complete, blocking CI/CD pipelines and release gates |
|
||||
| **Data integrity** | No data loss; pending jobs will resume when worker recovers |
|
||||
| **SLA impact** | Scan latency SLO violated if not resolved within 15 minutes |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks (< 2 minutes)
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.scanner.worker-health
|
||||
```
|
||||
|
||||
2. **Check scanner service status:**
|
||||
```bash
|
||||
stella scanner status
|
||||
```
|
||||
Expected: "Scanner workers: 4 active, 0 idle"
|
||||
Problem: "Scanner workers: 0 active" or "status: degraded"
|
||||
|
||||
3. **Check job queue depth:**
|
||||
```bash
|
||||
stella scanner queue status
|
||||
```
|
||||
Expected: Queue depth < 50
|
||||
Problem: Queue depth > 100 or growing rapidly
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check worker process logs:**
|
||||
```bash
|
||||
stella scanner logs --tail 100 --level error
|
||||
```
|
||||
Look for: "timeout", "connection refused", "out of memory"
|
||||
|
||||
2. **Check Valkey connectivity (job queue):**
|
||||
```bash
|
||||
stella doctor --check check.storage.valkey
|
||||
```
|
||||
|
||||
3. **Check if workers are OOM-killed:**
|
||||
```bash
|
||||
stella scanner workers inspect
|
||||
```
|
||||
Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)
|
||||
|
||||
4. **Check resource utilization:**
|
||||
```bash
|
||||
stella obs metrics --filter scanner --last 10m
|
||||
```
|
||||
Look for: Memory > 90%, CPU sustained > 95%
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Restart scanner workers:**
|
||||
```bash
|
||||
stella scanner workers restart
|
||||
```
|
||||
This will: Terminate current workers and spawn fresh ones
|
||||
|
||||
2. **If restart fails, force restart the scanner service:**
|
||||
```bash
|
||||
stella service restart scanner
|
||||
```
|
||||
|
||||
3. **Verify workers are processing:**
|
||||
```bash
|
||||
stella scanner queue status --watch
|
||||
```
|
||||
Queue depth should start decreasing
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If workers were OOM-killed:**
|
||||
|
||||
1. Increase worker memory limit:
|
||||
```bash
|
||||
stella scanner config set worker.memory_limit 4Gi
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
2. Reduce concurrent scans per worker:
|
||||
```bash
|
||||
stella scanner config set worker.concurrency 2
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
**If Valkey connection failed:**
|
||||
|
||||
1. Check Valkey health:
|
||||
```bash
|
||||
stella doctor --check check.storage.valkey
|
||||
```
|
||||
|
||||
2. Restart Valkey if needed (see `valkey-connection-failure.md`)
|
||||
|
||||
**If workers are deadlocked:**
|
||||
|
||||
1. Enable deadlock detection:
|
||||
```bash
|
||||
stella scanner config set worker.deadlock_detection true
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify workers are healthy
|
||||
stella doctor --check check.scanner.worker-health
|
||||
|
||||
# Submit a test scan
|
||||
stella scan image --image alpine:latest --dry-run
|
||||
|
||||
# Watch queue drain
|
||||
stella scanner queue status --watch
|
||||
|
||||
# Verify no errors in recent logs
|
||||
stella scanner logs --tail 20 --level error
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Alert:** Ensure `ScannerQueueBacklog` alert is configured with threshold < 100 jobs
|
||||
- [ ] **Monitoring:** Add Grafana panel for worker memory usage
|
||||
- [ ] **Capacity:** Review worker count and memory limits during capacity planning
|
||||
- [ ] **Deadlock:** Enable `worker.deadlock_detection` in production
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/architecture.md`
|
||||
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs`
|
||||
- **Dashboard:** Grafana > Stella Ops > Scanner Overview
|
||||
Reference in New Issue
Block a user