synergy moats product advisory implementations

2026-01-17 01:30:03 +02:00
parent 77ff029205
commit d8d9c0a6e3
106 changed files with 20603 additions and 123 deletions
--- a/docs/operations/guides/auditor-guide.md
+++ b/docs/operations/guides/auditor-guide.md
@@ -0,0 +1,256 @@
+# Auditor Guide
+
+> **Sprint:** SPRINT_20260117_027_CLI_audit_bundle_command  
+> **Task:** AUD-007 - Documentation
+
+This guide is for external auditors reviewing Stella Ops release evidence.
+
+## Overview
+
+Stella Ops generates comprehensive, tamper-evident audit bundles that contain all evidence required to verify release decisions. This guide explains how to interpret and verify these bundles.
+
+## Receiving an Audit Bundle
+
+Audit bundles may be delivered as:
+- **Directory:** A folder containing all evidence files
+- **Archive:** A `.tar.gz` or `.zip` file
+
+### Extracting Archives
+
+```bash
+# tar.gz
+tar -xzf audit-bundle-sha256-abc123.tar.gz
+
+# zip
+unzip audit-bundle-sha256-abc123.zip
+```
+
+## Bundle Structure
+
+```
+audit-bundle-<digest>-<timestamp>/
+├── manifest.json              # Integrity manifest
+├── README.md                  # Quick reference
+├── verdict/                   # Release decision
+├── evidence/                  # Supporting evidence
+├── policy/                    # Policy configuration
+└── replay/                    # Verification instructions
+```
+
+## Step 1: Verify Bundle Integrity
+
+Before reviewing contents, verify the bundle has not been tampered with.
+
+### Using Stella CLI
+
+```bash
+stella audit verify ./audit-bundle-sha256-abc123/
+```
+
+Expected output:
+```
+✓ Verified 15/15 files
+✓ Integrity hash verified
+✓ Bundle integrity verified
+```
+
+### Manual Verification
+
+1. Open `manifest.json`
+2. For each file listed, compute SHA-256 and compare:
+   ```bash
+   sha256sum verdict/verdict.json
+   ```
+3. Verify the `integrityHash` by hashing all file hashes
+
+## Step 2: Review the Verdict
+
+The verdict is the official release decision.
+
+### verdict/verdict.json
+
+```json
+{
+  "artifactDigest": "sha256:abc123...",
+  "decision": "PASS",
+  "timestamp": "2026-01-17T10:25:00Z",
+  "gates": [
+    {
+      "gateId": "sbom-required",
+      "status": "PASS",
+      "reason": "Valid CycloneDX SBOM present"
+    },
+    {
+      "gateId": "vex-trust",
+      "status": "PASS", 
+      "reason": "Trust score 0.85 >= 0.70 threshold"
+    }
+  ]
+}
+```
+
+### Decision Values
+
+| Decision | Meaning |
+|----------|---------|
+| `PASS` | All gates passed, artifact approved for deployment |
+| `BLOCKED` | One or more gates failed, artifact not approved |
+| `PENDING` | Evaluation incomplete, awaiting additional evidence |
+
+### verdict/verdict.dsse.json
+
+This file contains the cryptographically signed verdict envelope (DSSE format). Verify signatures using:
+
+```bash
+stella audit verify ./bundle/ --check-signatures
+```
+
+## Step 3: Review Evidence
+
+### evidence/sbom.json
+
+Software Bill of Materials (SBOM) listing all components in the artifact.
+
+**Key fields:**
+- `components[]` - List of all software components
+- `dependencies[]` - Dependency relationships
+- `metadata.timestamp` - When SBOM was generated
+
+### evidence/vex-statements/
+
+Vulnerability Exploitability eXchange (VEX) statements that justify vulnerability assessments.
+
+**index.json:**
+```json
+{
+  "statementCount": 3,
+  "statements": [
+    {"fileName": "vex-001.json", "source": "vendor-security"},
+    {"fileName": "vex-002.json", "source": "internal-analysis"}
+  ]
+}
+```
+
+Each VEX statement explains why a vulnerability does or does not affect this artifact.
+
+### evidence/reachability/analysis.json
+
+Reachability analysis showing which vulnerabilities are actually reachable in the code.
+
+```json
+{
+  "components": [
+    {
+      "purl": "pkg:npm/lodash@4.17.21",
+      "vulnerabilities": [
+        {
+          "id": "CVE-2021-23337",
+          "reachable": false,
+          "reason": "Vulnerable function not in call graph"
+        }
+      ]
+    }
+  ]
+}
+```
+
+## Step 4: Review Policy
+
+### policy/policy-snapshot.json
+
+The policy configuration used for evaluation:
+
+```json
+{
+  "policyVersion": "v2.3.1",
+  "gates": ["sbom-required", "vex-trust", "cve-threshold"],
+  "thresholds": {
+    "vexTrustScore": 0.70,
+    "maxCriticalCves": 0,
+    "maxHighCves": 5
+  }
+}
+```
+
+### policy/gate-decision.json
+
+Detailed breakdown of each gate evaluation:
+
+```json
+{
+  "gates": [
+    {
+      "gateId": "vex-trust",
+      "decision": "PASS",
+      "inputs": {
+        "vexStatements": 3,
+        "trustScore": 0.85,
+        "threshold": 0.70
+      }
+    }
+  ]
+}
+```
+
+## Step 5: Replay Verification (Optional)
+
+For maximum assurance, you can replay the verdict evaluation.
+
+### Using Stella CLI
+
+```bash
+cd audit-bundle-sha256-abc123/
+stella replay snapshot --manifest replay/knowledge-snapshot.json
+```
+
+This re-evaluates the policy using the frozen inputs and should produce an identical verdict.
+
+### Manual Replay Steps
+
+See `replay/replay-instructions.md` for detailed steps.
+
+## Compliance Mapping
+
+| Compliance Framework | Relevant Bundle Components |
+|---------------------|---------------------------|
+| **SOC 2 (CC7.1)** | verdict/, policy/ |
+| **ISO 27001 (A.12.6)** | evidence/sbom.json |
+| **FedRAMP** | All components |
+| **SLSA Level 3** | evidence/provenance/ |
+
+## Common Questions
+
+### Q: Why was this artifact blocked?
+
+Review `policy/gate-decision.json` for the specific gate that failed and its reason.
+
+### Q: How do I verify the SBOM is accurate?
+
+The SBOM digest is included in the manifest. Compare against the organization's SBOM generation process.
+
+### Q: What if replay produces a different result?
+
+This may indicate:
+1. Policy version mismatch
+2. Missing evidence files
+3. Time-dependent policy rules
+
+Contact the organization's security team for clarification.
+
+### Q: How long should audit bundles be retained?
+
+Stella Ops recommends:
+- Production releases: 5 years minimum
+- Security-critical systems: 7 years
+- Regulated industries: Per compliance requirements
+
+## Support
+
+For questions about this audit bundle:
+1. Contact the organization's Stella Ops administrator
+2. Reference the Bundle ID from `manifest.json`
+3. Include the artifact digest
+
+---
+
+_Last updated: 2026-01-17 (UTC)_
--- a/docs/operations/runbooks/COVERAGE.md
+++ b/docs/operations/runbooks/COVERAGE.md
@@ -0,0 +1,112 @@
+# Runbook Coverage Tracking
+
+This document tracks operational runbook coverage across Stella Ops modules.
+
+**Target:** 80% coverage of critical failure modes before declaring operability moat achieved.
+
+---
+
+## Coverage Summary
+
+| Module | Critical Failures | Runbooks | Coverage | Status |
+|--------|-------------------|----------|----------|--------|
+| Scanner | 5 | 0 | 0% | 🔴 Gap |
+| Policy Engine | 5 | 0 | 0% | 🔴 Gap |
+| Release Orchestrator | 5 | 0 | 0% | 🔴 Gap |
+| Attestor | 5 | 0 | 0% | 🔴 Gap |
+| Feed Connectors | 4 | 0 | 0% | 🔴 Gap |
+| **Database (Postgres)** | 4 | 4 | 100% | ✅ Complete |
+| **Crypto Subsystem** | 4 | 4 | 100% | ✅ Complete |
+| **Evidence Locker** | 4 | 4 | 100% | ✅ Complete |
+| **Backup/Restore** | 4 | 4 | 100% | ✅ Complete |
+| Authority (OAuth/OIDC) | 3 | 0 | 0% | 🔴 Gap |
+| **Overall** | **43** | **16** | **37%** | 🟡 In Progress |
+
+---
+
+## Available Runbooks
+
+### Database Operations
+- [postgres-ops.md](postgres-ops.md) - PostgreSQL database operations
+
+### Crypto Subsystem
+- [crypto-ops.md](crypto-ops.md) - Regional crypto operations (FIPS, eIDAS, GOST, SM)
+
+### Evidence Locker
+- [evidence-locker-ops.md](evidence-locker-ops.md) - Evidence locker operations
+
+### Backup/Restore
+- [backup-restore-ops.md](backup-restore-ops.md) - Backup and restore procedures
+
+### Vulnerability Operations
+- [vuln-ops.md](vuln-ops.md) - Vulnerability management operations
+
+### VEX Operations
+- [vex-ops.md](vex-ops.md) - VEX statement operations
+
+### Policy Incidents
+- [policy-incident.md](policy-incident.md) - Policy-related incident response
+
+---
+
+## Gap Analysis
+
+### High Priority Gaps (Critical modules without runbooks)
+
+1. **Scanner** - Core scanning functionality
+   - Worker stuck
+   - OOM on large images
+   - Registry auth failures
+
+2. **Policy Engine** - Policy evaluation
+   - Slow evaluation
+   - OPA crashes
+   - Compilation failures
+
+3. **Release Orchestrator** - Promotion workflow
+   - Stuck promotions
+   - Gate timeouts
+   - Missing evidence
+
+### Medium Priority Gaps
+
+4. **Attestor** - Signing and verification
+   - Signing failures
+   - Key expiration
+   - Rekor unavailability
+
+5. **Feed Connectors** - Advisory feeds
+   - NVD failures
+   - Rate limiting
+   - Offline bundle issues
+
+### Lower Priority Gaps
+
+6. **Authority** - Authentication
+   - Token validation failures
+   - OIDC provider issues
+
+---
+
+## Template
+
+New runbooks should use the template: [_template.md](_template.md)
+
+---
+
+## Doctor Check Integration
+
+Runbooks should be linked from Doctor check output. Current integration status:
+
+| Module | Doctor Checks | Linked to Runbook |
+|--------|---------------|-------------------|
+| Postgres | 4 | 0 |
+| Crypto | 8 | 0 |
+| Storage | 3 | 0 |
+| Evidence | 4 | 0 |
+
+**Next step:** Update Doctor check implementations to include runbook links in remediation output.
+
+---
+
+_Last updated: 2026-01-17 (UTC)_
--- a/docs/operations/runbooks/_template.md
+++ b/docs/operations/runbooks/_template.md
@@ -0,0 +1,157 @@
+# Runbook: [Component] - [Failure Scenario]
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-001 - Runbook Template
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | [Module name: Scanner, Policy, Orchestrator, Attestor, etc.] |
+| **Severity** | Critical / High / Medium / Low |
+| **On-call scope** | [Who should be paged: Platform team, Security team, etc.] |
+| **Last updated** | [YYYY-MM-DD] |
+| **Doctor check** | [Check ID if applicable, e.g., `check.scanner.worker-health`] |
+
+---
+
+## Symptoms
+
+Observable indicators that this failure is occurring:
+
+- [ ] [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
+- [ ] [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
+- [ ] [Metric/alert that fires: e.g., "Alert `ScannerWorkerStuck` firing"]
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | [e.g., "New scans cannot complete, blocking CI/CD pipelines"] |
+| **Data integrity** | [e.g., "No data loss, but stale scan results may be served"] |
+| **SLA impact** | [e.g., "Scan latency SLO violated if not resolved within 15 minutes"] |
+
+---
+
+## Diagnosis
+
+### Quick checks (< 2 minutes)
+
+Run these first to confirm the failure:
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check [relevant-check-id]
+   ```
+
+2. **Check service status:**
+   ```bash
+   stella [component] status
+   ```
+
+3. **Check recent logs:**
+   ```bash
+   stella [component] logs --tail 50 --level error
+   ```
+
+### Deep diagnosis (if quick checks inconclusive)
+
+1. **[Investigation step 1]:**
+   ```bash
+   [command]
+   ```
+   Expected output: [description]
+   If unexpected: [what it means]
+
+2. **[Investigation step 2]:**
+   ```bash
+   [command]
+   ```
+
+3. **Check related services:**
+   - Postgres connectivity: `stella doctor --check check.storage.postgres`
+   - Valkey connectivity: `stella doctor --check check.storage.valkey`
+   - Network connectivity: `stella doctor --check check.network.[target]`
+
+---
+
+## Resolution
+
+### Immediate mitigation (restore service quickly)
+
+Use these steps to restore service, even if root cause isn't fixed yet:
+
+1. **[Mitigation step 1]:**
+   ```bash
+   [command]
+   ```
+   This will: [explanation]
+
+2. **[Mitigation step 2]:**
+   ```bash
+   [command]
+   ```
+
+### Root cause fix
+
+Once service is restored, address the underlying issue:
+
+1. **[Fix step 1]:**
+   ```bash
+   [command]
+   ```
+
+2. **[Fix step 2]:**
+   ```bash
+   [command]
+   ```
+
+3. **Verify fix is complete:**
+   ```bash
+   stella doctor --check [relevant-check-id]
+   ```
+
+### Verification
+
+Confirm the issue is fully resolved:
+
+```bash
+# Re-run the failing operation
+stella [component] [test-command]
+
+# Verify metrics are healthy
+stella obs metrics --filter [component] --last 5m
+
+# Verify no new errors in logs
+stella [component] logs --tail 20 --level error
+```
+
+---
+
+## Prevention
+
+How to prevent this failure from recurring:
+
+- [ ] **Monitoring:** [e.g., "Add alert for queue depth > 100"]
+- [ ] **Configuration:** [e.g., "Increase worker count in high-volume environments"]
+- [ ] **Code change:** [e.g., "Implement circuit breaker for external service calls"]
+- [ ] **Documentation:** [e.g., "Update capacity planning guide"]
+
+---
+
+## Related Resources
+
+- **Architecture doc:** [Link to relevant architecture documentation]
+- **Related runbooks:** [Links to related failure scenarios]
+- **Doctor check source:** [Link to Doctor check implementation]
+- **Grafana dashboard:** [Link to relevant dashboard]
+
+---
+
+## Revision History
+
+| Date | Author | Changes |
+|------|--------|---------|
+| YYYY-MM-DD | [Name] | Initial version |
--- a/docs/operations/runbooks/attestor-hsm-connection.md
+++ b/docs/operations/runbooks/attestor-hsm-connection.md
@@ -0,0 +1,193 @@
+# Runbook: Attestor - HSM Connection Issues
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-005 - Attestor Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Attestor / Cryptography |
+| **Severity** | Critical |
+| **On-call scope** | Platform team, Security team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.crypto.hsm-availability` |
+
+---
+
+## Symptoms
+
+- [ ] Signing operations failing with "HSM unavailable"
+- [ ] Alert `AttestorHsmConnectionFailed` firing
+- [ ] Error: "PKCS#11 operation failed" or "HSM session timeout"
+- [ ] Attestations cannot be created
+- [ ] Key operations (sign, verify) failing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | No attestations can be signed; releases blocked |
+| **Data integrity** | Keys are safe in HSM; operations resume when connection restored |
+| **SLA impact** | All signing operations blocked; compliance posture at risk |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.crypto.hsm-availability
+   ```
+
+2. **Check HSM connection status:**
+   ```bash
+   stella crypto hsm status
+   ```
+
+3. **Test HSM connectivity:**
+   ```bash
+   stella crypto hsm test
+   ```
+
+### Deep diagnosis
+
+1. **Check PKCS#11 library status:**
+   ```bash
+   stella crypto hsm pkcs11-status
+   ```
+   Look for: Library loaded, slot available, session active
+
+2. **Check HSM network connectivity:**
+   ```bash
+   stella crypto hsm ping
+   ```
+
+3. **Check HSM session logs:**
+   ```bash
+   stella crypto hsm logs --last 30m
+   ```
+   Look for: Session errors, timeout, authentication failures
+
+4. **Check HSM slot status:**
+   ```bash
+   stella crypto hsm slots list
+   ```
+   Problem if: Slot not found, slot busy, token not present
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Attempt HSM reconnection:**
+   ```bash
+   stella crypto hsm reconnect
+   ```
+
+2. **If HSM unreachable, switch to software signing (if permitted):**
+   ```bash
+   stella attest config set signing.mode software
+   stella attest reload
+   ```
+   **Warning:** Software signing may not meet compliance requirements
+
+3. **Use backup HSM if configured:**
+   ```bash
+   stella crypto hsm failover --to backup
+   ```
+
+### Root cause fix
+
+**If network connectivity issue:**
+
+1. Check HSM network path:
+   ```bash
+   stella crypto hsm connectivity --verbose
+   ```
+
+2. Verify firewall rules allow HSM port (typically 1792 for Luna, 2225 for SafeNet)
+
+3. Check HSM server status with vendor tools
+
+**If session timeout:**
+
+1. Increase session timeout:
+   ```bash
+   stella crypto hsm config set session.timeout 300s
+   stella crypto hsm reconnect
+   ```
+
+2. Enable session keep-alive:
+   ```bash
+   stella crypto hsm config set session.keepalive true
+   stella crypto hsm config set session.keepalive_interval 60s
+   ```
+
+**If authentication failed:**
+
+1. Verify HSM credentials:
+   ```bash
+   stella crypto hsm auth verify
+   ```
+
+2. Update HSM PIN if changed:
+   ```bash
+   stella crypto hsm auth update --slot <slot-id>
+   ```
+
+**If PKCS#11 library issue:**
+
+1. Verify library path:
+   ```bash
+   stella crypto hsm config get pkcs11.library_path
+   ```
+
+2. Reload PKCS#11 library:
+   ```bash
+   stella crypto hsm pkcs11-reload
+   ```
+
+3. Check library compatibility:
+   ```bash
+   stella crypto hsm pkcs11-info
+   ```
+
+### Verification
+
+```bash
+# Test HSM connectivity
+stella crypto hsm test
+
+# Test signing operation
+stella attest test-sign
+
+# Verify key access
+stella keys verify <key-id> --operation sign
+
+# Check no errors in logs
+stella crypto hsm logs --level error --last 30m
+```
+
+---
+
+## Prevention
+
+- [ ] **Redundancy:** Configure backup HSM for failover
+- [ ] **Monitoring:** Alert on HSM connection failures immediately
+- [ ] **Keep-alive:** Enable session keep-alive to prevent timeouts
+- [ ] **Testing:** Include HSM health in regular health checks
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/cryptography/hsm-integration.md`
+- **Related runbooks:** `attestor-signing-failed.md`, `crypto-ops.md`
+- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Crypto/`
+- **HSM setup:** `docs/operations/hsm-configuration.md`
--- a/docs/operations/runbooks/attestor-key-expired.md
+++ b/docs/operations/runbooks/attestor-key-expired.md
@@ -0,0 +1,190 @@
+# Runbook: Attestor - Signing Key Expired
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-005 - Attestor Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Attestor |
+| **Severity** | Critical |
+| **On-call scope** | Platform team, Security team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.attestor.key-expiration` |
+
+---
+
+## Symptoms
+
+- [ ] Attestation creation failing with "key expired" error
+- [ ] Alert `AttestorKeyExpired` firing
+- [ ] Error: "signing key certificate has expired"
+- [ ] New attestations cannot be created
+- [ ] Verification of new attestations failing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | No new attestations can be signed; releases blocked |
+| **Data integrity** | Existing attestations remain valid; new ones cannot be created |
+| **SLA impact** | Release SLO violated; compliance posture compromised |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.attestor.key-expiration
+   ```
+
+2. **List signing keys and expiration:**
+   ```bash
+   stella keys list --type signing --show-expiration
+   ```
+   Look for: Keys with status "expired" or expiring soon
+
+3. **Check active signing key:**
+   ```bash
+   stella attest config get signing.key_id
+   stella keys show <key-id> --details
+   ```
+
+### Deep diagnosis
+
+1. **Check certificate chain validity:**
+   ```bash
+   stella crypto cert verify-chain --key <key-id>
+   ```
+   Problem if: Any certificate in chain expired
+
+2. **Check for backup keys:**
+   ```bash
+   stella keys list --type signing --status inactive
+   ```
+   Look for: Unexpired backup keys that can be activated
+
+3. **Check key rotation history:**
+   ```bash
+   stella keys rotation-history --key <key-id>
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **If backup key available, activate it:**
+   ```bash
+   stella keys activate <backup-key-id>
+   stella attest config set signing.key_id <backup-key-id>
+   stella attest reload
+   ```
+
+2. **Verify signing works:**
+   ```bash
+   stella attest test-sign
+   ```
+
+3. **Retry failed attestations:**
+   ```bash
+   stella attest retry --failed --last 1h
+   ```
+
+### Root cause fix
+
+**Generate new signing key:**
+
+1. Generate new key pair:
+   ```bash
+   stella keys generate \
+     --type signing \
+     --algorithm ecdsa-p256 \
+     --validity 365d \
+     --name "signing-key-$(date +%Y%m%d)"
+   ```
+
+2. If using HSM:
+   ```bash
+   stella keys generate \
+     --type signing \
+     --algorithm ecdsa-p256 \
+     --validity 365d \
+     --hsm-slot <slot> \
+     --name "signing-key-$(date +%Y%m%d)"
+   ```
+
+3. Register the new key:
+   ```bash
+   stella keys register <new-key-id> --purpose attestation-signing
+   ```
+
+4. Update signing configuration:
+   ```bash
+   stella attest config set signing.key_id <new-key-id>
+   stella attest reload
+   ```
+
+5. Publish new public key to trust anchors:
+   ```bash
+   stella issuer keys publish <new-key-id>
+   ```
+
+**Configure automatic rotation:**
+
+1. Enable auto-rotation:
+   ```bash
+   stella keys config set rotation.auto true
+   stella keys config set rotation.before_expiry 30d
+   stella keys config set rotation.overlap_days 14
+   ```
+
+2. Set up rotation alerts:
+   ```bash
+   stella keys config set alerts.expiring_days 30
+   stella keys config set alerts.expiring_days_critical 7
+   ```
+
+### Verification
+
+```bash
+# Verify new key is active
+stella keys list --type signing --status active
+
+# Test signing
+stella attest test-sign
+
+# Create test attestation
+stella attest create --type test --subject "test:key-rotation"
+
+# Verify the attestation
+stella verify attestation --last
+
+# Check key expiration
+stella keys show <new-key-id> --details | grep -i expir
+```
+
+---
+
+## Prevention
+
+- [ ] **Rotation:** Enable automatic key rotation 30 days before expiry
+- [ ] **Monitoring:** Alert on keys expiring within 30 days (warning) and 7 days (critical)
+- [ ] **Backup:** Maintain at least one backup signing key
+- [ ] **Documentation:** Document key rotation procedures and approval process
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/attestor/architecture.md`
+- **Related runbooks:** `attestor-signing-failed.md`, `attestor-hsm-connection.md`
+- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
+- **Key management:** `docs/operations/key-management.md`
--- a/docs/operations/runbooks/attestor-rekor-unavailable.md
+++ b/docs/operations/runbooks/attestor-rekor-unavailable.md
@@ -0,0 +1,184 @@
+# Runbook: Attestor - Rekor Transparency Log Unreachable
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-005 - Attestor Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Attestor |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.attestor.rekor-connectivity` |
+
+---
+
+## Symptoms
+
+- [ ] Attestation transparency logging failing
+- [ ] Alert `AttestorRekorUnavailable` firing
+- [ ] Error: "Rekor server unavailable" or "transparency log submission failed"
+- [ ] Attestations created but not anchored to transparency log
+- [ ] Verification failing due to missing log entry
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Attestations not publicly verifiable via transparency log |
+| **Data integrity** | Attestations still valid locally; transparency reduced |
+| **SLA impact** | Compliance may require transparency log anchoring |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.attestor.rekor-connectivity
+   ```
+
+2. **Check Rekor connectivity:**
+   ```bash
+   stella attest rekor status
+   ```
+
+3. **Test Rekor endpoint:**
+   ```bash
+   stella attest rekor ping
+   ```
+
+### Deep diagnosis
+
+1. **Check Rekor server URL:**
+   ```bash
+   stella attest config get rekor.url
+   ```
+   Default: https://rekor.sigstore.dev
+
+2. **Check for public Rekor outage:**
+   ```bash
+   stella attest rekor api-status
+   ```
+   Also check: https://status.sigstore.dev/
+
+3. **Check network/proxy issues:**
+   ```bash
+   stella attest rekor test --verbose
+   ```
+   Look for: TLS errors, proxy blocks, timeout
+
+4. **Check pending log entries:**
+   ```bash
+   stella attest rekor pending-entries
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Queue attestations for later submission:**
+   ```bash
+   stella attest config set rekor.queue_on_failure true
+   stella attest reload
+   ```
+
+2. **Disable Rekor requirement temporarily:**
+   ```bash
+   stella attest config set rekor.required false
+   stella attest reload
+   ```
+   **Warning:** Reduces transparency guarantees
+
+3. **Use private Rekor instance if available:**
+   ```bash
+   stella attest config set rekor.url https://rekor.internal.example.com
+   stella attest reload
+   ```
+
+### Root cause fix
+
+**If public Rekor outage:**
+
+1. Wait for Sigstore to resolve the issue
+2. Check status at https://status.sigstore.dev/
+3. Process queued entries when service recovers:
+   ```bash
+   stella attest rekor process-queue
+   ```
+
+**If network/firewall issue:**
+
+1. Verify outbound HTTPS to rekor.sigstore.dev:
+   ```bash
+   stella attest rekor connectivity --verbose
+   ```
+
+2. Configure proxy if required:
+   ```bash
+   stella attest config set rekor.proxy https://proxy:8080
+   ```
+
+3. Add Rekor endpoints to firewall allowlist:
+   - rekor.sigstore.dev:443
+   - fulcio.sigstore.dev:443 (for certificate issuance)
+
+**If TLS certificate issue:**
+
+1. Check certificate validity:
+   ```bash
+   stella attest rekor cert-check
+   ```
+
+2. Update CA certificates:
+   ```bash
+   stella crypto ca update
+   ```
+
+**If private Rekor instance issue:**
+
+1. Check private Rekor server status
+2. Verify Rekor database health
+3. Check Rekor signer availability
+
+### Verification
+
+```bash
+# Test Rekor connectivity
+stella attest rekor ping
+
+# Submit test entry
+stella attest rekor test-submit
+
+# Process any queued entries
+stella attest rekor process-queue
+
+# Verify recent attestation in log
+stella attest rekor lookup --attestation <attestation-id>
+```
+
+---
+
+## Prevention
+
+- [ ] **Redundancy:** Configure private Rekor instance as fallback
+- [ ] **Queuing:** Enable queue-on-failure for resilience
+- [ ] **Monitoring:** Alert on Rekor submission failures
+- [ ] **Offline:** Document attestation validity without Rekor for air-gap scenarios
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/attestor/transparency-log.md`
+- **Related runbooks:** `attestor-signing-failed.md`, `attestor-verification-failed.md`
+- **Sigstore docs:** https://docs.sigstore.dev/
+- **Rekor setup:** `docs/operations/rekor-configuration.md`
--- a/docs/operations/runbooks/attestor-signing-failed.md
+++ b/docs/operations/runbooks/attestor-signing-failed.md
@@ -0,0 +1,176 @@
+# Runbook: Attestor - Signature Generation Failures
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-005 - Attestor Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Attestor |
+| **Severity** | Critical |
+| **On-call scope** | Platform team, Security team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.attestor.signing-health` |
+
+---
+
+## Symptoms
+
+- [ ] Attestation requests failing with "signing failed" error
+- [ ] Alert `AttestorSigningFailed` firing
+- [ ] Evidence bundles missing signatures
+- [ ] Metric `attestor_signing_failures_total` increasing
+- [ ] Release pipeline blocked due to unsigned attestations
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Releases blocked; attestations cannot be created |
+| **Data integrity** | Evidence is recorded but unsigned; can be signed later |
+| **SLA impact** | Release SLO violated; evidence integrity compromised |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.attestor.signing-health
+   ```
+
+2. **Check attestor service status:**
+   ```bash
+   stella attest status
+   ```
+
+3. **Check signing key availability:**
+   ```bash
+   stella keys list --type signing --status active
+   ```
+   Problem if: No active signing keys
+
+### Deep diagnosis
+
+1. **Test signing operation:**
+   ```bash
+   stella attest test-sign --verbose
+   ```
+   Look for: Specific error message
+
+2. **Check key material access:**
+   ```bash
+   stella keys verify <key-id> --operation sign
+   ```
+
+3. **If using HSM, check HSM connectivity:**
+   ```bash
+   stella doctor --check check.crypto.hsm-availability
+   ```
+
+4. **Check for key expiration:**
+   ```bash
+   stella keys list --expiring-within 7d
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **If key expired, rotate to backup key:**
+   ```bash
+   stella keys activate <backup-key-id>
+   stella attest config set signing.key_id <backup-key-id>
+   ```
+
+2. **If HSM unavailable, switch to software signing (temporary):**
+   ```bash
+   stella attest config set signing.mode software
+   stella attest reload
+   ```
+   ⚠️ **Warning:** Software signing may not meet compliance requirements
+
+3. **Retry failed attestations:**
+   ```bash
+   stella attest retry --failed --last 1h
+   ```
+
+### Root cause fix
+
+**If key expired:**
+
+1. Generate new signing key:
+   ```bash
+   stella keys generate --type signing --algorithm ecdsa-p256
+   ```
+
+2. Configure key rotation schedule:
+   ```bash
+   stella keys config set rotation.auto true
+   stella keys config set rotation.overlap_days 14
+   ```
+
+**If HSM connection failed:**
+
+1. Verify HSM configuration:
+   ```bash
+   stella crypto hsm verify
+   ```
+
+2. Restart HSM connection:
+   ```bash
+   stella crypto hsm reconnect
+   ```
+
+**If certificate chain issue:**
+
+1. Verify certificate chain:
+   ```bash
+   stella crypto cert verify-chain --key <key-id>
+   ```
+
+2. Update intermediate certificates:
+   ```bash
+   stella crypto cert update-chain --key <key-id>
+   ```
+
+### Verification
+
+```bash
+# Test signing
+stella attest test-sign
+
+# Create test attestation
+stella attest create --type test --subject "test:verification"
+
+# Verify the attestation
+stella verify attestation --last
+
+# Check no failures in recent operations
+stella attest logs --level error --last 30m
+```
+
+---
+
+## Prevention
+
+- [ ] **Key rotation:** Enable automatic key rotation with 14-day overlap
+- [ ] **Monitoring:** Alert on keys expiring within 30 days
+- [ ] **Backup:** Maintain backup signing key in different HSM slot
+- [ ] **Testing:** Include signing test in health check schedule
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/attestor/architecture.md`
+- **Related runbooks:** `attestor-key-expired.md`, `attestor-hsm-connection.md`
+- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
+- **Dashboard:** Grafana > Stella Ops > Attestor
--- a/docs/operations/runbooks/attestor-verification-failed.md
+++ b/docs/operations/runbooks/attestor-verification-failed.md
@@ -0,0 +1,195 @@
+# Runbook: Attestor - Attestation Verification Failures
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-005 - Attestor Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Attestor |
+| **Severity** | High |
+| **On-call scope** | Platform team, Security team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.attestor.verification-health` |
+
+---
+
+## Symptoms
+
+- [ ] Attestation verification failing
+- [ ] Alert `AttestorVerificationFailed` firing
+- [ ] Error: "signature verification failed" or "invalid attestation"
+- [ ] Promotions blocked due to failed verification
+- [ ] Error: "trust anchor not found" or "certificate chain invalid"
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Artifacts cannot be promoted; release blocked |
+| **Data integrity** | May indicate tampered attestation or configuration issue |
+| **SLA impact** | Release pipeline blocked until resolved |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.attestor.verification-health
+   ```
+
+2. **Verify specific attestation:**
+   ```bash
+   stella verify attestation --attestation <attestation-id> --verbose
+   ```
+
+3. **Check trust anchors:**
+   ```bash
+   stella trust-anchors list
+   ```
+
+### Deep diagnosis
+
+1. **Check attestation details:**
+   ```bash
+   stella attest show <attestation-id> --details
+   ```
+   Look for: Signer identity, timestamp, subject
+
+2. **Verify certificate chain:**
+   ```bash
+   stella verify cert-chain --attestation <attestation-id>
+   ```
+   Problem if: Intermediate cert missing, root not trusted
+
+3. **Check public key availability:**
+   ```bash
+   stella keys show <key-id> --public
+   ```
+
+4. **Check if issuer is trusted:**
+   ```bash
+   stella issuer trust-status <issuer-id>
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **If trust anchor missing, add it:**
+   ```bash
+   stella trust-anchors add --cert <issuer-cert.pem>
+   ```
+
+2. **If intermediate cert missing:**
+   ```bash
+   stella trust-anchors add-intermediate --cert <intermediate.pem>
+   ```
+
+3. **Re-verify with verbose output:**
+   ```bash
+   stella verify attestation --attestation <attestation-id> --verbose
+   ```
+
+### Root cause fix
+
+**If signature mismatch:**
+
+1. Check attestation wasn't modified:
+   ```bash
+   stella attest integrity-check <attestation-id>
+   ```
+
+2. If modified, regenerate attestation:
+   ```bash
+   stella attest create --subject <digest> --type <type> --force
+   ```
+
+**If key rotated and old key not trusted:**
+
+1. Add old public key to trust anchors:
+   ```bash
+   stella trust-anchors add-key --key <old-key.pem> --expires <date>
+   ```
+
+2. Or fetch from issuer directory:
+   ```bash
+   stella issuer keys fetch <issuer-id>
+   ```
+
+**If certificate expired:**
+
+1. Check certificate validity:
+   ```bash
+   stella verify cert --attestation <attestation-id> --show-expiry
+   ```
+
+2. Re-sign with valid certificate:
+   ```bash
+   stella attest resign <attestation-id>
+   ```
+
+**If issuer not trusted:**
+
+1. Verify issuer identity:
+   ```bash
+   stella issuer show <issuer-id>
+   ```
+
+2. Add to trusted issuers (requires approval):
+   ```bash
+   stella issuer trust <issuer-id> --reason "Approved by security team"
+   ```
+
+**If algorithm not supported:**
+
+1. Check algorithm:
+   ```bash
+   stella attest show <attestation-id> | grep algorithm
+   ```
+
+2. Verify crypto provider supports algorithm:
+   ```bash
+   stella crypto providers list --algorithms
+   ```
+
+### Verification
+
+```bash
+# Verify attestation
+stella verify attestation --attestation <attestation-id>
+
+# Verify trust chain
+stella verify cert-chain --attestation <attestation-id>
+
+# Test end-to-end verification
+stella verify artifact --digest <digest>
+
+# Check no verification errors
+stella attest logs --filter "verification" --level error --last 30m
+```
+
+---
+
+## Prevention
+
+- [ ] **Trust anchors:** Keep trust anchor list current with all valid issuer certs
+- [ ] **Key rotation:** Plan key rotation with overlap period for verification continuity
+- [ ] **Monitoring:** Alert on verification failure rate > 0
+- [ ] **Testing:** Include verification tests in release pipeline
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/attestor/verification.md`
+- **Related runbooks:** `attestor-signing-failed.md`, `attestor-key-expired.md`
+- **Trust management:** `docs/operations/trust-anchors.md`
--- a/docs/operations/runbooks/backup-restore-ops.md
+++ b/docs/operations/runbooks/backup-restore-ops.md
@@ -0,0 +1,449 @@
+# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
+# Task: RUN-004 - Backup/Restore Runbook
+# Backup and Restore Operations Runbook
+
+Status: PRODUCTION-READY (2026-01-17 UTC)
+
+## Scope
+Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.
+
+---
+
+## Backup Architecture Overview
+
+### Backup Components
+
+| Component | Backup Type | Default Schedule | Retention |
+|-----------|-------------|------------------|-----------|
+| PostgreSQL | Full + WAL | Daily full, continuous WAL | 30 days |
+| Evidence Locker | Incremental | Daily | 90 days |
+| Configuration | Snapshot | Daily + on change | 90 days |
+| Secrets | Encrypted snapshot | Daily | 30 days |
+| Attestation Keys | Encrypted export | Weekly | 1 year |
+
+### Storage Locations
+
+- **Primary:** `/var/lib/stellaops/backups/` (local)
+- **Secondary:** S3/Azure Blob/GCS (configurable)
+- **Offline:** Removable media for air-gap scenarios
+
+---
+
+## Pre-flight Checklist
+
+### Environment Verification
+```bash
+# Check backup service status
+stella backup status
+
+# Verify backup storage
+stella doctor --check check.storage.backup
+
+# List recent backups
+stella backup list --last 7d
+
+# Test backup restore capability
+stella backup test-restore --latest --dry-run
+```
+
+### Metrics to Watch
+- `stella_backup_last_success_timestamp` - Last successful backup
+- `stella_backup_duration_seconds` - Backup duration
+- `stella_backup_size_bytes` - Backup size
+- `stella_restore_test_last_success` - Last restore test
+
+---
+
+## Standard Procedures
+
+### SP-001: Create Manual Backup
+
+**When:** Before upgrades, schema changes, or major configuration changes
+**Duration:** 5-30 minutes depending on data volume
+
+1. Create full system backup:
+   ```bash
+   stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)"
+   ```
+
+2. Or create component-specific backup:
+   ```bash
+   # Database only
+   stella backup create --type database --name "db-pre-migration"
+   
+   # Evidence locker only
+   stella backup create --type evidence --name "evidence-snapshot"
+   
+   # Configuration only
+   stella backup create --type config --name "config-backup"
+   ```
+
+3. Verify backup:
+   ```bash
+   stella backup verify --name "pre-upgrade-$(date +%Y%m%d)"
+   ```
+
+4. Copy to offsite storage (recommended):
+   ```bash
+   stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
+   ```
+
+### SP-002: Verify Backup Integrity
+
+**Frequency:** Weekly
+**Duration:** 15-60 minutes
+
+1. List backups for verification:
+   ```bash
+   stella backup list --unverified
+   ```
+
+2. Verify backup integrity:
+   ```bash
+   # Verify specific backup
+   stella backup verify --name <backup-name>
+   
+   # Verify all unverified
+   stella backup verify --all-unverified
+   ```
+
+3. Test restore (non-destructive):
+   ```bash
+   stella backup test-restore --name <backup-name> --target /tmp/restore-test
+   ```
+
+4. Record verification result:
+   ```bash
+   stella backup log-verification --name <backup-name> --result success
+   ```
+
+### SP-003: Restore from Backup
+
+**CAUTION: This is a destructive operation**
+
+#### Full System Restore
+
+1. Stop all services:
+   ```bash
+   stella service stop --all
+   ```
+
+2. List available backups:
+   ```bash
+   stella backup list --type full
+   ```
+
+3. Restore:
+   ```bash
+   # Dry run first
+   stella backup restore --name <backup-name> --dry-run
+   
+   # Execute restore
+   stella backup restore --name <backup-name> --confirm
+   ```
+
+4. Start services:
+   ```bash
+   stella service start --all
+   ```
+
+5. Verify restoration:
+   ```bash
+   stella doctor --all
+   stella service health
+   ```
+
+#### Component-Specific Restore
+
+1. Database restore:
+   ```bash
+   stella service stop --service api,release-orchestrator
+   stella backup restore --type database --name <backup-name> --confirm
+   stella db migrate  # Apply any pending migrations
+   stella service start --service api,release-orchestrator
+   ```
+
+2. Evidence locker restore:
+   ```bash
+   stella backup restore --type evidence --name <backup-name> --confirm
+   stella evidence verify --mode quick
+   ```
+
+3. Configuration restore:
+   ```bash
+   stella backup restore --type config --name <backup-name> --confirm
+   stella service restart --graceful
+   ```
+
+### SP-004: Point-in-Time Recovery (Database)
+
+1. Identify target recovery point:
+   ```bash
+   # List WAL archives
+   stella backup wal-list --after <start-date> --before <end-date>
+   ```
+
+2. Perform PITR:
+   ```bash
+   stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm
+   ```
+
+3. Verify data state:
+   ```bash
+   stella db verify-integrity
+   ```
+
+---
+
+## Backup Schedules
+
+### Configure Backup Schedule
+
+```bash
+# View current schedule
+stella backup schedule show
+
+# Set database backup schedule
+stella backup schedule set --type database --cron "0 2 * * *"
+
+# Set evidence backup schedule
+stella backup schedule set --type evidence --cron "0 3 * * *"
+
+# Set configuration backup schedule
+stella backup schedule set --type config --cron "0 4 * * *" --on-change
+```
+
+### Retention Policy
+
+```bash
+# View retention policy
+stella backup retention show
+
+# Set retention
+stella backup retention set --type database --days 30
+stella backup retention set --type evidence --days 90
+stella backup retention set --type config --days 90
+
+# Apply retention (cleanup old backups)
+stella backup retention apply
+```
+
+---
+
+## Incident Procedures
+
+### INC-001: Backup Failure
+
+**Symptoms:**
+- Alert: `StellaBackupFailed`
+- Missing recent backup
+
+**Investigation:**
+```bash
+# Check backup logs
+stella backup logs --last 24h
+
+# Check disk space
+stella doctor --check check.storage.diskspace,check.storage.backup
+
+# Test backup operation
+stella backup test --type database
+```
+
+**Resolution:**
+
+1. **Disk space issue:**
+   ```bash
+   stella backup retention apply --force
+   stella backup cleanup --expired
+   ```
+
+2. **Database connectivity:**
+   ```bash
+   stella doctor --check check.postgres.connectivity
+   ```
+
+3. **Permission issue:**
+   - Check backup directory permissions
+   - Verify service account access
+
+4. **Retry backup:**
+   ```bash
+   stella backup create --type <failed-type> --retry
+   ```
+
+### INC-002: Restore Failure
+
+**Symptoms:**
+- Restore command fails
+- Services not starting after restore
+
+**Investigation:**
+```bash
+# Check restore logs
+stella backup restore-logs --last-attempt
+
+# Verify backup integrity
+stella backup verify --name <backup-name>
+
+# Check disk space
+stella doctor --check check.storage.diskspace
+```
+
+**Resolution:**
+
+1. **Corrupted backup:**
+   ```bash
+   # Try previous backup
+   stella backup list --type <type>
+   stella backup restore --name <previous-backup> --confirm
+   ```
+
+2. **Version mismatch:**
+   ```bash
+   # Check backup version
+   stella backup info --name <backup-name>
+   
+   # Restore with migration
+   stella backup restore --name <backup-name> --with-migration
+   ```
+
+3. **Disk space:**
+   - Free space or expand volume
+   - Restore to alternate location
+
+### INC-003: Backup Storage Full
+
+**Symptoms:**
+- Alert: `StellaBackupStorageFull`
+- New backups failing
+
+**Immediate Actions:**
+```bash
+# Check storage
+stella backup storage stats
+
+# Emergency cleanup
+stella backup cleanup --keep-last 3
+
+# Delete specific old backups
+stella backup delete --older-than 14d --confirm
+```
+
+**Resolution:**
+
+1. **Adjust retention:**
+   ```bash
+   stella backup retention set --type database --days 14
+   stella backup retention apply
+   ```
+
+2. **Expand storage:**
+   - Add disk space
+   - Configure offsite storage
+
+3. **Archive to cold storage:**
+   ```bash
+   stella backup archive --older-than 30d --destination s3://archive-bucket/
+   ```
+
+---
+
+## Disaster Recovery Scenarios
+
+### DR-001: Complete System Loss
+
+1. Provision new infrastructure
+2. Install Stella Ops
+3. Restore from offsite backup:
+   ```bash
+   stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm
+   ```
+4. Verify all components
+5. Update DNS/load balancer
+
+### DR-002: Database Corruption
+
+1. Stop services
+2. Restore database from latest clean backup:
+   ```bash
+   stella backup restore --type database --name <last-known-good>
+   ```
+3. Apply WAL to near-corruption point (PITR)
+4. Verify data integrity
+5. Resume services
+
+### DR-003: Evidence Locker Loss
+
+1. Restore evidence from backup:
+   ```bash
+   stella backup restore --type evidence --name <backup-name>
+   ```
+2. Rebuild index:
+   ```bash
+   stella evidence index rebuild
+   ```
+3. Verify anchor chain:
+   ```bash
+   stella evidence anchor verify --all
+   ```
+
+---
+
+## Offline/Air-Gap Backup
+
+### Creating Offline Backup
+
+```bash
+# Create encrypted offline bundle
+stella backup create-offline \
+  --output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
+  --encrypt \
+  --passphrase-file /secure/backup-key
+
+# Verify offline backup
+stella backup verify-offline --input /media/usb/stellaops-backup-*.enc
+```
+
+### Restoring from Offline Backup
+
+```bash
+# Restore from offline backup
+stella backup restore-offline \
+  --input /media/usb/stellaops-backup-*.enc \
+  --passphrase-file /secure/backup-key \
+  --confirm
+```
+
+---
+
+## Monitoring Dashboard
+
+Access: Grafana → Dashboards → Stella Ops → Backup Status
+
+Key panels:
+- Last backup success time
+- Backup size trend
+- Backup duration
+- Restore test status
+- Storage utilization
+
+---
+
+## Evidence Capture
+
+```bash
+stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz
+```
+
+---
+
+## Escalation Path
+
+1. **L1 (On-call):** Retry failed backups, basic troubleshooting
+2. **L2 (Platform team):** Restore operations, schedule adjustments
+3. **L3 (Architecture):** Disaster recovery execution
+
+---
+
+_Last updated: 2026-01-17 (UTC)_
--- a/docs/operations/runbooks/connector-ghsa.md
+++ b/docs/operations/runbooks/connector-ghsa.md
@@ -0,0 +1,196 @@
+# Runbook: Feed Connector - GitHub Security Advisories (GHSA) Failures
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-006 - Feed Connector Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Concelier / GHSA Connector |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.connector.ghsa-health` |
+
+---
+
+## Symptoms
+
+- [ ] GHSA feed sync failing or stale
+- [ ] Alert `ConnectorGhsaSyncFailed` firing
+- [ ] Error: "GitHub API rate limit exceeded" or "GraphQL query failed"
+- [ ] GitHub Advisory Database vulnerabilities missing
+- [ ] Metric `connector_sync_failures_total{source="ghsa"}` increasing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | GitHub ecosystem vulnerabilities may be missed |
+| **Data integrity** | Data becomes stale; no data loss |
+| **SLA impact** | Vulnerability currency SLO violated for GitHub packages |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.connector.ghsa-health
+   ```
+
+2. **Check GHSA sync status:**
+   ```bash
+   stella admin feeds status --source ghsa
+   ```
+
+3. **Test GitHub API connectivity:**
+   ```bash
+   stella connector test ghsa
+   ```
+
+### Deep diagnosis
+
+1. **Check GitHub API rate limit:**
+   ```bash
+   stella connector ghsa rate-limit-status
+   ```
+   Problem if: Remaining = 0, rate limit exceeded
+
+2. **Check GitHub token permissions:**
+   ```bash
+   stella connector credentials show ghsa --check-scopes
+   ```
+   Required scopes: `public_repo`, `read:packages` (for private advisory access)
+
+3. **Check sync logs:**
+   ```bash
+   stella connector logs ghsa --last 1h --level error
+   ```
+   Look for: GraphQL errors, pagination issues, timeout
+
+4. **Check for GitHub API outage:**
+   ```bash
+   stella connector ghsa api-status
+   ```
+   Also check: https://www.githubstatus.com/
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **If rate limited, wait for reset:**
+   ```bash
+   stella connector ghsa rate-limit-status
+   # Note the reset time, then:
+   stella admin feeds refresh --source ghsa
+   ```
+
+2. **Use secondary token if available:**
+   ```bash
+   stella connector credentials rotate ghsa --to secondary
+   stella admin feeds refresh --source ghsa
+   ```
+
+3. **Load from offline bundle:**
+   ```bash
+   stella offline load --source ghsa --package ghsa-bundle-latest.tar.gz
+   ```
+
+### Root cause fix
+
+**If rate limit consistently exceeded:**
+
+1. Increase sync interval:
+   ```bash
+   stella connector config set ghsa.sync_interval 4h
+   ```
+
+2. Enable incremental sync:
+   ```bash
+   stella connector config set ghsa.incremental_sync true
+   ```
+
+3. Use authenticated requests (10x rate limit):
+   ```bash
+   stella connector credentials update ghsa --token <github-pat>
+   ```
+
+**If token expired or invalid:**
+
+1. Generate new GitHub PAT at https://github.com/settings/tokens
+
+2. Update token:
+   ```bash
+   stella connector credentials update ghsa --token <new-token>
+   ```
+
+3. Verify scopes:
+   ```bash
+   stella connector credentials show ghsa --check-scopes
+   ```
+
+**If GraphQL query failing:**
+
+1. Check for API schema changes:
+   ```bash
+   stella connector ghsa schema-check
+   ```
+
+2. Update connector if schema changed:
+   ```bash
+   stella upgrade --component connector-ghsa
+   ```
+
+**If pagination broken:**
+
+1. Reset sync cursor:
+   ```bash
+   stella connector ghsa reset-cursor
+   ```
+
+2. Force full resync:
+   ```bash
+   stella admin feeds refresh --source ghsa --full
+   ```
+
+### Verification
+
+```bash
+# Force sync
+stella admin feeds refresh --source ghsa
+
+# Monitor sync progress
+stella admin feeds status --source ghsa --watch
+
+# Verify recent advisories present
+stella vuln query GHSA-xxxx-xxxx-xxxx  # Use a recent GHSA ID
+
+# Check no errors
+stella connector logs ghsa --level error --last 1h
+```
+
+---
+
+## Prevention
+
+- [ ] **Authentication:** Always use authenticated requests for 5000/hr rate limit
+- [ ] **Monitoring:** Alert on last sync > 12h or sync failures
+- [ ] **Redundancy:** Use NVD/OSV as backup for GitHub ecosystem coverage
+- [ ] **Token rotation:** Rotate tokens before expiration
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/concelier/connectors.md`
+- **Connector config:** `docs/modules/concelier/operations/connectors/ghsa.md`
+- **Related runbooks:** `connector-nvd.md`, `connector-osv.md`
+- **GitHub API docs:** https://docs.github.com/en/graphql
--- a/docs/operations/runbooks/connector-nvd.md
+++ b/docs/operations/runbooks/connector-nvd.md
@@ -0,0 +1,195 @@
+# Runbook: Feed Connector - NVD Connector Failures
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-006 - Feed Connector Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Concelier / NVD Connector |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.connector.nvd-health` |
+
+---
+
+## Symptoms
+
+- [ ] NVD feed sync failing or stale (> 24h since last successful sync)
+- [ ] Alert `ConnectorNvdSyncFailed` firing
+- [ ] Error: "NVD API request failed" or "rate limit exceeded"
+- [ ] Vulnerability data missing or outdated
+- [ ] Metric `connector_sync_failures_total{source="nvd"}` increasing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Vulnerability scans may miss recent CVEs |
+| **Data integrity** | Data becomes stale; no data loss |
+| **SLA impact** | Vulnerability currency SLO violated (target: < 24h) |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.connector.nvd-health
+   ```
+
+2. **Check NVD sync status:**
+   ```bash
+   stella admin feeds status --source nvd
+   ```
+   Look for: Last sync time, error message, sync state
+
+3. **Check NVD API connectivity:**
+   ```bash
+   stella connector test nvd
+   ```
+
+### Deep diagnosis
+
+1. **Check NVD API key status:**
+   ```bash
+   stella connector credentials show nvd
+   ```
+   Problem if: API key expired or rate limit exhausted
+
+2. **Check NVD API rate limit:**
+   ```bash
+   stella connector nvd rate-limit-status
+   ```
+   Problem if: Remaining requests = 0, reset time in future
+
+3. **Check for NVD API outage:**
+   ```bash
+   stella connector nvd api-status
+   ```
+   Also check: https://nvd.nist.gov/general/news
+
+4. **Check sync logs:**
+   ```bash
+   stella connector logs nvd --last 1h --level error
+   ```
+   Look for: HTTP status codes, timeout errors, parsing failures
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **If rate limited, wait for reset:**
+   ```bash
+   stella connector nvd rate-limit-status
+   # Wait for reset time, then:
+   stella admin feeds refresh --source nvd
+   ```
+
+2. **If API key expired, use anonymous mode (slower):**
+   ```bash
+   stella connector config set nvd.api_key_mode anonymous
+   stella admin feeds refresh --source nvd
+   ```
+
+3. **Load from offline bundle if urgent:**
+   ```bash
+   # If you have a recent offline bundle:
+   stella offline load --source nvd --package nvd-bundle-latest.tar.gz
+   ```
+
+### Root cause fix
+
+**If API key expired or invalid:**
+
+1. Generate new NVD API key at https://nvd.nist.gov/developers/request-an-api-key
+
+2. Update API key:
+   ```bash
+   stella connector credentials update nvd --api-key <new-key>
+   ```
+
+3. Verify connectivity:
+   ```bash
+   stella connector test nvd
+   ```
+
+**If rate limit consistently exceeded:**
+
+1. Increase sync interval to reduce API calls:
+   ```bash
+   stella connector config set nvd.sync_interval 6h
+   ```
+
+2. Enable delta sync to reduce data volume:
+   ```bash
+   stella connector config set nvd.delta_sync true
+   ```
+
+3. Request higher rate limit from NVD (if available)
+
+**If network/firewall issue:**
+
+1. Verify outbound connectivity to NVD API:
+   ```bash
+   stella connector test nvd --verbose
+   ```
+
+2. Check proxy configuration if required:
+   ```bash
+   stella connector config set nvd.proxy https://proxy:8080
+   ```
+
+**If data parsing failures:**
+
+1. Check for NVD schema changes:
+   ```bash
+   stella connector nvd schema-check
+   ```
+
+2. Update connector if schema changed:
+   ```bash
+   stella upgrade --component connector-nvd
+   ```
+
+### Verification
+
+```bash
+# Force sync
+stella admin feeds refresh --source nvd --force
+
+# Monitor sync progress
+stella admin feeds status --source nvd --watch
+
+# Verify recent CVEs are present
+stella vuln query CVE-2026-XXXX  # Use a recent CVE ID
+
+# Check no errors in recent logs
+stella connector logs nvd --level error --last 1h
+```
+
+---
+
+## Prevention
+
+- [ ] **API Key:** Always use API key (not anonymous) for 10x rate limit
+- [ ] **Monitoring:** Alert on last sync > 24h or sync failure
+- [ ] **Redundancy:** Configure backup connector (OSV, GitHub Advisory) for overlap
+- [ ] **Offline:** Maintain weekly offline bundle for disaster recovery
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/concelier/connectors.md`
+- **Connector config:** `docs/modules/concelier/operations/connectors/nvd.md`
+- **Related runbooks:** `connector-ghsa.md`, `connector-osv.md`
+- **Dashboard:** Grafana > Stella Ops > Feed Connectors
--- a/docs/operations/runbooks/connector-osv.md
+++ b/docs/operations/runbooks/connector-osv.md
@@ -0,0 +1,193 @@
+# Runbook: Feed Connector - OSV (Open Source Vulnerabilities) Failures
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-006 - Feed Connector Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Concelier / OSV Connector |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.connector.osv-health` |
+
+---
+
+## Symptoms
+
+- [ ] OSV feed sync failing or stale
+- [ ] Alert `ConnectorOsvSyncFailed` firing
+- [ ] Error: "OSV API request failed" or "ecosystem sync failed"
+- [ ] OSV vulnerabilities missing from database
+- [ ] Metric `connector_sync_failures_total{source="osv"}` increasing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Open source ecosystem vulnerabilities may be missed |
+| **Data integrity** | Data becomes stale; no data loss |
+| **SLA impact** | Vulnerability currency SLO violated for affected ecosystems |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.connector.osv-health
+   ```
+
+2. **Check OSV sync status:**
+   ```bash
+   stella admin feeds status --source osv
+   ```
+
+3. **Test OSV API connectivity:**
+   ```bash
+   stella connector test osv
+   ```
+
+### Deep diagnosis
+
+1. **Check ecosystem-specific status:**
+   ```bash
+   stella connector osv ecosystems status
+   ```
+   Look for: Failed ecosystems, stale ecosystems
+
+2. **Check sync logs:**
+   ```bash
+   stella connector logs osv --last 1h --level error
+   ```
+   Look for: API errors, parsing failures, timeout
+
+3. **Check for OSV API outage:**
+   ```bash
+   stella connector osv api-status
+   ```
+   Also check: https://osv.dev/
+
+4. **Check GCS bucket access (OSV uses GCS for bulk data):**
+   ```bash
+   stella connector osv gcs-status
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Retry sync for specific ecosystem:**
+   ```bash
+   stella admin feeds refresh --source osv --ecosystem npm
+   ```
+
+2. **Sync from GCS bucket directly (faster for bulk):**
+   ```bash
+   stella connector osv sync-from-gcs
+   ```
+
+3. **Load from offline bundle:**
+   ```bash
+   stella offline load --source osv --package osv-bundle-latest.tar.gz
+   ```
+
+### Root cause fix
+
+**If API request failing:**
+
+1. Check API endpoint:
+   ```bash
+   stella connector osv api-test
+   ```
+
+2. Verify no proxy blocking:
+   ```bash
+   stella connector config set osv.proxy <proxy-url>
+   ```
+
+**If GCS access failing:**
+
+1. Check GCS connectivity:
+   ```bash
+   stella connector osv gcs-test
+   ```
+
+2. Enable anonymous access (default):
+   ```bash
+   stella connector config set osv.gcs_auth anonymous
+   ```
+
+3. Or configure service account:
+   ```bash
+   stella connector config set osv.gcs_credentials /path/to/sa-key.json
+   ```
+
+**If specific ecosystem failing:**
+
+1. Disable problematic ecosystem temporarily:
+   ```bash
+   stella connector config set osv.ecosystems.disabled <ecosystem>
+   ```
+
+2. Check ecosystem data format:
+   ```bash
+   stella connector osv ecosystem-check <ecosystem>
+   ```
+
+**If parsing errors:**
+
+1. Check for schema changes:
+   ```bash
+   stella connector osv schema-check
+   ```
+
+2. Update connector:
+   ```bash
+   stella upgrade --component connector-osv
+   ```
+
+### Verification
+
+```bash
+# Force sync
+stella admin feeds refresh --source osv
+
+# Monitor sync progress
+stella admin feeds status --source osv --watch
+
+# Verify ecosystem coverage
+stella connector osv ecosystems status
+
+# Query recent vulnerability
+stella vuln query OSV-2026-xxxx
+
+# Check no errors
+stella connector logs osv --level error --last 1h
+```
+
+---
+
+## Prevention
+
+- [ ] **Bulk sync:** Use GCS bulk sync for initial load and daily updates
+- [ ] **Monitoring:** Alert on ecosystem sync failures
+- [ ] **Redundancy:** NVD/GHSA provide overlapping coverage for major ecosystems
+- [ ] **Offline:** Maintain weekly offline bundle
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/concelier/connectors.md`
+- **Connector config:** `docs/modules/concelier/operations/connectors/osv.md`
+- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`
+- **OSV API docs:** https://osv.dev/docs/
--- a/docs/operations/runbooks/connector-vendor-specific.md
+++ b/docs/operations/runbooks/connector-vendor-specific.md
@@ -0,0 +1,220 @@
+# Runbook Template: Feed Connector - Vendor-Specific Connectors
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-006 - Feed Connector Runbooks
+
+## Overview
+
+This is a template runbook for vendor-specific advisory feed connectors (RedHat, Ubuntu, Debian, Oracle, VMware, etc.). Use this template to create runbooks for specific vendor connectors.
+
+---
+
+## Metadata Template
+
+| Field | Value |
+|-------|-------|
+| **Component** | Concelier / [Vendor] Connector |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | [Date] |
+| **Doctor check** | `check.connector.[vendor]-health` |
+
+---
+
+## Common Vendor Connector Issues
+
+### Authentication Failures
+
+**Symptoms:**
+- Sync failing with 401/403 errors
+- "authentication failed" or "invalid credentials"
+
+**Resolution:**
+```bash
+# Check credentials
+stella connector credentials show <vendor>
+
+# Update credentials
+stella connector credentials update <vendor> --api-key <key>
+
+# Test connectivity
+stella connector test <vendor>
+```
+
+### Rate Limiting
+
+**Symptoms:**
+- Sync failing with 429 errors
+- "rate limit exceeded"
+
+**Resolution:**
+```bash
+# Check rate limit status
+stella connector <vendor> rate-limit-status
+
+# Increase sync interval
+stella connector config set <vendor>.sync_interval 6h
+
+# Enable delta sync
+stella connector config set <vendor>.delta_sync true
+```
+
+### Data Format Changes
+
+**Symptoms:**
+- Parsing errors in sync logs
+- "unexpected format" or "schema validation failed"
+
+**Resolution:**
+```bash
+# Check for schema changes
+stella connector <vendor> schema-check
+
+# Update connector
+stella upgrade --component connector-<vendor>
+```
+
+### Offline Bundle Refresh
+
+**Resolution:**
+```bash
+# Create offline bundle
+stella offline sync --feeds <vendor> --output <vendor>-bundle.tar.gz
+
+# Load offline bundle
+stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
+```
+
+---
+
+## Vendor-Specific Runbooks
+
+Use this template to create runbooks for:
+
+### RedHat Security Data
+
+**Endpoint:** https://access.redhat.com/security/data/
+**Authentication:** API token or certificate
+**Connector:** `connector-redhat`
+
+Key commands:
+```bash
+stella connector test redhat
+stella admin feeds status --source redhat
+stella connector redhat cve-map-status  # RHSA to CVE mapping
+```
+
+### Ubuntu Security Notices
+
+**Endpoint:** https://ubuntu.com/security/notices
+**Authentication:** None (public)
+**Connector:** `connector-ubuntu`
+
+Key commands:
+```bash
+stella connector test ubuntu
+stella admin feeds status --source ubuntu
+stella connector ubuntu usn-status  # USN sync status
+```
+
+### Debian Security Tracker
+
+**Endpoint:** https://security-tracker.debian.org/
+**Authentication:** None (public)
+**Connector:** `connector-debian`
+
+Key commands:
+```bash
+stella connector test debian
+stella admin feeds status --source debian
+stella connector debian dla-status  # DLA sync status
+```
+
+### Oracle Security Alerts
+
+**Endpoint:** https://www.oracle.com/security-alerts/
+**Authentication:** Oracle account (optional)
+**Connector:** `connector-oracle`
+
+Key commands:
+```bash
+stella connector test oracle
+stella admin feeds status --source oracle
+stella connector oracle cpu-status  # Critical Patch Update status
+```
+
+### VMware Security Advisories
+
+**Endpoint:** https://www.vmware.com/security/advisories
+**Authentication:** None (public)
+**Connector:** `connector-vmware`
+
+Key commands:
+```bash
+stella connector test vmware
+stella admin feeds status --source vmware
+stella connector vmware vmsa-status  # VMSA sync status
+```
+
+---
+
+## Diagnosis Checklist
+
+For any vendor connector issue:
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.connector.<vendor>-health
+   ```
+
+2. **Check sync status:**
+   ```bash
+   stella admin feeds status --source <vendor>
+   ```
+
+3. **Test connectivity:**
+   ```bash
+   stella connector test <vendor>
+   ```
+
+4. **Check logs:**
+   ```bash
+   stella connector logs <vendor> --last 1h --level error
+   ```
+
+5. **Check credentials (if applicable):**
+   ```bash
+   stella connector credentials show <vendor>
+   ```
+
+---
+
+## Resolution Checklist
+
+1. **Retry sync:**
+   ```bash
+   stella admin feeds refresh --source <vendor>
+   ```
+
+2. **Update credentials (if auth issue):**
+   ```bash
+   stella connector credentials update <vendor>
+   ```
+
+3. **Update connector (if format changed):**
+   ```bash
+   stella upgrade --component connector-<vendor>
+   ```
+
+4. **Load offline bundle (if API unavailable):**
+   ```bash
+   stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
+   ```
+
+---
+
+## Related Resources
+
+- **Connector architecture:** `docs/modules/concelier/connectors.md`
+- **Vendor connector configs:** `docs/modules/concelier/operations/connectors/`
+- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`, `connector-osv.md`
--- a/docs/operations/runbooks/crypto-ops.md
+++ b/docs/operations/runbooks/crypto-ops.md
@@ -0,0 +1,370 @@
+# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
+# Task: RUN-002 - Crypto Subsystem Runbook
+# Regional Crypto Operations Runbook
+
+Status: PRODUCTION-READY (2026-01-17 UTC)
+
+## Scope
+Cryptographic subsystem operations including HSM management, regional crypto profile configuration, key rotation, and certificate management for all supported crypto profiles (International, FIPS, eIDAS, GOST, SM).
+
+---
+
+## Pre-flight Checklist
+
+### Environment Verification
+```bash
+# Check crypto subsystem health
+stella doctor --category crypto
+
+# Verify active crypto profile
+stella crypto profile show
+
+# List loaded crypto providers
+stella crypto providers list
+
+# Check key status
+stella crypto keys status
+```
+
+### Metrics to Watch
+- `stella_crypto_operations_total` - Crypto operation count by type
+- `stella_crypto_operation_duration_seconds` - Signing/verification latency
+- `stella_hsm_availability` - HSM availability (if configured)
+- `stella_cert_expiry_days` - Certificate expiration countdown
+
+---
+
+## Regional Crypto Profiles
+
+### Profile Overview
+
+| Profile | Use Case | Key Algorithms | Compliance |
+|---------|----------|----------------|------------|
+| `international` | Default, most deployments | RSA-2048+, ECDSA P-256/P-384, Ed25519 | General |
+| `fips` | US Government / FedRAMP | FIPS 140-2 approved algorithms only | FIPS 140-2 |
+| `eidas` | European Union | RSA-PSS, ECDSA, Ed25519 per ETSI TS 119 312 | eIDAS |
+| `gost` | Russian Federation | GOST R 34.10-2012, GOST R 34.11-2012 | Russian standards |
+| `sm` | China | SM2, SM3, SM4 | GM/T 0003-2012 |
+
+### Switching Profiles
+
+1. **Pre-switch verification:**
+   ```bash
+   # Verify target profile is available
+   stella crypto profile verify --profile <target-profile>
+   
+   # Check for incompatible existing signatures
+   stella crypto audit --check-compatibility --target-profile <target-profile>
+   ```
+
+2. **Profile switch:**
+   ```bash
+   # Switch profile (requires service restart)
+   stella crypto profile set --profile <target-profile>
+   
+   # Restart services to apply
+   stella service restart --graceful
+   ```
+
+3. **Post-switch verification:**
+   ```bash
+   stella doctor --check check.crypto.fips,check.crypto.eidas,check.crypto.gost,check.crypto.sm
+   ```
+
+---
+
+## Standard Procedures
+
+### SP-001: Key Rotation
+
+**Frequency:** Quarterly or per policy
+**Duration:** ~15 minutes (no downtime)
+
+1. Generate new key:
+   ```bash
+   # For software keys
+   stella crypto keys generate --type signing --algorithm ecdsa-p256 --name signing-$(date +%Y%m)
+   
+   # For HSM-backed keys
+   stella crypto keys generate --type signing --algorithm ecdsa-p256 --provider hsm --name signing-$(date +%Y%m)
+   ```
+
+2. Activate new key:
+   ```bash
+   stella crypto keys activate --name signing-$(date +%Y%m)
+   ```
+
+3. Verify signing with new key:
+   ```bash
+   echo "test" | stella crypto sign --output /dev/null
+   ```
+
+4. Schedule old key deactivation:
+   ```bash
+   stella crypto keys schedule-deactivation --name <old-key-name> --in 30d
+   ```
+
+### SP-002: Certificate Renewal
+
+**When:** Certificate expiring within 30 days
+
+1. Check expiration:
+   ```bash
+   stella crypto certs check-expiry
+   ```
+
+2. Generate CSR:
+   ```bash
+   stella crypto certs csr --subject "CN=stellaops.example.com,O=Example Corp" --output cert.csr
+   ```
+
+3. Install renewed certificate:
+   ```bash
+   stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
+   ```
+
+4. Verify certificate chain:
+   ```bash
+   stella doctor --check check.crypto.certchain
+   ```
+
+5. Restart services:
+   ```bash
+   stella service restart --graceful
+   ```
+
+### SP-003: HSM Health Check
+
+**Frequency:** Daily (automated) or on-demand
+
+1. Check HSM connectivity:
+   ```bash
+   stella crypto hsm status
+   ```
+
+2. Verify slot access:
+   ```bash
+   stella crypto hsm slots list
+   ```
+
+3. Test signing operation:
+   ```bash
+   stella crypto hsm test-sign
+   ```
+
+4. Check HSM metrics:
+   - Free objects/sessions
+   - Temperature/health (vendor-specific)
+
+---
+
+## Incident Procedures
+
+### INC-001: HSM Unavailable
+
+**Symptoms:**
+- Alert: `StellaHsmUnavailable`
+- Signing operations failing with "HSM connection error"
+
+**Investigation:**
+```bash
+# Check HSM status
+stella crypto hsm status
+
+# Test PKCS#11 module
+stella crypto hsm test-module
+
+# Check network to HSM
+stella network test --host <hsm-host> --port <hsm-port>
+```
+
+**Resolution:**
+
+1. **Network issue:**
+   - Verify network path to HSM
+   - Check firewall rules
+   - Verify HSM appliance is powered on
+
+2. **Session exhaustion:**
+   ```bash
+   # Release stale sessions
+   stella crypto hsm sessions release --stale
+   
+   # Restart crypto service
+   stella service restart --service crypto-signer
+   ```
+
+3. **HSM failure:**
+   - Fail over to secondary HSM (if configured)
+   - Contact HSM vendor support
+   - Consider temporary fallback to software keys (with approval)
+
+### INC-002: Signing Key Compromised
+
+**CRITICAL - Follow incident response procedure**
+
+1. **Immediate containment:**
+   ```bash
+   # Revoke compromised key
+   stella crypto keys revoke --name <compromised-key> --reason compromise
+   
+   # Block signing with compromised key
+   stella crypto keys block --name <compromised-key>
+   ```
+
+2. **Generate replacement key:**
+   ```bash
+   stella crypto keys generate --type signing --algorithm ecdsa-p256 --name emergency-signing
+   stella crypto keys activate --name emergency-signing
+   ```
+
+3. **Notify downstream:**
+   - Update trust registries with new key
+   - Notify relying parties
+   - Publish key revocation notice
+
+4. **Forensics:**
+   ```bash
+   # Export key usage audit log
+   stella crypto audit export --key <compromised-key> --output /secure/key-audit.json
+   ```
+
+### INC-003: Certificate Expired
+
+**Symptoms:**
+- TLS connection failures
+- Alert: `StellaCertExpired`
+
+**Immediate Resolution:**
+
+1. If renewed certificate is available:
+   ```bash
+   stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
+   stella service restart --graceful
+   ```
+
+2. If renewal not ready - emergency self-signed (temporary):
+   ```bash
+   # Generate emergency certificate (NOT for production use)
+   stella crypto certs generate-self-signed --days 7 --name emergency
+   stella crypto certs install --cert emergency.pem
+   stella service restart --graceful
+   ```
+
+3. Expedite certificate renewal process
+
+### INC-004: FIPS Mode Not Enabled
+
+**Symptoms:**
+- Alert: `StellaFipsNotEnabled`
+- Compliance audit failure
+
+**Resolution:**
+
+1. **Linux:**
+   ```bash
+   # Enable FIPS mode
+   sudo fips-mode-setup --enable
+   
+   # Reboot required
+   sudo reboot
+   
+   # Verify after reboot
+   fips-mode-setup --check
+   ```
+
+2. **Windows:**
+   - Enable via Group Policy
+   - Or via registry:
+     ```powershell
+     Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\Lsa\FipsAlgorithmPolicy" -Name "Enabled" -Value 1
+     Restart-Computer
+     ```
+
+3. Restart Stella services:
+   ```bash
+   stella service restart
+   stella doctor --check check.crypto.fips
+   ```
+
+---
+
+## Regional-Specific Procedures
+
+### GOST Configuration (Russian Federation)
+
+1. Install GOST engine:
+   ```bash
+   sudo apt install libengine-gost-openssl1.1
+   ```
+
+2. Configure Stella:
+   ```bash
+   stella crypto profile set --profile gost
+   stella crypto config set --gost-engine-path /usr/lib/x86_64-linux-gnu/engines-3/gost.so
+   ```
+
+3. Verify:
+   ```bash
+   stella doctor --check check.crypto.gost
+   ```
+
+### SM Configuration (China)
+
+1. Ensure OpenSSL 1.1.1+ with SM support:
+   ```bash
+   openssl version
+   openssl list -cipher-algorithms | grep -i sm
+   ```
+
+2. Configure Stella:
+   ```bash
+   stella crypto profile set --profile sm
+   ```
+
+3. Verify:
+   ```bash
+   stella doctor --check check.crypto.sm
+   ```
+
+---
+
+## Monitoring Dashboard
+
+Access: Grafana → Dashboards → Stella Ops → Crypto Subsystem
+
+Key panels:
+- Signing operation latency
+- Key usage by key ID
+- HSM availability
+- Certificate expiration countdown
+- Crypto profile in use
+
+---
+
+## Evidence Capture
+
+```bash
+# Comprehensive crypto diagnostics
+stella crypto diagnostics --output /tmp/crypto-diag-$(date +%Y%m%dT%H%M%S).tar.gz
+```
+
+Bundle includes:
+- Active crypto profile
+- Key inventory (public keys only)
+- Certificate chain
+- HSM status
+- Operation audit log (last 24h)
+
+---
+
+## Escalation Path
+
+1. **L1 (On-call):** Certificate installs, key activation
+2. **L2 (Security team):** Key rotation, HSM issues
+3. **L3 (Crypto SME):** Algorithm issues, compliance questions
+4. **HSM Vendor:** Hardware failures
+
+---
+
+_Last updated: 2026-01-17 (UTC)_
--- a/docs/operations/runbooks/evidence-locker-ops.md
+++ b/docs/operations/runbooks/evidence-locker-ops.md
@@ -0,0 +1,408 @@
+# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
+# Task: RUN-003 - Evidence Locker Runbook
+# Evidence Locker Operations Runbook
+
+Status: PRODUCTION-READY (2026-01-17 UTC)
+
+## Scope
+Evidence locker operations including storage management, integrity verification, attestation management, provenance chain maintenance, and disaster recovery procedures.
+
+---
+
+## Pre-flight Checklist
+
+### Environment Verification
+```bash
+# Check evidence locker health
+stella doctor --category evidence
+
+# Verify storage accessibility
+stella evidence status
+
+# Check index health
+stella evidence index status
+
+# Verify anchor chain
+stella evidence anchor verify --latest
+```
+
+### Metrics to Watch
+- `stella_evidence_artifacts_total` - Total artifacts stored
+- `stella_evidence_retrieval_latency_seconds` - Retrieval latency P99
+- `stella_evidence_storage_bytes` - Storage consumption
+- `stella_merkle_anchor_age_seconds` - Time since last anchor
+
+---
+
+## Standard Procedures
+
+### SP-001: Daily Integrity Check
+
+**Frequency:** Daily (automated) or on-demand
+**Duration:** Varies by locker size (typically 5-30 minutes)
+
+1. Run integrity verification:
+   ```bash
+   # Quick check (sample-based)
+   stella evidence verify --mode quick
+   
+   # Full check (all artifacts)
+   stella evidence verify --mode full
+   ```
+
+2. Review results:
+   ```bash
+   stella evidence verify-report --latest
+   ```
+
+3. Address any failures:
+   ```bash
+   # List failed artifacts
+   stella evidence verify-report --latest --filter failed
+   ```
+
+### SP-002: Index Maintenance
+
+**Frequency:** Weekly or after large ingestion
+**Duration:** ~10 minutes
+
+1. Check index health:
+   ```bash
+   stella evidence index status
+   ```
+
+2. Refresh index if needed:
+   ```bash
+   # Incremental refresh
+   stella evidence index refresh
+   
+   # Full rebuild (if corruption suspected)
+   stella evidence index rebuild
+   ```
+
+3. Optimize index:
+   ```bash
+   stella evidence index optimize
+   ```
+
+### SP-003: Merkle Anchoring
+
+**Frequency:** Per policy (default: every 6 hours)
+**Duration:** ~2 minutes
+
+1. Create new anchor:
+   ```bash
+   stella evidence anchor create
+   ```
+
+2. Verify anchor chain:
+   ```bash
+   stella evidence anchor verify --all
+   ```
+
+3. Export anchor for external archival:
+   ```bash
+   stella evidence anchor export --latest --output anchor-$(date +%Y%m%dT%H%M%S).json
+   ```
+
+### SP-004: Storage Cleanup
+
+**Frequency:** Monthly or when storage alerts trigger
+**Duration:** Varies
+
+1. Review storage usage:
+   ```bash
+   stella evidence storage stats
+   ```
+
+2. Apply retention policy:
+   ```bash
+   # Dry run first
+   stella evidence cleanup --apply-retention --dry-run
+   
+   # Execute cleanup
+   stella evidence cleanup --apply-retention
+   ```
+
+3. Archive old evidence (if required):
+   ```bash
+   stella evidence archive --older-than 365d --output /archive/evidence-$(date +%Y).tar
+   ```
+
+---
+
+## Incident Procedures
+
+### INC-001: Integrity Verification Failure
+
+**Symptoms:**
+- Alert: `StellaEvidenceIntegrityFailure`
+- Verification reports hash mismatch
+
+**Investigation:**
+```bash
+# Get failure details
+stella evidence verify-report --latest --filter failed --format json > /tmp/integrity-failures.json
+
+# Check specific artifact
+stella evidence inspect <artifact-id>
+
+# Check provenance
+stella evidence provenance show <artifact-id>
+```
+
+**Resolution:**
+
+1. **Isolated corruption:**
+   ```bash
+   # Attempt recovery from replica (if available)
+   stella evidence recover --id <artifact-id> --source replica
+   
+   # If no replica, mark as corrupted
+   stella evidence mark-corrupted --id <artifact-id> --reason "hash-mismatch"
+   ```
+
+2. **Widespread corruption:**
+   - Stop evidence ingestion
+   - Identify corruption extent
+   - Restore from backup if necessary
+   - Escalate to L3
+
+3. **False positive (software bug):**
+   - Verify with multiple hash implementations
+   - Check for recent software updates
+   - Report bug if confirmed
+
+### INC-002: Evidence Retrieval Failure
+
+**Symptoms:**
+- Alert: `StellaEvidenceRetrievalFailed`
+- API returning 404 for known artifacts
+
+**Investigation:**
+```bash
+# Check if artifact exists
+stella evidence exists <artifact-id>
+
+# Check index
+stella evidence index lookup <artifact-id>
+
+# Check storage backend
+stella evidence storage check <artifact-id>
+```
+
+**Resolution:**
+
+1. **Index corruption:**
+   ```bash
+   # Rebuild index
+   stella evidence index rebuild
+   ```
+
+2. **Storage backend issue:**
+   ```bash
+   # Check storage health
+   stella doctor --check check.storage.evidencelocker
+   
+   # Verify storage connectivity
+   stella evidence storage test
+   ```
+
+3. **File system issue:**
+   - Check disk health
+   - Verify file permissions
+   - Check mount status
+
+### INC-003: Anchor Chain Break
+
+**Symptoms:**
+- Alert: `StellaMerkleAnchorChainBroken`
+- Anchor verification fails
+
+**Investigation:**
+```bash
+# Check anchor chain
+stella evidence anchor verify --all --verbose
+
+# Find break point
+stella evidence anchor list --show-links
+
+# Inspect specific anchor
+stella evidence anchor inspect <anchor-id>
+```
+
+**Resolution:**
+
+1. **Single broken link:**
+   ```bash
+   # Attempt to recover from backup
+   stella evidence anchor recover --id <anchor-id> --source backup
+   ```
+
+2. **Multiple breaks:**
+   - Stop new anchoring
+   - Assess extent of damage
+   - Restore from backup or rebuild chain
+
+3. **Create new chain segment:**
+   ```bash
+   # Start new chain (preserves old chain as archived)
+   stella evidence anchor new-chain --reason "chain-break-recovery"
+   ```
+
+### INC-004: Storage Full
+
+**Symptoms:**
+- Alert: `StellaEvidenceStorageFull`
+- Ingestion failing
+
+**Immediate Actions:**
+```bash
+# Check storage usage
+stella evidence storage stats
+
+# Emergency cleanup of temporary files
+stella evidence cleanup --temp-only
+
+# Find large/old artifacts
+stella evidence storage analyze --sort size --limit 20
+```
+
+**Resolution:**
+
+1. **Apply retention policy:**
+   ```bash
+   stella evidence cleanup --apply-retention --aggressive
+   ```
+
+2. **Archive old evidence:**
+   ```bash
+   stella evidence archive --older-than 180d --compress
+   ```
+
+3. **Expand storage:**
+   - Follow cloud provider procedure
+   - Or add additional storage volume
+
+---
+
+## Disaster Recovery
+
+### DR-001: Full Evidence Locker Recovery
+
+**Prerequisites:**
+- Backup available
+- Target storage provisioned
+- Recovery environment ready
+
+**Procedure:**
+
+1. Provision new storage:
+   ```bash
+   stella evidence storage provision --size <size>
+   ```
+
+2. Restore from backup:
+   ```bash
+   # List available backups
+   stella backup list --type evidence-locker
+   
+   # Restore
+   stella evidence restore --backup-id <backup-id> --target /var/lib/stellaops/evidence
+   ```
+
+3. Verify restoration:
+   ```bash
+   stella evidence verify --mode full
+   stella evidence anchor verify --all
+   ```
+
+4. Update service configuration:
+   ```bash
+   stella config set EvidenceLocker:Path /var/lib/stellaops/evidence
+   stella service restart
+   ```
+
+### DR-002: Point-in-Time Recovery
+
+For recovering to a specific point in time:
+
+1. Identify target anchor:
+   ```bash
+   stella evidence anchor list --before <timestamp>
+   ```
+
+2. Restore to that point:
+   ```bash
+   stella evidence restore --to-anchor <anchor-id>
+   ```
+
+3. Verify integrity:
+   ```bash
+   stella evidence verify --mode full --to-anchor <anchor-id>
+   ```
+
+---
+
+## Offline Mode Operations
+
+### Preparing Offline Evidence Pack
+
+```bash
+# Export evidence for specific artifact
+stella evidence export --digest <artifact-digest> --output evidence-pack.tar.gz
+
+# Export with all dependencies
+stella evidence export --digest <artifact-digest> --include-deps --output evidence-full.tar.gz
+```
+
+### Verifying Evidence Offline
+
+```bash
+# Verify evidence pack without network
+stella evidence verify --offline --input evidence-pack.tar.gz
+
+# Replay verdict using evidence
+stella replay --evidence evidence-pack.tar.gz --output verdict.json
+```
+
+---
+
+## Monitoring Dashboard
+
+Access: Grafana → Dashboards → Stella Ops → Evidence Locker
+
+Key panels:
+- Artifact ingestion rate
+- Retrieval latency
+- Storage utilization trend
+- Integrity check status
+- Anchor chain health
+
+---
+
+## Evidence Capture
+
+For any incident:
+```bash
+stella evidence diagnostics --output /tmp/evidence-diag-$(date +%Y%m%dT%H%M%S).tar.gz
+```
+
+Bundle includes:
+- Index status
+- Storage stats
+- Recent anchor chain
+- Integrity check results
+- Operation audit log
+
+---
+
+## Escalation Path
+
+1. **L1 (On-call):** Standard procedures, cleanup operations
+2. **L2 (Platform team):** Index rebuild, anchor issues
+3. **L3 (Architecture):** Chain recovery, DR procedures
+
+---
+
+_Last updated: 2026-01-17 (UTC)_
--- a/docs/operations/runbooks/orchestrator-evidence-missing.md
+++ b/docs/operations/runbooks/orchestrator-evidence-missing.md
@@ -0,0 +1,183 @@
+# Runbook: Release Orchestrator - Required Evidence Not Found
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-004 - Release Orchestrator Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Release Orchestrator |
+| **Severity** | High |
+| **On-call scope** | Platform team, Security team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.orchestrator.evidence-availability` |
+
+---
+
+## Symptoms
+
+- [ ] Promotion failing with "required evidence not found"
+- [ ] Alert `OrchestratorEvidenceMissing` firing
+- [ ] Gate evaluation blocked waiting for evidence
+- [ ] Error: "SBOM not found" or "attestation missing"
+- [ ] Evidence chain incomplete for artifact
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Promotion blocked until evidence is generated |
+| **Data integrity** | Indicates missing security artifact - must be resolved |
+| **SLA impact** | Release blocked; compliance requirements not met |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.orchestrator.evidence-availability
+   ```
+
+2. **List missing evidence for promotion:**
+   ```bash
+   stella promotion evidence <promotion-id> --missing
+   ```
+
+3. **Check what evidence exists for artifact:**
+   ```bash
+   stella evidence list --artifact <digest>
+   ```
+
+### Deep diagnosis
+
+1. **Check evidence chain completeness:**
+   ```bash
+   stella evidence chain --artifact <digest> --verbose
+   ```
+   Look for: Missing nodes in the chain
+
+2. **Check if scan completed:**
+   ```bash
+   stella scanner jobs list --artifact <digest>
+   ```
+   Problem if: No completed scan or scan failed
+
+3. **Check if attestation was created:**
+   ```bash
+   stella attest list --subject <digest>
+   ```
+   Problem if: No attestation or attestation failed
+
+4. **Check evidence store health:**
+   ```bash
+   stella evidence store health
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Generate missing SBOM:**
+   ```bash
+   stella scan image --image <image-ref> --sbom-only
+   ```
+
+2. **Generate missing attestation:**
+   ```bash
+   stella attest create --subject <digest> --type slsa-provenance
+   ```
+
+3. **Re-scan artifact to regenerate all evidence:**
+   ```bash
+   stella scan image --image <image-ref> --force
+   ```
+
+### Root cause fix
+
+**If scan never ran:**
+
+1. Check why artifact wasn't scanned:
+   ```bash
+   stella scanner queue list --artifact <digest>
+   ```
+
+2. Configure automatic scanning on push:
+   ```bash
+   stella scanner config set auto_scan.enabled true
+   stella scanner config set auto_scan.triggers "push,promote"
+   ```
+
+**If evidence was generated but not stored:**
+
+1. Check evidence store connectivity:
+   ```bash
+   stella evidence store health
+   ```
+
+2. Retry evidence storage:
+   ```bash
+   stella evidence retry-store --artifact <digest>
+   ```
+
+**If attestation signing failed:**
+
+1. Check attestor status:
+   ```bash
+   stella attest status
+   ```
+
+2. See `attestor-signing-failed.md` runbook
+
+**If evidence expired or was deleted:**
+
+1. Check evidence retention policy:
+   ```bash
+   stella evidence policy show
+   ```
+
+2. Regenerate evidence:
+   ```bash
+   stella scan image --image <image-ref> --force
+   stella attest create --subject <digest> --type slsa-provenance
+   ```
+
+### Verification
+
+```bash
+# Check all evidence now exists
+stella evidence list --artifact <digest>
+
+# Verify evidence chain is complete
+stella evidence chain --artifact <digest>
+
+# Retry promotion
+stella promotion retry <promotion-id>
+
+# Verify promotion proceeds
+stella promotion status <promotion-id>
+```
+
+---
+
+## Prevention
+
+- [ ] **Auto-scan:** Enable automatic scanning for all pushed images
+- [ ] **Gates:** Configure evidence requirements clearly in promotion policy
+- [ ] **Monitoring:** Alert on evidence generation failures
+- [ ] **Retention:** Set appropriate evidence retention periods
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/evidence-locker/architecture.md`
+- **Related runbooks:** `orchestrator-promotion-stuck.md`, `attestor-signing-failed.md`
+- **Evidence requirements:** `docs/operations/evidence-requirements.md`
--- a/docs/operations/runbooks/orchestrator-gate-timeout.md
+++ b/docs/operations/runbooks/orchestrator-gate-timeout.md
@@ -0,0 +1,178 @@
+# Runbook: Release Orchestrator - Gate Evaluation Timeout
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-004 - Release Orchestrator Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Release Orchestrator |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.orchestrator.gate-timeout` |
+
+---
+
+## Symptoms
+
+- [ ] Promotion gates timing out before completing evaluation
+- [ ] Alert `OrchestratorGateTimeout` firing
+- [ ] Error: "gate evaluation timeout exceeded"
+- [ ] Promotion stuck waiting for gate response
+- [ ] Metric `orchestrator_gate_timeout_total` increasing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Promotions delayed or blocked; release pipeline stalled |
+| **Data integrity** | No data loss; promotion can be retried |
+| **SLA impact** | Release SLO violated if timeout persists |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.orchestrator.gate-timeout
+   ```
+
+2. **Identify timed-out gates:**
+   ```bash
+   stella promotion gates <promotion-id> --status timeout
+   ```
+
+3. **Check gate service health:**
+   ```bash
+   stella orch gate-services status
+   ```
+
+### Deep diagnosis
+
+1. **Check specific gate latency:**
+   ```bash
+   stella orch gate stats --gate <gate-name> --last 1h
+   ```
+   Look for: P95 latency, timeout rate
+
+2. **Check external service connectivity:**
+   ```bash
+   stella orch connectivity --gate <gate-name>
+   ```
+
+3. **Check gate evaluation logs:**
+   ```bash
+   stella orch logs --gate <gate-name> --promotion <promotion-id>
+   ```
+   Look for: Slow queries, external API delays
+
+4. **Check policy engine latency (for policy gates):**
+   ```bash
+   stella policy stats --last 10m
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Increase timeout for specific gate:**
+   ```bash
+   stella orch config set gates.<gate-name>.timeout 5m
+   stella orch reload
+   ```
+
+2. **Skip the timed-out gate (requires approval):**
+   ```bash
+   stella promotion gate skip <promotion-id> <gate-name> \
+     --reason "External service timeout - approved by <approver>"
+   ```
+
+3. **Retry the promotion:**
+   ```bash
+   stella promotion retry <promotion-id>
+   ```
+
+### Root cause fix
+
+**If external service is slow:**
+
+1. Configure gate retry with backoff:
+   ```bash
+   stella orch config set gates.<gate-name>.retries 3
+   stella orch config set gates.<gate-name>.retry_backoff 5s
+   ```
+
+2. Enable gate result caching:
+   ```bash
+   stella orch config set gates.<gate-name>.cache_ttl 5m
+   ```
+
+3. Configure circuit breaker:
+   ```bash
+   stella orch config set gates.<gate-name>.circuit_breaker.enabled true
+   stella orch config set gates.<gate-name>.circuit_breaker.threshold 5
+   ```
+
+**If policy evaluation is slow:**
+
+1. Optimize policy (see `policy-evaluation-slow.md` runbook)
+
+2. Increase policy worker count:
+   ```bash
+   stella policy config set opa.workers 4
+   ```
+
+**If evidence retrieval is slow:**
+
+1. Enable evidence pre-fetching:
+   ```bash
+   stella orch config set gates.evidence_prefetch true
+   ```
+
+2. Increase evidence cache:
+   ```bash
+   stella orch config set evidence.cache_size 1000
+   stella orch config set evidence.cache_ttl 10m
+   ```
+
+### Verification
+
+```bash
+# Retry promotion
+stella promotion retry <promotion-id>
+
+# Monitor gate evaluation
+stella promotion gates <promotion-id> --watch
+
+# Check gate latency improved
+stella orch gate stats --gate <gate-name> --last 10m
+
+# Verify no timeouts
+stella orch logs --filter "timeout" --last 30m
+```
+
+---
+
+## Prevention
+
+- [ ] **Timeouts:** Set appropriate timeouts based on gate SLAs (default: 2m)
+- [ ] **Monitoring:** Alert on gate P95 latency > 1m
+- [ ] **Caching:** Enable caching for slow gates
+- [ ] **Circuit breakers:** Enable circuit breakers for external service gates
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/release-orchestrator/gates.md`
+- **Related runbooks:** `orchestrator-promotion-stuck.md`, `policy-evaluation-slow.md`
+- **Dashboard:** Grafana > Stella Ops > Gate Latency
--- a/docs/operations/runbooks/orchestrator-promotion-stuck.md
+++ b/docs/operations/runbooks/orchestrator-promotion-stuck.md
@@ -0,0 +1,168 @@
+# Runbook: Release Orchestrator - Promotion Job Not Progressing
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-004 - Release Orchestrator Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Release Orchestrator |
+| **Severity** | Critical |
+| **On-call scope** | Platform team, Release team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.orchestrator.job-health` |
+
+---
+
+## Symptoms
+
+- [ ] Promotion job stuck in "in_progress" state for >10 minutes
+- [ ] No progress updates in promotion timeline
+- [ ] Alert `OrchestratorPromotionStuck` firing
+- [ ] UI shows promotion spinner indefinitely
+- [ ] Downstream environment not receiving promoted artifact
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Release blocked, cannot promote to target environment |
+| **Data integrity** | Artifact is safe; promotion can be retried |
+| **SLA impact** | Release SLO violated if not resolved within 30 minutes |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.orchestrator.job-health
+   ```
+
+2. **Check promotion status:**
+   ```bash
+   stella promotion status <promotion-id>
+   ```
+   Look for: Current step, last update time, any error messages
+
+3. **Check orchestrator service:**
+   ```bash
+   stella orch status
+   ```
+
+### Deep diagnosis
+
+1. **Get detailed promotion trace:**
+   ```bash
+   stella promotion trace <promotion-id> --verbose
+   ```
+   Look for: Which step is stuck, any timeouts
+
+2. **Check gate evaluation status:**
+   ```bash
+   stella promotion gates <promotion-id>
+   ```
+   Problem if: Gate stuck waiting for external service
+
+3. **Check target environment connectivity:**
+   ```bash
+   stella orch connectivity --target <env-name>
+   ```
+
+4. **Check for lock contention:**
+   ```bash
+   stella orch locks list
+   ```
+   Problem if: Stale locks on the artifact or environment
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **If gate is stuck waiting for external service:**
+   ```bash
+   # Skip the stuck gate (requires approval)
+   stella promotion gate skip <promotion-id> <gate-name> --reason "External service timeout"
+   ```
+
+2. **If lock is stale:**
+   ```bash
+   # Release the lock (use with caution)
+   stella orch locks release <lock-id> --force
+   ```
+
+3. **If orchestrator is unresponsive:**
+   ```bash
+   stella service restart orchestrator
+   ```
+
+### Root cause fix
+
+**If external gate service is slow:**
+
+1. Increase gate timeout:
+   ```bash
+   stella orch config set gates.<gate-name>.timeout 5m
+   ```
+
+2. Configure gate retry:
+   ```bash
+   stella orch config set gates.<gate-name>.retries 3
+   ```
+
+**If target environment is unreachable:**
+
+1. Check network connectivity to target
+2. Verify credentials for target environment:
+   ```bash
+   stella orch credentials verify --target <env-name>
+   ```
+
+**If database lock contention:**
+
+1. Increase lock timeout:
+   ```bash
+   stella orch config set locks.timeout 60s
+   ```
+
+2. Enable optimistic locking:
+   ```bash
+   stella orch config set locks.mode optimistic
+   ```
+
+### Verification
+
+```bash
+# Check promotion completed
+stella promotion status <promotion-id>
+
+# Verify artifact in target environment
+stella orch artifacts list --env <target-env> --filter <artifact-digest>
+
+# Check no stuck promotions
+stella promotion list --status in_progress --older-than 5m
+```
+
+---
+
+## Prevention
+
+- [ ] **Timeouts:** Configure appropriate timeouts for all gates
+- [ ] **Monitoring:** Alert on promotions stuck > 10 minutes
+- [ ] **Health checks:** Enable connectivity pre-checks before promotion
+- [ ] **Documentation:** Document SLAs for external gate services
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/release-orchestrator/architecture.md`
+- **Related runbooks:** `orchestrator-gate-timeout.md`, `orchestrator-evidence-missing.md`
+- **Dashboard:** Grafana > Stella Ops > Release Orchestrator
--- a/docs/operations/runbooks/orchestrator-quota-exceeded.md
+++ b/docs/operations/runbooks/orchestrator-quota-exceeded.md
@@ -0,0 +1,189 @@
+# Runbook: Release Orchestrator - Promotion Quota Exhausted
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-004 - Release Orchestrator Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Release Orchestrator |
+| **Severity** | Medium |
+| **On-call scope** | Platform team, Release team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.orchestrator.quota-status` |
+
+---
+
+## Symptoms
+
+- [ ] Promotions failing with "quota exceeded"
+- [ ] Alert `OrchestratorQuotaExceeded` firing
+- [ ] Error: "promotion rate limit reached" or "daily quota exhausted"
+- [ ] New promotions being rejected
+- [ ] Queued promotions not processing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | New releases blocked until quota resets or increases |
+| **Data integrity** | No data loss; promotions queued for later |
+| **SLA impact** | Release frequency SLO may be violated |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.orchestrator.quota-status
+   ```
+
+2. **Check current quota usage:**
+   ```bash
+   stella orch quota status
+   ```
+
+3. **Check quota limits:**
+   ```bash
+   stella orch quota limits show
+   ```
+
+### Deep diagnosis
+
+1. **Check promotion history:**
+   ```bash
+   stella promotion list --last 24h --count
+   ```
+   Look for: Unusual spike in promotions
+
+2. **Check per-environment quotas:**
+   ```bash
+   stella orch quota status --by-environment
+   ```
+
+3. **Check for runaway automation:**
+   ```bash
+   stella promotion list --last 1h --by-actor
+   ```
+   Problem if: Single actor/service making many promotions
+
+4. **Check when quota resets:**
+   ```bash
+   stella orch quota reset-time
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Request temporary quota increase:**
+   ```bash
+   stella orch quota request-increase --amount 50 --reason "Release deadline"
+   ```
+
+2. **Prioritize critical promotions:**
+   ```bash
+   stella promotion priority set <promotion-id> high
+   ```
+
+3. **Cancel unnecessary queued promotions:**
+   ```bash
+   stella promotion list --status queued
+   stella promotion cancel <promotion-id>
+   ```
+
+### Root cause fix
+
+**If legitimate high volume:**
+
+1. Increase quota limits:
+   ```bash
+   stella orch quota limits set --daily 200 --hourly 50
+   ```
+
+2. Increase per-environment limits:
+   ```bash
+   stella orch quota limits set --env production --daily 50
+   ```
+
+**If runaway automation:**
+
+1. Identify the source:
+   ```bash
+   stella promotion list --last 1h --by-actor --verbose
+   ```
+
+2. Revoke or rate-limit the service account:
+   ```bash
+   stella auth rate-limit set <service-account> --promotions-per-hour 10
+   ```
+
+3. Fix the automation bug
+
+**If promotion retries causing spike:**
+
+1. Check for failing promotions causing retries:
+   ```bash
+   stella promotion list --status failed --last 24h
+   ```
+
+2. Fix underlying promotion failures (see other runbooks)
+
+3. Configure retry limits:
+   ```bash
+   stella orch config set promotion.max_retries 3
+   stella orch config set promotion.retry_backoff 5m
+   ```
+
+**If quota too restrictive for workload:**
+
+1. Analyze actual promotion patterns:
+   ```bash
+   stella orch quota analyze --last 30d
+   ```
+
+2. Adjust quotas based on analysis:
+   ```bash
+   stella orch quota limits set --daily <recommended>
+   ```
+
+### Verification
+
+```bash
+# Check quota status
+stella orch quota status
+
+# Verify promotions processing
+stella promotion list --status in_progress
+
+# Test new promotion
+stella promotion create --test --dry-run
+
+# Check no quota errors
+stella orch logs --filter "quota" --level error --last 30m
+```
+
+---
+
+## Prevention
+
+- [ ] **Monitoring:** Alert at 80% quota usage
+- [ ] **Limits:** Set appropriate quotas based on team size and release frequency
+- [ ] **Automation:** Implement rate limiting in CI/CD pipelines
+- [ ] **Review:** Regularly review and adjust quotas based on usage patterns
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/release-orchestrator/quotas.md`
+- **Related runbooks:** `orchestrator-promotion-stuck.md`
+- **Quota management:** `docs/operations/quota-management.md`
--- a/docs/operations/runbooks/orchestrator-rollback-failed.md
+++ b/docs/operations/runbooks/orchestrator-rollback-failed.md
@@ -0,0 +1,189 @@
+# Runbook: Release Orchestrator - Rollback Operation Failed
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-004 - Release Orchestrator Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Release Orchestrator |
+| **Severity** | Critical |
+| **On-call scope** | Platform team, Release team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.orchestrator.rollback-health` |
+
+---
+
+## Symptoms
+
+- [ ] Rollback operation failing or stuck
+- [ ] Alert `OrchestratorRollbackFailed` firing
+- [ ] Error: "rollback failed" or "cannot restore previous version"
+- [ ] Target environment in inconsistent state
+- [ ] Previous artifact not available for deployment
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Rollback blocked; potentially broken release in production |
+| **Data integrity** | Environment may be in partial rollback state |
+| **SLA impact** | Incident resolution blocked; extended outage |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.orchestrator.rollback-health
+   ```
+
+2. **Check rollback status:**
+   ```bash
+   stella rollback status <rollback-id>
+   ```
+
+3. **Check previous deployment history:**
+   ```bash
+   stella orch deployments list --env <env-name> --last 10
+   ```
+
+### Deep diagnosis
+
+1. **Check why rollback failed:**
+   ```bash
+   stella rollback trace <rollback-id> --verbose
+   ```
+   Look for: Which step failed, error message
+
+2. **Check previous artifact availability:**
+   ```bash
+   stella orch artifacts get <previous-digest> --check
+   ```
+   Problem if: Artifact deleted, not in registry
+
+3. **Check environment state:**
+   ```bash
+   stella orch env status <env-name> --detailed
+   ```
+
+4. **Check for deployment locks:**
+   ```bash
+   stella orch locks list --env <env-name>
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Force release lock if stuck:**
+   ```bash
+   stella orch locks release --env <env-name> --force
+   ```
+
+2. **Manual rollback using specific artifact:**
+   ```bash
+   stella deploy --env <env-name> --artifact <previous-digest> --force
+   ```
+
+3. **If artifact unavailable, deploy last known good:**
+   ```bash
+   stella orch deployments list --env <env-name> --status success
+   stella deploy --env <env-name> --artifact <last-good-digest>
+   ```
+
+### Root cause fix
+
+**If previous artifact not in registry:**
+
+1. Check artifact retention policy:
+   ```bash
+   stella registry retention show
+   ```
+
+2. Restore from backup registry:
+   ```bash
+   stella registry restore --artifact <digest> --from backup
+   ```
+
+3. Increase artifact retention:
+   ```bash
+   stella registry retention set --min-versions 10
+   ```
+
+**If deployment service unavailable:**
+
+1. Check deployment target connectivity:
+   ```bash
+   stella orch connectivity --target <env-name>
+   ```
+
+2. Check deployment agent status:
+   ```bash
+   stella orch agent status --env <env-name>
+   ```
+
+**If configuration drift:**
+
+1. Check environment configuration:
+   ```bash
+   stella orch env config diff <env-name>
+   ```
+
+2. Reset environment to known state:
+   ```bash
+   stella orch env reset <env-name> --to-baseline
+   ```
+
+**If database state inconsistent:**
+
+1. Check orchestrator database:
+   ```bash
+   stella orch db verify
+   ```
+
+2. Repair deployment state:
+   ```bash
+   stella orch repair --deployment <deployment-id>
+   ```
+
+### Verification
+
+```bash
+# Verify rollback completed
+stella rollback status <rollback-id>
+
+# Verify environment state
+stella orch env status <env-name>
+
+# Verify correct version deployed
+stella orch deployments current --env <env-name>
+
+# Health check the environment
+stella orch health-check --env <env-name>
+```
+
+---
+
+## Prevention
+
+- [ ] **Retention:** Maintain at least 5 previous versions in registry
+- [ ] **Testing:** Test rollback procedure in staging regularly
+- [ ] **Monitoring:** Alert on rollback failures immediately
+- [ ] **Documentation:** Document manual rollback procedures per environment
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/release-orchestrator/rollback.md`
+- **Related runbooks:** `orchestrator-promotion-stuck.md`, `orchestrator-evidence-missing.md`
+- **Rollback procedures:** `docs/operations/rollback-procedures.md`
--- a/docs/operations/runbooks/policy-compilation-failed.md
+++ b/docs/operations/runbooks/policy-compilation-failed.md
@@ -0,0 +1,189 @@
+# Runbook: Policy Engine - Rego Compilation Errors
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-003 - Policy Engine Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Policy Engine |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.policy.compilation-health` |
+
+---
+
+## Symptoms
+
+- [ ] Policy deployment failing with "compilation error"
+- [ ] Alert `PolicyCompilationFailed` firing
+- [ ] Error: "rego_parse_error" or "rego_type_error"
+- [ ] New policies not taking effect
+- [ ] OPA rejecting policy bundle
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | New policies cannot be deployed; using stale policies |
+| **Data integrity** | Existing policies continue to work; new rules not enforced |
+| **SLA impact** | Policy updates blocked; security posture may be outdated |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.policy.compilation-health
+   ```
+
+2. **Check policy compilation status:**
+   ```bash
+   stella policy status --compilation
+   ```
+
+3. **Validate specific policy:**
+   ```bash
+   stella policy validate --file <policy-file>
+   ```
+
+### Deep diagnosis
+
+1. **Get detailed compilation errors:**
+   ```bash
+   stella policy compile --verbose
+   ```
+   Look for: Line numbers, error types, undefined references
+
+2. **Check for syntax errors:**
+   ```bash
+   stella policy lint --file <policy-file>
+   ```
+
+3. **Check for type errors:**
+   ```bash
+   stella policy typecheck --file <policy-file>
+   ```
+
+4. **Check OPA version compatibility:**
+   ```bash
+   stella policy opa version
+   stella policy check-compat --file <policy-file>
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Rollback to last working policy:**
+   ```bash
+   stella policy rollback --to-last-good
+   ```
+
+2. **Disable the failing policy:**
+   ```bash
+   stella policy disable <policy-id>
+   stella policy reload
+   ```
+
+3. **Use previous bundle:**
+   ```bash
+   stella policy bundle load --version <previous-version>
+   ```
+
+### Root cause fix
+
+**If syntax error:**
+
+1. Get exact error location:
+   ```bash
+   stella policy validate --file <policy-file> --show-line
+   ```
+
+2. Common syntax issues:
+   - Missing brackets or braces
+   - Invalid rule head syntax
+   - Incorrect import statements
+
+3. Fix and re-validate:
+   ```bash
+   stella policy validate --file <fixed-policy.rego>
+   ```
+
+**If undefined reference:**
+
+1. Check for missing imports:
+   ```bash
+   stella policy analyze --file <policy-file> --show-imports
+   ```
+
+2. Verify data references exist:
+   ```bash
+   stella policy data show
+   ```
+
+3. Add missing imports or data definitions
+
+**If type error:**
+
+1. Check type mismatches:
+   ```bash
+   stella policy typecheck --file <policy-file> --verbose
+   ```
+
+2. Common type issues:
+   - Comparing incompatible types
+   - Invalid function arguments
+   - Missing type annotations
+
+**If OPA version incompatibility:**
+
+1. Check Rego version features used:
+   ```bash
+   stella policy analyze --file <policy-file> --show-features
+   ```
+
+2. Update policy to use compatible features or upgrade OPA
+
+### Verification
+
+```bash
+# Validate fixed policy
+stella policy validate --file <fixed-policy.rego>
+
+# Test policy compilation
+stella policy compile --file <fixed-policy.rego>
+
+# Deploy policy
+stella policy deploy --file <fixed-policy.rego>
+
+# Test policy evaluation
+stella policy evaluate --test
+```
+
+---
+
+## Prevention
+
+- [ ] **CI/CD:** Add policy validation to CI pipeline before deployment
+- [ ] **Linting:** Run `stella policy lint` on all policy changes
+- [ ] **Testing:** Write unit tests for policies with `stella policy test`
+- [ ] **Staging:** Deploy to staging environment before production
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/policy/architecture.md`
+- **Related runbooks:** `policy-opa-crash.md`, `policy-evaluation-slow.md`
+- **Rego reference:** https://www.openpolicyagent.org/docs/latest/policy-language/
+- **Policy testing:** `docs/modules/policy/testing.md`
--- a/docs/operations/runbooks/policy-evaluation-slow.md
+++ b/docs/operations/runbooks/policy-evaluation-slow.md
@@ -0,0 +1,174 @@
+# Runbook: Policy Engine - Evaluation Latency High
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-003 - Policy Engine Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Policy Engine |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.policy.evaluation-latency` |
+
+---
+
+## Symptoms
+
+- [ ] Policy evaluation takes >500ms (warning) or >2s (critical)
+- [ ] Gate decisions timing out in CI/CD pipelines
+- [ ] Alert `PolicyEvaluationSlow` firing
+- [ ] Metric `policy_evaluation_duration_seconds` P95 > 1s
+- [ ] Users report "policy check taking too long"
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Slow release gate checks, CI/CD pipeline delays |
+| **Data integrity** | No data loss; decisions are still correct |
+| **SLA impact** | Gate latency SLO violated (target: P95 < 500ms) |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.policy.evaluation-latency
+   ```
+
+2. **Check policy engine status:**
+   ```bash
+   stella policy status
+   ```
+
+3. **Check recent evaluation times:**
+   ```bash
+   stella policy stats --last 10m
+   ```
+   Look for: P95 latency, cache hit rate
+
+### Deep diagnosis
+
+1. **Profile a slow evaluation:**
+   ```bash
+   stella policy evaluate --image <image-ref> --profile
+   ```
+   Look for: Which phase is slowest (parse, compile, execute)
+
+2. **Check OPA compilation cache:**
+   ```bash
+   stella policy cache stats
+   ```
+   Problem if: Cache hit rate < 90%
+
+3. **Check policy complexity:**
+   ```bash
+   stella policy analyze --complexity
+   ```
+   Problem if: Cyclomatic complexity > 50 or rule count > 200
+
+4. **Check external data fetches:**
+   ```bash
+   stella policy logs --filter "external fetch" --level debug
+   ```
+   Problem if: Many external fetches or slow responses
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Clear and warm the compilation cache:**
+   ```bash
+   stella policy cache clear
+   stella policy cache warm
+   ```
+
+2. **Increase OPA worker count:**
+   ```bash
+   stella policy config set opa.workers 4
+   stella policy reload
+   ```
+
+3. **Enable evaluation result caching:**
+   ```bash
+   stella policy config set cache.evaluation_ttl 60s
+   stella policy reload
+   ```
+
+### Root cause fix
+
+**If policy is too complex:**
+
+1. Analyze and simplify policy:
+   ```bash
+   stella policy analyze --suggest-optimizations
+   ```
+
+2. Split large policies into modules:
+   ```bash
+   stella policy refactor --auto-split
+   ```
+
+**If external data fetches are slow:**
+
+1. Increase external data cache TTL:
+   ```bash
+   stella policy config set external_data.cache_ttl 5m
+   ```
+
+2. Pre-fetch external data:
+   ```bash
+   stella policy external-data prefetch
+   ```
+
+**If Rego compilation is slow:**
+
+1. Enable partial evaluation:
+   ```bash
+   stella policy config set opa.partial_eval true
+   ```
+
+2. Pre-compile policies:
+   ```bash
+   stella policy compile --all
+   ```
+
+### Verification
+
+```bash
+# Run evaluation and check latency
+stella policy evaluate --image <image-ref> --timing
+
+# Check P95 latency
+stella policy stats --last 5m
+
+# Verify cache is effective
+stella policy cache stats
+```
+
+---
+
+## Prevention
+
+- [ ] **Review:** Review policy complexity before deployment
+- [ ] **Monitoring:** Alert on P95 latency > 300ms
+- [ ] **Caching:** Ensure evaluation cache is enabled
+- [ ] **Pre-warming:** Add cache warming to deployment pipeline
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/policy/architecture.md`
+- **Related runbooks:** `policy-opa-crash.md`, `policy-compilation-failed.md`
+- **Dashboard:** Grafana > Stella Ops > Policy Engine
--- a/docs/operations/runbooks/policy-opa-crash.md
+++ b/docs/operations/runbooks/policy-opa-crash.md
@@ -0,0 +1,205 @@
+# Runbook: Policy Engine - OPA Process Crashed
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-003 - Policy Engine Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Policy Engine |
+| **Severity** | Critical |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.policy.opa-health` |
+
+---
+
+## Symptoms
+
+- [ ] Policy evaluations failing with "OPA unavailable" error
+- [ ] Alert `PolicyOPACrashed` firing
+- [ ] OPA process exited unexpectedly
+- [ ] Error: "connection refused" when connecting to OPA
+- [ ] Metric `policy_opa_restarts_total` increasing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | All policy evaluations fail; gate decisions blocked |
+| **Data integrity** | No data loss; decisions delayed until OPA recovers |
+| **SLA impact** | Gate latency SLO violated; release pipeline blocked |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.policy.opa-health
+   ```
+
+2. **Check OPA process status:**
+   ```bash
+   stella policy status
+   ```
+   Look for: OPA process state, restart count
+
+3. **Check OPA logs for crash reason:**
+   ```bash
+   stella policy opa logs --last 30m --level error
+   ```
+
+### Deep diagnosis
+
+1. **Check OPA memory usage before crash:**
+   ```bash
+   stella policy stats --opa-metrics
+   ```
+   Problem if: Memory usage near limit before crash
+
+2. **Check for problematic policy:**
+   ```bash
+   stella policy list --last-error
+   ```
+   Look for: Policies that caused evaluation errors
+
+3. **Check OPA configuration:**
+   ```bash
+   stella policy opa config show
+   ```
+   Look for: Invalid configuration, missing bundles
+
+4. **Check for infinite loops in Rego:**
+   ```bash
+   stella policy analyze --detect-loops
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Restart OPA process:**
+   ```bash
+   stella policy opa restart
+   ```
+
+2. **If OPA keeps crashing, start in safe mode:**
+   ```bash
+   stella policy opa start --safe-mode
+   ```
+   Note: Safe mode disables custom policies
+
+3. **Enable failopen temporarily (if allowed by policy):**
+   ```bash
+   stella policy config set failopen true
+   stella policy reload
+   ```
+   **Warning:** Only use if compliance allows fail-open mode
+
+### Root cause fix
+
+**If OOM killed:**
+
+1. Increase OPA memory limit:
+   ```bash
+   stella policy opa config set memory_limit 2Gi
+   stella policy opa restart
+   ```
+
+2. Enable garbage collection tuning:
+   ```bash
+   stella policy opa config set gc_min_heap_size 256Mi
+   stella policy opa config set gc_max_heap_size 1Gi
+   ```
+
+**If policy caused crash:**
+
+1. Identify problematic policy:
+   ```bash
+   stella policy list --status error
+   ```
+
+2. Disable the problematic policy:
+   ```bash
+   stella policy disable <policy-id>
+   stella policy reload
+   ```
+
+3. Fix and re-enable:
+   ```bash
+   stella policy validate --file <fixed-policy.rego>
+   stella policy update <policy-id> --file <fixed-policy.rego>
+   stella policy enable <policy-id>
+   ```
+
+**If bundle loading failed:**
+
+1. Check bundle integrity:
+   ```bash
+   stella policy bundle verify
+   ```
+
+2. Rebuild bundle:
+   ```bash
+   stella policy bundle build --output bundle.tar.gz
+   stella policy bundle load bundle.tar.gz
+   ```
+
+**If configuration issue:**
+
+1. Reset to default configuration:
+   ```bash
+   stella policy opa config reset
+   ```
+
+2. Reconfigure with validated settings:
+   ```bash
+   stella policy opa config set workers 4
+   stella policy opa config set decision_log true
+   stella policy opa restart
+   ```
+
+### Verification
+
+```bash
+# Check OPA is running
+stella policy status
+
+# Check OPA health
+stella policy opa health
+
+# Test policy evaluation
+stella policy evaluate --test
+
+# Check no crashes in recent logs
+stella policy opa logs --level error --last 30m
+
+# Monitor stability
+stella policy stats --watch
+```
+
+---
+
+## Prevention
+
+- [ ] **Resources:** Set appropriate memory limits based on policy complexity
+- [ ] **Validation:** Validate all policies before deployment
+- [ ] **Monitoring:** Alert on OPA restart count > 2 in 10 minutes
+- [ ] **Testing:** Load test policies before production deployment
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/policy/architecture.md`
+- **Related runbooks:** `policy-evaluation-slow.md`, `policy-compilation-failed.md`
+- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Policy/`
+- **OPA documentation:** https://www.openpolicyagent.org/docs/latest/
--- a/docs/operations/runbooks/policy-storage-unavailable.md
+++ b/docs/operations/runbooks/policy-storage-unavailable.md
@@ -0,0 +1,178 @@
+# Runbook: Policy Engine - Policy Storage Backend Down
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-003 - Policy Engine Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Policy Engine |
+| **Severity** | Critical |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.policy.storage-health` |
+
+---
+
+## Symptoms
+
+- [ ] Policy operations failing with "storage unavailable"
+- [ ] Alert `PolicyStorageUnavailable` firing
+- [ ] Error: "failed to connect to policy store" or "database connection refused"
+- [ ] Policy updates not persisting
+- [ ] OPA unable to load bundles from storage
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Policy updates fail; cached policies may still work |
+| **Data integrity** | Policy changes not persisted; risk of inconsistent state |
+| **SLA impact** | Policy management blocked; evaluations use cached data |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.policy.storage-health
+   ```
+
+2. **Check storage connectivity:**
+   ```bash
+   stella policy storage status
+   ```
+
+3. **Check database health:**
+   ```bash
+   stella db status --component policy
+   ```
+
+### Deep diagnosis
+
+1. **Check PostgreSQL connectivity:**
+   ```bash
+   stella db ping --database policy
+   ```
+
+2. **Check connection pool status:**
+   ```bash
+   stella db pool-status --database policy
+   ```
+   Problem if: Pool exhausted, connections timing out
+
+3. **Check storage logs:**
+   ```bash
+   stella policy logs --filter "storage" --level error --last 30m
+   ```
+
+4. **Check disk space (if local storage):**
+   ```bash
+   stella policy storage disk-usage
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Enable read-only mode (use cached policies):**
+   ```bash
+   stella policy config set storage.read_only true
+   stella policy reload
+   ```
+
+2. **Switch to backup storage:**
+   ```bash
+   stella policy storage failover --to backup
+   ```
+
+3. **Restart policy service to reconnect:**
+   ```bash
+   stella service restart policy-engine
+   ```
+
+### Root cause fix
+
+**If database connection issue:**
+
+1. Check database status:
+   ```bash
+   stella db status --database policy --verbose
+   ```
+
+2. Restart database connection pool:
+   ```bash
+   stella db pool-restart --database policy
+   ```
+
+3. Check and increase connection limits:
+   ```bash
+   stella db config set policy.max_connections 50
+   ```
+
+**If disk space exhausted:**
+
+1. Check storage usage:
+   ```bash
+   stella policy storage disk-usage --verbose
+   ```
+
+2. Clean old policy versions:
+   ```bash
+   stella policy versions cleanup --older-than 30d
+   ```
+
+3. Increase storage capacity
+
+**If storage corruption:**
+
+1. Verify storage integrity:
+   ```bash
+   stella policy storage verify
+   ```
+
+2. Restore from backup:
+   ```bash
+   stella policy storage restore --from-backup latest
+   ```
+
+### Verification
+
+```bash
+# Check storage status
+stella policy storage status
+
+# Test write operation
+stella policy storage test-write
+
+# Test policy update
+stella policy update --test
+
+# Verify no errors
+stella policy logs --filter "storage" --level error --last 30m
+```
+
+---
+
+## Prevention
+
+- [ ] **Monitoring:** Alert on storage connection failures immediately
+- [ ] **Redundancy:** Configure backup storage for failover
+- [ ] **Cleanup:** Schedule regular cleanup of old policy versions
+- [ ] **Capacity:** Monitor disk usage and plan for growth
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/policy/storage.md`
+- **Related runbooks:** `policy-opa-crash.md`, `postgres-ops.md`
+- **Database setup:** `docs/operations/database-configuration.md`
--- a/docs/operations/runbooks/policy-version-mismatch.md
+++ b/docs/operations/runbooks/policy-version-mismatch.md
@@ -0,0 +1,195 @@
+# Runbook: Policy Engine - Policy Version Conflicts
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-003 - Policy Engine Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Policy Engine |
+| **Severity** | Medium |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.policy.version-consistency` |
+
+---
+
+## Symptoms
+
+- [ ] Policy evaluation returning unexpected results
+- [ ] Alert `PolicyVersionMismatch` firing
+- [ ] Error: "policy version conflict" or "bundle version mismatch"
+- [ ] Different nodes evaluating with different policy versions
+- [ ] Inconsistent gate decisions for same artifact
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Inconsistent policy decisions; unpredictable gate results |
+| **Data integrity** | Decisions may not match expected policy behavior |
+| **SLA impact** | Gate accuracy SLO violated; trust in decisions reduced |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.policy.version-consistency
+   ```
+
+2. **Check policy version across nodes:**
+   ```bash
+   stella policy version --all-nodes
+   ```
+
+3. **Check active policy version:**
+   ```bash
+   stella policy active --show-version
+   ```
+
+### Deep diagnosis
+
+1. **Compare versions across instances:**
+   ```bash
+   stella policy version diff --all-instances
+   ```
+   Problem if: Different versions on different nodes
+
+2. **Check bundle distribution status:**
+   ```bash
+   stella policy bundle status --all-nodes
+   ```
+
+3. **Check for failed deployments:**
+   ```bash
+   stella policy deployments list --status failed --last 24h
+   ```
+
+4. **Check OPA bundle sync:**
+   ```bash
+   stella policy opa bundle-status
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Force sync to latest version:**
+   ```bash
+   stella policy sync --force --all-nodes
+   ```
+
+2. **Pin specific version:**
+   ```bash
+   stella policy pin --version <version>
+   stella policy sync --all-nodes
+   ```
+
+3. **Restart policy engines to force reload:**
+   ```bash
+   stella service restart policy-engine --all-nodes
+   ```
+
+### Root cause fix
+
+**If bundle distribution failed:**
+
+1. Check bundle storage:
+   ```bash
+   stella policy bundle storage-status
+   ```
+
+2. Rebuild and redistribute bundle:
+   ```bash
+   stella policy bundle build
+   stella policy bundle distribute --all-nodes
+   ```
+
+**If node out of sync:**
+
+1. Check specific node status:
+   ```bash
+   stella policy status --node <node-id>
+   ```
+
+2. Force node resync:
+   ```bash
+   stella policy sync --node <node-id> --force
+   ```
+
+3. Verify node is receiving updates:
+   ```bash
+   stella policy bundle check-subscription --node <node-id>
+   ```
+
+**If concurrent deployments caused conflict:**
+
+1. Check deployment history:
+   ```bash
+   stella policy deployments list --last 1h
+   ```
+
+2. Resolve to single version:
+   ```bash
+   stella policy resolve-conflict --to-version <version>
+   ```
+
+3. Enable deployment locking:
+   ```bash
+   stella policy config set deployment.locking true
+   ```
+
+**If OPA bundle polling issue:**
+
+1. Check OPA bundle configuration:
+   ```bash
+   stella policy opa config show | grep bundle
+   ```
+
+2. Decrease polling interval for faster sync:
+   ```bash
+   stella policy opa config set bundle.polling.min_delay_seconds 10
+   stella policy opa config set bundle.polling.max_delay_seconds 30
+   ```
+
+### Verification
+
+```bash
+# Verify all nodes on same version
+stella policy version --all-nodes
+
+# Test consistent evaluation
+stella policy evaluate --test --all-nodes
+
+# Verify bundle status
+stella policy bundle status --all-nodes
+
+# Check no version warnings
+stella policy logs --filter "version" --level warning --last 30m
+```
+
+---
+
+## Prevention
+
+- [ ] **Locking:** Enable deployment locking to prevent concurrent updates
+- [ ] **Monitoring:** Alert on version drift between nodes
+- [ ] **Sync:** Configure aggressive bundle polling for fast convergence
+- [ ] **Testing:** Deploy to staging before production to catch issues
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/policy/versioning.md`
+- **Related runbooks:** `policy-opa-crash.md`, `policy-storage-unavailable.md`
+- **Deployment guide:** `docs/operations/policy-deployment.md`
--- a/docs/operations/runbooks/postgres-ops.md
+++ b/docs/operations/runbooks/postgres-ops.md
@@ -0,0 +1,371 @@
+# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
+# Task: RUN-001 - PostgreSQL Operations Runbook
+# PostgreSQL Database Runbook (dev-mock ready)
+
+Status: PRODUCTION-READY (2026-01-17 UTC)
+
+## Scope
+PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
+
+---
+
+## Pre-flight Checklist
+
+### Environment Verification
+```bash
+# Check database connection
+stella db ping
+
+# Verify connection pool health
+stella doctor --check check.postgres.connectivity,check.postgres.pool
+
+# Check migration status
+stella db migrations status
+```
+
+### Metrics to Watch
+- `stella_postgres_connections_active` - Active connections (should be < 80% of max)
+- `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms)
+- `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0)
+
+---
+
+## Standard Procedures
+
+### SP-001: Daily Health Check
+
+**Frequency:** Daily or on-demand
+**Duration:** ~5 minutes
+
+1. Run comprehensive health check:
+   ```bash
+   stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
+   ```
+
+2. Review slow queries from last 24h:
+   ```bash
+   stella db queries --slow --period 24h --limit 20
+   ```
+
+3. Check replication status (if applicable):
+   ```bash
+   stella db replication status
+   ```
+
+4. Verify backup completion:
+   ```bash
+   stella backup status --type database
+   ```
+
+### SP-002: Connection Pool Tuning
+
+**When:** Pool exhaustion alerts or high wait times
+
+1. Check current pool usage:
+   ```bash
+   stella db pool stats --detailed
+   ```
+
+2. Identify connection-holding queries:
+   ```bash
+   stella db queries --active --sort duration
+   ```
+
+3. Adjust pool size (if needed):
+   ```bash
+   # Review current settings
+   stella config get Database:MaxPoolSize
+   
+   # Increase pool size
+   stella config set Database:MaxPoolSize 150
+   
+   # Restart affected services
+   stella service restart --service release-orchestrator
+   ```
+
+4. Verify improvement:
+   ```bash
+   stella db pool watch --duration 5m
+   ```
+
+### SP-003: Backup and Restore
+
+**Backup:**
+```bash
+# Create immediate backup
+stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
+
+# Verify backup
+stella backup verify --latest
+```
+
+**Restore:**
+```bash
+# List available backups
+stella backup list --type database
+
+# Restore to specific point (CAUTION: destructive)
+stella backup restore --id <backup-id> --confirm
+
+# Verify restoration
+stella db ping
+stella db migrations status
+```
+
+### SP-004: Migration Execution
+
+1. Pre-migration backup:
+   ```bash
+   stella backup create --type database --name "pre-migration"
+   ```
+
+2. Run migrations:
+   ```bash
+   # Dry run first
+   stella db migrate --dry-run
+   
+   # Apply migrations
+   stella db migrate
+   ```
+
+3. Verify migration success:
+   ```bash
+   stella db migrations status
+   stella doctor --check check.postgres.migrations
+   ```
+
+---
+
+## Incident Procedures
+
+### INC-001: Connection Pool Exhaustion
+
+**Symptoms:**
+- Alert: `StellaPostgresPoolExhausted`
+- Error logs: "connection pool exhausted, waiting for available connection"
+- Increased request latency
+
+**Investigation:**
+```bash
+# Check pool status
+stella db pool stats
+
+# Find long-running queries
+stella db queries --active --sort duration --limit 10
+
+# Check for connection leaks
+stella db connections --by-client
+```
+
+**Resolution:**
+
+1. **Immediate relief** - Terminate long-running queries:
+   ```bash
+   # Identify stuck queries
+   stella db queries --active --duration ">5m"
+   
+   # Terminate specific query (use with caution)
+   stella db query terminate --pid <pid>
+   ```
+
+2. **Scale pool** (if legitimate load):
+   ```bash
+   stella config set Database:MaxPoolSize 200
+   stella service restart --graceful
+   ```
+
+3. **Fix leaks** (if application bug):
+   - Review application logs for unclosed connections
+   - Deploy fix to affected service
+
+### INC-002: Slow Query Performance
+
+**Symptoms:**
+- Alert: `StellaPostgresQueryLatencyHigh`
+- P99 query latency > 500ms
+
+**Investigation:**
+```bash
+# Get slow query report
+stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
+
+# Analyze specific query
+stella db query explain --sql "SELECT ..." --analyze
+
+# Check table statistics
+stella db stats tables --sort bloat
+```
+
+**Resolution:**
+
+1. **Index optimization:**
+   ```bash
+   # Get index recommendations
+   stella db index suggest --table <table>
+   
+   # Create recommended index
+   stella db index create --table <table> --columns "col1,col2"
+   ```
+
+2. **Vacuum/analyze:**
+   ```bash
+   stella db vacuum --table <table>
+   stella db analyze --table <table>
+   ```
+
+3. **Query optimization** - Review and rewrite problematic queries
+
+### INC-003: Database Connectivity Loss
+
+**Symptoms:**
+- Alert: `StellaPostgresConnectionFailed`
+- All services reporting database connection errors
+
+**Investigation:**
+```bash
+# Test basic connectivity
+stella db ping
+
+# Check DNS resolution
+stella network dns-lookup <db-host>
+
+# Check firewall/network
+stella network test --host <db-host> --port 5432
+```
+
+**Resolution:**
+
+1. **Network issue:**
+   - Verify security groups / firewall rules
+   - Check VPN/tunnel status if applicable
+   - Verify DNS resolution
+
+2. **Database server issue:**
+   - Check PostgreSQL service status on server
+   - Review PostgreSQL logs
+   - Check disk space on database server
+
+3. **Credential issue:**
+   ```bash
+   stella db verify-credentials
+   stella secrets rotate --scope database
+   ```
+
+### INC-004: Disk Space Alert
+
+**Symptoms:**
+- Alert: `StellaPostgresDiskSpaceWarning` or `Critical`
+- Database write failures
+
+**Investigation:**
+```bash
+# Check disk usage
+stella db disk-usage
+
+# Find large tables
+stella db stats tables --sort size --limit 20
+
+# Check for bloat
+stella db stats tables --sort bloat
+```
+
+**Resolution:**
+
+1. **Immediate cleanup:**
+   ```bash
+   # Vacuum to reclaim space
+   stella db vacuum --full --table <large-table>
+   
+   # Clean old data (if retention policy allows)
+   stella db prune --table evidence_artifacts --older-than 90d --dry-run
+   ```
+
+2. **Archive old data:**
+   ```bash
+   stella db archive --table findings_history --older-than 180d
+   ```
+
+3. **Expand disk** (if legitimate growth):
+   - Follow cloud provider procedure to expand volume
+   - Resize filesystem
+
+---
+
+## Maintenance Windows
+
+### Weekly Maintenance (Sunday 02:00 UTC)
+
+1. Run vacuum analyze on all tables:
+   ```bash
+   stella db vacuum --analyze --all-tables
+   ```
+
+2. Update table statistics:
+   ```bash
+   stella db analyze --all-tables
+   ```
+
+3. Clean temporary files:
+   ```bash
+   stella db cleanup --temp-files
+   ```
+
+### Monthly Maintenance (First Sunday 03:00 UTC)
+
+1. Full vacuum on large tables:
+   ```bash
+   stella db vacuum --full --table findings --table verdicts
+   ```
+
+2. Reindex if needed:
+   ```bash
+   stella db reindex --concurrently --table findings
+   ```
+
+3. Archive old data per retention policy:
+   ```bash
+   stella db archive --apply-retention
+   ```
+
+---
+
+## Monitoring Dashboard
+
+Access: Grafana → Dashboards → Stella Ops → PostgreSQL
+
+Key panels:
+- Connection pool utilization
+- Query latency percentiles
+- Disk usage trend
+- Replication lag (if applicable)
+- Active queries count
+
+---
+
+## Evidence Capture
+
+For any incident, capture:
+```bash
+# Comprehensive database state
+stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
+```
+
+Bundle includes:
+- Connection stats
+- Active queries
+- Lock information
+- Table statistics
+- Recent slow query log
+- Configuration snapshot
+
+---
+
+## Escalation Path
+
+1. **L1 (On-call):** Standard procedures, restart services
+2. **L2 (Database team):** Query optimization, schema changes
+3. **L3 (Vendor support):** Hardware/cloud platform issues
+
+---
+
+_Last updated: 2026-01-17 (UTC)_
--- a/docs/operations/runbooks/scanner-oom.md
+++ b/docs/operations/runbooks/scanner-oom.md
@@ -0,0 +1,152 @@
+# Runbook: Scanner - Out of Memory on Large Images
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-002 - Scanner Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Scanner |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.scanner.memory-usage` |
+
+---
+
+## Symptoms
+
+- [ ] Scanner worker exits with code 137 (OOM killed)
+- [ ] Scans fail consistently for specific large images
+- [ ] Error log contains "fatal error: runtime: out of memory"
+- [ ] Alert `ScannerWorkerOOM` firing
+- [ ] Metric `scanner_worker_restarts_total{reason="oom"}` increasing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Large images cannot be scanned; smaller images may still work |
+| **Data integrity** | No data loss; failed scans can be retried |
+| **SLA impact** | Specific images blocked from release pipeline |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Identify the failing image:**
+   ```bash
+   stella scanner jobs list --status failed --last 1h
+   ```
+
+2. **Check image size:**
+   ```bash
+   stella image inspect <image-ref> --format json | jq '.size'
+   ```
+   Problem if: Image size > 2GB or layer count > 100
+
+3. **Check worker memory limit:**
+   ```bash
+   stella scanner config get worker.memory_limit
+   ```
+
+### Deep diagnosis
+
+1. **Profile memory usage during scan:**
+   ```bash
+   stella scan image --image <image-ref> --profile-memory
+   ```
+
+2. **Check SBOM generation memory:**
+   ```bash
+   stella scanner logs --filter "sbom" --level debug --last 30m
+   ```
+   Look for: "memory allocation failed", "heap exhausted"
+
+3. **Identify memory-heavy layers:**
+   ```bash
+   stella image layers <image-ref> --sort-by size
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Increase worker memory limit:**
+   ```bash
+   stella scanner config set worker.memory_limit 8Gi
+   stella scanner workers restart
+   ```
+
+2. **Enable streaming mode for large images:**
+   ```bash
+   stella scanner config set sbom.streaming_threshold 1Gi
+   stella scanner workers restart
+   ```
+
+3. **Retry the failed scan:**
+   ```bash
+   stella scan image --image <image-ref> --retry
+   ```
+
+### Root cause fix
+
+**For consistently large images:**
+
+1. Configure dedicated large-image worker pool:
+   ```bash
+   stella scanner workers add --pool large-images --memory 16Gi --count 2
+   stella scanner config set routing.large_image_threshold 2Gi
+   stella scanner config set routing.large_image_pool large-images
+   ```
+
+**For images with many small files (node_modules, etc.):**
+
+1. Enable incremental SBOM mode:
+   ```bash
+   stella scanner config set sbom.incremental_mode true
+   ```
+
+**For base image reuse:**
+
+1. Enable layer caching:
+   ```bash
+   stella scanner config set cache.layer_dedup true
+   ```
+
+### Verification
+
+```bash
+# Retry the previously failing scan
+stella scan image --image <image-ref>
+
+# Monitor memory during scan
+stella scanner workers stats --watch
+
+# Verify no OOM in recent logs
+stella scanner logs --filter "out of memory" --last 1h
+```
+
+---
+
+## Prevention
+
+- [ ] **Capacity:** Set memory limit based on largest expected image (recommend 4Gi minimum)
+- [ ] **Routing:** Configure large-image pool for images > 2GB
+- [ ] **Monitoring:** Alert on `scanner_worker_memory_usage_bytes` > 80% of limit
+- [ ] **Documentation:** Document image size limits in user guide
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/scanner/architecture.md`
+- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
+- **Dashboard:** Grafana > Stella Ops > Scanner Memory
--- a/docs/operations/runbooks/scanner-registry-auth.md
+++ b/docs/operations/runbooks/scanner-registry-auth.md
@@ -0,0 +1,195 @@
+# Runbook: Scanner - Registry Authentication Failures
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-002 - Scanner Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Scanner |
+| **Severity** | High |
+| **On-call scope** | Platform team, Security team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.scanner.registry-auth` |
+
+---
+
+## Symptoms
+
+- [ ] Scans failing with "401 Unauthorized" or "403 Forbidden"
+- [ ] Alert `ScannerRegistryAuthFailed` firing
+- [ ] Error: "failed to authenticate with registry"
+- [ ] Error: "failed to pull image manifest"
+- [ ] Scans work for public images but fail for private images
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Cannot scan private images; release pipeline blocked |
+| **Data integrity** | No data loss; authentication issue only |
+| **SLA impact** | All scans for affected registry blocked |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.scanner.registry-auth
+   ```
+
+2. **List configured registries:**
+   ```bash
+   stella registry list --show-status
+   ```
+   Look for: Registries with "auth_failed" status
+
+3. **Test registry authentication:**
+   ```bash
+   stella registry test <registry-url>
+   ```
+
+### Deep diagnosis
+
+1. **Check credential expiration:**
+   ```bash
+   stella registry credentials show <registry-name>
+   ```
+   Look for: Expiration date, token type
+
+2. **Test with verbose output:**
+   ```bash
+   stella registry test <registry-url> --verbose
+   ```
+   Look for: Specific auth error message, HTTP status code
+
+3. **Check registry logs:**
+   ```bash
+   stella scanner logs --filter "registry auth" --last 30m
+   ```
+
+4. **Verify IAM/OIDC configuration (for cloud registries):**
+   ```bash
+   stella registry iam-status <registry-name>
+   ```
+   Problem if: IAM role not assumable, OIDC token expired
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Refresh credentials (for token-based auth):**
+   ```bash
+   stella registry refresh-credentials <registry-name>
+   ```
+
+2. **Update static credentials:**
+   ```bash
+   stella registry update-credentials <registry-name> \
+     --username <user> \
+     --password <token>
+   ```
+
+3. **For Docker Hub rate limiting:**
+   ```bash
+   stella registry configure docker-hub \
+     --username <user> \
+     --access-token <token>
+   ```
+
+### Root cause fix
+
+**If credentials expired:**
+
+1. Generate new access token in registry (ECR, GCR, ACR, etc.)
+
+2. Update credentials:
+   ```bash
+   stella registry update-credentials <registry-name> --from-env
+   ```
+
+3. Configure automatic token refresh:
+   ```bash
+   stella registry config set <registry-name>.auto_refresh true
+   stella registry config set <registry-name>.refresh_interval 11h
+   ```
+
+**If IAM role/policy changed (AWS ECR):**
+
+1. Verify IAM role permissions:
+   ```bash
+   stella registry iam verify <registry-name>
+   ```
+
+2. Update IAM role ARN if changed:
+   ```bash
+   stella registry configure ecr \
+     --region <region> \
+     --role-arn <arn>
+   ```
+
+**If OIDC federation changed (GCP Artifact Registry):**
+
+1. Verify service account:
+   ```bash
+   stella registry oidc verify <registry-name>
+   ```
+
+2. Update workload identity configuration:
+   ```bash
+   stella registry configure gcr \
+     --project <project> \
+     --workload-identity-provider <provider>
+   ```
+
+**If certificate changed (self-hosted registries):**
+
+1. Update CA certificate:
+   ```bash
+   stella registry configure <registry-name> \
+     --ca-cert /path/to/ca.crt
+   ```
+
+2. Or skip verification (not recommended for production):
+   ```bash
+   stella registry configure <registry-name> \
+     --insecure-skip-verify
+   ```
+
+### Verification
+
+```bash
+# Test authentication
+stella registry test <registry-url>
+
+# Test scanning a private image
+stella scan image --image <registry-url>/<image>:<tag> --dry-run
+
+# Verify no auth failures in recent logs
+stella scanner logs --filter "auth" --level error --last 30m
+```
+
+---
+
+## Prevention
+
+- [ ] **Credentials:** Use service accounts/workload identity instead of static tokens
+- [ ] **Rotation:** Configure automatic token refresh before expiration
+- [ ] **Monitoring:** Alert on authentication failure rate > 0
+- [ ] **Documentation:** Document registry credential management procedures
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/scanner/registry-auth.md`
+- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
+- **Registry setup:** `docs/operations/registry-configuration.md`
--- a/docs/operations/runbooks/scanner-sbom-generation-failed.md
+++ b/docs/operations/runbooks/scanner-sbom-generation-failed.md
@@ -0,0 +1,188 @@
+# Runbook: Scanner - SBOM Generation Failures
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-002 - Scanner Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Scanner |
+| **Severity** | High |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.scanner.sbom-generation` |
+
+---
+
+## Symptoms
+
+- [ ] Scans completing but SBOM generation failing
+- [ ] Alert `ScannerSbomGenerationFailed` firing
+- [ ] Error: "SBOM generation failed" or "unsupported package format"
+- [ ] Partial SBOM with missing components
+- [ ] Metric `scanner_sbom_generation_failures_total` increasing
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Incomplete vulnerability coverage; missing dependencies not scanned |
+| **Data integrity** | Partial SBOM may miss vulnerabilities; attestations incomplete |
+| **SLA impact** | SBOM completeness SLO violated (target: > 95%) |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.scanner.sbom-generation
+   ```
+
+2. **Check failed SBOM jobs:**
+   ```bash
+   stella scanner jobs list --status sbom_failed --last 1h
+   ```
+
+3. **Check SBOM completeness rate:**
+   ```bash
+   stella scanner stats --sbom-metrics
+   ```
+
+### Deep diagnosis
+
+1. **Analyze specific failure:**
+   ```bash
+   stella scanner job details <job-id> --sbom-errors
+   ```
+   Look for: Specific package manager or file type causing failure
+
+2. **Check for unsupported ecosystems:**
+   ```bash
+   stella sbom analyze --image <image-ref> --verbose
+   ```
+   Look for: "unsupported", "unknown package format", "parsing failed"
+
+3. **Check scanner plugin status:**
+   ```bash
+   stella scanner plugins list --status
+   ```
+   Problem if: Package manager plugin disabled or erroring
+
+4. **Check for corrupted package files:**
+   ```bash
+   stella image inspect <image-ref> --check-integrity
+   ```
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Enable fallback SBOM generation:**
+   ```bash
+   stella scanner config set sbom.fallback_mode true
+   stella scan image --image <image-ref> --sbom-fallback
+   ```
+
+2. **Use alternative SBOM generator:**
+   ```bash
+   stella sbom generate --image <image-ref> --generator syft --output sbom.json
+   ```
+
+3. **Generate partial SBOM and continue:**
+   ```bash
+   stella scan image --image <image-ref> --sbom-partial-ok
+   ```
+
+### Root cause fix
+
+**If package manager not supported:**
+
+1. Check supported package managers:
+   ```bash
+   stella scanner plugins list --type package-manager
+   ```
+
+2. Enable additional plugins:
+   ```bash
+   stella scanner plugins enable <plugin-name>
+   ```
+
+3. For custom package formats, add mapping:
+   ```bash
+   stella scanner config set sbom.custom_mappings.<format> <handler>
+   ```
+
+**If package file corrupted:**
+
+1. Identify corrupted files:
+   ```bash
+   stella image layers <image-ref> --verify-packages
+   ```
+
+2. Report to image owner for fix
+
+**If memory/resource issue during generation:**
+
+1. Increase SBOM generator resources:
+   ```bash
+   stella scanner config set sbom.memory_limit 4Gi
+   stella scanner config set sbom.timeout 10m
+   ```
+
+2. Enable streaming mode:
+   ```bash
+   stella scanner config set sbom.streaming_mode true
+   ```
+
+**If plugin crashed:**
+
+1. Check plugin logs:
+   ```bash
+   stella scanner plugins logs <plugin-name> --last 30m
+   ```
+
+2. Restart plugin:
+   ```bash
+   stella scanner plugins restart <plugin-name>
+   ```
+
+### Verification
+
+```bash
+# Retry SBOM generation
+stella sbom generate --image <image-ref> --output sbom.json
+
+# Validate SBOM completeness
+stella sbom validate --file sbom.json --check-completeness
+
+# Check component count
+stella sbom stats --file sbom.json
+
+# Full scan with SBOM
+stella scan image --image <image-ref>
+```
+
+---
+
+## Prevention
+
+- [ ] **Plugins:** Keep all package manager plugins enabled and updated
+- [ ] **Monitoring:** Alert on SBOM completeness < 90%
+- [ ] **Fallback:** Configure fallback SBOM generator for resilience
+- [ ] **Testing:** Test SBOM generation for new image types before production
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/scanner/sbom-generation.md`
+- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
+- **SBOM formats:** `docs/formats/sbom-spdx.md`, `docs/formats/sbom-cyclonedx.md`
--- a/docs/operations/runbooks/scanner-timeout.md
+++ b/docs/operations/runbooks/scanner-timeout.md
@@ -0,0 +1,174 @@
+# Runbook: Scanner - Scan Timeout on Complex Images
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-002 - Scanner Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Scanner |
+| **Severity** | Medium |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.scanner.timeout-rate` |
+
+---
+
+## Symptoms
+
+- [ ] Scans failing with "timeout exceeded" error
+- [ ] Alert `ScannerTimeoutExceeded` firing
+- [ ] Metric `scanner_scan_timeout_total` increasing
+- [ ] Specific images consistently timing out
+- [ ] Error log: "scan operation exceeded timeout of X seconds"
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | Specific images cannot be scanned; pipeline blocked |
+| **Data integrity** | No data loss; scans can be retried with adjusted settings |
+| **SLA impact** | Release pipeline delayed for affected images |
+
+---
+
+## Diagnosis
+
+### Quick checks
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.scanner.timeout-rate
+   ```
+
+2. **Identify failing images:**
+   ```bash
+   stella scanner jobs list --status timeout --last 1h
+   ```
+   Look for: Pattern in image types or sizes
+
+3. **Check current timeout settings:**
+   ```bash
+   stella scanner config get timeouts
+   ```
+
+### Deep diagnosis
+
+1. **Analyze image complexity:**
+   ```bash
+   stella image inspect <image-ref> --format json | jq '{size, layers: .layers | length, files: .manifest.fileCount}'
+   ```
+   Problem if: > 50 layers, > 100k files, or > 5GB size
+
+2. **Check scanner worker load:**
+   ```bash
+   stella scanner workers stats
+   ```
+   Problem if: All workers at capacity during timeouts
+
+3. **Profile a scan:**
+   ```bash
+   stella scan image --image <image-ref> --profile --verbose
+   ```
+   Look for: Which phase is slowest (layer extraction, SBOM generation, vuln matching)
+
+4. **Check for filesystem-heavy images:**
+   ```bash
+   stella image layers <image-ref> --sort-by file-count
+   ```
+   Problem if: Single layer with > 50k files (e.g., node_modules)
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Increase timeout for specific image:**
+   ```bash
+   stella scan image --image <image-ref> --timeout 30m
+   ```
+
+2. **Increase global scan timeout:**
+   ```bash
+   stella scanner config set timeouts.scan 20m
+   stella scanner workers restart
+   ```
+
+3. **Enable fast mode for initial scan:**
+   ```bash
+   stella scan image --image <image-ref> --fast-mode
+   ```
+
+### Root cause fix
+
+**If image is too complex:**
+
+1. Enable incremental scanning:
+   ```bash
+   stella scanner config set scan.incremental_mode true
+   ```
+
+2. Configure layer caching:
+   ```bash
+   stella scanner config set cache.layer_dedup true
+   stella scanner config set cache.sbom_cache true
+   ```
+
+**If filesystem is too large:**
+
+1. Enable streaming SBOM generation:
+   ```bash
+   stella scanner config set sbom.streaming_threshold 500Gi
+   ```
+
+2. Configure file sampling for massive images:
+   ```bash
+   stella scanner config set sbom.file_sample_max 100000
+   ```
+
+**If vulnerability matching is slow:**
+
+1. Enable parallel matching:
+   ```bash
+   stella scanner config set vuln.parallel_matching true
+   stella scanner config set vuln.match_workers 4
+   ```
+
+2. Optimize vulnerability database indexes:
+   ```bash
+   stella db optimize --component scanner
+   ```
+
+### Verification
+
+```bash
+# Retry the previously failing scan
+stella scan image --image <image-ref> --timeout 30m
+
+# Monitor scan progress
+stella scanner jobs watch <job-id>
+
+# Verify no timeouts in recent scans
+stella scanner jobs list --status timeout --last 1h
+```
+
+---
+
+## Prevention
+
+- [ ] **Capacity:** Configure appropriate timeouts based on expected image complexity (15m default, 30m for large)
+- [ ] **Monitoring:** Alert on timeout rate > 5%
+- [ ] **Caching:** Enable layer and SBOM caching for base images
+- [ ] **Documentation:** Document image size/complexity limits in user guide
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/scanner/architecture.md`
+- **Related runbooks:** `scanner-oom.md`, `scanner-worker-stuck.md`
+- **Dashboard:** Grafana > Stella Ops > Scanner Performance
--- a/docs/operations/runbooks/scanner-worker-stuck.md
+++ b/docs/operations/runbooks/scanner-worker-stuck.md
@@ -0,0 +1,174 @@
+# Runbook: Scanner - Worker Not Processing Jobs
+
+> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
+> **Task:** RUN-002 - Scanner Runbooks
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| **Component** | Scanner |
+| **Severity** | Critical |
+| **On-call scope** | Platform team |
+| **Last updated** | 2026-01-17 |
+| **Doctor check** | `check.scanner.worker-health` |
+
+---
+
+## Symptoms
+
+- [ ] Scan jobs stuck in "pending" or "processing" state for >5 minutes
+- [ ] Scanner worker process shows 0% CPU usage
+- [ ] Alert `ScannerWorkerStuck` or `ScannerQueueBacklog` firing
+- [ ] UI shows "Scan in progress" indefinitely
+- [ ] Metric `scanner_jobs_pending` increasing over time
+
+---
+
+## Impact
+
+| Impact Type | Description |
+|-------------|-------------|
+| **User-facing** | New scans cannot complete, blocking CI/CD pipelines and release gates |
+| **Data integrity** | No data loss; pending jobs will resume when worker recovers |
+| **SLA impact** | Scan latency SLO violated if not resolved within 15 minutes |
+
+---
+
+## Diagnosis
+
+### Quick checks (< 2 minutes)
+
+1. **Check Doctor diagnostics:**
+   ```bash
+   stella doctor --check check.scanner.worker-health
+   ```
+
+2. **Check scanner service status:**
+   ```bash
+   stella scanner status
+   ```
+   Expected: "Scanner workers: 4 active, 0 idle"
+   Problem: "Scanner workers: 0 active" or "status: degraded"
+
+3. **Check job queue depth:**
+   ```bash
+   stella scanner queue status
+   ```
+   Expected: Queue depth < 50
+   Problem: Queue depth > 100 or growing rapidly
+
+### Deep diagnosis
+
+1. **Check worker process logs:**
+   ```bash
+   stella scanner logs --tail 100 --level error
+   ```
+   Look for: "timeout", "connection refused", "out of memory"
+
+2. **Check Valkey connectivity (job queue):**
+   ```bash
+   stella doctor --check check.storage.valkey
+   ```
+
+3. **Check if workers are OOM-killed:**
+   ```bash
+   stella scanner workers inspect
+   ```
+   Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)
+
+4. **Check resource utilization:**
+   ```bash
+   stella obs metrics --filter scanner --last 10m
+   ```
+   Look for: Memory > 90%, CPU sustained > 95%
+
+---
+
+## Resolution
+
+### Immediate mitigation
+
+1. **Restart scanner workers:**
+   ```bash
+   stella scanner workers restart
+   ```
+   This will: Terminate current workers and spawn fresh ones
+
+2. **If restart fails, force restart the scanner service:**
+   ```bash
+   stella service restart scanner
+   ```
+
+3. **Verify workers are processing:**
+   ```bash
+   stella scanner queue status --watch
+   ```
+   Queue depth should start decreasing
+
+### Root cause fix
+
+**If workers were OOM-killed:**
+
+1. Increase worker memory limit:
+   ```bash
+   stella scanner config set worker.memory_limit 4Gi
+   stella scanner workers restart
+   ```
+
+2. Reduce concurrent scans per worker:
+   ```bash
+   stella scanner config set worker.concurrency 2
+   stella scanner workers restart
+   ```
+
+**If Valkey connection failed:**
+
+1. Check Valkey health:
+   ```bash
+   stella doctor --check check.storage.valkey
+   ```
+
+2. Restart Valkey if needed (see `valkey-connection-failure.md`)
+
+**If workers are deadlocked:**
+
+1. Enable deadlock detection:
+   ```bash
+   stella scanner config set worker.deadlock_detection true
+   stella scanner workers restart
+   ```
+
+### Verification
+
+```bash
+# Verify workers are healthy
+stella doctor --check check.scanner.worker-health
+
+# Submit a test scan
+stella scan image --image alpine:latest --dry-run
+
+# Watch queue drain
+stella scanner queue status --watch
+
+# Verify no errors in recent logs
+stella scanner logs --tail 20 --level error
+```
+
+---
+
+## Prevention
+
+- [ ] **Alert:** Ensure `ScannerQueueBacklog` alert is configured with threshold < 100 jobs
+- [ ] **Monitoring:** Add Grafana panel for worker memory usage
+- [ ] **Capacity:** Review worker count and memory limits during capacity planning
+- [ ] **Deadlock:** Enable `worker.deadlock_detection` in production
+
+---
+
+## Related Resources
+
+- **Architecture:** `docs/modules/scanner/architecture.md`
+- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
+- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs`
+- **Dashboard:** Grafana > Stella Ops > Scanner Overview