synergy moats product advisory implementations

This commit is contained in:
master
2026-01-17 01:30:03 +02:00
parent 77ff029205
commit d8d9c0a6e3
106 changed files with 20603 additions and 123 deletions

View File

@@ -0,0 +1,256 @@
# Auditor Guide
> **Sprint:** SPRINT_20260117_027_CLI_audit_bundle_command
> **Task:** AUD-007 - Documentation
This guide is for external auditors reviewing Stella Ops release evidence.
## Overview
Stella Ops generates comprehensive, tamper-evident audit bundles that contain all evidence required to verify release decisions. This guide explains how to interpret and verify these bundles.
## Receiving an Audit Bundle
Audit bundles may be delivered as:
- **Directory:** A folder containing all evidence files
- **Archive:** A `.tar.gz` or `.zip` file
### Extracting Archives
```bash
# tar.gz
tar -xzf audit-bundle-sha256-abc123.tar.gz
# zip
unzip audit-bundle-sha256-abc123.zip
```
## Bundle Structure
```
audit-bundle-<digest>-<timestamp>/
├── manifest.json # Integrity manifest
├── README.md # Quick reference
├── verdict/ # Release decision
├── evidence/ # Supporting evidence
├── policy/ # Policy configuration
└── replay/ # Verification instructions
```
## Step 1: Verify Bundle Integrity
Before reviewing contents, verify the bundle has not been tampered with.
### Using Stella CLI
```bash
stella audit verify ./audit-bundle-sha256-abc123/
```
Expected output:
```
✓ Verified 15/15 files
✓ Integrity hash verified
✓ Bundle integrity verified
```
### Manual Verification
1. Open `manifest.json`
2. For each file listed, compute SHA-256 and compare:
```bash
sha256sum verdict/verdict.json
```
3. Verify the `integrityHash` by hashing all file hashes
## Step 2: Review the Verdict
The verdict is the official release decision.
### verdict/verdict.json
```json
{
"artifactDigest": "sha256:abc123...",
"decision": "PASS",
"timestamp": "2026-01-17T10:25:00Z",
"gates": [
{
"gateId": "sbom-required",
"status": "PASS",
"reason": "Valid CycloneDX SBOM present"
},
{
"gateId": "vex-trust",
"status": "PASS",
"reason": "Trust score 0.85 >= 0.70 threshold"
}
]
}
```
### Decision Values
| Decision | Meaning |
|----------|---------|
| `PASS` | All gates passed, artifact approved for deployment |
| `BLOCKED` | One or more gates failed, artifact not approved |
| `PENDING` | Evaluation incomplete, awaiting additional evidence |
### verdict/verdict.dsse.json
This file contains the cryptographically signed verdict envelope (DSSE format). Verify signatures using:
```bash
stella audit verify ./bundle/ --check-signatures
```
## Step 3: Review Evidence
### evidence/sbom.json
Software Bill of Materials (SBOM) listing all components in the artifact.
**Key fields:**
- `components[]` - List of all software components
- `dependencies[]` - Dependency relationships
- `metadata.timestamp` - When SBOM was generated
### evidence/vex-statements/
Vulnerability Exploitability eXchange (VEX) statements that justify vulnerability assessments.
**index.json:**
```json
{
"statementCount": 3,
"statements": [
{"fileName": "vex-001.json", "source": "vendor-security"},
{"fileName": "vex-002.json", "source": "internal-analysis"}
]
}
```
Each VEX statement explains why a vulnerability does or does not affect this artifact.
### evidence/reachability/analysis.json
Reachability analysis showing which vulnerabilities are actually reachable in the code.
```json
{
"components": [
{
"purl": "pkg:npm/lodash@4.17.21",
"vulnerabilities": [
{
"id": "CVE-2021-23337",
"reachable": false,
"reason": "Vulnerable function not in call graph"
}
]
}
]
}
```
## Step 4: Review Policy
### policy/policy-snapshot.json
The policy configuration used for evaluation:
```json
{
"policyVersion": "v2.3.1",
"gates": ["sbom-required", "vex-trust", "cve-threshold"],
"thresholds": {
"vexTrustScore": 0.70,
"maxCriticalCves": 0,
"maxHighCves": 5
}
}
```
### policy/gate-decision.json
Detailed breakdown of each gate evaluation:
```json
{
"gates": [
{
"gateId": "vex-trust",
"decision": "PASS",
"inputs": {
"vexStatements": 3,
"trustScore": 0.85,
"threshold": 0.70
}
}
]
}
```
## Step 5: Replay Verification (Optional)
For maximum assurance, you can replay the verdict evaluation.
### Using Stella CLI
```bash
cd audit-bundle-sha256-abc123/
stella replay snapshot --manifest replay/knowledge-snapshot.json
```
This re-evaluates the policy using the frozen inputs and should produce an identical verdict.
### Manual Replay Steps
See `replay/replay-instructions.md` for detailed steps.
## Compliance Mapping
| Compliance Framework | Relevant Bundle Components |
|---------------------|---------------------------|
| **SOC 2 (CC7.1)** | verdict/, policy/ |
| **ISO 27001 (A.12.6)** | evidence/sbom.json |
| **FedRAMP** | All components |
| **SLSA Level 3** | evidence/provenance/ |
## Common Questions
### Q: Why was this artifact blocked?
Review `policy/gate-decision.json` for the specific gate that failed and its reason.
### Q: How do I verify the SBOM is accurate?
The SBOM digest is included in the manifest. Compare against the organization's SBOM generation process.
### Q: What if replay produces a different result?
This may indicate:
1. Policy version mismatch
2. Missing evidence files
3. Time-dependent policy rules
Contact the organization's security team for clarification.
### Q: How long should audit bundles be retained?
Stella Ops recommends:
- Production releases: 5 years minimum
- Security-critical systems: 7 years
- Regulated industries: Per compliance requirements
## Support
For questions about this audit bundle:
1. Contact the organization's Stella Ops administrator
2. Reference the Bundle ID from `manifest.json`
3. Include the artifact digest
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,112 @@
# Runbook Coverage Tracking
This document tracks operational runbook coverage across Stella Ops modules.
**Target:** 80% coverage of critical failure modes before declaring operability moat achieved.
---
## Coverage Summary
| Module | Critical Failures | Runbooks | Coverage | Status |
|--------|-------------------|----------|----------|--------|
| Scanner | 5 | 0 | 0% | 🔴 Gap |
| Policy Engine | 5 | 0 | 0% | 🔴 Gap |
| Release Orchestrator | 5 | 0 | 0% | 🔴 Gap |
| Attestor | 5 | 0 | 0% | 🔴 Gap |
| Feed Connectors | 4 | 0 | 0% | 🔴 Gap |
| **Database (Postgres)** | 4 | 4 | 100% | ✅ Complete |
| **Crypto Subsystem** | 4 | 4 | 100% | ✅ Complete |
| **Evidence Locker** | 4 | 4 | 100% | ✅ Complete |
| **Backup/Restore** | 4 | 4 | 100% | ✅ Complete |
| Authority (OAuth/OIDC) | 3 | 0 | 0% | 🔴 Gap |
| **Overall** | **43** | **16** | **37%** | 🟡 In Progress |
---
## Available Runbooks
### Database Operations
- [postgres-ops.md](postgres-ops.md) - PostgreSQL database operations
### Crypto Subsystem
- [crypto-ops.md](crypto-ops.md) - Regional crypto operations (FIPS, eIDAS, GOST, SM)
### Evidence Locker
- [evidence-locker-ops.md](evidence-locker-ops.md) - Evidence locker operations
### Backup/Restore
- [backup-restore-ops.md](backup-restore-ops.md) - Backup and restore procedures
### Vulnerability Operations
- [vuln-ops.md](vuln-ops.md) - Vulnerability management operations
### VEX Operations
- [vex-ops.md](vex-ops.md) - VEX statement operations
### Policy Incidents
- [policy-incident.md](policy-incident.md) - Policy-related incident response
---
## Gap Analysis
### High Priority Gaps (Critical modules without runbooks)
1. **Scanner** - Core scanning functionality
- Worker stuck
- OOM on large images
- Registry auth failures
2. **Policy Engine** - Policy evaluation
- Slow evaluation
- OPA crashes
- Compilation failures
3. **Release Orchestrator** - Promotion workflow
- Stuck promotions
- Gate timeouts
- Missing evidence
### Medium Priority Gaps
4. **Attestor** - Signing and verification
- Signing failures
- Key expiration
- Rekor unavailability
5. **Feed Connectors** - Advisory feeds
- NVD failures
- Rate limiting
- Offline bundle issues
### Lower Priority Gaps
6. **Authority** - Authentication
- Token validation failures
- OIDC provider issues
---
## Template
New runbooks should use the template: [_template.md](_template.md)
---
## Doctor Check Integration
Runbooks should be linked from Doctor check output. Current integration status:
| Module | Doctor Checks | Linked to Runbook |
|--------|---------------|-------------------|
| Postgres | 4 | 0 |
| Crypto | 8 | 0 |
| Storage | 3 | 0 |
| Evidence | 4 | 0 |
**Next step:** Update Doctor check implementations to include runbook links in remediation output.
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,157 @@
# Runbook: [Component] - [Failure Scenario]
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-001 - Runbook Template
## Metadata
| Field | Value |
|-------|-------|
| **Component** | [Module name: Scanner, Policy, Orchestrator, Attestor, etc.] |
| **Severity** | Critical / High / Medium / Low |
| **On-call scope** | [Who should be paged: Platform team, Security team, etc.] |
| **Last updated** | [YYYY-MM-DD] |
| **Doctor check** | [Check ID if applicable, e.g., `check.scanner.worker-health`] |
---
## Symptoms
Observable indicators that this failure is occurring:
- [ ] [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
- [ ] [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
- [ ] [Metric/alert that fires: e.g., "Alert `ScannerWorkerStuck` firing"]
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | [e.g., "New scans cannot complete, blocking CI/CD pipelines"] |
| **Data integrity** | [e.g., "No data loss, but stale scan results may be served"] |
| **SLA impact** | [e.g., "Scan latency SLO violated if not resolved within 15 minutes"] |
---
## Diagnosis
### Quick checks (< 2 minutes)
Run these first to confirm the failure:
1. **Check Doctor diagnostics:**
```bash
stella doctor --check [relevant-check-id]
```
2. **Check service status:**
```bash
stella [component] status
```
3. **Check recent logs:**
```bash
stella [component] logs --tail 50 --level error
```
### Deep diagnosis (if quick checks inconclusive)
1. **[Investigation step 1]:**
```bash
[command]
```
Expected output: [description]
If unexpected: [what it means]
2. **[Investigation step 2]:**
```bash
[command]
```
3. **Check related services:**
- Postgres connectivity: `stella doctor --check check.storage.postgres`
- Valkey connectivity: `stella doctor --check check.storage.valkey`
- Network connectivity: `stella doctor --check check.network.[target]`
---
## Resolution
### Immediate mitigation (restore service quickly)
Use these steps to restore service, even if root cause isn't fixed yet:
1. **[Mitigation step 1]:**
```bash
[command]
```
This will: [explanation]
2. **[Mitigation step 2]:**
```bash
[command]
```
### Root cause fix
Once service is restored, address the underlying issue:
1. **[Fix step 1]:**
```bash
[command]
```
2. **[Fix step 2]:**
```bash
[command]
```
3. **Verify fix is complete:**
```bash
stella doctor --check [relevant-check-id]
```
### Verification
Confirm the issue is fully resolved:
```bash
# Re-run the failing operation
stella [component] [test-command]
# Verify metrics are healthy
stella obs metrics --filter [component] --last 5m
# Verify no new errors in logs
stella [component] logs --tail 20 --level error
```
---
## Prevention
How to prevent this failure from recurring:
- [ ] **Monitoring:** [e.g., "Add alert for queue depth > 100"]
- [ ] **Configuration:** [e.g., "Increase worker count in high-volume environments"]
- [ ] **Code change:** [e.g., "Implement circuit breaker for external service calls"]
- [ ] **Documentation:** [e.g., "Update capacity planning guide"]
---
## Related Resources
- **Architecture doc:** [Link to relevant architecture documentation]
- **Related runbooks:** [Links to related failure scenarios]
- **Doctor check source:** [Link to Doctor check implementation]
- **Grafana dashboard:** [Link to relevant dashboard]
---
## Revision History
| Date | Author | Changes |
|------|--------|---------|
| YYYY-MM-DD | [Name] | Initial version |

View File

@@ -0,0 +1,193 @@
# Runbook: Attestor - HSM Connection Issues
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor / Cryptography |
| **Severity** | Critical |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.crypto.hsm-availability` |
---
## Symptoms
- [ ] Signing operations failing with "HSM unavailable"
- [ ] Alert `AttestorHsmConnectionFailed` firing
- [ ] Error: "PKCS#11 operation failed" or "HSM session timeout"
- [ ] Attestations cannot be created
- [ ] Key operations (sign, verify) failing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | No attestations can be signed; releases blocked |
| **Data integrity** | Keys are safe in HSM; operations resume when connection restored |
| **SLA impact** | All signing operations blocked; compliance posture at risk |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.crypto.hsm-availability
```
2. **Check HSM connection status:**
```bash
stella crypto hsm status
```
3. **Test HSM connectivity:**
```bash
stella crypto hsm test
```
### Deep diagnosis
1. **Check PKCS#11 library status:**
```bash
stella crypto hsm pkcs11-status
```
Look for: Library loaded, slot available, session active
2. **Check HSM network connectivity:**
```bash
stella crypto hsm ping
```
3. **Check HSM session logs:**
```bash
stella crypto hsm logs --last 30m
```
Look for: Session errors, timeout, authentication failures
4. **Check HSM slot status:**
```bash
stella crypto hsm slots list
```
Problem if: Slot not found, slot busy, token not present
---
## Resolution
### Immediate mitigation
1. **Attempt HSM reconnection:**
```bash
stella crypto hsm reconnect
```
2. **If HSM unreachable, switch to software signing (if permitted):**
```bash
stella attest config set signing.mode software
stella attest reload
```
**Warning:** Software signing may not meet compliance requirements
3. **Use backup HSM if configured:**
```bash
stella crypto hsm failover --to backup
```
### Root cause fix
**If network connectivity issue:**
1. Check HSM network path:
```bash
stella crypto hsm connectivity --verbose
```
2. Verify firewall rules allow HSM port (typically 1792 for Luna, 2225 for SafeNet)
3. Check HSM server status with vendor tools
**If session timeout:**
1. Increase session timeout:
```bash
stella crypto hsm config set session.timeout 300s
stella crypto hsm reconnect
```
2. Enable session keep-alive:
```bash
stella crypto hsm config set session.keepalive true
stella crypto hsm config set session.keepalive_interval 60s
```
**If authentication failed:**
1. Verify HSM credentials:
```bash
stella crypto hsm auth verify
```
2. Update HSM PIN if changed:
```bash
stella crypto hsm auth update --slot <slot-id>
```
**If PKCS#11 library issue:**
1. Verify library path:
```bash
stella crypto hsm config get pkcs11.library_path
```
2. Reload PKCS#11 library:
```bash
stella crypto hsm pkcs11-reload
```
3. Check library compatibility:
```bash
stella crypto hsm pkcs11-info
```
### Verification
```bash
# Test HSM connectivity
stella crypto hsm test
# Test signing operation
stella attest test-sign
# Verify key access
stella keys verify <key-id> --operation sign
# Check no errors in logs
stella crypto hsm logs --level error --last 30m
```
---
## Prevention
- [ ] **Redundancy:** Configure backup HSM for failover
- [ ] **Monitoring:** Alert on HSM connection failures immediately
- [ ] **Keep-alive:** Enable session keep-alive to prevent timeouts
- [ ] **Testing:** Include HSM health in regular health checks
---
## Related Resources
- **Architecture:** `docs/modules/cryptography/hsm-integration.md`
- **Related runbooks:** `attestor-signing-failed.md`, `crypto-ops.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Crypto/`
- **HSM setup:** `docs/operations/hsm-configuration.md`

View File

@@ -0,0 +1,190 @@
# Runbook: Attestor - Signing Key Expired
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor |
| **Severity** | Critical |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.attestor.key-expiration` |
---
## Symptoms
- [ ] Attestation creation failing with "key expired" error
- [ ] Alert `AttestorKeyExpired` firing
- [ ] Error: "signing key certificate has expired"
- [ ] New attestations cannot be created
- [ ] Verification of new attestations failing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | No new attestations can be signed; releases blocked |
| **Data integrity** | Existing attestations remain valid; new ones cannot be created |
| **SLA impact** | Release SLO violated; compliance posture compromised |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.attestor.key-expiration
```
2. **List signing keys and expiration:**
```bash
stella keys list --type signing --show-expiration
```
Look for: Keys with status "expired" or expiring soon
3. **Check active signing key:**
```bash
stella attest config get signing.key_id
stella keys show <key-id> --details
```
### Deep diagnosis
1. **Check certificate chain validity:**
```bash
stella crypto cert verify-chain --key <key-id>
```
Problem if: Any certificate in chain expired
2. **Check for backup keys:**
```bash
stella keys list --type signing --status inactive
```
Look for: Unexpired backup keys that can be activated
3. **Check key rotation history:**
```bash
stella keys rotation-history --key <key-id>
```
---
## Resolution
### Immediate mitigation
1. **If backup key available, activate it:**
```bash
stella keys activate <backup-key-id>
stella attest config set signing.key_id <backup-key-id>
stella attest reload
```
2. **Verify signing works:**
```bash
stella attest test-sign
```
3. **Retry failed attestations:**
```bash
stella attest retry --failed --last 1h
```
### Root cause fix
**Generate new signing key:**
1. Generate new key pair:
```bash
stella keys generate \
--type signing \
--algorithm ecdsa-p256 \
--validity 365d \
--name "signing-key-$(date +%Y%m%d)"
```
2. If using HSM:
```bash
stella keys generate \
--type signing \
--algorithm ecdsa-p256 \
--validity 365d \
--hsm-slot <slot> \
--name "signing-key-$(date +%Y%m%d)"
```
3. Register the new key:
```bash
stella keys register <new-key-id> --purpose attestation-signing
```
4. Update signing configuration:
```bash
stella attest config set signing.key_id <new-key-id>
stella attest reload
```
5. Publish new public key to trust anchors:
```bash
stella issuer keys publish <new-key-id>
```
**Configure automatic rotation:**
1. Enable auto-rotation:
```bash
stella keys config set rotation.auto true
stella keys config set rotation.before_expiry 30d
stella keys config set rotation.overlap_days 14
```
2. Set up rotation alerts:
```bash
stella keys config set alerts.expiring_days 30
stella keys config set alerts.expiring_days_critical 7
```
### Verification
```bash
# Verify new key is active
stella keys list --type signing --status active
# Test signing
stella attest test-sign
# Create test attestation
stella attest create --type test --subject "test:key-rotation"
# Verify the attestation
stella verify attestation --last
# Check key expiration
stella keys show <new-key-id> --details | grep -i expir
```
---
## Prevention
- [ ] **Rotation:** Enable automatic key rotation 30 days before expiry
- [ ] **Monitoring:** Alert on keys expiring within 30 days (warning) and 7 days (critical)
- [ ] **Backup:** Maintain at least one backup signing key
- [ ] **Documentation:** Document key rotation procedures and approval process
---
## Related Resources
- **Architecture:** `docs/modules/attestor/architecture.md`
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-hsm-connection.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
- **Key management:** `docs/operations/key-management.md`

View File

@@ -0,0 +1,184 @@
# Runbook: Attestor - Rekor Transparency Log Unreachable
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.attestor.rekor-connectivity` |
---
## Symptoms
- [ ] Attestation transparency logging failing
- [ ] Alert `AttestorRekorUnavailable` firing
- [ ] Error: "Rekor server unavailable" or "transparency log submission failed"
- [ ] Attestations created but not anchored to transparency log
- [ ] Verification failing due to missing log entry
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Attestations not publicly verifiable via transparency log |
| **Data integrity** | Attestations still valid locally; transparency reduced |
| **SLA impact** | Compliance may require transparency log anchoring |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.attestor.rekor-connectivity
```
2. **Check Rekor connectivity:**
```bash
stella attest rekor status
```
3. **Test Rekor endpoint:**
```bash
stella attest rekor ping
```
### Deep diagnosis
1. **Check Rekor server URL:**
```bash
stella attest config get rekor.url
```
Default: https://rekor.sigstore.dev
2. **Check for public Rekor outage:**
```bash
stella attest rekor api-status
```
Also check: https://status.sigstore.dev/
3. **Check network/proxy issues:**
```bash
stella attest rekor test --verbose
```
Look for: TLS errors, proxy blocks, timeout
4. **Check pending log entries:**
```bash
stella attest rekor pending-entries
```
---
## Resolution
### Immediate mitigation
1. **Queue attestations for later submission:**
```bash
stella attest config set rekor.queue_on_failure true
stella attest reload
```
2. **Disable Rekor requirement temporarily:**
```bash
stella attest config set rekor.required false
stella attest reload
```
**Warning:** Reduces transparency guarantees
3. **Use private Rekor instance if available:**
```bash
stella attest config set rekor.url https://rekor.internal.example.com
stella attest reload
```
### Root cause fix
**If public Rekor outage:**
1. Wait for Sigstore to resolve the issue
2. Check status at https://status.sigstore.dev/
3. Process queued entries when service recovers:
```bash
stella attest rekor process-queue
```
**If network/firewall issue:**
1. Verify outbound HTTPS to rekor.sigstore.dev:
```bash
stella attest rekor connectivity --verbose
```
2. Configure proxy if required:
```bash
stella attest config set rekor.proxy https://proxy:8080
```
3. Add Rekor endpoints to firewall allowlist:
- rekor.sigstore.dev:443
- fulcio.sigstore.dev:443 (for certificate issuance)
**If TLS certificate issue:**
1. Check certificate validity:
```bash
stella attest rekor cert-check
```
2. Update CA certificates:
```bash
stella crypto ca update
```
**If private Rekor instance issue:**
1. Check private Rekor server status
2. Verify Rekor database health
3. Check Rekor signer availability
### Verification
```bash
# Test Rekor connectivity
stella attest rekor ping
# Submit test entry
stella attest rekor test-submit
# Process any queued entries
stella attest rekor process-queue
# Verify recent attestation in log
stella attest rekor lookup --attestation <attestation-id>
```
---
## Prevention
- [ ] **Redundancy:** Configure private Rekor instance as fallback
- [ ] **Queuing:** Enable queue-on-failure for resilience
- [ ] **Monitoring:** Alert on Rekor submission failures
- [ ] **Offline:** Document attestation validity without Rekor for air-gap scenarios
---
## Related Resources
- **Architecture:** `docs/modules/attestor/transparency-log.md`
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-verification-failed.md`
- **Sigstore docs:** https://docs.sigstore.dev/
- **Rekor setup:** `docs/operations/rekor-configuration.md`

View File

@@ -0,0 +1,176 @@
# Runbook: Attestor - Signature Generation Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor |
| **Severity** | Critical |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.attestor.signing-health` |
---
## Symptoms
- [ ] Attestation requests failing with "signing failed" error
- [ ] Alert `AttestorSigningFailed` firing
- [ ] Evidence bundles missing signatures
- [ ] Metric `attestor_signing_failures_total` increasing
- [ ] Release pipeline blocked due to unsigned attestations
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Releases blocked; attestations cannot be created |
| **Data integrity** | Evidence is recorded but unsigned; can be signed later |
| **SLA impact** | Release SLO violated; evidence integrity compromised |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.attestor.signing-health
```
2. **Check attestor service status:**
```bash
stella attest status
```
3. **Check signing key availability:**
```bash
stella keys list --type signing --status active
```
Problem if: No active signing keys
### Deep diagnosis
1. **Test signing operation:**
```bash
stella attest test-sign --verbose
```
Look for: Specific error message
2. **Check key material access:**
```bash
stella keys verify <key-id> --operation sign
```
3. **If using HSM, check HSM connectivity:**
```bash
stella doctor --check check.crypto.hsm-availability
```
4. **Check for key expiration:**
```bash
stella keys list --expiring-within 7d
```
---
## Resolution
### Immediate mitigation
1. **If key expired, rotate to backup key:**
```bash
stella keys activate <backup-key-id>
stella attest config set signing.key_id <backup-key-id>
```
2. **If HSM unavailable, switch to software signing (temporary):**
```bash
stella attest config set signing.mode software
stella attest reload
```
⚠️ **Warning:** Software signing may not meet compliance requirements
3. **Retry failed attestations:**
```bash
stella attest retry --failed --last 1h
```
### Root cause fix
**If key expired:**
1. Generate new signing key:
```bash
stella keys generate --type signing --algorithm ecdsa-p256
```
2. Configure key rotation schedule:
```bash
stella keys config set rotation.auto true
stella keys config set rotation.overlap_days 14
```
**If HSM connection failed:**
1. Verify HSM configuration:
```bash
stella crypto hsm verify
```
2. Restart HSM connection:
```bash
stella crypto hsm reconnect
```
**If certificate chain issue:**
1. Verify certificate chain:
```bash
stella crypto cert verify-chain --key <key-id>
```
2. Update intermediate certificates:
```bash
stella crypto cert update-chain --key <key-id>
```
### Verification
```bash
# Test signing
stella attest test-sign
# Create test attestation
stella attest create --type test --subject "test:verification"
# Verify the attestation
stella verify attestation --last
# Check no failures in recent operations
stella attest logs --level error --last 30m
```
---
## Prevention
- [ ] **Key rotation:** Enable automatic key rotation with 14-day overlap
- [ ] **Monitoring:** Alert on keys expiring within 30 days
- [ ] **Backup:** Maintain backup signing key in different HSM slot
- [ ] **Testing:** Include signing test in health check schedule
---
## Related Resources
- **Architecture:** `docs/modules/attestor/architecture.md`
- **Related runbooks:** `attestor-key-expired.md`, `attestor-hsm-connection.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
- **Dashboard:** Grafana > Stella Ops > Attestor

View File

@@ -0,0 +1,195 @@
# Runbook: Attestor - Attestation Verification Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor |
| **Severity** | High |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.attestor.verification-health` |
---
## Symptoms
- [ ] Attestation verification failing
- [ ] Alert `AttestorVerificationFailed` firing
- [ ] Error: "signature verification failed" or "invalid attestation"
- [ ] Promotions blocked due to failed verification
- [ ] Error: "trust anchor not found" or "certificate chain invalid"
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Artifacts cannot be promoted; release blocked |
| **Data integrity** | May indicate tampered attestation or configuration issue |
| **SLA impact** | Release pipeline blocked until resolved |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.attestor.verification-health
```
2. **Verify specific attestation:**
```bash
stella verify attestation --attestation <attestation-id> --verbose
```
3. **Check trust anchors:**
```bash
stella trust-anchors list
```
### Deep diagnosis
1. **Check attestation details:**
```bash
stella attest show <attestation-id> --details
```
Look for: Signer identity, timestamp, subject
2. **Verify certificate chain:**
```bash
stella verify cert-chain --attestation <attestation-id>
```
Problem if: Intermediate cert missing, root not trusted
3. **Check public key availability:**
```bash
stella keys show <key-id> --public
```
4. **Check if issuer is trusted:**
```bash
stella issuer trust-status <issuer-id>
```
---
## Resolution
### Immediate mitigation
1. **If trust anchor missing, add it:**
```bash
stella trust-anchors add --cert <issuer-cert.pem>
```
2. **If intermediate cert missing:**
```bash
stella trust-anchors add-intermediate --cert <intermediate.pem>
```
3. **Re-verify with verbose output:**
```bash
stella verify attestation --attestation <attestation-id> --verbose
```
### Root cause fix
**If signature mismatch:**
1. Check attestation wasn't modified:
```bash
stella attest integrity-check <attestation-id>
```
2. If modified, regenerate attestation:
```bash
stella attest create --subject <digest> --type <type> --force
```
**If key rotated and old key not trusted:**
1. Add old public key to trust anchors:
```bash
stella trust-anchors add-key --key <old-key.pem> --expires <date>
```
2. Or fetch from issuer directory:
```bash
stella issuer keys fetch <issuer-id>
```
**If certificate expired:**
1. Check certificate validity:
```bash
stella verify cert --attestation <attestation-id> --show-expiry
```
2. Re-sign with valid certificate:
```bash
stella attest resign <attestation-id>
```
**If issuer not trusted:**
1. Verify issuer identity:
```bash
stella issuer show <issuer-id>
```
2. Add to trusted issuers (requires approval):
```bash
stella issuer trust <issuer-id> --reason "Approved by security team"
```
**If algorithm not supported:**
1. Check algorithm:
```bash
stella attest show <attestation-id> | grep algorithm
```
2. Verify crypto provider supports algorithm:
```bash
stella crypto providers list --algorithms
```
### Verification
```bash
# Verify attestation
stella verify attestation --attestation <attestation-id>
# Verify trust chain
stella verify cert-chain --attestation <attestation-id>
# Test end-to-end verification
stella verify artifact --digest <digest>
# Check no verification errors
stella attest logs --filter "verification" --level error --last 30m
```
---
## Prevention
- [ ] **Trust anchors:** Keep trust anchor list current with all valid issuer certs
- [ ] **Key rotation:** Plan key rotation with overlap period for verification continuity
- [ ] **Monitoring:** Alert on verification failure rate > 0
- [ ] **Testing:** Include verification tests in release pipeline
---
## Related Resources
- **Architecture:** `docs/modules/attestor/verification.md`
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-key-expired.md`
- **Trust management:** `docs/operations/trust-anchors.md`

View File

@@ -0,0 +1,449 @@
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-004 - Backup/Restore Runbook
# Backup and Restore Operations Runbook
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.
---
## Backup Architecture Overview
### Backup Components
| Component | Backup Type | Default Schedule | Retention |
|-----------|-------------|------------------|-----------|
| PostgreSQL | Full + WAL | Daily full, continuous WAL | 30 days |
| Evidence Locker | Incremental | Daily | 90 days |
| Configuration | Snapshot | Daily + on change | 90 days |
| Secrets | Encrypted snapshot | Daily | 30 days |
| Attestation Keys | Encrypted export | Weekly | 1 year |
### Storage Locations
- **Primary:** `/var/lib/stellaops/backups/` (local)
- **Secondary:** S3/Azure Blob/GCS (configurable)
- **Offline:** Removable media for air-gap scenarios
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check backup service status
stella backup status
# Verify backup storage
stella doctor --check check.storage.backup
# List recent backups
stella backup list --last 7d
# Test backup restore capability
stella backup test-restore --latest --dry-run
```
### Metrics to Watch
- `stella_backup_last_success_timestamp` - Last successful backup
- `stella_backup_duration_seconds` - Backup duration
- `stella_backup_size_bytes` - Backup size
- `stella_restore_test_last_success` - Last restore test
---
## Standard Procedures
### SP-001: Create Manual Backup
**When:** Before upgrades, schema changes, or major configuration changes
**Duration:** 5-30 minutes depending on data volume
1. Create full system backup:
```bash
stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)"
```
2. Or create component-specific backup:
```bash
# Database only
stella backup create --type database --name "db-pre-migration"
# Evidence locker only
stella backup create --type evidence --name "evidence-snapshot"
# Configuration only
stella backup create --type config --name "config-backup"
```
3. Verify backup:
```bash
stella backup verify --name "pre-upgrade-$(date +%Y%m%d)"
```
4. Copy to offsite storage (recommended):
```bash
stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
```
### SP-002: Verify Backup Integrity
**Frequency:** Weekly
**Duration:** 15-60 minutes
1. List backups for verification:
```bash
stella backup list --unverified
```
2. Verify backup integrity:
```bash
# Verify specific backup
stella backup verify --name <backup-name>
# Verify all unverified
stella backup verify --all-unverified
```
3. Test restore (non-destructive):
```bash
stella backup test-restore --name <backup-name> --target /tmp/restore-test
```
4. Record verification result:
```bash
stella backup log-verification --name <backup-name> --result success
```
### SP-003: Restore from Backup
**CAUTION: This is a destructive operation**
#### Full System Restore
1. Stop all services:
```bash
stella service stop --all
```
2. List available backups:
```bash
stella backup list --type full
```
3. Restore:
```bash
# Dry run first
stella backup restore --name <backup-name> --dry-run
# Execute restore
stella backup restore --name <backup-name> --confirm
```
4. Start services:
```bash
stella service start --all
```
5. Verify restoration:
```bash
stella doctor --all
stella service health
```
#### Component-Specific Restore
1. Database restore:
```bash
stella service stop --service api,release-orchestrator
stella backup restore --type database --name <backup-name> --confirm
stella db migrate # Apply any pending migrations
stella service start --service api,release-orchestrator
```
2. Evidence locker restore:
```bash
stella backup restore --type evidence --name <backup-name> --confirm
stella evidence verify --mode quick
```
3. Configuration restore:
```bash
stella backup restore --type config --name <backup-name> --confirm
stella service restart --graceful
```
### SP-004: Point-in-Time Recovery (Database)
1. Identify target recovery point:
```bash
# List WAL archives
stella backup wal-list --after <start-date> --before <end-date>
```
2. Perform PITR:
```bash
stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm
```
3. Verify data state:
```bash
stella db verify-integrity
```
---
## Backup Schedules
### Configure Backup Schedule
```bash
# View current schedule
stella backup schedule show
# Set database backup schedule
stella backup schedule set --type database --cron "0 2 * * *"
# Set evidence backup schedule
stella backup schedule set --type evidence --cron "0 3 * * *"
# Set configuration backup schedule
stella backup schedule set --type config --cron "0 4 * * *" --on-change
```
### Retention Policy
```bash
# View retention policy
stella backup retention show
# Set retention
stella backup retention set --type database --days 30
stella backup retention set --type evidence --days 90
stella backup retention set --type config --days 90
# Apply retention (cleanup old backups)
stella backup retention apply
```
---
## Incident Procedures
### INC-001: Backup Failure
**Symptoms:**
- Alert: `StellaBackupFailed`
- Missing recent backup
**Investigation:**
```bash
# Check backup logs
stella backup logs --last 24h
# Check disk space
stella doctor --check check.storage.diskspace,check.storage.backup
# Test backup operation
stella backup test --type database
```
**Resolution:**
1. **Disk space issue:**
```bash
stella backup retention apply --force
stella backup cleanup --expired
```
2. **Database connectivity:**
```bash
stella doctor --check check.postgres.connectivity
```
3. **Permission issue:**
- Check backup directory permissions
- Verify service account access
4. **Retry backup:**
```bash
stella backup create --type <failed-type> --retry
```
### INC-002: Restore Failure
**Symptoms:**
- Restore command fails
- Services not starting after restore
**Investigation:**
```bash
# Check restore logs
stella backup restore-logs --last-attempt
# Verify backup integrity
stella backup verify --name <backup-name>
# Check disk space
stella doctor --check check.storage.diskspace
```
**Resolution:**
1. **Corrupted backup:**
```bash
# Try previous backup
stella backup list --type <type>
stella backup restore --name <previous-backup> --confirm
```
2. **Version mismatch:**
```bash
# Check backup version
stella backup info --name <backup-name>
# Restore with migration
stella backup restore --name <backup-name> --with-migration
```
3. **Disk space:**
- Free space or expand volume
- Restore to alternate location
### INC-003: Backup Storage Full
**Symptoms:**
- Alert: `StellaBackupStorageFull`
- New backups failing
**Immediate Actions:**
```bash
# Check storage
stella backup storage stats
# Emergency cleanup
stella backup cleanup --keep-last 3
# Delete specific old backups
stella backup delete --older-than 14d --confirm
```
**Resolution:**
1. **Adjust retention:**
```bash
stella backup retention set --type database --days 14
stella backup retention apply
```
2. **Expand storage:**
- Add disk space
- Configure offsite storage
3. **Archive to cold storage:**
```bash
stella backup archive --older-than 30d --destination s3://archive-bucket/
```
---
## Disaster Recovery Scenarios
### DR-001: Complete System Loss
1. Provision new infrastructure
2. Install Stella Ops
3. Restore from offsite backup:
```bash
stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm
```
4. Verify all components
5. Update DNS/load balancer
### DR-002: Database Corruption
1. Stop services
2. Restore database from latest clean backup:
```bash
stella backup restore --type database --name <last-known-good>
```
3. Apply WAL to near-corruption point (PITR)
4. Verify data integrity
5. Resume services
### DR-003: Evidence Locker Loss
1. Restore evidence from backup:
```bash
stella backup restore --type evidence --name <backup-name>
```
2. Rebuild index:
```bash
stella evidence index rebuild
```
3. Verify anchor chain:
```bash
stella evidence anchor verify --all
```
---
## Offline/Air-Gap Backup
### Creating Offline Backup
```bash
# Create encrypted offline bundle
stella backup create-offline \
--output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
--encrypt \
--passphrase-file /secure/backup-key
# Verify offline backup
stella backup verify-offline --input /media/usb/stellaops-backup-*.enc
```
### Restoring from Offline Backup
```bash
# Restore from offline backup
stella backup restore-offline \
--input /media/usb/stellaops-backup-*.enc \
--passphrase-file /secure/backup-key \
--confirm
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → Backup Status
Key panels:
- Last backup success time
- Backup size trend
- Backup duration
- Restore test status
- Storage utilization
---
## Evidence Capture
```bash
stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
---
## Escalation Path
1. **L1 (On-call):** Retry failed backups, basic troubleshooting
2. **L2 (Platform team):** Restore operations, schedule adjustments
3. **L3 (Architecture):** Disaster recovery execution
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,196 @@
# Runbook: Feed Connector - GitHub Security Advisories (GHSA) Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-006 - Feed Connector Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Concelier / GHSA Connector |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.connector.ghsa-health` |
---
## Symptoms
- [ ] GHSA feed sync failing or stale
- [ ] Alert `ConnectorGhsaSyncFailed` firing
- [ ] Error: "GitHub API rate limit exceeded" or "GraphQL query failed"
- [ ] GitHub Advisory Database vulnerabilities missing
- [ ] Metric `connector_sync_failures_total{source="ghsa"}` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | GitHub ecosystem vulnerabilities may be missed |
| **Data integrity** | Data becomes stale; no data loss |
| **SLA impact** | Vulnerability currency SLO violated for GitHub packages |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.connector.ghsa-health
```
2. **Check GHSA sync status:**
```bash
stella admin feeds status --source ghsa
```
3. **Test GitHub API connectivity:**
```bash
stella connector test ghsa
```
### Deep diagnosis
1. **Check GitHub API rate limit:**
```bash
stella connector ghsa rate-limit-status
```
Problem if: Remaining = 0, rate limit exceeded
2. **Check GitHub token permissions:**
```bash
stella connector credentials show ghsa --check-scopes
```
Required scopes: `public_repo`, `read:packages` (for private advisory access)
3. **Check sync logs:**
```bash
stella connector logs ghsa --last 1h --level error
```
Look for: GraphQL errors, pagination issues, timeout
4. **Check for GitHub API outage:**
```bash
stella connector ghsa api-status
```
Also check: https://www.githubstatus.com/
---
## Resolution
### Immediate mitigation
1. **If rate limited, wait for reset:**
```bash
stella connector ghsa rate-limit-status
# Note the reset time, then:
stella admin feeds refresh --source ghsa
```
2. **Use secondary token if available:**
```bash
stella connector credentials rotate ghsa --to secondary
stella admin feeds refresh --source ghsa
```
3. **Load from offline bundle:**
```bash
stella offline load --source ghsa --package ghsa-bundle-latest.tar.gz
```
### Root cause fix
**If rate limit consistently exceeded:**
1. Increase sync interval:
```bash
stella connector config set ghsa.sync_interval 4h
```
2. Enable incremental sync:
```bash
stella connector config set ghsa.incremental_sync true
```
3. Use authenticated requests (10x rate limit):
```bash
stella connector credentials update ghsa --token <github-pat>
```
**If token expired or invalid:**
1. Generate new GitHub PAT at https://github.com/settings/tokens
2. Update token:
```bash
stella connector credentials update ghsa --token <new-token>
```
3. Verify scopes:
```bash
stella connector credentials show ghsa --check-scopes
```
**If GraphQL query failing:**
1. Check for API schema changes:
```bash
stella connector ghsa schema-check
```
2. Update connector if schema changed:
```bash
stella upgrade --component connector-ghsa
```
**If pagination broken:**
1. Reset sync cursor:
```bash
stella connector ghsa reset-cursor
```
2. Force full resync:
```bash
stella admin feeds refresh --source ghsa --full
```
### Verification
```bash
# Force sync
stella admin feeds refresh --source ghsa
# Monitor sync progress
stella admin feeds status --source ghsa --watch
# Verify recent advisories present
stella vuln query GHSA-xxxx-xxxx-xxxx # Use a recent GHSA ID
# Check no errors
stella connector logs ghsa --level error --last 1h
```
---
## Prevention
- [ ] **Authentication:** Always use authenticated requests for 5000/hr rate limit
- [ ] **Monitoring:** Alert on last sync > 12h or sync failures
- [ ] **Redundancy:** Use NVD/OSV as backup for GitHub ecosystem coverage
- [ ] **Token rotation:** Rotate tokens before expiration
---
## Related Resources
- **Architecture:** `docs/modules/concelier/connectors.md`
- **Connector config:** `docs/modules/concelier/operations/connectors/ghsa.md`
- **Related runbooks:** `connector-nvd.md`, `connector-osv.md`
- **GitHub API docs:** https://docs.github.com/en/graphql

View File

@@ -0,0 +1,195 @@
# Runbook: Feed Connector - NVD Connector Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-006 - Feed Connector Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Concelier / NVD Connector |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.connector.nvd-health` |
---
## Symptoms
- [ ] NVD feed sync failing or stale (> 24h since last successful sync)
- [ ] Alert `ConnectorNvdSyncFailed` firing
- [ ] Error: "NVD API request failed" or "rate limit exceeded"
- [ ] Vulnerability data missing or outdated
- [ ] Metric `connector_sync_failures_total{source="nvd"}` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Vulnerability scans may miss recent CVEs |
| **Data integrity** | Data becomes stale; no data loss |
| **SLA impact** | Vulnerability currency SLO violated (target: < 24h) |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.connector.nvd-health
```
2. **Check NVD sync status:**
```bash
stella admin feeds status --source nvd
```
Look for: Last sync time, error message, sync state
3. **Check NVD API connectivity:**
```bash
stella connector test nvd
```
### Deep diagnosis
1. **Check NVD API key status:**
```bash
stella connector credentials show nvd
```
Problem if: API key expired or rate limit exhausted
2. **Check NVD API rate limit:**
```bash
stella connector nvd rate-limit-status
```
Problem if: Remaining requests = 0, reset time in future
3. **Check for NVD API outage:**
```bash
stella connector nvd api-status
```
Also check: https://nvd.nist.gov/general/news
4. **Check sync logs:**
```bash
stella connector logs nvd --last 1h --level error
```
Look for: HTTP status codes, timeout errors, parsing failures
---
## Resolution
### Immediate mitigation
1. **If rate limited, wait for reset:**
```bash
stella connector nvd rate-limit-status
# Wait for reset time, then:
stella admin feeds refresh --source nvd
```
2. **If API key expired, use anonymous mode (slower):**
```bash
stella connector config set nvd.api_key_mode anonymous
stella admin feeds refresh --source nvd
```
3. **Load from offline bundle if urgent:**
```bash
# If you have a recent offline bundle:
stella offline load --source nvd --package nvd-bundle-latest.tar.gz
```
### Root cause fix
**If API key expired or invalid:**
1. Generate new NVD API key at https://nvd.nist.gov/developers/request-an-api-key
2. Update API key:
```bash
stella connector credentials update nvd --api-key <new-key>
```
3. Verify connectivity:
```bash
stella connector test nvd
```
**If rate limit consistently exceeded:**
1. Increase sync interval to reduce API calls:
```bash
stella connector config set nvd.sync_interval 6h
```
2. Enable delta sync to reduce data volume:
```bash
stella connector config set nvd.delta_sync true
```
3. Request higher rate limit from NVD (if available)
**If network/firewall issue:**
1. Verify outbound connectivity to NVD API:
```bash
stella connector test nvd --verbose
```
2. Check proxy configuration if required:
```bash
stella connector config set nvd.proxy https://proxy:8080
```
**If data parsing failures:**
1. Check for NVD schema changes:
```bash
stella connector nvd schema-check
```
2. Update connector if schema changed:
```bash
stella upgrade --component connector-nvd
```
### Verification
```bash
# Force sync
stella admin feeds refresh --source nvd --force
# Monitor sync progress
stella admin feeds status --source nvd --watch
# Verify recent CVEs are present
stella vuln query CVE-2026-XXXX # Use a recent CVE ID
# Check no errors in recent logs
stella connector logs nvd --level error --last 1h
```
---
## Prevention
- [ ] **API Key:** Always use API key (not anonymous) for 10x rate limit
- [ ] **Monitoring:** Alert on last sync > 24h or sync failure
- [ ] **Redundancy:** Configure backup connector (OSV, GitHub Advisory) for overlap
- [ ] **Offline:** Maintain weekly offline bundle for disaster recovery
---
## Related Resources
- **Architecture:** `docs/modules/concelier/connectors.md`
- **Connector config:** `docs/modules/concelier/operations/connectors/nvd.md`
- **Related runbooks:** `connector-ghsa.md`, `connector-osv.md`
- **Dashboard:** Grafana > Stella Ops > Feed Connectors

View File

@@ -0,0 +1,193 @@
# Runbook: Feed Connector - OSV (Open Source Vulnerabilities) Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-006 - Feed Connector Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Concelier / OSV Connector |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.connector.osv-health` |
---
## Symptoms
- [ ] OSV feed sync failing or stale
- [ ] Alert `ConnectorOsvSyncFailed` firing
- [ ] Error: "OSV API request failed" or "ecosystem sync failed"
- [ ] OSV vulnerabilities missing from database
- [ ] Metric `connector_sync_failures_total{source="osv"}` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Open source ecosystem vulnerabilities may be missed |
| **Data integrity** | Data becomes stale; no data loss |
| **SLA impact** | Vulnerability currency SLO violated for affected ecosystems |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.connector.osv-health
```
2. **Check OSV sync status:**
```bash
stella admin feeds status --source osv
```
3. **Test OSV API connectivity:**
```bash
stella connector test osv
```
### Deep diagnosis
1. **Check ecosystem-specific status:**
```bash
stella connector osv ecosystems status
```
Look for: Failed ecosystems, stale ecosystems
2. **Check sync logs:**
```bash
stella connector logs osv --last 1h --level error
```
Look for: API errors, parsing failures, timeout
3. **Check for OSV API outage:**
```bash
stella connector osv api-status
```
Also check: https://osv.dev/
4. **Check GCS bucket access (OSV uses GCS for bulk data):**
```bash
stella connector osv gcs-status
```
---
## Resolution
### Immediate mitigation
1. **Retry sync for specific ecosystem:**
```bash
stella admin feeds refresh --source osv --ecosystem npm
```
2. **Sync from GCS bucket directly (faster for bulk):**
```bash
stella connector osv sync-from-gcs
```
3. **Load from offline bundle:**
```bash
stella offline load --source osv --package osv-bundle-latest.tar.gz
```
### Root cause fix
**If API request failing:**
1. Check API endpoint:
```bash
stella connector osv api-test
```
2. Verify no proxy blocking:
```bash
stella connector config set osv.proxy <proxy-url>
```
**If GCS access failing:**
1. Check GCS connectivity:
```bash
stella connector osv gcs-test
```
2. Enable anonymous access (default):
```bash
stella connector config set osv.gcs_auth anonymous
```
3. Or configure service account:
```bash
stella connector config set osv.gcs_credentials /path/to/sa-key.json
```
**If specific ecosystem failing:**
1. Disable problematic ecosystem temporarily:
```bash
stella connector config set osv.ecosystems.disabled <ecosystem>
```
2. Check ecosystem data format:
```bash
stella connector osv ecosystem-check <ecosystem>
```
**If parsing errors:**
1. Check for schema changes:
```bash
stella connector osv schema-check
```
2. Update connector:
```bash
stella upgrade --component connector-osv
```
### Verification
```bash
# Force sync
stella admin feeds refresh --source osv
# Monitor sync progress
stella admin feeds status --source osv --watch
# Verify ecosystem coverage
stella connector osv ecosystems status
# Query recent vulnerability
stella vuln query OSV-2026-xxxx
# Check no errors
stella connector logs osv --level error --last 1h
```
---
## Prevention
- [ ] **Bulk sync:** Use GCS bulk sync for initial load and daily updates
- [ ] **Monitoring:** Alert on ecosystem sync failures
- [ ] **Redundancy:** NVD/GHSA provide overlapping coverage for major ecosystems
- [ ] **Offline:** Maintain weekly offline bundle
---
## Related Resources
- **Architecture:** `docs/modules/concelier/connectors.md`
- **Connector config:** `docs/modules/concelier/operations/connectors/osv.md`
- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`
- **OSV API docs:** https://osv.dev/docs/

View File

@@ -0,0 +1,220 @@
# Runbook Template: Feed Connector - Vendor-Specific Connectors
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-006 - Feed Connector Runbooks
## Overview
This is a template runbook for vendor-specific advisory feed connectors (RedHat, Ubuntu, Debian, Oracle, VMware, etc.). Use this template to create runbooks for specific vendor connectors.
---
## Metadata Template
| Field | Value |
|-------|-------|
| **Component** | Concelier / [Vendor] Connector |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | [Date] |
| **Doctor check** | `check.connector.[vendor]-health` |
---
## Common Vendor Connector Issues
### Authentication Failures
**Symptoms:**
- Sync failing with 401/403 errors
- "authentication failed" or "invalid credentials"
**Resolution:**
```bash
# Check credentials
stella connector credentials show <vendor>
# Update credentials
stella connector credentials update <vendor> --api-key <key>
# Test connectivity
stella connector test <vendor>
```
### Rate Limiting
**Symptoms:**
- Sync failing with 429 errors
- "rate limit exceeded"
**Resolution:**
```bash
# Check rate limit status
stella connector <vendor> rate-limit-status
# Increase sync interval
stella connector config set <vendor>.sync_interval 6h
# Enable delta sync
stella connector config set <vendor>.delta_sync true
```
### Data Format Changes
**Symptoms:**
- Parsing errors in sync logs
- "unexpected format" or "schema validation failed"
**Resolution:**
```bash
# Check for schema changes
stella connector <vendor> schema-check
# Update connector
stella upgrade --component connector-<vendor>
```
### Offline Bundle Refresh
**Resolution:**
```bash
# Create offline bundle
stella offline sync --feeds <vendor> --output <vendor>-bundle.tar.gz
# Load offline bundle
stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
```
---
## Vendor-Specific Runbooks
Use this template to create runbooks for:
### RedHat Security Data
**Endpoint:** https://access.redhat.com/security/data/
**Authentication:** API token or certificate
**Connector:** `connector-redhat`
Key commands:
```bash
stella connector test redhat
stella admin feeds status --source redhat
stella connector redhat cve-map-status # RHSA to CVE mapping
```
### Ubuntu Security Notices
**Endpoint:** https://ubuntu.com/security/notices
**Authentication:** None (public)
**Connector:** `connector-ubuntu`
Key commands:
```bash
stella connector test ubuntu
stella admin feeds status --source ubuntu
stella connector ubuntu usn-status # USN sync status
```
### Debian Security Tracker
**Endpoint:** https://security-tracker.debian.org/
**Authentication:** None (public)
**Connector:** `connector-debian`
Key commands:
```bash
stella connector test debian
stella admin feeds status --source debian
stella connector debian dla-status # DLA sync status
```
### Oracle Security Alerts
**Endpoint:** https://www.oracle.com/security-alerts/
**Authentication:** Oracle account (optional)
**Connector:** `connector-oracle`
Key commands:
```bash
stella connector test oracle
stella admin feeds status --source oracle
stella connector oracle cpu-status # Critical Patch Update status
```
### VMware Security Advisories
**Endpoint:** https://www.vmware.com/security/advisories
**Authentication:** None (public)
**Connector:** `connector-vmware`
Key commands:
```bash
stella connector test vmware
stella admin feeds status --source vmware
stella connector vmware vmsa-status # VMSA sync status
```
---
## Diagnosis Checklist
For any vendor connector issue:
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.connector.<vendor>-health
```
2. **Check sync status:**
```bash
stella admin feeds status --source <vendor>
```
3. **Test connectivity:**
```bash
stella connector test <vendor>
```
4. **Check logs:**
```bash
stella connector logs <vendor> --last 1h --level error
```
5. **Check credentials (if applicable):**
```bash
stella connector credentials show <vendor>
```
---
## Resolution Checklist
1. **Retry sync:**
```bash
stella admin feeds refresh --source <vendor>
```
2. **Update credentials (if auth issue):**
```bash
stella connector credentials update <vendor>
```
3. **Update connector (if format changed):**
```bash
stella upgrade --component connector-<vendor>
```
4. **Load offline bundle (if API unavailable):**
```bash
stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
```
---
## Related Resources
- **Connector architecture:** `docs/modules/concelier/connectors.md`
- **Vendor connector configs:** `docs/modules/concelier/operations/connectors/`
- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`, `connector-osv.md`

View File

@@ -0,0 +1,370 @@
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-002 - Crypto Subsystem Runbook
# Regional Crypto Operations Runbook
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
Cryptographic subsystem operations including HSM management, regional crypto profile configuration, key rotation, and certificate management for all supported crypto profiles (International, FIPS, eIDAS, GOST, SM).
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check crypto subsystem health
stella doctor --category crypto
# Verify active crypto profile
stella crypto profile show
# List loaded crypto providers
stella crypto providers list
# Check key status
stella crypto keys status
```
### Metrics to Watch
- `stella_crypto_operations_total` - Crypto operation count by type
- `stella_crypto_operation_duration_seconds` - Signing/verification latency
- `stella_hsm_availability` - HSM availability (if configured)
- `stella_cert_expiry_days` - Certificate expiration countdown
---
## Regional Crypto Profiles
### Profile Overview
| Profile | Use Case | Key Algorithms | Compliance |
|---------|----------|----------------|------------|
| `international` | Default, most deployments | RSA-2048+, ECDSA P-256/P-384, Ed25519 | General |
| `fips` | US Government / FedRAMP | FIPS 140-2 approved algorithms only | FIPS 140-2 |
| `eidas` | European Union | RSA-PSS, ECDSA, Ed25519 per ETSI TS 119 312 | eIDAS |
| `gost` | Russian Federation | GOST R 34.10-2012, GOST R 34.11-2012 | Russian standards |
| `sm` | China | SM2, SM3, SM4 | GM/T 0003-2012 |
### Switching Profiles
1. **Pre-switch verification:**
```bash
# Verify target profile is available
stella crypto profile verify --profile <target-profile>
# Check for incompatible existing signatures
stella crypto audit --check-compatibility --target-profile <target-profile>
```
2. **Profile switch:**
```bash
# Switch profile (requires service restart)
stella crypto profile set --profile <target-profile>
# Restart services to apply
stella service restart --graceful
```
3. **Post-switch verification:**
```bash
stella doctor --check check.crypto.fips,check.crypto.eidas,check.crypto.gost,check.crypto.sm
```
---
## Standard Procedures
### SP-001: Key Rotation
**Frequency:** Quarterly or per policy
**Duration:** ~15 minutes (no downtime)
1. Generate new key:
```bash
# For software keys
stella crypto keys generate --type signing --algorithm ecdsa-p256 --name signing-$(date +%Y%m)
# For HSM-backed keys
stella crypto keys generate --type signing --algorithm ecdsa-p256 --provider hsm --name signing-$(date +%Y%m)
```
2. Activate new key:
```bash
stella crypto keys activate --name signing-$(date +%Y%m)
```
3. Verify signing with new key:
```bash
echo "test" | stella crypto sign --output /dev/null
```
4. Schedule old key deactivation:
```bash
stella crypto keys schedule-deactivation --name <old-key-name> --in 30d
```
### SP-002: Certificate Renewal
**When:** Certificate expiring within 30 days
1. Check expiration:
```bash
stella crypto certs check-expiry
```
2. Generate CSR:
```bash
stella crypto certs csr --subject "CN=stellaops.example.com,O=Example Corp" --output cert.csr
```
3. Install renewed certificate:
```bash
stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
```
4. Verify certificate chain:
```bash
stella doctor --check check.crypto.certchain
```
5. Restart services:
```bash
stella service restart --graceful
```
### SP-003: HSM Health Check
**Frequency:** Daily (automated) or on-demand
1. Check HSM connectivity:
```bash
stella crypto hsm status
```
2. Verify slot access:
```bash
stella crypto hsm slots list
```
3. Test signing operation:
```bash
stella crypto hsm test-sign
```
4. Check HSM metrics:
- Free objects/sessions
- Temperature/health (vendor-specific)
---
## Incident Procedures
### INC-001: HSM Unavailable
**Symptoms:**
- Alert: `StellaHsmUnavailable`
- Signing operations failing with "HSM connection error"
**Investigation:**
```bash
# Check HSM status
stella crypto hsm status
# Test PKCS#11 module
stella crypto hsm test-module
# Check network to HSM
stella network test --host <hsm-host> --port <hsm-port>
```
**Resolution:**
1. **Network issue:**
- Verify network path to HSM
- Check firewall rules
- Verify HSM appliance is powered on
2. **Session exhaustion:**
```bash
# Release stale sessions
stella crypto hsm sessions release --stale
# Restart crypto service
stella service restart --service crypto-signer
```
3. **HSM failure:**
- Fail over to secondary HSM (if configured)
- Contact HSM vendor support
- Consider temporary fallback to software keys (with approval)
### INC-002: Signing Key Compromised
**CRITICAL - Follow incident response procedure**
1. **Immediate containment:**
```bash
# Revoke compromised key
stella crypto keys revoke --name <compromised-key> --reason compromise
# Block signing with compromised key
stella crypto keys block --name <compromised-key>
```
2. **Generate replacement key:**
```bash
stella crypto keys generate --type signing --algorithm ecdsa-p256 --name emergency-signing
stella crypto keys activate --name emergency-signing
```
3. **Notify downstream:**
- Update trust registries with new key
- Notify relying parties
- Publish key revocation notice
4. **Forensics:**
```bash
# Export key usage audit log
stella crypto audit export --key <compromised-key> --output /secure/key-audit.json
```
### INC-003: Certificate Expired
**Symptoms:**
- TLS connection failures
- Alert: `StellaCertExpired`
**Immediate Resolution:**
1. If renewed certificate is available:
```bash
stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
stella service restart --graceful
```
2. If renewal not ready - emergency self-signed (temporary):
```bash
# Generate emergency certificate (NOT for production use)
stella crypto certs generate-self-signed --days 7 --name emergency
stella crypto certs install --cert emergency.pem
stella service restart --graceful
```
3. Expedite certificate renewal process
### INC-004: FIPS Mode Not Enabled
**Symptoms:**
- Alert: `StellaFipsNotEnabled`
- Compliance audit failure
**Resolution:**
1. **Linux:**
```bash
# Enable FIPS mode
sudo fips-mode-setup --enable
# Reboot required
sudo reboot
# Verify after reboot
fips-mode-setup --check
```
2. **Windows:**
- Enable via Group Policy
- Or via registry:
```powershell
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\Lsa\FipsAlgorithmPolicy" -Name "Enabled" -Value 1
Restart-Computer
```
3. Restart Stella services:
```bash
stella service restart
stella doctor --check check.crypto.fips
```
---
## Regional-Specific Procedures
### GOST Configuration (Russian Federation)
1. Install GOST engine:
```bash
sudo apt install libengine-gost-openssl1.1
```
2. Configure Stella:
```bash
stella crypto profile set --profile gost
stella crypto config set --gost-engine-path /usr/lib/x86_64-linux-gnu/engines-3/gost.so
```
3. Verify:
```bash
stella doctor --check check.crypto.gost
```
### SM Configuration (China)
1. Ensure OpenSSL 1.1.1+ with SM support:
```bash
openssl version
openssl list -cipher-algorithms | grep -i sm
```
2. Configure Stella:
```bash
stella crypto profile set --profile sm
```
3. Verify:
```bash
stella doctor --check check.crypto.sm
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → Crypto Subsystem
Key panels:
- Signing operation latency
- Key usage by key ID
- HSM availability
- Certificate expiration countdown
- Crypto profile in use
---
## Evidence Capture
```bash
# Comprehensive crypto diagnostics
stella crypto diagnostics --output /tmp/crypto-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
Bundle includes:
- Active crypto profile
- Key inventory (public keys only)
- Certificate chain
- HSM status
- Operation audit log (last 24h)
---
## Escalation Path
1. **L1 (On-call):** Certificate installs, key activation
2. **L2 (Security team):** Key rotation, HSM issues
3. **L3 (Crypto SME):** Algorithm issues, compliance questions
4. **HSM Vendor:** Hardware failures
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,408 @@
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-003 - Evidence Locker Runbook
# Evidence Locker Operations Runbook
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
Evidence locker operations including storage management, integrity verification, attestation management, provenance chain maintenance, and disaster recovery procedures.
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check evidence locker health
stella doctor --category evidence
# Verify storage accessibility
stella evidence status
# Check index health
stella evidence index status
# Verify anchor chain
stella evidence anchor verify --latest
```
### Metrics to Watch
- `stella_evidence_artifacts_total` - Total artifacts stored
- `stella_evidence_retrieval_latency_seconds` - Retrieval latency P99
- `stella_evidence_storage_bytes` - Storage consumption
- `stella_merkle_anchor_age_seconds` - Time since last anchor
---
## Standard Procedures
### SP-001: Daily Integrity Check
**Frequency:** Daily (automated) or on-demand
**Duration:** Varies by locker size (typically 5-30 minutes)
1. Run integrity verification:
```bash
# Quick check (sample-based)
stella evidence verify --mode quick
# Full check (all artifacts)
stella evidence verify --mode full
```
2. Review results:
```bash
stella evidence verify-report --latest
```
3. Address any failures:
```bash
# List failed artifacts
stella evidence verify-report --latest --filter failed
```
### SP-002: Index Maintenance
**Frequency:** Weekly or after large ingestion
**Duration:** ~10 minutes
1. Check index health:
```bash
stella evidence index status
```
2. Refresh index if needed:
```bash
# Incremental refresh
stella evidence index refresh
# Full rebuild (if corruption suspected)
stella evidence index rebuild
```
3. Optimize index:
```bash
stella evidence index optimize
```
### SP-003: Merkle Anchoring
**Frequency:** Per policy (default: every 6 hours)
**Duration:** ~2 minutes
1. Create new anchor:
```bash
stella evidence anchor create
```
2. Verify anchor chain:
```bash
stella evidence anchor verify --all
```
3. Export anchor for external archival:
```bash
stella evidence anchor export --latest --output anchor-$(date +%Y%m%dT%H%M%S).json
```
### SP-004: Storage Cleanup
**Frequency:** Monthly or when storage alerts trigger
**Duration:** Varies
1. Review storage usage:
```bash
stella evidence storage stats
```
2. Apply retention policy:
```bash
# Dry run first
stella evidence cleanup --apply-retention --dry-run
# Execute cleanup
stella evidence cleanup --apply-retention
```
3. Archive old evidence (if required):
```bash
stella evidence archive --older-than 365d --output /archive/evidence-$(date +%Y).tar
```
---
## Incident Procedures
### INC-001: Integrity Verification Failure
**Symptoms:**
- Alert: `StellaEvidenceIntegrityFailure`
- Verification reports hash mismatch
**Investigation:**
```bash
# Get failure details
stella evidence verify-report --latest --filter failed --format json > /tmp/integrity-failures.json
# Check specific artifact
stella evidence inspect <artifact-id>
# Check provenance
stella evidence provenance show <artifact-id>
```
**Resolution:**
1. **Isolated corruption:**
```bash
# Attempt recovery from replica (if available)
stella evidence recover --id <artifact-id> --source replica
# If no replica, mark as corrupted
stella evidence mark-corrupted --id <artifact-id> --reason "hash-mismatch"
```
2. **Widespread corruption:**
- Stop evidence ingestion
- Identify corruption extent
- Restore from backup if necessary
- Escalate to L3
3. **False positive (software bug):**
- Verify with multiple hash implementations
- Check for recent software updates
- Report bug if confirmed
### INC-002: Evidence Retrieval Failure
**Symptoms:**
- Alert: `StellaEvidenceRetrievalFailed`
- API returning 404 for known artifacts
**Investigation:**
```bash
# Check if artifact exists
stella evidence exists <artifact-id>
# Check index
stella evidence index lookup <artifact-id>
# Check storage backend
stella evidence storage check <artifact-id>
```
**Resolution:**
1. **Index corruption:**
```bash
# Rebuild index
stella evidence index rebuild
```
2. **Storage backend issue:**
```bash
# Check storage health
stella doctor --check check.storage.evidencelocker
# Verify storage connectivity
stella evidence storage test
```
3. **File system issue:**
- Check disk health
- Verify file permissions
- Check mount status
### INC-003: Anchor Chain Break
**Symptoms:**
- Alert: `StellaMerkleAnchorChainBroken`
- Anchor verification fails
**Investigation:**
```bash
# Check anchor chain
stella evidence anchor verify --all --verbose
# Find break point
stella evidence anchor list --show-links
# Inspect specific anchor
stella evidence anchor inspect <anchor-id>
```
**Resolution:**
1. **Single broken link:**
```bash
# Attempt to recover from backup
stella evidence anchor recover --id <anchor-id> --source backup
```
2. **Multiple breaks:**
- Stop new anchoring
- Assess extent of damage
- Restore from backup or rebuild chain
3. **Create new chain segment:**
```bash
# Start new chain (preserves old chain as archived)
stella evidence anchor new-chain --reason "chain-break-recovery"
```
### INC-004: Storage Full
**Symptoms:**
- Alert: `StellaEvidenceStorageFull`
- Ingestion failing
**Immediate Actions:**
```bash
# Check storage usage
stella evidence storage stats
# Emergency cleanup of temporary files
stella evidence cleanup --temp-only
# Find large/old artifacts
stella evidence storage analyze --sort size --limit 20
```
**Resolution:**
1. **Apply retention policy:**
```bash
stella evidence cleanup --apply-retention --aggressive
```
2. **Archive old evidence:**
```bash
stella evidence archive --older-than 180d --compress
```
3. **Expand storage:**
- Follow cloud provider procedure
- Or add additional storage volume
---
## Disaster Recovery
### DR-001: Full Evidence Locker Recovery
**Prerequisites:**
- Backup available
- Target storage provisioned
- Recovery environment ready
**Procedure:**
1. Provision new storage:
```bash
stella evidence storage provision --size <size>
```
2. Restore from backup:
```bash
# List available backups
stella backup list --type evidence-locker
# Restore
stella evidence restore --backup-id <backup-id> --target /var/lib/stellaops/evidence
```
3. Verify restoration:
```bash
stella evidence verify --mode full
stella evidence anchor verify --all
```
4. Update service configuration:
```bash
stella config set EvidenceLocker:Path /var/lib/stellaops/evidence
stella service restart
```
### DR-002: Point-in-Time Recovery
For recovering to a specific point in time:
1. Identify target anchor:
```bash
stella evidence anchor list --before <timestamp>
```
2. Restore to that point:
```bash
stella evidence restore --to-anchor <anchor-id>
```
3. Verify integrity:
```bash
stella evidence verify --mode full --to-anchor <anchor-id>
```
---
## Offline Mode Operations
### Preparing Offline Evidence Pack
```bash
# Export evidence for specific artifact
stella evidence export --digest <artifact-digest> --output evidence-pack.tar.gz
# Export with all dependencies
stella evidence export --digest <artifact-digest> --include-deps --output evidence-full.tar.gz
```
### Verifying Evidence Offline
```bash
# Verify evidence pack without network
stella evidence verify --offline --input evidence-pack.tar.gz
# Replay verdict using evidence
stella replay --evidence evidence-pack.tar.gz --output verdict.json
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → Evidence Locker
Key panels:
- Artifact ingestion rate
- Retrieval latency
- Storage utilization trend
- Integrity check status
- Anchor chain health
---
## Evidence Capture
For any incident:
```bash
stella evidence diagnostics --output /tmp/evidence-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
Bundle includes:
- Index status
- Storage stats
- Recent anchor chain
- Integrity check results
- Operation audit log
---
## Escalation Path
1. **L1 (On-call):** Standard procedures, cleanup operations
2. **L2 (Platform team):** Index rebuild, anchor issues
3. **L3 (Architecture):** Chain recovery, DR procedures
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,183 @@
# Runbook: Release Orchestrator - Required Evidence Not Found
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | High |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.evidence-availability` |
---
## Symptoms
- [ ] Promotion failing with "required evidence not found"
- [ ] Alert `OrchestratorEvidenceMissing` firing
- [ ] Gate evaluation blocked waiting for evidence
- [ ] Error: "SBOM not found" or "attestation missing"
- [ ] Evidence chain incomplete for artifact
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Promotion blocked until evidence is generated |
| **Data integrity** | Indicates missing security artifact - must be resolved |
| **SLA impact** | Release blocked; compliance requirements not met |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.evidence-availability
```
2. **List missing evidence for promotion:**
```bash
stella promotion evidence <promotion-id> --missing
```
3. **Check what evidence exists for artifact:**
```bash
stella evidence list --artifact <digest>
```
### Deep diagnosis
1. **Check evidence chain completeness:**
```bash
stella evidence chain --artifact <digest> --verbose
```
Look for: Missing nodes in the chain
2. **Check if scan completed:**
```bash
stella scanner jobs list --artifact <digest>
```
Problem if: No completed scan or scan failed
3. **Check if attestation was created:**
```bash
stella attest list --subject <digest>
```
Problem if: No attestation or attestation failed
4. **Check evidence store health:**
```bash
stella evidence store health
```
---
## Resolution
### Immediate mitigation
1. **Generate missing SBOM:**
```bash
stella scan image --image <image-ref> --sbom-only
```
2. **Generate missing attestation:**
```bash
stella attest create --subject <digest> --type slsa-provenance
```
3. **Re-scan artifact to regenerate all evidence:**
```bash
stella scan image --image <image-ref> --force
```
### Root cause fix
**If scan never ran:**
1. Check why artifact wasn't scanned:
```bash
stella scanner queue list --artifact <digest>
```
2. Configure automatic scanning on push:
```bash
stella scanner config set auto_scan.enabled true
stella scanner config set auto_scan.triggers "push,promote"
```
**If evidence was generated but not stored:**
1. Check evidence store connectivity:
```bash
stella evidence store health
```
2. Retry evidence storage:
```bash
stella evidence retry-store --artifact <digest>
```
**If attestation signing failed:**
1. Check attestor status:
```bash
stella attest status
```
2. See `attestor-signing-failed.md` runbook
**If evidence expired or was deleted:**
1. Check evidence retention policy:
```bash
stella evidence policy show
```
2. Regenerate evidence:
```bash
stella scan image --image <image-ref> --force
stella attest create --subject <digest> --type slsa-provenance
```
### Verification
```bash
# Check all evidence now exists
stella evidence list --artifact <digest>
# Verify evidence chain is complete
stella evidence chain --artifact <digest>
# Retry promotion
stella promotion retry <promotion-id>
# Verify promotion proceeds
stella promotion status <promotion-id>
```
---
## Prevention
- [ ] **Auto-scan:** Enable automatic scanning for all pushed images
- [ ] **Gates:** Configure evidence requirements clearly in promotion policy
- [ ] **Monitoring:** Alert on evidence generation failures
- [ ] **Retention:** Set appropriate evidence retention periods
---
## Related Resources
- **Architecture:** `docs/modules/evidence-locker/architecture.md`
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `attestor-signing-failed.md`
- **Evidence requirements:** `docs/operations/evidence-requirements.md`

View File

@@ -0,0 +1,178 @@
# Runbook: Release Orchestrator - Gate Evaluation Timeout
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.gate-timeout` |
---
## Symptoms
- [ ] Promotion gates timing out before completing evaluation
- [ ] Alert `OrchestratorGateTimeout` firing
- [ ] Error: "gate evaluation timeout exceeded"
- [ ] Promotion stuck waiting for gate response
- [ ] Metric `orchestrator_gate_timeout_total` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Promotions delayed or blocked; release pipeline stalled |
| **Data integrity** | No data loss; promotion can be retried |
| **SLA impact** | Release SLO violated if timeout persists |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.gate-timeout
```
2. **Identify timed-out gates:**
```bash
stella promotion gates <promotion-id> --status timeout
```
3. **Check gate service health:**
```bash
stella orch gate-services status
```
### Deep diagnosis
1. **Check specific gate latency:**
```bash
stella orch gate stats --gate <gate-name> --last 1h
```
Look for: P95 latency, timeout rate
2. **Check external service connectivity:**
```bash
stella orch connectivity --gate <gate-name>
```
3. **Check gate evaluation logs:**
```bash
stella orch logs --gate <gate-name> --promotion <promotion-id>
```
Look for: Slow queries, external API delays
4. **Check policy engine latency (for policy gates):**
```bash
stella policy stats --last 10m
```
---
## Resolution
### Immediate mitigation
1. **Increase timeout for specific gate:**
```bash
stella orch config set gates.<gate-name>.timeout 5m
stella orch reload
```
2. **Skip the timed-out gate (requires approval):**
```bash
stella promotion gate skip <promotion-id> <gate-name> \
--reason "External service timeout - approved by <approver>"
```
3. **Retry the promotion:**
```bash
stella promotion retry <promotion-id>
```
### Root cause fix
**If external service is slow:**
1. Configure gate retry with backoff:
```bash
stella orch config set gates.<gate-name>.retries 3
stella orch config set gates.<gate-name>.retry_backoff 5s
```
2. Enable gate result caching:
```bash
stella orch config set gates.<gate-name>.cache_ttl 5m
```
3. Configure circuit breaker:
```bash
stella orch config set gates.<gate-name>.circuit_breaker.enabled true
stella orch config set gates.<gate-name>.circuit_breaker.threshold 5
```
**If policy evaluation is slow:**
1. Optimize policy (see `policy-evaluation-slow.md` runbook)
2. Increase policy worker count:
```bash
stella policy config set opa.workers 4
```
**If evidence retrieval is slow:**
1. Enable evidence pre-fetching:
```bash
stella orch config set gates.evidence_prefetch true
```
2. Increase evidence cache:
```bash
stella orch config set evidence.cache_size 1000
stella orch config set evidence.cache_ttl 10m
```
### Verification
```bash
# Retry promotion
stella promotion retry <promotion-id>
# Monitor gate evaluation
stella promotion gates <promotion-id> --watch
# Check gate latency improved
stella orch gate stats --gate <gate-name> --last 10m
# Verify no timeouts
stella orch logs --filter "timeout" --last 30m
```
---
## Prevention
- [ ] **Timeouts:** Set appropriate timeouts based on gate SLAs (default: 2m)
- [ ] **Monitoring:** Alert on gate P95 latency > 1m
- [ ] **Caching:** Enable caching for slow gates
- [ ] **Circuit breakers:** Enable circuit breakers for external service gates
---
## Related Resources
- **Architecture:** `docs/modules/release-orchestrator/gates.md`
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `policy-evaluation-slow.md`
- **Dashboard:** Grafana > Stella Ops > Gate Latency

View File

@@ -0,0 +1,168 @@
# Runbook: Release Orchestrator - Promotion Job Not Progressing
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | Critical |
| **On-call scope** | Platform team, Release team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.job-health` |
---
## Symptoms
- [ ] Promotion job stuck in "in_progress" state for >10 minutes
- [ ] No progress updates in promotion timeline
- [ ] Alert `OrchestratorPromotionStuck` firing
- [ ] UI shows promotion spinner indefinitely
- [ ] Downstream environment not receiving promoted artifact
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Release blocked, cannot promote to target environment |
| **Data integrity** | Artifact is safe; promotion can be retried |
| **SLA impact** | Release SLO violated if not resolved within 30 minutes |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.job-health
```
2. **Check promotion status:**
```bash
stella promotion status <promotion-id>
```
Look for: Current step, last update time, any error messages
3. **Check orchestrator service:**
```bash
stella orch status
```
### Deep diagnosis
1. **Get detailed promotion trace:**
```bash
stella promotion trace <promotion-id> --verbose
```
Look for: Which step is stuck, any timeouts
2. **Check gate evaluation status:**
```bash
stella promotion gates <promotion-id>
```
Problem if: Gate stuck waiting for external service
3. **Check target environment connectivity:**
```bash
stella orch connectivity --target <env-name>
```
4. **Check for lock contention:**
```bash
stella orch locks list
```
Problem if: Stale locks on the artifact or environment
---
## Resolution
### Immediate mitigation
1. **If gate is stuck waiting for external service:**
```bash
# Skip the stuck gate (requires approval)
stella promotion gate skip <promotion-id> <gate-name> --reason "External service timeout"
```
2. **If lock is stale:**
```bash
# Release the lock (use with caution)
stella orch locks release <lock-id> --force
```
3. **If orchestrator is unresponsive:**
```bash
stella service restart orchestrator
```
### Root cause fix
**If external gate service is slow:**
1. Increase gate timeout:
```bash
stella orch config set gates.<gate-name>.timeout 5m
```
2. Configure gate retry:
```bash
stella orch config set gates.<gate-name>.retries 3
```
**If target environment is unreachable:**
1. Check network connectivity to target
2. Verify credentials for target environment:
```bash
stella orch credentials verify --target <env-name>
```
**If database lock contention:**
1. Increase lock timeout:
```bash
stella orch config set locks.timeout 60s
```
2. Enable optimistic locking:
```bash
stella orch config set locks.mode optimistic
```
### Verification
```bash
# Check promotion completed
stella promotion status <promotion-id>
# Verify artifact in target environment
stella orch artifacts list --env <target-env> --filter <artifact-digest>
# Check no stuck promotions
stella promotion list --status in_progress --older-than 5m
```
---
## Prevention
- [ ] **Timeouts:** Configure appropriate timeouts for all gates
- [ ] **Monitoring:** Alert on promotions stuck > 10 minutes
- [ ] **Health checks:** Enable connectivity pre-checks before promotion
- [ ] **Documentation:** Document SLAs for external gate services
---
## Related Resources
- **Architecture:** `docs/modules/release-orchestrator/architecture.md`
- **Related runbooks:** `orchestrator-gate-timeout.md`, `orchestrator-evidence-missing.md`
- **Dashboard:** Grafana > Stella Ops > Release Orchestrator

View File

@@ -0,0 +1,189 @@
# Runbook: Release Orchestrator - Promotion Quota Exhausted
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | Medium |
| **On-call scope** | Platform team, Release team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.quota-status` |
---
## Symptoms
- [ ] Promotions failing with "quota exceeded"
- [ ] Alert `OrchestratorQuotaExceeded` firing
- [ ] Error: "promotion rate limit reached" or "daily quota exhausted"
- [ ] New promotions being rejected
- [ ] Queued promotions not processing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | New releases blocked until quota resets or increases |
| **Data integrity** | No data loss; promotions queued for later |
| **SLA impact** | Release frequency SLO may be violated |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.quota-status
```
2. **Check current quota usage:**
```bash
stella orch quota status
```
3. **Check quota limits:**
```bash
stella orch quota limits show
```
### Deep diagnosis
1. **Check promotion history:**
```bash
stella promotion list --last 24h --count
```
Look for: Unusual spike in promotions
2. **Check per-environment quotas:**
```bash
stella orch quota status --by-environment
```
3. **Check for runaway automation:**
```bash
stella promotion list --last 1h --by-actor
```
Problem if: Single actor/service making many promotions
4. **Check when quota resets:**
```bash
stella orch quota reset-time
```
---
## Resolution
### Immediate mitigation
1. **Request temporary quota increase:**
```bash
stella orch quota request-increase --amount 50 --reason "Release deadline"
```
2. **Prioritize critical promotions:**
```bash
stella promotion priority set <promotion-id> high
```
3. **Cancel unnecessary queued promotions:**
```bash
stella promotion list --status queued
stella promotion cancel <promotion-id>
```
### Root cause fix
**If legitimate high volume:**
1. Increase quota limits:
```bash
stella orch quota limits set --daily 200 --hourly 50
```
2. Increase per-environment limits:
```bash
stella orch quota limits set --env production --daily 50
```
**If runaway automation:**
1. Identify the source:
```bash
stella promotion list --last 1h --by-actor --verbose
```
2. Revoke or rate-limit the service account:
```bash
stella auth rate-limit set <service-account> --promotions-per-hour 10
```
3. Fix the automation bug
**If promotion retries causing spike:**
1. Check for failing promotions causing retries:
```bash
stella promotion list --status failed --last 24h
```
2. Fix underlying promotion failures (see other runbooks)
3. Configure retry limits:
```bash
stella orch config set promotion.max_retries 3
stella orch config set promotion.retry_backoff 5m
```
**If quota too restrictive for workload:**
1. Analyze actual promotion patterns:
```bash
stella orch quota analyze --last 30d
```
2. Adjust quotas based on analysis:
```bash
stella orch quota limits set --daily <recommended>
```
### Verification
```bash
# Check quota status
stella orch quota status
# Verify promotions processing
stella promotion list --status in_progress
# Test new promotion
stella promotion create --test --dry-run
# Check no quota errors
stella orch logs --filter "quota" --level error --last 30m
```
---
## Prevention
- [ ] **Monitoring:** Alert at 80% quota usage
- [ ] **Limits:** Set appropriate quotas based on team size and release frequency
- [ ] **Automation:** Implement rate limiting in CI/CD pipelines
- [ ] **Review:** Regularly review and adjust quotas based on usage patterns
---
## Related Resources
- **Architecture:** `docs/modules/release-orchestrator/quotas.md`
- **Related runbooks:** `orchestrator-promotion-stuck.md`
- **Quota management:** `docs/operations/quota-management.md`

View File

@@ -0,0 +1,189 @@
# Runbook: Release Orchestrator - Rollback Operation Failed
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | Critical |
| **On-call scope** | Platform team, Release team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.rollback-health` |
---
## Symptoms
- [ ] Rollback operation failing or stuck
- [ ] Alert `OrchestratorRollbackFailed` firing
- [ ] Error: "rollback failed" or "cannot restore previous version"
- [ ] Target environment in inconsistent state
- [ ] Previous artifact not available for deployment
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Rollback blocked; potentially broken release in production |
| **Data integrity** | Environment may be in partial rollback state |
| **SLA impact** | Incident resolution blocked; extended outage |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.rollback-health
```
2. **Check rollback status:**
```bash
stella rollback status <rollback-id>
```
3. **Check previous deployment history:**
```bash
stella orch deployments list --env <env-name> --last 10
```
### Deep diagnosis
1. **Check why rollback failed:**
```bash
stella rollback trace <rollback-id> --verbose
```
Look for: Which step failed, error message
2. **Check previous artifact availability:**
```bash
stella orch artifacts get <previous-digest> --check
```
Problem if: Artifact deleted, not in registry
3. **Check environment state:**
```bash
stella orch env status <env-name> --detailed
```
4. **Check for deployment locks:**
```bash
stella orch locks list --env <env-name>
```
---
## Resolution
### Immediate mitigation
1. **Force release lock if stuck:**
```bash
stella orch locks release --env <env-name> --force
```
2. **Manual rollback using specific artifact:**
```bash
stella deploy --env <env-name> --artifact <previous-digest> --force
```
3. **If artifact unavailable, deploy last known good:**
```bash
stella orch deployments list --env <env-name> --status success
stella deploy --env <env-name> --artifact <last-good-digest>
```
### Root cause fix
**If previous artifact not in registry:**
1. Check artifact retention policy:
```bash
stella registry retention show
```
2. Restore from backup registry:
```bash
stella registry restore --artifact <digest> --from backup
```
3. Increase artifact retention:
```bash
stella registry retention set --min-versions 10
```
**If deployment service unavailable:**
1. Check deployment target connectivity:
```bash
stella orch connectivity --target <env-name>
```
2. Check deployment agent status:
```bash
stella orch agent status --env <env-name>
```
**If configuration drift:**
1. Check environment configuration:
```bash
stella orch env config diff <env-name>
```
2. Reset environment to known state:
```bash
stella orch env reset <env-name> --to-baseline
```
**If database state inconsistent:**
1. Check orchestrator database:
```bash
stella orch db verify
```
2. Repair deployment state:
```bash
stella orch repair --deployment <deployment-id>
```
### Verification
```bash
# Verify rollback completed
stella rollback status <rollback-id>
# Verify environment state
stella orch env status <env-name>
# Verify correct version deployed
stella orch deployments current --env <env-name>
# Health check the environment
stella orch health-check --env <env-name>
```
---
## Prevention
- [ ] **Retention:** Maintain at least 5 previous versions in registry
- [ ] **Testing:** Test rollback procedure in staging regularly
- [ ] **Monitoring:** Alert on rollback failures immediately
- [ ] **Documentation:** Document manual rollback procedures per environment
---
## Related Resources
- **Architecture:** `docs/modules/release-orchestrator/rollback.md`
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `orchestrator-evidence-missing.md`
- **Rollback procedures:** `docs/operations/rollback-procedures.md`

View File

@@ -0,0 +1,189 @@
# Runbook: Policy Engine - Rego Compilation Errors
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.compilation-health` |
---
## Symptoms
- [ ] Policy deployment failing with "compilation error"
- [ ] Alert `PolicyCompilationFailed` firing
- [ ] Error: "rego_parse_error" or "rego_type_error"
- [ ] New policies not taking effect
- [ ] OPA rejecting policy bundle
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | New policies cannot be deployed; using stale policies |
| **Data integrity** | Existing policies continue to work; new rules not enforced |
| **SLA impact** | Policy updates blocked; security posture may be outdated |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.compilation-health
```
2. **Check policy compilation status:**
```bash
stella policy status --compilation
```
3. **Validate specific policy:**
```bash
stella policy validate --file <policy-file>
```
### Deep diagnosis
1. **Get detailed compilation errors:**
```bash
stella policy compile --verbose
```
Look for: Line numbers, error types, undefined references
2. **Check for syntax errors:**
```bash
stella policy lint --file <policy-file>
```
3. **Check for type errors:**
```bash
stella policy typecheck --file <policy-file>
```
4. **Check OPA version compatibility:**
```bash
stella policy opa version
stella policy check-compat --file <policy-file>
```
---
## Resolution
### Immediate mitigation
1. **Rollback to last working policy:**
```bash
stella policy rollback --to-last-good
```
2. **Disable the failing policy:**
```bash
stella policy disable <policy-id>
stella policy reload
```
3. **Use previous bundle:**
```bash
stella policy bundle load --version <previous-version>
```
### Root cause fix
**If syntax error:**
1. Get exact error location:
```bash
stella policy validate --file <policy-file> --show-line
```
2. Common syntax issues:
- Missing brackets or braces
- Invalid rule head syntax
- Incorrect import statements
3. Fix and re-validate:
```bash
stella policy validate --file <fixed-policy.rego>
```
**If undefined reference:**
1. Check for missing imports:
```bash
stella policy analyze --file <policy-file> --show-imports
```
2. Verify data references exist:
```bash
stella policy data show
```
3. Add missing imports or data definitions
**If type error:**
1. Check type mismatches:
```bash
stella policy typecheck --file <policy-file> --verbose
```
2. Common type issues:
- Comparing incompatible types
- Invalid function arguments
- Missing type annotations
**If OPA version incompatibility:**
1. Check Rego version features used:
```bash
stella policy analyze --file <policy-file> --show-features
```
2. Update policy to use compatible features or upgrade OPA
### Verification
```bash
# Validate fixed policy
stella policy validate --file <fixed-policy.rego>
# Test policy compilation
stella policy compile --file <fixed-policy.rego>
# Deploy policy
stella policy deploy --file <fixed-policy.rego>
# Test policy evaluation
stella policy evaluate --test
```
---
## Prevention
- [ ] **CI/CD:** Add policy validation to CI pipeline before deployment
- [ ] **Linting:** Run `stella policy lint` on all policy changes
- [ ] **Testing:** Write unit tests for policies with `stella policy test`
- [ ] **Staging:** Deploy to staging environment before production
---
## Related Resources
- **Architecture:** `docs/modules/policy/architecture.md`
- **Related runbooks:** `policy-opa-crash.md`, `policy-evaluation-slow.md`
- **Rego reference:** https://www.openpolicyagent.org/docs/latest/policy-language/
- **Policy testing:** `docs/modules/policy/testing.md`

View File

@@ -0,0 +1,174 @@
# Runbook: Policy Engine - Evaluation Latency High
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.evaluation-latency` |
---
## Symptoms
- [ ] Policy evaluation takes >500ms (warning) or >2s (critical)
- [ ] Gate decisions timing out in CI/CD pipelines
- [ ] Alert `PolicyEvaluationSlow` firing
- [ ] Metric `policy_evaluation_duration_seconds` P95 > 1s
- [ ] Users report "policy check taking too long"
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Slow release gate checks, CI/CD pipeline delays |
| **Data integrity** | No data loss; decisions are still correct |
| **SLA impact** | Gate latency SLO violated (target: P95 < 500ms) |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.evaluation-latency
```
2. **Check policy engine status:**
```bash
stella policy status
```
3. **Check recent evaluation times:**
```bash
stella policy stats --last 10m
```
Look for: P95 latency, cache hit rate
### Deep diagnosis
1. **Profile a slow evaluation:**
```bash
stella policy evaluate --image <image-ref> --profile
```
Look for: Which phase is slowest (parse, compile, execute)
2. **Check OPA compilation cache:**
```bash
stella policy cache stats
```
Problem if: Cache hit rate < 90%
3. **Check policy complexity:**
```bash
stella policy analyze --complexity
```
Problem if: Cyclomatic complexity > 50 or rule count > 200
4. **Check external data fetches:**
```bash
stella policy logs --filter "external fetch" --level debug
```
Problem if: Many external fetches or slow responses
---
## Resolution
### Immediate mitigation
1. **Clear and warm the compilation cache:**
```bash
stella policy cache clear
stella policy cache warm
```
2. **Increase OPA worker count:**
```bash
stella policy config set opa.workers 4
stella policy reload
```
3. **Enable evaluation result caching:**
```bash
stella policy config set cache.evaluation_ttl 60s
stella policy reload
```
### Root cause fix
**If policy is too complex:**
1. Analyze and simplify policy:
```bash
stella policy analyze --suggest-optimizations
```
2. Split large policies into modules:
```bash
stella policy refactor --auto-split
```
**If external data fetches are slow:**
1. Increase external data cache TTL:
```bash
stella policy config set external_data.cache_ttl 5m
```
2. Pre-fetch external data:
```bash
stella policy external-data prefetch
```
**If Rego compilation is slow:**
1. Enable partial evaluation:
```bash
stella policy config set opa.partial_eval true
```
2. Pre-compile policies:
```bash
stella policy compile --all
```
### Verification
```bash
# Run evaluation and check latency
stella policy evaluate --image <image-ref> --timing
# Check P95 latency
stella policy stats --last 5m
# Verify cache is effective
stella policy cache stats
```
---
## Prevention
- [ ] **Review:** Review policy complexity before deployment
- [ ] **Monitoring:** Alert on P95 latency > 300ms
- [ ] **Caching:** Ensure evaluation cache is enabled
- [ ] **Pre-warming:** Add cache warming to deployment pipeline
---
## Related Resources
- **Architecture:** `docs/modules/policy/architecture.md`
- **Related runbooks:** `policy-opa-crash.md`, `policy-compilation-failed.md`
- **Dashboard:** Grafana > Stella Ops > Policy Engine

View File

@@ -0,0 +1,205 @@
# Runbook: Policy Engine - OPA Process Crashed
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | Critical |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.opa-health` |
---
## Symptoms
- [ ] Policy evaluations failing with "OPA unavailable" error
- [ ] Alert `PolicyOPACrashed` firing
- [ ] OPA process exited unexpectedly
- [ ] Error: "connection refused" when connecting to OPA
- [ ] Metric `policy_opa_restarts_total` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | All policy evaluations fail; gate decisions blocked |
| **Data integrity** | No data loss; decisions delayed until OPA recovers |
| **SLA impact** | Gate latency SLO violated; release pipeline blocked |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.opa-health
```
2. **Check OPA process status:**
```bash
stella policy status
```
Look for: OPA process state, restart count
3. **Check OPA logs for crash reason:**
```bash
stella policy opa logs --last 30m --level error
```
### Deep diagnosis
1. **Check OPA memory usage before crash:**
```bash
stella policy stats --opa-metrics
```
Problem if: Memory usage near limit before crash
2. **Check for problematic policy:**
```bash
stella policy list --last-error
```
Look for: Policies that caused evaluation errors
3. **Check OPA configuration:**
```bash
stella policy opa config show
```
Look for: Invalid configuration, missing bundles
4. **Check for infinite loops in Rego:**
```bash
stella policy analyze --detect-loops
```
---
## Resolution
### Immediate mitigation
1. **Restart OPA process:**
```bash
stella policy opa restart
```
2. **If OPA keeps crashing, start in safe mode:**
```bash
stella policy opa start --safe-mode
```
Note: Safe mode disables custom policies
3. **Enable failopen temporarily (if allowed by policy):**
```bash
stella policy config set failopen true
stella policy reload
```
**Warning:** Only use if compliance allows fail-open mode
### Root cause fix
**If OOM killed:**
1. Increase OPA memory limit:
```bash
stella policy opa config set memory_limit 2Gi
stella policy opa restart
```
2. Enable garbage collection tuning:
```bash
stella policy opa config set gc_min_heap_size 256Mi
stella policy opa config set gc_max_heap_size 1Gi
```
**If policy caused crash:**
1. Identify problematic policy:
```bash
stella policy list --status error
```
2. Disable the problematic policy:
```bash
stella policy disable <policy-id>
stella policy reload
```
3. Fix and re-enable:
```bash
stella policy validate --file <fixed-policy.rego>
stella policy update <policy-id> --file <fixed-policy.rego>
stella policy enable <policy-id>
```
**If bundle loading failed:**
1. Check bundle integrity:
```bash
stella policy bundle verify
```
2. Rebuild bundle:
```bash
stella policy bundle build --output bundle.tar.gz
stella policy bundle load bundle.tar.gz
```
**If configuration issue:**
1. Reset to default configuration:
```bash
stella policy opa config reset
```
2. Reconfigure with validated settings:
```bash
stella policy opa config set workers 4
stella policy opa config set decision_log true
stella policy opa restart
```
### Verification
```bash
# Check OPA is running
stella policy status
# Check OPA health
stella policy opa health
# Test policy evaluation
stella policy evaluate --test
# Check no crashes in recent logs
stella policy opa logs --level error --last 30m
# Monitor stability
stella policy stats --watch
```
---
## Prevention
- [ ] **Resources:** Set appropriate memory limits based on policy complexity
- [ ] **Validation:** Validate all policies before deployment
- [ ] **Monitoring:** Alert on OPA restart count > 2 in 10 minutes
- [ ] **Testing:** Load test policies before production deployment
---
## Related Resources
- **Architecture:** `docs/modules/policy/architecture.md`
- **Related runbooks:** `policy-evaluation-slow.md`, `policy-compilation-failed.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Policy/`
- **OPA documentation:** https://www.openpolicyagent.org/docs/latest/

View File

@@ -0,0 +1,178 @@
# Runbook: Policy Engine - Policy Storage Backend Down
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | Critical |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.storage-health` |
---
## Symptoms
- [ ] Policy operations failing with "storage unavailable"
- [ ] Alert `PolicyStorageUnavailable` firing
- [ ] Error: "failed to connect to policy store" or "database connection refused"
- [ ] Policy updates not persisting
- [ ] OPA unable to load bundles from storage
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Policy updates fail; cached policies may still work |
| **Data integrity** | Policy changes not persisted; risk of inconsistent state |
| **SLA impact** | Policy management blocked; evaluations use cached data |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.storage-health
```
2. **Check storage connectivity:**
```bash
stella policy storage status
```
3. **Check database health:**
```bash
stella db status --component policy
```
### Deep diagnosis
1. **Check PostgreSQL connectivity:**
```bash
stella db ping --database policy
```
2. **Check connection pool status:**
```bash
stella db pool-status --database policy
```
Problem if: Pool exhausted, connections timing out
3. **Check storage logs:**
```bash
stella policy logs --filter "storage" --level error --last 30m
```
4. **Check disk space (if local storage):**
```bash
stella policy storage disk-usage
```
---
## Resolution
### Immediate mitigation
1. **Enable read-only mode (use cached policies):**
```bash
stella policy config set storage.read_only true
stella policy reload
```
2. **Switch to backup storage:**
```bash
stella policy storage failover --to backup
```
3. **Restart policy service to reconnect:**
```bash
stella service restart policy-engine
```
### Root cause fix
**If database connection issue:**
1. Check database status:
```bash
stella db status --database policy --verbose
```
2. Restart database connection pool:
```bash
stella db pool-restart --database policy
```
3. Check and increase connection limits:
```bash
stella db config set policy.max_connections 50
```
**If disk space exhausted:**
1. Check storage usage:
```bash
stella policy storage disk-usage --verbose
```
2. Clean old policy versions:
```bash
stella policy versions cleanup --older-than 30d
```
3. Increase storage capacity
**If storage corruption:**
1. Verify storage integrity:
```bash
stella policy storage verify
```
2. Restore from backup:
```bash
stella policy storage restore --from-backup latest
```
### Verification
```bash
# Check storage status
stella policy storage status
# Test write operation
stella policy storage test-write
# Test policy update
stella policy update --test
# Verify no errors
stella policy logs --filter "storage" --level error --last 30m
```
---
## Prevention
- [ ] **Monitoring:** Alert on storage connection failures immediately
- [ ] **Redundancy:** Configure backup storage for failover
- [ ] **Cleanup:** Schedule regular cleanup of old policy versions
- [ ] **Capacity:** Monitor disk usage and plan for growth
---
## Related Resources
- **Architecture:** `docs/modules/policy/storage.md`
- **Related runbooks:** `policy-opa-crash.md`, `postgres-ops.md`
- **Database setup:** `docs/operations/database-configuration.md`

View File

@@ -0,0 +1,195 @@
# Runbook: Policy Engine - Policy Version Conflicts
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | Medium |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.version-consistency` |
---
## Symptoms
- [ ] Policy evaluation returning unexpected results
- [ ] Alert `PolicyVersionMismatch` firing
- [ ] Error: "policy version conflict" or "bundle version mismatch"
- [ ] Different nodes evaluating with different policy versions
- [ ] Inconsistent gate decisions for same artifact
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Inconsistent policy decisions; unpredictable gate results |
| **Data integrity** | Decisions may not match expected policy behavior |
| **SLA impact** | Gate accuracy SLO violated; trust in decisions reduced |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.version-consistency
```
2. **Check policy version across nodes:**
```bash
stella policy version --all-nodes
```
3. **Check active policy version:**
```bash
stella policy active --show-version
```
### Deep diagnosis
1. **Compare versions across instances:**
```bash
stella policy version diff --all-instances
```
Problem if: Different versions on different nodes
2. **Check bundle distribution status:**
```bash
stella policy bundle status --all-nodes
```
3. **Check for failed deployments:**
```bash
stella policy deployments list --status failed --last 24h
```
4. **Check OPA bundle sync:**
```bash
stella policy opa bundle-status
```
---
## Resolution
### Immediate mitigation
1. **Force sync to latest version:**
```bash
stella policy sync --force --all-nodes
```
2. **Pin specific version:**
```bash
stella policy pin --version <version>
stella policy sync --all-nodes
```
3. **Restart policy engines to force reload:**
```bash
stella service restart policy-engine --all-nodes
```
### Root cause fix
**If bundle distribution failed:**
1. Check bundle storage:
```bash
stella policy bundle storage-status
```
2. Rebuild and redistribute bundle:
```bash
stella policy bundle build
stella policy bundle distribute --all-nodes
```
**If node out of sync:**
1. Check specific node status:
```bash
stella policy status --node <node-id>
```
2. Force node resync:
```bash
stella policy sync --node <node-id> --force
```
3. Verify node is receiving updates:
```bash
stella policy bundle check-subscription --node <node-id>
```
**If concurrent deployments caused conflict:**
1. Check deployment history:
```bash
stella policy deployments list --last 1h
```
2. Resolve to single version:
```bash
stella policy resolve-conflict --to-version <version>
```
3. Enable deployment locking:
```bash
stella policy config set deployment.locking true
```
**If OPA bundle polling issue:**
1. Check OPA bundle configuration:
```bash
stella policy opa config show | grep bundle
```
2. Decrease polling interval for faster sync:
```bash
stella policy opa config set bundle.polling.min_delay_seconds 10
stella policy opa config set bundle.polling.max_delay_seconds 30
```
### Verification
```bash
# Verify all nodes on same version
stella policy version --all-nodes
# Test consistent evaluation
stella policy evaluate --test --all-nodes
# Verify bundle status
stella policy bundle status --all-nodes
# Check no version warnings
stella policy logs --filter "version" --level warning --last 30m
```
---
## Prevention
- [ ] **Locking:** Enable deployment locking to prevent concurrent updates
- [ ] **Monitoring:** Alert on version drift between nodes
- [ ] **Sync:** Configure aggressive bundle polling for fast convergence
- [ ] **Testing:** Deploy to staging before production to catch issues
---
## Related Resources
- **Architecture:** `docs/modules/policy/versioning.md`
- **Related runbooks:** `policy-opa-crash.md`, `policy-storage-unavailable.md`
- **Deployment guide:** `docs/operations/policy-deployment.md`

View File

@@ -0,0 +1,371 @@
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-001 - PostgreSQL Operations Runbook
# PostgreSQL Database Runbook (dev-mock ready)
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check database connection
stella db ping
# Verify connection pool health
stella doctor --check check.postgres.connectivity,check.postgres.pool
# Check migration status
stella db migrations status
```
### Metrics to Watch
- `stella_postgres_connections_active` - Active connections (should be < 80% of max)
- `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms)
- `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0)
---
## Standard Procedures
### SP-001: Daily Health Check
**Frequency:** Daily or on-demand
**Duration:** ~5 minutes
1. Run comprehensive health check:
```bash
stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
```
2. Review slow queries from last 24h:
```bash
stella db queries --slow --period 24h --limit 20
```
3. Check replication status (if applicable):
```bash
stella db replication status
```
4. Verify backup completion:
```bash
stella backup status --type database
```
### SP-002: Connection Pool Tuning
**When:** Pool exhaustion alerts or high wait times
1. Check current pool usage:
```bash
stella db pool stats --detailed
```
2. Identify connection-holding queries:
```bash
stella db queries --active --sort duration
```
3. Adjust pool size (if needed):
```bash
# Review current settings
stella config get Database:MaxPoolSize
# Increase pool size
stella config set Database:MaxPoolSize 150
# Restart affected services
stella service restart --service release-orchestrator
```
4. Verify improvement:
```bash
stella db pool watch --duration 5m
```
### SP-003: Backup and Restore
**Backup:**
```bash
# Create immediate backup
stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
# Verify backup
stella backup verify --latest
```
**Restore:**
```bash
# List available backups
stella backup list --type database
# Restore to specific point (CAUTION: destructive)
stella backup restore --id <backup-id> --confirm
# Verify restoration
stella db ping
stella db migrations status
```
### SP-004: Migration Execution
1. Pre-migration backup:
```bash
stella backup create --type database --name "pre-migration"
```
2. Run migrations:
```bash
# Dry run first
stella db migrate --dry-run
# Apply migrations
stella db migrate
```
3. Verify migration success:
```bash
stella db migrations status
stella doctor --check check.postgres.migrations
```
---
## Incident Procedures
### INC-001: Connection Pool Exhaustion
**Symptoms:**
- Alert: `StellaPostgresPoolExhausted`
- Error logs: "connection pool exhausted, waiting for available connection"
- Increased request latency
**Investigation:**
```bash
# Check pool status
stella db pool stats
# Find long-running queries
stella db queries --active --sort duration --limit 10
# Check for connection leaks
stella db connections --by-client
```
**Resolution:**
1. **Immediate relief** - Terminate long-running queries:
```bash
# Identify stuck queries
stella db queries --active --duration ">5m"
# Terminate specific query (use with caution)
stella db query terminate --pid <pid>
```
2. **Scale pool** (if legitimate load):
```bash
stella config set Database:MaxPoolSize 200
stella service restart --graceful
```
3. **Fix leaks** (if application bug):
- Review application logs for unclosed connections
- Deploy fix to affected service
### INC-002: Slow Query Performance
**Symptoms:**
- Alert: `StellaPostgresQueryLatencyHigh`
- P99 query latency > 500ms
**Investigation:**
```bash
# Get slow query report
stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
# Analyze specific query
stella db query explain --sql "SELECT ..." --analyze
# Check table statistics
stella db stats tables --sort bloat
```
**Resolution:**
1. **Index optimization:**
```bash
# Get index recommendations
stella db index suggest --table <table>
# Create recommended index
stella db index create --table <table> --columns "col1,col2"
```
2. **Vacuum/analyze:**
```bash
stella db vacuum --table <table>
stella db analyze --table <table>
```
3. **Query optimization** - Review and rewrite problematic queries
### INC-003: Database Connectivity Loss
**Symptoms:**
- Alert: `StellaPostgresConnectionFailed`
- All services reporting database connection errors
**Investigation:**
```bash
# Test basic connectivity
stella db ping
# Check DNS resolution
stella network dns-lookup <db-host>
# Check firewall/network
stella network test --host <db-host> --port 5432
```
**Resolution:**
1. **Network issue:**
- Verify security groups / firewall rules
- Check VPN/tunnel status if applicable
- Verify DNS resolution
2. **Database server issue:**
- Check PostgreSQL service status on server
- Review PostgreSQL logs
- Check disk space on database server
3. **Credential issue:**
```bash
stella db verify-credentials
stella secrets rotate --scope database
```
### INC-004: Disk Space Alert
**Symptoms:**
- Alert: `StellaPostgresDiskSpaceWarning` or `Critical`
- Database write failures
**Investigation:**
```bash
# Check disk usage
stella db disk-usage
# Find large tables
stella db stats tables --sort size --limit 20
# Check for bloat
stella db stats tables --sort bloat
```
**Resolution:**
1. **Immediate cleanup:**
```bash
# Vacuum to reclaim space
stella db vacuum --full --table <large-table>
# Clean old data (if retention policy allows)
stella db prune --table evidence_artifacts --older-than 90d --dry-run
```
2. **Archive old data:**
```bash
stella db archive --table findings_history --older-than 180d
```
3. **Expand disk** (if legitimate growth):
- Follow cloud provider procedure to expand volume
- Resize filesystem
---
## Maintenance Windows
### Weekly Maintenance (Sunday 02:00 UTC)
1. Run vacuum analyze on all tables:
```bash
stella db vacuum --analyze --all-tables
```
2. Update table statistics:
```bash
stella db analyze --all-tables
```
3. Clean temporary files:
```bash
stella db cleanup --temp-files
```
### Monthly Maintenance (First Sunday 03:00 UTC)
1. Full vacuum on large tables:
```bash
stella db vacuum --full --table findings --table verdicts
```
2. Reindex if needed:
```bash
stella db reindex --concurrently --table findings
```
3. Archive old data per retention policy:
```bash
stella db archive --apply-retention
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → PostgreSQL
Key panels:
- Connection pool utilization
- Query latency percentiles
- Disk usage trend
- Replication lag (if applicable)
- Active queries count
---
## Evidence Capture
For any incident, capture:
```bash
# Comprehensive database state
stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
Bundle includes:
- Connection stats
- Active queries
- Lock information
- Table statistics
- Recent slow query log
- Configuration snapshot
---
## Escalation Path
1. **L1 (On-call):** Standard procedures, restart services
2. **L2 (Database team):** Query optimization, schema changes
3. **L3 (Vendor support):** Hardware/cloud platform issues
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,152 @@
# Runbook: Scanner - Out of Memory on Large Images
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.memory-usage` |
---
## Symptoms
- [ ] Scanner worker exits with code 137 (OOM killed)
- [ ] Scans fail consistently for specific large images
- [ ] Error log contains "fatal error: runtime: out of memory"
- [ ] Alert `ScannerWorkerOOM` firing
- [ ] Metric `scanner_worker_restarts_total{reason="oom"}` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Large images cannot be scanned; smaller images may still work |
| **Data integrity** | No data loss; failed scans can be retried |
| **SLA impact** | Specific images blocked from release pipeline |
---
## Diagnosis
### Quick checks
1. **Identify the failing image:**
```bash
stella scanner jobs list --status failed --last 1h
```
2. **Check image size:**
```bash
stella image inspect <image-ref> --format json | jq '.size'
```
Problem if: Image size > 2GB or layer count > 100
3. **Check worker memory limit:**
```bash
stella scanner config get worker.memory_limit
```
### Deep diagnosis
1. **Profile memory usage during scan:**
```bash
stella scan image --image <image-ref> --profile-memory
```
2. **Check SBOM generation memory:**
```bash
stella scanner logs --filter "sbom" --level debug --last 30m
```
Look for: "memory allocation failed", "heap exhausted"
3. **Identify memory-heavy layers:**
```bash
stella image layers <image-ref> --sort-by size
```
---
## Resolution
### Immediate mitigation
1. **Increase worker memory limit:**
```bash
stella scanner config set worker.memory_limit 8Gi
stella scanner workers restart
```
2. **Enable streaming mode for large images:**
```bash
stella scanner config set sbom.streaming_threshold 1Gi
stella scanner workers restart
```
3. **Retry the failed scan:**
```bash
stella scan image --image <image-ref> --retry
```
### Root cause fix
**For consistently large images:**
1. Configure dedicated large-image worker pool:
```bash
stella scanner workers add --pool large-images --memory 16Gi --count 2
stella scanner config set routing.large_image_threshold 2Gi
stella scanner config set routing.large_image_pool large-images
```
**For images with many small files (node_modules, etc.):**
1. Enable incremental SBOM mode:
```bash
stella scanner config set sbom.incremental_mode true
```
**For base image reuse:**
1. Enable layer caching:
```bash
stella scanner config set cache.layer_dedup true
```
### Verification
```bash
# Retry the previously failing scan
stella scan image --image <image-ref>
# Monitor memory during scan
stella scanner workers stats --watch
# Verify no OOM in recent logs
stella scanner logs --filter "out of memory" --last 1h
```
---
## Prevention
- [ ] **Capacity:** Set memory limit based on largest expected image (recommend 4Gi minimum)
- [ ] **Routing:** Configure large-image pool for images > 2GB
- [ ] **Monitoring:** Alert on `scanner_worker_memory_usage_bytes` > 80% of limit
- [ ] **Documentation:** Document image size limits in user guide
---
## Related Resources
- **Architecture:** `docs/modules/scanner/architecture.md`
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
- **Dashboard:** Grafana > Stella Ops > Scanner Memory

View File

@@ -0,0 +1,195 @@
# Runbook: Scanner - Registry Authentication Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | High |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.registry-auth` |
---
## Symptoms
- [ ] Scans failing with "401 Unauthorized" or "403 Forbidden"
- [ ] Alert `ScannerRegistryAuthFailed` firing
- [ ] Error: "failed to authenticate with registry"
- [ ] Error: "failed to pull image manifest"
- [ ] Scans work for public images but fail for private images
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Cannot scan private images; release pipeline blocked |
| **Data integrity** | No data loss; authentication issue only |
| **SLA impact** | All scans for affected registry blocked |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.scanner.registry-auth
```
2. **List configured registries:**
```bash
stella registry list --show-status
```
Look for: Registries with "auth_failed" status
3. **Test registry authentication:**
```bash
stella registry test <registry-url>
```
### Deep diagnosis
1. **Check credential expiration:**
```bash
stella registry credentials show <registry-name>
```
Look for: Expiration date, token type
2. **Test with verbose output:**
```bash
stella registry test <registry-url> --verbose
```
Look for: Specific auth error message, HTTP status code
3. **Check registry logs:**
```bash
stella scanner logs --filter "registry auth" --last 30m
```
4. **Verify IAM/OIDC configuration (for cloud registries):**
```bash
stella registry iam-status <registry-name>
```
Problem if: IAM role not assumable, OIDC token expired
---
## Resolution
### Immediate mitigation
1. **Refresh credentials (for token-based auth):**
```bash
stella registry refresh-credentials <registry-name>
```
2. **Update static credentials:**
```bash
stella registry update-credentials <registry-name> \
--username <user> \
--password <token>
```
3. **For Docker Hub rate limiting:**
```bash
stella registry configure docker-hub \
--username <user> \
--access-token <token>
```
### Root cause fix
**If credentials expired:**
1. Generate new access token in registry (ECR, GCR, ACR, etc.)
2. Update credentials:
```bash
stella registry update-credentials <registry-name> --from-env
```
3. Configure automatic token refresh:
```bash
stella registry config set <registry-name>.auto_refresh true
stella registry config set <registry-name>.refresh_interval 11h
```
**If IAM role/policy changed (AWS ECR):**
1. Verify IAM role permissions:
```bash
stella registry iam verify <registry-name>
```
2. Update IAM role ARN if changed:
```bash
stella registry configure ecr \
--region <region> \
--role-arn <arn>
```
**If OIDC federation changed (GCP Artifact Registry):**
1. Verify service account:
```bash
stella registry oidc verify <registry-name>
```
2. Update workload identity configuration:
```bash
stella registry configure gcr \
--project <project> \
--workload-identity-provider <provider>
```
**If certificate changed (self-hosted registries):**
1. Update CA certificate:
```bash
stella registry configure <registry-name> \
--ca-cert /path/to/ca.crt
```
2. Or skip verification (not recommended for production):
```bash
stella registry configure <registry-name> \
--insecure-skip-verify
```
### Verification
```bash
# Test authentication
stella registry test <registry-url>
# Test scanning a private image
stella scan image --image <registry-url>/<image>:<tag> --dry-run
# Verify no auth failures in recent logs
stella scanner logs --filter "auth" --level error --last 30m
```
---
## Prevention
- [ ] **Credentials:** Use service accounts/workload identity instead of static tokens
- [ ] **Rotation:** Configure automatic token refresh before expiration
- [ ] **Monitoring:** Alert on authentication failure rate > 0
- [ ] **Documentation:** Document registry credential management procedures
---
## Related Resources
- **Architecture:** `docs/modules/scanner/registry-auth.md`
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
- **Registry setup:** `docs/operations/registry-configuration.md`

View File

@@ -0,0 +1,188 @@
# Runbook: Scanner - SBOM Generation Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.sbom-generation` |
---
## Symptoms
- [ ] Scans completing but SBOM generation failing
- [ ] Alert `ScannerSbomGenerationFailed` firing
- [ ] Error: "SBOM generation failed" or "unsupported package format"
- [ ] Partial SBOM with missing components
- [ ] Metric `scanner_sbom_generation_failures_total` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Incomplete vulnerability coverage; missing dependencies not scanned |
| **Data integrity** | Partial SBOM may miss vulnerabilities; attestations incomplete |
| **SLA impact** | SBOM completeness SLO violated (target: > 95%) |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.scanner.sbom-generation
```
2. **Check failed SBOM jobs:**
```bash
stella scanner jobs list --status sbom_failed --last 1h
```
3. **Check SBOM completeness rate:**
```bash
stella scanner stats --sbom-metrics
```
### Deep diagnosis
1. **Analyze specific failure:**
```bash
stella scanner job details <job-id> --sbom-errors
```
Look for: Specific package manager or file type causing failure
2. **Check for unsupported ecosystems:**
```bash
stella sbom analyze --image <image-ref> --verbose
```
Look for: "unsupported", "unknown package format", "parsing failed"
3. **Check scanner plugin status:**
```bash
stella scanner plugins list --status
```
Problem if: Package manager plugin disabled or erroring
4. **Check for corrupted package files:**
```bash
stella image inspect <image-ref> --check-integrity
```
---
## Resolution
### Immediate mitigation
1. **Enable fallback SBOM generation:**
```bash
stella scanner config set sbom.fallback_mode true
stella scan image --image <image-ref> --sbom-fallback
```
2. **Use alternative SBOM generator:**
```bash
stella sbom generate --image <image-ref> --generator syft --output sbom.json
```
3. **Generate partial SBOM and continue:**
```bash
stella scan image --image <image-ref> --sbom-partial-ok
```
### Root cause fix
**If package manager not supported:**
1. Check supported package managers:
```bash
stella scanner plugins list --type package-manager
```
2. Enable additional plugins:
```bash
stella scanner plugins enable <plugin-name>
```
3. For custom package formats, add mapping:
```bash
stella scanner config set sbom.custom_mappings.<format> <handler>
```
**If package file corrupted:**
1. Identify corrupted files:
```bash
stella image layers <image-ref> --verify-packages
```
2. Report to image owner for fix
**If memory/resource issue during generation:**
1. Increase SBOM generator resources:
```bash
stella scanner config set sbom.memory_limit 4Gi
stella scanner config set sbom.timeout 10m
```
2. Enable streaming mode:
```bash
stella scanner config set sbom.streaming_mode true
```
**If plugin crashed:**
1. Check plugin logs:
```bash
stella scanner plugins logs <plugin-name> --last 30m
```
2. Restart plugin:
```bash
stella scanner plugins restart <plugin-name>
```
### Verification
```bash
# Retry SBOM generation
stella sbom generate --image <image-ref> --output sbom.json
# Validate SBOM completeness
stella sbom validate --file sbom.json --check-completeness
# Check component count
stella sbom stats --file sbom.json
# Full scan with SBOM
stella scan image --image <image-ref>
```
---
## Prevention
- [ ] **Plugins:** Keep all package manager plugins enabled and updated
- [ ] **Monitoring:** Alert on SBOM completeness < 90%
- [ ] **Fallback:** Configure fallback SBOM generator for resilience
- [ ] **Testing:** Test SBOM generation for new image types before production
---
## Related Resources
- **Architecture:** `docs/modules/scanner/sbom-generation.md`
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
- **SBOM formats:** `docs/formats/sbom-spdx.md`, `docs/formats/sbom-cyclonedx.md`

View File

@@ -0,0 +1,174 @@
# Runbook: Scanner - Scan Timeout on Complex Images
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | Medium |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.timeout-rate` |
---
## Symptoms
- [ ] Scans failing with "timeout exceeded" error
- [ ] Alert `ScannerTimeoutExceeded` firing
- [ ] Metric `scanner_scan_timeout_total` increasing
- [ ] Specific images consistently timing out
- [ ] Error log: "scan operation exceeded timeout of X seconds"
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Specific images cannot be scanned; pipeline blocked |
| **Data integrity** | No data loss; scans can be retried with adjusted settings |
| **SLA impact** | Release pipeline delayed for affected images |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.scanner.timeout-rate
```
2. **Identify failing images:**
```bash
stella scanner jobs list --status timeout --last 1h
```
Look for: Pattern in image types or sizes
3. **Check current timeout settings:**
```bash
stella scanner config get timeouts
```
### Deep diagnosis
1. **Analyze image complexity:**
```bash
stella image inspect <image-ref> --format json | jq '{size, layers: .layers | length, files: .manifest.fileCount}'
```
Problem if: > 50 layers, > 100k files, or > 5GB size
2. **Check scanner worker load:**
```bash
stella scanner workers stats
```
Problem if: All workers at capacity during timeouts
3. **Profile a scan:**
```bash
stella scan image --image <image-ref> --profile --verbose
```
Look for: Which phase is slowest (layer extraction, SBOM generation, vuln matching)
4. **Check for filesystem-heavy images:**
```bash
stella image layers <image-ref> --sort-by file-count
```
Problem if: Single layer with > 50k files (e.g., node_modules)
---
## Resolution
### Immediate mitigation
1. **Increase timeout for specific image:**
```bash
stella scan image --image <image-ref> --timeout 30m
```
2. **Increase global scan timeout:**
```bash
stella scanner config set timeouts.scan 20m
stella scanner workers restart
```
3. **Enable fast mode for initial scan:**
```bash
stella scan image --image <image-ref> --fast-mode
```
### Root cause fix
**If image is too complex:**
1. Enable incremental scanning:
```bash
stella scanner config set scan.incremental_mode true
```
2. Configure layer caching:
```bash
stella scanner config set cache.layer_dedup true
stella scanner config set cache.sbom_cache true
```
**If filesystem is too large:**
1. Enable streaming SBOM generation:
```bash
stella scanner config set sbom.streaming_threshold 500Gi
```
2. Configure file sampling for massive images:
```bash
stella scanner config set sbom.file_sample_max 100000
```
**If vulnerability matching is slow:**
1. Enable parallel matching:
```bash
stella scanner config set vuln.parallel_matching true
stella scanner config set vuln.match_workers 4
```
2. Optimize vulnerability database indexes:
```bash
stella db optimize --component scanner
```
### Verification
```bash
# Retry the previously failing scan
stella scan image --image <image-ref> --timeout 30m
# Monitor scan progress
stella scanner jobs watch <job-id>
# Verify no timeouts in recent scans
stella scanner jobs list --status timeout --last 1h
```
---
## Prevention
- [ ] **Capacity:** Configure appropriate timeouts based on expected image complexity (15m default, 30m for large)
- [ ] **Monitoring:** Alert on timeout rate > 5%
- [ ] **Caching:** Enable layer and SBOM caching for base images
- [ ] **Documentation:** Document image size/complexity limits in user guide
---
## Related Resources
- **Architecture:** `docs/modules/scanner/architecture.md`
- **Related runbooks:** `scanner-oom.md`, `scanner-worker-stuck.md`
- **Dashboard:** Grafana > Stella Ops > Scanner Performance

View File

@@ -0,0 +1,174 @@
# Runbook: Scanner - Worker Not Processing Jobs
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | Critical |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.worker-health` |
---
## Symptoms
- [ ] Scan jobs stuck in "pending" or "processing" state for >5 minutes
- [ ] Scanner worker process shows 0% CPU usage
- [ ] Alert `ScannerWorkerStuck` or `ScannerQueueBacklog` firing
- [ ] UI shows "Scan in progress" indefinitely
- [ ] Metric `scanner_jobs_pending` increasing over time
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | New scans cannot complete, blocking CI/CD pipelines and release gates |
| **Data integrity** | No data loss; pending jobs will resume when worker recovers |
| **SLA impact** | Scan latency SLO violated if not resolved within 15 minutes |
---
## Diagnosis
### Quick checks (< 2 minutes)
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.scanner.worker-health
```
2. **Check scanner service status:**
```bash
stella scanner status
```
Expected: "Scanner workers: 4 active, 0 idle"
Problem: "Scanner workers: 0 active" or "status: degraded"
3. **Check job queue depth:**
```bash
stella scanner queue status
```
Expected: Queue depth < 50
Problem: Queue depth > 100 or growing rapidly
### Deep diagnosis
1. **Check worker process logs:**
```bash
stella scanner logs --tail 100 --level error
```
Look for: "timeout", "connection refused", "out of memory"
2. **Check Valkey connectivity (job queue):**
```bash
stella doctor --check check.storage.valkey
```
3. **Check if workers are OOM-killed:**
```bash
stella scanner workers inspect
```
Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)
4. **Check resource utilization:**
```bash
stella obs metrics --filter scanner --last 10m
```
Look for: Memory > 90%, CPU sustained > 95%
---
## Resolution
### Immediate mitigation
1. **Restart scanner workers:**
```bash
stella scanner workers restart
```
This will: Terminate current workers and spawn fresh ones
2. **If restart fails, force restart the scanner service:**
```bash
stella service restart scanner
```
3. **Verify workers are processing:**
```bash
stella scanner queue status --watch
```
Queue depth should start decreasing
### Root cause fix
**If workers were OOM-killed:**
1. Increase worker memory limit:
```bash
stella scanner config set worker.memory_limit 4Gi
stella scanner workers restart
```
2. Reduce concurrent scans per worker:
```bash
stella scanner config set worker.concurrency 2
stella scanner workers restart
```
**If Valkey connection failed:**
1. Check Valkey health:
```bash
stella doctor --check check.storage.valkey
```
2. Restart Valkey if needed (see `valkey-connection-failure.md`)
**If workers are deadlocked:**
1. Enable deadlock detection:
```bash
stella scanner config set worker.deadlock_detection true
stella scanner workers restart
```
### Verification
```bash
# Verify workers are healthy
stella doctor --check check.scanner.worker-health
# Submit a test scan
stella scan image --image alpine:latest --dry-run
# Watch queue drain
stella scanner queue status --watch
# Verify no errors in recent logs
stella scanner logs --tail 20 --level error
```
---
## Prevention
- [ ] **Alert:** Ensure `ScannerQueueBacklog` alert is configured with threshold < 100 jobs
- [ ] **Monitoring:** Add Grafana panel for worker memory usage
- [ ] **Capacity:** Review worker count and memory limits during capacity planning
- [ ] **Deadlock:** Enable `worker.deadlock_detection` in production
---
## Related Resources
- **Architecture:** `docs/modules/scanner/architecture.md`
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs`
- **Dashboard:** Grafana > Stella Ops > Scanner Overview