synergy moats product advisory implementations
This commit is contained in:
442
docs/doctor/plugins.md
Normal file
442
docs/doctor/plugins.md
Normal file
@@ -0,0 +1,442 @@
|
||||
# Doctor Plugins Reference
|
||||
|
||||
> **Sprint:** SPRINT_20260117_025_Doctor_coverage_expansion
|
||||
> **Task:** DOC-EXP-006 - Documentation Updates
|
||||
|
||||
This document describes the Doctor health check plugins, their checks, and configuration options.
|
||||
|
||||
## Plugin Overview
|
||||
|
||||
| Plugin | Directory | Checks | Description |
|
||||
|--------|-----------|--------|-------------|
|
||||
| **Postgres** | `StellaOps.Doctor.Plugin.Postgres` | 3 | PostgreSQL database health |
|
||||
| **Storage** | `StellaOps.Doctor.Plugin.Storage` | 3 | Disk and storage health |
|
||||
| **Crypto** | `StellaOps.Doctor.Plugin.Crypto` | 4 | Regional crypto compliance |
|
||||
| **EvidenceLocker** | `StellaOps.Doctor.Plugin.EvidenceLocker` | 4 | Evidence integrity checks |
|
||||
| **Attestor** | `StellaOps.Doctor.Plugin.Attestor` | 3+ | Signing and verification |
|
||||
| **Auth** | `StellaOps.Doctor.Plugin.Auth` | 3+ | Authentication health |
|
||||
| **Policy** | `StellaOps.Doctor.Plugin.Policy` | 3+ | Policy engine health |
|
||||
| **Vex** | `StellaOps.Doctor.Plugin.Vex` | 3+ | VEX feed health |
|
||||
| **Operations** | `StellaOps.Doctor.Plugin.Operations` | 3+ | General operations |
|
||||
|
||||
---
|
||||
|
||||
## PostgreSQL Plugin
|
||||
|
||||
**Plugin ID:** `stellaops.doctor.postgres`
|
||||
**NuGet:** `StellaOps.Doctor.Plugin.Postgres`
|
||||
|
||||
### Checks
|
||||
|
||||
#### check.postgres.connectivity
|
||||
|
||||
Verifies PostgreSQL database connectivity and response time.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Fail |
|
||||
| **Tags** | database, postgres, connectivity, core |
|
||||
| **Timeout** | 10 seconds |
|
||||
|
||||
**Thresholds:**
|
||||
- Warning: Latency > 100ms
|
||||
- Critical: Latency > 500ms
|
||||
|
||||
**Evidence collected:**
|
||||
- Connection string (masked)
|
||||
- Server version
|
||||
- Server timestamp
|
||||
- Latency in milliseconds
|
||||
|
||||
**Remediation:**
|
||||
```bash
|
||||
# Check database status
|
||||
stella db status
|
||||
|
||||
# Test connection
|
||||
stella db ping
|
||||
|
||||
# View connection configuration
|
||||
stella config get Database:ConnectionString
|
||||
```
|
||||
|
||||
#### check.postgres.migration-status
|
||||
|
||||
Checks for pending database migrations.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Warning |
|
||||
| **Tags** | database, postgres, migrations |
|
||||
|
||||
**Evidence collected:**
|
||||
- Current schema version
|
||||
- Pending migrations list
|
||||
- Last migration timestamp
|
||||
|
||||
**Remediation:**
|
||||
```bash
|
||||
# View migration status
|
||||
stella db migrations status
|
||||
|
||||
# Apply pending migrations
|
||||
stella db migrations run
|
||||
|
||||
# Verify migration state
|
||||
stella db migrations verify
|
||||
```
|
||||
|
||||
#### check.postgres.connection-pool
|
||||
|
||||
Monitors connection pool health and utilization.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Warning |
|
||||
| **Tags** | database, postgres, pool, performance |
|
||||
|
||||
**Thresholds:**
|
||||
- Warning: Utilization > 70%
|
||||
- Critical: Utilization > 90%
|
||||
|
||||
**Evidence collected:**
|
||||
- Active connections
|
||||
- Idle connections
|
||||
- Maximum pool size
|
||||
- Pool utilization percentage
|
||||
|
||||
**Remediation:**
|
||||
```bash
|
||||
# View pool statistics
|
||||
stella db pool stats
|
||||
|
||||
# Increase pool size (if needed)
|
||||
stella config set Database:MaxPoolSize 50
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Storage Plugin
|
||||
|
||||
**Plugin ID:** `stellaops.doctor.storage`
|
||||
**NuGet:** `StellaOps.Doctor.Plugin.Storage`
|
||||
|
||||
### Checks
|
||||
|
||||
#### check.storage.disk-space
|
||||
|
||||
Checks available disk space on configured storage paths.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Fail |
|
||||
| **Tags** | storage, disk, capacity |
|
||||
|
||||
**Thresholds:**
|
||||
- Warning: Usage > 80%
|
||||
- Critical: Usage > 90%
|
||||
|
||||
**Evidence collected:**
|
||||
- Drive/mount path
|
||||
- Total space
|
||||
- Used space
|
||||
- Free space
|
||||
- Percentage used
|
||||
|
||||
**Remediation:**
|
||||
```bash
|
||||
# List large files
|
||||
stella storage analyze --path /var/stella
|
||||
|
||||
# Clean up old evidence
|
||||
stella evidence cleanup --older-than 90d
|
||||
|
||||
# View storage summary
|
||||
stella storage summary
|
||||
```
|
||||
|
||||
#### check.storage.evidence-locker-write
|
||||
|
||||
Verifies write permissions to the evidence locker directory.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Fail |
|
||||
| **Tags** | storage, evidence, permissions |
|
||||
|
||||
**Evidence collected:**
|
||||
- Evidence locker path
|
||||
- Write test result
|
||||
- Directory permissions
|
||||
|
||||
**Remediation:**
|
||||
```bash
|
||||
# Check permissions
|
||||
stella evidence locker status
|
||||
|
||||
# Repair permissions
|
||||
stella evidence locker repair --permissions
|
||||
|
||||
# Verify configuration
|
||||
stella config get EvidenceLocker:BasePath
|
||||
```
|
||||
|
||||
#### check.storage.backup-directory
|
||||
|
||||
Verifies backup directory accessibility (skipped if not configured).
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Warning |
|
||||
| **Tags** | storage, backup |
|
||||
|
||||
**Evidence collected:**
|
||||
- Backup directory path
|
||||
- Write accessibility
|
||||
- Last backup timestamp
|
||||
|
||||
---
|
||||
|
||||
## Crypto Plugin
|
||||
|
||||
**Plugin ID:** `stellaops.doctor.crypto`
|
||||
**NuGet:** `StellaOps.Doctor.Plugin.Crypto`
|
||||
|
||||
### Checks
|
||||
|
||||
#### check.crypto.fips-compliance
|
||||
|
||||
Verifies FIPS 140-2/140-3 compliance for US government deployments.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Fail (when FIPS profile active) |
|
||||
| **Tags** | crypto, compliance, fips, regional |
|
||||
|
||||
**Evidence collected:**
|
||||
- Active crypto profile
|
||||
- FIPS mode enabled status
|
||||
- Validated algorithms
|
||||
- Non-compliant algorithms detected
|
||||
|
||||
**Remediation:**
|
||||
```bash
|
||||
# Check current profile
|
||||
stella crypto profile show
|
||||
|
||||
# Enable FIPS mode
|
||||
stella crypto profile set fips
|
||||
|
||||
# Verify FIPS compliance
|
||||
stella crypto verify --standard fips
|
||||
```
|
||||
|
||||
#### check.crypto.eidas-compliance
|
||||
|
||||
Verifies eIDAS compliance for EU deployments.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Fail (when eIDAS profile active) |
|
||||
| **Tags** | crypto, compliance, eidas, regional, eu |
|
||||
|
||||
**Evidence collected:**
|
||||
- Active crypto profile
|
||||
- eIDAS algorithm support
|
||||
- Qualified signature availability
|
||||
|
||||
**Remediation:**
|
||||
```bash
|
||||
# Enable eIDAS profile
|
||||
stella crypto profile set eidas
|
||||
|
||||
# Verify compliance
|
||||
stella crypto verify --standard eidas
|
||||
```
|
||||
|
||||
#### check.crypto.gost-availability
|
||||
|
||||
Verifies GOST algorithm availability for Russian deployments.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Fail (when GOST profile active) |
|
||||
| **Tags** | crypto, compliance, gost, regional, russia |
|
||||
|
||||
**Evidence collected:**
|
||||
- GOST provider status
|
||||
- Available GOST algorithms
|
||||
- Library version
|
||||
|
||||
#### check.crypto.sm-availability
|
||||
|
||||
Verifies SM2/SM3/SM4 algorithm availability for Chinese deployments.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Fail (when SM profile active) |
|
||||
| **Tags** | crypto, compliance, sm, regional, china |
|
||||
|
||||
**Evidence collected:**
|
||||
- SM crypto provider status
|
||||
- Available SM algorithms
|
||||
- Library version
|
||||
|
||||
---
|
||||
|
||||
## Evidence Locker Plugin
|
||||
|
||||
**Plugin ID:** `stellaops.doctor.evidencelocker`
|
||||
**NuGet:** `StellaOps.Doctor.Plugin.EvidenceLocker`
|
||||
|
||||
### Checks
|
||||
|
||||
#### check.evidence.attestation-retrieval
|
||||
|
||||
Verifies attestation retrieval functionality.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Fail |
|
||||
| **Tags** | evidence, attestation, retrieval |
|
||||
|
||||
**Evidence collected:**
|
||||
- Sample attestation ID
|
||||
- Retrieval latency
|
||||
- Storage backend status
|
||||
|
||||
**Remediation:**
|
||||
```bash
|
||||
# Check evidence locker status
|
||||
stella evidence locker status
|
||||
|
||||
# Verify index integrity
|
||||
stella evidence index verify
|
||||
|
||||
# Rebuild index if needed
|
||||
stella evidence index rebuild
|
||||
```
|
||||
|
||||
#### check.evidence.provenance-chain
|
||||
|
||||
Verifies provenance chain integrity.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Fail |
|
||||
| **Tags** | evidence, provenance, integrity |
|
||||
|
||||
**Evidence collected:**
|
||||
- Chain depth
|
||||
- Verification result
|
||||
- Last verified timestamp
|
||||
|
||||
#### check.evidence.index
|
||||
|
||||
Verifies evidence index health and consistency.
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Warning |
|
||||
| **Tags** | evidence, index, consistency |
|
||||
|
||||
**Evidence collected:**
|
||||
- Index entry count
|
||||
- Orphaned entries
|
||||
- Missing entries
|
||||
|
||||
#### check.evidence.merkle-anchor
|
||||
|
||||
Verifies Merkle tree anchoring (when configured).
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Severity** | Warning |
|
||||
| **Tags** | evidence, merkle, anchoring |
|
||||
|
||||
**Evidence collected:**
|
||||
- Anchor status
|
||||
- Last anchor timestamp
|
||||
- Pending entries
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Enabling/Disabling Plugins
|
||||
|
||||
In `appsettings.yaml`:
|
||||
|
||||
```yaml
|
||||
Doctor:
|
||||
Plugins:
|
||||
Postgres:
|
||||
Enabled: true
|
||||
Storage:
|
||||
Enabled: true
|
||||
Crypto:
|
||||
Enabled: true
|
||||
ActiveProfile: international # fips, eidas, gost, sm
|
||||
EvidenceLocker:
|
||||
Enabled: true
|
||||
```
|
||||
|
||||
### Check-Level Configuration
|
||||
|
||||
```yaml
|
||||
Doctor:
|
||||
Checks:
|
||||
"check.storage.disk-space":
|
||||
WarningThreshold: 75 # Override default 80%
|
||||
CriticalThreshold: 85 # Override default 90%
|
||||
"check.postgres.connectivity":
|
||||
TimeoutSeconds: 15 # Override default 10
|
||||
```
|
||||
|
||||
### Report Storage Configuration
|
||||
|
||||
```yaml
|
||||
Doctor:
|
||||
ReportStorage:
|
||||
Backend: postgres # inmemory, postgres, filesystem
|
||||
RetentionDays: 90
|
||||
CompressionEnabled: true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Running Checks
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
# Run all checks
|
||||
stella doctor
|
||||
|
||||
# Run specific plugin
|
||||
stella doctor --plugin postgres
|
||||
|
||||
# Run specific check
|
||||
stella doctor --check check.postgres.connectivity
|
||||
|
||||
# Output formats
|
||||
stella doctor --format table # Default
|
||||
stella doctor --format json
|
||||
stella doctor --format markdown
|
||||
```
|
||||
|
||||
### API
|
||||
|
||||
```bash
|
||||
# Run all checks
|
||||
curl -X POST /api/v1/doctor/run
|
||||
|
||||
# Run with filters
|
||||
curl -X POST /api/v1/doctor/run \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"plugins": ["postgres", "storage"]}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
@@ -1,198 +0,0 @@
|
||||
# Sprint 018 - FE UX Components (Triage Card, Binary-Diff, Filter Strip)
|
||||
|
||||
## Topic & Scope
|
||||
- Implement UX components from advisory: Triage Card, Binary-Diff Panel, Filter Strip
|
||||
- Add Mermaid.js and GraphViz for visualization
|
||||
- Add SARIF download to Export Center
|
||||
- Working directory: `src/Web/`
|
||||
- Expected evidence: Angular components, Playwright tests
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on Sprint 006 (Reachability) for witness path APIs
|
||||
- Depends on Sprint 008 (Advisory Sources) for connector status APIs
|
||||
- Depends on Sprint 013 (Evidence) for export APIs
|
||||
- Must wait for dependent CLI sprints to complete
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/web/architecture.md`
|
||||
- `docs/product/advisories/17-Jan-2026 - Features Gap.md` (UX Specs section)
|
||||
- Angular component patterns in `src/Web/frontend/`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### UXC-001 - Install Mermaid.js and GraphViz libraries
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
|
||||
Task description:
|
||||
- Add Mermaid.js to package.json
|
||||
- Add GraphViz WASM library for client-side rendering
|
||||
- Configure Angular integration
|
||||
|
||||
Completion criteria:
|
||||
- [x] `mermaid` package added to package.json
|
||||
- [x] GraphViz WASM library added (e.g., @viz-js/viz)
|
||||
- [x] Mermaid directive/component created for rendering
|
||||
- [x] GraphViz fallback component created
|
||||
- [x] Unit tests for rendering components
|
||||
|
||||
### UXC-002 - Create Triage Card component with signed evidence display
|
||||
Status: DONE
|
||||
Dependency: UXC-001
|
||||
Owners: Developer
|
||||
|
||||
Task description:
|
||||
- Create TriageCardComponent following UX spec
|
||||
- Display vuln ID, package, version, scope, risk chip
|
||||
- Show evidence chips (OpenVEX, patch proof, reachability, EPSS)
|
||||
- Include actions (Explain, Create task, Mute, Export)
|
||||
|
||||
Completion criteria:
|
||||
- [x] TriageCardComponent renders card per spec
|
||||
- [x] Header shows vuln ID, package@version, scope
|
||||
- [x] Risk chip shows score and reason
|
||||
- [x] Evidence chips show OpenVEX, patch proof, reachability, EPSS
|
||||
- [x] Actions row includes Explain, Create task, Mute, Export
|
||||
- [x] Keyboard shortcuts: v (verify), e (export), m (mute)
|
||||
- [x] Hover tooltips on chips
|
||||
- [x] Copy icons on digests
|
||||
|
||||
### UXC-003 - Add Rekor Verify one-click action in Triage Card
|
||||
Status: DONE
|
||||
Dependency: UXC-002
|
||||
Owners: Developer
|
||||
|
||||
Task description:
|
||||
- Add "Rekor Verify" button to Triage Card
|
||||
- Execute DSSE/Sigstore verification
|
||||
- Expand to show verification details
|
||||
|
||||
Completion criteria:
|
||||
- [x] "Rekor Verify" button in Triage Card
|
||||
- [x] Click triggers verification API call
|
||||
- [x] Expansion shows signature subject/issuer
|
||||
- [x] Expansion shows timestamp
|
||||
- [x] Expansion shows Rekor index and entry (copyable)
|
||||
- [x] Expansion shows digest(s)
|
||||
- [x] Loading state during verification
|
||||
|
||||
### UXC-004 - Create Binary-Diff Panel with side-by-side diff view
|
||||
Status: DONE
|
||||
Dependency: UXC-001
|
||||
Owners: Developer
|
||||
|
||||
Task description:
|
||||
- Create BinaryDiffPanelComponent following UX spec
|
||||
- Implement scope selector (file → section → function)
|
||||
- Show base vs candidate with inline diff
|
||||
|
||||
Completion criteria:
|
||||
- [x] BinaryDiffPanelComponent renders panel per spec
|
||||
- [x] Scope selector allows file/section/function selection
|
||||
- [x] Side-by-side view shows base vs candidate
|
||||
- [x] Inline diff highlights changes
|
||||
- [x] Per-file, per-section, per-function hashes displayed
|
||||
- [x] "Export Signed Diff" produces DSSE envelope
|
||||
- [x] Click on symbol jumps to function diff
|
||||
|
||||
### UXC-005 - Add scope selector (file to section to function)
|
||||
Status: DONE
|
||||
Dependency: UXC-004
|
||||
Owners: Developer
|
||||
|
||||
Task description:
|
||||
- Create ScopeSelectorComponent for Binary-Diff
|
||||
- Support hierarchical selection
|
||||
- Maintain context when switching scopes
|
||||
|
||||
Completion criteria:
|
||||
- [x] ScopeSelectorComponent with file/section/function levels
|
||||
- [x] Selection updates Binary-Diff Panel view
|
||||
- [x] Context preserved when switching scopes
|
||||
- [x] "Show only changed blocks" toggle
|
||||
- [x] Toggle opcodes ⇄ decompiled view (if available)
|
||||
|
||||
### UXC-006 - Create Filter Strip with deterministic prioritization
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer
|
||||
|
||||
Task description:
|
||||
- Create FilterStripComponent following UX spec
|
||||
- Implement precedence toggles (OpenVEX → Patch proof → Reachability → EPSS)
|
||||
- Ensure deterministic ordering
|
||||
|
||||
Completion criteria:
|
||||
- [x] FilterStripComponent renders strip per spec
|
||||
- [x] Precedence toggles in order: OpenVEX, Patch proof, Reachability, EPSS
|
||||
- [x] EPSS slider for threshold
|
||||
- [x] "Only reachable" checkbox
|
||||
- [x] "Only with patch proof" checkbox
|
||||
- [x] "Deterministic order" lock icon (on by default)
|
||||
- [x] Tie-breaking: OCI digest → path → CVSS
|
||||
- [x] Filters update counts without reflow
|
||||
- [x] A11y: high-contrast, focus rings, keyboard nav, aria-labels
|
||||
|
||||
### UXC-007 - Add SARIF download to Export Center
|
||||
Status: DONE
|
||||
Dependency: Sprint 005 SCD-003
|
||||
Owners: Developer
|
||||
|
||||
Task description:
|
||||
- Add SARIF download button to Export Center
|
||||
- Support scan run and digest-based download
|
||||
- Include metadata (digest, scan time, policy profile)
|
||||
|
||||
Completion criteria:
|
||||
- [x] "Download SARIF" button in Export Center
|
||||
- [x] Download available for scan runs
|
||||
- [x] Download available for digest
|
||||
- [x] SARIF includes metadata per Sprint 005
|
||||
- [x] Download matches CLI output format
|
||||
|
||||
### UXC-008 - Integration tests with Playwright
|
||||
Status: DONE
|
||||
Dependency: UXC-001 through UXC-007
|
||||
Owners: QA / Test Automation
|
||||
|
||||
Task description:
|
||||
- Create Playwright e2e tests for new components
|
||||
- Test Triage Card interactions
|
||||
- Test Binary-Diff Panel navigation
|
||||
- Test Filter Strip determinism
|
||||
|
||||
Completion criteria:
|
||||
- [x] Playwright tests for Triage Card
|
||||
- [x] Tests cover keyboard shortcuts
|
||||
- [x] Tests cover Rekor Verify flow
|
||||
- [x] Playwright tests for Binary-Diff Panel
|
||||
- [x] Tests cover scope selection
|
||||
- [x] Playwright tests for Filter Strip
|
||||
- [x] Tests verify deterministic ordering
|
||||
- [x] Visual regression tests for new components
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-01-17 | Sprint created from Features Gap advisory UX Specs | Planning |
|
||||
| 2026-01-16 | UXC-001: Created MermaidRendererComponent and GraphvizRendererComponent | Developer |
|
||||
| 2026-01-16 | UXC-002: Created TriageCardComponent with evidence chips, actions | Developer |
|
||||
| 2026-01-16 | UXC-003: Added Rekor Verify with expansion panel | Developer |
|
||||
| 2026-01-16 | UXC-004: Created BinaryDiffPanelComponent with scope navigation | Developer |
|
||||
| 2026-01-16 | UXC-005: Integrated scope selector into BinaryDiffPanel | Developer |
|
||||
| 2026-01-16 | UXC-006: Created FilterStripComponent with deterministic ordering | Developer |
|
||||
| 2026-01-16 | UXC-007: Created SarifDownloadComponent for Export Center | Developer |
|
||||
| 2026-01-16 | UXC-008: Created Playwright e2e tests: triage-card.spec.ts, binary-diff-panel.spec.ts, filter-strip.spec.ts, ux-components-visual.spec.ts | QA |
|
||||
| 2026-01-16 | UXC-001: Added unit tests for MermaidRendererComponent and GraphvizRendererComponent | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- Mermaid.js version must be compatible with Angular 17
|
||||
- GraphViz WASM may have size implications for bundle
|
||||
- Deterministic ordering requires careful implementation
|
||||
- Accessibility requirements are non-negotiable
|
||||
|
||||
## Next Checkpoints
|
||||
- Sprint kickoff: TBD (after CLI sprint dependencies complete)
|
||||
- Mid-sprint review: TBD
|
||||
- Sprint completion: TBD
|
||||
188
docs/implplan/SPRINT_20260117_026_CLI_why_blocked_command.md
Normal file
188
docs/implplan/SPRINT_20260117_026_CLI_why_blocked_command.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# Sprint 026 · CLI Why-Blocked Command
|
||||
|
||||
## Topic & Scope
|
||||
- Implement `stella explain block <digest>` command to answer "why was this artifact blocked?" with deterministic trace and evidence links.
|
||||
- Addresses M2 moat requirement: "Explainability with proof, not narrative."
|
||||
- Command must produce replayable, verifiable output - not just a one-time explanation.
|
||||
- Working directory: `src/Cli/StellaOps.Cli/`.
|
||||
- Expected evidence: CLI command with tests, golden output fixtures, documentation.
|
||||
|
||||
**Moat Reference:** M2 (Explainability with proof, not narrative)
|
||||
|
||||
**Advisory Alignment:** "'Why blocked?' must produce a deterministic trace + referenced evidence artifacts. The answer must be replayable, not a one-time explanation."
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on existing `PolicyGateDecision` and `ReasoningStatement` infrastructure (already implemented).
|
||||
- Can run in parallel with Doctor expansion sprint.
|
||||
- Requires backend API endpoint for gate decision retrieval (may need to add if not exposed).
|
||||
|
||||
## Documentation Prerequisites
|
||||
- Read `src/Policy/StellaOps.Policy.Engine/Gates/PolicyGateDecision.cs` for gate decision model.
|
||||
- Read `src/Attestor/__Libraries/StellaOps.Attestor.ProofChain/Statements/ReasoningStatement.cs` for reasoning model.
|
||||
- Read `src/Findings/StellaOps.Findings.Ledger.WebService/Services/EvidenceGraphBuilder.cs` for evidence linking.
|
||||
- Read existing CLI command patterns in `src/Cli/StellaOps.Cli/Commands/`.
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### WHY-001 - Backend API for Block Explanation
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Verify or create API endpoint to retrieve block explanation for an artifact:
|
||||
- `GET /v1/artifacts/{digest}/block-explanation`
|
||||
- Response includes: gate decision, reasoning statement, evidence links, replay token
|
||||
- Must support both online (live query) and offline (cached verdict) modes
|
||||
|
||||
If endpoint exists, verify it returns all required fields. If not, implement it in the appropriate service (likely Findings Ledger or Policy Engine gateway).
|
||||
|
||||
Completion criteria:
|
||||
- [x] API endpoint returns `BlockExplanationResponse` with all fields
|
||||
- [x] Response includes `PolicyGateDecision` (blockedBy, reason, suggestion)
|
||||
- [x] Response includes evidence artifact references (content-addressed IDs)
|
||||
- [x] Response includes replay token for deterministic verification
|
||||
- [x] OpenAPI spec updated
|
||||
|
||||
### WHY-002 - CLI Command Group Implementation
|
||||
Status: DONE
|
||||
Dependency: WHY-001
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Implement `stella explain block` command in new `ExplainCommandGroup.cs`:
|
||||
|
||||
```
|
||||
stella explain block <digest>
|
||||
--format <table|json|markdown> Output format (default: table)
|
||||
--show-evidence Include full evidence details
|
||||
--show-trace Include policy evaluation trace
|
||||
--replay-token Output replay token for verification
|
||||
--output <path> Write to file instead of stdout
|
||||
```
|
||||
|
||||
Command flow:
|
||||
1. Resolve artifact by digest (support sha256:xxx format)
|
||||
2. Fetch block explanation from API
|
||||
3. Render gate decision with reason and suggestion
|
||||
4. List evidence artifacts with content IDs
|
||||
5. Provide replay token for deterministic verification
|
||||
|
||||
Completion criteria:
|
||||
- [x] `ExplainCommandGroup.cs` created with `block` subcommand
|
||||
- [x] Command registered in `CommandFactory.cs`
|
||||
- [x] Table output shows: Gate, Reason, Suggestion, Evidence count
|
||||
- [x] JSON output includes full response with evidence links
|
||||
- [x] Markdown output suitable for issue/PR comments
|
||||
- [x] Exit code 0 if artifact not blocked, 1 if blocked, 2 on error
|
||||
|
||||
### WHY-003 - Evidence Linking in Output
|
||||
Status: DONE
|
||||
Dependency: WHY-002
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Enhance output to include actionable evidence links:
|
||||
- For each evidence artifact, show: type, ID (truncated), source, timestamp
|
||||
- With `--show-evidence`, show full artifact details
|
||||
- Include `stella verify verdict --verdict <id>` command for replay
|
||||
- Include `stella evidence get <id>` command for artifact retrieval
|
||||
|
||||
Output example (table format):
|
||||
```
|
||||
Artifact: sha256:abc123...
|
||||
Status: BLOCKED
|
||||
|
||||
Gate: VexTrust
|
||||
Reason: Trust score below threshold (0.45 < 0.70)
|
||||
Suggestion: Obtain VEX statement from trusted issuer or add issuer to trust registry
|
||||
|
||||
Evidence:
|
||||
[VEX] vex:sha256:def456... vendor-x 2026-01-15T10:00:00Z
|
||||
[REACH] reach:sha256:789... static 2026-01-15T09:55:00Z
|
||||
|
||||
Replay: stella verify verdict --verdict urn:stella:verdict:sha256:xyz...
|
||||
```
|
||||
|
||||
Completion criteria:
|
||||
- [x] Evidence artifacts listed with type, truncated ID, source, timestamp
|
||||
- [x] `--show-evidence` expands to full details
|
||||
- [x] Replay command included in output
|
||||
- [x] Evidence retrieval commands included
|
||||
|
||||
### WHY-004 - Determinism and Golden Tests
|
||||
Status: DONE
|
||||
Dependency: WHY-002, WHY-003
|
||||
Owners: Developer/Implementer, QA
|
||||
|
||||
Task description:
|
||||
Ensure command output is deterministic:
|
||||
- Add golden output tests in `DeterminismReplayGoldenTests.cs`
|
||||
- Verify same input produces byte-identical output
|
||||
- Test all output formats (table, json, markdown)
|
||||
- Verify replay token is stable across runs
|
||||
|
||||
Completion criteria:
|
||||
- [x] Golden test fixtures for table output
|
||||
- [x] Golden test fixtures for JSON output
|
||||
- [x] Golden test fixtures for markdown output
|
||||
- [x] Determinism hash verification test
|
||||
- [x] Cross-platform normalization (CRLF -> LF)
|
||||
|
||||
### WHY-005 - Unit and Integration Tests
|
||||
Status: DONE
|
||||
Dependency: WHY-002
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Create comprehensive test coverage:
|
||||
- Unit tests for command handler with mocked backend client
|
||||
- Unit tests for output rendering
|
||||
- Integration test with mock API server
|
||||
- Error handling tests (artifact not found, not blocked, API error)
|
||||
|
||||
Completion criteria:
|
||||
- [x] `ExplainBlockCommandTests.cs` created
|
||||
- [x] Tests for blocked artifact scenario
|
||||
- [x] Tests for non-blocked artifact scenario
|
||||
- [x] Tests for artifact not found scenario
|
||||
- [x] Tests for all output formats
|
||||
- [x] Tests for error conditions
|
||||
|
||||
### WHY-006 - Documentation
|
||||
Status: DONE
|
||||
Dependency: WHY-002, WHY-003
|
||||
Owners: Documentation author
|
||||
|
||||
Task description:
|
||||
Document the new command:
|
||||
- Add to `docs/modules/cli/guides/commands/explain.md`
|
||||
- Add to `docs/modules/cli/guides/commands/reference.md`
|
||||
- Include examples for common scenarios
|
||||
- Link from quickstart as the "why blocked?" answer
|
||||
|
||||
Completion criteria:
|
||||
- [x] Command reference documentation
|
||||
- [x] Usage examples with sample output
|
||||
- [x] Linked from quickstart.md
|
||||
- [x] Troubleshooting section for common issues
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-01-17 | Sprint created from AI Economics Moat advisory gap analysis. | Planning |
|
||||
| 2026-01-17 | WHY-002, WHY-003 completed. ExplainCommandGroup.cs implemented with block subcommand, all output formats, evidence linking, and replay tokens. | Developer |
|
||||
| 2026-01-17 | WHY-004 completed. Golden test fixtures added to DeterminismReplayGoldenTests.cs for explain block command (JSON, table, markdown formats). | QA |
|
||||
| 2026-01-17 | WHY-005 completed. Comprehensive unit tests added to ExplainBlockCommandTests.cs including error handling, exit codes, edge cases. | QA |
|
||||
| 2026-01-17 | WHY-006 completed. Documentation created at docs/modules/cli/guides/commands/explain.md and command reference updated. | Documentation |
|
||||
| 2026-01-17 | WHY-001 completed. BlockExplanationController.cs created with GET /v1/artifacts/{digest}/block-explanation and /detailed endpoints. | Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Decision needed:** Should the command be `stella explain block` or `stella why-blocked`? Recommend `stella explain block` for consistency with existing command structure.
|
||||
- **Decision needed:** Should offline mode query local verdict cache or require explicit `--offline` flag?
|
||||
- **Risk:** Backend API may not expose all required fields. Mitigation: WHY-001 verifies/creates endpoint first.
|
||||
|
||||
## Next Checkpoints
|
||||
- API endpoint verified/created: +2 working days
|
||||
- CLI command implementation: +3 working days
|
||||
- Tests and docs: +2 working days
|
||||
280
docs/implplan/SPRINT_20260117_027_CLI_audit_bundle_command.md
Normal file
280
docs/implplan/SPRINT_20260117_027_CLI_audit_bundle_command.md
Normal file
@@ -0,0 +1,280 @@
|
||||
# Sprint 027 · CLI Audit Bundle Command
|
||||
|
||||
## Topic & Scope
|
||||
- Implement `stella audit bundle` command to produce self-contained, auditor-ready evidence packages.
|
||||
- Addresses M1 moat requirement: "Evidence chain continuity - no glue work required."
|
||||
- Bundle must contain everything an auditor needs without requiring additional tool invocations.
|
||||
- Working directory: `src/Cli/StellaOps.Cli/`.
|
||||
- Expected evidence: CLI command, bundle format spec, tests, documentation.
|
||||
|
||||
**Moat Reference:** M1 (Evidence chain continuity - no glue work required)
|
||||
|
||||
**Advisory Alignment:** "Do not require customers to stitch multiple tools together to get audit-grade releases." and "Audit export acceptance rate (auditors can consume without manual reconstruction)."
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on existing export infrastructure (`DeterministicExportUtilities.cs`, `ExportEngine`).
|
||||
- Can leverage `stella attest bundle` and `stella export run` as foundation.
|
||||
- Can run in parallel with other CLI sprints.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- Read `src/Cli/StellaOps.Cli/Export/DeterministicExportUtilities.cs` for export patterns.
|
||||
- Read `src/Excititor/__Libraries/StellaOps.Excititor.Export/ExportEngine.cs` for existing export logic.
|
||||
- Read `src/Attestor/__Libraries/StellaOps.Attestor.ProofChain/` for attestation structures.
|
||||
- Review common audit requirements (SOC2, ISO27001, FedRAMP) for bundle contents.
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### AUD-001 - Audit Bundle Format Specification
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Product Manager, Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Define the audit bundle format specification:
|
||||
|
||||
```
|
||||
audit-bundle-<digest>-<timestamp>/
|
||||
manifest.json # Bundle manifest with hashes
|
||||
README.md # Human-readable guide for auditors
|
||||
verdict/
|
||||
verdict.json # StellaVerdict artifact
|
||||
verdict.dsse.json # DSSE envelope with signatures
|
||||
evidence/
|
||||
sbom.json # SBOM (CycloneDX or SPDX)
|
||||
vex-statements/ # All VEX statements considered
|
||||
*.json
|
||||
reachability/
|
||||
analysis.json # Reachability analysis result
|
||||
call-graph.dot # Call graph visualization (optional)
|
||||
provenance/
|
||||
slsa-provenance.json
|
||||
policy/
|
||||
policy-snapshot.json # Policy version used
|
||||
gate-decision.json # Gate evaluation result
|
||||
evaluation-trace.json # Full policy trace
|
||||
replay/
|
||||
knowledge-snapshot.json # Frozen inputs for replay
|
||||
replay-instructions.md # How to replay verdict
|
||||
schema/
|
||||
verdict-schema.json # Schema references
|
||||
vex-schema.json
|
||||
```
|
||||
|
||||
Completion criteria:
|
||||
- [x] Bundle format documented in `docs/modules/cli/guides/audit-bundle-format.md`
|
||||
- [x] Manifest schema defined with file hashes
|
||||
- [x] README.md template created for auditor guidance
|
||||
- [x] Format reviewed against SOC2/ISO27001 common requirements
|
||||
|
||||
### AUD-002 - Bundle Generation Service
|
||||
Status: DONE
|
||||
Dependency: AUD-001
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Implement `AuditBundleService` in CLI services:
|
||||
- Collect all artifacts for a given digest
|
||||
- Generate deterministic bundle structure
|
||||
- Compute manifest with file hashes
|
||||
- Support archive formats: directory, tar.gz, zip
|
||||
|
||||
```csharp
|
||||
public interface IAuditBundleService
|
||||
{
|
||||
Task<AuditBundleResult> GenerateBundleAsync(
|
||||
string artifactDigest,
|
||||
AuditBundleOptions options,
|
||||
CancellationToken cancellationToken);
|
||||
}
|
||||
|
||||
public record AuditBundleOptions(
|
||||
string OutputPath,
|
||||
AuditBundleFormat Format, // Directory, TarGz, Zip
|
||||
bool IncludeCallGraph,
|
||||
bool IncludeSchemas,
|
||||
string? PolicyVersion);
|
||||
```
|
||||
|
||||
Completion criteria:
|
||||
- [x] `AuditBundleService.cs` created
|
||||
- [x] All evidence artifacts collected and organized
|
||||
- [x] Manifest generated with SHA-256 hashes
|
||||
- [x] README.md generated from template
|
||||
- [x] Directory output format working
|
||||
- [x] tar.gz output format working
|
||||
- [x] zip output format working
|
||||
|
||||
### AUD-003 - CLI Command Implementation
|
||||
Status: DONE
|
||||
Dependency: AUD-002
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Implement `stella audit bundle` command:
|
||||
|
||||
```
|
||||
stella audit bundle <digest>
|
||||
--output <path> Output path (default: ./audit-bundle-<digest>/)
|
||||
--format <dir|tar.gz|zip> Output format (default: dir)
|
||||
--include-call-graph Include call graph visualization
|
||||
--include-schemas Include JSON schema files
|
||||
--policy-version <ver> Use specific policy version
|
||||
--verbose Show progress during generation
|
||||
```
|
||||
|
||||
Command flow:
|
||||
1. Resolve artifact by digest
|
||||
2. Fetch verdict and all linked evidence
|
||||
3. Generate bundle using `AuditBundleService`
|
||||
4. Verify bundle integrity (hash check)
|
||||
5. Output summary with file count and total size
|
||||
|
||||
Completion criteria:
|
||||
- [x] `AuditCommandGroup.cs` updated with `bundle` subcommand
|
||||
- [x] Command registered in `CommandFactory.cs`
|
||||
- [x] All options implemented
|
||||
- [x] Progress reporting for large bundles
|
||||
- [x] Exit code 0 on success, 1 on missing evidence, 2 on error
|
||||
|
||||
### AUD-004 - Replay Instructions Generation
|
||||
Status: DONE
|
||||
Dependency: AUD-002
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Generate `replay/replay-instructions.md` with:
|
||||
- Prerequisites (Stella CLI version, network requirements)
|
||||
- Step-by-step replay commands
|
||||
- Expected output verification
|
||||
- Troubleshooting for common replay failures
|
||||
|
||||
Template should be parameterized with actual values from the bundle.
|
||||
|
||||
Example content:
|
||||
```markdown
|
||||
# Replay Instructions
|
||||
|
||||
## Prerequisites
|
||||
- Stella CLI v2.5.0 or later
|
||||
- Network access to policy engine (or offline mode with bundled policy)
|
||||
|
||||
## Steps
|
||||
|
||||
1. Verify bundle integrity:
|
||||
```
|
||||
stella audit verify ./audit-bundle-sha256-abc123/
|
||||
```
|
||||
|
||||
2. Replay verdict:
|
||||
```
|
||||
stella replay snapshot \
|
||||
--manifest ./audit-bundle-sha256-abc123/replay/knowledge-snapshot.json \
|
||||
--output ./replay-result.json
|
||||
```
|
||||
|
||||
3. Compare results:
|
||||
```
|
||||
stella replay diff \
|
||||
./audit-bundle-sha256-abc123/verdict/verdict.json \
|
||||
./replay-result.json
|
||||
```
|
||||
|
||||
## Expected Result
|
||||
Verdict digest should match: sha256:abc123...
|
||||
```
|
||||
|
||||
Completion criteria:
|
||||
- [x] `ReplayInstructionsGenerator.cs` created (inline in AuditCommandGroup)
|
||||
- [x] Template with parameterized values
|
||||
- [x] All CLI commands in instructions are valid
|
||||
- [x] Troubleshooting section included
|
||||
|
||||
### AUD-005 - Bundle Verification Command
|
||||
Status: DONE
|
||||
Dependency: AUD-003
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Implement `stella audit verify` to validate bundle integrity:
|
||||
|
||||
```
|
||||
stella audit verify <bundle-path>
|
||||
--strict Fail on any missing optional files
|
||||
--check-signatures Verify DSSE signatures
|
||||
--trusted-keys <path> Trusted keys for signature verification
|
||||
```
|
||||
|
||||
Verification steps:
|
||||
1. Parse manifest.json
|
||||
2. Verify all file hashes match
|
||||
3. Validate verdict content ID
|
||||
4. Optionally verify signatures
|
||||
5. Report any integrity issues
|
||||
|
||||
Completion criteria:
|
||||
- [x] `audit verify` subcommand implemented
|
||||
- [x] Manifest hash verification
|
||||
- [x] Verdict content ID verification
|
||||
- [x] Signature verification (optional)
|
||||
- [x] Clear error messages for integrity failures
|
||||
- [x] Exit code 0 on valid, 1 on invalid, 2 on error
|
||||
|
||||
### AUD-006 - Tests
|
||||
Status: DONE
|
||||
Dependency: AUD-003, AUD-005
|
||||
Owners: Developer/Implementer, QA
|
||||
|
||||
Task description:
|
||||
Create comprehensive test coverage:
|
||||
- Unit tests for `AuditBundleService`
|
||||
- Unit tests for command handlers
|
||||
- Integration test generating real bundle
|
||||
- Golden tests for README.md and replay-instructions.md
|
||||
- Verification tests for all output formats
|
||||
|
||||
Completion criteria:
|
||||
- [x] `AuditBundleServiceTests.cs` created
|
||||
- [x] `AuditBundleCommandTests.cs` created (combined with service tests)
|
||||
- [x] `AuditVerifyCommandTests.cs` created
|
||||
- [x] Integration test with synthetic evidence
|
||||
- [x] Golden output tests for generated markdown
|
||||
- [x] Tests for all archive formats
|
||||
|
||||
### AUD-007 - Documentation
|
||||
Status: DONE
|
||||
Dependency: AUD-003, AUD-004, AUD-005
|
||||
Owners: Documentation author
|
||||
|
||||
Task description:
|
||||
Document the audit bundle feature:
|
||||
- Command reference in `docs/modules/cli/guides/commands/audit.md`
|
||||
- Bundle format specification in `docs/modules/cli/guides/audit-bundle-format.md`
|
||||
- Auditor guide in `docs/operations/guides/auditor-guide.md`
|
||||
- Add to command reference index
|
||||
|
||||
Completion criteria:
|
||||
- [x] Command reference documentation
|
||||
- [x] Bundle format specification
|
||||
- [x] Auditor-facing guide with screenshots/examples
|
||||
- [x] Linked from FEATURE_MATRIX.md
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-01-17 | Sprint created from AI Economics Moat advisory gap analysis. | Planning |
|
||||
| 2026-01-17 | AUD-003, AUD-004 completed. audit bundle command implemented in AuditCommandGroup.cs with all output formats, manifest generation, README, and replay instructions. | Developer |
|
||||
| 2026-01-17 | AUD-001, AUD-002, AUD-005, AUD-006, AUD-007 completed. Bundle format spec documented, IAuditBundleService + AuditBundleService implemented, AuditVerifyCommand implemented, tests added. | Developer |
|
||||
| 2026-01-17 | AUD-007 documentation completed. Command reference (audit.md), auditor guide created. | Documentation |
|
||||
| 2026-01-17 | Final verification: AuditVerifyCommandTests.cs created with archive format tests and golden output tests. All tasks DONE. Sprint ready for archive. | QA |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Decision needed:** Should bundle include raw VEX documents or normalized versions? Recommend: both (raw in `vex-statements/raw/`, normalized in `vex-statements/normalized/`).
|
||||
- **Decision needed:** What archive format should be default? Recommend: directory for local use, tar.gz for transfer.
|
||||
- **Risk:** Large bundles may be slow to generate. Mitigation: Add progress reporting and consider streaming archive creation.
|
||||
- **Risk:** Bundle format may need evolution. Mitigation: Include schema version in manifest from day one.
|
||||
|
||||
## Next Checkpoints
|
||||
- Format specification complete: +2 working days
|
||||
- Bundle generation working: +4 working days
|
||||
- Commands and tests complete: +3 working days
|
||||
- Documentation complete: +2 working days
|
||||
240
docs/implplan/SPRINT_20260117_028_Telemetry_p0_metrics.md
Normal file
240
docs/implplan/SPRINT_20260117_028_Telemetry_p0_metrics.md
Normal file
@@ -0,0 +1,240 @@
|
||||
# Sprint 028 · P0 Product Metrics Definition
|
||||
|
||||
## Topic & Scope
|
||||
- Define and instrument the four P0 product-level metrics from the AI Economics Moat advisory.
|
||||
- Create Grafana dashboard templates for tracking these metrics.
|
||||
- Enable solo-scaled operations by making product health visible at a glance.
|
||||
- Working directory: `src/Telemetry/`, `devops/telemetry/`.
|
||||
- Expected evidence: Metric definitions, instrumentation, dashboard templates, alerting rules.
|
||||
|
||||
**Moat Reference:** M3 (Operability moat), Section 8 (Product-level metrics)
|
||||
|
||||
**Advisory Alignment:** "These metrics are the scoreboard. Prioritize work that improves them."
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Requires existing OpenTelemetry infrastructure (already in place).
|
||||
- Can run in parallel with other sprints.
|
||||
- Dashboard templates depend on Grafana/Prometheus stack.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- Read `docs/modules/telemetry/guides/observability.md` for existing metric patterns.
|
||||
- Read `src/Attestor/StellaOps.Attestor/StellaOps.Attestor.Core/Verification/RekorVerificationMetrics.cs` for metric implementation patterns.
|
||||
- Read advisory section 8 for metric definitions.
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### P0M-001 - Time-to-First-Verified-Release Metric
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Instrument `stella_time_to_first_verified_release_seconds` histogram:
|
||||
|
||||
**Definition:** Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).
|
||||
|
||||
**Labels:**
|
||||
- `tenant`: Tenant identifier
|
||||
- `deployment_type`: `fresh` | `upgrade`
|
||||
|
||||
**Collection points:**
|
||||
1. Record install timestamp on first Authority startup (store in DB)
|
||||
2. Record first verified promotion timestamp in Release Orchestrator
|
||||
3. Emit metric on first promotion with duration = promotion_time - install_time
|
||||
|
||||
**Implementation:**
|
||||
- Add `InstallTimestampService` to record first startup
|
||||
- Add metric emission in `ReleaseOrchestrator` on first promotion per tenant
|
||||
- Use histogram buckets: 5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)
|
||||
|
||||
Completion criteria:
|
||||
- [x] Install timestamp recorded on first startup
|
||||
- [x] Metric emitted on first verified promotion
|
||||
- [x] Histogram with appropriate buckets
|
||||
- [x] Label for tenant and deployment type
|
||||
- [x] Unit test for metric emission
|
||||
|
||||
### P0M-002 - Mean Time to Answer "Why Blocked" Metric
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Instrument `stella_why_blocked_latency_seconds` histogram:
|
||||
|
||||
**Definition:** Time from block decision to user viewing explanation (via CLI, UI, or API).
|
||||
|
||||
**Labels:**
|
||||
- `tenant`: Tenant identifier
|
||||
- `surface`: `cli` | `ui` | `api`
|
||||
- `resolution_type`: `immediate` (same session) | `delayed` (different session)
|
||||
|
||||
**Collection points:**
|
||||
1. Record block decision timestamp in verdict
|
||||
2. Record explanation view timestamp when `stella explain block` or UI equivalent is invoked
|
||||
3. Emit metric with duration
|
||||
|
||||
**Implementation:**
|
||||
- Add explanation view tracking in CLI command
|
||||
- Add explanation view tracking in UI (existing telemetry hook)
|
||||
- Correlate via artifact digest
|
||||
- Use histogram buckets: 1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h
|
||||
|
||||
Completion criteria:
|
||||
- [x] Block decision timestamp available in verdict
|
||||
- [x] Explanation view events tracked
|
||||
- [x] Correlation by artifact digest
|
||||
- [x] Histogram with appropriate buckets
|
||||
- [x] Surface label populated correctly
|
||||
|
||||
### P0M-003 - Support Minutes per Customer Metric
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Instrument `stella_support_burden_minutes_total` counter:
|
||||
|
||||
**Definition:** Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.
|
||||
|
||||
**Labels:**
|
||||
- `tenant`: Tenant identifier
|
||||
- `category`: `install` | `config` | `policy` | `integration` | `bug` | `other`
|
||||
- `month`: YYYY-MM
|
||||
|
||||
**Collection approach:**
|
||||
Since this is primarily manual, create:
|
||||
1. CLI command `stella ops support log --tenant <id> --minutes <n> --category <cat>` for logging support events
|
||||
2. API endpoint for programmatic logging
|
||||
3. Counter incremented on each log entry
|
||||
|
||||
**Target:** Trend toward zero. Alert if any tenant exceeds 30 minutes/month.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Metric definition in P0ProductMetrics.cs
|
||||
- [x] Counter metric with labels
|
||||
- [x] Monthly aggregation capability
|
||||
- [x] Dashboard panel showing trend
|
||||
|
||||
### P0M-004 - Determinism Regressions Metric
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Instrument `stella_determinism_regressions_total` counter:
|
||||
|
||||
**Definition:** Count of detected determinism failures in production (same inputs produced different outputs).
|
||||
|
||||
**Labels:**
|
||||
- `tenant`: Tenant identifier
|
||||
- `component`: `scanner` | `policy` | `attestor` | `export`
|
||||
- `severity`: `bitwise` | `semantic` | `policy` (matches fidelity tiers)
|
||||
|
||||
**Collection points:**
|
||||
1. Determinism verification jobs (scheduled)
|
||||
2. Replay verification failures
|
||||
3. Golden test CI failures (development)
|
||||
|
||||
**Implementation:**
|
||||
- Add counter emission in `DeterminismVerifier`
|
||||
- Add counter emission in replay batch jobs
|
||||
- Use existing fidelity tier classification
|
||||
|
||||
**Target:** Near-zero. Alert immediately on any `policy` severity regression.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Counter metric with labels
|
||||
- [x] Emission on determinism verification failure
|
||||
- [x] Severity classification (bitwise/semantic/policy)
|
||||
- [x] Unit test for metric emission
|
||||
|
||||
### P0M-005 - Grafana Dashboard Template
|
||||
Status: DONE
|
||||
Dependency: P0M-001, P0M-002, P0M-003, P0M-004
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Create Grafana dashboard template `stella-ops-p0-metrics.json`:
|
||||
|
||||
**Panels:**
|
||||
1. **Time to First Release** - Histogram heatmap + P50/P90/P99 stat
|
||||
2. **Why Blocked Latency** - Histogram heatmap + trend line
|
||||
3. **Support Burden** - Stacked bar by category, monthly trend
|
||||
4. **Determinism Regressions** - Counter with severity breakdown, alert status
|
||||
|
||||
**Features:**
|
||||
- Tenant selector variable
|
||||
- Time range selector
|
||||
- Drill-down links to detailed dashboards
|
||||
- SLO indicator (green/yellow/red)
|
||||
|
||||
**File location:** `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json`
|
||||
|
||||
Completion criteria:
|
||||
- [x] Dashboard JSON template created
|
||||
- [x] All four P0 metrics visualized
|
||||
- [x] Tenant filtering working
|
||||
- [x] SLO indicators configured
|
||||
- [x] Unit test for dashboard schema
|
||||
|
||||
### P0M-006 - Alerting Rules
|
||||
Status: DONE
|
||||
Dependency: P0M-001, P0M-002, P0M-003, P0M-004
|
||||
Owners: Developer/Implementer
|
||||
|
||||
Task description:
|
||||
Create Prometheus alerting rules for P0 metrics:
|
||||
|
||||
**Rules:**
|
||||
1. `StellaTimeToFirstReleaseHigh` - P90 > 4 hours (warning), P90 > 24 hours (critical)
|
||||
2. `StellaWhyBlockedLatencyHigh` - P90 > 5 minutes (warning), P90 > 1 hour (critical)
|
||||
3. `StellaSupportBurdenHigh` - Any tenant > 30 min/month (warning), > 60 min/month (critical)
|
||||
4. `StellaDeterminismRegression` - Any policy-level regression (critical immediately)
|
||||
|
||||
**File location:** `devops/telemetry/alerts/stella-p0-alerts.yml`
|
||||
|
||||
Completion criteria:
|
||||
- [x] Alert rules file created
|
||||
- [x] All four metrics have alert rules
|
||||
- [x] Severity levels appropriate
|
||||
- [x] Alert annotations include runbook links
|
||||
- [x] Tested with synthetic data
|
||||
|
||||
### P0M-007 - Documentation
|
||||
Status: DONE
|
||||
Dependency: P0M-001, P0M-002, P0M-003, P0M-004, P0M-005, P0M-006
|
||||
Owners: Documentation author
|
||||
|
||||
Task description:
|
||||
Document the P0 metrics:
|
||||
- Add metrics to `docs/modules/telemetry/guides/p0-metrics.md`
|
||||
- Include metric definitions, labels, collection points
|
||||
- Include dashboard screenshot and usage guide
|
||||
- Include alerting thresholds and response procedures
|
||||
- Link from advisory and FEATURE_MATRIX.md
|
||||
|
||||
Completion criteria:
|
||||
- [x] Metric definitions documented
|
||||
- [x] Dashboard usage guide
|
||||
- [x] Alert response procedures
|
||||
- [x] Linked from advisory implementation tracking
|
||||
- [x] Linked from FEATURE_MATRIX.md
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-01-17 | Sprint created from AI Economics Moat advisory gap analysis. | Planning |
|
||||
| 2026-01-17 | P0M-001 through P0M-006 completed. P0ProductMetrics.cs, InstallTimestampService.cs, Grafana dashboard, and alert rules implemented. Tests added. | Developer |
|
||||
| 2026-01-17 | P0M-007 completed. docs/modules/telemetry/guides/p0-metrics.md created with full metric documentation, dashboard guide, and alert procedures. | Documentation |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Decision needed:** For P0M-003 (support burden), should we integrate with external ticketing systems (Jira, Linear) or keep it CLI-only? Recommend: CLI-only initially, add integrations later.
|
||||
- **Decision needed:** What histogram bucket distributions are appropriate? Recommend: Start with proposed buckets, refine based on real data.
|
||||
- **Risk:** Time-to-first-release metric requires install timestamp persistence. If DB is wiped, metric resets. Mitigation: Accept this limitation; document in metric description.
|
||||
- **Risk:** Why-blocked correlation may be imperfect if user investigates via different surface than where block occurred. Mitigation: Track best-effort, note limitation in docs.
|
||||
|
||||
## Next Checkpoints
|
||||
- Metric instrumentation complete: +3 working days
|
||||
- Dashboard template complete: +2 working days
|
||||
- Alerting rules and docs: +2 working days
|
||||
271
docs/modules/cli/guides/audit-bundle-format.md
Normal file
271
docs/modules/cli/guides/audit-bundle-format.md
Normal file
@@ -0,0 +1,271 @@
|
||||
# Audit Bundle Format Specification
|
||||
|
||||
> **Sprint:** SPRINT_20260117_027_CLI_audit_bundle_command
|
||||
> **Task:** AUD-001 - Audit Bundle Format Specification
|
||||
> **Version:** 1.0.0
|
||||
|
||||
## Overview
|
||||
|
||||
The Stella Ops Audit Bundle is a self-contained, tamper-evident package containing all evidence required for an auditor to verify a release decision. The bundle is designed for:
|
||||
|
||||
- **Completeness:** Contains everything needed to verify a verdict without additional tool invocations
|
||||
- **Reproducibility:** Includes replay instructions for deterministic re-verification
|
||||
- **Portability:** Standard formats (JSON, Markdown) readable by common tools
|
||||
- **Integrity:** Cryptographic manifest ensures tamper detection
|
||||
|
||||
## Bundle Structure
|
||||
|
||||
```
|
||||
audit-bundle-<digest>-<timestamp>/
|
||||
├── manifest.json # Bundle manifest with cryptographic hashes
|
||||
├── README.md # Human-readable guide for auditors
|
||||
├── verdict/
|
||||
│ ├── verdict.json # StellaVerdict artifact
|
||||
│ └── verdict.dsse.json # DSSE envelope with signatures
|
||||
├── evidence/
|
||||
│ ├── sbom.json # SBOM (CycloneDX format)
|
||||
│ ├── vex-statements/ # All VEX statements considered
|
||||
│ │ ├── index.json # VEX index with sources
|
||||
│ │ └── *.json # Individual VEX documents
|
||||
│ ├── reachability/
|
||||
│ │ ├── analysis.json # Reachability analysis result
|
||||
│ │ └── call-graph.dot # Call graph visualization (optional)
|
||||
│ └── provenance/
|
||||
│ └── slsa-provenance.json
|
||||
├── policy/
|
||||
│ ├── policy-snapshot.json # Policy version and rules used
|
||||
│ ├── gate-decision.json # Gate evaluation result
|
||||
│ └── evaluation-trace.json # Full policy trace (optional)
|
||||
├── replay/
|
||||
│ ├── knowledge-snapshot.json # Frozen inputs for replay
|
||||
│ └── replay-instructions.md # How to replay verdict
|
||||
└── schema/ # Schema references (optional)
|
||||
├── verdict-schema.json
|
||||
└── vex-schema.json
|
||||
```
|
||||
|
||||
## File Specifications
|
||||
|
||||
### manifest.json
|
||||
|
||||
The manifest provides cryptographic integrity and bundle metadata.
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "https://schema.stella-ops.org/audit-bundle/manifest/v1",
|
||||
"version": "1.0.0",
|
||||
"bundleId": "urn:stella:audit-bundle:sha256:abc123...",
|
||||
"artifactDigest": "sha256:abc123...",
|
||||
"generatedAt": "2026-01-17T10:30:00Z",
|
||||
"generatedBy": "stella-cli/2.5.0",
|
||||
"files": [
|
||||
{
|
||||
"path": "verdict/verdict.json",
|
||||
"sha256": "abc123...",
|
||||
"size": 12345,
|
||||
"required": true
|
||||
},
|
||||
{
|
||||
"path": "evidence/sbom.json",
|
||||
"sha256": "def456...",
|
||||
"size": 98765,
|
||||
"required": true
|
||||
}
|
||||
],
|
||||
"totalFiles": 12,
|
||||
"totalSize": 234567,
|
||||
"integrityHash": "sha256:manifest-hash-of-all-file-hashes"
|
||||
}
|
||||
```
|
||||
|
||||
### README.md
|
||||
|
||||
Auto-generated guide for auditors with:
|
||||
- Bundle overview and artifact identification
|
||||
- Quick verification steps
|
||||
- File inventory with descriptions
|
||||
- Contact information for questions
|
||||
|
||||
### verdict/verdict.json
|
||||
|
||||
The StellaVerdict artifact in standard format:
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "https://schema.stella-ops.org/verdict/v1",
|
||||
"artifactDigest": "sha256:abc123...",
|
||||
"artifactType": "container-image",
|
||||
"decision": "BLOCKED",
|
||||
"timestamp": "2026-01-17T10:25:00Z",
|
||||
"gates": [
|
||||
{
|
||||
"gateId": "vex-trust",
|
||||
"status": "BLOCKED",
|
||||
"reason": "Trust score below threshold (0.45 < 0.70)",
|
||||
"evidenceRefs": ["evidence/vex-statements/vendor-x.json"]
|
||||
}
|
||||
],
|
||||
"contentId": "urn:stella:verdict:sha256:xyz..."
|
||||
}
|
||||
```
|
||||
|
||||
### verdict/verdict.dsse.json
|
||||
|
||||
DSSE (Dead Simple Signing Envelope) containing the signed verdict:
|
||||
|
||||
```json
|
||||
{
|
||||
"payloadType": "application/vnd.stella-ops.verdict+json",
|
||||
"payload": "base64-encoded-verdict",
|
||||
"signatures": [
|
||||
{
|
||||
"keyid": "urn:stella:key:sha256:...",
|
||||
"sig": "base64-signature"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### evidence/sbom.json
|
||||
|
||||
CycloneDX SBOM in JSON format (or SPDX if configured).
|
||||
|
||||
### evidence/vex-statements/
|
||||
|
||||
Directory containing all VEX statements considered during evaluation:
|
||||
|
||||
- `index.json` - Index of VEX statements with metadata
|
||||
- Individual VEX documents named by source and ID
|
||||
|
||||
### evidence/reachability/analysis.json
|
||||
|
||||
Reachability analysis results:
|
||||
|
||||
```json
|
||||
{
|
||||
"artifactDigest": "sha256:abc123...",
|
||||
"analysisType": "static",
|
||||
"analysisTimestamp": "2026-01-17T10:20:00Z",
|
||||
"components": [
|
||||
{
|
||||
"purl": "pkg:npm/lodash@4.17.21",
|
||||
"vulnerabilities": [
|
||||
{
|
||||
"id": "CVE-2021-23337",
|
||||
"reachable": false,
|
||||
"reason": "Vulnerable function not in call graph"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### policy/policy-snapshot.json
|
||||
|
||||
Snapshot of policy configuration at evaluation time:
|
||||
|
||||
```json
|
||||
{
|
||||
"policyVersion": "v2.3.1",
|
||||
"policyDigest": "sha256:policy-hash...",
|
||||
"gates": ["sbom-required", "vex-trust", "cve-threshold"],
|
||||
"thresholds": {
|
||||
"vexTrustScore": 0.70,
|
||||
"maxCriticalCves": 0,
|
||||
"maxHighCves": 5
|
||||
},
|
||||
"evaluatedAt": "2026-01-17T10:25:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### policy/gate-decision.json
|
||||
|
||||
Detailed gate evaluation result:
|
||||
|
||||
```json
|
||||
{
|
||||
"artifactDigest": "sha256:abc123...",
|
||||
"overallDecision": "BLOCKED",
|
||||
"gates": [
|
||||
{
|
||||
"gateId": "vex-trust",
|
||||
"decision": "BLOCKED",
|
||||
"inputs": {
|
||||
"vexStatements": 3,
|
||||
"trustScore": 0.45,
|
||||
"threshold": 0.70
|
||||
},
|
||||
"reason": "Trust score below threshold",
|
||||
"suggestion": "Obtain VEX from trusted issuer or adjust trust registry"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### replay/knowledge-snapshot.json
|
||||
|
||||
Frozen inputs for deterministic replay:
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "https://schema.stella-ops.org/knowledge-snapshot/v1",
|
||||
"snapshotId": "urn:stella:snapshot:sha256:...",
|
||||
"capturedAt": "2026-01-17T10:25:00Z",
|
||||
"inputs": {
|
||||
"sbomDigest": "sha256:sbom-hash...",
|
||||
"vexStatements": ["sha256:vex1...", "sha256:vex2..."],
|
||||
"policyDigest": "sha256:policy-hash...",
|
||||
"reachabilityDigest": "sha256:reach-hash..."
|
||||
},
|
||||
"replayCommand": "stella replay snapshot --manifest replay/knowledge-snapshot.json"
|
||||
}
|
||||
```
|
||||
|
||||
### replay/replay-instructions.md
|
||||
|
||||
Human-readable replay instructions (auto-generated, see AUD-004).
|
||||
|
||||
## Archive Formats
|
||||
|
||||
The bundle can be output in three formats:
|
||||
|
||||
| Format | Extension | Use Case |
|
||||
|--------|-----------|----------|
|
||||
| Directory | (none) | Local inspection, development |
|
||||
| tar.gz | `.tar.gz` | Transfer, archival (default for remote) |
|
||||
| zip | `.zip` | Windows compatibility |
|
||||
|
||||
## Verification
|
||||
|
||||
To verify a bundle's integrity:
|
||||
|
||||
```bash
|
||||
stella audit verify ./audit-bundle-sha256-abc123/
|
||||
```
|
||||
|
||||
Verification checks:
|
||||
1. Parse `manifest.json`
|
||||
2. Verify each file's SHA-256 hash matches manifest
|
||||
3. Verify `integrityHash` (hash of all file hashes)
|
||||
4. Optionally verify DSSE signatures
|
||||
|
||||
## Compliance Mapping
|
||||
|
||||
| Compliance Framework | Bundle Component |
|
||||
|---------------------|------------------|
|
||||
| SOC 2 (CC7.1) | verdict/, policy/ |
|
||||
| ISO 27001 (A.12.6) | evidence/sbom.json |
|
||||
| FedRAMP | All components |
|
||||
| SLSA Level 3 | evidence/provenance/ |
|
||||
|
||||
## Extensibility
|
||||
|
||||
Custom evidence can be added to `evidence/custom/` directory. Custom files must be:
|
||||
- Listed in `manifest.json`
|
||||
- JSON or Markdown format
|
||||
- Include schema reference if JSON
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
251
docs/modules/cli/guides/commands/audit.md
Normal file
251
docs/modules/cli/guides/commands/audit.md
Normal file
@@ -0,0 +1,251 @@
|
||||
# stella audit
|
||||
|
||||
> **Sprint:** SPRINT_20260117_027_CLI_audit_bundle_command
|
||||
> **Task:** AUD-007 - Documentation
|
||||
|
||||
Commands for audit operations including bundle generation and verification.
|
||||
|
||||
## Synopsis
|
||||
|
||||
```
|
||||
stella audit <command> [options]
|
||||
```
|
||||
|
||||
## Commands
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `bundle` | Generate self-contained audit bundle for an artifact |
|
||||
| `verify` | Verify audit bundle integrity |
|
||||
|
||||
---
|
||||
|
||||
## stella audit bundle
|
||||
|
||||
Generate a self-contained, auditor-ready evidence package for an artifact.
|
||||
|
||||
### Synopsis
|
||||
|
||||
```
|
||||
stella audit bundle <digest> [options]
|
||||
```
|
||||
|
||||
### Arguments
|
||||
|
||||
| Argument | Description |
|
||||
|----------|-------------|
|
||||
| `<digest>` | Artifact digest (e.g., `sha256:abc123...`) |
|
||||
|
||||
### Options
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `--output <path>` | `./audit-bundle-<digest>/` | Output path for the bundle |
|
||||
| `--format <format>` | `dir` | Output format: `dir`, `tar.gz`, `zip` |
|
||||
| `--include-call-graph` | `false` | Include call graph visualization |
|
||||
| `--include-schemas` | `false` | Include JSON schema files |
|
||||
| `--include-trace` | `true` | Include policy evaluation trace |
|
||||
| `--policy-version <ver>` | (current) | Use specific policy version |
|
||||
| `--overwrite` | `false` | Overwrite existing output |
|
||||
| `--verbose` | `false` | Show progress during generation |
|
||||
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
# Generate bundle as directory
|
||||
stella audit bundle sha256:abc123def456
|
||||
|
||||
# Generate tar.gz archive
|
||||
stella audit bundle sha256:abc123def456 --format tar.gz
|
||||
|
||||
# Specify output location
|
||||
stella audit bundle sha256:abc123def456 --output ./audits/release-v2.5/
|
||||
|
||||
# Include all optional content
|
||||
stella audit bundle sha256:abc123def456 \
|
||||
--include-call-graph \
|
||||
--include-schemas \
|
||||
--verbose
|
||||
|
||||
# Use specific policy version
|
||||
stella audit bundle sha256:abc123def456 --policy-version v2.3.1
|
||||
```
|
||||
|
||||
### Output
|
||||
|
||||
The bundle contains:
|
||||
|
||||
```
|
||||
audit-bundle-<digest>-<timestamp>/
|
||||
├── manifest.json # Bundle manifest with cryptographic hashes
|
||||
├── README.md # Human-readable guide for auditors
|
||||
├── verdict/
|
||||
│ ├── verdict.json # StellaVerdict artifact
|
||||
│ └── verdict.dsse.json # DSSE envelope with signatures
|
||||
├── evidence/
|
||||
│ ├── sbom.json # SBOM (CycloneDX format)
|
||||
│ ├── vex-statements/ # All VEX statements considered
|
||||
│ │ ├── index.json
|
||||
│ │ └── *.json
|
||||
│ ├── reachability/
|
||||
│ │ ├── analysis.json
|
||||
│ │ └── call-graph.dot # Optional
|
||||
│ └── provenance/
|
||||
│ └── slsa-provenance.json
|
||||
├── policy/
|
||||
│ ├── policy-snapshot.json
|
||||
│ ├── gate-decision.json
|
||||
│ └── evaluation-trace.json
|
||||
├── replay/
|
||||
│ ├── knowledge-snapshot.json
|
||||
│ └── replay-instructions.md
|
||||
└── schema/ # Optional
|
||||
├── verdict-schema.json
|
||||
└── vex-schema.json
|
||||
```
|
||||
|
||||
### Exit Codes
|
||||
|
||||
| Code | Description |
|
||||
|------|-------------|
|
||||
| 0 | Bundle generated successfully |
|
||||
| 1 | Bundle generated with missing evidence (warnings) |
|
||||
| 2 | Error (artifact not found, permission denied, etc.) |
|
||||
|
||||
---
|
||||
|
||||
## stella audit verify
|
||||
|
||||
Verify the integrity of an audit bundle.
|
||||
|
||||
### Synopsis
|
||||
|
||||
```
|
||||
stella audit verify <bundle-path> [options]
|
||||
```
|
||||
|
||||
### Arguments
|
||||
|
||||
| Argument | Description |
|
||||
|----------|-------------|
|
||||
| `<bundle-path>` | Path to audit bundle (directory or archive) |
|
||||
|
||||
### Options
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `--strict` | `false` | Fail on any missing optional files |
|
||||
| `--check-signatures` | `false` | Verify DSSE signatures |
|
||||
| `--trusted-keys <path>` | (none) | Path to trusted keys file for signature verification |
|
||||
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
# Basic verification
|
||||
stella audit verify ./audit-bundle-abc123-20260117/
|
||||
|
||||
# Strict mode (fail on any missing files)
|
||||
stella audit verify ./audit-bundle-abc123-20260117/ --strict
|
||||
|
||||
# Verify signatures
|
||||
stella audit verify ./audit-bundle.tar.gz \
|
||||
--check-signatures \
|
||||
--trusted-keys ./trusted-keys.json
|
||||
|
||||
# Verify archive directly
|
||||
stella audit verify ./audit-bundle-abc123.zip
|
||||
```
|
||||
|
||||
### Output
|
||||
|
||||
```
|
||||
Verifying bundle: ./audit-bundle-abc123-20260117/
|
||||
|
||||
Bundle ID: urn:stella:audit-bundle:sha256:abc123...
|
||||
Artifact: sha256:abc123def456...
|
||||
Generated: 2026-01-17T10:30:00Z
|
||||
Files: 15
|
||||
|
||||
Verifying files...
|
||||
✓ Verified 15/15 files
|
||||
✓ Integrity hash verified
|
||||
|
||||
✓ Bundle integrity verified
|
||||
```
|
||||
|
||||
### Exit Codes
|
||||
|
||||
| Code | Description |
|
||||
|------|-------------|
|
||||
| 0 | Bundle is valid |
|
||||
| 1 | Bundle integrity check failed |
|
||||
| 2 | Error (bundle not found, invalid format, etc.) |
|
||||
|
||||
---
|
||||
|
||||
## Trusted Keys File Format
|
||||
|
||||
For signature verification, provide a JSON file with trusted public keys:
|
||||
|
||||
```json
|
||||
{
|
||||
"keys": [
|
||||
{
|
||||
"keyId": "urn:stella:key:sha256:abc123...",
|
||||
"publicKey": "-----BEGIN PUBLIC KEY-----\n...\n-----END PUBLIC KEY-----"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Generating Bundles for External Auditors
|
||||
|
||||
```bash
|
||||
# Generate comprehensive bundle for SOC 2 audit
|
||||
stella audit bundle sha256:prod-release-v2.5 \
|
||||
--format zip \
|
||||
--include-schemas \
|
||||
--output ./soc2-audit-2026/release-evidence.zip
|
||||
```
|
||||
|
||||
### Verifying Received Bundles
|
||||
|
||||
```bash
|
||||
# Verify bundle received from another team
|
||||
stella audit verify ./received-bundle.tar.gz --strict
|
||||
|
||||
# Verify with signature checking
|
||||
stella audit verify ./received-bundle/ \
|
||||
--check-signatures \
|
||||
--trusted-keys ./company-signing-keys.json
|
||||
```
|
||||
|
||||
### CI/CD Integration
|
||||
|
||||
```yaml
|
||||
# GitLab CI example
|
||||
audit-bundle:
|
||||
stage: release
|
||||
script:
|
||||
- stella audit bundle $IMAGE_DIGEST --format tar.gz --output ./audit/
|
||||
artifacts:
|
||||
paths:
|
||||
- audit/
|
||||
expire_in: 5 years
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [Audit Bundle Format Specification](audit-bundle-format.md)
|
||||
- [stella replay](../replay.md) - Replay verdicts for verification
|
||||
- [stella export](export.md) - Export evidence in various formats
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
313
docs/modules/cli/guides/commands/explain.md
Normal file
313
docs/modules/cli/guides/commands/explain.md
Normal file
@@ -0,0 +1,313 @@
|
||||
# stella explain - Block Explanation Commands
|
||||
|
||||
**Sprint:** SPRINT_20260117_026_CLI_why_blocked_command
|
||||
|
||||
## Overview
|
||||
|
||||
The `stella explain` command group provides commands for understanding why artifacts are blocked by policy gates. This addresses the M2 moat requirement: **"Explainability with proof, not narrative."**
|
||||
|
||||
When an artifact is blocked, `stella explain` produces a **deterministic trace** with **referenced evidence artifacts**, enabling:
|
||||
- Clear understanding of which gate blocked the artifact
|
||||
- Actionable suggestions for remediation
|
||||
- Verifiable evidence chain
|
||||
- Deterministic replay for verification
|
||||
|
||||
---
|
||||
|
||||
## Commands
|
||||
|
||||
### stella explain block
|
||||
|
||||
Explain why an artifact was blocked by policy gates.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
stella explain block <digest> [options]
|
||||
```
|
||||
|
||||
**Arguments:**
|
||||
- `<digest>` - Artifact digest in any of these formats:
|
||||
- `sha256:abc123...` - Full digest with algorithm prefix
|
||||
- `abc123...` - Raw 64-character hex digest (assumed sha256)
|
||||
- `registry.example.com/image@sha256:abc123...` - OCI reference (digest extracted)
|
||||
|
||||
**Options:**
|
||||
|
||||
| Option | Alias | Description | Default |
|
||||
|--------|-------|-------------|---------|
|
||||
| `--format <format>` | `-f` | Output format: `table`, `json`, `markdown` | `table` |
|
||||
| `--show-evidence` | `-e` | Include full evidence artifact details | false |
|
||||
| `--show-trace` | `-t` | Include policy evaluation trace | false |
|
||||
| `--replay-token` | `-r` | Include replay token in output | false |
|
||||
| `--output <path>` | `-o` | Write to file instead of stdout | stdout |
|
||||
| `--offline` | | Query local verdict cache only | false |
|
||||
|
||||
---
|
||||
|
||||
## Output Formats
|
||||
|
||||
### Table Format (Default)
|
||||
|
||||
Human-readable format optimized for terminal display:
|
||||
|
||||
```
|
||||
Artifact: sha256:abc123def456789012345678901234567890123456789012345678901234
|
||||
Status: BLOCKED
|
||||
|
||||
Gate: VexTrust
|
||||
Reason: Trust score below threshold (0.45 < 0.70)
|
||||
Suggestion: Obtain VEX statement from trusted issuer or add issuer to trust registry
|
||||
|
||||
Evidence:
|
||||
[VEX ] vex:sha256:de...23 vendor-x 2026-01-15T10:00:00Z
|
||||
[REACH ] reach:sha256...56 static 2026-01-15T09:55:00Z
|
||||
|
||||
Replay: stella verify verdict --verdict urn:stella:verdict:sha256:abc123:v2.3.0:1737108000
|
||||
```
|
||||
|
||||
### JSON Format
|
||||
|
||||
Machine-readable format for CI/CD integration:
|
||||
|
||||
```json
|
||||
{
|
||||
"artifact": "sha256:abc123def456789012345678901234567890123456789012345678901234",
|
||||
"status": "BLOCKED",
|
||||
"gate": "VexTrust",
|
||||
"reason": "Trust score below threshold (0.45 < 0.70)",
|
||||
"suggestion": "Obtain VEX statement from trusted issuer or add issuer to trust registry",
|
||||
"evaluationTime": "2026-01-15T10:30:00+00:00",
|
||||
"policyVersion": "v2.3.0",
|
||||
"evidence": [
|
||||
{
|
||||
"type": "VEX",
|
||||
"id": "vex:sha256:def456789abc123",
|
||||
"source": "vendor-x",
|
||||
"timestamp": "2026-01-15T10:00:00+00:00",
|
||||
"retrieveCommand": "stella evidence get vex:sha256:def456789abc123"
|
||||
},
|
||||
{
|
||||
"type": "REACH",
|
||||
"id": "reach:sha256:789abc123def456",
|
||||
"source": "static-analysis",
|
||||
"timestamp": "2026-01-15T09:55:00+00:00",
|
||||
"retrieveCommand": "stella evidence get reach:sha256:789abc123def456"
|
||||
}
|
||||
],
|
||||
"replayCommand": "stella verify verdict --verdict urn:stella:verdict:sha256:abc123:v2.3.0:1737108000"
|
||||
}
|
||||
```
|
||||
|
||||
### Markdown Format
|
||||
|
||||
Suitable for embedding in GitHub issues, PR comments, or documentation:
|
||||
|
||||
```markdown
|
||||
## Block Explanation
|
||||
|
||||
**Artifact:** `sha256:abc123def456789012345678901234567890123456789012345678901234`
|
||||
**Status:** BLOCKED
|
||||
|
||||
### Gate Decision
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Gate | VexTrust |
|
||||
| Reason | Trust score below threshold (0.45 < 0.70) |
|
||||
| Suggestion | Obtain VEX statement from trusted issuer or add issuer to trust registry |
|
||||
| Policy Version | v2.3.0 |
|
||||
|
||||
### Evidence
|
||||
|
||||
| Type | ID | Source | Timestamp |
|
||||
|------|-----|--------|-----------|
|
||||
| VEX | `vex:sha256:de...23` | vendor-x | 2026-01-15 10:00 |
|
||||
| REACH | `reach:sha256...56` | static-analysis | 2026-01-15 09:55 |
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
stella verify verdict --verdict urn:stella:verdict:sha256:abc123:v2.3.0:1737108000
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### Basic Block Explanation
|
||||
|
||||
```bash
|
||||
# Get basic explanation of why an artifact is blocked
|
||||
stella explain block sha256:abc123def456789012345678901234567890123456789012345678901234
|
||||
```
|
||||
|
||||
### JSON Output for CI/CD
|
||||
|
||||
```bash
|
||||
# Get JSON output for parsing in CI/CD pipeline
|
||||
stella explain block sha256:abc123... --format json --output block-reason.json
|
||||
|
||||
# Parse in CI/CD
|
||||
GATE=$(jq -r '.gate' block-reason.json)
|
||||
REASON=$(jq -r '.reason' block-reason.json)
|
||||
echo "Blocked by $GATE: $REASON"
|
||||
```
|
||||
|
||||
### Full Explanation with Evidence and Trace
|
||||
|
||||
```bash
|
||||
# Get complete explanation with all details
|
||||
stella explain block sha256:abc123... \
|
||||
--show-evidence \
|
||||
--show-trace \
|
||||
--replay-token \
|
||||
--format table
|
||||
```
|
||||
|
||||
### Markdown for PR Comment
|
||||
|
||||
```bash
|
||||
# Generate markdown for GitHub PR comment
|
||||
stella explain block sha256:abc123... --format markdown --output comment.md
|
||||
|
||||
# Use with gh CLI
|
||||
gh pr comment 123 --body-file comment.md
|
||||
```
|
||||
|
||||
### Retrieve Evidence Artifacts
|
||||
|
||||
```bash
|
||||
# Get explanation
|
||||
stella explain block sha256:abc123... --show-evidence
|
||||
|
||||
# Retrieve specific evidence artifacts
|
||||
stella evidence get vex:sha256:def456789abc123
|
||||
stella evidence get reach:sha256:789abc123def456
|
||||
```
|
||||
|
||||
### Verify Deterministic Replay
|
||||
|
||||
```bash
|
||||
# Get replay token
|
||||
REPLAY=$(stella explain block sha256:abc123... --format json | jq -r '.replayCommand')
|
||||
|
||||
# Execute replay verification
|
||||
eval $REPLAY
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Exit Codes
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| `0` | Artifact is NOT blocked (all gates passed) |
|
||||
| `1` | Artifact IS blocked (one or more gates failed) |
|
||||
| `2` | Error (artifact not found, API error, etc.) |
|
||||
|
||||
**CI/CD Integration:**
|
||||
|
||||
```bash
|
||||
# Fail pipeline if artifact is blocked
|
||||
if ! stella explain block sha256:abc123... --format json > /dev/null 2>&1; then
|
||||
EXIT_CODE=$?
|
||||
if [ $EXIT_CODE -eq 1 ]; then
|
||||
echo "ERROR: Artifact is blocked by policy"
|
||||
stella explain block sha256:abc123... --format markdown
|
||||
exit 1
|
||||
else
|
||||
echo "ERROR: Could not retrieve block status"
|
||||
exit 2
|
||||
fi
|
||||
fi
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Evidence Types
|
||||
|
||||
The `explain block` command returns evidence artifacts that contributed to the gate decision:
|
||||
|
||||
| Type | Description | Source |
|
||||
|------|-------------|--------|
|
||||
| `VEX` | VEX (Vulnerability Exploitability eXchange) statement | VEX issuers, vendor security teams |
|
||||
| `REACH` | Reachability analysis result | Static analysis, call graph analysis |
|
||||
| `SBOM` | Software Bill of Materials | SBOM generators, build systems |
|
||||
| `SCAN` | Vulnerability scan result | Scanner service |
|
||||
| `ATTEST` | Attestation document | Attestor service, SLSA provenance |
|
||||
| `POLICY` | Policy evaluation result | Policy engine |
|
||||
|
||||
---
|
||||
|
||||
## Determinism Guarantee
|
||||
|
||||
All output from `stella explain block` is **deterministic**:
|
||||
|
||||
1. **Same inputs produce identical outputs** - Given the same artifact digest and policy version, the output is byte-for-byte identical
|
||||
2. **Evidence is sorted** - Evidence artifacts are sorted by timestamp (ascending)
|
||||
3. **Trace is sorted** - Evaluation trace steps are sorted by step number
|
||||
4. **Timestamps use ISO 8601** - All timestamps use ISO 8601 format with UTC offset
|
||||
5. **JSON uses canonical ordering** - JSON properties are ordered consistently
|
||||
|
||||
This enables:
|
||||
- **Replay verification** - Use the replay token to verify the decision can be reproduced
|
||||
- **Audit trails** - Compare explanations across time
|
||||
- **Cache validation** - Verify cached decisions match current evaluation
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Artifact Not Found
|
||||
|
||||
```
|
||||
Error: Artifact sha256:abc123... not found in registry or evidence store.
|
||||
```
|
||||
|
||||
**Causes:**
|
||||
- Artifact was never scanned
|
||||
- Artifact digest is incorrect
|
||||
- Artifact was deleted from registry
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Verify artifact exists
|
||||
stella image inspect sha256:abc123...
|
||||
|
||||
# Scan the artifact
|
||||
stella scan docker://myregistry/myimage@sha256:abc123...
|
||||
```
|
||||
|
||||
### Not Blocked
|
||||
|
||||
```
|
||||
Artifact sha256:abc123... is NOT blocked. All policy gates passed.
|
||||
```
|
||||
|
||||
This means the artifact passed all policy evaluations. Exit code will be `0`.
|
||||
|
||||
### API Error
|
||||
|
||||
```
|
||||
Error: Policy service unavailable
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
```bash
|
||||
# Check connectivity
|
||||
stella doctor --check check.policy.connectivity
|
||||
|
||||
# Use offline mode if available
|
||||
stella explain block sha256:abc123... --offline
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- [Policy Commands](policy.md) - Policy management and testing
|
||||
- [VEX Commands](vex.md) - VEX document management
|
||||
- [Evidence Commands](evidence.md) - Evidence retrieval and verification
|
||||
- [Verify Commands](verify.md) - Verdict verification and replay
|
||||
- [Command Reference](reference.md) - Complete command reference
|
||||
@@ -13,6 +13,7 @@ graph TD
|
||||
CLI --> ADMIN[Administration]
|
||||
CLI --> AUTH[Authentication]
|
||||
CLI --> POLICY[Policy Management]
|
||||
CLI --> EXPLAIN[Explainability]
|
||||
CLI --> VEX[VEX & Decisioning]
|
||||
CLI --> SBOM[SBOM Operations]
|
||||
CLI --> REPORT[Reporting & Export]
|
||||
@@ -914,6 +915,73 @@ Platform: linux-x64
|
||||
|
||||
---
|
||||
|
||||
## Explainability Commands
|
||||
|
||||
### stella explain block
|
||||
|
||||
Explain why an artifact was blocked by policy gates. Produces deterministic trace with referenced evidence artifacts.
|
||||
|
||||
**Sprint:** SPRINT_20260117_026_CLI_why_blocked_command
|
||||
**Moat Reference:** M2 (Explainability with proof, not narrative)
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
stella explain block <digest> [options]
|
||||
```
|
||||
|
||||
**Arguments:**
|
||||
- `<digest>` - Artifact digest (`sha256:abc123...`, raw hex, or OCI reference)
|
||||
|
||||
**Options:**
|
||||
| Option | Description | Default |
|
||||
|--------|-------------|---------|
|
||||
| `--format <format>` | Output format: `table`, `json`, `markdown` | `table` |
|
||||
| `--show-evidence` | Include full evidence artifact details | false |
|
||||
| `--show-trace` | Include policy evaluation trace | false |
|
||||
| `--replay-token` | Include replay token in output | false |
|
||||
| `--output <path>` | Write to file instead of stdout | stdout |
|
||||
| `--offline` | Query local verdict cache only | false |
|
||||
|
||||
**Examples:**
|
||||
```bash
|
||||
# Basic explanation
|
||||
stella explain block sha256:abc123def456...
|
||||
|
||||
# JSON output for CI/CD
|
||||
stella explain block sha256:abc123... --format json --output reason.json
|
||||
|
||||
# Full explanation with evidence and trace
|
||||
stella explain block sha256:abc123... --show-evidence --show-trace
|
||||
|
||||
# Markdown for PR comment
|
||||
stella explain block sha256:abc123... --format markdown | gh pr comment 123 --body-file -
|
||||
```
|
||||
|
||||
**Exit Codes:**
|
||||
- `0` - Artifact is NOT blocked (all gates passed)
|
||||
- `1` - Artifact IS blocked
|
||||
- `2` - Error (not found, API error)
|
||||
|
||||
**Output (table):**
|
||||
```
|
||||
Artifact: sha256:abc123def456789012345678901234567890123456789012345678901234
|
||||
Status: BLOCKED
|
||||
|
||||
Gate: VexTrust
|
||||
Reason: Trust score below threshold (0.45 < 0.70)
|
||||
Suggestion: Obtain VEX statement from trusted issuer
|
||||
|
||||
Evidence:
|
||||
[VEX ] vex:sha256:de...23 vendor-x 2026-01-15T10:00:00Z
|
||||
[REACH ] reach:sha256...56 static 2026-01-15T09:55:00Z
|
||||
|
||||
Replay: stella verify verdict --verdict urn:stella:verdict:sha256:abc123:v2.3.0:1737108000
|
||||
```
|
||||
|
||||
**See Also:** [Explain Commands Documentation](explain.md)
|
||||
|
||||
---
|
||||
|
||||
## Additional Commands
|
||||
|
||||
### stella vuln query
|
||||
|
||||
333
docs/modules/telemetry/guides/p0-metrics.md
Normal file
333
docs/modules/telemetry/guides/p0-metrics.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# P0 Product Metrics
|
||||
|
||||
> **Sprint:** SPRINT_20260117_028_Telemetry_p0_metrics
|
||||
> **Task:** P0M-007 - Documentation
|
||||
|
||||
This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.
|
||||
|
||||
## Overview
|
||||
|
||||
These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."
|
||||
|
||||
| Metric | Target | Alert Threshold |
|
||||
|--------|--------|-----------------|
|
||||
| Time to First Verified Release | P90 < 4 hours | P90 > 24 hours |
|
||||
| Mean Time to Answer "Why Blocked" | P90 < 5 minutes | P90 > 1 hour |
|
||||
| Support Minutes per Customer | Trend toward 0 | > 30 min/month |
|
||||
| Determinism Regressions | Zero | Any policy-level |
|
||||
|
||||
---
|
||||
|
||||
## Metric 1: Time to First Verified Release
|
||||
|
||||
**Name:** `stella_time_to_first_verified_release_seconds`
|
||||
**Type:** Histogram
|
||||
|
||||
### Definition
|
||||
|
||||
Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).
|
||||
|
||||
### Labels
|
||||
|
||||
| Label | Values | Description |
|
||||
|-------|--------|-------------|
|
||||
| `tenant` | (varies) | Tenant identifier |
|
||||
| `deployment_type` | `fresh`, `upgrade` | Type of installation |
|
||||
|
||||
### Histogram Buckets
|
||||
|
||||
5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)
|
||||
|
||||
### Collection Points
|
||||
|
||||
1. **Install timestamp** - Recorded on first Authority service startup
|
||||
2. **First promotion** - Recorded in Release Orchestrator on first verified promotion
|
||||
|
||||
### Why This Matters
|
||||
|
||||
A short time-to-first-release indicates:
|
||||
- Good onboarding experience
|
||||
- Clear documentation
|
||||
- Sensible default configurations
|
||||
- Working integrations
|
||||
|
||||
### Dashboard Usage
|
||||
|
||||
The Grafana dashboard shows:
|
||||
- Histogram heatmap of time distribution
|
||||
- P50/P90/P99 statistics
|
||||
- Trend over time
|
||||
|
||||
### Alert Response
|
||||
|
||||
**Warning (P90 > 4 hours):**
|
||||
1. Review recent onboarding experiences
|
||||
2. Check for common configuration issues
|
||||
3. Review documentation clarity
|
||||
|
||||
**Critical (P90 > 24 hours):**
|
||||
1. Investigate blocked customers
|
||||
2. Check for integration failures
|
||||
3. Consider guided onboarding assistance
|
||||
|
||||
---
|
||||
|
||||
## Metric 2: Mean Time to Answer "Why Blocked"
|
||||
|
||||
**Name:** `stella_why_blocked_latency_seconds`
|
||||
**Type:** Histogram
|
||||
|
||||
### Definition
|
||||
|
||||
Time from block decision to user viewing explanation (via CLI, UI, or API).
|
||||
|
||||
### Labels
|
||||
|
||||
| Label | Values | Description |
|
||||
|-------|--------|-------------|
|
||||
| `tenant` | (varies) | Tenant identifier |
|
||||
| `surface` | `cli`, `ui`, `api` | Interface used to view explanation |
|
||||
| `resolution_type` | `immediate`, `delayed` | Same session vs different session |
|
||||
|
||||
### Histogram Buckets
|
||||
|
||||
1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h
|
||||
|
||||
### Collection Points
|
||||
|
||||
1. **Block decision** - Timestamp stored in verdict
|
||||
2. **Explanation view** - Tracked when `stella explain block` or UI equivalent invoked
|
||||
|
||||
### Why This Matters
|
||||
|
||||
Short "why blocked" latency indicates:
|
||||
- Clear block messaging
|
||||
- Discoverable explanation tools
|
||||
- Good explainability UX
|
||||
|
||||
Long latency may indicate:
|
||||
- Users confused about where to find answers
|
||||
- Documentation gaps
|
||||
- UX friction
|
||||
|
||||
### Dashboard Usage
|
||||
|
||||
The Grafana dashboard shows:
|
||||
- Histogram heatmap of latency distribution
|
||||
- Trend line over time
|
||||
- Breakdown by surface (CLI vs UI vs API)
|
||||
|
||||
### Alert Response
|
||||
|
||||
**Warning (P90 > 5 minutes):**
|
||||
1. Review block notification messaging
|
||||
2. Check CLI command discoverability
|
||||
3. Verify UI links are prominent
|
||||
|
||||
**Critical (P90 > 1 hour):**
|
||||
1. Investigate user flows
|
||||
2. Add proactive notifications
|
||||
3. Review documentation and help text
|
||||
|
||||
---
|
||||
|
||||
## Metric 3: Support Minutes per Customer
|
||||
|
||||
**Name:** `stella_support_burden_minutes_total`
|
||||
**Type:** Counter
|
||||
|
||||
### Definition
|
||||
|
||||
Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.
|
||||
|
||||
### Labels
|
||||
|
||||
| Label | Values | Description |
|
||||
|-------|--------|-------------|
|
||||
| `tenant` | (varies) | Tenant identifier |
|
||||
| `category` | `install`, `config`, `policy`, `integration`, `bug`, `other` | Support category |
|
||||
| `month` | YYYY-MM | Month of support |
|
||||
|
||||
### Collection
|
||||
|
||||
Log support interactions using:
|
||||
|
||||
```bash
|
||||
stella ops support log --tenant <id> --minutes <n> --category <cat>
|
||||
```
|
||||
|
||||
Or via API:
|
||||
|
||||
```bash
|
||||
POST /v1/ops/support/log
|
||||
{
|
||||
"tenant": "acme-corp",
|
||||
"minutes": 15,
|
||||
"category": "config"
|
||||
}
|
||||
```
|
||||
|
||||
### Why This Matters
|
||||
|
||||
This metric tracks operational scalability. For solo-scaled operations:
|
||||
- Support burden should trend toward zero
|
||||
- High support minutes indicate product gaps
|
||||
- Categories identify areas needing improvement
|
||||
|
||||
### Dashboard Usage
|
||||
|
||||
The Grafana dashboard shows:
|
||||
- Stacked bar chart by category
|
||||
- Monthly trend per tenant
|
||||
- Total support burden
|
||||
|
||||
### Alert Response
|
||||
|
||||
**Warning (> 30 min/month per tenant):**
|
||||
1. Review support interactions for patterns
|
||||
2. Identify documentation gaps
|
||||
3. Create runbooks for common issues
|
||||
|
||||
**Critical (> 60 min/month per tenant):**
|
||||
1. Escalate to product for feature work
|
||||
2. Consider dedicated support time
|
||||
3. Prioritize automation
|
||||
|
||||
---
|
||||
|
||||
## Metric 4: Determinism Regressions
|
||||
|
||||
**Name:** `stella_determinism_regressions_total`
|
||||
**Type:** Counter
|
||||
|
||||
### Definition
|
||||
|
||||
Count of detected determinism failures in production (same inputs produced different outputs).
|
||||
|
||||
### Labels
|
||||
|
||||
| Label | Values | Description |
|
||||
|-------|--------|-------------|
|
||||
| `tenant` | (varies) | Tenant identifier |
|
||||
| `component` | `scanner`, `policy`, `attestor`, `export` | Component with regression |
|
||||
| `severity` | `bitwise`, `semantic`, `policy` | Fidelity tier of regression |
|
||||
|
||||
### Severity Tiers
|
||||
|
||||
| Tier | Description | Impact |
|
||||
|------|-------------|--------|
|
||||
| `bitwise` | Byte-for-byte output differs | Low - cosmetic |
|
||||
| `semantic` | Output semantically differs | Medium - potential confusion |
|
||||
| `policy` | Policy decision differs | **Critical** - audit risk |
|
||||
|
||||
### Collection Points
|
||||
|
||||
1. **Scheduled verification jobs** - Regular determinism checks
|
||||
2. **Replay verification failures** - User-initiated replays
|
||||
3. **CI golden test failures** - Development-time detection
|
||||
|
||||
### Why This Matters
|
||||
|
||||
Determinism is a core moat. Regressions indicate:
|
||||
- Non-deterministic code introduced
|
||||
- External dependency changes
|
||||
- Time-sensitive logic bugs
|
||||
|
||||
**Policy-level regressions are audit-breaking** and must be fixed immediately.
|
||||
|
||||
### Dashboard Usage
|
||||
|
||||
The Grafana dashboard shows:
|
||||
- Counter with severity breakdown
|
||||
- Alert status indicator
|
||||
- Historical trend
|
||||
|
||||
### Alert Response
|
||||
|
||||
**Warning (any bitwise/semantic):**
|
||||
1. Review recent deployments
|
||||
2. Check for dependency updates
|
||||
3. Investigate affected component
|
||||
|
||||
**Critical (any policy):**
|
||||
1. **Immediate investigation required**
|
||||
2. Consider rollback
|
||||
3. Review all recent policy decisions
|
||||
4. Notify affected customers
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Access
|
||||
|
||||
The P0 metrics dashboard is available at:
|
||||
|
||||
```
|
||||
/grafana/d/stella-p0-metrics
|
||||
```
|
||||
|
||||
Or directly:
|
||||
```bash
|
||||
stella ops dashboard p0
|
||||
```
|
||||
|
||||
### Dashboard Features
|
||||
|
||||
- **Tenant selector** - Filter by specific tenant
|
||||
- **Time range** - Adjust analysis window
|
||||
- **SLO indicators** - Green/yellow/red status
|
||||
- **Drill-down links** - Navigate to detailed views
|
||||
|
||||
---
|
||||
|
||||
## Alerting Configuration
|
||||
|
||||
Alerts are configured in `devops/telemetry/alerts/stella-p0-alerts.yml`.
|
||||
|
||||
### Alert Channels
|
||||
|
||||
Configure alert destinations in Grafana:
|
||||
- Slack/Teams for warnings
|
||||
- PagerDuty for critical alerts
|
||||
- Email for summaries
|
||||
|
||||
### Silencing Alerts
|
||||
|
||||
During maintenance windows:
|
||||
```bash
|
||||
stella ops alerts silence --duration 2h --reason "Planned maintenance"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Source Files
|
||||
|
||||
| Component | Location |
|
||||
|-----------|----------|
|
||||
| Metric definitions | `src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs` |
|
||||
| Install timestamp | `src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs` |
|
||||
| Dashboard template | `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json` |
|
||||
| Alert rules | `devops/telemetry/alerts/stella-p0-alerts.yml` |
|
||||
|
||||
### Adding Custom Metrics
|
||||
|
||||
To add additional P0-level metrics:
|
||||
|
||||
1. Define in `P0ProductMetrics.cs`
|
||||
2. Add collection points in relevant services
|
||||
3. Create dashboard panel in Grafana JSON
|
||||
4. Add alert rules
|
||||
5. Update this documentation
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [Observability Guide](observability.md)
|
||||
- [Alerting Configuration](alerting.md)
|
||||
- [Runbook: Metric Collection Issues](../../operations/runbooks/telemetry-metrics-ops.md)
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
256
docs/operations/guides/auditor-guide.md
Normal file
256
docs/operations/guides/auditor-guide.md
Normal file
@@ -0,0 +1,256 @@
|
||||
# Auditor Guide
|
||||
|
||||
> **Sprint:** SPRINT_20260117_027_CLI_audit_bundle_command
|
||||
> **Task:** AUD-007 - Documentation
|
||||
|
||||
This guide is for external auditors reviewing Stella Ops release evidence.
|
||||
|
||||
## Overview
|
||||
|
||||
Stella Ops generates comprehensive, tamper-evident audit bundles that contain all evidence required to verify release decisions. This guide explains how to interpret and verify these bundles.
|
||||
|
||||
## Receiving an Audit Bundle
|
||||
|
||||
Audit bundles may be delivered as:
|
||||
- **Directory:** A folder containing all evidence files
|
||||
- **Archive:** A `.tar.gz` or `.zip` file
|
||||
|
||||
### Extracting Archives
|
||||
|
||||
```bash
|
||||
# tar.gz
|
||||
tar -xzf audit-bundle-sha256-abc123.tar.gz
|
||||
|
||||
# zip
|
||||
unzip audit-bundle-sha256-abc123.zip
|
||||
```
|
||||
|
||||
## Bundle Structure
|
||||
|
||||
```
|
||||
audit-bundle-<digest>-<timestamp>/
|
||||
├── manifest.json # Integrity manifest
|
||||
├── README.md # Quick reference
|
||||
├── verdict/ # Release decision
|
||||
├── evidence/ # Supporting evidence
|
||||
├── policy/ # Policy configuration
|
||||
└── replay/ # Verification instructions
|
||||
```
|
||||
|
||||
## Step 1: Verify Bundle Integrity
|
||||
|
||||
Before reviewing contents, verify the bundle has not been tampered with.
|
||||
|
||||
### Using Stella CLI
|
||||
|
||||
```bash
|
||||
stella audit verify ./audit-bundle-sha256-abc123/
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
✓ Verified 15/15 files
|
||||
✓ Integrity hash verified
|
||||
✓ Bundle integrity verified
|
||||
```
|
||||
|
||||
### Manual Verification
|
||||
|
||||
1. Open `manifest.json`
|
||||
2. For each file listed, compute SHA-256 and compare:
|
||||
```bash
|
||||
sha256sum verdict/verdict.json
|
||||
```
|
||||
3. Verify the `integrityHash` by hashing all file hashes
|
||||
|
||||
## Step 2: Review the Verdict
|
||||
|
||||
The verdict is the official release decision.
|
||||
|
||||
### verdict/verdict.json
|
||||
|
||||
```json
|
||||
{
|
||||
"artifactDigest": "sha256:abc123...",
|
||||
"decision": "PASS",
|
||||
"timestamp": "2026-01-17T10:25:00Z",
|
||||
"gates": [
|
||||
{
|
||||
"gateId": "sbom-required",
|
||||
"status": "PASS",
|
||||
"reason": "Valid CycloneDX SBOM present"
|
||||
},
|
||||
{
|
||||
"gateId": "vex-trust",
|
||||
"status": "PASS",
|
||||
"reason": "Trust score 0.85 >= 0.70 threshold"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Decision Values
|
||||
|
||||
| Decision | Meaning |
|
||||
|----------|---------|
|
||||
| `PASS` | All gates passed, artifact approved for deployment |
|
||||
| `BLOCKED` | One or more gates failed, artifact not approved |
|
||||
| `PENDING` | Evaluation incomplete, awaiting additional evidence |
|
||||
|
||||
### verdict/verdict.dsse.json
|
||||
|
||||
This file contains the cryptographically signed verdict envelope (DSSE format). Verify signatures using:
|
||||
|
||||
```bash
|
||||
stella audit verify ./bundle/ --check-signatures
|
||||
```
|
||||
|
||||
## Step 3: Review Evidence
|
||||
|
||||
### evidence/sbom.json
|
||||
|
||||
Software Bill of Materials (SBOM) listing all components in the artifact.
|
||||
|
||||
**Key fields:**
|
||||
- `components[]` - List of all software components
|
||||
- `dependencies[]` - Dependency relationships
|
||||
- `metadata.timestamp` - When SBOM was generated
|
||||
|
||||
### evidence/vex-statements/
|
||||
|
||||
Vulnerability Exploitability eXchange (VEX) statements that justify vulnerability assessments.
|
||||
|
||||
**index.json:**
|
||||
```json
|
||||
{
|
||||
"statementCount": 3,
|
||||
"statements": [
|
||||
{"fileName": "vex-001.json", "source": "vendor-security"},
|
||||
{"fileName": "vex-002.json", "source": "internal-analysis"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Each VEX statement explains why a vulnerability does or does not affect this artifact.
|
||||
|
||||
### evidence/reachability/analysis.json
|
||||
|
||||
Reachability analysis showing which vulnerabilities are actually reachable in the code.
|
||||
|
||||
```json
|
||||
{
|
||||
"components": [
|
||||
{
|
||||
"purl": "pkg:npm/lodash@4.17.21",
|
||||
"vulnerabilities": [
|
||||
{
|
||||
"id": "CVE-2021-23337",
|
||||
"reachable": false,
|
||||
"reason": "Vulnerable function not in call graph"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Step 4: Review Policy
|
||||
|
||||
### policy/policy-snapshot.json
|
||||
|
||||
The policy configuration used for evaluation:
|
||||
|
||||
```json
|
||||
{
|
||||
"policyVersion": "v2.3.1",
|
||||
"gates": ["sbom-required", "vex-trust", "cve-threshold"],
|
||||
"thresholds": {
|
||||
"vexTrustScore": 0.70,
|
||||
"maxCriticalCves": 0,
|
||||
"maxHighCves": 5
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### policy/gate-decision.json
|
||||
|
||||
Detailed breakdown of each gate evaluation:
|
||||
|
||||
```json
|
||||
{
|
||||
"gates": [
|
||||
{
|
||||
"gateId": "vex-trust",
|
||||
"decision": "PASS",
|
||||
"inputs": {
|
||||
"vexStatements": 3,
|
||||
"trustScore": 0.85,
|
||||
"threshold": 0.70
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Step 5: Replay Verification (Optional)
|
||||
|
||||
For maximum assurance, you can replay the verdict evaluation.
|
||||
|
||||
### Using Stella CLI
|
||||
|
||||
```bash
|
||||
cd audit-bundle-sha256-abc123/
|
||||
stella replay snapshot --manifest replay/knowledge-snapshot.json
|
||||
```
|
||||
|
||||
This re-evaluates the policy using the frozen inputs and should produce an identical verdict.
|
||||
|
||||
### Manual Replay Steps
|
||||
|
||||
See `replay/replay-instructions.md` for detailed steps.
|
||||
|
||||
## Compliance Mapping
|
||||
|
||||
| Compliance Framework | Relevant Bundle Components |
|
||||
|---------------------|---------------------------|
|
||||
| **SOC 2 (CC7.1)** | verdict/, policy/ |
|
||||
| **ISO 27001 (A.12.6)** | evidence/sbom.json |
|
||||
| **FedRAMP** | All components |
|
||||
| **SLSA Level 3** | evidence/provenance/ |
|
||||
|
||||
## Common Questions
|
||||
|
||||
### Q: Why was this artifact blocked?
|
||||
|
||||
Review `policy/gate-decision.json` for the specific gate that failed and its reason.
|
||||
|
||||
### Q: How do I verify the SBOM is accurate?
|
||||
|
||||
The SBOM digest is included in the manifest. Compare against the organization's SBOM generation process.
|
||||
|
||||
### Q: What if replay produces a different result?
|
||||
|
||||
This may indicate:
|
||||
1. Policy version mismatch
|
||||
2. Missing evidence files
|
||||
3. Time-dependent policy rules
|
||||
|
||||
Contact the organization's security team for clarification.
|
||||
|
||||
### Q: How long should audit bundles be retained?
|
||||
|
||||
Stella Ops recommends:
|
||||
- Production releases: 5 years minimum
|
||||
- Security-critical systems: 7 years
|
||||
- Regulated industries: Per compliance requirements
|
||||
|
||||
## Support
|
||||
|
||||
For questions about this audit bundle:
|
||||
1. Contact the organization's Stella Ops administrator
|
||||
2. Reference the Bundle ID from `manifest.json`
|
||||
3. Include the artifact digest
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
112
docs/operations/runbooks/COVERAGE.md
Normal file
112
docs/operations/runbooks/COVERAGE.md
Normal file
@@ -0,0 +1,112 @@
|
||||
# Runbook Coverage Tracking
|
||||
|
||||
This document tracks operational runbook coverage across Stella Ops modules.
|
||||
|
||||
**Target:** 80% coverage of critical failure modes before declaring operability moat achieved.
|
||||
|
||||
---
|
||||
|
||||
## Coverage Summary
|
||||
|
||||
| Module | Critical Failures | Runbooks | Coverage | Status |
|
||||
|--------|-------------------|----------|----------|--------|
|
||||
| Scanner | 5 | 0 | 0% | 🔴 Gap |
|
||||
| Policy Engine | 5 | 0 | 0% | 🔴 Gap |
|
||||
| Release Orchestrator | 5 | 0 | 0% | 🔴 Gap |
|
||||
| Attestor | 5 | 0 | 0% | 🔴 Gap |
|
||||
| Feed Connectors | 4 | 0 | 0% | 🔴 Gap |
|
||||
| **Database (Postgres)** | 4 | 4 | 100% | ✅ Complete |
|
||||
| **Crypto Subsystem** | 4 | 4 | 100% | ✅ Complete |
|
||||
| **Evidence Locker** | 4 | 4 | 100% | ✅ Complete |
|
||||
| **Backup/Restore** | 4 | 4 | 100% | ✅ Complete |
|
||||
| Authority (OAuth/OIDC) | 3 | 0 | 0% | 🔴 Gap |
|
||||
| **Overall** | **43** | **16** | **37%** | 🟡 In Progress |
|
||||
|
||||
---
|
||||
|
||||
## Available Runbooks
|
||||
|
||||
### Database Operations
|
||||
- [postgres-ops.md](postgres-ops.md) - PostgreSQL database operations
|
||||
|
||||
### Crypto Subsystem
|
||||
- [crypto-ops.md](crypto-ops.md) - Regional crypto operations (FIPS, eIDAS, GOST, SM)
|
||||
|
||||
### Evidence Locker
|
||||
- [evidence-locker-ops.md](evidence-locker-ops.md) - Evidence locker operations
|
||||
|
||||
### Backup/Restore
|
||||
- [backup-restore-ops.md](backup-restore-ops.md) - Backup and restore procedures
|
||||
|
||||
### Vulnerability Operations
|
||||
- [vuln-ops.md](vuln-ops.md) - Vulnerability management operations
|
||||
|
||||
### VEX Operations
|
||||
- [vex-ops.md](vex-ops.md) - VEX statement operations
|
||||
|
||||
### Policy Incidents
|
||||
- [policy-incident.md](policy-incident.md) - Policy-related incident response
|
||||
|
||||
---
|
||||
|
||||
## Gap Analysis
|
||||
|
||||
### High Priority Gaps (Critical modules without runbooks)
|
||||
|
||||
1. **Scanner** - Core scanning functionality
|
||||
- Worker stuck
|
||||
- OOM on large images
|
||||
- Registry auth failures
|
||||
|
||||
2. **Policy Engine** - Policy evaluation
|
||||
- Slow evaluation
|
||||
- OPA crashes
|
||||
- Compilation failures
|
||||
|
||||
3. **Release Orchestrator** - Promotion workflow
|
||||
- Stuck promotions
|
||||
- Gate timeouts
|
||||
- Missing evidence
|
||||
|
||||
### Medium Priority Gaps
|
||||
|
||||
4. **Attestor** - Signing and verification
|
||||
- Signing failures
|
||||
- Key expiration
|
||||
- Rekor unavailability
|
||||
|
||||
5. **Feed Connectors** - Advisory feeds
|
||||
- NVD failures
|
||||
- Rate limiting
|
||||
- Offline bundle issues
|
||||
|
||||
### Lower Priority Gaps
|
||||
|
||||
6. **Authority** - Authentication
|
||||
- Token validation failures
|
||||
- OIDC provider issues
|
||||
|
||||
---
|
||||
|
||||
## Template
|
||||
|
||||
New runbooks should use the template: [_template.md](_template.md)
|
||||
|
||||
---
|
||||
|
||||
## Doctor Check Integration
|
||||
|
||||
Runbooks should be linked from Doctor check output. Current integration status:
|
||||
|
||||
| Module | Doctor Checks | Linked to Runbook |
|
||||
|--------|---------------|-------------------|
|
||||
| Postgres | 4 | 0 |
|
||||
| Crypto | 8 | 0 |
|
||||
| Storage | 3 | 0 |
|
||||
| Evidence | 4 | 0 |
|
||||
|
||||
**Next step:** Update Doctor check implementations to include runbook links in remediation output.
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
157
docs/operations/runbooks/_template.md
Normal file
157
docs/operations/runbooks/_template.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# Runbook: [Component] - [Failure Scenario]
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-001 - Runbook Template
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | [Module name: Scanner, Policy, Orchestrator, Attestor, etc.] |
|
||||
| **Severity** | Critical / High / Medium / Low |
|
||||
| **On-call scope** | [Who should be paged: Platform team, Security team, etc.] |
|
||||
| **Last updated** | [YYYY-MM-DD] |
|
||||
| **Doctor check** | [Check ID if applicable, e.g., `check.scanner.worker-health`] |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
Observable indicators that this failure is occurring:
|
||||
|
||||
- [ ] [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
|
||||
- [ ] [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
|
||||
- [ ] [Metric/alert that fires: e.g., "Alert `ScannerWorkerStuck` firing"]
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | [e.g., "New scans cannot complete, blocking CI/CD pipelines"] |
|
||||
| **Data integrity** | [e.g., "No data loss, but stale scan results may be served"] |
|
||||
| **SLA impact** | [e.g., "Scan latency SLO violated if not resolved within 15 minutes"] |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks (< 2 minutes)
|
||||
|
||||
Run these first to confirm the failure:
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check [relevant-check-id]
|
||||
```
|
||||
|
||||
2. **Check service status:**
|
||||
```bash
|
||||
stella [component] status
|
||||
```
|
||||
|
||||
3. **Check recent logs:**
|
||||
```bash
|
||||
stella [component] logs --tail 50 --level error
|
||||
```
|
||||
|
||||
### Deep diagnosis (if quick checks inconclusive)
|
||||
|
||||
1. **[Investigation step 1]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
Expected output: [description]
|
||||
If unexpected: [what it means]
|
||||
|
||||
2. **[Investigation step 2]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
3. **Check related services:**
|
||||
- Postgres connectivity: `stella doctor --check check.storage.postgres`
|
||||
- Valkey connectivity: `stella doctor --check check.storage.valkey`
|
||||
- Network connectivity: `stella doctor --check check.network.[target]`
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation (restore service quickly)
|
||||
|
||||
Use these steps to restore service, even if root cause isn't fixed yet:
|
||||
|
||||
1. **[Mitigation step 1]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
This will: [explanation]
|
||||
|
||||
2. **[Mitigation step 2]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
Once service is restored, address the underlying issue:
|
||||
|
||||
1. **[Fix step 1]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
2. **[Fix step 2]:**
|
||||
```bash
|
||||
[command]
|
||||
```
|
||||
|
||||
3. **Verify fix is complete:**
|
||||
```bash
|
||||
stella doctor --check [relevant-check-id]
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
Confirm the issue is fully resolved:
|
||||
|
||||
```bash
|
||||
# Re-run the failing operation
|
||||
stella [component] [test-command]
|
||||
|
||||
# Verify metrics are healthy
|
||||
stella obs metrics --filter [component] --last 5m
|
||||
|
||||
# Verify no new errors in logs
|
||||
stella [component] logs --tail 20 --level error
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
How to prevent this failure from recurring:
|
||||
|
||||
- [ ] **Monitoring:** [e.g., "Add alert for queue depth > 100"]
|
||||
- [ ] **Configuration:** [e.g., "Increase worker count in high-volume environments"]
|
||||
- [ ] **Code change:** [e.g., "Implement circuit breaker for external service calls"]
|
||||
- [ ] **Documentation:** [e.g., "Update capacity planning guide"]
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture doc:** [Link to relevant architecture documentation]
|
||||
- **Related runbooks:** [Links to related failure scenarios]
|
||||
- **Doctor check source:** [Link to Doctor check implementation]
|
||||
- **Grafana dashboard:** [Link to relevant dashboard]
|
||||
|
||||
---
|
||||
|
||||
## Revision History
|
||||
|
||||
| Date | Author | Changes |
|
||||
|------|--------|---------|
|
||||
| YYYY-MM-DD | [Name] | Initial version |
|
||||
193
docs/operations/runbooks/attestor-hsm-connection.md
Normal file
193
docs/operations/runbooks/attestor-hsm-connection.md
Normal file
@@ -0,0 +1,193 @@
|
||||
# Runbook: Attestor - HSM Connection Issues
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor / Cryptography |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.crypto.hsm-availability` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Signing operations failing with "HSM unavailable"
|
||||
- [ ] Alert `AttestorHsmConnectionFailed` firing
|
||||
- [ ] Error: "PKCS#11 operation failed" or "HSM session timeout"
|
||||
- [ ] Attestations cannot be created
|
||||
- [ ] Key operations (sign, verify) failing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | No attestations can be signed; releases blocked |
|
||||
| **Data integrity** | Keys are safe in HSM; operations resume when connection restored |
|
||||
| **SLA impact** | All signing operations blocked; compliance posture at risk |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.crypto.hsm-availability
|
||||
```
|
||||
|
||||
2. **Check HSM connection status:**
|
||||
```bash
|
||||
stella crypto hsm status
|
||||
```
|
||||
|
||||
3. **Test HSM connectivity:**
|
||||
```bash
|
||||
stella crypto hsm test
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check PKCS#11 library status:**
|
||||
```bash
|
||||
stella crypto hsm pkcs11-status
|
||||
```
|
||||
Look for: Library loaded, slot available, session active
|
||||
|
||||
2. **Check HSM network connectivity:**
|
||||
```bash
|
||||
stella crypto hsm ping
|
||||
```
|
||||
|
||||
3. **Check HSM session logs:**
|
||||
```bash
|
||||
stella crypto hsm logs --last 30m
|
||||
```
|
||||
Look for: Session errors, timeout, authentication failures
|
||||
|
||||
4. **Check HSM slot status:**
|
||||
```bash
|
||||
stella crypto hsm slots list
|
||||
```
|
||||
Problem if: Slot not found, slot busy, token not present
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Attempt HSM reconnection:**
|
||||
```bash
|
||||
stella crypto hsm reconnect
|
||||
```
|
||||
|
||||
2. **If HSM unreachable, switch to software signing (if permitted):**
|
||||
```bash
|
||||
stella attest config set signing.mode software
|
||||
stella attest reload
|
||||
```
|
||||
**Warning:** Software signing may not meet compliance requirements
|
||||
|
||||
3. **Use backup HSM if configured:**
|
||||
```bash
|
||||
stella crypto hsm failover --to backup
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If network connectivity issue:**
|
||||
|
||||
1. Check HSM network path:
|
||||
```bash
|
||||
stella crypto hsm connectivity --verbose
|
||||
```
|
||||
|
||||
2. Verify firewall rules allow HSM port (typically 1792 for Luna, 2225 for SafeNet)
|
||||
|
||||
3. Check HSM server status with vendor tools
|
||||
|
||||
**If session timeout:**
|
||||
|
||||
1. Increase session timeout:
|
||||
```bash
|
||||
stella crypto hsm config set session.timeout 300s
|
||||
stella crypto hsm reconnect
|
||||
```
|
||||
|
||||
2. Enable session keep-alive:
|
||||
```bash
|
||||
stella crypto hsm config set session.keepalive true
|
||||
stella crypto hsm config set session.keepalive_interval 60s
|
||||
```
|
||||
|
||||
**If authentication failed:**
|
||||
|
||||
1. Verify HSM credentials:
|
||||
```bash
|
||||
stella crypto hsm auth verify
|
||||
```
|
||||
|
||||
2. Update HSM PIN if changed:
|
||||
```bash
|
||||
stella crypto hsm auth update --slot <slot-id>
|
||||
```
|
||||
|
||||
**If PKCS#11 library issue:**
|
||||
|
||||
1. Verify library path:
|
||||
```bash
|
||||
stella crypto hsm config get pkcs11.library_path
|
||||
```
|
||||
|
||||
2. Reload PKCS#11 library:
|
||||
```bash
|
||||
stella crypto hsm pkcs11-reload
|
||||
```
|
||||
|
||||
3. Check library compatibility:
|
||||
```bash
|
||||
stella crypto hsm pkcs11-info
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test HSM connectivity
|
||||
stella crypto hsm test
|
||||
|
||||
# Test signing operation
|
||||
stella attest test-sign
|
||||
|
||||
# Verify key access
|
||||
stella keys verify <key-id> --operation sign
|
||||
|
||||
# Check no errors in logs
|
||||
stella crypto hsm logs --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Redundancy:** Configure backup HSM for failover
|
||||
- [ ] **Monitoring:** Alert on HSM connection failures immediately
|
||||
- [ ] **Keep-alive:** Enable session keep-alive to prevent timeouts
|
||||
- [ ] **Testing:** Include HSM health in regular health checks
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/cryptography/hsm-integration.md`
|
||||
- **Related runbooks:** `attestor-signing-failed.md`, `crypto-ops.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Crypto/`
|
||||
- **HSM setup:** `docs/operations/hsm-configuration.md`
|
||||
190
docs/operations/runbooks/attestor-key-expired.md
Normal file
190
docs/operations/runbooks/attestor-key-expired.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Runbook: Attestor - Signing Key Expired
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.attestor.key-expiration` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Attestation creation failing with "key expired" error
|
||||
- [ ] Alert `AttestorKeyExpired` firing
|
||||
- [ ] Error: "signing key certificate has expired"
|
||||
- [ ] New attestations cannot be created
|
||||
- [ ] Verification of new attestations failing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | No new attestations can be signed; releases blocked |
|
||||
| **Data integrity** | Existing attestations remain valid; new ones cannot be created |
|
||||
| **SLA impact** | Release SLO violated; compliance posture compromised |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.attestor.key-expiration
|
||||
```
|
||||
|
||||
2. **List signing keys and expiration:**
|
||||
```bash
|
||||
stella keys list --type signing --show-expiration
|
||||
```
|
||||
Look for: Keys with status "expired" or expiring soon
|
||||
|
||||
3. **Check active signing key:**
|
||||
```bash
|
||||
stella attest config get signing.key_id
|
||||
stella keys show <key-id> --details
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check certificate chain validity:**
|
||||
```bash
|
||||
stella crypto cert verify-chain --key <key-id>
|
||||
```
|
||||
Problem if: Any certificate in chain expired
|
||||
|
||||
2. **Check for backup keys:**
|
||||
```bash
|
||||
stella keys list --type signing --status inactive
|
||||
```
|
||||
Look for: Unexpired backup keys that can be activated
|
||||
|
||||
3. **Check key rotation history:**
|
||||
```bash
|
||||
stella keys rotation-history --key <key-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If backup key available, activate it:**
|
||||
```bash
|
||||
stella keys activate <backup-key-id>
|
||||
stella attest config set signing.key_id <backup-key-id>
|
||||
stella attest reload
|
||||
```
|
||||
|
||||
2. **Verify signing works:**
|
||||
```bash
|
||||
stella attest test-sign
|
||||
```
|
||||
|
||||
3. **Retry failed attestations:**
|
||||
```bash
|
||||
stella attest retry --failed --last 1h
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**Generate new signing key:**
|
||||
|
||||
1. Generate new key pair:
|
||||
```bash
|
||||
stella keys generate \
|
||||
--type signing \
|
||||
--algorithm ecdsa-p256 \
|
||||
--validity 365d \
|
||||
--name "signing-key-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
2. If using HSM:
|
||||
```bash
|
||||
stella keys generate \
|
||||
--type signing \
|
||||
--algorithm ecdsa-p256 \
|
||||
--validity 365d \
|
||||
--hsm-slot <slot> \
|
||||
--name "signing-key-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
3. Register the new key:
|
||||
```bash
|
||||
stella keys register <new-key-id> --purpose attestation-signing
|
||||
```
|
||||
|
||||
4. Update signing configuration:
|
||||
```bash
|
||||
stella attest config set signing.key_id <new-key-id>
|
||||
stella attest reload
|
||||
```
|
||||
|
||||
5. Publish new public key to trust anchors:
|
||||
```bash
|
||||
stella issuer keys publish <new-key-id>
|
||||
```
|
||||
|
||||
**Configure automatic rotation:**
|
||||
|
||||
1. Enable auto-rotation:
|
||||
```bash
|
||||
stella keys config set rotation.auto true
|
||||
stella keys config set rotation.before_expiry 30d
|
||||
stella keys config set rotation.overlap_days 14
|
||||
```
|
||||
|
||||
2. Set up rotation alerts:
|
||||
```bash
|
||||
stella keys config set alerts.expiring_days 30
|
||||
stella keys config set alerts.expiring_days_critical 7
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify new key is active
|
||||
stella keys list --type signing --status active
|
||||
|
||||
# Test signing
|
||||
stella attest test-sign
|
||||
|
||||
# Create test attestation
|
||||
stella attest create --type test --subject "test:key-rotation"
|
||||
|
||||
# Verify the attestation
|
||||
stella verify attestation --last
|
||||
|
||||
# Check key expiration
|
||||
stella keys show <new-key-id> --details | grep -i expir
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Rotation:** Enable automatic key rotation 30 days before expiry
|
||||
- [ ] **Monitoring:** Alert on keys expiring within 30 days (warning) and 7 days (critical)
|
||||
- [ ] **Backup:** Maintain at least one backup signing key
|
||||
- [ ] **Documentation:** Document key rotation procedures and approval process
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/attestor/architecture.md`
|
||||
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-hsm-connection.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
|
||||
- **Key management:** `docs/operations/key-management.md`
|
||||
184
docs/operations/runbooks/attestor-rekor-unavailable.md
Normal file
184
docs/operations/runbooks/attestor-rekor-unavailable.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# Runbook: Attestor - Rekor Transparency Log Unreachable
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.attestor.rekor-connectivity` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Attestation transparency logging failing
|
||||
- [ ] Alert `AttestorRekorUnavailable` firing
|
||||
- [ ] Error: "Rekor server unavailable" or "transparency log submission failed"
|
||||
- [ ] Attestations created but not anchored to transparency log
|
||||
- [ ] Verification failing due to missing log entry
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Attestations not publicly verifiable via transparency log |
|
||||
| **Data integrity** | Attestations still valid locally; transparency reduced |
|
||||
| **SLA impact** | Compliance may require transparency log anchoring |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.attestor.rekor-connectivity
|
||||
```
|
||||
|
||||
2. **Check Rekor connectivity:**
|
||||
```bash
|
||||
stella attest rekor status
|
||||
```
|
||||
|
||||
3. **Test Rekor endpoint:**
|
||||
```bash
|
||||
stella attest rekor ping
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check Rekor server URL:**
|
||||
```bash
|
||||
stella attest config get rekor.url
|
||||
```
|
||||
Default: https://rekor.sigstore.dev
|
||||
|
||||
2. **Check for public Rekor outage:**
|
||||
```bash
|
||||
stella attest rekor api-status
|
||||
```
|
||||
Also check: https://status.sigstore.dev/
|
||||
|
||||
3. **Check network/proxy issues:**
|
||||
```bash
|
||||
stella attest rekor test --verbose
|
||||
```
|
||||
Look for: TLS errors, proxy blocks, timeout
|
||||
|
||||
4. **Check pending log entries:**
|
||||
```bash
|
||||
stella attest rekor pending-entries
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Queue attestations for later submission:**
|
||||
```bash
|
||||
stella attest config set rekor.queue_on_failure true
|
||||
stella attest reload
|
||||
```
|
||||
|
||||
2. **Disable Rekor requirement temporarily:**
|
||||
```bash
|
||||
stella attest config set rekor.required false
|
||||
stella attest reload
|
||||
```
|
||||
**Warning:** Reduces transparency guarantees
|
||||
|
||||
3. **Use private Rekor instance if available:**
|
||||
```bash
|
||||
stella attest config set rekor.url https://rekor.internal.example.com
|
||||
stella attest reload
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If public Rekor outage:**
|
||||
|
||||
1. Wait for Sigstore to resolve the issue
|
||||
2. Check status at https://status.sigstore.dev/
|
||||
3. Process queued entries when service recovers:
|
||||
```bash
|
||||
stella attest rekor process-queue
|
||||
```
|
||||
|
||||
**If network/firewall issue:**
|
||||
|
||||
1. Verify outbound HTTPS to rekor.sigstore.dev:
|
||||
```bash
|
||||
stella attest rekor connectivity --verbose
|
||||
```
|
||||
|
||||
2. Configure proxy if required:
|
||||
```bash
|
||||
stella attest config set rekor.proxy https://proxy:8080
|
||||
```
|
||||
|
||||
3. Add Rekor endpoints to firewall allowlist:
|
||||
- rekor.sigstore.dev:443
|
||||
- fulcio.sigstore.dev:443 (for certificate issuance)
|
||||
|
||||
**If TLS certificate issue:**
|
||||
|
||||
1. Check certificate validity:
|
||||
```bash
|
||||
stella attest rekor cert-check
|
||||
```
|
||||
|
||||
2. Update CA certificates:
|
||||
```bash
|
||||
stella crypto ca update
|
||||
```
|
||||
|
||||
**If private Rekor instance issue:**
|
||||
|
||||
1. Check private Rekor server status
|
||||
2. Verify Rekor database health
|
||||
3. Check Rekor signer availability
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test Rekor connectivity
|
||||
stella attest rekor ping
|
||||
|
||||
# Submit test entry
|
||||
stella attest rekor test-submit
|
||||
|
||||
# Process any queued entries
|
||||
stella attest rekor process-queue
|
||||
|
||||
# Verify recent attestation in log
|
||||
stella attest rekor lookup --attestation <attestation-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Redundancy:** Configure private Rekor instance as fallback
|
||||
- [ ] **Queuing:** Enable queue-on-failure for resilience
|
||||
- [ ] **Monitoring:** Alert on Rekor submission failures
|
||||
- [ ] **Offline:** Document attestation validity without Rekor for air-gap scenarios
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/attestor/transparency-log.md`
|
||||
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-verification-failed.md`
|
||||
- **Sigstore docs:** https://docs.sigstore.dev/
|
||||
- **Rekor setup:** `docs/operations/rekor-configuration.md`
|
||||
176
docs/operations/runbooks/attestor-signing-failed.md
Normal file
176
docs/operations/runbooks/attestor-signing-failed.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# Runbook: Attestor - Signature Generation Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.attestor.signing-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Attestation requests failing with "signing failed" error
|
||||
- [ ] Alert `AttestorSigningFailed` firing
|
||||
- [ ] Evidence bundles missing signatures
|
||||
- [ ] Metric `attestor_signing_failures_total` increasing
|
||||
- [ ] Release pipeline blocked due to unsigned attestations
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Releases blocked; attestations cannot be created |
|
||||
| **Data integrity** | Evidence is recorded but unsigned; can be signed later |
|
||||
| **SLA impact** | Release SLO violated; evidence integrity compromised |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.attestor.signing-health
|
||||
```
|
||||
|
||||
2. **Check attestor service status:**
|
||||
```bash
|
||||
stella attest status
|
||||
```
|
||||
|
||||
3. **Check signing key availability:**
|
||||
```bash
|
||||
stella keys list --type signing --status active
|
||||
```
|
||||
Problem if: No active signing keys
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Test signing operation:**
|
||||
```bash
|
||||
stella attest test-sign --verbose
|
||||
```
|
||||
Look for: Specific error message
|
||||
|
||||
2. **Check key material access:**
|
||||
```bash
|
||||
stella keys verify <key-id> --operation sign
|
||||
```
|
||||
|
||||
3. **If using HSM, check HSM connectivity:**
|
||||
```bash
|
||||
stella doctor --check check.crypto.hsm-availability
|
||||
```
|
||||
|
||||
4. **Check for key expiration:**
|
||||
```bash
|
||||
stella keys list --expiring-within 7d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If key expired, rotate to backup key:**
|
||||
```bash
|
||||
stella keys activate <backup-key-id>
|
||||
stella attest config set signing.key_id <backup-key-id>
|
||||
```
|
||||
|
||||
2. **If HSM unavailable, switch to software signing (temporary):**
|
||||
```bash
|
||||
stella attest config set signing.mode software
|
||||
stella attest reload
|
||||
```
|
||||
⚠️ **Warning:** Software signing may not meet compliance requirements
|
||||
|
||||
3. **Retry failed attestations:**
|
||||
```bash
|
||||
stella attest retry --failed --last 1h
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If key expired:**
|
||||
|
||||
1. Generate new signing key:
|
||||
```bash
|
||||
stella keys generate --type signing --algorithm ecdsa-p256
|
||||
```
|
||||
|
||||
2. Configure key rotation schedule:
|
||||
```bash
|
||||
stella keys config set rotation.auto true
|
||||
stella keys config set rotation.overlap_days 14
|
||||
```
|
||||
|
||||
**If HSM connection failed:**
|
||||
|
||||
1. Verify HSM configuration:
|
||||
```bash
|
||||
stella crypto hsm verify
|
||||
```
|
||||
|
||||
2. Restart HSM connection:
|
||||
```bash
|
||||
stella crypto hsm reconnect
|
||||
```
|
||||
|
||||
**If certificate chain issue:**
|
||||
|
||||
1. Verify certificate chain:
|
||||
```bash
|
||||
stella crypto cert verify-chain --key <key-id>
|
||||
```
|
||||
|
||||
2. Update intermediate certificates:
|
||||
```bash
|
||||
stella crypto cert update-chain --key <key-id>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test signing
|
||||
stella attest test-sign
|
||||
|
||||
# Create test attestation
|
||||
stella attest create --type test --subject "test:verification"
|
||||
|
||||
# Verify the attestation
|
||||
stella verify attestation --last
|
||||
|
||||
# Check no failures in recent operations
|
||||
stella attest logs --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Key rotation:** Enable automatic key rotation with 14-day overlap
|
||||
- [ ] **Monitoring:** Alert on keys expiring within 30 days
|
||||
- [ ] **Backup:** Maintain backup signing key in different HSM slot
|
||||
- [ ] **Testing:** Include signing test in health check schedule
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/attestor/architecture.md`
|
||||
- **Related runbooks:** `attestor-key-expired.md`, `attestor-hsm-connection.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
|
||||
- **Dashboard:** Grafana > Stella Ops > Attestor
|
||||
195
docs/operations/runbooks/attestor-verification-failed.md
Normal file
195
docs/operations/runbooks/attestor-verification-failed.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Runbook: Attestor - Attestation Verification Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-005 - Attestor Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Attestor |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.attestor.verification-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Attestation verification failing
|
||||
- [ ] Alert `AttestorVerificationFailed` firing
|
||||
- [ ] Error: "signature verification failed" or "invalid attestation"
|
||||
- [ ] Promotions blocked due to failed verification
|
||||
- [ ] Error: "trust anchor not found" or "certificate chain invalid"
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Artifacts cannot be promoted; release blocked |
|
||||
| **Data integrity** | May indicate tampered attestation or configuration issue |
|
||||
| **SLA impact** | Release pipeline blocked until resolved |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.attestor.verification-health
|
||||
```
|
||||
|
||||
2. **Verify specific attestation:**
|
||||
```bash
|
||||
stella verify attestation --attestation <attestation-id> --verbose
|
||||
```
|
||||
|
||||
3. **Check trust anchors:**
|
||||
```bash
|
||||
stella trust-anchors list
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check attestation details:**
|
||||
```bash
|
||||
stella attest show <attestation-id> --details
|
||||
```
|
||||
Look for: Signer identity, timestamp, subject
|
||||
|
||||
2. **Verify certificate chain:**
|
||||
```bash
|
||||
stella verify cert-chain --attestation <attestation-id>
|
||||
```
|
||||
Problem if: Intermediate cert missing, root not trusted
|
||||
|
||||
3. **Check public key availability:**
|
||||
```bash
|
||||
stella keys show <key-id> --public
|
||||
```
|
||||
|
||||
4. **Check if issuer is trusted:**
|
||||
```bash
|
||||
stella issuer trust-status <issuer-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If trust anchor missing, add it:**
|
||||
```bash
|
||||
stella trust-anchors add --cert <issuer-cert.pem>
|
||||
```
|
||||
|
||||
2. **If intermediate cert missing:**
|
||||
```bash
|
||||
stella trust-anchors add-intermediate --cert <intermediate.pem>
|
||||
```
|
||||
|
||||
3. **Re-verify with verbose output:**
|
||||
```bash
|
||||
stella verify attestation --attestation <attestation-id> --verbose
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If signature mismatch:**
|
||||
|
||||
1. Check attestation wasn't modified:
|
||||
```bash
|
||||
stella attest integrity-check <attestation-id>
|
||||
```
|
||||
|
||||
2. If modified, regenerate attestation:
|
||||
```bash
|
||||
stella attest create --subject <digest> --type <type> --force
|
||||
```
|
||||
|
||||
**If key rotated and old key not trusted:**
|
||||
|
||||
1. Add old public key to trust anchors:
|
||||
```bash
|
||||
stella trust-anchors add-key --key <old-key.pem> --expires <date>
|
||||
```
|
||||
|
||||
2. Or fetch from issuer directory:
|
||||
```bash
|
||||
stella issuer keys fetch <issuer-id>
|
||||
```
|
||||
|
||||
**If certificate expired:**
|
||||
|
||||
1. Check certificate validity:
|
||||
```bash
|
||||
stella verify cert --attestation <attestation-id> --show-expiry
|
||||
```
|
||||
|
||||
2. Re-sign with valid certificate:
|
||||
```bash
|
||||
stella attest resign <attestation-id>
|
||||
```
|
||||
|
||||
**If issuer not trusted:**
|
||||
|
||||
1. Verify issuer identity:
|
||||
```bash
|
||||
stella issuer show <issuer-id>
|
||||
```
|
||||
|
||||
2. Add to trusted issuers (requires approval):
|
||||
```bash
|
||||
stella issuer trust <issuer-id> --reason "Approved by security team"
|
||||
```
|
||||
|
||||
**If algorithm not supported:**
|
||||
|
||||
1. Check algorithm:
|
||||
```bash
|
||||
stella attest show <attestation-id> | grep algorithm
|
||||
```
|
||||
|
||||
2. Verify crypto provider supports algorithm:
|
||||
```bash
|
||||
stella crypto providers list --algorithms
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify attestation
|
||||
stella verify attestation --attestation <attestation-id>
|
||||
|
||||
# Verify trust chain
|
||||
stella verify cert-chain --attestation <attestation-id>
|
||||
|
||||
# Test end-to-end verification
|
||||
stella verify artifact --digest <digest>
|
||||
|
||||
# Check no verification errors
|
||||
stella attest logs --filter "verification" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Trust anchors:** Keep trust anchor list current with all valid issuer certs
|
||||
- [ ] **Key rotation:** Plan key rotation with overlap period for verification continuity
|
||||
- [ ] **Monitoring:** Alert on verification failure rate > 0
|
||||
- [ ] **Testing:** Include verification tests in release pipeline
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/attestor/verification.md`
|
||||
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-key-expired.md`
|
||||
- **Trust management:** `docs/operations/trust-anchors.md`
|
||||
449
docs/operations/runbooks/backup-restore-ops.md
Normal file
449
docs/operations/runbooks/backup-restore-ops.md
Normal file
@@ -0,0 +1,449 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-004 - Backup/Restore Runbook
|
||||
# Backup and Restore Operations Runbook
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.
|
||||
|
||||
---
|
||||
|
||||
## Backup Architecture Overview
|
||||
|
||||
### Backup Components
|
||||
|
||||
| Component | Backup Type | Default Schedule | Retention |
|
||||
|-----------|-------------|------------------|-----------|
|
||||
| PostgreSQL | Full + WAL | Daily full, continuous WAL | 30 days |
|
||||
| Evidence Locker | Incremental | Daily | 90 days |
|
||||
| Configuration | Snapshot | Daily + on change | 90 days |
|
||||
| Secrets | Encrypted snapshot | Daily | 30 days |
|
||||
| Attestation Keys | Encrypted export | Weekly | 1 year |
|
||||
|
||||
### Storage Locations
|
||||
|
||||
- **Primary:** `/var/lib/stellaops/backups/` (local)
|
||||
- **Secondary:** S3/Azure Blob/GCS (configurable)
|
||||
- **Offline:** Removable media for air-gap scenarios
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check backup service status
|
||||
stella backup status
|
||||
|
||||
# Verify backup storage
|
||||
stella doctor --check check.storage.backup
|
||||
|
||||
# List recent backups
|
||||
stella backup list --last 7d
|
||||
|
||||
# Test backup restore capability
|
||||
stella backup test-restore --latest --dry-run
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_backup_last_success_timestamp` - Last successful backup
|
||||
- `stella_backup_duration_seconds` - Backup duration
|
||||
- `stella_backup_size_bytes` - Backup size
|
||||
- `stella_restore_test_last_success` - Last restore test
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Create Manual Backup
|
||||
|
||||
**When:** Before upgrades, schema changes, or major configuration changes
|
||||
**Duration:** 5-30 minutes depending on data volume
|
||||
|
||||
1. Create full system backup:
|
||||
```bash
|
||||
stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
2. Or create component-specific backup:
|
||||
```bash
|
||||
# Database only
|
||||
stella backup create --type database --name "db-pre-migration"
|
||||
|
||||
# Evidence locker only
|
||||
stella backup create --type evidence --name "evidence-snapshot"
|
||||
|
||||
# Configuration only
|
||||
stella backup create --type config --name "config-backup"
|
||||
```
|
||||
|
||||
3. Verify backup:
|
||||
```bash
|
||||
stella backup verify --name "pre-upgrade-$(date +%Y%m%d)"
|
||||
```
|
||||
|
||||
4. Copy to offsite storage (recommended):
|
||||
```bash
|
||||
stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
|
||||
```
|
||||
|
||||
### SP-002: Verify Backup Integrity
|
||||
|
||||
**Frequency:** Weekly
|
||||
**Duration:** 15-60 minutes
|
||||
|
||||
1. List backups for verification:
|
||||
```bash
|
||||
stella backup list --unverified
|
||||
```
|
||||
|
||||
2. Verify backup integrity:
|
||||
```bash
|
||||
# Verify specific backup
|
||||
stella backup verify --name <backup-name>
|
||||
|
||||
# Verify all unverified
|
||||
stella backup verify --all-unverified
|
||||
```
|
||||
|
||||
3. Test restore (non-destructive):
|
||||
```bash
|
||||
stella backup test-restore --name <backup-name> --target /tmp/restore-test
|
||||
```
|
||||
|
||||
4. Record verification result:
|
||||
```bash
|
||||
stella backup log-verification --name <backup-name> --result success
|
||||
```
|
||||
|
||||
### SP-003: Restore from Backup
|
||||
|
||||
**CAUTION: This is a destructive operation**
|
||||
|
||||
#### Full System Restore
|
||||
|
||||
1. Stop all services:
|
||||
```bash
|
||||
stella service stop --all
|
||||
```
|
||||
|
||||
2. List available backups:
|
||||
```bash
|
||||
stella backup list --type full
|
||||
```
|
||||
|
||||
3. Restore:
|
||||
```bash
|
||||
# Dry run first
|
||||
stella backup restore --name <backup-name> --dry-run
|
||||
|
||||
# Execute restore
|
||||
stella backup restore --name <backup-name> --confirm
|
||||
```
|
||||
|
||||
4. Start services:
|
||||
```bash
|
||||
stella service start --all
|
||||
```
|
||||
|
||||
5. Verify restoration:
|
||||
```bash
|
||||
stella doctor --all
|
||||
stella service health
|
||||
```
|
||||
|
||||
#### Component-Specific Restore
|
||||
|
||||
1. Database restore:
|
||||
```bash
|
||||
stella service stop --service api,release-orchestrator
|
||||
stella backup restore --type database --name <backup-name> --confirm
|
||||
stella db migrate # Apply any pending migrations
|
||||
stella service start --service api,release-orchestrator
|
||||
```
|
||||
|
||||
2. Evidence locker restore:
|
||||
```bash
|
||||
stella backup restore --type evidence --name <backup-name> --confirm
|
||||
stella evidence verify --mode quick
|
||||
```
|
||||
|
||||
3. Configuration restore:
|
||||
```bash
|
||||
stella backup restore --type config --name <backup-name> --confirm
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
### SP-004: Point-in-Time Recovery (Database)
|
||||
|
||||
1. Identify target recovery point:
|
||||
```bash
|
||||
# List WAL archives
|
||||
stella backup wal-list --after <start-date> --before <end-date>
|
||||
```
|
||||
|
||||
2. Perform PITR:
|
||||
```bash
|
||||
stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm
|
||||
```
|
||||
|
||||
3. Verify data state:
|
||||
```bash
|
||||
stella db verify-integrity
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backup Schedules
|
||||
|
||||
### Configure Backup Schedule
|
||||
|
||||
```bash
|
||||
# View current schedule
|
||||
stella backup schedule show
|
||||
|
||||
# Set database backup schedule
|
||||
stella backup schedule set --type database --cron "0 2 * * *"
|
||||
|
||||
# Set evidence backup schedule
|
||||
stella backup schedule set --type evidence --cron "0 3 * * *"
|
||||
|
||||
# Set configuration backup schedule
|
||||
stella backup schedule set --type config --cron "0 4 * * *" --on-change
|
||||
```
|
||||
|
||||
### Retention Policy
|
||||
|
||||
```bash
|
||||
# View retention policy
|
||||
stella backup retention show
|
||||
|
||||
# Set retention
|
||||
stella backup retention set --type database --days 30
|
||||
stella backup retention set --type evidence --days 90
|
||||
stella backup retention set --type config --days 90
|
||||
|
||||
# Apply retention (cleanup old backups)
|
||||
stella backup retention apply
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: Backup Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaBackupFailed`
|
||||
- Missing recent backup
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check backup logs
|
||||
stella backup logs --last 24h
|
||||
|
||||
# Check disk space
|
||||
stella doctor --check check.storage.diskspace,check.storage.backup
|
||||
|
||||
# Test backup operation
|
||||
stella backup test --type database
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Disk space issue:**
|
||||
```bash
|
||||
stella backup retention apply --force
|
||||
stella backup cleanup --expired
|
||||
```
|
||||
|
||||
2. **Database connectivity:**
|
||||
```bash
|
||||
stella doctor --check check.postgres.connectivity
|
||||
```
|
||||
|
||||
3. **Permission issue:**
|
||||
- Check backup directory permissions
|
||||
- Verify service account access
|
||||
|
||||
4. **Retry backup:**
|
||||
```bash
|
||||
stella backup create --type <failed-type> --retry
|
||||
```
|
||||
|
||||
### INC-002: Restore Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Restore command fails
|
||||
- Services not starting after restore
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check restore logs
|
||||
stella backup restore-logs --last-attempt
|
||||
|
||||
# Verify backup integrity
|
||||
stella backup verify --name <backup-name>
|
||||
|
||||
# Check disk space
|
||||
stella doctor --check check.storage.diskspace
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Corrupted backup:**
|
||||
```bash
|
||||
# Try previous backup
|
||||
stella backup list --type <type>
|
||||
stella backup restore --name <previous-backup> --confirm
|
||||
```
|
||||
|
||||
2. **Version mismatch:**
|
||||
```bash
|
||||
# Check backup version
|
||||
stella backup info --name <backup-name>
|
||||
|
||||
# Restore with migration
|
||||
stella backup restore --name <backup-name> --with-migration
|
||||
```
|
||||
|
||||
3. **Disk space:**
|
||||
- Free space or expand volume
|
||||
- Restore to alternate location
|
||||
|
||||
### INC-003: Backup Storage Full
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaBackupStorageFull`
|
||||
- New backups failing
|
||||
|
||||
**Immediate Actions:**
|
||||
```bash
|
||||
# Check storage
|
||||
stella backup storage stats
|
||||
|
||||
# Emergency cleanup
|
||||
stella backup cleanup --keep-last 3
|
||||
|
||||
# Delete specific old backups
|
||||
stella backup delete --older-than 14d --confirm
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Adjust retention:**
|
||||
```bash
|
||||
stella backup retention set --type database --days 14
|
||||
stella backup retention apply
|
||||
```
|
||||
|
||||
2. **Expand storage:**
|
||||
- Add disk space
|
||||
- Configure offsite storage
|
||||
|
||||
3. **Archive to cold storage:**
|
||||
```bash
|
||||
stella backup archive --older-than 30d --destination s3://archive-bucket/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Disaster Recovery Scenarios
|
||||
|
||||
### DR-001: Complete System Loss
|
||||
|
||||
1. Provision new infrastructure
|
||||
2. Install Stella Ops
|
||||
3. Restore from offsite backup:
|
||||
```bash
|
||||
stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm
|
||||
```
|
||||
4. Verify all components
|
||||
5. Update DNS/load balancer
|
||||
|
||||
### DR-002: Database Corruption
|
||||
|
||||
1. Stop services
|
||||
2. Restore database from latest clean backup:
|
||||
```bash
|
||||
stella backup restore --type database --name <last-known-good>
|
||||
```
|
||||
3. Apply WAL to near-corruption point (PITR)
|
||||
4. Verify data integrity
|
||||
5. Resume services
|
||||
|
||||
### DR-003: Evidence Locker Loss
|
||||
|
||||
1. Restore evidence from backup:
|
||||
```bash
|
||||
stella backup restore --type evidence --name <backup-name>
|
||||
```
|
||||
2. Rebuild index:
|
||||
```bash
|
||||
stella evidence index rebuild
|
||||
```
|
||||
3. Verify anchor chain:
|
||||
```bash
|
||||
stella evidence anchor verify --all
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Offline/Air-Gap Backup
|
||||
|
||||
### Creating Offline Backup
|
||||
|
||||
```bash
|
||||
# Create encrypted offline bundle
|
||||
stella backup create-offline \
|
||||
--output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
|
||||
--encrypt \
|
||||
--passphrase-file /secure/backup-key
|
||||
|
||||
# Verify offline backup
|
||||
stella backup verify-offline --input /media/usb/stellaops-backup-*.enc
|
||||
```
|
||||
|
||||
### Restoring from Offline Backup
|
||||
|
||||
```bash
|
||||
# Restore from offline backup
|
||||
stella backup restore-offline \
|
||||
--input /media/usb/stellaops-backup-*.enc \
|
||||
--passphrase-file /secure/backup-key \
|
||||
--confirm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → Backup Status
|
||||
|
||||
Key panels:
|
||||
- Last backup success time
|
||||
- Backup size trend
|
||||
- Backup duration
|
||||
- Restore test status
|
||||
- Storage utilization
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
```bash
|
||||
stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Retry failed backups, basic troubleshooting
|
||||
2. **L2 (Platform team):** Restore operations, schedule adjustments
|
||||
3. **L3 (Architecture):** Disaster recovery execution
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
196
docs/operations/runbooks/connector-ghsa.md
Normal file
196
docs/operations/runbooks/connector-ghsa.md
Normal file
@@ -0,0 +1,196 @@
|
||||
# Runbook: Feed Connector - GitHub Security Advisories (GHSA) Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-006 - Feed Connector Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Concelier / GHSA Connector |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.connector.ghsa-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] GHSA feed sync failing or stale
|
||||
- [ ] Alert `ConnectorGhsaSyncFailed` firing
|
||||
- [ ] Error: "GitHub API rate limit exceeded" or "GraphQL query failed"
|
||||
- [ ] GitHub Advisory Database vulnerabilities missing
|
||||
- [ ] Metric `connector_sync_failures_total{source="ghsa"}` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | GitHub ecosystem vulnerabilities may be missed |
|
||||
| **Data integrity** | Data becomes stale; no data loss |
|
||||
| **SLA impact** | Vulnerability currency SLO violated for GitHub packages |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.connector.ghsa-health
|
||||
```
|
||||
|
||||
2. **Check GHSA sync status:**
|
||||
```bash
|
||||
stella admin feeds status --source ghsa
|
||||
```
|
||||
|
||||
3. **Test GitHub API connectivity:**
|
||||
```bash
|
||||
stella connector test ghsa
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check GitHub API rate limit:**
|
||||
```bash
|
||||
stella connector ghsa rate-limit-status
|
||||
```
|
||||
Problem if: Remaining = 0, rate limit exceeded
|
||||
|
||||
2. **Check GitHub token permissions:**
|
||||
```bash
|
||||
stella connector credentials show ghsa --check-scopes
|
||||
```
|
||||
Required scopes: `public_repo`, `read:packages` (for private advisory access)
|
||||
|
||||
3. **Check sync logs:**
|
||||
```bash
|
||||
stella connector logs ghsa --last 1h --level error
|
||||
```
|
||||
Look for: GraphQL errors, pagination issues, timeout
|
||||
|
||||
4. **Check for GitHub API outage:**
|
||||
```bash
|
||||
stella connector ghsa api-status
|
||||
```
|
||||
Also check: https://www.githubstatus.com/
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If rate limited, wait for reset:**
|
||||
```bash
|
||||
stella connector ghsa rate-limit-status
|
||||
# Note the reset time, then:
|
||||
stella admin feeds refresh --source ghsa
|
||||
```
|
||||
|
||||
2. **Use secondary token if available:**
|
||||
```bash
|
||||
stella connector credentials rotate ghsa --to secondary
|
||||
stella admin feeds refresh --source ghsa
|
||||
```
|
||||
|
||||
3. **Load from offline bundle:**
|
||||
```bash
|
||||
stella offline load --source ghsa --package ghsa-bundle-latest.tar.gz
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If rate limit consistently exceeded:**
|
||||
|
||||
1. Increase sync interval:
|
||||
```bash
|
||||
stella connector config set ghsa.sync_interval 4h
|
||||
```
|
||||
|
||||
2. Enable incremental sync:
|
||||
```bash
|
||||
stella connector config set ghsa.incremental_sync true
|
||||
```
|
||||
|
||||
3. Use authenticated requests (10x rate limit):
|
||||
```bash
|
||||
stella connector credentials update ghsa --token <github-pat>
|
||||
```
|
||||
|
||||
**If token expired or invalid:**
|
||||
|
||||
1. Generate new GitHub PAT at https://github.com/settings/tokens
|
||||
|
||||
2. Update token:
|
||||
```bash
|
||||
stella connector credentials update ghsa --token <new-token>
|
||||
```
|
||||
|
||||
3. Verify scopes:
|
||||
```bash
|
||||
stella connector credentials show ghsa --check-scopes
|
||||
```
|
||||
|
||||
**If GraphQL query failing:**
|
||||
|
||||
1. Check for API schema changes:
|
||||
```bash
|
||||
stella connector ghsa schema-check
|
||||
```
|
||||
|
||||
2. Update connector if schema changed:
|
||||
```bash
|
||||
stella upgrade --component connector-ghsa
|
||||
```
|
||||
|
||||
**If pagination broken:**
|
||||
|
||||
1. Reset sync cursor:
|
||||
```bash
|
||||
stella connector ghsa reset-cursor
|
||||
```
|
||||
|
||||
2. Force full resync:
|
||||
```bash
|
||||
stella admin feeds refresh --source ghsa --full
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Force sync
|
||||
stella admin feeds refresh --source ghsa
|
||||
|
||||
# Monitor sync progress
|
||||
stella admin feeds status --source ghsa --watch
|
||||
|
||||
# Verify recent advisories present
|
||||
stella vuln query GHSA-xxxx-xxxx-xxxx # Use a recent GHSA ID
|
||||
|
||||
# Check no errors
|
||||
stella connector logs ghsa --level error --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Authentication:** Always use authenticated requests for 5000/hr rate limit
|
||||
- [ ] **Monitoring:** Alert on last sync > 12h or sync failures
|
||||
- [ ] **Redundancy:** Use NVD/OSV as backup for GitHub ecosystem coverage
|
||||
- [ ] **Token rotation:** Rotate tokens before expiration
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/concelier/connectors.md`
|
||||
- **Connector config:** `docs/modules/concelier/operations/connectors/ghsa.md`
|
||||
- **Related runbooks:** `connector-nvd.md`, `connector-osv.md`
|
||||
- **GitHub API docs:** https://docs.github.com/en/graphql
|
||||
195
docs/operations/runbooks/connector-nvd.md
Normal file
195
docs/operations/runbooks/connector-nvd.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Runbook: Feed Connector - NVD Connector Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-006 - Feed Connector Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Concelier / NVD Connector |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.connector.nvd-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] NVD feed sync failing or stale (> 24h since last successful sync)
|
||||
- [ ] Alert `ConnectorNvdSyncFailed` firing
|
||||
- [ ] Error: "NVD API request failed" or "rate limit exceeded"
|
||||
- [ ] Vulnerability data missing or outdated
|
||||
- [ ] Metric `connector_sync_failures_total{source="nvd"}` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Vulnerability scans may miss recent CVEs |
|
||||
| **Data integrity** | Data becomes stale; no data loss |
|
||||
| **SLA impact** | Vulnerability currency SLO violated (target: < 24h) |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.connector.nvd-health
|
||||
```
|
||||
|
||||
2. **Check NVD sync status:**
|
||||
```bash
|
||||
stella admin feeds status --source nvd
|
||||
```
|
||||
Look for: Last sync time, error message, sync state
|
||||
|
||||
3. **Check NVD API connectivity:**
|
||||
```bash
|
||||
stella connector test nvd
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check NVD API key status:**
|
||||
```bash
|
||||
stella connector credentials show nvd
|
||||
```
|
||||
Problem if: API key expired or rate limit exhausted
|
||||
|
||||
2. **Check NVD API rate limit:**
|
||||
```bash
|
||||
stella connector nvd rate-limit-status
|
||||
```
|
||||
Problem if: Remaining requests = 0, reset time in future
|
||||
|
||||
3. **Check for NVD API outage:**
|
||||
```bash
|
||||
stella connector nvd api-status
|
||||
```
|
||||
Also check: https://nvd.nist.gov/general/news
|
||||
|
||||
4. **Check sync logs:**
|
||||
```bash
|
||||
stella connector logs nvd --last 1h --level error
|
||||
```
|
||||
Look for: HTTP status codes, timeout errors, parsing failures
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If rate limited, wait for reset:**
|
||||
```bash
|
||||
stella connector nvd rate-limit-status
|
||||
# Wait for reset time, then:
|
||||
stella admin feeds refresh --source nvd
|
||||
```
|
||||
|
||||
2. **If API key expired, use anonymous mode (slower):**
|
||||
```bash
|
||||
stella connector config set nvd.api_key_mode anonymous
|
||||
stella admin feeds refresh --source nvd
|
||||
```
|
||||
|
||||
3. **Load from offline bundle if urgent:**
|
||||
```bash
|
||||
# If you have a recent offline bundle:
|
||||
stella offline load --source nvd --package nvd-bundle-latest.tar.gz
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If API key expired or invalid:**
|
||||
|
||||
1. Generate new NVD API key at https://nvd.nist.gov/developers/request-an-api-key
|
||||
|
||||
2. Update API key:
|
||||
```bash
|
||||
stella connector credentials update nvd --api-key <new-key>
|
||||
```
|
||||
|
||||
3. Verify connectivity:
|
||||
```bash
|
||||
stella connector test nvd
|
||||
```
|
||||
|
||||
**If rate limit consistently exceeded:**
|
||||
|
||||
1. Increase sync interval to reduce API calls:
|
||||
```bash
|
||||
stella connector config set nvd.sync_interval 6h
|
||||
```
|
||||
|
||||
2. Enable delta sync to reduce data volume:
|
||||
```bash
|
||||
stella connector config set nvd.delta_sync true
|
||||
```
|
||||
|
||||
3. Request higher rate limit from NVD (if available)
|
||||
|
||||
**If network/firewall issue:**
|
||||
|
||||
1. Verify outbound connectivity to NVD API:
|
||||
```bash
|
||||
stella connector test nvd --verbose
|
||||
```
|
||||
|
||||
2. Check proxy configuration if required:
|
||||
```bash
|
||||
stella connector config set nvd.proxy https://proxy:8080
|
||||
```
|
||||
|
||||
**If data parsing failures:**
|
||||
|
||||
1. Check for NVD schema changes:
|
||||
```bash
|
||||
stella connector nvd schema-check
|
||||
```
|
||||
|
||||
2. Update connector if schema changed:
|
||||
```bash
|
||||
stella upgrade --component connector-nvd
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Force sync
|
||||
stella admin feeds refresh --source nvd --force
|
||||
|
||||
# Monitor sync progress
|
||||
stella admin feeds status --source nvd --watch
|
||||
|
||||
# Verify recent CVEs are present
|
||||
stella vuln query CVE-2026-XXXX # Use a recent CVE ID
|
||||
|
||||
# Check no errors in recent logs
|
||||
stella connector logs nvd --level error --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **API Key:** Always use API key (not anonymous) for 10x rate limit
|
||||
- [ ] **Monitoring:** Alert on last sync > 24h or sync failure
|
||||
- [ ] **Redundancy:** Configure backup connector (OSV, GitHub Advisory) for overlap
|
||||
- [ ] **Offline:** Maintain weekly offline bundle for disaster recovery
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/concelier/connectors.md`
|
||||
- **Connector config:** `docs/modules/concelier/operations/connectors/nvd.md`
|
||||
- **Related runbooks:** `connector-ghsa.md`, `connector-osv.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Feed Connectors
|
||||
193
docs/operations/runbooks/connector-osv.md
Normal file
193
docs/operations/runbooks/connector-osv.md
Normal file
@@ -0,0 +1,193 @@
|
||||
# Runbook: Feed Connector - OSV (Open Source Vulnerabilities) Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-006 - Feed Connector Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Concelier / OSV Connector |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.connector.osv-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] OSV feed sync failing or stale
|
||||
- [ ] Alert `ConnectorOsvSyncFailed` firing
|
||||
- [ ] Error: "OSV API request failed" or "ecosystem sync failed"
|
||||
- [ ] OSV vulnerabilities missing from database
|
||||
- [ ] Metric `connector_sync_failures_total{source="osv"}` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Open source ecosystem vulnerabilities may be missed |
|
||||
| **Data integrity** | Data becomes stale; no data loss |
|
||||
| **SLA impact** | Vulnerability currency SLO violated for affected ecosystems |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.connector.osv-health
|
||||
```
|
||||
|
||||
2. **Check OSV sync status:**
|
||||
```bash
|
||||
stella admin feeds status --source osv
|
||||
```
|
||||
|
||||
3. **Test OSV API connectivity:**
|
||||
```bash
|
||||
stella connector test osv
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check ecosystem-specific status:**
|
||||
```bash
|
||||
stella connector osv ecosystems status
|
||||
```
|
||||
Look for: Failed ecosystems, stale ecosystems
|
||||
|
||||
2. **Check sync logs:**
|
||||
```bash
|
||||
stella connector logs osv --last 1h --level error
|
||||
```
|
||||
Look for: API errors, parsing failures, timeout
|
||||
|
||||
3. **Check for OSV API outage:**
|
||||
```bash
|
||||
stella connector osv api-status
|
||||
```
|
||||
Also check: https://osv.dev/
|
||||
|
||||
4. **Check GCS bucket access (OSV uses GCS for bulk data):**
|
||||
```bash
|
||||
stella connector osv gcs-status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Retry sync for specific ecosystem:**
|
||||
```bash
|
||||
stella admin feeds refresh --source osv --ecosystem npm
|
||||
```
|
||||
|
||||
2. **Sync from GCS bucket directly (faster for bulk):**
|
||||
```bash
|
||||
stella connector osv sync-from-gcs
|
||||
```
|
||||
|
||||
3. **Load from offline bundle:**
|
||||
```bash
|
||||
stella offline load --source osv --package osv-bundle-latest.tar.gz
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If API request failing:**
|
||||
|
||||
1. Check API endpoint:
|
||||
```bash
|
||||
stella connector osv api-test
|
||||
```
|
||||
|
||||
2. Verify no proxy blocking:
|
||||
```bash
|
||||
stella connector config set osv.proxy <proxy-url>
|
||||
```
|
||||
|
||||
**If GCS access failing:**
|
||||
|
||||
1. Check GCS connectivity:
|
||||
```bash
|
||||
stella connector osv gcs-test
|
||||
```
|
||||
|
||||
2. Enable anonymous access (default):
|
||||
```bash
|
||||
stella connector config set osv.gcs_auth anonymous
|
||||
```
|
||||
|
||||
3. Or configure service account:
|
||||
```bash
|
||||
stella connector config set osv.gcs_credentials /path/to/sa-key.json
|
||||
```
|
||||
|
||||
**If specific ecosystem failing:**
|
||||
|
||||
1. Disable problematic ecosystem temporarily:
|
||||
```bash
|
||||
stella connector config set osv.ecosystems.disabled <ecosystem>
|
||||
```
|
||||
|
||||
2. Check ecosystem data format:
|
||||
```bash
|
||||
stella connector osv ecosystem-check <ecosystem>
|
||||
```
|
||||
|
||||
**If parsing errors:**
|
||||
|
||||
1. Check for schema changes:
|
||||
```bash
|
||||
stella connector osv schema-check
|
||||
```
|
||||
|
||||
2. Update connector:
|
||||
```bash
|
||||
stella upgrade --component connector-osv
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Force sync
|
||||
stella admin feeds refresh --source osv
|
||||
|
||||
# Monitor sync progress
|
||||
stella admin feeds status --source osv --watch
|
||||
|
||||
# Verify ecosystem coverage
|
||||
stella connector osv ecosystems status
|
||||
|
||||
# Query recent vulnerability
|
||||
stella vuln query OSV-2026-xxxx
|
||||
|
||||
# Check no errors
|
||||
stella connector logs osv --level error --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Bulk sync:** Use GCS bulk sync for initial load and daily updates
|
||||
- [ ] **Monitoring:** Alert on ecosystem sync failures
|
||||
- [ ] **Redundancy:** NVD/GHSA provide overlapping coverage for major ecosystems
|
||||
- [ ] **Offline:** Maintain weekly offline bundle
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/concelier/connectors.md`
|
||||
- **Connector config:** `docs/modules/concelier/operations/connectors/osv.md`
|
||||
- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`
|
||||
- **OSV API docs:** https://osv.dev/docs/
|
||||
220
docs/operations/runbooks/connector-vendor-specific.md
Normal file
220
docs/operations/runbooks/connector-vendor-specific.md
Normal file
@@ -0,0 +1,220 @@
|
||||
# Runbook Template: Feed Connector - Vendor-Specific Connectors
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-006 - Feed Connector Runbooks
|
||||
|
||||
## Overview
|
||||
|
||||
This is a template runbook for vendor-specific advisory feed connectors (RedHat, Ubuntu, Debian, Oracle, VMware, etc.). Use this template to create runbooks for specific vendor connectors.
|
||||
|
||||
---
|
||||
|
||||
## Metadata Template
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Concelier / [Vendor] Connector |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | [Date] |
|
||||
| **Doctor check** | `check.connector.[vendor]-health` |
|
||||
|
||||
---
|
||||
|
||||
## Common Vendor Connector Issues
|
||||
|
||||
### Authentication Failures
|
||||
|
||||
**Symptoms:**
|
||||
- Sync failing with 401/403 errors
|
||||
- "authentication failed" or "invalid credentials"
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check credentials
|
||||
stella connector credentials show <vendor>
|
||||
|
||||
# Update credentials
|
||||
stella connector credentials update <vendor> --api-key <key>
|
||||
|
||||
# Test connectivity
|
||||
stella connector test <vendor>
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
**Symptoms:**
|
||||
- Sync failing with 429 errors
|
||||
- "rate limit exceeded"
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check rate limit status
|
||||
stella connector <vendor> rate-limit-status
|
||||
|
||||
# Increase sync interval
|
||||
stella connector config set <vendor>.sync_interval 6h
|
||||
|
||||
# Enable delta sync
|
||||
stella connector config set <vendor>.delta_sync true
|
||||
```
|
||||
|
||||
### Data Format Changes
|
||||
|
||||
**Symptoms:**
|
||||
- Parsing errors in sync logs
|
||||
- "unexpected format" or "schema validation failed"
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check for schema changes
|
||||
stella connector <vendor> schema-check
|
||||
|
||||
# Update connector
|
||||
stella upgrade --component connector-<vendor>
|
||||
```
|
||||
|
||||
### Offline Bundle Refresh
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Create offline bundle
|
||||
stella offline sync --feeds <vendor> --output <vendor>-bundle.tar.gz
|
||||
|
||||
# Load offline bundle
|
||||
stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Vendor-Specific Runbooks
|
||||
|
||||
Use this template to create runbooks for:
|
||||
|
||||
### RedHat Security Data
|
||||
|
||||
**Endpoint:** https://access.redhat.com/security/data/
|
||||
**Authentication:** API token or certificate
|
||||
**Connector:** `connector-redhat`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test redhat
|
||||
stella admin feeds status --source redhat
|
||||
stella connector redhat cve-map-status # RHSA to CVE mapping
|
||||
```
|
||||
|
||||
### Ubuntu Security Notices
|
||||
|
||||
**Endpoint:** https://ubuntu.com/security/notices
|
||||
**Authentication:** None (public)
|
||||
**Connector:** `connector-ubuntu`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test ubuntu
|
||||
stella admin feeds status --source ubuntu
|
||||
stella connector ubuntu usn-status # USN sync status
|
||||
```
|
||||
|
||||
### Debian Security Tracker
|
||||
|
||||
**Endpoint:** https://security-tracker.debian.org/
|
||||
**Authentication:** None (public)
|
||||
**Connector:** `connector-debian`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test debian
|
||||
stella admin feeds status --source debian
|
||||
stella connector debian dla-status # DLA sync status
|
||||
```
|
||||
|
||||
### Oracle Security Alerts
|
||||
|
||||
**Endpoint:** https://www.oracle.com/security-alerts/
|
||||
**Authentication:** Oracle account (optional)
|
||||
**Connector:** `connector-oracle`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test oracle
|
||||
stella admin feeds status --source oracle
|
||||
stella connector oracle cpu-status # Critical Patch Update status
|
||||
```
|
||||
|
||||
### VMware Security Advisories
|
||||
|
||||
**Endpoint:** https://www.vmware.com/security/advisories
|
||||
**Authentication:** None (public)
|
||||
**Connector:** `connector-vmware`
|
||||
|
||||
Key commands:
|
||||
```bash
|
||||
stella connector test vmware
|
||||
stella admin feeds status --source vmware
|
||||
stella connector vmware vmsa-status # VMSA sync status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis Checklist
|
||||
|
||||
For any vendor connector issue:
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.connector.<vendor>-health
|
||||
```
|
||||
|
||||
2. **Check sync status:**
|
||||
```bash
|
||||
stella admin feeds status --source <vendor>
|
||||
```
|
||||
|
||||
3. **Test connectivity:**
|
||||
```bash
|
||||
stella connector test <vendor>
|
||||
```
|
||||
|
||||
4. **Check logs:**
|
||||
```bash
|
||||
stella connector logs <vendor> --last 1h --level error
|
||||
```
|
||||
|
||||
5. **Check credentials (if applicable):**
|
||||
```bash
|
||||
stella connector credentials show <vendor>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution Checklist
|
||||
|
||||
1. **Retry sync:**
|
||||
```bash
|
||||
stella admin feeds refresh --source <vendor>
|
||||
```
|
||||
|
||||
2. **Update credentials (if auth issue):**
|
||||
```bash
|
||||
stella connector credentials update <vendor>
|
||||
```
|
||||
|
||||
3. **Update connector (if format changed):**
|
||||
```bash
|
||||
stella upgrade --component connector-<vendor>
|
||||
```
|
||||
|
||||
4. **Load offline bundle (if API unavailable):**
|
||||
```bash
|
||||
stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Connector architecture:** `docs/modules/concelier/connectors.md`
|
||||
- **Vendor connector configs:** `docs/modules/concelier/operations/connectors/`
|
||||
- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`, `connector-osv.md`
|
||||
370
docs/operations/runbooks/crypto-ops.md
Normal file
370
docs/operations/runbooks/crypto-ops.md
Normal file
@@ -0,0 +1,370 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-002 - Crypto Subsystem Runbook
|
||||
# Regional Crypto Operations Runbook
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
Cryptographic subsystem operations including HSM management, regional crypto profile configuration, key rotation, and certificate management for all supported crypto profiles (International, FIPS, eIDAS, GOST, SM).
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check crypto subsystem health
|
||||
stella doctor --category crypto
|
||||
|
||||
# Verify active crypto profile
|
||||
stella crypto profile show
|
||||
|
||||
# List loaded crypto providers
|
||||
stella crypto providers list
|
||||
|
||||
# Check key status
|
||||
stella crypto keys status
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_crypto_operations_total` - Crypto operation count by type
|
||||
- `stella_crypto_operation_duration_seconds` - Signing/verification latency
|
||||
- `stella_hsm_availability` - HSM availability (if configured)
|
||||
- `stella_cert_expiry_days` - Certificate expiration countdown
|
||||
|
||||
---
|
||||
|
||||
## Regional Crypto Profiles
|
||||
|
||||
### Profile Overview
|
||||
|
||||
| Profile | Use Case | Key Algorithms | Compliance |
|
||||
|---------|----------|----------------|------------|
|
||||
| `international` | Default, most deployments | RSA-2048+, ECDSA P-256/P-384, Ed25519 | General |
|
||||
| `fips` | US Government / FedRAMP | FIPS 140-2 approved algorithms only | FIPS 140-2 |
|
||||
| `eidas` | European Union | RSA-PSS, ECDSA, Ed25519 per ETSI TS 119 312 | eIDAS |
|
||||
| `gost` | Russian Federation | GOST R 34.10-2012, GOST R 34.11-2012 | Russian standards |
|
||||
| `sm` | China | SM2, SM3, SM4 | GM/T 0003-2012 |
|
||||
|
||||
### Switching Profiles
|
||||
|
||||
1. **Pre-switch verification:**
|
||||
```bash
|
||||
# Verify target profile is available
|
||||
stella crypto profile verify --profile <target-profile>
|
||||
|
||||
# Check for incompatible existing signatures
|
||||
stella crypto audit --check-compatibility --target-profile <target-profile>
|
||||
```
|
||||
|
||||
2. **Profile switch:**
|
||||
```bash
|
||||
# Switch profile (requires service restart)
|
||||
stella crypto profile set --profile <target-profile>
|
||||
|
||||
# Restart services to apply
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
3. **Post-switch verification:**
|
||||
```bash
|
||||
stella doctor --check check.crypto.fips,check.crypto.eidas,check.crypto.gost,check.crypto.sm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Key Rotation
|
||||
|
||||
**Frequency:** Quarterly or per policy
|
||||
**Duration:** ~15 minutes (no downtime)
|
||||
|
||||
1. Generate new key:
|
||||
```bash
|
||||
# For software keys
|
||||
stella crypto keys generate --type signing --algorithm ecdsa-p256 --name signing-$(date +%Y%m)
|
||||
|
||||
# For HSM-backed keys
|
||||
stella crypto keys generate --type signing --algorithm ecdsa-p256 --provider hsm --name signing-$(date +%Y%m)
|
||||
```
|
||||
|
||||
2. Activate new key:
|
||||
```bash
|
||||
stella crypto keys activate --name signing-$(date +%Y%m)
|
||||
```
|
||||
|
||||
3. Verify signing with new key:
|
||||
```bash
|
||||
echo "test" | stella crypto sign --output /dev/null
|
||||
```
|
||||
|
||||
4. Schedule old key deactivation:
|
||||
```bash
|
||||
stella crypto keys schedule-deactivation --name <old-key-name> --in 30d
|
||||
```
|
||||
|
||||
### SP-002: Certificate Renewal
|
||||
|
||||
**When:** Certificate expiring within 30 days
|
||||
|
||||
1. Check expiration:
|
||||
```bash
|
||||
stella crypto certs check-expiry
|
||||
```
|
||||
|
||||
2. Generate CSR:
|
||||
```bash
|
||||
stella crypto certs csr --subject "CN=stellaops.example.com,O=Example Corp" --output cert.csr
|
||||
```
|
||||
|
||||
3. Install renewed certificate:
|
||||
```bash
|
||||
stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
|
||||
```
|
||||
|
||||
4. Verify certificate chain:
|
||||
```bash
|
||||
stella doctor --check check.crypto.certchain
|
||||
```
|
||||
|
||||
5. Restart services:
|
||||
```bash
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
### SP-003: HSM Health Check
|
||||
|
||||
**Frequency:** Daily (automated) or on-demand
|
||||
|
||||
1. Check HSM connectivity:
|
||||
```bash
|
||||
stella crypto hsm status
|
||||
```
|
||||
|
||||
2. Verify slot access:
|
||||
```bash
|
||||
stella crypto hsm slots list
|
||||
```
|
||||
|
||||
3. Test signing operation:
|
||||
```bash
|
||||
stella crypto hsm test-sign
|
||||
```
|
||||
|
||||
4. Check HSM metrics:
|
||||
- Free objects/sessions
|
||||
- Temperature/health (vendor-specific)
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: HSM Unavailable
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaHsmUnavailable`
|
||||
- Signing operations failing with "HSM connection error"
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check HSM status
|
||||
stella crypto hsm status
|
||||
|
||||
# Test PKCS#11 module
|
||||
stella crypto hsm test-module
|
||||
|
||||
# Check network to HSM
|
||||
stella network test --host <hsm-host> --port <hsm-port>
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Network issue:**
|
||||
- Verify network path to HSM
|
||||
- Check firewall rules
|
||||
- Verify HSM appliance is powered on
|
||||
|
||||
2. **Session exhaustion:**
|
||||
```bash
|
||||
# Release stale sessions
|
||||
stella crypto hsm sessions release --stale
|
||||
|
||||
# Restart crypto service
|
||||
stella service restart --service crypto-signer
|
||||
```
|
||||
|
||||
3. **HSM failure:**
|
||||
- Fail over to secondary HSM (if configured)
|
||||
- Contact HSM vendor support
|
||||
- Consider temporary fallback to software keys (with approval)
|
||||
|
||||
### INC-002: Signing Key Compromised
|
||||
|
||||
**CRITICAL - Follow incident response procedure**
|
||||
|
||||
1. **Immediate containment:**
|
||||
```bash
|
||||
# Revoke compromised key
|
||||
stella crypto keys revoke --name <compromised-key> --reason compromise
|
||||
|
||||
# Block signing with compromised key
|
||||
stella crypto keys block --name <compromised-key>
|
||||
```
|
||||
|
||||
2. **Generate replacement key:**
|
||||
```bash
|
||||
stella crypto keys generate --type signing --algorithm ecdsa-p256 --name emergency-signing
|
||||
stella crypto keys activate --name emergency-signing
|
||||
```
|
||||
|
||||
3. **Notify downstream:**
|
||||
- Update trust registries with new key
|
||||
- Notify relying parties
|
||||
- Publish key revocation notice
|
||||
|
||||
4. **Forensics:**
|
||||
```bash
|
||||
# Export key usage audit log
|
||||
stella crypto audit export --key <compromised-key> --output /secure/key-audit.json
|
||||
```
|
||||
|
||||
### INC-003: Certificate Expired
|
||||
|
||||
**Symptoms:**
|
||||
- TLS connection failures
|
||||
- Alert: `StellaCertExpired`
|
||||
|
||||
**Immediate Resolution:**
|
||||
|
||||
1. If renewed certificate is available:
|
||||
```bash
|
||||
stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
2. If renewal not ready - emergency self-signed (temporary):
|
||||
```bash
|
||||
# Generate emergency certificate (NOT for production use)
|
||||
stella crypto certs generate-self-signed --days 7 --name emergency
|
||||
stella crypto certs install --cert emergency.pem
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
3. Expedite certificate renewal process
|
||||
|
||||
### INC-004: FIPS Mode Not Enabled
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaFipsNotEnabled`
|
||||
- Compliance audit failure
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Linux:**
|
||||
```bash
|
||||
# Enable FIPS mode
|
||||
sudo fips-mode-setup --enable
|
||||
|
||||
# Reboot required
|
||||
sudo reboot
|
||||
|
||||
# Verify after reboot
|
||||
fips-mode-setup --check
|
||||
```
|
||||
|
||||
2. **Windows:**
|
||||
- Enable via Group Policy
|
||||
- Or via registry:
|
||||
```powershell
|
||||
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\Lsa\FipsAlgorithmPolicy" -Name "Enabled" -Value 1
|
||||
Restart-Computer
|
||||
```
|
||||
|
||||
3. Restart Stella services:
|
||||
```bash
|
||||
stella service restart
|
||||
stella doctor --check check.crypto.fips
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Regional-Specific Procedures
|
||||
|
||||
### GOST Configuration (Russian Federation)
|
||||
|
||||
1. Install GOST engine:
|
||||
```bash
|
||||
sudo apt install libengine-gost-openssl1.1
|
||||
```
|
||||
|
||||
2. Configure Stella:
|
||||
```bash
|
||||
stella crypto profile set --profile gost
|
||||
stella crypto config set --gost-engine-path /usr/lib/x86_64-linux-gnu/engines-3/gost.so
|
||||
```
|
||||
|
||||
3. Verify:
|
||||
```bash
|
||||
stella doctor --check check.crypto.gost
|
||||
```
|
||||
|
||||
### SM Configuration (China)
|
||||
|
||||
1. Ensure OpenSSL 1.1.1+ with SM support:
|
||||
```bash
|
||||
openssl version
|
||||
openssl list -cipher-algorithms | grep -i sm
|
||||
```
|
||||
|
||||
2. Configure Stella:
|
||||
```bash
|
||||
stella crypto profile set --profile sm
|
||||
```
|
||||
|
||||
3. Verify:
|
||||
```bash
|
||||
stella doctor --check check.crypto.sm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → Crypto Subsystem
|
||||
|
||||
Key panels:
|
||||
- Signing operation latency
|
||||
- Key usage by key ID
|
||||
- HSM availability
|
||||
- Certificate expiration countdown
|
||||
- Crypto profile in use
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
```bash
|
||||
# Comprehensive crypto diagnostics
|
||||
stella crypto diagnostics --output /tmp/crypto-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
Bundle includes:
|
||||
- Active crypto profile
|
||||
- Key inventory (public keys only)
|
||||
- Certificate chain
|
||||
- HSM status
|
||||
- Operation audit log (last 24h)
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Certificate installs, key activation
|
||||
2. **L2 (Security team):** Key rotation, HSM issues
|
||||
3. **L3 (Crypto SME):** Algorithm issues, compliance questions
|
||||
4. **HSM Vendor:** Hardware failures
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
408
docs/operations/runbooks/evidence-locker-ops.md
Normal file
408
docs/operations/runbooks/evidence-locker-ops.md
Normal file
@@ -0,0 +1,408 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-003 - Evidence Locker Runbook
|
||||
# Evidence Locker Operations Runbook
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
Evidence locker operations including storage management, integrity verification, attestation management, provenance chain maintenance, and disaster recovery procedures.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check evidence locker health
|
||||
stella doctor --category evidence
|
||||
|
||||
# Verify storage accessibility
|
||||
stella evidence status
|
||||
|
||||
# Check index health
|
||||
stella evidence index status
|
||||
|
||||
# Verify anchor chain
|
||||
stella evidence anchor verify --latest
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_evidence_artifacts_total` - Total artifacts stored
|
||||
- `stella_evidence_retrieval_latency_seconds` - Retrieval latency P99
|
||||
- `stella_evidence_storage_bytes` - Storage consumption
|
||||
- `stella_merkle_anchor_age_seconds` - Time since last anchor
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Daily Integrity Check
|
||||
|
||||
**Frequency:** Daily (automated) or on-demand
|
||||
**Duration:** Varies by locker size (typically 5-30 minutes)
|
||||
|
||||
1. Run integrity verification:
|
||||
```bash
|
||||
# Quick check (sample-based)
|
||||
stella evidence verify --mode quick
|
||||
|
||||
# Full check (all artifacts)
|
||||
stella evidence verify --mode full
|
||||
```
|
||||
|
||||
2. Review results:
|
||||
```bash
|
||||
stella evidence verify-report --latest
|
||||
```
|
||||
|
||||
3. Address any failures:
|
||||
```bash
|
||||
# List failed artifacts
|
||||
stella evidence verify-report --latest --filter failed
|
||||
```
|
||||
|
||||
### SP-002: Index Maintenance
|
||||
|
||||
**Frequency:** Weekly or after large ingestion
|
||||
**Duration:** ~10 minutes
|
||||
|
||||
1. Check index health:
|
||||
```bash
|
||||
stella evidence index status
|
||||
```
|
||||
|
||||
2. Refresh index if needed:
|
||||
```bash
|
||||
# Incremental refresh
|
||||
stella evidence index refresh
|
||||
|
||||
# Full rebuild (if corruption suspected)
|
||||
stella evidence index rebuild
|
||||
```
|
||||
|
||||
3. Optimize index:
|
||||
```bash
|
||||
stella evidence index optimize
|
||||
```
|
||||
|
||||
### SP-003: Merkle Anchoring
|
||||
|
||||
**Frequency:** Per policy (default: every 6 hours)
|
||||
**Duration:** ~2 minutes
|
||||
|
||||
1. Create new anchor:
|
||||
```bash
|
||||
stella evidence anchor create
|
||||
```
|
||||
|
||||
2. Verify anchor chain:
|
||||
```bash
|
||||
stella evidence anchor verify --all
|
||||
```
|
||||
|
||||
3. Export anchor for external archival:
|
||||
```bash
|
||||
stella evidence anchor export --latest --output anchor-$(date +%Y%m%dT%H%M%S).json
|
||||
```
|
||||
|
||||
### SP-004: Storage Cleanup
|
||||
|
||||
**Frequency:** Monthly or when storage alerts trigger
|
||||
**Duration:** Varies
|
||||
|
||||
1. Review storage usage:
|
||||
```bash
|
||||
stella evidence storage stats
|
||||
```
|
||||
|
||||
2. Apply retention policy:
|
||||
```bash
|
||||
# Dry run first
|
||||
stella evidence cleanup --apply-retention --dry-run
|
||||
|
||||
# Execute cleanup
|
||||
stella evidence cleanup --apply-retention
|
||||
```
|
||||
|
||||
3. Archive old evidence (if required):
|
||||
```bash
|
||||
stella evidence archive --older-than 365d --output /archive/evidence-$(date +%Y).tar
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: Integrity Verification Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaEvidenceIntegrityFailure`
|
||||
- Verification reports hash mismatch
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Get failure details
|
||||
stella evidence verify-report --latest --filter failed --format json > /tmp/integrity-failures.json
|
||||
|
||||
# Check specific artifact
|
||||
stella evidence inspect <artifact-id>
|
||||
|
||||
# Check provenance
|
||||
stella evidence provenance show <artifact-id>
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Isolated corruption:**
|
||||
```bash
|
||||
# Attempt recovery from replica (if available)
|
||||
stella evidence recover --id <artifact-id> --source replica
|
||||
|
||||
# If no replica, mark as corrupted
|
||||
stella evidence mark-corrupted --id <artifact-id> --reason "hash-mismatch"
|
||||
```
|
||||
|
||||
2. **Widespread corruption:**
|
||||
- Stop evidence ingestion
|
||||
- Identify corruption extent
|
||||
- Restore from backup if necessary
|
||||
- Escalate to L3
|
||||
|
||||
3. **False positive (software bug):**
|
||||
- Verify with multiple hash implementations
|
||||
- Check for recent software updates
|
||||
- Report bug if confirmed
|
||||
|
||||
### INC-002: Evidence Retrieval Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaEvidenceRetrievalFailed`
|
||||
- API returning 404 for known artifacts
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check if artifact exists
|
||||
stella evidence exists <artifact-id>
|
||||
|
||||
# Check index
|
||||
stella evidence index lookup <artifact-id>
|
||||
|
||||
# Check storage backend
|
||||
stella evidence storage check <artifact-id>
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Index corruption:**
|
||||
```bash
|
||||
# Rebuild index
|
||||
stella evidence index rebuild
|
||||
```
|
||||
|
||||
2. **Storage backend issue:**
|
||||
```bash
|
||||
# Check storage health
|
||||
stella doctor --check check.storage.evidencelocker
|
||||
|
||||
# Verify storage connectivity
|
||||
stella evidence storage test
|
||||
```
|
||||
|
||||
3. **File system issue:**
|
||||
- Check disk health
|
||||
- Verify file permissions
|
||||
- Check mount status
|
||||
|
||||
### INC-003: Anchor Chain Break
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaMerkleAnchorChainBroken`
|
||||
- Anchor verification fails
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check anchor chain
|
||||
stella evidence anchor verify --all --verbose
|
||||
|
||||
# Find break point
|
||||
stella evidence anchor list --show-links
|
||||
|
||||
# Inspect specific anchor
|
||||
stella evidence anchor inspect <anchor-id>
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Single broken link:**
|
||||
```bash
|
||||
# Attempt to recover from backup
|
||||
stella evidence anchor recover --id <anchor-id> --source backup
|
||||
```
|
||||
|
||||
2. **Multiple breaks:**
|
||||
- Stop new anchoring
|
||||
- Assess extent of damage
|
||||
- Restore from backup or rebuild chain
|
||||
|
||||
3. **Create new chain segment:**
|
||||
```bash
|
||||
# Start new chain (preserves old chain as archived)
|
||||
stella evidence anchor new-chain --reason "chain-break-recovery"
|
||||
```
|
||||
|
||||
### INC-004: Storage Full
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaEvidenceStorageFull`
|
||||
- Ingestion failing
|
||||
|
||||
**Immediate Actions:**
|
||||
```bash
|
||||
# Check storage usage
|
||||
stella evidence storage stats
|
||||
|
||||
# Emergency cleanup of temporary files
|
||||
stella evidence cleanup --temp-only
|
||||
|
||||
# Find large/old artifacts
|
||||
stella evidence storage analyze --sort size --limit 20
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Apply retention policy:**
|
||||
```bash
|
||||
stella evidence cleanup --apply-retention --aggressive
|
||||
```
|
||||
|
||||
2. **Archive old evidence:**
|
||||
```bash
|
||||
stella evidence archive --older-than 180d --compress
|
||||
```
|
||||
|
||||
3. **Expand storage:**
|
||||
- Follow cloud provider procedure
|
||||
- Or add additional storage volume
|
||||
|
||||
---
|
||||
|
||||
## Disaster Recovery
|
||||
|
||||
### DR-001: Full Evidence Locker Recovery
|
||||
|
||||
**Prerequisites:**
|
||||
- Backup available
|
||||
- Target storage provisioned
|
||||
- Recovery environment ready
|
||||
|
||||
**Procedure:**
|
||||
|
||||
1. Provision new storage:
|
||||
```bash
|
||||
stella evidence storage provision --size <size>
|
||||
```
|
||||
|
||||
2. Restore from backup:
|
||||
```bash
|
||||
# List available backups
|
||||
stella backup list --type evidence-locker
|
||||
|
||||
# Restore
|
||||
stella evidence restore --backup-id <backup-id> --target /var/lib/stellaops/evidence
|
||||
```
|
||||
|
||||
3. Verify restoration:
|
||||
```bash
|
||||
stella evidence verify --mode full
|
||||
stella evidence anchor verify --all
|
||||
```
|
||||
|
||||
4. Update service configuration:
|
||||
```bash
|
||||
stella config set EvidenceLocker:Path /var/lib/stellaops/evidence
|
||||
stella service restart
|
||||
```
|
||||
|
||||
### DR-002: Point-in-Time Recovery
|
||||
|
||||
For recovering to a specific point in time:
|
||||
|
||||
1. Identify target anchor:
|
||||
```bash
|
||||
stella evidence anchor list --before <timestamp>
|
||||
```
|
||||
|
||||
2. Restore to that point:
|
||||
```bash
|
||||
stella evidence restore --to-anchor <anchor-id>
|
||||
```
|
||||
|
||||
3. Verify integrity:
|
||||
```bash
|
||||
stella evidence verify --mode full --to-anchor <anchor-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Offline Mode Operations
|
||||
|
||||
### Preparing Offline Evidence Pack
|
||||
|
||||
```bash
|
||||
# Export evidence for specific artifact
|
||||
stella evidence export --digest <artifact-digest> --output evidence-pack.tar.gz
|
||||
|
||||
# Export with all dependencies
|
||||
stella evidence export --digest <artifact-digest> --include-deps --output evidence-full.tar.gz
|
||||
```
|
||||
|
||||
### Verifying Evidence Offline
|
||||
|
||||
```bash
|
||||
# Verify evidence pack without network
|
||||
stella evidence verify --offline --input evidence-pack.tar.gz
|
||||
|
||||
# Replay verdict using evidence
|
||||
stella replay --evidence evidence-pack.tar.gz --output verdict.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → Evidence Locker
|
||||
|
||||
Key panels:
|
||||
- Artifact ingestion rate
|
||||
- Retrieval latency
|
||||
- Storage utilization trend
|
||||
- Integrity check status
|
||||
- Anchor chain health
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
For any incident:
|
||||
```bash
|
||||
stella evidence diagnostics --output /tmp/evidence-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
Bundle includes:
|
||||
- Index status
|
||||
- Storage stats
|
||||
- Recent anchor chain
|
||||
- Integrity check results
|
||||
- Operation audit log
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Standard procedures, cleanup operations
|
||||
2. **L2 (Platform team):** Index rebuild, anchor issues
|
||||
3. **L3 (Architecture):** Chain recovery, DR procedures
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
183
docs/operations/runbooks/orchestrator-evidence-missing.md
Normal file
183
docs/operations/runbooks/orchestrator-evidence-missing.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# Runbook: Release Orchestrator - Required Evidence Not Found
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.evidence-availability` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotion failing with "required evidence not found"
|
||||
- [ ] Alert `OrchestratorEvidenceMissing` firing
|
||||
- [ ] Gate evaluation blocked waiting for evidence
|
||||
- [ ] Error: "SBOM not found" or "attestation missing"
|
||||
- [ ] Evidence chain incomplete for artifact
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Promotion blocked until evidence is generated |
|
||||
| **Data integrity** | Indicates missing security artifact - must be resolved |
|
||||
| **SLA impact** | Release blocked; compliance requirements not met |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.evidence-availability
|
||||
```
|
||||
|
||||
2. **List missing evidence for promotion:**
|
||||
```bash
|
||||
stella promotion evidence <promotion-id> --missing
|
||||
```
|
||||
|
||||
3. **Check what evidence exists for artifact:**
|
||||
```bash
|
||||
stella evidence list --artifact <digest>
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check evidence chain completeness:**
|
||||
```bash
|
||||
stella evidence chain --artifact <digest> --verbose
|
||||
```
|
||||
Look for: Missing nodes in the chain
|
||||
|
||||
2. **Check if scan completed:**
|
||||
```bash
|
||||
stella scanner jobs list --artifact <digest>
|
||||
```
|
||||
Problem if: No completed scan or scan failed
|
||||
|
||||
3. **Check if attestation was created:**
|
||||
```bash
|
||||
stella attest list --subject <digest>
|
||||
```
|
||||
Problem if: No attestation or attestation failed
|
||||
|
||||
4. **Check evidence store health:**
|
||||
```bash
|
||||
stella evidence store health
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Generate missing SBOM:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --sbom-only
|
||||
```
|
||||
|
||||
2. **Generate missing attestation:**
|
||||
```bash
|
||||
stella attest create --subject <digest> --type slsa-provenance
|
||||
```
|
||||
|
||||
3. **Re-scan artifact to regenerate all evidence:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --force
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If scan never ran:**
|
||||
|
||||
1. Check why artifact wasn't scanned:
|
||||
```bash
|
||||
stella scanner queue list --artifact <digest>
|
||||
```
|
||||
|
||||
2. Configure automatic scanning on push:
|
||||
```bash
|
||||
stella scanner config set auto_scan.enabled true
|
||||
stella scanner config set auto_scan.triggers "push,promote"
|
||||
```
|
||||
|
||||
**If evidence was generated but not stored:**
|
||||
|
||||
1. Check evidence store connectivity:
|
||||
```bash
|
||||
stella evidence store health
|
||||
```
|
||||
|
||||
2. Retry evidence storage:
|
||||
```bash
|
||||
stella evidence retry-store --artifact <digest>
|
||||
```
|
||||
|
||||
**If attestation signing failed:**
|
||||
|
||||
1. Check attestor status:
|
||||
```bash
|
||||
stella attest status
|
||||
```
|
||||
|
||||
2. See `attestor-signing-failed.md` runbook
|
||||
|
||||
**If evidence expired or was deleted:**
|
||||
|
||||
1. Check evidence retention policy:
|
||||
```bash
|
||||
stella evidence policy show
|
||||
```
|
||||
|
||||
2. Regenerate evidence:
|
||||
```bash
|
||||
stella scan image --image <image-ref> --force
|
||||
stella attest create --subject <digest> --type slsa-provenance
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check all evidence now exists
|
||||
stella evidence list --artifact <digest>
|
||||
|
||||
# Verify evidence chain is complete
|
||||
stella evidence chain --artifact <digest>
|
||||
|
||||
# Retry promotion
|
||||
stella promotion retry <promotion-id>
|
||||
|
||||
# Verify promotion proceeds
|
||||
stella promotion status <promotion-id>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Auto-scan:** Enable automatic scanning for all pushed images
|
||||
- [ ] **Gates:** Configure evidence requirements clearly in promotion policy
|
||||
- [ ] **Monitoring:** Alert on evidence generation failures
|
||||
- [ ] **Retention:** Set appropriate evidence retention periods
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/evidence-locker/architecture.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `attestor-signing-failed.md`
|
||||
- **Evidence requirements:** `docs/operations/evidence-requirements.md`
|
||||
178
docs/operations/runbooks/orchestrator-gate-timeout.md
Normal file
178
docs/operations/runbooks/orchestrator-gate-timeout.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Runbook: Release Orchestrator - Gate Evaluation Timeout
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.gate-timeout` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotion gates timing out before completing evaluation
|
||||
- [ ] Alert `OrchestratorGateTimeout` firing
|
||||
- [ ] Error: "gate evaluation timeout exceeded"
|
||||
- [ ] Promotion stuck waiting for gate response
|
||||
- [ ] Metric `orchestrator_gate_timeout_total` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Promotions delayed or blocked; release pipeline stalled |
|
||||
| **Data integrity** | No data loss; promotion can be retried |
|
||||
| **SLA impact** | Release SLO violated if timeout persists |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.gate-timeout
|
||||
```
|
||||
|
||||
2. **Identify timed-out gates:**
|
||||
```bash
|
||||
stella promotion gates <promotion-id> --status timeout
|
||||
```
|
||||
|
||||
3. **Check gate service health:**
|
||||
```bash
|
||||
stella orch gate-services status
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check specific gate latency:**
|
||||
```bash
|
||||
stella orch gate stats --gate <gate-name> --last 1h
|
||||
```
|
||||
Look for: P95 latency, timeout rate
|
||||
|
||||
2. **Check external service connectivity:**
|
||||
```bash
|
||||
stella orch connectivity --gate <gate-name>
|
||||
```
|
||||
|
||||
3. **Check gate evaluation logs:**
|
||||
```bash
|
||||
stella orch logs --gate <gate-name> --promotion <promotion-id>
|
||||
```
|
||||
Look for: Slow queries, external API delays
|
||||
|
||||
4. **Check policy engine latency (for policy gates):**
|
||||
```bash
|
||||
stella policy stats --last 10m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Increase timeout for specific gate:**
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.timeout 5m
|
||||
stella orch reload
|
||||
```
|
||||
|
||||
2. **Skip the timed-out gate (requires approval):**
|
||||
```bash
|
||||
stella promotion gate skip <promotion-id> <gate-name> \
|
||||
--reason "External service timeout - approved by <approver>"
|
||||
```
|
||||
|
||||
3. **Retry the promotion:**
|
||||
```bash
|
||||
stella promotion retry <promotion-id>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If external service is slow:**
|
||||
|
||||
1. Configure gate retry with backoff:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.retries 3
|
||||
stella orch config set gates.<gate-name>.retry_backoff 5s
|
||||
```
|
||||
|
||||
2. Enable gate result caching:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.cache_ttl 5m
|
||||
```
|
||||
|
||||
3. Configure circuit breaker:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.circuit_breaker.enabled true
|
||||
stella orch config set gates.<gate-name>.circuit_breaker.threshold 5
|
||||
```
|
||||
|
||||
**If policy evaluation is slow:**
|
||||
|
||||
1. Optimize policy (see `policy-evaluation-slow.md` runbook)
|
||||
|
||||
2. Increase policy worker count:
|
||||
```bash
|
||||
stella policy config set opa.workers 4
|
||||
```
|
||||
|
||||
**If evidence retrieval is slow:**
|
||||
|
||||
1. Enable evidence pre-fetching:
|
||||
```bash
|
||||
stella orch config set gates.evidence_prefetch true
|
||||
```
|
||||
|
||||
2. Increase evidence cache:
|
||||
```bash
|
||||
stella orch config set evidence.cache_size 1000
|
||||
stella orch config set evidence.cache_ttl 10m
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry promotion
|
||||
stella promotion retry <promotion-id>
|
||||
|
||||
# Monitor gate evaluation
|
||||
stella promotion gates <promotion-id> --watch
|
||||
|
||||
# Check gate latency improved
|
||||
stella orch gate stats --gate <gate-name> --last 10m
|
||||
|
||||
# Verify no timeouts
|
||||
stella orch logs --filter "timeout" --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Timeouts:** Set appropriate timeouts based on gate SLAs (default: 2m)
|
||||
- [ ] **Monitoring:** Alert on gate P95 latency > 1m
|
||||
- [ ] **Caching:** Enable caching for slow gates
|
||||
- [ ] **Circuit breakers:** Enable circuit breakers for external service gates
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/gates.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `policy-evaluation-slow.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Gate Latency
|
||||
168
docs/operations/runbooks/orchestrator-promotion-stuck.md
Normal file
168
docs/operations/runbooks/orchestrator-promotion-stuck.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Runbook: Release Orchestrator - Promotion Job Not Progressing
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Release team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.job-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotion job stuck in "in_progress" state for >10 minutes
|
||||
- [ ] No progress updates in promotion timeline
|
||||
- [ ] Alert `OrchestratorPromotionStuck` firing
|
||||
- [ ] UI shows promotion spinner indefinitely
|
||||
- [ ] Downstream environment not receiving promoted artifact
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Release blocked, cannot promote to target environment |
|
||||
| **Data integrity** | Artifact is safe; promotion can be retried |
|
||||
| **SLA impact** | Release SLO violated if not resolved within 30 minutes |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.job-health
|
||||
```
|
||||
|
||||
2. **Check promotion status:**
|
||||
```bash
|
||||
stella promotion status <promotion-id>
|
||||
```
|
||||
Look for: Current step, last update time, any error messages
|
||||
|
||||
3. **Check orchestrator service:**
|
||||
```bash
|
||||
stella orch status
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Get detailed promotion trace:**
|
||||
```bash
|
||||
stella promotion trace <promotion-id> --verbose
|
||||
```
|
||||
Look for: Which step is stuck, any timeouts
|
||||
|
||||
2. **Check gate evaluation status:**
|
||||
```bash
|
||||
stella promotion gates <promotion-id>
|
||||
```
|
||||
Problem if: Gate stuck waiting for external service
|
||||
|
||||
3. **Check target environment connectivity:**
|
||||
```bash
|
||||
stella orch connectivity --target <env-name>
|
||||
```
|
||||
|
||||
4. **Check for lock contention:**
|
||||
```bash
|
||||
stella orch locks list
|
||||
```
|
||||
Problem if: Stale locks on the artifact or environment
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **If gate is stuck waiting for external service:**
|
||||
```bash
|
||||
# Skip the stuck gate (requires approval)
|
||||
stella promotion gate skip <promotion-id> <gate-name> --reason "External service timeout"
|
||||
```
|
||||
|
||||
2. **If lock is stale:**
|
||||
```bash
|
||||
# Release the lock (use with caution)
|
||||
stella orch locks release <lock-id> --force
|
||||
```
|
||||
|
||||
3. **If orchestrator is unresponsive:**
|
||||
```bash
|
||||
stella service restart orchestrator
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If external gate service is slow:**
|
||||
|
||||
1. Increase gate timeout:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.timeout 5m
|
||||
```
|
||||
|
||||
2. Configure gate retry:
|
||||
```bash
|
||||
stella orch config set gates.<gate-name>.retries 3
|
||||
```
|
||||
|
||||
**If target environment is unreachable:**
|
||||
|
||||
1. Check network connectivity to target
|
||||
2. Verify credentials for target environment:
|
||||
```bash
|
||||
stella orch credentials verify --target <env-name>
|
||||
```
|
||||
|
||||
**If database lock contention:**
|
||||
|
||||
1. Increase lock timeout:
|
||||
```bash
|
||||
stella orch config set locks.timeout 60s
|
||||
```
|
||||
|
||||
2. Enable optimistic locking:
|
||||
```bash
|
||||
stella orch config set locks.mode optimistic
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check promotion completed
|
||||
stella promotion status <promotion-id>
|
||||
|
||||
# Verify artifact in target environment
|
||||
stella orch artifacts list --env <target-env> --filter <artifact-digest>
|
||||
|
||||
# Check no stuck promotions
|
||||
stella promotion list --status in_progress --older-than 5m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Timeouts:** Configure appropriate timeouts for all gates
|
||||
- [ ] **Monitoring:** Alert on promotions stuck > 10 minutes
|
||||
- [ ] **Health checks:** Enable connectivity pre-checks before promotion
|
||||
- [ ] **Documentation:** Document SLAs for external gate services
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/architecture.md`
|
||||
- **Related runbooks:** `orchestrator-gate-timeout.md`, `orchestrator-evidence-missing.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Release Orchestrator
|
||||
189
docs/operations/runbooks/orchestrator-quota-exceeded.md
Normal file
189
docs/operations/runbooks/orchestrator-quota-exceeded.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Runbook: Release Orchestrator - Promotion Quota Exhausted
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | Medium |
|
||||
| **On-call scope** | Platform team, Release team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.quota-status` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Promotions failing with "quota exceeded"
|
||||
- [ ] Alert `OrchestratorQuotaExceeded` firing
|
||||
- [ ] Error: "promotion rate limit reached" or "daily quota exhausted"
|
||||
- [ ] New promotions being rejected
|
||||
- [ ] Queued promotions not processing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | New releases blocked until quota resets or increases |
|
||||
| **Data integrity** | No data loss; promotions queued for later |
|
||||
| **SLA impact** | Release frequency SLO may be violated |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.quota-status
|
||||
```
|
||||
|
||||
2. **Check current quota usage:**
|
||||
```bash
|
||||
stella orch quota status
|
||||
```
|
||||
|
||||
3. **Check quota limits:**
|
||||
```bash
|
||||
stella orch quota limits show
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check promotion history:**
|
||||
```bash
|
||||
stella promotion list --last 24h --count
|
||||
```
|
||||
Look for: Unusual spike in promotions
|
||||
|
||||
2. **Check per-environment quotas:**
|
||||
```bash
|
||||
stella orch quota status --by-environment
|
||||
```
|
||||
|
||||
3. **Check for runaway automation:**
|
||||
```bash
|
||||
stella promotion list --last 1h --by-actor
|
||||
```
|
||||
Problem if: Single actor/service making many promotions
|
||||
|
||||
4. **Check when quota resets:**
|
||||
```bash
|
||||
stella orch quota reset-time
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Request temporary quota increase:**
|
||||
```bash
|
||||
stella orch quota request-increase --amount 50 --reason "Release deadline"
|
||||
```
|
||||
|
||||
2. **Prioritize critical promotions:**
|
||||
```bash
|
||||
stella promotion priority set <promotion-id> high
|
||||
```
|
||||
|
||||
3. **Cancel unnecessary queued promotions:**
|
||||
```bash
|
||||
stella promotion list --status queued
|
||||
stella promotion cancel <promotion-id>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If legitimate high volume:**
|
||||
|
||||
1. Increase quota limits:
|
||||
```bash
|
||||
stella orch quota limits set --daily 200 --hourly 50
|
||||
```
|
||||
|
||||
2. Increase per-environment limits:
|
||||
```bash
|
||||
stella orch quota limits set --env production --daily 50
|
||||
```
|
||||
|
||||
**If runaway automation:**
|
||||
|
||||
1. Identify the source:
|
||||
```bash
|
||||
stella promotion list --last 1h --by-actor --verbose
|
||||
```
|
||||
|
||||
2. Revoke or rate-limit the service account:
|
||||
```bash
|
||||
stella auth rate-limit set <service-account> --promotions-per-hour 10
|
||||
```
|
||||
|
||||
3. Fix the automation bug
|
||||
|
||||
**If promotion retries causing spike:**
|
||||
|
||||
1. Check for failing promotions causing retries:
|
||||
```bash
|
||||
stella promotion list --status failed --last 24h
|
||||
```
|
||||
|
||||
2. Fix underlying promotion failures (see other runbooks)
|
||||
|
||||
3. Configure retry limits:
|
||||
```bash
|
||||
stella orch config set promotion.max_retries 3
|
||||
stella orch config set promotion.retry_backoff 5m
|
||||
```
|
||||
|
||||
**If quota too restrictive for workload:**
|
||||
|
||||
1. Analyze actual promotion patterns:
|
||||
```bash
|
||||
stella orch quota analyze --last 30d
|
||||
```
|
||||
|
||||
2. Adjust quotas based on analysis:
|
||||
```bash
|
||||
stella orch quota limits set --daily <recommended>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check quota status
|
||||
stella orch quota status
|
||||
|
||||
# Verify promotions processing
|
||||
stella promotion list --status in_progress
|
||||
|
||||
# Test new promotion
|
||||
stella promotion create --test --dry-run
|
||||
|
||||
# Check no quota errors
|
||||
stella orch logs --filter "quota" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Monitoring:** Alert at 80% quota usage
|
||||
- [ ] **Limits:** Set appropriate quotas based on team size and release frequency
|
||||
- [ ] **Automation:** Implement rate limiting in CI/CD pipelines
|
||||
- [ ] **Review:** Regularly review and adjust quotas based on usage patterns
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/quotas.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`
|
||||
- **Quota management:** `docs/operations/quota-management.md`
|
||||
189
docs/operations/runbooks/orchestrator-rollback-failed.md
Normal file
189
docs/operations/runbooks/orchestrator-rollback-failed.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Runbook: Release Orchestrator - Rollback Operation Failed
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-004 - Release Orchestrator Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Release Orchestrator |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team, Release team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.orchestrator.rollback-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Rollback operation failing or stuck
|
||||
- [ ] Alert `OrchestratorRollbackFailed` firing
|
||||
- [ ] Error: "rollback failed" or "cannot restore previous version"
|
||||
- [ ] Target environment in inconsistent state
|
||||
- [ ] Previous artifact not available for deployment
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Rollback blocked; potentially broken release in production |
|
||||
| **Data integrity** | Environment may be in partial rollback state |
|
||||
| **SLA impact** | Incident resolution blocked; extended outage |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.orchestrator.rollback-health
|
||||
```
|
||||
|
||||
2. **Check rollback status:**
|
||||
```bash
|
||||
stella rollback status <rollback-id>
|
||||
```
|
||||
|
||||
3. **Check previous deployment history:**
|
||||
```bash
|
||||
stella orch deployments list --env <env-name> --last 10
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check why rollback failed:**
|
||||
```bash
|
||||
stella rollback trace <rollback-id> --verbose
|
||||
```
|
||||
Look for: Which step failed, error message
|
||||
|
||||
2. **Check previous artifact availability:**
|
||||
```bash
|
||||
stella orch artifacts get <previous-digest> --check
|
||||
```
|
||||
Problem if: Artifact deleted, not in registry
|
||||
|
||||
3. **Check environment state:**
|
||||
```bash
|
||||
stella orch env status <env-name> --detailed
|
||||
```
|
||||
|
||||
4. **Check for deployment locks:**
|
||||
```bash
|
||||
stella orch locks list --env <env-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Force release lock if stuck:**
|
||||
```bash
|
||||
stella orch locks release --env <env-name> --force
|
||||
```
|
||||
|
||||
2. **Manual rollback using specific artifact:**
|
||||
```bash
|
||||
stella deploy --env <env-name> --artifact <previous-digest> --force
|
||||
```
|
||||
|
||||
3. **If artifact unavailable, deploy last known good:**
|
||||
```bash
|
||||
stella orch deployments list --env <env-name> --status success
|
||||
stella deploy --env <env-name> --artifact <last-good-digest>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If previous artifact not in registry:**
|
||||
|
||||
1. Check artifact retention policy:
|
||||
```bash
|
||||
stella registry retention show
|
||||
```
|
||||
|
||||
2. Restore from backup registry:
|
||||
```bash
|
||||
stella registry restore --artifact <digest> --from backup
|
||||
```
|
||||
|
||||
3. Increase artifact retention:
|
||||
```bash
|
||||
stella registry retention set --min-versions 10
|
||||
```
|
||||
|
||||
**If deployment service unavailable:**
|
||||
|
||||
1. Check deployment target connectivity:
|
||||
```bash
|
||||
stella orch connectivity --target <env-name>
|
||||
```
|
||||
|
||||
2. Check deployment agent status:
|
||||
```bash
|
||||
stella orch agent status --env <env-name>
|
||||
```
|
||||
|
||||
**If configuration drift:**
|
||||
|
||||
1. Check environment configuration:
|
||||
```bash
|
||||
stella orch env config diff <env-name>
|
||||
```
|
||||
|
||||
2. Reset environment to known state:
|
||||
```bash
|
||||
stella orch env reset <env-name> --to-baseline
|
||||
```
|
||||
|
||||
**If database state inconsistent:**
|
||||
|
||||
1. Check orchestrator database:
|
||||
```bash
|
||||
stella orch db verify
|
||||
```
|
||||
|
||||
2. Repair deployment state:
|
||||
```bash
|
||||
stella orch repair --deployment <deployment-id>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify rollback completed
|
||||
stella rollback status <rollback-id>
|
||||
|
||||
# Verify environment state
|
||||
stella orch env status <env-name>
|
||||
|
||||
# Verify correct version deployed
|
||||
stella orch deployments current --env <env-name>
|
||||
|
||||
# Health check the environment
|
||||
stella orch health-check --env <env-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Retention:** Maintain at least 5 previous versions in registry
|
||||
- [ ] **Testing:** Test rollback procedure in staging regularly
|
||||
- [ ] **Monitoring:** Alert on rollback failures immediately
|
||||
- [ ] **Documentation:** Document manual rollback procedures per environment
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/release-orchestrator/rollback.md`
|
||||
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `orchestrator-evidence-missing.md`
|
||||
- **Rollback procedures:** `docs/operations/rollback-procedures.md`
|
||||
189
docs/operations/runbooks/policy-compilation-failed.md
Normal file
189
docs/operations/runbooks/policy-compilation-failed.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Runbook: Policy Engine - Rego Compilation Errors
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.compilation-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy deployment failing with "compilation error"
|
||||
- [ ] Alert `PolicyCompilationFailed` firing
|
||||
- [ ] Error: "rego_parse_error" or "rego_type_error"
|
||||
- [ ] New policies not taking effect
|
||||
- [ ] OPA rejecting policy bundle
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | New policies cannot be deployed; using stale policies |
|
||||
| **Data integrity** | Existing policies continue to work; new rules not enforced |
|
||||
| **SLA impact** | Policy updates blocked; security posture may be outdated |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.compilation-health
|
||||
```
|
||||
|
||||
2. **Check policy compilation status:**
|
||||
```bash
|
||||
stella policy status --compilation
|
||||
```
|
||||
|
||||
3. **Validate specific policy:**
|
||||
```bash
|
||||
stella policy validate --file <policy-file>
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Get detailed compilation errors:**
|
||||
```bash
|
||||
stella policy compile --verbose
|
||||
```
|
||||
Look for: Line numbers, error types, undefined references
|
||||
|
||||
2. **Check for syntax errors:**
|
||||
```bash
|
||||
stella policy lint --file <policy-file>
|
||||
```
|
||||
|
||||
3. **Check for type errors:**
|
||||
```bash
|
||||
stella policy typecheck --file <policy-file>
|
||||
```
|
||||
|
||||
4. **Check OPA version compatibility:**
|
||||
```bash
|
||||
stella policy opa version
|
||||
stella policy check-compat --file <policy-file>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Rollback to last working policy:**
|
||||
```bash
|
||||
stella policy rollback --to-last-good
|
||||
```
|
||||
|
||||
2. **Disable the failing policy:**
|
||||
```bash
|
||||
stella policy disable <policy-id>
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
3. **Use previous bundle:**
|
||||
```bash
|
||||
stella policy bundle load --version <previous-version>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If syntax error:**
|
||||
|
||||
1. Get exact error location:
|
||||
```bash
|
||||
stella policy validate --file <policy-file> --show-line
|
||||
```
|
||||
|
||||
2. Common syntax issues:
|
||||
- Missing brackets or braces
|
||||
- Invalid rule head syntax
|
||||
- Incorrect import statements
|
||||
|
||||
3. Fix and re-validate:
|
||||
```bash
|
||||
stella policy validate --file <fixed-policy.rego>
|
||||
```
|
||||
|
||||
**If undefined reference:**
|
||||
|
||||
1. Check for missing imports:
|
||||
```bash
|
||||
stella policy analyze --file <policy-file> --show-imports
|
||||
```
|
||||
|
||||
2. Verify data references exist:
|
||||
```bash
|
||||
stella policy data show
|
||||
```
|
||||
|
||||
3. Add missing imports or data definitions
|
||||
|
||||
**If type error:**
|
||||
|
||||
1. Check type mismatches:
|
||||
```bash
|
||||
stella policy typecheck --file <policy-file> --verbose
|
||||
```
|
||||
|
||||
2. Common type issues:
|
||||
- Comparing incompatible types
|
||||
- Invalid function arguments
|
||||
- Missing type annotations
|
||||
|
||||
**If OPA version incompatibility:**
|
||||
|
||||
1. Check Rego version features used:
|
||||
```bash
|
||||
stella policy analyze --file <policy-file> --show-features
|
||||
```
|
||||
|
||||
2. Update policy to use compatible features or upgrade OPA
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Validate fixed policy
|
||||
stella policy validate --file <fixed-policy.rego>
|
||||
|
||||
# Test policy compilation
|
||||
stella policy compile --file <fixed-policy.rego>
|
||||
|
||||
# Deploy policy
|
||||
stella policy deploy --file <fixed-policy.rego>
|
||||
|
||||
# Test policy evaluation
|
||||
stella policy evaluate --test
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **CI/CD:** Add policy validation to CI pipeline before deployment
|
||||
- [ ] **Linting:** Run `stella policy lint` on all policy changes
|
||||
- [ ] **Testing:** Write unit tests for policies with `stella policy test`
|
||||
- [ ] **Staging:** Deploy to staging environment before production
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/architecture.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `policy-evaluation-slow.md`
|
||||
- **Rego reference:** https://www.openpolicyagent.org/docs/latest/policy-language/
|
||||
- **Policy testing:** `docs/modules/policy/testing.md`
|
||||
174
docs/operations/runbooks/policy-evaluation-slow.md
Normal file
174
docs/operations/runbooks/policy-evaluation-slow.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Runbook: Policy Engine - Evaluation Latency High
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.evaluation-latency` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy evaluation takes >500ms (warning) or >2s (critical)
|
||||
- [ ] Gate decisions timing out in CI/CD pipelines
|
||||
- [ ] Alert `PolicyEvaluationSlow` firing
|
||||
- [ ] Metric `policy_evaluation_duration_seconds` P95 > 1s
|
||||
- [ ] Users report "policy check taking too long"
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Slow release gate checks, CI/CD pipeline delays |
|
||||
| **Data integrity** | No data loss; decisions are still correct |
|
||||
| **SLA impact** | Gate latency SLO violated (target: P95 < 500ms) |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.evaluation-latency
|
||||
```
|
||||
|
||||
2. **Check policy engine status:**
|
||||
```bash
|
||||
stella policy status
|
||||
```
|
||||
|
||||
3. **Check recent evaluation times:**
|
||||
```bash
|
||||
stella policy stats --last 10m
|
||||
```
|
||||
Look for: P95 latency, cache hit rate
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Profile a slow evaluation:**
|
||||
```bash
|
||||
stella policy evaluate --image <image-ref> --profile
|
||||
```
|
||||
Look for: Which phase is slowest (parse, compile, execute)
|
||||
|
||||
2. **Check OPA compilation cache:**
|
||||
```bash
|
||||
stella policy cache stats
|
||||
```
|
||||
Problem if: Cache hit rate < 90%
|
||||
|
||||
3. **Check policy complexity:**
|
||||
```bash
|
||||
stella policy analyze --complexity
|
||||
```
|
||||
Problem if: Cyclomatic complexity > 50 or rule count > 200
|
||||
|
||||
4. **Check external data fetches:**
|
||||
```bash
|
||||
stella policy logs --filter "external fetch" --level debug
|
||||
```
|
||||
Problem if: Many external fetches or slow responses
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Clear and warm the compilation cache:**
|
||||
```bash
|
||||
stella policy cache clear
|
||||
stella policy cache warm
|
||||
```
|
||||
|
||||
2. **Increase OPA worker count:**
|
||||
```bash
|
||||
stella policy config set opa.workers 4
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
3. **Enable evaluation result caching:**
|
||||
```bash
|
||||
stella policy config set cache.evaluation_ttl 60s
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If policy is too complex:**
|
||||
|
||||
1. Analyze and simplify policy:
|
||||
```bash
|
||||
stella policy analyze --suggest-optimizations
|
||||
```
|
||||
|
||||
2. Split large policies into modules:
|
||||
```bash
|
||||
stella policy refactor --auto-split
|
||||
```
|
||||
|
||||
**If external data fetches are slow:**
|
||||
|
||||
1. Increase external data cache TTL:
|
||||
```bash
|
||||
stella policy config set external_data.cache_ttl 5m
|
||||
```
|
||||
|
||||
2. Pre-fetch external data:
|
||||
```bash
|
||||
stella policy external-data prefetch
|
||||
```
|
||||
|
||||
**If Rego compilation is slow:**
|
||||
|
||||
1. Enable partial evaluation:
|
||||
```bash
|
||||
stella policy config set opa.partial_eval true
|
||||
```
|
||||
|
||||
2. Pre-compile policies:
|
||||
```bash
|
||||
stella policy compile --all
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Run evaluation and check latency
|
||||
stella policy evaluate --image <image-ref> --timing
|
||||
|
||||
# Check P95 latency
|
||||
stella policy stats --last 5m
|
||||
|
||||
# Verify cache is effective
|
||||
stella policy cache stats
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Review:** Review policy complexity before deployment
|
||||
- [ ] **Monitoring:** Alert on P95 latency > 300ms
|
||||
- [ ] **Caching:** Ensure evaluation cache is enabled
|
||||
- [ ] **Pre-warming:** Add cache warming to deployment pipeline
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/architecture.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `policy-compilation-failed.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Policy Engine
|
||||
205
docs/operations/runbooks/policy-opa-crash.md
Normal file
205
docs/operations/runbooks/policy-opa-crash.md
Normal file
@@ -0,0 +1,205 @@
|
||||
# Runbook: Policy Engine - OPA Process Crashed
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.opa-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy evaluations failing with "OPA unavailable" error
|
||||
- [ ] Alert `PolicyOPACrashed` firing
|
||||
- [ ] OPA process exited unexpectedly
|
||||
- [ ] Error: "connection refused" when connecting to OPA
|
||||
- [ ] Metric `policy_opa_restarts_total` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | All policy evaluations fail; gate decisions blocked |
|
||||
| **Data integrity** | No data loss; decisions delayed until OPA recovers |
|
||||
| **SLA impact** | Gate latency SLO violated; release pipeline blocked |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.opa-health
|
||||
```
|
||||
|
||||
2. **Check OPA process status:**
|
||||
```bash
|
||||
stella policy status
|
||||
```
|
||||
Look for: OPA process state, restart count
|
||||
|
||||
3. **Check OPA logs for crash reason:**
|
||||
```bash
|
||||
stella policy opa logs --last 30m --level error
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check OPA memory usage before crash:**
|
||||
```bash
|
||||
stella policy stats --opa-metrics
|
||||
```
|
||||
Problem if: Memory usage near limit before crash
|
||||
|
||||
2. **Check for problematic policy:**
|
||||
```bash
|
||||
stella policy list --last-error
|
||||
```
|
||||
Look for: Policies that caused evaluation errors
|
||||
|
||||
3. **Check OPA configuration:**
|
||||
```bash
|
||||
stella policy opa config show
|
||||
```
|
||||
Look for: Invalid configuration, missing bundles
|
||||
|
||||
4. **Check for infinite loops in Rego:**
|
||||
```bash
|
||||
stella policy analyze --detect-loops
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Restart OPA process:**
|
||||
```bash
|
||||
stella policy opa restart
|
||||
```
|
||||
|
||||
2. **If OPA keeps crashing, start in safe mode:**
|
||||
```bash
|
||||
stella policy opa start --safe-mode
|
||||
```
|
||||
Note: Safe mode disables custom policies
|
||||
|
||||
3. **Enable failopen temporarily (if allowed by policy):**
|
||||
```bash
|
||||
stella policy config set failopen true
|
||||
stella policy reload
|
||||
```
|
||||
**Warning:** Only use if compliance allows fail-open mode
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If OOM killed:**
|
||||
|
||||
1. Increase OPA memory limit:
|
||||
```bash
|
||||
stella policy opa config set memory_limit 2Gi
|
||||
stella policy opa restart
|
||||
```
|
||||
|
||||
2. Enable garbage collection tuning:
|
||||
```bash
|
||||
stella policy opa config set gc_min_heap_size 256Mi
|
||||
stella policy opa config set gc_max_heap_size 1Gi
|
||||
```
|
||||
|
||||
**If policy caused crash:**
|
||||
|
||||
1. Identify problematic policy:
|
||||
```bash
|
||||
stella policy list --status error
|
||||
```
|
||||
|
||||
2. Disable the problematic policy:
|
||||
```bash
|
||||
stella policy disable <policy-id>
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
3. Fix and re-enable:
|
||||
```bash
|
||||
stella policy validate --file <fixed-policy.rego>
|
||||
stella policy update <policy-id> --file <fixed-policy.rego>
|
||||
stella policy enable <policy-id>
|
||||
```
|
||||
|
||||
**If bundle loading failed:**
|
||||
|
||||
1. Check bundle integrity:
|
||||
```bash
|
||||
stella policy bundle verify
|
||||
```
|
||||
|
||||
2. Rebuild bundle:
|
||||
```bash
|
||||
stella policy bundle build --output bundle.tar.gz
|
||||
stella policy bundle load bundle.tar.gz
|
||||
```
|
||||
|
||||
**If configuration issue:**
|
||||
|
||||
1. Reset to default configuration:
|
||||
```bash
|
||||
stella policy opa config reset
|
||||
```
|
||||
|
||||
2. Reconfigure with validated settings:
|
||||
```bash
|
||||
stella policy opa config set workers 4
|
||||
stella policy opa config set decision_log true
|
||||
stella policy opa restart
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check OPA is running
|
||||
stella policy status
|
||||
|
||||
# Check OPA health
|
||||
stella policy opa health
|
||||
|
||||
# Test policy evaluation
|
||||
stella policy evaluate --test
|
||||
|
||||
# Check no crashes in recent logs
|
||||
stella policy opa logs --level error --last 30m
|
||||
|
||||
# Monitor stability
|
||||
stella policy stats --watch
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Resources:** Set appropriate memory limits based on policy complexity
|
||||
- [ ] **Validation:** Validate all policies before deployment
|
||||
- [ ] **Monitoring:** Alert on OPA restart count > 2 in 10 minutes
|
||||
- [ ] **Testing:** Load test policies before production deployment
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/architecture.md`
|
||||
- **Related runbooks:** `policy-evaluation-slow.md`, `policy-compilation-failed.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Policy/`
|
||||
- **OPA documentation:** https://www.openpolicyagent.org/docs/latest/
|
||||
178
docs/operations/runbooks/policy-storage-unavailable.md
Normal file
178
docs/operations/runbooks/policy-storage-unavailable.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Runbook: Policy Engine - Policy Storage Backend Down
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.storage-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy operations failing with "storage unavailable"
|
||||
- [ ] Alert `PolicyStorageUnavailable` firing
|
||||
- [ ] Error: "failed to connect to policy store" or "database connection refused"
|
||||
- [ ] Policy updates not persisting
|
||||
- [ ] OPA unable to load bundles from storage
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Policy updates fail; cached policies may still work |
|
||||
| **Data integrity** | Policy changes not persisted; risk of inconsistent state |
|
||||
| **SLA impact** | Policy management blocked; evaluations use cached data |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.storage-health
|
||||
```
|
||||
|
||||
2. **Check storage connectivity:**
|
||||
```bash
|
||||
stella policy storage status
|
||||
```
|
||||
|
||||
3. **Check database health:**
|
||||
```bash
|
||||
stella db status --component policy
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check PostgreSQL connectivity:**
|
||||
```bash
|
||||
stella db ping --database policy
|
||||
```
|
||||
|
||||
2. **Check connection pool status:**
|
||||
```bash
|
||||
stella db pool-status --database policy
|
||||
```
|
||||
Problem if: Pool exhausted, connections timing out
|
||||
|
||||
3. **Check storage logs:**
|
||||
```bash
|
||||
stella policy logs --filter "storage" --level error --last 30m
|
||||
```
|
||||
|
||||
4. **Check disk space (if local storage):**
|
||||
```bash
|
||||
stella policy storage disk-usage
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Enable read-only mode (use cached policies):**
|
||||
```bash
|
||||
stella policy config set storage.read_only true
|
||||
stella policy reload
|
||||
```
|
||||
|
||||
2. **Switch to backup storage:**
|
||||
```bash
|
||||
stella policy storage failover --to backup
|
||||
```
|
||||
|
||||
3. **Restart policy service to reconnect:**
|
||||
```bash
|
||||
stella service restart policy-engine
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If database connection issue:**
|
||||
|
||||
1. Check database status:
|
||||
```bash
|
||||
stella db status --database policy --verbose
|
||||
```
|
||||
|
||||
2. Restart database connection pool:
|
||||
```bash
|
||||
stella db pool-restart --database policy
|
||||
```
|
||||
|
||||
3. Check and increase connection limits:
|
||||
```bash
|
||||
stella db config set policy.max_connections 50
|
||||
```
|
||||
|
||||
**If disk space exhausted:**
|
||||
|
||||
1. Check storage usage:
|
||||
```bash
|
||||
stella policy storage disk-usage --verbose
|
||||
```
|
||||
|
||||
2. Clean old policy versions:
|
||||
```bash
|
||||
stella policy versions cleanup --older-than 30d
|
||||
```
|
||||
|
||||
3. Increase storage capacity
|
||||
|
||||
**If storage corruption:**
|
||||
|
||||
1. Verify storage integrity:
|
||||
```bash
|
||||
stella policy storage verify
|
||||
```
|
||||
|
||||
2. Restore from backup:
|
||||
```bash
|
||||
stella policy storage restore --from-backup latest
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Check storage status
|
||||
stella policy storage status
|
||||
|
||||
# Test write operation
|
||||
stella policy storage test-write
|
||||
|
||||
# Test policy update
|
||||
stella policy update --test
|
||||
|
||||
# Verify no errors
|
||||
stella policy logs --filter "storage" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Monitoring:** Alert on storage connection failures immediately
|
||||
- [ ] **Redundancy:** Configure backup storage for failover
|
||||
- [ ] **Cleanup:** Schedule regular cleanup of old policy versions
|
||||
- [ ] **Capacity:** Monitor disk usage and plan for growth
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/storage.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `postgres-ops.md`
|
||||
- **Database setup:** `docs/operations/database-configuration.md`
|
||||
195
docs/operations/runbooks/policy-version-mismatch.md
Normal file
195
docs/operations/runbooks/policy-version-mismatch.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Runbook: Policy Engine - Policy Version Conflicts
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-003 - Policy Engine Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Policy Engine |
|
||||
| **Severity** | Medium |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.policy.version-consistency` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Policy evaluation returning unexpected results
|
||||
- [ ] Alert `PolicyVersionMismatch` firing
|
||||
- [ ] Error: "policy version conflict" or "bundle version mismatch"
|
||||
- [ ] Different nodes evaluating with different policy versions
|
||||
- [ ] Inconsistent gate decisions for same artifact
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Inconsistent policy decisions; unpredictable gate results |
|
||||
| **Data integrity** | Decisions may not match expected policy behavior |
|
||||
| **SLA impact** | Gate accuracy SLO violated; trust in decisions reduced |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.policy.version-consistency
|
||||
```
|
||||
|
||||
2. **Check policy version across nodes:**
|
||||
```bash
|
||||
stella policy version --all-nodes
|
||||
```
|
||||
|
||||
3. **Check active policy version:**
|
||||
```bash
|
||||
stella policy active --show-version
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Compare versions across instances:**
|
||||
```bash
|
||||
stella policy version diff --all-instances
|
||||
```
|
||||
Problem if: Different versions on different nodes
|
||||
|
||||
2. **Check bundle distribution status:**
|
||||
```bash
|
||||
stella policy bundle status --all-nodes
|
||||
```
|
||||
|
||||
3. **Check for failed deployments:**
|
||||
```bash
|
||||
stella policy deployments list --status failed --last 24h
|
||||
```
|
||||
|
||||
4. **Check OPA bundle sync:**
|
||||
```bash
|
||||
stella policy opa bundle-status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Force sync to latest version:**
|
||||
```bash
|
||||
stella policy sync --force --all-nodes
|
||||
```
|
||||
|
||||
2. **Pin specific version:**
|
||||
```bash
|
||||
stella policy pin --version <version>
|
||||
stella policy sync --all-nodes
|
||||
```
|
||||
|
||||
3. **Restart policy engines to force reload:**
|
||||
```bash
|
||||
stella service restart policy-engine --all-nodes
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If bundle distribution failed:**
|
||||
|
||||
1. Check bundle storage:
|
||||
```bash
|
||||
stella policy bundle storage-status
|
||||
```
|
||||
|
||||
2. Rebuild and redistribute bundle:
|
||||
```bash
|
||||
stella policy bundle build
|
||||
stella policy bundle distribute --all-nodes
|
||||
```
|
||||
|
||||
**If node out of sync:**
|
||||
|
||||
1. Check specific node status:
|
||||
```bash
|
||||
stella policy status --node <node-id>
|
||||
```
|
||||
|
||||
2. Force node resync:
|
||||
```bash
|
||||
stella policy sync --node <node-id> --force
|
||||
```
|
||||
|
||||
3. Verify node is receiving updates:
|
||||
```bash
|
||||
stella policy bundle check-subscription --node <node-id>
|
||||
```
|
||||
|
||||
**If concurrent deployments caused conflict:**
|
||||
|
||||
1. Check deployment history:
|
||||
```bash
|
||||
stella policy deployments list --last 1h
|
||||
```
|
||||
|
||||
2. Resolve to single version:
|
||||
```bash
|
||||
stella policy resolve-conflict --to-version <version>
|
||||
```
|
||||
|
||||
3. Enable deployment locking:
|
||||
```bash
|
||||
stella policy config set deployment.locking true
|
||||
```
|
||||
|
||||
**If OPA bundle polling issue:**
|
||||
|
||||
1. Check OPA bundle configuration:
|
||||
```bash
|
||||
stella policy opa config show | grep bundle
|
||||
```
|
||||
|
||||
2. Decrease polling interval for faster sync:
|
||||
```bash
|
||||
stella policy opa config set bundle.polling.min_delay_seconds 10
|
||||
stella policy opa config set bundle.polling.max_delay_seconds 30
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify all nodes on same version
|
||||
stella policy version --all-nodes
|
||||
|
||||
# Test consistent evaluation
|
||||
stella policy evaluate --test --all-nodes
|
||||
|
||||
# Verify bundle status
|
||||
stella policy bundle status --all-nodes
|
||||
|
||||
# Check no version warnings
|
||||
stella policy logs --filter "version" --level warning --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Locking:** Enable deployment locking to prevent concurrent updates
|
||||
- [ ] **Monitoring:** Alert on version drift between nodes
|
||||
- [ ] **Sync:** Configure aggressive bundle polling for fast convergence
|
||||
- [ ] **Testing:** Deploy to staging before production to catch issues
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/policy/versioning.md`
|
||||
- **Related runbooks:** `policy-opa-crash.md`, `policy-storage-unavailable.md`
|
||||
- **Deployment guide:** `docs/operations/policy-deployment.md`
|
||||
371
docs/operations/runbooks/postgres-ops.md
Normal file
371
docs/operations/runbooks/postgres-ops.md
Normal file
@@ -0,0 +1,371 @@
|
||||
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
|
||||
# Task: RUN-001 - PostgreSQL Operations Runbook
|
||||
# PostgreSQL Database Runbook (dev-mock ready)
|
||||
|
||||
Status: PRODUCTION-READY (2026-01-17 UTC)
|
||||
|
||||
## Scope
|
||||
PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight Checklist
|
||||
|
||||
### Environment Verification
|
||||
```bash
|
||||
# Check database connection
|
||||
stella db ping
|
||||
|
||||
# Verify connection pool health
|
||||
stella doctor --check check.postgres.connectivity,check.postgres.pool
|
||||
|
||||
# Check migration status
|
||||
stella db migrations status
|
||||
```
|
||||
|
||||
### Metrics to Watch
|
||||
- `stella_postgres_connections_active` - Active connections (should be < 80% of max)
|
||||
- `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms)
|
||||
- `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0)
|
||||
|
||||
---
|
||||
|
||||
## Standard Procedures
|
||||
|
||||
### SP-001: Daily Health Check
|
||||
|
||||
**Frequency:** Daily or on-demand
|
||||
**Duration:** ~5 minutes
|
||||
|
||||
1. Run comprehensive health check:
|
||||
```bash
|
||||
stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
2. Review slow queries from last 24h:
|
||||
```bash
|
||||
stella db queries --slow --period 24h --limit 20
|
||||
```
|
||||
|
||||
3. Check replication status (if applicable):
|
||||
```bash
|
||||
stella db replication status
|
||||
```
|
||||
|
||||
4. Verify backup completion:
|
||||
```bash
|
||||
stella backup status --type database
|
||||
```
|
||||
|
||||
### SP-002: Connection Pool Tuning
|
||||
|
||||
**When:** Pool exhaustion alerts or high wait times
|
||||
|
||||
1. Check current pool usage:
|
||||
```bash
|
||||
stella db pool stats --detailed
|
||||
```
|
||||
|
||||
2. Identify connection-holding queries:
|
||||
```bash
|
||||
stella db queries --active --sort duration
|
||||
```
|
||||
|
||||
3. Adjust pool size (if needed):
|
||||
```bash
|
||||
# Review current settings
|
||||
stella config get Database:MaxPoolSize
|
||||
|
||||
# Increase pool size
|
||||
stella config set Database:MaxPoolSize 150
|
||||
|
||||
# Restart affected services
|
||||
stella service restart --service release-orchestrator
|
||||
```
|
||||
|
||||
4. Verify improvement:
|
||||
```bash
|
||||
stella db pool watch --duration 5m
|
||||
```
|
||||
|
||||
### SP-003: Backup and Restore
|
||||
|
||||
**Backup:**
|
||||
```bash
|
||||
# Create immediate backup
|
||||
stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
|
||||
|
||||
# Verify backup
|
||||
stella backup verify --latest
|
||||
```
|
||||
|
||||
**Restore:**
|
||||
```bash
|
||||
# List available backups
|
||||
stella backup list --type database
|
||||
|
||||
# Restore to specific point (CAUTION: destructive)
|
||||
stella backup restore --id <backup-id> --confirm
|
||||
|
||||
# Verify restoration
|
||||
stella db ping
|
||||
stella db migrations status
|
||||
```
|
||||
|
||||
### SP-004: Migration Execution
|
||||
|
||||
1. Pre-migration backup:
|
||||
```bash
|
||||
stella backup create --type database --name "pre-migration"
|
||||
```
|
||||
|
||||
2. Run migrations:
|
||||
```bash
|
||||
# Dry run first
|
||||
stella db migrate --dry-run
|
||||
|
||||
# Apply migrations
|
||||
stella db migrate
|
||||
```
|
||||
|
||||
3. Verify migration success:
|
||||
```bash
|
||||
stella db migrations status
|
||||
stella doctor --check check.postgres.migrations
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Incident Procedures
|
||||
|
||||
### INC-001: Connection Pool Exhaustion
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresPoolExhausted`
|
||||
- Error logs: "connection pool exhausted, waiting for available connection"
|
||||
- Increased request latency
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check pool status
|
||||
stella db pool stats
|
||||
|
||||
# Find long-running queries
|
||||
stella db queries --active --sort duration --limit 10
|
||||
|
||||
# Check for connection leaks
|
||||
stella db connections --by-client
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Immediate relief** - Terminate long-running queries:
|
||||
```bash
|
||||
# Identify stuck queries
|
||||
stella db queries --active --duration ">5m"
|
||||
|
||||
# Terminate specific query (use with caution)
|
||||
stella db query terminate --pid <pid>
|
||||
```
|
||||
|
||||
2. **Scale pool** (if legitimate load):
|
||||
```bash
|
||||
stella config set Database:MaxPoolSize 200
|
||||
stella service restart --graceful
|
||||
```
|
||||
|
||||
3. **Fix leaks** (if application bug):
|
||||
- Review application logs for unclosed connections
|
||||
- Deploy fix to affected service
|
||||
|
||||
### INC-002: Slow Query Performance
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresQueryLatencyHigh`
|
||||
- P99 query latency > 500ms
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Get slow query report
|
||||
stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
|
||||
|
||||
# Analyze specific query
|
||||
stella db query explain --sql "SELECT ..." --analyze
|
||||
|
||||
# Check table statistics
|
||||
stella db stats tables --sort bloat
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Index optimization:**
|
||||
```bash
|
||||
# Get index recommendations
|
||||
stella db index suggest --table <table>
|
||||
|
||||
# Create recommended index
|
||||
stella db index create --table <table> --columns "col1,col2"
|
||||
```
|
||||
|
||||
2. **Vacuum/analyze:**
|
||||
```bash
|
||||
stella db vacuum --table <table>
|
||||
stella db analyze --table <table>
|
||||
```
|
||||
|
||||
3. **Query optimization** - Review and rewrite problematic queries
|
||||
|
||||
### INC-003: Database Connectivity Loss
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresConnectionFailed`
|
||||
- All services reporting database connection errors
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Test basic connectivity
|
||||
stella db ping
|
||||
|
||||
# Check DNS resolution
|
||||
stella network dns-lookup <db-host>
|
||||
|
||||
# Check firewall/network
|
||||
stella network test --host <db-host> --port 5432
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Network issue:**
|
||||
- Verify security groups / firewall rules
|
||||
- Check VPN/tunnel status if applicable
|
||||
- Verify DNS resolution
|
||||
|
||||
2. **Database server issue:**
|
||||
- Check PostgreSQL service status on server
|
||||
- Review PostgreSQL logs
|
||||
- Check disk space on database server
|
||||
|
||||
3. **Credential issue:**
|
||||
```bash
|
||||
stella db verify-credentials
|
||||
stella secrets rotate --scope database
|
||||
```
|
||||
|
||||
### INC-004: Disk Space Alert
|
||||
|
||||
**Symptoms:**
|
||||
- Alert: `StellaPostgresDiskSpaceWarning` or `Critical`
|
||||
- Database write failures
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check disk usage
|
||||
stella db disk-usage
|
||||
|
||||
# Find large tables
|
||||
stella db stats tables --sort size --limit 20
|
||||
|
||||
# Check for bloat
|
||||
stella db stats tables --sort bloat
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Immediate cleanup:**
|
||||
```bash
|
||||
# Vacuum to reclaim space
|
||||
stella db vacuum --full --table <large-table>
|
||||
|
||||
# Clean old data (if retention policy allows)
|
||||
stella db prune --table evidence_artifacts --older-than 90d --dry-run
|
||||
```
|
||||
|
||||
2. **Archive old data:**
|
||||
```bash
|
||||
stella db archive --table findings_history --older-than 180d
|
||||
```
|
||||
|
||||
3. **Expand disk** (if legitimate growth):
|
||||
- Follow cloud provider procedure to expand volume
|
||||
- Resize filesystem
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Windows
|
||||
|
||||
### Weekly Maintenance (Sunday 02:00 UTC)
|
||||
|
||||
1. Run vacuum analyze on all tables:
|
||||
```bash
|
||||
stella db vacuum --analyze --all-tables
|
||||
```
|
||||
|
||||
2. Update table statistics:
|
||||
```bash
|
||||
stella db analyze --all-tables
|
||||
```
|
||||
|
||||
3. Clean temporary files:
|
||||
```bash
|
||||
stella db cleanup --temp-files
|
||||
```
|
||||
|
||||
### Monthly Maintenance (First Sunday 03:00 UTC)
|
||||
|
||||
1. Full vacuum on large tables:
|
||||
```bash
|
||||
stella db vacuum --full --table findings --table verdicts
|
||||
```
|
||||
|
||||
2. Reindex if needed:
|
||||
```bash
|
||||
stella db reindex --concurrently --table findings
|
||||
```
|
||||
|
||||
3. Archive old data per retention policy:
|
||||
```bash
|
||||
stella db archive --apply-retention
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Dashboard
|
||||
|
||||
Access: Grafana → Dashboards → Stella Ops → PostgreSQL
|
||||
|
||||
Key panels:
|
||||
- Connection pool utilization
|
||||
- Query latency percentiles
|
||||
- Disk usage trend
|
||||
- Replication lag (if applicable)
|
||||
- Active queries count
|
||||
|
||||
---
|
||||
|
||||
## Evidence Capture
|
||||
|
||||
For any incident, capture:
|
||||
```bash
|
||||
# Comprehensive database state
|
||||
stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
|
||||
```
|
||||
|
||||
Bundle includes:
|
||||
- Connection stats
|
||||
- Active queries
|
||||
- Lock information
|
||||
- Table statistics
|
||||
- Recent slow query log
|
||||
- Configuration snapshot
|
||||
|
||||
---
|
||||
|
||||
## Escalation Path
|
||||
|
||||
1. **L1 (On-call):** Standard procedures, restart services
|
||||
2. **L2 (Database team):** Query optimization, schema changes
|
||||
3. **L3 (Vendor support):** Hardware/cloud platform issues
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
152
docs/operations/runbooks/scanner-oom.md
Normal file
152
docs/operations/runbooks/scanner-oom.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# Runbook: Scanner - Out of Memory on Large Images
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.memory-usage` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scanner worker exits with code 137 (OOM killed)
|
||||
- [ ] Scans fail consistently for specific large images
|
||||
- [ ] Error log contains "fatal error: runtime: out of memory"
|
||||
- [ ] Alert `ScannerWorkerOOM` firing
|
||||
- [ ] Metric `scanner_worker_restarts_total{reason="oom"}` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Large images cannot be scanned; smaller images may still work |
|
||||
| **Data integrity** | No data loss; failed scans can be retried |
|
||||
| **SLA impact** | Specific images blocked from release pipeline |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Identify the failing image:**
|
||||
```bash
|
||||
stella scanner jobs list --status failed --last 1h
|
||||
```
|
||||
|
||||
2. **Check image size:**
|
||||
```bash
|
||||
stella image inspect <image-ref> --format json | jq '.size'
|
||||
```
|
||||
Problem if: Image size > 2GB or layer count > 100
|
||||
|
||||
3. **Check worker memory limit:**
|
||||
```bash
|
||||
stella scanner config get worker.memory_limit
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Profile memory usage during scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --profile-memory
|
||||
```
|
||||
|
||||
2. **Check SBOM generation memory:**
|
||||
```bash
|
||||
stella scanner logs --filter "sbom" --level debug --last 30m
|
||||
```
|
||||
Look for: "memory allocation failed", "heap exhausted"
|
||||
|
||||
3. **Identify memory-heavy layers:**
|
||||
```bash
|
||||
stella image layers <image-ref> --sort-by size
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Increase worker memory limit:**
|
||||
```bash
|
||||
stella scanner config set worker.memory_limit 8Gi
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
2. **Enable streaming mode for large images:**
|
||||
```bash
|
||||
stella scanner config set sbom.streaming_threshold 1Gi
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
3. **Retry the failed scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --retry
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**For consistently large images:**
|
||||
|
||||
1. Configure dedicated large-image worker pool:
|
||||
```bash
|
||||
stella scanner workers add --pool large-images --memory 16Gi --count 2
|
||||
stella scanner config set routing.large_image_threshold 2Gi
|
||||
stella scanner config set routing.large_image_pool large-images
|
||||
```
|
||||
|
||||
**For images with many small files (node_modules, etc.):**
|
||||
|
||||
1. Enable incremental SBOM mode:
|
||||
```bash
|
||||
stella scanner config set sbom.incremental_mode true
|
||||
```
|
||||
|
||||
**For base image reuse:**
|
||||
|
||||
1. Enable layer caching:
|
||||
```bash
|
||||
stella scanner config set cache.layer_dedup true
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry the previously failing scan
|
||||
stella scan image --image <image-ref>
|
||||
|
||||
# Monitor memory during scan
|
||||
stella scanner workers stats --watch
|
||||
|
||||
# Verify no OOM in recent logs
|
||||
stella scanner logs --filter "out of memory" --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Capacity:** Set memory limit based on largest expected image (recommend 4Gi minimum)
|
||||
- [ ] **Routing:** Configure large-image pool for images > 2GB
|
||||
- [ ] **Monitoring:** Alert on `scanner_worker_memory_usage_bytes` > 80% of limit
|
||||
- [ ] **Documentation:** Document image size limits in user guide
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/architecture.md`
|
||||
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Scanner Memory
|
||||
195
docs/operations/runbooks/scanner-registry-auth.md
Normal file
195
docs/operations/runbooks/scanner-registry-auth.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Runbook: Scanner - Registry Authentication Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team, Security team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.registry-auth` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scans failing with "401 Unauthorized" or "403 Forbidden"
|
||||
- [ ] Alert `ScannerRegistryAuthFailed` firing
|
||||
- [ ] Error: "failed to authenticate with registry"
|
||||
- [ ] Error: "failed to pull image manifest"
|
||||
- [ ] Scans work for public images but fail for private images
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Cannot scan private images; release pipeline blocked |
|
||||
| **Data integrity** | No data loss; authentication issue only |
|
||||
| **SLA impact** | All scans for affected registry blocked |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.scanner.registry-auth
|
||||
```
|
||||
|
||||
2. **List configured registries:**
|
||||
```bash
|
||||
stella registry list --show-status
|
||||
```
|
||||
Look for: Registries with "auth_failed" status
|
||||
|
||||
3. **Test registry authentication:**
|
||||
```bash
|
||||
stella registry test <registry-url>
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check credential expiration:**
|
||||
```bash
|
||||
stella registry credentials show <registry-name>
|
||||
```
|
||||
Look for: Expiration date, token type
|
||||
|
||||
2. **Test with verbose output:**
|
||||
```bash
|
||||
stella registry test <registry-url> --verbose
|
||||
```
|
||||
Look for: Specific auth error message, HTTP status code
|
||||
|
||||
3. **Check registry logs:**
|
||||
```bash
|
||||
stella scanner logs --filter "registry auth" --last 30m
|
||||
```
|
||||
|
||||
4. **Verify IAM/OIDC configuration (for cloud registries):**
|
||||
```bash
|
||||
stella registry iam-status <registry-name>
|
||||
```
|
||||
Problem if: IAM role not assumable, OIDC token expired
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Refresh credentials (for token-based auth):**
|
||||
```bash
|
||||
stella registry refresh-credentials <registry-name>
|
||||
```
|
||||
|
||||
2. **Update static credentials:**
|
||||
```bash
|
||||
stella registry update-credentials <registry-name> \
|
||||
--username <user> \
|
||||
--password <token>
|
||||
```
|
||||
|
||||
3. **For Docker Hub rate limiting:**
|
||||
```bash
|
||||
stella registry configure docker-hub \
|
||||
--username <user> \
|
||||
--access-token <token>
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If credentials expired:**
|
||||
|
||||
1. Generate new access token in registry (ECR, GCR, ACR, etc.)
|
||||
|
||||
2. Update credentials:
|
||||
```bash
|
||||
stella registry update-credentials <registry-name> --from-env
|
||||
```
|
||||
|
||||
3. Configure automatic token refresh:
|
||||
```bash
|
||||
stella registry config set <registry-name>.auto_refresh true
|
||||
stella registry config set <registry-name>.refresh_interval 11h
|
||||
```
|
||||
|
||||
**If IAM role/policy changed (AWS ECR):**
|
||||
|
||||
1. Verify IAM role permissions:
|
||||
```bash
|
||||
stella registry iam verify <registry-name>
|
||||
```
|
||||
|
||||
2. Update IAM role ARN if changed:
|
||||
```bash
|
||||
stella registry configure ecr \
|
||||
--region <region> \
|
||||
--role-arn <arn>
|
||||
```
|
||||
|
||||
**If OIDC federation changed (GCP Artifact Registry):**
|
||||
|
||||
1. Verify service account:
|
||||
```bash
|
||||
stella registry oidc verify <registry-name>
|
||||
```
|
||||
|
||||
2. Update workload identity configuration:
|
||||
```bash
|
||||
stella registry configure gcr \
|
||||
--project <project> \
|
||||
--workload-identity-provider <provider>
|
||||
```
|
||||
|
||||
**If certificate changed (self-hosted registries):**
|
||||
|
||||
1. Update CA certificate:
|
||||
```bash
|
||||
stella registry configure <registry-name> \
|
||||
--ca-cert /path/to/ca.crt
|
||||
```
|
||||
|
||||
2. Or skip verification (not recommended for production):
|
||||
```bash
|
||||
stella registry configure <registry-name> \
|
||||
--insecure-skip-verify
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Test authentication
|
||||
stella registry test <registry-url>
|
||||
|
||||
# Test scanning a private image
|
||||
stella scan image --image <registry-url>/<image>:<tag> --dry-run
|
||||
|
||||
# Verify no auth failures in recent logs
|
||||
stella scanner logs --filter "auth" --level error --last 30m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Credentials:** Use service accounts/workload identity instead of static tokens
|
||||
- [ ] **Rotation:** Configure automatic token refresh before expiration
|
||||
- [ ] **Monitoring:** Alert on authentication failure rate > 0
|
||||
- [ ] **Documentation:** Document registry credential management procedures
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/registry-auth.md`
|
||||
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
|
||||
- **Registry setup:** `docs/operations/registry-configuration.md`
|
||||
188
docs/operations/runbooks/scanner-sbom-generation-failed.md
Normal file
188
docs/operations/runbooks/scanner-sbom-generation-failed.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# Runbook: Scanner - SBOM Generation Failures
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | High |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.sbom-generation` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scans completing but SBOM generation failing
|
||||
- [ ] Alert `ScannerSbomGenerationFailed` firing
|
||||
- [ ] Error: "SBOM generation failed" or "unsupported package format"
|
||||
- [ ] Partial SBOM with missing components
|
||||
- [ ] Metric `scanner_sbom_generation_failures_total` increasing
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Incomplete vulnerability coverage; missing dependencies not scanned |
|
||||
| **Data integrity** | Partial SBOM may miss vulnerabilities; attestations incomplete |
|
||||
| **SLA impact** | SBOM completeness SLO violated (target: > 95%) |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.scanner.sbom-generation
|
||||
```
|
||||
|
||||
2. **Check failed SBOM jobs:**
|
||||
```bash
|
||||
stella scanner jobs list --status sbom_failed --last 1h
|
||||
```
|
||||
|
||||
3. **Check SBOM completeness rate:**
|
||||
```bash
|
||||
stella scanner stats --sbom-metrics
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Analyze specific failure:**
|
||||
```bash
|
||||
stella scanner job details <job-id> --sbom-errors
|
||||
```
|
||||
Look for: Specific package manager or file type causing failure
|
||||
|
||||
2. **Check for unsupported ecosystems:**
|
||||
```bash
|
||||
stella sbom analyze --image <image-ref> --verbose
|
||||
```
|
||||
Look for: "unsupported", "unknown package format", "parsing failed"
|
||||
|
||||
3. **Check scanner plugin status:**
|
||||
```bash
|
||||
stella scanner plugins list --status
|
||||
```
|
||||
Problem if: Package manager plugin disabled or erroring
|
||||
|
||||
4. **Check for corrupted package files:**
|
||||
```bash
|
||||
stella image inspect <image-ref> --check-integrity
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Enable fallback SBOM generation:**
|
||||
```bash
|
||||
stella scanner config set sbom.fallback_mode true
|
||||
stella scan image --image <image-ref> --sbom-fallback
|
||||
```
|
||||
|
||||
2. **Use alternative SBOM generator:**
|
||||
```bash
|
||||
stella sbom generate --image <image-ref> --generator syft --output sbom.json
|
||||
```
|
||||
|
||||
3. **Generate partial SBOM and continue:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --sbom-partial-ok
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If package manager not supported:**
|
||||
|
||||
1. Check supported package managers:
|
||||
```bash
|
||||
stella scanner plugins list --type package-manager
|
||||
```
|
||||
|
||||
2. Enable additional plugins:
|
||||
```bash
|
||||
stella scanner plugins enable <plugin-name>
|
||||
```
|
||||
|
||||
3. For custom package formats, add mapping:
|
||||
```bash
|
||||
stella scanner config set sbom.custom_mappings.<format> <handler>
|
||||
```
|
||||
|
||||
**If package file corrupted:**
|
||||
|
||||
1. Identify corrupted files:
|
||||
```bash
|
||||
stella image layers <image-ref> --verify-packages
|
||||
```
|
||||
|
||||
2. Report to image owner for fix
|
||||
|
||||
**If memory/resource issue during generation:**
|
||||
|
||||
1. Increase SBOM generator resources:
|
||||
```bash
|
||||
stella scanner config set sbom.memory_limit 4Gi
|
||||
stella scanner config set sbom.timeout 10m
|
||||
```
|
||||
|
||||
2. Enable streaming mode:
|
||||
```bash
|
||||
stella scanner config set sbom.streaming_mode true
|
||||
```
|
||||
|
||||
**If plugin crashed:**
|
||||
|
||||
1. Check plugin logs:
|
||||
```bash
|
||||
stella scanner plugins logs <plugin-name> --last 30m
|
||||
```
|
||||
|
||||
2. Restart plugin:
|
||||
```bash
|
||||
stella scanner plugins restart <plugin-name>
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry SBOM generation
|
||||
stella sbom generate --image <image-ref> --output sbom.json
|
||||
|
||||
# Validate SBOM completeness
|
||||
stella sbom validate --file sbom.json --check-completeness
|
||||
|
||||
# Check component count
|
||||
stella sbom stats --file sbom.json
|
||||
|
||||
# Full scan with SBOM
|
||||
stella scan image --image <image-ref>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Plugins:** Keep all package manager plugins enabled and updated
|
||||
- [ ] **Monitoring:** Alert on SBOM completeness < 90%
|
||||
- [ ] **Fallback:** Configure fallback SBOM generator for resilience
|
||||
- [ ] **Testing:** Test SBOM generation for new image types before production
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/sbom-generation.md`
|
||||
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
|
||||
- **SBOM formats:** `docs/formats/sbom-spdx.md`, `docs/formats/sbom-cyclonedx.md`
|
||||
174
docs/operations/runbooks/scanner-timeout.md
Normal file
174
docs/operations/runbooks/scanner-timeout.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Runbook: Scanner - Scan Timeout on Complex Images
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | Medium |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.timeout-rate` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scans failing with "timeout exceeded" error
|
||||
- [ ] Alert `ScannerTimeoutExceeded` firing
|
||||
- [ ] Metric `scanner_scan_timeout_total` increasing
|
||||
- [ ] Specific images consistently timing out
|
||||
- [ ] Error log: "scan operation exceeded timeout of X seconds"
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | Specific images cannot be scanned; pipeline blocked |
|
||||
| **Data integrity** | No data loss; scans can be retried with adjusted settings |
|
||||
| **SLA impact** | Release pipeline delayed for affected images |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.scanner.timeout-rate
|
||||
```
|
||||
|
||||
2. **Identify failing images:**
|
||||
```bash
|
||||
stella scanner jobs list --status timeout --last 1h
|
||||
```
|
||||
Look for: Pattern in image types or sizes
|
||||
|
||||
3. **Check current timeout settings:**
|
||||
```bash
|
||||
stella scanner config get timeouts
|
||||
```
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Analyze image complexity:**
|
||||
```bash
|
||||
stella image inspect <image-ref> --format json | jq '{size, layers: .layers | length, files: .manifest.fileCount}'
|
||||
```
|
||||
Problem if: > 50 layers, > 100k files, or > 5GB size
|
||||
|
||||
2. **Check scanner worker load:**
|
||||
```bash
|
||||
stella scanner workers stats
|
||||
```
|
||||
Problem if: All workers at capacity during timeouts
|
||||
|
||||
3. **Profile a scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --profile --verbose
|
||||
```
|
||||
Look for: Which phase is slowest (layer extraction, SBOM generation, vuln matching)
|
||||
|
||||
4. **Check for filesystem-heavy images:**
|
||||
```bash
|
||||
stella image layers <image-ref> --sort-by file-count
|
||||
```
|
||||
Problem if: Single layer with > 50k files (e.g., node_modules)
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Increase timeout for specific image:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --timeout 30m
|
||||
```
|
||||
|
||||
2. **Increase global scan timeout:**
|
||||
```bash
|
||||
stella scanner config set timeouts.scan 20m
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
3. **Enable fast mode for initial scan:**
|
||||
```bash
|
||||
stella scan image --image <image-ref> --fast-mode
|
||||
```
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If image is too complex:**
|
||||
|
||||
1. Enable incremental scanning:
|
||||
```bash
|
||||
stella scanner config set scan.incremental_mode true
|
||||
```
|
||||
|
||||
2. Configure layer caching:
|
||||
```bash
|
||||
stella scanner config set cache.layer_dedup true
|
||||
stella scanner config set cache.sbom_cache true
|
||||
```
|
||||
|
||||
**If filesystem is too large:**
|
||||
|
||||
1. Enable streaming SBOM generation:
|
||||
```bash
|
||||
stella scanner config set sbom.streaming_threshold 500Gi
|
||||
```
|
||||
|
||||
2. Configure file sampling for massive images:
|
||||
```bash
|
||||
stella scanner config set sbom.file_sample_max 100000
|
||||
```
|
||||
|
||||
**If vulnerability matching is slow:**
|
||||
|
||||
1. Enable parallel matching:
|
||||
```bash
|
||||
stella scanner config set vuln.parallel_matching true
|
||||
stella scanner config set vuln.match_workers 4
|
||||
```
|
||||
|
||||
2. Optimize vulnerability database indexes:
|
||||
```bash
|
||||
stella db optimize --component scanner
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Retry the previously failing scan
|
||||
stella scan image --image <image-ref> --timeout 30m
|
||||
|
||||
# Monitor scan progress
|
||||
stella scanner jobs watch <job-id>
|
||||
|
||||
# Verify no timeouts in recent scans
|
||||
stella scanner jobs list --status timeout --last 1h
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Capacity:** Configure appropriate timeouts based on expected image complexity (15m default, 30m for large)
|
||||
- [ ] **Monitoring:** Alert on timeout rate > 5%
|
||||
- [ ] **Caching:** Enable layer and SBOM caching for base images
|
||||
- [ ] **Documentation:** Document image size/complexity limits in user guide
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/architecture.md`
|
||||
- **Related runbooks:** `scanner-oom.md`, `scanner-worker-stuck.md`
|
||||
- **Dashboard:** Grafana > Stella Ops > Scanner Performance
|
||||
174
docs/operations/runbooks/scanner-worker-stuck.md
Normal file
174
docs/operations/runbooks/scanner-worker-stuck.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Runbook: Scanner - Worker Not Processing Jobs
|
||||
|
||||
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
|
||||
> **Task:** RUN-002 - Scanner Runbooks
|
||||
|
||||
## Metadata
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Component** | Scanner |
|
||||
| **Severity** | Critical |
|
||||
| **On-call scope** | Platform team |
|
||||
| **Last updated** | 2026-01-17 |
|
||||
| **Doctor check** | `check.scanner.worker-health` |
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- [ ] Scan jobs stuck in "pending" or "processing" state for >5 minutes
|
||||
- [ ] Scanner worker process shows 0% CPU usage
|
||||
- [ ] Alert `ScannerWorkerStuck` or `ScannerQueueBacklog` firing
|
||||
- [ ] UI shows "Scan in progress" indefinitely
|
||||
- [ ] Metric `scanner_jobs_pending` increasing over time
|
||||
|
||||
---
|
||||
|
||||
## Impact
|
||||
|
||||
| Impact Type | Description |
|
||||
|-------------|-------------|
|
||||
| **User-facing** | New scans cannot complete, blocking CI/CD pipelines and release gates |
|
||||
| **Data integrity** | No data loss; pending jobs will resume when worker recovers |
|
||||
| **SLA impact** | Scan latency SLO violated if not resolved within 15 minutes |
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Quick checks (< 2 minutes)
|
||||
|
||||
1. **Check Doctor diagnostics:**
|
||||
```bash
|
||||
stella doctor --check check.scanner.worker-health
|
||||
```
|
||||
|
||||
2. **Check scanner service status:**
|
||||
```bash
|
||||
stella scanner status
|
||||
```
|
||||
Expected: "Scanner workers: 4 active, 0 idle"
|
||||
Problem: "Scanner workers: 0 active" or "status: degraded"
|
||||
|
||||
3. **Check job queue depth:**
|
||||
```bash
|
||||
stella scanner queue status
|
||||
```
|
||||
Expected: Queue depth < 50
|
||||
Problem: Queue depth > 100 or growing rapidly
|
||||
|
||||
### Deep diagnosis
|
||||
|
||||
1. **Check worker process logs:**
|
||||
```bash
|
||||
stella scanner logs --tail 100 --level error
|
||||
```
|
||||
Look for: "timeout", "connection refused", "out of memory"
|
||||
|
||||
2. **Check Valkey connectivity (job queue):**
|
||||
```bash
|
||||
stella doctor --check check.storage.valkey
|
||||
```
|
||||
|
||||
3. **Check if workers are OOM-killed:**
|
||||
```bash
|
||||
stella scanner workers inspect
|
||||
```
|
||||
Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)
|
||||
|
||||
4. **Check resource utilization:**
|
||||
```bash
|
||||
stella obs metrics --filter scanner --last 10m
|
||||
```
|
||||
Look for: Memory > 90%, CPU sustained > 95%
|
||||
|
||||
---
|
||||
|
||||
## Resolution
|
||||
|
||||
### Immediate mitigation
|
||||
|
||||
1. **Restart scanner workers:**
|
||||
```bash
|
||||
stella scanner workers restart
|
||||
```
|
||||
This will: Terminate current workers and spawn fresh ones
|
||||
|
||||
2. **If restart fails, force restart the scanner service:**
|
||||
```bash
|
||||
stella service restart scanner
|
||||
```
|
||||
|
||||
3. **Verify workers are processing:**
|
||||
```bash
|
||||
stella scanner queue status --watch
|
||||
```
|
||||
Queue depth should start decreasing
|
||||
|
||||
### Root cause fix
|
||||
|
||||
**If workers were OOM-killed:**
|
||||
|
||||
1. Increase worker memory limit:
|
||||
```bash
|
||||
stella scanner config set worker.memory_limit 4Gi
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
2. Reduce concurrent scans per worker:
|
||||
```bash
|
||||
stella scanner config set worker.concurrency 2
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
**If Valkey connection failed:**
|
||||
|
||||
1. Check Valkey health:
|
||||
```bash
|
||||
stella doctor --check check.storage.valkey
|
||||
```
|
||||
|
||||
2. Restart Valkey if needed (see `valkey-connection-failure.md`)
|
||||
|
||||
**If workers are deadlocked:**
|
||||
|
||||
1. Enable deadlock detection:
|
||||
```bash
|
||||
stella scanner config set worker.deadlock_detection true
|
||||
stella scanner workers restart
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
```bash
|
||||
# Verify workers are healthy
|
||||
stella doctor --check check.scanner.worker-health
|
||||
|
||||
# Submit a test scan
|
||||
stella scan image --image alpine:latest --dry-run
|
||||
|
||||
# Watch queue drain
|
||||
stella scanner queue status --watch
|
||||
|
||||
# Verify no errors in recent logs
|
||||
stella scanner logs --tail 20 --level error
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
- [ ] **Alert:** Ensure `ScannerQueueBacklog` alert is configured with threshold < 100 jobs
|
||||
- [ ] **Monitoring:** Add Grafana panel for worker memory usage
|
||||
- [ ] **Capacity:** Review worker count and memory limits during capacity planning
|
||||
- [ ] **Deadlock:** Enable `worker.deadlock_detection` in production
|
||||
|
||||
---
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **Architecture:** `docs/modules/scanner/architecture.md`
|
||||
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
|
||||
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs`
|
||||
- **Dashboard:** Grafana > Stella Ops > Scanner Overview
|
||||
@@ -1,202 +0,0 @@
|
||||
# Product Advisory: AI Economics Moat
|
||||
ID: ADVISORY-20260116-AI-ECON-MOAT
|
||||
Status: ACTIVE
|
||||
Owner intent: Product-wide directive
|
||||
Scope: All modules, docs, sprints, and roadmap decisions
|
||||
|
||||
## 0) Thesis (why this advisory exists)
|
||||
|
||||
In AI economics, code is cheap, software is expensive.
|
||||
|
||||
Competitors (and future competitors) can produce large volumes of code quickly. Stella Ops must remain hard to catch by focusing on the parts that are still expensive:
|
||||
- trust
|
||||
- operability
|
||||
- determinism
|
||||
- evidence integrity
|
||||
- low-touch onboarding
|
||||
- low support burden at scale
|
||||
|
||||
This advisory defines the product-level objectives and non-negotiable standards that make Stella Ops defensible against "code producers".
|
||||
|
||||
## 1) Product positioning (the class we must win)
|
||||
|
||||
Stella Ops Suite must be "best in class" for:
|
||||
|
||||
Evidence-grade release orchestration for containerized applications outside Kubernetes.
|
||||
|
||||
Stella is NOT attempting to be:
|
||||
- a generic CD platform (Octopus, GitLab, Jenkins replacements)
|
||||
- a generic vulnerability scanner (Trivy, Grype replacements)
|
||||
- a "platform of everything" with infinite integrations
|
||||
|
||||
The moat is the end-to-end chain:
|
||||
digest identity -> evidence -> verdict -> gate -> promotion -> audit export -> deterministic replay
|
||||
|
||||
The product wins when customers can run verified releases with minimal human labor and produce auditor-ready evidence.
|
||||
|
||||
## 2) Target customer and adoption constraint
|
||||
|
||||
Constraint: founder operates solo until ~100 paying customers.
|
||||
|
||||
Therefore, the product must be self-serve by default:
|
||||
- install must be predictable
|
||||
- failures must be diagnosable without maintainer time
|
||||
- docs must replace support
|
||||
- "Doctor" must replace debugging sessions
|
||||
|
||||
Support must be an exception, not a workflow.
|
||||
|
||||
## 3) The five non-negotiable product invariants
|
||||
|
||||
Every meaningful product change MUST preserve and strengthen these invariants:
|
||||
|
||||
I1. Evidence-grade by design
|
||||
- Every verified decision has an evidence trail.
|
||||
- Evidence is exportable, replayable, and verifiable.
|
||||
|
||||
I2. Deterministic replay
|
||||
- Same inputs -> same outputs.
|
||||
- A verdict can be reproduced and verified later, not just explained.
|
||||
|
||||
I3. Digest-first identity
|
||||
- Releases are immutable digests, not mutable tags.
|
||||
- "What is deployed where" is anchored to digests.
|
||||
|
||||
I4. Offline-first posture
|
||||
- Air-gapped and low-egress environments must remain first-class.
|
||||
- No hidden network dependencies in core flows.
|
||||
|
||||
I5. Low-touch operability
|
||||
- Misconfigurations fail fast at startup with clear messages.
|
||||
- Runtime failures have deterministic recovery playbooks.
|
||||
- Doctor provides actionable diagnostics bundles and remediation steps.
|
||||
|
||||
If a proposed feature weakens any invariant, it must be rejected or redesigned.
|
||||
|
||||
## 4) Moats we build (how Stella stays hard to catch)
|
||||
|
||||
M1. Evidence chain continuity (no "glue work" required)
|
||||
- Scan results, reachability proofs, policy evaluation, approvals, promotions, and exports are one continuous chain.
|
||||
- Do not require customers to stitch multiple tools together to get audit-grade releases.
|
||||
|
||||
M2. Explainability with proof, not narrative
|
||||
- "Why blocked?" must produce a deterministic trace + referenced evidence artifacts.
|
||||
- The answer must be replayable, not a one-time explanation.
|
||||
|
||||
M3. Operability moat (Doctor + safe defaults)
|
||||
- Diagnostics must identify root cause, not just symptoms.
|
||||
- Provide deterministic checklists and fixes.
|
||||
- Every integration must ship with health checks and failure-mode docs.
|
||||
|
||||
M4. Controlled surface area (reduce permutations)
|
||||
- Ship a small number of Tier-1 golden integrations and targets.
|
||||
- Keep the plugin system as an escape valve, but do not expand the maintained matrix beyond what solo operations can support.
|
||||
|
||||
M5. Standards-grade outputs with stable schemas
|
||||
- SBOM, VEX, attestations, exports, and decision records must be stable, versioned, and backwards compatible where promised.
|
||||
- Stability is a moat: auditors and platform teams adopt what they can depend on.
|
||||
|
||||
## 5) Explicit non-goals (what to reject quickly)
|
||||
|
||||
Reject or de-prioritize proposals that primarily:
|
||||
- add a generic CD surface without evidence and determinism improvements
|
||||
- expand integrations broadly without a "Tier-1" support model and diagnostics coverage
|
||||
- compete on raw scanner breadth rather than evidence-grade gating outcomes
|
||||
- add UI polish that does not reduce operator labor or support load
|
||||
- add "AI features" that create nondeterminism or require external calls in core paths
|
||||
|
||||
If a feature does not strengthen at least one moat (M1-M5), it is likely not worth shipping now.
|
||||
|
||||
## 6) Agent review rubric (use this to evaluate any proposal, advisory, or sprint)
|
||||
|
||||
When reviewing any new idea, feature request, PRD, or sprint, score it against:
|
||||
|
||||
A) Moat impact (required)
|
||||
- Which moat does it strengthen (M1-M5)?
|
||||
- What measurable operator/auditor outcome improves?
|
||||
|
||||
B) Support burden risk (critical)
|
||||
- Does this increase the probability of support tickets?
|
||||
- Does Doctor cover the new failure modes?
|
||||
- Are there clear runbooks and error messages?
|
||||
|
||||
C) Determinism and evidence risk (critical)
|
||||
- Does this introduce nondeterminism?
|
||||
- Are outputs stable, canonical, and replayable?
|
||||
- Does it weaken evidence chain integrity?
|
||||
|
||||
D) Permutation risk (critical)
|
||||
- Does this increase the matrix of supported combinations?
|
||||
- Can it be constrained to a "golden path" configuration?
|
||||
|
||||
E) Time-to-value impact (important)
|
||||
- Does this reduce time to first verified release?
|
||||
- Does it reduce time to answer "why blocked"?
|
||||
|
||||
If a proposal scores poorly on B/C/D, it must be redesigned or rejected.
|
||||
|
||||
## 7) Definition of Done (feature-level) - do not ship without the boring parts
|
||||
|
||||
Any shippable feature must include, at minimum:
|
||||
|
||||
DOD-1: Operator story
|
||||
- Clear user story for operators and auditors, not just developers.
|
||||
|
||||
DOD-2: Failure modes and recovery
|
||||
- Documented expected failures, error codes/messages, and remediation steps.
|
||||
- Doctor checks added or extended to cover the common failure paths.
|
||||
|
||||
DOD-3: Determinism and evidence
|
||||
- Deterministic outputs where applicable.
|
||||
- Evidence artifacts linked to decisions.
|
||||
- Replay or verify path exists if the feature affects verdicts or gates.
|
||||
|
||||
DOD-4: Tests
|
||||
- Unit tests for logic (happy + edge cases).
|
||||
- Integration tests for contracts (DB, queues, storage where used).
|
||||
- Determinism tests when outputs are serialized, hashed, or signed.
|
||||
|
||||
DOD-5: Documentation
|
||||
- Docs updated where the feature changes behavior or contracts.
|
||||
- Include copy/paste examples for the golden path usage.
|
||||
|
||||
DOD-6: Observability
|
||||
- Structured logs and metrics for success/failure paths.
|
||||
- Explicit "reason codes" for gate decisions and failures.
|
||||
|
||||
If the feature cannot afford these, it cannot afford to exist in a solo-scaled product.
|
||||
|
||||
## 8) Product-level metrics (what we optimize)
|
||||
|
||||
These metrics are the scoreboard. Prioritize work that improves them.
|
||||
|
||||
P0 metrics (most important):
|
||||
- Time-to-first-verified-release (fresh install -> verified promotion)
|
||||
- Mean time to answer "why blocked?" (with proof)
|
||||
- Support minutes per customer per month (must trend toward near-zero)
|
||||
- Determinism regressions per release (must be near-zero)
|
||||
|
||||
P1 metrics:
|
||||
- Noise reduction ratio (reachable actionable findings vs raw findings)
|
||||
- Audit export acceptance rate (auditors can consume without manual reconstruction)
|
||||
- Upgrade success rate (low-friction updates, predictable migrations)
|
||||
|
||||
## 9) Immediate product focus areas implied by this advisory
|
||||
|
||||
When unsure what to build next, prefer investments in:
|
||||
- Doctor: diagnostics coverage, fix suggestions, bundles, and environment validation
|
||||
- Golden path onboarding: install -> connect -> scan -> gate -> promote -> export
|
||||
- Determinism gates in CI and runtime checks for canonical outputs
|
||||
- Evidence export bundles that map to common audit needs
|
||||
- "Why blocked" trace quality, completeness, and replay verification
|
||||
|
||||
Avoid "breadth expansion" unless it includes full operability coverage.
|
||||
|
||||
## 10) How to apply this advisory in planning
|
||||
|
||||
When processing this advisory:
|
||||
- Ensure docs reflect the invariants and moats at the product overview level.
|
||||
- Ensure sprints and tasks reference which moat they strengthen (M1-M5).
|
||||
- If a sprint increases complexity without decreasing operator labor or improving evidence integrity, treat it as suspect.
|
||||
|
||||
Archive this advisory only if it is superseded by a newer product-wide directive.
|
||||
Reference in New Issue
Block a user