synergy moats product advisory implementations

This commit is contained in:
master
2026-01-17 01:30:03 +02:00
parent 77ff029205
commit 702a27ac83
112 changed files with 21356 additions and 127 deletions

442
docs/doctor/plugins.md Normal file
View File

@@ -0,0 +1,442 @@
# Doctor Plugins Reference
> **Sprint:** SPRINT_20260117_025_Doctor_coverage_expansion
> **Task:** DOC-EXP-006 - Documentation Updates
This document describes the Doctor health check plugins, their checks, and configuration options.
## Plugin Overview
| Plugin | Directory | Checks | Description |
|--------|-----------|--------|-------------|
| **Postgres** | `StellaOps.Doctor.Plugin.Postgres` | 3 | PostgreSQL database health |
| **Storage** | `StellaOps.Doctor.Plugin.Storage` | 3 | Disk and storage health |
| **Crypto** | `StellaOps.Doctor.Plugin.Crypto` | 4 | Regional crypto compliance |
| **EvidenceLocker** | `StellaOps.Doctor.Plugin.EvidenceLocker` | 4 | Evidence integrity checks |
| **Attestor** | `StellaOps.Doctor.Plugin.Attestor` | 3+ | Signing and verification |
| **Auth** | `StellaOps.Doctor.Plugin.Auth` | 3+ | Authentication health |
| **Policy** | `StellaOps.Doctor.Plugin.Policy` | 3+ | Policy engine health |
| **Vex** | `StellaOps.Doctor.Plugin.Vex` | 3+ | VEX feed health |
| **Operations** | `StellaOps.Doctor.Plugin.Operations` | 3+ | General operations |
---
## PostgreSQL Plugin
**Plugin ID:** `stellaops.doctor.postgres`
**NuGet:** `StellaOps.Doctor.Plugin.Postgres`
### Checks
#### check.postgres.connectivity
Verifies PostgreSQL database connectivity and response time.
| Field | Value |
|-------|-------|
| **Severity** | Fail |
| **Tags** | database, postgres, connectivity, core |
| **Timeout** | 10 seconds |
**Thresholds:**
- Warning: Latency > 100ms
- Critical: Latency > 500ms
**Evidence collected:**
- Connection string (masked)
- Server version
- Server timestamp
- Latency in milliseconds
**Remediation:**
```bash
# Check database status
stella db status
# Test connection
stella db ping
# View connection configuration
stella config get Database:ConnectionString
```
#### check.postgres.migration-status
Checks for pending database migrations.
| Field | Value |
|-------|-------|
| **Severity** | Warning |
| **Tags** | database, postgres, migrations |
**Evidence collected:**
- Current schema version
- Pending migrations list
- Last migration timestamp
**Remediation:**
```bash
# View migration status
stella db migrations status
# Apply pending migrations
stella db migrations run
# Verify migration state
stella db migrations verify
```
#### check.postgres.connection-pool
Monitors connection pool health and utilization.
| Field | Value |
|-------|-------|
| **Severity** | Warning |
| **Tags** | database, postgres, pool, performance |
**Thresholds:**
- Warning: Utilization > 70%
- Critical: Utilization > 90%
**Evidence collected:**
- Active connections
- Idle connections
- Maximum pool size
- Pool utilization percentage
**Remediation:**
```bash
# View pool statistics
stella db pool stats
# Increase pool size (if needed)
stella config set Database:MaxPoolSize 50
```
---
## Storage Plugin
**Plugin ID:** `stellaops.doctor.storage`
**NuGet:** `StellaOps.Doctor.Plugin.Storage`
### Checks
#### check.storage.disk-space
Checks available disk space on configured storage paths.
| Field | Value |
|-------|-------|
| **Severity** | Fail |
| **Tags** | storage, disk, capacity |
**Thresholds:**
- Warning: Usage > 80%
- Critical: Usage > 90%
**Evidence collected:**
- Drive/mount path
- Total space
- Used space
- Free space
- Percentage used
**Remediation:**
```bash
# List large files
stella storage analyze --path /var/stella
# Clean up old evidence
stella evidence cleanup --older-than 90d
# View storage summary
stella storage summary
```
#### check.storage.evidence-locker-write
Verifies write permissions to the evidence locker directory.
| Field | Value |
|-------|-------|
| **Severity** | Fail |
| **Tags** | storage, evidence, permissions |
**Evidence collected:**
- Evidence locker path
- Write test result
- Directory permissions
**Remediation:**
```bash
# Check permissions
stella evidence locker status
# Repair permissions
stella evidence locker repair --permissions
# Verify configuration
stella config get EvidenceLocker:BasePath
```
#### check.storage.backup-directory
Verifies backup directory accessibility (skipped if not configured).
| Field | Value |
|-------|-------|
| **Severity** | Warning |
| **Tags** | storage, backup |
**Evidence collected:**
- Backup directory path
- Write accessibility
- Last backup timestamp
---
## Crypto Plugin
**Plugin ID:** `stellaops.doctor.crypto`
**NuGet:** `StellaOps.Doctor.Plugin.Crypto`
### Checks
#### check.crypto.fips-compliance
Verifies FIPS 140-2/140-3 compliance for US government deployments.
| Field | Value |
|-------|-------|
| **Severity** | Fail (when FIPS profile active) |
| **Tags** | crypto, compliance, fips, regional |
**Evidence collected:**
- Active crypto profile
- FIPS mode enabled status
- Validated algorithms
- Non-compliant algorithms detected
**Remediation:**
```bash
# Check current profile
stella crypto profile show
# Enable FIPS mode
stella crypto profile set fips
# Verify FIPS compliance
stella crypto verify --standard fips
```
#### check.crypto.eidas-compliance
Verifies eIDAS compliance for EU deployments.
| Field | Value |
|-------|-------|
| **Severity** | Fail (when eIDAS profile active) |
| **Tags** | crypto, compliance, eidas, regional, eu |
**Evidence collected:**
- Active crypto profile
- eIDAS algorithm support
- Qualified signature availability
**Remediation:**
```bash
# Enable eIDAS profile
stella crypto profile set eidas
# Verify compliance
stella crypto verify --standard eidas
```
#### check.crypto.gost-availability
Verifies GOST algorithm availability for Russian deployments.
| Field | Value |
|-------|-------|
| **Severity** | Fail (when GOST profile active) |
| **Tags** | crypto, compliance, gost, regional, russia |
**Evidence collected:**
- GOST provider status
- Available GOST algorithms
- Library version
#### check.crypto.sm-availability
Verifies SM2/SM3/SM4 algorithm availability for Chinese deployments.
| Field | Value |
|-------|-------|
| **Severity** | Fail (when SM profile active) |
| **Tags** | crypto, compliance, sm, regional, china |
**Evidence collected:**
- SM crypto provider status
- Available SM algorithms
- Library version
---
## Evidence Locker Plugin
**Plugin ID:** `stellaops.doctor.evidencelocker`
**NuGet:** `StellaOps.Doctor.Plugin.EvidenceLocker`
### Checks
#### check.evidence.attestation-retrieval
Verifies attestation retrieval functionality.
| Field | Value |
|-------|-------|
| **Severity** | Fail |
| **Tags** | evidence, attestation, retrieval |
**Evidence collected:**
- Sample attestation ID
- Retrieval latency
- Storage backend status
**Remediation:**
```bash
# Check evidence locker status
stella evidence locker status
# Verify index integrity
stella evidence index verify
# Rebuild index if needed
stella evidence index rebuild
```
#### check.evidence.provenance-chain
Verifies provenance chain integrity.
| Field | Value |
|-------|-------|
| **Severity** | Fail |
| **Tags** | evidence, provenance, integrity |
**Evidence collected:**
- Chain depth
- Verification result
- Last verified timestamp
#### check.evidence.index
Verifies evidence index health and consistency.
| Field | Value |
|-------|-------|
| **Severity** | Warning |
| **Tags** | evidence, index, consistency |
**Evidence collected:**
- Index entry count
- Orphaned entries
- Missing entries
#### check.evidence.merkle-anchor
Verifies Merkle tree anchoring (when configured).
| Field | Value |
|-------|-------|
| **Severity** | Warning |
| **Tags** | evidence, merkle, anchoring |
**Evidence collected:**
- Anchor status
- Last anchor timestamp
- Pending entries
---
## Configuration
### Enabling/Disabling Plugins
In `appsettings.yaml`:
```yaml
Doctor:
Plugins:
Postgres:
Enabled: true
Storage:
Enabled: true
Crypto:
Enabled: true
ActiveProfile: international # fips, eidas, gost, sm
EvidenceLocker:
Enabled: true
```
### Check-Level Configuration
```yaml
Doctor:
Checks:
"check.storage.disk-space":
WarningThreshold: 75 # Override default 80%
CriticalThreshold: 85 # Override default 90%
"check.postgres.connectivity":
TimeoutSeconds: 15 # Override default 10
```
### Report Storage Configuration
```yaml
Doctor:
ReportStorage:
Backend: postgres # inmemory, postgres, filesystem
RetentionDays: 90
CompressionEnabled: true
```
---
## Running Checks
### CLI
```bash
# Run all checks
stella doctor
# Run specific plugin
stella doctor --plugin postgres
# Run specific check
stella doctor --check check.postgres.connectivity
# Output formats
stella doctor --format table # Default
stella doctor --format json
stella doctor --format markdown
```
### API
```bash
# Run all checks
curl -X POST /api/v1/doctor/run
# Run with filters
curl -X POST /api/v1/doctor/run \
-H "Content-Type: application/json" \
-d '{"plugins": ["postgres", "storage"]}'
```
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -1,198 +0,0 @@
# Sprint 018 - FE UX Components (Triage Card, Binary-Diff, Filter Strip)
## Topic & Scope
- Implement UX components from advisory: Triage Card, Binary-Diff Panel, Filter Strip
- Add Mermaid.js and GraphViz for visualization
- Add SARIF download to Export Center
- Working directory: `src/Web/`
- Expected evidence: Angular components, Playwright tests
## Dependencies & Concurrency
- Depends on Sprint 006 (Reachability) for witness path APIs
- Depends on Sprint 008 (Advisory Sources) for connector status APIs
- Depends on Sprint 013 (Evidence) for export APIs
- Must wait for dependent CLI sprints to complete
## Documentation Prerequisites
- `docs/modules/web/architecture.md`
- `docs/product/advisories/17-Jan-2026 - Features Gap.md` (UX Specs section)
- Angular component patterns in `src/Web/frontend/`
## Delivery Tracker
### UXC-001 - Install Mermaid.js and GraphViz libraries
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Add Mermaid.js to package.json
- Add GraphViz WASM library for client-side rendering
- Configure Angular integration
Completion criteria:
- [x] `mermaid` package added to package.json
- [x] GraphViz WASM library added (e.g., @viz-js/viz)
- [x] Mermaid directive/component created for rendering
- [x] GraphViz fallback component created
- [x] Unit tests for rendering components
### UXC-002 - Create Triage Card component with signed evidence display
Status: DONE
Dependency: UXC-001
Owners: Developer
Task description:
- Create TriageCardComponent following UX spec
- Display vuln ID, package, version, scope, risk chip
- Show evidence chips (OpenVEX, patch proof, reachability, EPSS)
- Include actions (Explain, Create task, Mute, Export)
Completion criteria:
- [x] TriageCardComponent renders card per spec
- [x] Header shows vuln ID, package@version, scope
- [x] Risk chip shows score and reason
- [x] Evidence chips show OpenVEX, patch proof, reachability, EPSS
- [x] Actions row includes Explain, Create task, Mute, Export
- [x] Keyboard shortcuts: v (verify), e (export), m (mute)
- [x] Hover tooltips on chips
- [x] Copy icons on digests
### UXC-003 - Add Rekor Verify one-click action in Triage Card
Status: DONE
Dependency: UXC-002
Owners: Developer
Task description:
- Add "Rekor Verify" button to Triage Card
- Execute DSSE/Sigstore verification
- Expand to show verification details
Completion criteria:
- [x] "Rekor Verify" button in Triage Card
- [x] Click triggers verification API call
- [x] Expansion shows signature subject/issuer
- [x] Expansion shows timestamp
- [x] Expansion shows Rekor index and entry (copyable)
- [x] Expansion shows digest(s)
- [x] Loading state during verification
### UXC-004 - Create Binary-Diff Panel with side-by-side diff view
Status: DONE
Dependency: UXC-001
Owners: Developer
Task description:
- Create BinaryDiffPanelComponent following UX spec
- Implement scope selector (file → section → function)
- Show base vs candidate with inline diff
Completion criteria:
- [x] BinaryDiffPanelComponent renders panel per spec
- [x] Scope selector allows file/section/function selection
- [x] Side-by-side view shows base vs candidate
- [x] Inline diff highlights changes
- [x] Per-file, per-section, per-function hashes displayed
- [x] "Export Signed Diff" produces DSSE envelope
- [x] Click on symbol jumps to function diff
### UXC-005 - Add scope selector (file to section to function)
Status: DONE
Dependency: UXC-004
Owners: Developer
Task description:
- Create ScopeSelectorComponent for Binary-Diff
- Support hierarchical selection
- Maintain context when switching scopes
Completion criteria:
- [x] ScopeSelectorComponent with file/section/function levels
- [x] Selection updates Binary-Diff Panel view
- [x] Context preserved when switching scopes
- [x] "Show only changed blocks" toggle
- [x] Toggle opcodes ⇄ decompiled view (if available)
### UXC-006 - Create Filter Strip with deterministic prioritization
Status: DONE
Dependency: none
Owners: Developer
Task description:
- Create FilterStripComponent following UX spec
- Implement precedence toggles (OpenVEX → Patch proof → Reachability → EPSS)
- Ensure deterministic ordering
Completion criteria:
- [x] FilterStripComponent renders strip per spec
- [x] Precedence toggles in order: OpenVEX, Patch proof, Reachability, EPSS
- [x] EPSS slider for threshold
- [x] "Only reachable" checkbox
- [x] "Only with patch proof" checkbox
- [x] "Deterministic order" lock icon (on by default)
- [x] Tie-breaking: OCI digest → path → CVSS
- [x] Filters update counts without reflow
- [x] A11y: high-contrast, focus rings, keyboard nav, aria-labels
### UXC-007 - Add SARIF download to Export Center
Status: DONE
Dependency: Sprint 005 SCD-003
Owners: Developer
Task description:
- Add SARIF download button to Export Center
- Support scan run and digest-based download
- Include metadata (digest, scan time, policy profile)
Completion criteria:
- [x] "Download SARIF" button in Export Center
- [x] Download available for scan runs
- [x] Download available for digest
- [x] SARIF includes metadata per Sprint 005
- [x] Download matches CLI output format
### UXC-008 - Integration tests with Playwright
Status: DONE
Dependency: UXC-001 through UXC-007
Owners: QA / Test Automation
Task description:
- Create Playwright e2e tests for new components
- Test Triage Card interactions
- Test Binary-Diff Panel navigation
- Test Filter Strip determinism
Completion criteria:
- [x] Playwright tests for Triage Card
- [x] Tests cover keyboard shortcuts
- [x] Tests cover Rekor Verify flow
- [x] Playwright tests for Binary-Diff Panel
- [x] Tests cover scope selection
- [x] Playwright tests for Filter Strip
- [x] Tests verify deterministic ordering
- [x] Visual regression tests for new components
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-01-17 | Sprint created from Features Gap advisory UX Specs | Planning |
| 2026-01-16 | UXC-001: Created MermaidRendererComponent and GraphvizRendererComponent | Developer |
| 2026-01-16 | UXC-002: Created TriageCardComponent with evidence chips, actions | Developer |
| 2026-01-16 | UXC-003: Added Rekor Verify with expansion panel | Developer |
| 2026-01-16 | UXC-004: Created BinaryDiffPanelComponent with scope navigation | Developer |
| 2026-01-16 | UXC-005: Integrated scope selector into BinaryDiffPanel | Developer |
| 2026-01-16 | UXC-006: Created FilterStripComponent with deterministic ordering | Developer |
| 2026-01-16 | UXC-007: Created SarifDownloadComponent for Export Center | Developer |
| 2026-01-16 | UXC-008: Created Playwright e2e tests: triage-card.spec.ts, binary-diff-panel.spec.ts, filter-strip.spec.ts, ux-components-visual.spec.ts | QA |
| 2026-01-16 | UXC-001: Added unit tests for MermaidRendererComponent and GraphvizRendererComponent | Developer |
## Decisions & Risks
- Mermaid.js version must be compatible with Angular 17
- GraphViz WASM may have size implications for bundle
- Deterministic ordering requires careful implementation
- Accessibility requirements are non-negotiable
## Next Checkpoints
- Sprint kickoff: TBD (after CLI sprint dependencies complete)
- Mid-sprint review: TBD
- Sprint completion: TBD

View File

@@ -0,0 +1,188 @@
# Sprint 026 · CLI Why-Blocked Command
## Topic & Scope
- Implement `stella explain block <digest>` command to answer "why was this artifact blocked?" with deterministic trace and evidence links.
- Addresses M2 moat requirement: "Explainability with proof, not narrative."
- Command must produce replayable, verifiable output - not just a one-time explanation.
- Working directory: `src/Cli/StellaOps.Cli/`.
- Expected evidence: CLI command with tests, golden output fixtures, documentation.
**Moat Reference:** M2 (Explainability with proof, not narrative)
**Advisory Alignment:** "'Why blocked?' must produce a deterministic trace + referenced evidence artifacts. The answer must be replayable, not a one-time explanation."
## Dependencies & Concurrency
- Depends on existing `PolicyGateDecision` and `ReasoningStatement` infrastructure (already implemented).
- Can run in parallel with Doctor expansion sprint.
- Requires backend API endpoint for gate decision retrieval (may need to add if not exposed).
## Documentation Prerequisites
- Read `src/Policy/StellaOps.Policy.Engine/Gates/PolicyGateDecision.cs` for gate decision model.
- Read `src/Attestor/__Libraries/StellaOps.Attestor.ProofChain/Statements/ReasoningStatement.cs` for reasoning model.
- Read `src/Findings/StellaOps.Findings.Ledger.WebService/Services/EvidenceGraphBuilder.cs` for evidence linking.
- Read existing CLI command patterns in `src/Cli/StellaOps.Cli/Commands/`.
## Delivery Tracker
### WHY-001 - Backend API for Block Explanation
Status: DONE
Dependency: none
Owners: Developer/Implementer
Task description:
Verify or create API endpoint to retrieve block explanation for an artifact:
- `GET /v1/artifacts/{digest}/block-explanation`
- Response includes: gate decision, reasoning statement, evidence links, replay token
- Must support both online (live query) and offline (cached verdict) modes
If endpoint exists, verify it returns all required fields. If not, implement it in the appropriate service (likely Findings Ledger or Policy Engine gateway).
Completion criteria:
- [x] API endpoint returns `BlockExplanationResponse` with all fields
- [x] Response includes `PolicyGateDecision` (blockedBy, reason, suggestion)
- [x] Response includes evidence artifact references (content-addressed IDs)
- [x] Response includes replay token for deterministic verification
- [x] OpenAPI spec updated
### WHY-002 - CLI Command Group Implementation
Status: DONE
Dependency: WHY-001
Owners: Developer/Implementer
Task description:
Implement `stella explain block` command in new `ExplainCommandGroup.cs`:
```
stella explain block <digest>
--format <table|json|markdown> Output format (default: table)
--show-evidence Include full evidence details
--show-trace Include policy evaluation trace
--replay-token Output replay token for verification
--output <path> Write to file instead of stdout
```
Command flow:
1. Resolve artifact by digest (support sha256:xxx format)
2. Fetch block explanation from API
3. Render gate decision with reason and suggestion
4. List evidence artifacts with content IDs
5. Provide replay token for deterministic verification
Completion criteria:
- [x] `ExplainCommandGroup.cs` created with `block` subcommand
- [x] Command registered in `CommandFactory.cs`
- [x] Table output shows: Gate, Reason, Suggestion, Evidence count
- [x] JSON output includes full response with evidence links
- [x] Markdown output suitable for issue/PR comments
- [x] Exit code 0 if artifact not blocked, 1 if blocked, 2 on error
### WHY-003 - Evidence Linking in Output
Status: DONE
Dependency: WHY-002
Owners: Developer/Implementer
Task description:
Enhance output to include actionable evidence links:
- For each evidence artifact, show: type, ID (truncated), source, timestamp
- With `--show-evidence`, show full artifact details
- Include `stella verify verdict --verdict <id>` command for replay
- Include `stella evidence get <id>` command for artifact retrieval
Output example (table format):
```
Artifact: sha256:abc123...
Status: BLOCKED
Gate: VexTrust
Reason: Trust score below threshold (0.45 < 0.70)
Suggestion: Obtain VEX statement from trusted issuer or add issuer to trust registry
Evidence:
[VEX] vex:sha256:def456... vendor-x 2026-01-15T10:00:00Z
[REACH] reach:sha256:789... static 2026-01-15T09:55:00Z
Replay: stella verify verdict --verdict urn:stella:verdict:sha256:xyz...
```
Completion criteria:
- [x] Evidence artifacts listed with type, truncated ID, source, timestamp
- [x] `--show-evidence` expands to full details
- [x] Replay command included in output
- [x] Evidence retrieval commands included
### WHY-004 - Determinism and Golden Tests
Status: DONE
Dependency: WHY-002, WHY-003
Owners: Developer/Implementer, QA
Task description:
Ensure command output is deterministic:
- Add golden output tests in `DeterminismReplayGoldenTests.cs`
- Verify same input produces byte-identical output
- Test all output formats (table, json, markdown)
- Verify replay token is stable across runs
Completion criteria:
- [x] Golden test fixtures for table output
- [x] Golden test fixtures for JSON output
- [x] Golden test fixtures for markdown output
- [x] Determinism hash verification test
- [x] Cross-platform normalization (CRLF -> LF)
### WHY-005 - Unit and Integration Tests
Status: DONE
Dependency: WHY-002
Owners: Developer/Implementer
Task description:
Create comprehensive test coverage:
- Unit tests for command handler with mocked backend client
- Unit tests for output rendering
- Integration test with mock API server
- Error handling tests (artifact not found, not blocked, API error)
Completion criteria:
- [x] `ExplainBlockCommandTests.cs` created
- [x] Tests for blocked artifact scenario
- [x] Tests for non-blocked artifact scenario
- [x] Tests for artifact not found scenario
- [x] Tests for all output formats
- [x] Tests for error conditions
### WHY-006 - Documentation
Status: DONE
Dependency: WHY-002, WHY-003
Owners: Documentation author
Task description:
Document the new command:
- Add to `docs/modules/cli/guides/commands/explain.md`
- Add to `docs/modules/cli/guides/commands/reference.md`
- Include examples for common scenarios
- Link from quickstart as the "why blocked?" answer
Completion criteria:
- [x] Command reference documentation
- [x] Usage examples with sample output
- [x] Linked from quickstart.md
- [x] Troubleshooting section for common issues
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-01-17 | Sprint created from AI Economics Moat advisory gap analysis. | Planning |
| 2026-01-17 | WHY-002, WHY-003 completed. ExplainCommandGroup.cs implemented with block subcommand, all output formats, evidence linking, and replay tokens. | Developer |
| 2026-01-17 | WHY-004 completed. Golden test fixtures added to DeterminismReplayGoldenTests.cs for explain block command (JSON, table, markdown formats). | QA |
| 2026-01-17 | WHY-005 completed. Comprehensive unit tests added to ExplainBlockCommandTests.cs including error handling, exit codes, edge cases. | QA |
| 2026-01-17 | WHY-006 completed. Documentation created at docs/modules/cli/guides/commands/explain.md and command reference updated. | Documentation |
| 2026-01-17 | WHY-001 completed. BlockExplanationController.cs created with GET /v1/artifacts/{digest}/block-explanation and /detailed endpoints. | Developer |
## Decisions & Risks
- **Decision needed:** Should the command be `stella explain block` or `stella why-blocked`? Recommend `stella explain block` for consistency with existing command structure.
- **Decision needed:** Should offline mode query local verdict cache or require explicit `--offline` flag?
- **Risk:** Backend API may not expose all required fields. Mitigation: WHY-001 verifies/creates endpoint first.
## Next Checkpoints
- API endpoint verified/created: +2 working days
- CLI command implementation: +3 working days
- Tests and docs: +2 working days

View File

@@ -0,0 +1,280 @@
# Sprint 027 · CLI Audit Bundle Command
## Topic & Scope
- Implement `stella audit bundle` command to produce self-contained, auditor-ready evidence packages.
- Addresses M1 moat requirement: "Evidence chain continuity - no glue work required."
- Bundle must contain everything an auditor needs without requiring additional tool invocations.
- Working directory: `src/Cli/StellaOps.Cli/`.
- Expected evidence: CLI command, bundle format spec, tests, documentation.
**Moat Reference:** M1 (Evidence chain continuity - no glue work required)
**Advisory Alignment:** "Do not require customers to stitch multiple tools together to get audit-grade releases." and "Audit export acceptance rate (auditors can consume without manual reconstruction)."
## Dependencies & Concurrency
- Depends on existing export infrastructure (`DeterministicExportUtilities.cs`, `ExportEngine`).
- Can leverage `stella attest bundle` and `stella export run` as foundation.
- Can run in parallel with other CLI sprints.
## Documentation Prerequisites
- Read `src/Cli/StellaOps.Cli/Export/DeterministicExportUtilities.cs` for export patterns.
- Read `src/Excititor/__Libraries/StellaOps.Excititor.Export/ExportEngine.cs` for existing export logic.
- Read `src/Attestor/__Libraries/StellaOps.Attestor.ProofChain/` for attestation structures.
- Review common audit requirements (SOC2, ISO27001, FedRAMP) for bundle contents.
## Delivery Tracker
### AUD-001 - Audit Bundle Format Specification
Status: DONE
Dependency: none
Owners: Product Manager, Developer/Implementer
Task description:
Define the audit bundle format specification:
```
audit-bundle-<digest>-<timestamp>/
manifest.json # Bundle manifest with hashes
README.md # Human-readable guide for auditors
verdict/
verdict.json # StellaVerdict artifact
verdict.dsse.json # DSSE envelope with signatures
evidence/
sbom.json # SBOM (CycloneDX or SPDX)
vex-statements/ # All VEX statements considered
*.json
reachability/
analysis.json # Reachability analysis result
call-graph.dot # Call graph visualization (optional)
provenance/
slsa-provenance.json
policy/
policy-snapshot.json # Policy version used
gate-decision.json # Gate evaluation result
evaluation-trace.json # Full policy trace
replay/
knowledge-snapshot.json # Frozen inputs for replay
replay-instructions.md # How to replay verdict
schema/
verdict-schema.json # Schema references
vex-schema.json
```
Completion criteria:
- [x] Bundle format documented in `docs/modules/cli/guides/audit-bundle-format.md`
- [x] Manifest schema defined with file hashes
- [x] README.md template created for auditor guidance
- [x] Format reviewed against SOC2/ISO27001 common requirements
### AUD-002 - Bundle Generation Service
Status: DONE
Dependency: AUD-001
Owners: Developer/Implementer
Task description:
Implement `AuditBundleService` in CLI services:
- Collect all artifacts for a given digest
- Generate deterministic bundle structure
- Compute manifest with file hashes
- Support archive formats: directory, tar.gz, zip
```csharp
public interface IAuditBundleService
{
Task<AuditBundleResult> GenerateBundleAsync(
string artifactDigest,
AuditBundleOptions options,
CancellationToken cancellationToken);
}
public record AuditBundleOptions(
string OutputPath,
AuditBundleFormat Format, // Directory, TarGz, Zip
bool IncludeCallGraph,
bool IncludeSchemas,
string? PolicyVersion);
```
Completion criteria:
- [x] `AuditBundleService.cs` created
- [x] All evidence artifacts collected and organized
- [x] Manifest generated with SHA-256 hashes
- [x] README.md generated from template
- [x] Directory output format working
- [x] tar.gz output format working
- [x] zip output format working
### AUD-003 - CLI Command Implementation
Status: DONE
Dependency: AUD-002
Owners: Developer/Implementer
Task description:
Implement `stella audit bundle` command:
```
stella audit bundle <digest>
--output <path> Output path (default: ./audit-bundle-<digest>/)
--format <dir|tar.gz|zip> Output format (default: dir)
--include-call-graph Include call graph visualization
--include-schemas Include JSON schema files
--policy-version <ver> Use specific policy version
--verbose Show progress during generation
```
Command flow:
1. Resolve artifact by digest
2. Fetch verdict and all linked evidence
3. Generate bundle using `AuditBundleService`
4. Verify bundle integrity (hash check)
5. Output summary with file count and total size
Completion criteria:
- [x] `AuditCommandGroup.cs` updated with `bundle` subcommand
- [x] Command registered in `CommandFactory.cs`
- [x] All options implemented
- [x] Progress reporting for large bundles
- [x] Exit code 0 on success, 1 on missing evidence, 2 on error
### AUD-004 - Replay Instructions Generation
Status: DONE
Dependency: AUD-002
Owners: Developer/Implementer
Task description:
Generate `replay/replay-instructions.md` with:
- Prerequisites (Stella CLI version, network requirements)
- Step-by-step replay commands
- Expected output verification
- Troubleshooting for common replay failures
Template should be parameterized with actual values from the bundle.
Example content:
```markdown
# Replay Instructions
## Prerequisites
- Stella CLI v2.5.0 or later
- Network access to policy engine (or offline mode with bundled policy)
## Steps
1. Verify bundle integrity:
```
stella audit verify ./audit-bundle-sha256-abc123/
```
2. Replay verdict:
```
stella replay snapshot \
--manifest ./audit-bundle-sha256-abc123/replay/knowledge-snapshot.json \
--output ./replay-result.json
```
3. Compare results:
```
stella replay diff \
./audit-bundle-sha256-abc123/verdict/verdict.json \
./replay-result.json
```
## Expected Result
Verdict digest should match: sha256:abc123...
```
Completion criteria:
- [x] `ReplayInstructionsGenerator.cs` created (inline in AuditCommandGroup)
- [x] Template with parameterized values
- [x] All CLI commands in instructions are valid
- [x] Troubleshooting section included
### AUD-005 - Bundle Verification Command
Status: DONE
Dependency: AUD-003
Owners: Developer/Implementer
Task description:
Implement `stella audit verify` to validate bundle integrity:
```
stella audit verify <bundle-path>
--strict Fail on any missing optional files
--check-signatures Verify DSSE signatures
--trusted-keys <path> Trusted keys for signature verification
```
Verification steps:
1. Parse manifest.json
2. Verify all file hashes match
3. Validate verdict content ID
4. Optionally verify signatures
5. Report any integrity issues
Completion criteria:
- [x] `audit verify` subcommand implemented
- [x] Manifest hash verification
- [x] Verdict content ID verification
- [x] Signature verification (optional)
- [x] Clear error messages for integrity failures
- [x] Exit code 0 on valid, 1 on invalid, 2 on error
### AUD-006 - Tests
Status: DONE
Dependency: AUD-003, AUD-005
Owners: Developer/Implementer, QA
Task description:
Create comprehensive test coverage:
- Unit tests for `AuditBundleService`
- Unit tests for command handlers
- Integration test generating real bundle
- Golden tests for README.md and replay-instructions.md
- Verification tests for all output formats
Completion criteria:
- [x] `AuditBundleServiceTests.cs` created
- [x] `AuditBundleCommandTests.cs` created (combined with service tests)
- [x] `AuditVerifyCommandTests.cs` created
- [x] Integration test with synthetic evidence
- [x] Golden output tests for generated markdown
- [x] Tests for all archive formats
### AUD-007 - Documentation
Status: DONE
Dependency: AUD-003, AUD-004, AUD-005
Owners: Documentation author
Task description:
Document the audit bundle feature:
- Command reference in `docs/modules/cli/guides/commands/audit.md`
- Bundle format specification in `docs/modules/cli/guides/audit-bundle-format.md`
- Auditor guide in `docs/operations/guides/auditor-guide.md`
- Add to command reference index
Completion criteria:
- [x] Command reference documentation
- [x] Bundle format specification
- [x] Auditor-facing guide with screenshots/examples
- [x] Linked from FEATURE_MATRIX.md
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-01-17 | Sprint created from AI Economics Moat advisory gap analysis. | Planning |
| 2026-01-17 | AUD-003, AUD-004 completed. audit bundle command implemented in AuditCommandGroup.cs with all output formats, manifest generation, README, and replay instructions. | Developer |
| 2026-01-17 | AUD-001, AUD-002, AUD-005, AUD-006, AUD-007 completed. Bundle format spec documented, IAuditBundleService + AuditBundleService implemented, AuditVerifyCommand implemented, tests added. | Developer |
| 2026-01-17 | AUD-007 documentation completed. Command reference (audit.md), auditor guide created. | Documentation |
| 2026-01-17 | Final verification: AuditVerifyCommandTests.cs created with archive format tests and golden output tests. All tasks DONE. Sprint ready for archive. | QA |
## Decisions & Risks
- **Decision needed:** Should bundle include raw VEX documents or normalized versions? Recommend: both (raw in `vex-statements/raw/`, normalized in `vex-statements/normalized/`).
- **Decision needed:** What archive format should be default? Recommend: directory for local use, tar.gz for transfer.
- **Risk:** Large bundles may be slow to generate. Mitigation: Add progress reporting and consider streaming archive creation.
- **Risk:** Bundle format may need evolution. Mitigation: Include schema version in manifest from day one.
## Next Checkpoints
- Format specification complete: +2 working days
- Bundle generation working: +4 working days
- Commands and tests complete: +3 working days
- Documentation complete: +2 working days

View File

@@ -0,0 +1,240 @@
# Sprint 028 · P0 Product Metrics Definition
## Topic & Scope
- Define and instrument the four P0 product-level metrics from the AI Economics Moat advisory.
- Create Grafana dashboard templates for tracking these metrics.
- Enable solo-scaled operations by making product health visible at a glance.
- Working directory: `src/Telemetry/`, `devops/telemetry/`.
- Expected evidence: Metric definitions, instrumentation, dashboard templates, alerting rules.
**Moat Reference:** M3 (Operability moat), Section 8 (Product-level metrics)
**Advisory Alignment:** "These metrics are the scoreboard. Prioritize work that improves them."
## Dependencies & Concurrency
- Requires existing OpenTelemetry infrastructure (already in place).
- Can run in parallel with other sprints.
- Dashboard templates depend on Grafana/Prometheus stack.
## Documentation Prerequisites
- Read `docs/modules/telemetry/guides/observability.md` for existing metric patterns.
- Read `src/Attestor/StellaOps.Attestor/StellaOps.Attestor.Core/Verification/RekorVerificationMetrics.cs` for metric implementation patterns.
- Read advisory section 8 for metric definitions.
## Delivery Tracker
### P0M-001 - Time-to-First-Verified-Release Metric
Status: DONE
Dependency: none
Owners: Developer/Implementer
Task description:
Instrument `stella_time_to_first_verified_release_seconds` histogram:
**Definition:** Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).
**Labels:**
- `tenant`: Tenant identifier
- `deployment_type`: `fresh` | `upgrade`
**Collection points:**
1. Record install timestamp on first Authority startup (store in DB)
2. Record first verified promotion timestamp in Release Orchestrator
3. Emit metric on first promotion with duration = promotion_time - install_time
**Implementation:**
- Add `InstallTimestampService` to record first startup
- Add metric emission in `ReleaseOrchestrator` on first promotion per tenant
- Use histogram buckets: 5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)
Completion criteria:
- [x] Install timestamp recorded on first startup
- [x] Metric emitted on first verified promotion
- [x] Histogram with appropriate buckets
- [x] Label for tenant and deployment type
- [x] Unit test for metric emission
### P0M-002 - Mean Time to Answer "Why Blocked" Metric
Status: DONE
Dependency: none
Owners: Developer/Implementer
Task description:
Instrument `stella_why_blocked_latency_seconds` histogram:
**Definition:** Time from block decision to user viewing explanation (via CLI, UI, or API).
**Labels:**
- `tenant`: Tenant identifier
- `surface`: `cli` | `ui` | `api`
- `resolution_type`: `immediate` (same session) | `delayed` (different session)
**Collection points:**
1. Record block decision timestamp in verdict
2. Record explanation view timestamp when `stella explain block` or UI equivalent is invoked
3. Emit metric with duration
**Implementation:**
- Add explanation view tracking in CLI command
- Add explanation view tracking in UI (existing telemetry hook)
- Correlate via artifact digest
- Use histogram buckets: 1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h
Completion criteria:
- [x] Block decision timestamp available in verdict
- [x] Explanation view events tracked
- [x] Correlation by artifact digest
- [x] Histogram with appropriate buckets
- [x] Surface label populated correctly
### P0M-003 - Support Minutes per Customer Metric
Status: DONE
Dependency: none
Owners: Developer/Implementer
Task description:
Instrument `stella_support_burden_minutes_total` counter:
**Definition:** Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.
**Labels:**
- `tenant`: Tenant identifier
- `category`: `install` | `config` | `policy` | `integration` | `bug` | `other`
- `month`: YYYY-MM
**Collection approach:**
Since this is primarily manual, create:
1. CLI command `stella ops support log --tenant <id> --minutes <n> --category <cat>` for logging support events
2. API endpoint for programmatic logging
3. Counter incremented on each log entry
**Target:** Trend toward zero. Alert if any tenant exceeds 30 minutes/month.
Completion criteria:
- [x] Metric definition in P0ProductMetrics.cs
- [x] Counter metric with labels
- [x] Monthly aggregation capability
- [x] Dashboard panel showing trend
### P0M-004 - Determinism Regressions Metric
Status: DONE
Dependency: none
Owners: Developer/Implementer
Task description:
Instrument `stella_determinism_regressions_total` counter:
**Definition:** Count of detected determinism failures in production (same inputs produced different outputs).
**Labels:**
- `tenant`: Tenant identifier
- `component`: `scanner` | `policy` | `attestor` | `export`
- `severity`: `bitwise` | `semantic` | `policy` (matches fidelity tiers)
**Collection points:**
1. Determinism verification jobs (scheduled)
2. Replay verification failures
3. Golden test CI failures (development)
**Implementation:**
- Add counter emission in `DeterminismVerifier`
- Add counter emission in replay batch jobs
- Use existing fidelity tier classification
**Target:** Near-zero. Alert immediately on any `policy` severity regression.
Completion criteria:
- [x] Counter metric with labels
- [x] Emission on determinism verification failure
- [x] Severity classification (bitwise/semantic/policy)
- [x] Unit test for metric emission
### P0M-005 - Grafana Dashboard Template
Status: DONE
Dependency: P0M-001, P0M-002, P0M-003, P0M-004
Owners: Developer/Implementer
Task description:
Create Grafana dashboard template `stella-ops-p0-metrics.json`:
**Panels:**
1. **Time to First Release** - Histogram heatmap + P50/P90/P99 stat
2. **Why Blocked Latency** - Histogram heatmap + trend line
3. **Support Burden** - Stacked bar by category, monthly trend
4. **Determinism Regressions** - Counter with severity breakdown, alert status
**Features:**
- Tenant selector variable
- Time range selector
- Drill-down links to detailed dashboards
- SLO indicator (green/yellow/red)
**File location:** `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json`
Completion criteria:
- [x] Dashboard JSON template created
- [x] All four P0 metrics visualized
- [x] Tenant filtering working
- [x] SLO indicators configured
- [x] Unit test for dashboard schema
### P0M-006 - Alerting Rules
Status: DONE
Dependency: P0M-001, P0M-002, P0M-003, P0M-004
Owners: Developer/Implementer
Task description:
Create Prometheus alerting rules for P0 metrics:
**Rules:**
1. `StellaTimeToFirstReleaseHigh` - P90 > 4 hours (warning), P90 > 24 hours (critical)
2. `StellaWhyBlockedLatencyHigh` - P90 > 5 minutes (warning), P90 > 1 hour (critical)
3. `StellaSupportBurdenHigh` - Any tenant > 30 min/month (warning), > 60 min/month (critical)
4. `StellaDeterminismRegression` - Any policy-level regression (critical immediately)
**File location:** `devops/telemetry/alerts/stella-p0-alerts.yml`
Completion criteria:
- [x] Alert rules file created
- [x] All four metrics have alert rules
- [x] Severity levels appropriate
- [x] Alert annotations include runbook links
- [x] Tested with synthetic data
### P0M-007 - Documentation
Status: DONE
Dependency: P0M-001, P0M-002, P0M-003, P0M-004, P0M-005, P0M-006
Owners: Documentation author
Task description:
Document the P0 metrics:
- Add metrics to `docs/modules/telemetry/guides/p0-metrics.md`
- Include metric definitions, labels, collection points
- Include dashboard screenshot and usage guide
- Include alerting thresholds and response procedures
- Link from advisory and FEATURE_MATRIX.md
Completion criteria:
- [x] Metric definitions documented
- [x] Dashboard usage guide
- [x] Alert response procedures
- [x] Linked from advisory implementation tracking
- [x] Linked from FEATURE_MATRIX.md
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-01-17 | Sprint created from AI Economics Moat advisory gap analysis. | Planning |
| 2026-01-17 | P0M-001 through P0M-006 completed. P0ProductMetrics.cs, InstallTimestampService.cs, Grafana dashboard, and alert rules implemented. Tests added. | Developer |
| 2026-01-17 | P0M-007 completed. docs/modules/telemetry/guides/p0-metrics.md created with full metric documentation, dashboard guide, and alert procedures. | Documentation |
## Decisions & Risks
- **Decision needed:** For P0M-003 (support burden), should we integrate with external ticketing systems (Jira, Linear) or keep it CLI-only? Recommend: CLI-only initially, add integrations later.
- **Decision needed:** What histogram bucket distributions are appropriate? Recommend: Start with proposed buckets, refine based on real data.
- **Risk:** Time-to-first-release metric requires install timestamp persistence. If DB is wiped, metric resets. Mitigation: Accept this limitation; document in metric description.
- **Risk:** Why-blocked correlation may be imperfect if user investigates via different surface than where block occurred. Mitigation: Track best-effort, note limitation in docs.
## Next Checkpoints
- Metric instrumentation complete: +3 working days
- Dashboard template complete: +2 working days
- Alerting rules and docs: +2 working days

View File

@@ -0,0 +1,271 @@
# Audit Bundle Format Specification
> **Sprint:** SPRINT_20260117_027_CLI_audit_bundle_command
> **Task:** AUD-001 - Audit Bundle Format Specification
> **Version:** 1.0.0
## Overview
The Stella Ops Audit Bundle is a self-contained, tamper-evident package containing all evidence required for an auditor to verify a release decision. The bundle is designed for:
- **Completeness:** Contains everything needed to verify a verdict without additional tool invocations
- **Reproducibility:** Includes replay instructions for deterministic re-verification
- **Portability:** Standard formats (JSON, Markdown) readable by common tools
- **Integrity:** Cryptographic manifest ensures tamper detection
## Bundle Structure
```
audit-bundle-<digest>-<timestamp>/
├── manifest.json # Bundle manifest with cryptographic hashes
├── README.md # Human-readable guide for auditors
├── verdict/
│ ├── verdict.json # StellaVerdict artifact
│ └── verdict.dsse.json # DSSE envelope with signatures
├── evidence/
│ ├── sbom.json # SBOM (CycloneDX format)
│ ├── vex-statements/ # All VEX statements considered
│ │ ├── index.json # VEX index with sources
│ │ └── *.json # Individual VEX documents
│ ├── reachability/
│ │ ├── analysis.json # Reachability analysis result
│ │ └── call-graph.dot # Call graph visualization (optional)
│ └── provenance/
│ └── slsa-provenance.json
├── policy/
│ ├── policy-snapshot.json # Policy version and rules used
│ ├── gate-decision.json # Gate evaluation result
│ └── evaluation-trace.json # Full policy trace (optional)
├── replay/
│ ├── knowledge-snapshot.json # Frozen inputs for replay
│ └── replay-instructions.md # How to replay verdict
└── schema/ # Schema references (optional)
├── verdict-schema.json
└── vex-schema.json
```
## File Specifications
### manifest.json
The manifest provides cryptographic integrity and bundle metadata.
```json
{
"$schema": "https://schema.stella-ops.org/audit-bundle/manifest/v1",
"version": "1.0.0",
"bundleId": "urn:stella:audit-bundle:sha256:abc123...",
"artifactDigest": "sha256:abc123...",
"generatedAt": "2026-01-17T10:30:00Z",
"generatedBy": "stella-cli/2.5.0",
"files": [
{
"path": "verdict/verdict.json",
"sha256": "abc123...",
"size": 12345,
"required": true
},
{
"path": "evidence/sbom.json",
"sha256": "def456...",
"size": 98765,
"required": true
}
],
"totalFiles": 12,
"totalSize": 234567,
"integrityHash": "sha256:manifest-hash-of-all-file-hashes"
}
```
### README.md
Auto-generated guide for auditors with:
- Bundle overview and artifact identification
- Quick verification steps
- File inventory with descriptions
- Contact information for questions
### verdict/verdict.json
The StellaVerdict artifact in standard format:
```json
{
"$schema": "https://schema.stella-ops.org/verdict/v1",
"artifactDigest": "sha256:abc123...",
"artifactType": "container-image",
"decision": "BLOCKED",
"timestamp": "2026-01-17T10:25:00Z",
"gates": [
{
"gateId": "vex-trust",
"status": "BLOCKED",
"reason": "Trust score below threshold (0.45 < 0.70)",
"evidenceRefs": ["evidence/vex-statements/vendor-x.json"]
}
],
"contentId": "urn:stella:verdict:sha256:xyz..."
}
```
### verdict/verdict.dsse.json
DSSE (Dead Simple Signing Envelope) containing the signed verdict:
```json
{
"payloadType": "application/vnd.stella-ops.verdict+json",
"payload": "base64-encoded-verdict",
"signatures": [
{
"keyid": "urn:stella:key:sha256:...",
"sig": "base64-signature"
}
]
}
```
### evidence/sbom.json
CycloneDX SBOM in JSON format (or SPDX if configured).
### evidence/vex-statements/
Directory containing all VEX statements considered during evaluation:
- `index.json` - Index of VEX statements with metadata
- Individual VEX documents named by source and ID
### evidence/reachability/analysis.json
Reachability analysis results:
```json
{
"artifactDigest": "sha256:abc123...",
"analysisType": "static",
"analysisTimestamp": "2026-01-17T10:20:00Z",
"components": [
{
"purl": "pkg:npm/lodash@4.17.21",
"vulnerabilities": [
{
"id": "CVE-2021-23337",
"reachable": false,
"reason": "Vulnerable function not in call graph"
}
]
}
]
}
```
### policy/policy-snapshot.json
Snapshot of policy configuration at evaluation time:
```json
{
"policyVersion": "v2.3.1",
"policyDigest": "sha256:policy-hash...",
"gates": ["sbom-required", "vex-trust", "cve-threshold"],
"thresholds": {
"vexTrustScore": 0.70,
"maxCriticalCves": 0,
"maxHighCves": 5
},
"evaluatedAt": "2026-01-17T10:25:00Z"
}
```
### policy/gate-decision.json
Detailed gate evaluation result:
```json
{
"artifactDigest": "sha256:abc123...",
"overallDecision": "BLOCKED",
"gates": [
{
"gateId": "vex-trust",
"decision": "BLOCKED",
"inputs": {
"vexStatements": 3,
"trustScore": 0.45,
"threshold": 0.70
},
"reason": "Trust score below threshold",
"suggestion": "Obtain VEX from trusted issuer or adjust trust registry"
}
]
}
```
### replay/knowledge-snapshot.json
Frozen inputs for deterministic replay:
```json
{
"$schema": "https://schema.stella-ops.org/knowledge-snapshot/v1",
"snapshotId": "urn:stella:snapshot:sha256:...",
"capturedAt": "2026-01-17T10:25:00Z",
"inputs": {
"sbomDigest": "sha256:sbom-hash...",
"vexStatements": ["sha256:vex1...", "sha256:vex2..."],
"policyDigest": "sha256:policy-hash...",
"reachabilityDigest": "sha256:reach-hash..."
},
"replayCommand": "stella replay snapshot --manifest replay/knowledge-snapshot.json"
}
```
### replay/replay-instructions.md
Human-readable replay instructions (auto-generated, see AUD-004).
## Archive Formats
The bundle can be output in three formats:
| Format | Extension | Use Case |
|--------|-----------|----------|
| Directory | (none) | Local inspection, development |
| tar.gz | `.tar.gz` | Transfer, archival (default for remote) |
| zip | `.zip` | Windows compatibility |
## Verification
To verify a bundle's integrity:
```bash
stella audit verify ./audit-bundle-sha256-abc123/
```
Verification checks:
1. Parse `manifest.json`
2. Verify each file's SHA-256 hash matches manifest
3. Verify `integrityHash` (hash of all file hashes)
4. Optionally verify DSSE signatures
## Compliance Mapping
| Compliance Framework | Bundle Component |
|---------------------|------------------|
| SOC 2 (CC7.1) | verdict/, policy/ |
| ISO 27001 (A.12.6) | evidence/sbom.json |
| FedRAMP | All components |
| SLSA Level 3 | evidence/provenance/ |
## Extensibility
Custom evidence can be added to `evidence/custom/` directory. Custom files must be:
- Listed in `manifest.json`
- JSON or Markdown format
- Include schema reference if JSON
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,251 @@
# stella audit
> **Sprint:** SPRINT_20260117_027_CLI_audit_bundle_command
> **Task:** AUD-007 - Documentation
Commands for audit operations including bundle generation and verification.
## Synopsis
```
stella audit <command> [options]
```
## Commands
| Command | Description |
|---------|-------------|
| `bundle` | Generate self-contained audit bundle for an artifact |
| `verify` | Verify audit bundle integrity |
---
## stella audit bundle
Generate a self-contained, auditor-ready evidence package for an artifact.
### Synopsis
```
stella audit bundle <digest> [options]
```
### Arguments
| Argument | Description |
|----------|-------------|
| `<digest>` | Artifact digest (e.g., `sha256:abc123...`) |
### Options
| Option | Default | Description |
|--------|---------|-------------|
| `--output <path>` | `./audit-bundle-<digest>/` | Output path for the bundle |
| `--format <format>` | `dir` | Output format: `dir`, `tar.gz`, `zip` |
| `--include-call-graph` | `false` | Include call graph visualization |
| `--include-schemas` | `false` | Include JSON schema files |
| `--include-trace` | `true` | Include policy evaluation trace |
| `--policy-version <ver>` | (current) | Use specific policy version |
| `--overwrite` | `false` | Overwrite existing output |
| `--verbose` | `false` | Show progress during generation |
### Examples
```bash
# Generate bundle as directory
stella audit bundle sha256:abc123def456
# Generate tar.gz archive
stella audit bundle sha256:abc123def456 --format tar.gz
# Specify output location
stella audit bundle sha256:abc123def456 --output ./audits/release-v2.5/
# Include all optional content
stella audit bundle sha256:abc123def456 \
--include-call-graph \
--include-schemas \
--verbose
# Use specific policy version
stella audit bundle sha256:abc123def456 --policy-version v2.3.1
```
### Output
The bundle contains:
```
audit-bundle-<digest>-<timestamp>/
├── manifest.json # Bundle manifest with cryptographic hashes
├── README.md # Human-readable guide for auditors
├── verdict/
│ ├── verdict.json # StellaVerdict artifact
│ └── verdict.dsse.json # DSSE envelope with signatures
├── evidence/
│ ├── sbom.json # SBOM (CycloneDX format)
│ ├── vex-statements/ # All VEX statements considered
│ │ ├── index.json
│ │ └── *.json
│ ├── reachability/
│ │ ├── analysis.json
│ │ └── call-graph.dot # Optional
│ └── provenance/
│ └── slsa-provenance.json
├── policy/
│ ├── policy-snapshot.json
│ ├── gate-decision.json
│ └── evaluation-trace.json
├── replay/
│ ├── knowledge-snapshot.json
│ └── replay-instructions.md
└── schema/ # Optional
├── verdict-schema.json
└── vex-schema.json
```
### Exit Codes
| Code | Description |
|------|-------------|
| 0 | Bundle generated successfully |
| 1 | Bundle generated with missing evidence (warnings) |
| 2 | Error (artifact not found, permission denied, etc.) |
---
## stella audit verify
Verify the integrity of an audit bundle.
### Synopsis
```
stella audit verify <bundle-path> [options]
```
### Arguments
| Argument | Description |
|----------|-------------|
| `<bundle-path>` | Path to audit bundle (directory or archive) |
### Options
| Option | Default | Description |
|--------|---------|-------------|
| `--strict` | `false` | Fail on any missing optional files |
| `--check-signatures` | `false` | Verify DSSE signatures |
| `--trusted-keys <path>` | (none) | Path to trusted keys file for signature verification |
### Examples
```bash
# Basic verification
stella audit verify ./audit-bundle-abc123-20260117/
# Strict mode (fail on any missing files)
stella audit verify ./audit-bundle-abc123-20260117/ --strict
# Verify signatures
stella audit verify ./audit-bundle.tar.gz \
--check-signatures \
--trusted-keys ./trusted-keys.json
# Verify archive directly
stella audit verify ./audit-bundle-abc123.zip
```
### Output
```
Verifying bundle: ./audit-bundle-abc123-20260117/
Bundle ID: urn:stella:audit-bundle:sha256:abc123...
Artifact: sha256:abc123def456...
Generated: 2026-01-17T10:30:00Z
Files: 15
Verifying files...
✓ Verified 15/15 files
✓ Integrity hash verified
✓ Bundle integrity verified
```
### Exit Codes
| Code | Description |
|------|-------------|
| 0 | Bundle is valid |
| 1 | Bundle integrity check failed |
| 2 | Error (bundle not found, invalid format, etc.) |
---
## Trusted Keys File Format
For signature verification, provide a JSON file with trusted public keys:
```json
{
"keys": [
{
"keyId": "urn:stella:key:sha256:abc123...",
"publicKey": "-----BEGIN PUBLIC KEY-----\n...\n-----END PUBLIC KEY-----"
}
]
}
```
---
## Use Cases
### Generating Bundles for External Auditors
```bash
# Generate comprehensive bundle for SOC 2 audit
stella audit bundle sha256:prod-release-v2.5 \
--format zip \
--include-schemas \
--output ./soc2-audit-2026/release-evidence.zip
```
### Verifying Received Bundles
```bash
# Verify bundle received from another team
stella audit verify ./received-bundle.tar.gz --strict
# Verify with signature checking
stella audit verify ./received-bundle/ \
--check-signatures \
--trusted-keys ./company-signing-keys.json
```
### CI/CD Integration
```yaml
# GitLab CI example
audit-bundle:
stage: release
script:
- stella audit bundle $IMAGE_DIGEST --format tar.gz --output ./audit/
artifacts:
paths:
- audit/
expire_in: 5 years
```
---
## Related
- [Audit Bundle Format Specification](audit-bundle-format.md)
- [stella replay](../replay.md) - Replay verdicts for verification
- [stella export](export.md) - Export evidence in various formats
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,313 @@
# stella explain - Block Explanation Commands
**Sprint:** SPRINT_20260117_026_CLI_why_blocked_command
## Overview
The `stella explain` command group provides commands for understanding why artifacts are blocked by policy gates. This addresses the M2 moat requirement: **"Explainability with proof, not narrative."**
When an artifact is blocked, `stella explain` produces a **deterministic trace** with **referenced evidence artifacts**, enabling:
- Clear understanding of which gate blocked the artifact
- Actionable suggestions for remediation
- Verifiable evidence chain
- Deterministic replay for verification
---
## Commands
### stella explain block
Explain why an artifact was blocked by policy gates.
**Usage:**
```bash
stella explain block <digest> [options]
```
**Arguments:**
- `<digest>` - Artifact digest in any of these formats:
- `sha256:abc123...` - Full digest with algorithm prefix
- `abc123...` - Raw 64-character hex digest (assumed sha256)
- `registry.example.com/image@sha256:abc123...` - OCI reference (digest extracted)
**Options:**
| Option | Alias | Description | Default |
|--------|-------|-------------|---------|
| `--format <format>` | `-f` | Output format: `table`, `json`, `markdown` | `table` |
| `--show-evidence` | `-e` | Include full evidence artifact details | false |
| `--show-trace` | `-t` | Include policy evaluation trace | false |
| `--replay-token` | `-r` | Include replay token in output | false |
| `--output <path>` | `-o` | Write to file instead of stdout | stdout |
| `--offline` | | Query local verdict cache only | false |
---
## Output Formats
### Table Format (Default)
Human-readable format optimized for terminal display:
```
Artifact: sha256:abc123def456789012345678901234567890123456789012345678901234
Status: BLOCKED
Gate: VexTrust
Reason: Trust score below threshold (0.45 < 0.70)
Suggestion: Obtain VEX statement from trusted issuer or add issuer to trust registry
Evidence:
[VEX ] vex:sha256:de...23 vendor-x 2026-01-15T10:00:00Z
[REACH ] reach:sha256...56 static 2026-01-15T09:55:00Z
Replay: stella verify verdict --verdict urn:stella:verdict:sha256:abc123:v2.3.0:1737108000
```
### JSON Format
Machine-readable format for CI/CD integration:
```json
{
"artifact": "sha256:abc123def456789012345678901234567890123456789012345678901234",
"status": "BLOCKED",
"gate": "VexTrust",
"reason": "Trust score below threshold (0.45 < 0.70)",
"suggestion": "Obtain VEX statement from trusted issuer or add issuer to trust registry",
"evaluationTime": "2026-01-15T10:30:00+00:00",
"policyVersion": "v2.3.0",
"evidence": [
{
"type": "VEX",
"id": "vex:sha256:def456789abc123",
"source": "vendor-x",
"timestamp": "2026-01-15T10:00:00+00:00",
"retrieveCommand": "stella evidence get vex:sha256:def456789abc123"
},
{
"type": "REACH",
"id": "reach:sha256:789abc123def456",
"source": "static-analysis",
"timestamp": "2026-01-15T09:55:00+00:00",
"retrieveCommand": "stella evidence get reach:sha256:789abc123def456"
}
],
"replayCommand": "stella verify verdict --verdict urn:stella:verdict:sha256:abc123:v2.3.0:1737108000"
}
```
### Markdown Format
Suitable for embedding in GitHub issues, PR comments, or documentation:
```markdown
## Block Explanation
**Artifact:** `sha256:abc123def456789012345678901234567890123456789012345678901234`
**Status:** BLOCKED
### Gate Decision
| Property | Value |
|----------|-------|
| Gate | VexTrust |
| Reason | Trust score below threshold (0.45 < 0.70) |
| Suggestion | Obtain VEX statement from trusted issuer or add issuer to trust registry |
| Policy Version | v2.3.0 |
### Evidence
| Type | ID | Source | Timestamp |
|------|-----|--------|-----------|
| VEX | `vex:sha256:de...23` | vendor-x | 2026-01-15 10:00 |
| REACH | `reach:sha256...56` | static-analysis | 2026-01-15 09:55 |
### Verification
```bash
stella verify verdict --verdict urn:stella:verdict:sha256:abc123:v2.3.0:1737108000
```
```
---
## Examples
### Basic Block Explanation
```bash
# Get basic explanation of why an artifact is blocked
stella explain block sha256:abc123def456789012345678901234567890123456789012345678901234
```
### JSON Output for CI/CD
```bash
# Get JSON output for parsing in CI/CD pipeline
stella explain block sha256:abc123... --format json --output block-reason.json
# Parse in CI/CD
GATE=$(jq -r '.gate' block-reason.json)
REASON=$(jq -r '.reason' block-reason.json)
echo "Blocked by $GATE: $REASON"
```
### Full Explanation with Evidence and Trace
```bash
# Get complete explanation with all details
stella explain block sha256:abc123... \
--show-evidence \
--show-trace \
--replay-token \
--format table
```
### Markdown for PR Comment
```bash
# Generate markdown for GitHub PR comment
stella explain block sha256:abc123... --format markdown --output comment.md
# Use with gh CLI
gh pr comment 123 --body-file comment.md
```
### Retrieve Evidence Artifacts
```bash
# Get explanation
stella explain block sha256:abc123... --show-evidence
# Retrieve specific evidence artifacts
stella evidence get vex:sha256:def456789abc123
stella evidence get reach:sha256:789abc123def456
```
### Verify Deterministic Replay
```bash
# Get replay token
REPLAY=$(stella explain block sha256:abc123... --format json | jq -r '.replayCommand')
# Execute replay verification
eval $REPLAY
```
---
## Exit Codes
| Code | Meaning |
|------|---------|
| `0` | Artifact is NOT blocked (all gates passed) |
| `1` | Artifact IS blocked (one or more gates failed) |
| `2` | Error (artifact not found, API error, etc.) |
**CI/CD Integration:**
```bash
# Fail pipeline if artifact is blocked
if ! stella explain block sha256:abc123... --format json > /dev/null 2>&1; then
EXIT_CODE=$?
if [ $EXIT_CODE -eq 1 ]; then
echo "ERROR: Artifact is blocked by policy"
stella explain block sha256:abc123... --format markdown
exit 1
else
echo "ERROR: Could not retrieve block status"
exit 2
fi
fi
```
---
## Evidence Types
The `explain block` command returns evidence artifacts that contributed to the gate decision:
| Type | Description | Source |
|------|-------------|--------|
| `VEX` | VEX (Vulnerability Exploitability eXchange) statement | VEX issuers, vendor security teams |
| `REACH` | Reachability analysis result | Static analysis, call graph analysis |
| `SBOM` | Software Bill of Materials | SBOM generators, build systems |
| `SCAN` | Vulnerability scan result | Scanner service |
| `ATTEST` | Attestation document | Attestor service, SLSA provenance |
| `POLICY` | Policy evaluation result | Policy engine |
---
## Determinism Guarantee
All output from `stella explain block` is **deterministic**:
1. **Same inputs produce identical outputs** - Given the same artifact digest and policy version, the output is byte-for-byte identical
2. **Evidence is sorted** - Evidence artifacts are sorted by timestamp (ascending)
3. **Trace is sorted** - Evaluation trace steps are sorted by step number
4. **Timestamps use ISO 8601** - All timestamps use ISO 8601 format with UTC offset
5. **JSON uses canonical ordering** - JSON properties are ordered consistently
This enables:
- **Replay verification** - Use the replay token to verify the decision can be reproduced
- **Audit trails** - Compare explanations across time
- **Cache validation** - Verify cached decisions match current evaluation
---
## Troubleshooting
### Artifact Not Found
```
Error: Artifact sha256:abc123... not found in registry or evidence store.
```
**Causes:**
- Artifact was never scanned
- Artifact digest is incorrect
- Artifact was deleted from registry
**Solutions:**
```bash
# Verify artifact exists
stella image inspect sha256:abc123...
# Scan the artifact
stella scan docker://myregistry/myimage@sha256:abc123...
```
### Not Blocked
```
Artifact sha256:abc123... is NOT blocked. All policy gates passed.
```
This means the artifact passed all policy evaluations. Exit code will be `0`.
### API Error
```
Error: Policy service unavailable
```
**Solutions:**
```bash
# Check connectivity
stella doctor --check check.policy.connectivity
# Use offline mode if available
stella explain block sha256:abc123... --offline
```
---
## See Also
- [Policy Commands](policy.md) - Policy management and testing
- [VEX Commands](vex.md) - VEX document management
- [Evidence Commands](evidence.md) - Evidence retrieval and verification
- [Verify Commands](verify.md) - Verdict verification and replay
- [Command Reference](reference.md) - Complete command reference

View File

@@ -13,6 +13,7 @@ graph TD
CLI --> ADMIN[Administration]
CLI --> AUTH[Authentication]
CLI --> POLICY[Policy Management]
CLI --> EXPLAIN[Explainability]
CLI --> VEX[VEX & Decisioning]
CLI --> SBOM[SBOM Operations]
CLI --> REPORT[Reporting & Export]
@@ -914,6 +915,73 @@ Platform: linux-x64
---
## Explainability Commands
### stella explain block
Explain why an artifact was blocked by policy gates. Produces deterministic trace with referenced evidence artifacts.
**Sprint:** SPRINT_20260117_026_CLI_why_blocked_command
**Moat Reference:** M2 (Explainability with proof, not narrative)
**Usage:**
```bash
stella explain block <digest> [options]
```
**Arguments:**
- `<digest>` - Artifact digest (`sha256:abc123...`, raw hex, or OCI reference)
**Options:**
| Option | Description | Default |
|--------|-------------|---------|
| `--format <format>` | Output format: `table`, `json`, `markdown` | `table` |
| `--show-evidence` | Include full evidence artifact details | false |
| `--show-trace` | Include policy evaluation trace | false |
| `--replay-token` | Include replay token in output | false |
| `--output <path>` | Write to file instead of stdout | stdout |
| `--offline` | Query local verdict cache only | false |
**Examples:**
```bash
# Basic explanation
stella explain block sha256:abc123def456...
# JSON output for CI/CD
stella explain block sha256:abc123... --format json --output reason.json
# Full explanation with evidence and trace
stella explain block sha256:abc123... --show-evidence --show-trace
# Markdown for PR comment
stella explain block sha256:abc123... --format markdown | gh pr comment 123 --body-file -
```
**Exit Codes:**
- `0` - Artifact is NOT blocked (all gates passed)
- `1` - Artifact IS blocked
- `2` - Error (not found, API error)
**Output (table):**
```
Artifact: sha256:abc123def456789012345678901234567890123456789012345678901234
Status: BLOCKED
Gate: VexTrust
Reason: Trust score below threshold (0.45 < 0.70)
Suggestion: Obtain VEX statement from trusted issuer
Evidence:
[VEX ] vex:sha256:de...23 vendor-x 2026-01-15T10:00:00Z
[REACH ] reach:sha256...56 static 2026-01-15T09:55:00Z
Replay: stella verify verdict --verdict urn:stella:verdict:sha256:abc123:v2.3.0:1737108000
```
**See Also:** [Explain Commands Documentation](explain.md)
---
## Additional Commands
### stella vuln query

View File

@@ -0,0 +1,333 @@
# P0 Product Metrics
> **Sprint:** SPRINT_20260117_028_Telemetry_p0_metrics
> **Task:** P0M-007 - Documentation
This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.
## Overview
These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."
| Metric | Target | Alert Threshold |
|--------|--------|-----------------|
| Time to First Verified Release | P90 < 4 hours | P90 > 24 hours |
| Mean Time to Answer "Why Blocked" | P90 < 5 minutes | P90 > 1 hour |
| Support Minutes per Customer | Trend toward 0 | > 30 min/month |
| Determinism Regressions | Zero | Any policy-level |
---
## Metric 1: Time to First Verified Release
**Name:** `stella_time_to_first_verified_release_seconds`
**Type:** Histogram
### Definition
Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).
### Labels
| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `deployment_type` | `fresh`, `upgrade` | Type of installation |
### Histogram Buckets
5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)
### Collection Points
1. **Install timestamp** - Recorded on first Authority service startup
2. **First promotion** - Recorded in Release Orchestrator on first verified promotion
### Why This Matters
A short time-to-first-release indicates:
- Good onboarding experience
- Clear documentation
- Sensible default configurations
- Working integrations
### Dashboard Usage
The Grafana dashboard shows:
- Histogram heatmap of time distribution
- P50/P90/P99 statistics
- Trend over time
### Alert Response
**Warning (P90 > 4 hours):**
1. Review recent onboarding experiences
2. Check for common configuration issues
3. Review documentation clarity
**Critical (P90 > 24 hours):**
1. Investigate blocked customers
2. Check for integration failures
3. Consider guided onboarding assistance
---
## Metric 2: Mean Time to Answer "Why Blocked"
**Name:** `stella_why_blocked_latency_seconds`
**Type:** Histogram
### Definition
Time from block decision to user viewing explanation (via CLI, UI, or API).
### Labels
| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `surface` | `cli`, `ui`, `api` | Interface used to view explanation |
| `resolution_type` | `immediate`, `delayed` | Same session vs different session |
### Histogram Buckets
1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h
### Collection Points
1. **Block decision** - Timestamp stored in verdict
2. **Explanation view** - Tracked when `stella explain block` or UI equivalent invoked
### Why This Matters
Short "why blocked" latency indicates:
- Clear block messaging
- Discoverable explanation tools
- Good explainability UX
Long latency may indicate:
- Users confused about where to find answers
- Documentation gaps
- UX friction
### Dashboard Usage
The Grafana dashboard shows:
- Histogram heatmap of latency distribution
- Trend line over time
- Breakdown by surface (CLI vs UI vs API)
### Alert Response
**Warning (P90 > 5 minutes):**
1. Review block notification messaging
2. Check CLI command discoverability
3. Verify UI links are prominent
**Critical (P90 > 1 hour):**
1. Investigate user flows
2. Add proactive notifications
3. Review documentation and help text
---
## Metric 3: Support Minutes per Customer
**Name:** `stella_support_burden_minutes_total`
**Type:** Counter
### Definition
Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.
### Labels
| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `category` | `install`, `config`, `policy`, `integration`, `bug`, `other` | Support category |
| `month` | YYYY-MM | Month of support |
### Collection
Log support interactions using:
```bash
stella ops support log --tenant <id> --minutes <n> --category <cat>
```
Or via API:
```bash
POST /v1/ops/support/log
{
"tenant": "acme-corp",
"minutes": 15,
"category": "config"
}
```
### Why This Matters
This metric tracks operational scalability. For solo-scaled operations:
- Support burden should trend toward zero
- High support minutes indicate product gaps
- Categories identify areas needing improvement
### Dashboard Usage
The Grafana dashboard shows:
- Stacked bar chart by category
- Monthly trend per tenant
- Total support burden
### Alert Response
**Warning (> 30 min/month per tenant):**
1. Review support interactions for patterns
2. Identify documentation gaps
3. Create runbooks for common issues
**Critical (> 60 min/month per tenant):**
1. Escalate to product for feature work
2. Consider dedicated support time
3. Prioritize automation
---
## Metric 4: Determinism Regressions
**Name:** `stella_determinism_regressions_total`
**Type:** Counter
### Definition
Count of detected determinism failures in production (same inputs produced different outputs).
### Labels
| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `component` | `scanner`, `policy`, `attestor`, `export` | Component with regression |
| `severity` | `bitwise`, `semantic`, `policy` | Fidelity tier of regression |
### Severity Tiers
| Tier | Description | Impact |
|------|-------------|--------|
| `bitwise` | Byte-for-byte output differs | Low - cosmetic |
| `semantic` | Output semantically differs | Medium - potential confusion |
| `policy` | Policy decision differs | **Critical** - audit risk |
### Collection Points
1. **Scheduled verification jobs** - Regular determinism checks
2. **Replay verification failures** - User-initiated replays
3. **CI golden test failures** - Development-time detection
### Why This Matters
Determinism is a core moat. Regressions indicate:
- Non-deterministic code introduced
- External dependency changes
- Time-sensitive logic bugs
**Policy-level regressions are audit-breaking** and must be fixed immediately.
### Dashboard Usage
The Grafana dashboard shows:
- Counter with severity breakdown
- Alert status indicator
- Historical trend
### Alert Response
**Warning (any bitwise/semantic):**
1. Review recent deployments
2. Check for dependency updates
3. Investigate affected component
**Critical (any policy):**
1. **Immediate investigation required**
2. Consider rollback
3. Review all recent policy decisions
4. Notify affected customers
---
## Dashboard Access
The P0 metrics dashboard is available at:
```
/grafana/d/stella-p0-metrics
```
Or directly:
```bash
stella ops dashboard p0
```
### Dashboard Features
- **Tenant selector** - Filter by specific tenant
- **Time range** - Adjust analysis window
- **SLO indicators** - Green/yellow/red status
- **Drill-down links** - Navigate to detailed views
---
## Alerting Configuration
Alerts are configured in `devops/telemetry/alerts/stella-p0-alerts.yml`.
### Alert Channels
Configure alert destinations in Grafana:
- Slack/Teams for warnings
- PagerDuty for critical alerts
- Email for summaries
### Silencing Alerts
During maintenance windows:
```bash
stella ops alerts silence --duration 2h --reason "Planned maintenance"
```
---
## Implementation Notes
### Source Files
| Component | Location |
|-----------|----------|
| Metric definitions | `src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs` |
| Install timestamp | `src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs` |
| Dashboard template | `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json` |
| Alert rules | `devops/telemetry/alerts/stella-p0-alerts.yml` |
### Adding Custom Metrics
To add additional P0-level metrics:
1. Define in `P0ProductMetrics.cs`
2. Add collection points in relevant services
3. Create dashboard panel in Grafana JSON
4. Add alert rules
5. Update this documentation
---
## Related
- [Observability Guide](observability.md)
- [Alerting Configuration](alerting.md)
- [Runbook: Metric Collection Issues](../../operations/runbooks/telemetry-metrics-ops.md)
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,256 @@
# Auditor Guide
> **Sprint:** SPRINT_20260117_027_CLI_audit_bundle_command
> **Task:** AUD-007 - Documentation
This guide is for external auditors reviewing Stella Ops release evidence.
## Overview
Stella Ops generates comprehensive, tamper-evident audit bundles that contain all evidence required to verify release decisions. This guide explains how to interpret and verify these bundles.
## Receiving an Audit Bundle
Audit bundles may be delivered as:
- **Directory:** A folder containing all evidence files
- **Archive:** A `.tar.gz` or `.zip` file
### Extracting Archives
```bash
# tar.gz
tar -xzf audit-bundle-sha256-abc123.tar.gz
# zip
unzip audit-bundle-sha256-abc123.zip
```
## Bundle Structure
```
audit-bundle-<digest>-<timestamp>/
├── manifest.json # Integrity manifest
├── README.md # Quick reference
├── verdict/ # Release decision
├── evidence/ # Supporting evidence
├── policy/ # Policy configuration
└── replay/ # Verification instructions
```
## Step 1: Verify Bundle Integrity
Before reviewing contents, verify the bundle has not been tampered with.
### Using Stella CLI
```bash
stella audit verify ./audit-bundle-sha256-abc123/
```
Expected output:
```
✓ Verified 15/15 files
✓ Integrity hash verified
✓ Bundle integrity verified
```
### Manual Verification
1. Open `manifest.json`
2. For each file listed, compute SHA-256 and compare:
```bash
sha256sum verdict/verdict.json
```
3. Verify the `integrityHash` by hashing all file hashes
## Step 2: Review the Verdict
The verdict is the official release decision.
### verdict/verdict.json
```json
{
"artifactDigest": "sha256:abc123...",
"decision": "PASS",
"timestamp": "2026-01-17T10:25:00Z",
"gates": [
{
"gateId": "sbom-required",
"status": "PASS",
"reason": "Valid CycloneDX SBOM present"
},
{
"gateId": "vex-trust",
"status": "PASS",
"reason": "Trust score 0.85 >= 0.70 threshold"
}
]
}
```
### Decision Values
| Decision | Meaning |
|----------|---------|
| `PASS` | All gates passed, artifact approved for deployment |
| `BLOCKED` | One or more gates failed, artifact not approved |
| `PENDING` | Evaluation incomplete, awaiting additional evidence |
### verdict/verdict.dsse.json
This file contains the cryptographically signed verdict envelope (DSSE format). Verify signatures using:
```bash
stella audit verify ./bundle/ --check-signatures
```
## Step 3: Review Evidence
### evidence/sbom.json
Software Bill of Materials (SBOM) listing all components in the artifact.
**Key fields:**
- `components[]` - List of all software components
- `dependencies[]` - Dependency relationships
- `metadata.timestamp` - When SBOM was generated
### evidence/vex-statements/
Vulnerability Exploitability eXchange (VEX) statements that justify vulnerability assessments.
**index.json:**
```json
{
"statementCount": 3,
"statements": [
{"fileName": "vex-001.json", "source": "vendor-security"},
{"fileName": "vex-002.json", "source": "internal-analysis"}
]
}
```
Each VEX statement explains why a vulnerability does or does not affect this artifact.
### evidence/reachability/analysis.json
Reachability analysis showing which vulnerabilities are actually reachable in the code.
```json
{
"components": [
{
"purl": "pkg:npm/lodash@4.17.21",
"vulnerabilities": [
{
"id": "CVE-2021-23337",
"reachable": false,
"reason": "Vulnerable function not in call graph"
}
]
}
]
}
```
## Step 4: Review Policy
### policy/policy-snapshot.json
The policy configuration used for evaluation:
```json
{
"policyVersion": "v2.3.1",
"gates": ["sbom-required", "vex-trust", "cve-threshold"],
"thresholds": {
"vexTrustScore": 0.70,
"maxCriticalCves": 0,
"maxHighCves": 5
}
}
```
### policy/gate-decision.json
Detailed breakdown of each gate evaluation:
```json
{
"gates": [
{
"gateId": "vex-trust",
"decision": "PASS",
"inputs": {
"vexStatements": 3,
"trustScore": 0.85,
"threshold": 0.70
}
}
]
}
```
## Step 5: Replay Verification (Optional)
For maximum assurance, you can replay the verdict evaluation.
### Using Stella CLI
```bash
cd audit-bundle-sha256-abc123/
stella replay snapshot --manifest replay/knowledge-snapshot.json
```
This re-evaluates the policy using the frozen inputs and should produce an identical verdict.
### Manual Replay Steps
See `replay/replay-instructions.md` for detailed steps.
## Compliance Mapping
| Compliance Framework | Relevant Bundle Components |
|---------------------|---------------------------|
| **SOC 2 (CC7.1)** | verdict/, policy/ |
| **ISO 27001 (A.12.6)** | evidence/sbom.json |
| **FedRAMP** | All components |
| **SLSA Level 3** | evidence/provenance/ |
## Common Questions
### Q: Why was this artifact blocked?
Review `policy/gate-decision.json` for the specific gate that failed and its reason.
### Q: How do I verify the SBOM is accurate?
The SBOM digest is included in the manifest. Compare against the organization's SBOM generation process.
### Q: What if replay produces a different result?
This may indicate:
1. Policy version mismatch
2. Missing evidence files
3. Time-dependent policy rules
Contact the organization's security team for clarification.
### Q: How long should audit bundles be retained?
Stella Ops recommends:
- Production releases: 5 years minimum
- Security-critical systems: 7 years
- Regulated industries: Per compliance requirements
## Support
For questions about this audit bundle:
1. Contact the organization's Stella Ops administrator
2. Reference the Bundle ID from `manifest.json`
3. Include the artifact digest
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,112 @@
# Runbook Coverage Tracking
This document tracks operational runbook coverage across Stella Ops modules.
**Target:** 80% coverage of critical failure modes before declaring operability moat achieved.
---
## Coverage Summary
| Module | Critical Failures | Runbooks | Coverage | Status |
|--------|-------------------|----------|----------|--------|
| Scanner | 5 | 0 | 0% | 🔴 Gap |
| Policy Engine | 5 | 0 | 0% | 🔴 Gap |
| Release Orchestrator | 5 | 0 | 0% | 🔴 Gap |
| Attestor | 5 | 0 | 0% | 🔴 Gap |
| Feed Connectors | 4 | 0 | 0% | 🔴 Gap |
| **Database (Postgres)** | 4 | 4 | 100% | ✅ Complete |
| **Crypto Subsystem** | 4 | 4 | 100% | ✅ Complete |
| **Evidence Locker** | 4 | 4 | 100% | ✅ Complete |
| **Backup/Restore** | 4 | 4 | 100% | ✅ Complete |
| Authority (OAuth/OIDC) | 3 | 0 | 0% | 🔴 Gap |
| **Overall** | **43** | **16** | **37%** | 🟡 In Progress |
---
## Available Runbooks
### Database Operations
- [postgres-ops.md](postgres-ops.md) - PostgreSQL database operations
### Crypto Subsystem
- [crypto-ops.md](crypto-ops.md) - Regional crypto operations (FIPS, eIDAS, GOST, SM)
### Evidence Locker
- [evidence-locker-ops.md](evidence-locker-ops.md) - Evidence locker operations
### Backup/Restore
- [backup-restore-ops.md](backup-restore-ops.md) - Backup and restore procedures
### Vulnerability Operations
- [vuln-ops.md](vuln-ops.md) - Vulnerability management operations
### VEX Operations
- [vex-ops.md](vex-ops.md) - VEX statement operations
### Policy Incidents
- [policy-incident.md](policy-incident.md) - Policy-related incident response
---
## Gap Analysis
### High Priority Gaps (Critical modules without runbooks)
1. **Scanner** - Core scanning functionality
- Worker stuck
- OOM on large images
- Registry auth failures
2. **Policy Engine** - Policy evaluation
- Slow evaluation
- OPA crashes
- Compilation failures
3. **Release Orchestrator** - Promotion workflow
- Stuck promotions
- Gate timeouts
- Missing evidence
### Medium Priority Gaps
4. **Attestor** - Signing and verification
- Signing failures
- Key expiration
- Rekor unavailability
5. **Feed Connectors** - Advisory feeds
- NVD failures
- Rate limiting
- Offline bundle issues
### Lower Priority Gaps
6. **Authority** - Authentication
- Token validation failures
- OIDC provider issues
---
## Template
New runbooks should use the template: [_template.md](_template.md)
---
## Doctor Check Integration
Runbooks should be linked from Doctor check output. Current integration status:
| Module | Doctor Checks | Linked to Runbook |
|--------|---------------|-------------------|
| Postgres | 4 | 0 |
| Crypto | 8 | 0 |
| Storage | 3 | 0 |
| Evidence | 4 | 0 |
**Next step:** Update Doctor check implementations to include runbook links in remediation output.
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,157 @@
# Runbook: [Component] - [Failure Scenario]
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-001 - Runbook Template
## Metadata
| Field | Value |
|-------|-------|
| **Component** | [Module name: Scanner, Policy, Orchestrator, Attestor, etc.] |
| **Severity** | Critical / High / Medium / Low |
| **On-call scope** | [Who should be paged: Platform team, Security team, etc.] |
| **Last updated** | [YYYY-MM-DD] |
| **Doctor check** | [Check ID if applicable, e.g., `check.scanner.worker-health`] |
---
## Symptoms
Observable indicators that this failure is occurring:
- [ ] [Symptom 1: e.g., "Scan jobs stuck in pending state for >5 minutes"]
- [ ] [Symptom 2: e.g., "Error logs contain 'worker timeout exceeded'"]
- [ ] [Metric/alert that fires: e.g., "Alert `ScannerWorkerStuck` firing"]
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | [e.g., "New scans cannot complete, blocking CI/CD pipelines"] |
| **Data integrity** | [e.g., "No data loss, but stale scan results may be served"] |
| **SLA impact** | [e.g., "Scan latency SLO violated if not resolved within 15 minutes"] |
---
## Diagnosis
### Quick checks (< 2 minutes)
Run these first to confirm the failure:
1. **Check Doctor diagnostics:**
```bash
stella doctor --check [relevant-check-id]
```
2. **Check service status:**
```bash
stella [component] status
```
3. **Check recent logs:**
```bash
stella [component] logs --tail 50 --level error
```
### Deep diagnosis (if quick checks inconclusive)
1. **[Investigation step 1]:**
```bash
[command]
```
Expected output: [description]
If unexpected: [what it means]
2. **[Investigation step 2]:**
```bash
[command]
```
3. **Check related services:**
- Postgres connectivity: `stella doctor --check check.storage.postgres`
- Valkey connectivity: `stella doctor --check check.storage.valkey`
- Network connectivity: `stella doctor --check check.network.[target]`
---
## Resolution
### Immediate mitigation (restore service quickly)
Use these steps to restore service, even if root cause isn't fixed yet:
1. **[Mitigation step 1]:**
```bash
[command]
```
This will: [explanation]
2. **[Mitigation step 2]:**
```bash
[command]
```
### Root cause fix
Once service is restored, address the underlying issue:
1. **[Fix step 1]:**
```bash
[command]
```
2. **[Fix step 2]:**
```bash
[command]
```
3. **Verify fix is complete:**
```bash
stella doctor --check [relevant-check-id]
```
### Verification
Confirm the issue is fully resolved:
```bash
# Re-run the failing operation
stella [component] [test-command]
# Verify metrics are healthy
stella obs metrics --filter [component] --last 5m
# Verify no new errors in logs
stella [component] logs --tail 20 --level error
```
---
## Prevention
How to prevent this failure from recurring:
- [ ] **Monitoring:** [e.g., "Add alert for queue depth > 100"]
- [ ] **Configuration:** [e.g., "Increase worker count in high-volume environments"]
- [ ] **Code change:** [e.g., "Implement circuit breaker for external service calls"]
- [ ] **Documentation:** [e.g., "Update capacity planning guide"]
---
## Related Resources
- **Architecture doc:** [Link to relevant architecture documentation]
- **Related runbooks:** [Links to related failure scenarios]
- **Doctor check source:** [Link to Doctor check implementation]
- **Grafana dashboard:** [Link to relevant dashboard]
---
## Revision History
| Date | Author | Changes |
|------|--------|---------|
| YYYY-MM-DD | [Name] | Initial version |

View File

@@ -0,0 +1,193 @@
# Runbook: Attestor - HSM Connection Issues
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor / Cryptography |
| **Severity** | Critical |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.crypto.hsm-availability` |
---
## Symptoms
- [ ] Signing operations failing with "HSM unavailable"
- [ ] Alert `AttestorHsmConnectionFailed` firing
- [ ] Error: "PKCS#11 operation failed" or "HSM session timeout"
- [ ] Attestations cannot be created
- [ ] Key operations (sign, verify) failing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | No attestations can be signed; releases blocked |
| **Data integrity** | Keys are safe in HSM; operations resume when connection restored |
| **SLA impact** | All signing operations blocked; compliance posture at risk |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.crypto.hsm-availability
```
2. **Check HSM connection status:**
```bash
stella crypto hsm status
```
3. **Test HSM connectivity:**
```bash
stella crypto hsm test
```
### Deep diagnosis
1. **Check PKCS#11 library status:**
```bash
stella crypto hsm pkcs11-status
```
Look for: Library loaded, slot available, session active
2. **Check HSM network connectivity:**
```bash
stella crypto hsm ping
```
3. **Check HSM session logs:**
```bash
stella crypto hsm logs --last 30m
```
Look for: Session errors, timeout, authentication failures
4. **Check HSM slot status:**
```bash
stella crypto hsm slots list
```
Problem if: Slot not found, slot busy, token not present
---
## Resolution
### Immediate mitigation
1. **Attempt HSM reconnection:**
```bash
stella crypto hsm reconnect
```
2. **If HSM unreachable, switch to software signing (if permitted):**
```bash
stella attest config set signing.mode software
stella attest reload
```
**Warning:** Software signing may not meet compliance requirements
3. **Use backup HSM if configured:**
```bash
stella crypto hsm failover --to backup
```
### Root cause fix
**If network connectivity issue:**
1. Check HSM network path:
```bash
stella crypto hsm connectivity --verbose
```
2. Verify firewall rules allow HSM port (typically 1792 for Luna, 2225 for SafeNet)
3. Check HSM server status with vendor tools
**If session timeout:**
1. Increase session timeout:
```bash
stella crypto hsm config set session.timeout 300s
stella crypto hsm reconnect
```
2. Enable session keep-alive:
```bash
stella crypto hsm config set session.keepalive true
stella crypto hsm config set session.keepalive_interval 60s
```
**If authentication failed:**
1. Verify HSM credentials:
```bash
stella crypto hsm auth verify
```
2. Update HSM PIN if changed:
```bash
stella crypto hsm auth update --slot <slot-id>
```
**If PKCS#11 library issue:**
1. Verify library path:
```bash
stella crypto hsm config get pkcs11.library_path
```
2. Reload PKCS#11 library:
```bash
stella crypto hsm pkcs11-reload
```
3. Check library compatibility:
```bash
stella crypto hsm pkcs11-info
```
### Verification
```bash
# Test HSM connectivity
stella crypto hsm test
# Test signing operation
stella attest test-sign
# Verify key access
stella keys verify <key-id> --operation sign
# Check no errors in logs
stella crypto hsm logs --level error --last 30m
```
---
## Prevention
- [ ] **Redundancy:** Configure backup HSM for failover
- [ ] **Monitoring:** Alert on HSM connection failures immediately
- [ ] **Keep-alive:** Enable session keep-alive to prevent timeouts
- [ ] **Testing:** Include HSM health in regular health checks
---
## Related Resources
- **Architecture:** `docs/modules/cryptography/hsm-integration.md`
- **Related runbooks:** `attestor-signing-failed.md`, `crypto-ops.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Crypto/`
- **HSM setup:** `docs/operations/hsm-configuration.md`

View File

@@ -0,0 +1,190 @@
# Runbook: Attestor - Signing Key Expired
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor |
| **Severity** | Critical |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.attestor.key-expiration` |
---
## Symptoms
- [ ] Attestation creation failing with "key expired" error
- [ ] Alert `AttestorKeyExpired` firing
- [ ] Error: "signing key certificate has expired"
- [ ] New attestations cannot be created
- [ ] Verification of new attestations failing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | No new attestations can be signed; releases blocked |
| **Data integrity** | Existing attestations remain valid; new ones cannot be created |
| **SLA impact** | Release SLO violated; compliance posture compromised |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.attestor.key-expiration
```
2. **List signing keys and expiration:**
```bash
stella keys list --type signing --show-expiration
```
Look for: Keys with status "expired" or expiring soon
3. **Check active signing key:**
```bash
stella attest config get signing.key_id
stella keys show <key-id> --details
```
### Deep diagnosis
1. **Check certificate chain validity:**
```bash
stella crypto cert verify-chain --key <key-id>
```
Problem if: Any certificate in chain expired
2. **Check for backup keys:**
```bash
stella keys list --type signing --status inactive
```
Look for: Unexpired backup keys that can be activated
3. **Check key rotation history:**
```bash
stella keys rotation-history --key <key-id>
```
---
## Resolution
### Immediate mitigation
1. **If backup key available, activate it:**
```bash
stella keys activate <backup-key-id>
stella attest config set signing.key_id <backup-key-id>
stella attest reload
```
2. **Verify signing works:**
```bash
stella attest test-sign
```
3. **Retry failed attestations:**
```bash
stella attest retry --failed --last 1h
```
### Root cause fix
**Generate new signing key:**
1. Generate new key pair:
```bash
stella keys generate \
--type signing \
--algorithm ecdsa-p256 \
--validity 365d \
--name "signing-key-$(date +%Y%m%d)"
```
2. If using HSM:
```bash
stella keys generate \
--type signing \
--algorithm ecdsa-p256 \
--validity 365d \
--hsm-slot <slot> \
--name "signing-key-$(date +%Y%m%d)"
```
3. Register the new key:
```bash
stella keys register <new-key-id> --purpose attestation-signing
```
4. Update signing configuration:
```bash
stella attest config set signing.key_id <new-key-id>
stella attest reload
```
5. Publish new public key to trust anchors:
```bash
stella issuer keys publish <new-key-id>
```
**Configure automatic rotation:**
1. Enable auto-rotation:
```bash
stella keys config set rotation.auto true
stella keys config set rotation.before_expiry 30d
stella keys config set rotation.overlap_days 14
```
2. Set up rotation alerts:
```bash
stella keys config set alerts.expiring_days 30
stella keys config set alerts.expiring_days_critical 7
```
### Verification
```bash
# Verify new key is active
stella keys list --type signing --status active
# Test signing
stella attest test-sign
# Create test attestation
stella attest create --type test --subject "test:key-rotation"
# Verify the attestation
stella verify attestation --last
# Check key expiration
stella keys show <new-key-id> --details | grep -i expir
```
---
## Prevention
- [ ] **Rotation:** Enable automatic key rotation 30 days before expiry
- [ ] **Monitoring:** Alert on keys expiring within 30 days (warning) and 7 days (critical)
- [ ] **Backup:** Maintain at least one backup signing key
- [ ] **Documentation:** Document key rotation procedures and approval process
---
## Related Resources
- **Architecture:** `docs/modules/attestor/architecture.md`
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-hsm-connection.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
- **Key management:** `docs/operations/key-management.md`

View File

@@ -0,0 +1,184 @@
# Runbook: Attestor - Rekor Transparency Log Unreachable
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.attestor.rekor-connectivity` |
---
## Symptoms
- [ ] Attestation transparency logging failing
- [ ] Alert `AttestorRekorUnavailable` firing
- [ ] Error: "Rekor server unavailable" or "transparency log submission failed"
- [ ] Attestations created but not anchored to transparency log
- [ ] Verification failing due to missing log entry
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Attestations not publicly verifiable via transparency log |
| **Data integrity** | Attestations still valid locally; transparency reduced |
| **SLA impact** | Compliance may require transparency log anchoring |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.attestor.rekor-connectivity
```
2. **Check Rekor connectivity:**
```bash
stella attest rekor status
```
3. **Test Rekor endpoint:**
```bash
stella attest rekor ping
```
### Deep diagnosis
1. **Check Rekor server URL:**
```bash
stella attest config get rekor.url
```
Default: https://rekor.sigstore.dev
2. **Check for public Rekor outage:**
```bash
stella attest rekor api-status
```
Also check: https://status.sigstore.dev/
3. **Check network/proxy issues:**
```bash
stella attest rekor test --verbose
```
Look for: TLS errors, proxy blocks, timeout
4. **Check pending log entries:**
```bash
stella attest rekor pending-entries
```
---
## Resolution
### Immediate mitigation
1. **Queue attestations for later submission:**
```bash
stella attest config set rekor.queue_on_failure true
stella attest reload
```
2. **Disable Rekor requirement temporarily:**
```bash
stella attest config set rekor.required false
stella attest reload
```
**Warning:** Reduces transparency guarantees
3. **Use private Rekor instance if available:**
```bash
stella attest config set rekor.url https://rekor.internal.example.com
stella attest reload
```
### Root cause fix
**If public Rekor outage:**
1. Wait for Sigstore to resolve the issue
2. Check status at https://status.sigstore.dev/
3. Process queued entries when service recovers:
```bash
stella attest rekor process-queue
```
**If network/firewall issue:**
1. Verify outbound HTTPS to rekor.sigstore.dev:
```bash
stella attest rekor connectivity --verbose
```
2. Configure proxy if required:
```bash
stella attest config set rekor.proxy https://proxy:8080
```
3. Add Rekor endpoints to firewall allowlist:
- rekor.sigstore.dev:443
- fulcio.sigstore.dev:443 (for certificate issuance)
**If TLS certificate issue:**
1. Check certificate validity:
```bash
stella attest rekor cert-check
```
2. Update CA certificates:
```bash
stella crypto ca update
```
**If private Rekor instance issue:**
1. Check private Rekor server status
2. Verify Rekor database health
3. Check Rekor signer availability
### Verification
```bash
# Test Rekor connectivity
stella attest rekor ping
# Submit test entry
stella attest rekor test-submit
# Process any queued entries
stella attest rekor process-queue
# Verify recent attestation in log
stella attest rekor lookup --attestation <attestation-id>
```
---
## Prevention
- [ ] **Redundancy:** Configure private Rekor instance as fallback
- [ ] **Queuing:** Enable queue-on-failure for resilience
- [ ] **Monitoring:** Alert on Rekor submission failures
- [ ] **Offline:** Document attestation validity without Rekor for air-gap scenarios
---
## Related Resources
- **Architecture:** `docs/modules/attestor/transparency-log.md`
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-verification-failed.md`
- **Sigstore docs:** https://docs.sigstore.dev/
- **Rekor setup:** `docs/operations/rekor-configuration.md`

View File

@@ -0,0 +1,176 @@
# Runbook: Attestor - Signature Generation Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor |
| **Severity** | Critical |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.attestor.signing-health` |
---
## Symptoms
- [ ] Attestation requests failing with "signing failed" error
- [ ] Alert `AttestorSigningFailed` firing
- [ ] Evidence bundles missing signatures
- [ ] Metric `attestor_signing_failures_total` increasing
- [ ] Release pipeline blocked due to unsigned attestations
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Releases blocked; attestations cannot be created |
| **Data integrity** | Evidence is recorded but unsigned; can be signed later |
| **SLA impact** | Release SLO violated; evidence integrity compromised |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.attestor.signing-health
```
2. **Check attestor service status:**
```bash
stella attest status
```
3. **Check signing key availability:**
```bash
stella keys list --type signing --status active
```
Problem if: No active signing keys
### Deep diagnosis
1. **Test signing operation:**
```bash
stella attest test-sign --verbose
```
Look for: Specific error message
2. **Check key material access:**
```bash
stella keys verify <key-id> --operation sign
```
3. **If using HSM, check HSM connectivity:**
```bash
stella doctor --check check.crypto.hsm-availability
```
4. **Check for key expiration:**
```bash
stella keys list --expiring-within 7d
```
---
## Resolution
### Immediate mitigation
1. **If key expired, rotate to backup key:**
```bash
stella keys activate <backup-key-id>
stella attest config set signing.key_id <backup-key-id>
```
2. **If HSM unavailable, switch to software signing (temporary):**
```bash
stella attest config set signing.mode software
stella attest reload
```
⚠️ **Warning:** Software signing may not meet compliance requirements
3. **Retry failed attestations:**
```bash
stella attest retry --failed --last 1h
```
### Root cause fix
**If key expired:**
1. Generate new signing key:
```bash
stella keys generate --type signing --algorithm ecdsa-p256
```
2. Configure key rotation schedule:
```bash
stella keys config set rotation.auto true
stella keys config set rotation.overlap_days 14
```
**If HSM connection failed:**
1. Verify HSM configuration:
```bash
stella crypto hsm verify
```
2. Restart HSM connection:
```bash
stella crypto hsm reconnect
```
**If certificate chain issue:**
1. Verify certificate chain:
```bash
stella crypto cert verify-chain --key <key-id>
```
2. Update intermediate certificates:
```bash
stella crypto cert update-chain --key <key-id>
```
### Verification
```bash
# Test signing
stella attest test-sign
# Create test attestation
stella attest create --type test --subject "test:verification"
# Verify the attestation
stella verify attestation --last
# Check no failures in recent operations
stella attest logs --level error --last 30m
```
---
## Prevention
- [ ] **Key rotation:** Enable automatic key rotation with 14-day overlap
- [ ] **Monitoring:** Alert on keys expiring within 30 days
- [ ] **Backup:** Maintain backup signing key in different HSM slot
- [ ] **Testing:** Include signing test in health check schedule
---
## Related Resources
- **Architecture:** `docs/modules/attestor/architecture.md`
- **Related runbooks:** `attestor-key-expired.md`, `attestor-hsm-connection.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Attestor/`
- **Dashboard:** Grafana > Stella Ops > Attestor

View File

@@ -0,0 +1,195 @@
# Runbook: Attestor - Attestation Verification Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-005 - Attestor Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Attestor |
| **Severity** | High |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.attestor.verification-health` |
---
## Symptoms
- [ ] Attestation verification failing
- [ ] Alert `AttestorVerificationFailed` firing
- [ ] Error: "signature verification failed" or "invalid attestation"
- [ ] Promotions blocked due to failed verification
- [ ] Error: "trust anchor not found" or "certificate chain invalid"
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Artifacts cannot be promoted; release blocked |
| **Data integrity** | May indicate tampered attestation or configuration issue |
| **SLA impact** | Release pipeline blocked until resolved |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.attestor.verification-health
```
2. **Verify specific attestation:**
```bash
stella verify attestation --attestation <attestation-id> --verbose
```
3. **Check trust anchors:**
```bash
stella trust-anchors list
```
### Deep diagnosis
1. **Check attestation details:**
```bash
stella attest show <attestation-id> --details
```
Look for: Signer identity, timestamp, subject
2. **Verify certificate chain:**
```bash
stella verify cert-chain --attestation <attestation-id>
```
Problem if: Intermediate cert missing, root not trusted
3. **Check public key availability:**
```bash
stella keys show <key-id> --public
```
4. **Check if issuer is trusted:**
```bash
stella issuer trust-status <issuer-id>
```
---
## Resolution
### Immediate mitigation
1. **If trust anchor missing, add it:**
```bash
stella trust-anchors add --cert <issuer-cert.pem>
```
2. **If intermediate cert missing:**
```bash
stella trust-anchors add-intermediate --cert <intermediate.pem>
```
3. **Re-verify with verbose output:**
```bash
stella verify attestation --attestation <attestation-id> --verbose
```
### Root cause fix
**If signature mismatch:**
1. Check attestation wasn't modified:
```bash
stella attest integrity-check <attestation-id>
```
2. If modified, regenerate attestation:
```bash
stella attest create --subject <digest> --type <type> --force
```
**If key rotated and old key not trusted:**
1. Add old public key to trust anchors:
```bash
stella trust-anchors add-key --key <old-key.pem> --expires <date>
```
2. Or fetch from issuer directory:
```bash
stella issuer keys fetch <issuer-id>
```
**If certificate expired:**
1. Check certificate validity:
```bash
stella verify cert --attestation <attestation-id> --show-expiry
```
2. Re-sign with valid certificate:
```bash
stella attest resign <attestation-id>
```
**If issuer not trusted:**
1. Verify issuer identity:
```bash
stella issuer show <issuer-id>
```
2. Add to trusted issuers (requires approval):
```bash
stella issuer trust <issuer-id> --reason "Approved by security team"
```
**If algorithm not supported:**
1. Check algorithm:
```bash
stella attest show <attestation-id> | grep algorithm
```
2. Verify crypto provider supports algorithm:
```bash
stella crypto providers list --algorithms
```
### Verification
```bash
# Verify attestation
stella verify attestation --attestation <attestation-id>
# Verify trust chain
stella verify cert-chain --attestation <attestation-id>
# Test end-to-end verification
stella verify artifact --digest <digest>
# Check no verification errors
stella attest logs --filter "verification" --level error --last 30m
```
---
## Prevention
- [ ] **Trust anchors:** Keep trust anchor list current with all valid issuer certs
- [ ] **Key rotation:** Plan key rotation with overlap period for verification continuity
- [ ] **Monitoring:** Alert on verification failure rate > 0
- [ ] **Testing:** Include verification tests in release pipeline
---
## Related Resources
- **Architecture:** `docs/modules/attestor/verification.md`
- **Related runbooks:** `attestor-signing-failed.md`, `attestor-key-expired.md`
- **Trust management:** `docs/operations/trust-anchors.md`

View File

@@ -0,0 +1,449 @@
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-004 - Backup/Restore Runbook
# Backup and Restore Operations Runbook
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
Comprehensive backup and restore procedures for all Stella Ops components including database, evidence locker, configuration, and secrets.
---
## Backup Architecture Overview
### Backup Components
| Component | Backup Type | Default Schedule | Retention |
|-----------|-------------|------------------|-----------|
| PostgreSQL | Full + WAL | Daily full, continuous WAL | 30 days |
| Evidence Locker | Incremental | Daily | 90 days |
| Configuration | Snapshot | Daily + on change | 90 days |
| Secrets | Encrypted snapshot | Daily | 30 days |
| Attestation Keys | Encrypted export | Weekly | 1 year |
### Storage Locations
- **Primary:** `/var/lib/stellaops/backups/` (local)
- **Secondary:** S3/Azure Blob/GCS (configurable)
- **Offline:** Removable media for air-gap scenarios
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check backup service status
stella backup status
# Verify backup storage
stella doctor --check check.storage.backup
# List recent backups
stella backup list --last 7d
# Test backup restore capability
stella backup test-restore --latest --dry-run
```
### Metrics to Watch
- `stella_backup_last_success_timestamp` - Last successful backup
- `stella_backup_duration_seconds` - Backup duration
- `stella_backup_size_bytes` - Backup size
- `stella_restore_test_last_success` - Last restore test
---
## Standard Procedures
### SP-001: Create Manual Backup
**When:** Before upgrades, schema changes, or major configuration changes
**Duration:** 5-30 minutes depending on data volume
1. Create full system backup:
```bash
stella backup create --full --name "pre-upgrade-$(date +%Y%m%d)"
```
2. Or create component-specific backup:
```bash
# Database only
stella backup create --type database --name "db-pre-migration"
# Evidence locker only
stella backup create --type evidence --name "evidence-snapshot"
# Configuration only
stella backup create --type config --name "config-backup"
```
3. Verify backup:
```bash
stella backup verify --name "pre-upgrade-$(date +%Y%m%d)"
```
4. Copy to offsite storage (recommended):
```bash
stella backup copy --name "pre-upgrade-$(date +%Y%m%d)" --destination s3://backup-bucket/
```
### SP-002: Verify Backup Integrity
**Frequency:** Weekly
**Duration:** 15-60 minutes
1. List backups for verification:
```bash
stella backup list --unverified
```
2. Verify backup integrity:
```bash
# Verify specific backup
stella backup verify --name <backup-name>
# Verify all unverified
stella backup verify --all-unverified
```
3. Test restore (non-destructive):
```bash
stella backup test-restore --name <backup-name> --target /tmp/restore-test
```
4. Record verification result:
```bash
stella backup log-verification --name <backup-name> --result success
```
### SP-003: Restore from Backup
**CAUTION: This is a destructive operation**
#### Full System Restore
1. Stop all services:
```bash
stella service stop --all
```
2. List available backups:
```bash
stella backup list --type full
```
3. Restore:
```bash
# Dry run first
stella backup restore --name <backup-name> --dry-run
# Execute restore
stella backup restore --name <backup-name> --confirm
```
4. Start services:
```bash
stella service start --all
```
5. Verify restoration:
```bash
stella doctor --all
stella service health
```
#### Component-Specific Restore
1. Database restore:
```bash
stella service stop --service api,release-orchestrator
stella backup restore --type database --name <backup-name> --confirm
stella db migrate # Apply any pending migrations
stella service start --service api,release-orchestrator
```
2. Evidence locker restore:
```bash
stella backup restore --type evidence --name <backup-name> --confirm
stella evidence verify --mode quick
```
3. Configuration restore:
```bash
stella backup restore --type config --name <backup-name> --confirm
stella service restart --graceful
```
### SP-004: Point-in-Time Recovery (Database)
1. Identify target recovery point:
```bash
# List WAL archives
stella backup wal-list --after <start-date> --before <end-date>
```
2. Perform PITR:
```bash
stella backup restore-pitr --to-time "2026-01-17T10:30:00Z" --confirm
```
3. Verify data state:
```bash
stella db verify-integrity
```
---
## Backup Schedules
### Configure Backup Schedule
```bash
# View current schedule
stella backup schedule show
# Set database backup schedule
stella backup schedule set --type database --cron "0 2 * * *"
# Set evidence backup schedule
stella backup schedule set --type evidence --cron "0 3 * * *"
# Set configuration backup schedule
stella backup schedule set --type config --cron "0 4 * * *" --on-change
```
### Retention Policy
```bash
# View retention policy
stella backup retention show
# Set retention
stella backup retention set --type database --days 30
stella backup retention set --type evidence --days 90
stella backup retention set --type config --days 90
# Apply retention (cleanup old backups)
stella backup retention apply
```
---
## Incident Procedures
### INC-001: Backup Failure
**Symptoms:**
- Alert: `StellaBackupFailed`
- Missing recent backup
**Investigation:**
```bash
# Check backup logs
stella backup logs --last 24h
# Check disk space
stella doctor --check check.storage.diskspace,check.storage.backup
# Test backup operation
stella backup test --type database
```
**Resolution:**
1. **Disk space issue:**
```bash
stella backup retention apply --force
stella backup cleanup --expired
```
2. **Database connectivity:**
```bash
stella doctor --check check.postgres.connectivity
```
3. **Permission issue:**
- Check backup directory permissions
- Verify service account access
4. **Retry backup:**
```bash
stella backup create --type <failed-type> --retry
```
### INC-002: Restore Failure
**Symptoms:**
- Restore command fails
- Services not starting after restore
**Investigation:**
```bash
# Check restore logs
stella backup restore-logs --last-attempt
# Verify backup integrity
stella backup verify --name <backup-name>
# Check disk space
stella doctor --check check.storage.diskspace
```
**Resolution:**
1. **Corrupted backup:**
```bash
# Try previous backup
stella backup list --type <type>
stella backup restore --name <previous-backup> --confirm
```
2. **Version mismatch:**
```bash
# Check backup version
stella backup info --name <backup-name>
# Restore with migration
stella backup restore --name <backup-name> --with-migration
```
3. **Disk space:**
- Free space or expand volume
- Restore to alternate location
### INC-003: Backup Storage Full
**Symptoms:**
- Alert: `StellaBackupStorageFull`
- New backups failing
**Immediate Actions:**
```bash
# Check storage
stella backup storage stats
# Emergency cleanup
stella backup cleanup --keep-last 3
# Delete specific old backups
stella backup delete --older-than 14d --confirm
```
**Resolution:**
1. **Adjust retention:**
```bash
stella backup retention set --type database --days 14
stella backup retention apply
```
2. **Expand storage:**
- Add disk space
- Configure offsite storage
3. **Archive to cold storage:**
```bash
stella backup archive --older-than 30d --destination s3://archive-bucket/
```
---
## Disaster Recovery Scenarios
### DR-001: Complete System Loss
1. Provision new infrastructure
2. Install Stella Ops
3. Restore from offsite backup:
```bash
stella backup restore --source s3://backup-bucket/latest-full.tar.gz --confirm
```
4. Verify all components
5. Update DNS/load balancer
### DR-002: Database Corruption
1. Stop services
2. Restore database from latest clean backup:
```bash
stella backup restore --type database --name <last-known-good>
```
3. Apply WAL to near-corruption point (PITR)
4. Verify data integrity
5. Resume services
### DR-003: Evidence Locker Loss
1. Restore evidence from backup:
```bash
stella backup restore --type evidence --name <backup-name>
```
2. Rebuild index:
```bash
stella evidence index rebuild
```
3. Verify anchor chain:
```bash
stella evidence anchor verify --all
```
---
## Offline/Air-Gap Backup
### Creating Offline Backup
```bash
# Create encrypted offline bundle
stella backup create-offline \
--output /media/usb/stellaops-backup-$(date +%Y%m%d).enc \
--encrypt \
--passphrase-file /secure/backup-key
# Verify offline backup
stella backup verify-offline --input /media/usb/stellaops-backup-*.enc
```
### Restoring from Offline Backup
```bash
# Restore from offline backup
stella backup restore-offline \
--input /media/usb/stellaops-backup-*.enc \
--passphrase-file /secure/backup-key \
--confirm
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → Backup Status
Key panels:
- Last backup success time
- Backup size trend
- Backup duration
- Restore test status
- Storage utilization
---
## Evidence Capture
```bash
stella backup diagnostics --output /tmp/backup-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
---
## Escalation Path
1. **L1 (On-call):** Retry failed backups, basic troubleshooting
2. **L2 (Platform team):** Restore operations, schedule adjustments
3. **L3 (Architecture):** Disaster recovery execution
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,196 @@
# Runbook: Feed Connector - GitHub Security Advisories (GHSA) Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-006 - Feed Connector Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Concelier / GHSA Connector |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.connector.ghsa-health` |
---
## Symptoms
- [ ] GHSA feed sync failing or stale
- [ ] Alert `ConnectorGhsaSyncFailed` firing
- [ ] Error: "GitHub API rate limit exceeded" or "GraphQL query failed"
- [ ] GitHub Advisory Database vulnerabilities missing
- [ ] Metric `connector_sync_failures_total{source="ghsa"}` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | GitHub ecosystem vulnerabilities may be missed |
| **Data integrity** | Data becomes stale; no data loss |
| **SLA impact** | Vulnerability currency SLO violated for GitHub packages |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.connector.ghsa-health
```
2. **Check GHSA sync status:**
```bash
stella admin feeds status --source ghsa
```
3. **Test GitHub API connectivity:**
```bash
stella connector test ghsa
```
### Deep diagnosis
1. **Check GitHub API rate limit:**
```bash
stella connector ghsa rate-limit-status
```
Problem if: Remaining = 0, rate limit exceeded
2. **Check GitHub token permissions:**
```bash
stella connector credentials show ghsa --check-scopes
```
Required scopes: `public_repo`, `read:packages` (for private advisory access)
3. **Check sync logs:**
```bash
stella connector logs ghsa --last 1h --level error
```
Look for: GraphQL errors, pagination issues, timeout
4. **Check for GitHub API outage:**
```bash
stella connector ghsa api-status
```
Also check: https://www.githubstatus.com/
---
## Resolution
### Immediate mitigation
1. **If rate limited, wait for reset:**
```bash
stella connector ghsa rate-limit-status
# Note the reset time, then:
stella admin feeds refresh --source ghsa
```
2. **Use secondary token if available:**
```bash
stella connector credentials rotate ghsa --to secondary
stella admin feeds refresh --source ghsa
```
3. **Load from offline bundle:**
```bash
stella offline load --source ghsa --package ghsa-bundle-latest.tar.gz
```
### Root cause fix
**If rate limit consistently exceeded:**
1. Increase sync interval:
```bash
stella connector config set ghsa.sync_interval 4h
```
2. Enable incremental sync:
```bash
stella connector config set ghsa.incremental_sync true
```
3. Use authenticated requests (10x rate limit):
```bash
stella connector credentials update ghsa --token <github-pat>
```
**If token expired or invalid:**
1. Generate new GitHub PAT at https://github.com/settings/tokens
2. Update token:
```bash
stella connector credentials update ghsa --token <new-token>
```
3. Verify scopes:
```bash
stella connector credentials show ghsa --check-scopes
```
**If GraphQL query failing:**
1. Check for API schema changes:
```bash
stella connector ghsa schema-check
```
2. Update connector if schema changed:
```bash
stella upgrade --component connector-ghsa
```
**If pagination broken:**
1. Reset sync cursor:
```bash
stella connector ghsa reset-cursor
```
2. Force full resync:
```bash
stella admin feeds refresh --source ghsa --full
```
### Verification
```bash
# Force sync
stella admin feeds refresh --source ghsa
# Monitor sync progress
stella admin feeds status --source ghsa --watch
# Verify recent advisories present
stella vuln query GHSA-xxxx-xxxx-xxxx # Use a recent GHSA ID
# Check no errors
stella connector logs ghsa --level error --last 1h
```
---
## Prevention
- [ ] **Authentication:** Always use authenticated requests for 5000/hr rate limit
- [ ] **Monitoring:** Alert on last sync > 12h or sync failures
- [ ] **Redundancy:** Use NVD/OSV as backup for GitHub ecosystem coverage
- [ ] **Token rotation:** Rotate tokens before expiration
---
## Related Resources
- **Architecture:** `docs/modules/concelier/connectors.md`
- **Connector config:** `docs/modules/concelier/operations/connectors/ghsa.md`
- **Related runbooks:** `connector-nvd.md`, `connector-osv.md`
- **GitHub API docs:** https://docs.github.com/en/graphql

View File

@@ -0,0 +1,195 @@
# Runbook: Feed Connector - NVD Connector Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-006 - Feed Connector Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Concelier / NVD Connector |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.connector.nvd-health` |
---
## Symptoms
- [ ] NVD feed sync failing or stale (> 24h since last successful sync)
- [ ] Alert `ConnectorNvdSyncFailed` firing
- [ ] Error: "NVD API request failed" or "rate limit exceeded"
- [ ] Vulnerability data missing or outdated
- [ ] Metric `connector_sync_failures_total{source="nvd"}` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Vulnerability scans may miss recent CVEs |
| **Data integrity** | Data becomes stale; no data loss |
| **SLA impact** | Vulnerability currency SLO violated (target: < 24h) |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.connector.nvd-health
```
2. **Check NVD sync status:**
```bash
stella admin feeds status --source nvd
```
Look for: Last sync time, error message, sync state
3. **Check NVD API connectivity:**
```bash
stella connector test nvd
```
### Deep diagnosis
1. **Check NVD API key status:**
```bash
stella connector credentials show nvd
```
Problem if: API key expired or rate limit exhausted
2. **Check NVD API rate limit:**
```bash
stella connector nvd rate-limit-status
```
Problem if: Remaining requests = 0, reset time in future
3. **Check for NVD API outage:**
```bash
stella connector nvd api-status
```
Also check: https://nvd.nist.gov/general/news
4. **Check sync logs:**
```bash
stella connector logs nvd --last 1h --level error
```
Look for: HTTP status codes, timeout errors, parsing failures
---
## Resolution
### Immediate mitigation
1. **If rate limited, wait for reset:**
```bash
stella connector nvd rate-limit-status
# Wait for reset time, then:
stella admin feeds refresh --source nvd
```
2. **If API key expired, use anonymous mode (slower):**
```bash
stella connector config set nvd.api_key_mode anonymous
stella admin feeds refresh --source nvd
```
3. **Load from offline bundle if urgent:**
```bash
# If you have a recent offline bundle:
stella offline load --source nvd --package nvd-bundle-latest.tar.gz
```
### Root cause fix
**If API key expired or invalid:**
1. Generate new NVD API key at https://nvd.nist.gov/developers/request-an-api-key
2. Update API key:
```bash
stella connector credentials update nvd --api-key <new-key>
```
3. Verify connectivity:
```bash
stella connector test nvd
```
**If rate limit consistently exceeded:**
1. Increase sync interval to reduce API calls:
```bash
stella connector config set nvd.sync_interval 6h
```
2. Enable delta sync to reduce data volume:
```bash
stella connector config set nvd.delta_sync true
```
3. Request higher rate limit from NVD (if available)
**If network/firewall issue:**
1. Verify outbound connectivity to NVD API:
```bash
stella connector test nvd --verbose
```
2. Check proxy configuration if required:
```bash
stella connector config set nvd.proxy https://proxy:8080
```
**If data parsing failures:**
1. Check for NVD schema changes:
```bash
stella connector nvd schema-check
```
2. Update connector if schema changed:
```bash
stella upgrade --component connector-nvd
```
### Verification
```bash
# Force sync
stella admin feeds refresh --source nvd --force
# Monitor sync progress
stella admin feeds status --source nvd --watch
# Verify recent CVEs are present
stella vuln query CVE-2026-XXXX # Use a recent CVE ID
# Check no errors in recent logs
stella connector logs nvd --level error --last 1h
```
---
## Prevention
- [ ] **API Key:** Always use API key (not anonymous) for 10x rate limit
- [ ] **Monitoring:** Alert on last sync > 24h or sync failure
- [ ] **Redundancy:** Configure backup connector (OSV, GitHub Advisory) for overlap
- [ ] **Offline:** Maintain weekly offline bundle for disaster recovery
---
## Related Resources
- **Architecture:** `docs/modules/concelier/connectors.md`
- **Connector config:** `docs/modules/concelier/operations/connectors/nvd.md`
- **Related runbooks:** `connector-ghsa.md`, `connector-osv.md`
- **Dashboard:** Grafana > Stella Ops > Feed Connectors

View File

@@ -0,0 +1,193 @@
# Runbook: Feed Connector - OSV (Open Source Vulnerabilities) Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-006 - Feed Connector Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Concelier / OSV Connector |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.connector.osv-health` |
---
## Symptoms
- [ ] OSV feed sync failing or stale
- [ ] Alert `ConnectorOsvSyncFailed` firing
- [ ] Error: "OSV API request failed" or "ecosystem sync failed"
- [ ] OSV vulnerabilities missing from database
- [ ] Metric `connector_sync_failures_total{source="osv"}` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Open source ecosystem vulnerabilities may be missed |
| **Data integrity** | Data becomes stale; no data loss |
| **SLA impact** | Vulnerability currency SLO violated for affected ecosystems |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.connector.osv-health
```
2. **Check OSV sync status:**
```bash
stella admin feeds status --source osv
```
3. **Test OSV API connectivity:**
```bash
stella connector test osv
```
### Deep diagnosis
1. **Check ecosystem-specific status:**
```bash
stella connector osv ecosystems status
```
Look for: Failed ecosystems, stale ecosystems
2. **Check sync logs:**
```bash
stella connector logs osv --last 1h --level error
```
Look for: API errors, parsing failures, timeout
3. **Check for OSV API outage:**
```bash
stella connector osv api-status
```
Also check: https://osv.dev/
4. **Check GCS bucket access (OSV uses GCS for bulk data):**
```bash
stella connector osv gcs-status
```
---
## Resolution
### Immediate mitigation
1. **Retry sync for specific ecosystem:**
```bash
stella admin feeds refresh --source osv --ecosystem npm
```
2. **Sync from GCS bucket directly (faster for bulk):**
```bash
stella connector osv sync-from-gcs
```
3. **Load from offline bundle:**
```bash
stella offline load --source osv --package osv-bundle-latest.tar.gz
```
### Root cause fix
**If API request failing:**
1. Check API endpoint:
```bash
stella connector osv api-test
```
2. Verify no proxy blocking:
```bash
stella connector config set osv.proxy <proxy-url>
```
**If GCS access failing:**
1. Check GCS connectivity:
```bash
stella connector osv gcs-test
```
2. Enable anonymous access (default):
```bash
stella connector config set osv.gcs_auth anonymous
```
3. Or configure service account:
```bash
stella connector config set osv.gcs_credentials /path/to/sa-key.json
```
**If specific ecosystem failing:**
1. Disable problematic ecosystem temporarily:
```bash
stella connector config set osv.ecosystems.disabled <ecosystem>
```
2. Check ecosystem data format:
```bash
stella connector osv ecosystem-check <ecosystem>
```
**If parsing errors:**
1. Check for schema changes:
```bash
stella connector osv schema-check
```
2. Update connector:
```bash
stella upgrade --component connector-osv
```
### Verification
```bash
# Force sync
stella admin feeds refresh --source osv
# Monitor sync progress
stella admin feeds status --source osv --watch
# Verify ecosystem coverage
stella connector osv ecosystems status
# Query recent vulnerability
stella vuln query OSV-2026-xxxx
# Check no errors
stella connector logs osv --level error --last 1h
```
---
## Prevention
- [ ] **Bulk sync:** Use GCS bulk sync for initial load and daily updates
- [ ] **Monitoring:** Alert on ecosystem sync failures
- [ ] **Redundancy:** NVD/GHSA provide overlapping coverage for major ecosystems
- [ ] **Offline:** Maintain weekly offline bundle
---
## Related Resources
- **Architecture:** `docs/modules/concelier/connectors.md`
- **Connector config:** `docs/modules/concelier/operations/connectors/osv.md`
- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`
- **OSV API docs:** https://osv.dev/docs/

View File

@@ -0,0 +1,220 @@
# Runbook Template: Feed Connector - Vendor-Specific Connectors
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-006 - Feed Connector Runbooks
## Overview
This is a template runbook for vendor-specific advisory feed connectors (RedHat, Ubuntu, Debian, Oracle, VMware, etc.). Use this template to create runbooks for specific vendor connectors.
---
## Metadata Template
| Field | Value |
|-------|-------|
| **Component** | Concelier / [Vendor] Connector |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | [Date] |
| **Doctor check** | `check.connector.[vendor]-health` |
---
## Common Vendor Connector Issues
### Authentication Failures
**Symptoms:**
- Sync failing with 401/403 errors
- "authentication failed" or "invalid credentials"
**Resolution:**
```bash
# Check credentials
stella connector credentials show <vendor>
# Update credentials
stella connector credentials update <vendor> --api-key <key>
# Test connectivity
stella connector test <vendor>
```
### Rate Limiting
**Symptoms:**
- Sync failing with 429 errors
- "rate limit exceeded"
**Resolution:**
```bash
# Check rate limit status
stella connector <vendor> rate-limit-status
# Increase sync interval
stella connector config set <vendor>.sync_interval 6h
# Enable delta sync
stella connector config set <vendor>.delta_sync true
```
### Data Format Changes
**Symptoms:**
- Parsing errors in sync logs
- "unexpected format" or "schema validation failed"
**Resolution:**
```bash
# Check for schema changes
stella connector <vendor> schema-check
# Update connector
stella upgrade --component connector-<vendor>
```
### Offline Bundle Refresh
**Resolution:**
```bash
# Create offline bundle
stella offline sync --feeds <vendor> --output <vendor>-bundle.tar.gz
# Load offline bundle
stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
```
---
## Vendor-Specific Runbooks
Use this template to create runbooks for:
### RedHat Security Data
**Endpoint:** https://access.redhat.com/security/data/
**Authentication:** API token or certificate
**Connector:** `connector-redhat`
Key commands:
```bash
stella connector test redhat
stella admin feeds status --source redhat
stella connector redhat cve-map-status # RHSA to CVE mapping
```
### Ubuntu Security Notices
**Endpoint:** https://ubuntu.com/security/notices
**Authentication:** None (public)
**Connector:** `connector-ubuntu`
Key commands:
```bash
stella connector test ubuntu
stella admin feeds status --source ubuntu
stella connector ubuntu usn-status # USN sync status
```
### Debian Security Tracker
**Endpoint:** https://security-tracker.debian.org/
**Authentication:** None (public)
**Connector:** `connector-debian`
Key commands:
```bash
stella connector test debian
stella admin feeds status --source debian
stella connector debian dla-status # DLA sync status
```
### Oracle Security Alerts
**Endpoint:** https://www.oracle.com/security-alerts/
**Authentication:** Oracle account (optional)
**Connector:** `connector-oracle`
Key commands:
```bash
stella connector test oracle
stella admin feeds status --source oracle
stella connector oracle cpu-status # Critical Patch Update status
```
### VMware Security Advisories
**Endpoint:** https://www.vmware.com/security/advisories
**Authentication:** None (public)
**Connector:** `connector-vmware`
Key commands:
```bash
stella connector test vmware
stella admin feeds status --source vmware
stella connector vmware vmsa-status # VMSA sync status
```
---
## Diagnosis Checklist
For any vendor connector issue:
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.connector.<vendor>-health
```
2. **Check sync status:**
```bash
stella admin feeds status --source <vendor>
```
3. **Test connectivity:**
```bash
stella connector test <vendor>
```
4. **Check logs:**
```bash
stella connector logs <vendor> --last 1h --level error
```
5. **Check credentials (if applicable):**
```bash
stella connector credentials show <vendor>
```
---
## Resolution Checklist
1. **Retry sync:**
```bash
stella admin feeds refresh --source <vendor>
```
2. **Update credentials (if auth issue):**
```bash
stella connector credentials update <vendor>
```
3. **Update connector (if format changed):**
```bash
stella upgrade --component connector-<vendor>
```
4. **Load offline bundle (if API unavailable):**
```bash
stella offline load --source <vendor> --package <vendor>-bundle.tar.gz
```
---
## Related Resources
- **Connector architecture:** `docs/modules/concelier/connectors.md`
- **Vendor connector configs:** `docs/modules/concelier/operations/connectors/`
- **Related runbooks:** `connector-nvd.md`, `connector-ghsa.md`, `connector-osv.md`

View File

@@ -0,0 +1,370 @@
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-002 - Crypto Subsystem Runbook
# Regional Crypto Operations Runbook
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
Cryptographic subsystem operations including HSM management, regional crypto profile configuration, key rotation, and certificate management for all supported crypto profiles (International, FIPS, eIDAS, GOST, SM).
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check crypto subsystem health
stella doctor --category crypto
# Verify active crypto profile
stella crypto profile show
# List loaded crypto providers
stella crypto providers list
# Check key status
stella crypto keys status
```
### Metrics to Watch
- `stella_crypto_operations_total` - Crypto operation count by type
- `stella_crypto_operation_duration_seconds` - Signing/verification latency
- `stella_hsm_availability` - HSM availability (if configured)
- `stella_cert_expiry_days` - Certificate expiration countdown
---
## Regional Crypto Profiles
### Profile Overview
| Profile | Use Case | Key Algorithms | Compliance |
|---------|----------|----------------|------------|
| `international` | Default, most deployments | RSA-2048+, ECDSA P-256/P-384, Ed25519 | General |
| `fips` | US Government / FedRAMP | FIPS 140-2 approved algorithms only | FIPS 140-2 |
| `eidas` | European Union | RSA-PSS, ECDSA, Ed25519 per ETSI TS 119 312 | eIDAS |
| `gost` | Russian Federation | GOST R 34.10-2012, GOST R 34.11-2012 | Russian standards |
| `sm` | China | SM2, SM3, SM4 | GM/T 0003-2012 |
### Switching Profiles
1. **Pre-switch verification:**
```bash
# Verify target profile is available
stella crypto profile verify --profile <target-profile>
# Check for incompatible existing signatures
stella crypto audit --check-compatibility --target-profile <target-profile>
```
2. **Profile switch:**
```bash
# Switch profile (requires service restart)
stella crypto profile set --profile <target-profile>
# Restart services to apply
stella service restart --graceful
```
3. **Post-switch verification:**
```bash
stella doctor --check check.crypto.fips,check.crypto.eidas,check.crypto.gost,check.crypto.sm
```
---
## Standard Procedures
### SP-001: Key Rotation
**Frequency:** Quarterly or per policy
**Duration:** ~15 minutes (no downtime)
1. Generate new key:
```bash
# For software keys
stella crypto keys generate --type signing --algorithm ecdsa-p256 --name signing-$(date +%Y%m)
# For HSM-backed keys
stella crypto keys generate --type signing --algorithm ecdsa-p256 --provider hsm --name signing-$(date +%Y%m)
```
2. Activate new key:
```bash
stella crypto keys activate --name signing-$(date +%Y%m)
```
3. Verify signing with new key:
```bash
echo "test" | stella crypto sign --output /dev/null
```
4. Schedule old key deactivation:
```bash
stella crypto keys schedule-deactivation --name <old-key-name> --in 30d
```
### SP-002: Certificate Renewal
**When:** Certificate expiring within 30 days
1. Check expiration:
```bash
stella crypto certs check-expiry
```
2. Generate CSR:
```bash
stella crypto certs csr --subject "CN=stellaops.example.com,O=Example Corp" --output cert.csr
```
3. Install renewed certificate:
```bash
stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
```
4. Verify certificate chain:
```bash
stella doctor --check check.crypto.certchain
```
5. Restart services:
```bash
stella service restart --graceful
```
### SP-003: HSM Health Check
**Frequency:** Daily (automated) or on-demand
1. Check HSM connectivity:
```bash
stella crypto hsm status
```
2. Verify slot access:
```bash
stella crypto hsm slots list
```
3. Test signing operation:
```bash
stella crypto hsm test-sign
```
4. Check HSM metrics:
- Free objects/sessions
- Temperature/health (vendor-specific)
---
## Incident Procedures
### INC-001: HSM Unavailable
**Symptoms:**
- Alert: `StellaHsmUnavailable`
- Signing operations failing with "HSM connection error"
**Investigation:**
```bash
# Check HSM status
stella crypto hsm status
# Test PKCS#11 module
stella crypto hsm test-module
# Check network to HSM
stella network test --host <hsm-host> --port <hsm-port>
```
**Resolution:**
1. **Network issue:**
- Verify network path to HSM
- Check firewall rules
- Verify HSM appliance is powered on
2. **Session exhaustion:**
```bash
# Release stale sessions
stella crypto hsm sessions release --stale
# Restart crypto service
stella service restart --service crypto-signer
```
3. **HSM failure:**
- Fail over to secondary HSM (if configured)
- Contact HSM vendor support
- Consider temporary fallback to software keys (with approval)
### INC-002: Signing Key Compromised
**CRITICAL - Follow incident response procedure**
1. **Immediate containment:**
```bash
# Revoke compromised key
stella crypto keys revoke --name <compromised-key> --reason compromise
# Block signing with compromised key
stella crypto keys block --name <compromised-key>
```
2. **Generate replacement key:**
```bash
stella crypto keys generate --type signing --algorithm ecdsa-p256 --name emergency-signing
stella crypto keys activate --name emergency-signing
```
3. **Notify downstream:**
- Update trust registries with new key
- Notify relying parties
- Publish key revocation notice
4. **Forensics:**
```bash
# Export key usage audit log
stella crypto audit export --key <compromised-key> --output /secure/key-audit.json
```
### INC-003: Certificate Expired
**Symptoms:**
- TLS connection failures
- Alert: `StellaCertExpired`
**Immediate Resolution:**
1. If renewed certificate is available:
```bash
stella crypto certs install --cert renewed-cert.pem --chain ca-chain.pem
stella service restart --graceful
```
2. If renewal not ready - emergency self-signed (temporary):
```bash
# Generate emergency certificate (NOT for production use)
stella crypto certs generate-self-signed --days 7 --name emergency
stella crypto certs install --cert emergency.pem
stella service restart --graceful
```
3. Expedite certificate renewal process
### INC-004: FIPS Mode Not Enabled
**Symptoms:**
- Alert: `StellaFipsNotEnabled`
- Compliance audit failure
**Resolution:**
1. **Linux:**
```bash
# Enable FIPS mode
sudo fips-mode-setup --enable
# Reboot required
sudo reboot
# Verify after reboot
fips-mode-setup --check
```
2. **Windows:**
- Enable via Group Policy
- Or via registry:
```powershell
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\Lsa\FipsAlgorithmPolicy" -Name "Enabled" -Value 1
Restart-Computer
```
3. Restart Stella services:
```bash
stella service restart
stella doctor --check check.crypto.fips
```
---
## Regional-Specific Procedures
### GOST Configuration (Russian Federation)
1. Install GOST engine:
```bash
sudo apt install libengine-gost-openssl1.1
```
2. Configure Stella:
```bash
stella crypto profile set --profile gost
stella crypto config set --gost-engine-path /usr/lib/x86_64-linux-gnu/engines-3/gost.so
```
3. Verify:
```bash
stella doctor --check check.crypto.gost
```
### SM Configuration (China)
1. Ensure OpenSSL 1.1.1+ with SM support:
```bash
openssl version
openssl list -cipher-algorithms | grep -i sm
```
2. Configure Stella:
```bash
stella crypto profile set --profile sm
```
3. Verify:
```bash
stella doctor --check check.crypto.sm
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → Crypto Subsystem
Key panels:
- Signing operation latency
- Key usage by key ID
- HSM availability
- Certificate expiration countdown
- Crypto profile in use
---
## Evidence Capture
```bash
# Comprehensive crypto diagnostics
stella crypto diagnostics --output /tmp/crypto-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
Bundle includes:
- Active crypto profile
- Key inventory (public keys only)
- Certificate chain
- HSM status
- Operation audit log (last 24h)
---
## Escalation Path
1. **L1 (On-call):** Certificate installs, key activation
2. **L2 (Security team):** Key rotation, HSM issues
3. **L3 (Crypto SME):** Algorithm issues, compliance questions
4. **HSM Vendor:** Hardware failures
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,408 @@
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-003 - Evidence Locker Runbook
# Evidence Locker Operations Runbook
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
Evidence locker operations including storage management, integrity verification, attestation management, provenance chain maintenance, and disaster recovery procedures.
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check evidence locker health
stella doctor --category evidence
# Verify storage accessibility
stella evidence status
# Check index health
stella evidence index status
# Verify anchor chain
stella evidence anchor verify --latest
```
### Metrics to Watch
- `stella_evidence_artifacts_total` - Total artifacts stored
- `stella_evidence_retrieval_latency_seconds` - Retrieval latency P99
- `stella_evidence_storage_bytes` - Storage consumption
- `stella_merkle_anchor_age_seconds` - Time since last anchor
---
## Standard Procedures
### SP-001: Daily Integrity Check
**Frequency:** Daily (automated) or on-demand
**Duration:** Varies by locker size (typically 5-30 minutes)
1. Run integrity verification:
```bash
# Quick check (sample-based)
stella evidence verify --mode quick
# Full check (all artifacts)
stella evidence verify --mode full
```
2. Review results:
```bash
stella evidence verify-report --latest
```
3. Address any failures:
```bash
# List failed artifacts
stella evidence verify-report --latest --filter failed
```
### SP-002: Index Maintenance
**Frequency:** Weekly or after large ingestion
**Duration:** ~10 minutes
1. Check index health:
```bash
stella evidence index status
```
2. Refresh index if needed:
```bash
# Incremental refresh
stella evidence index refresh
# Full rebuild (if corruption suspected)
stella evidence index rebuild
```
3. Optimize index:
```bash
stella evidence index optimize
```
### SP-003: Merkle Anchoring
**Frequency:** Per policy (default: every 6 hours)
**Duration:** ~2 minutes
1. Create new anchor:
```bash
stella evidence anchor create
```
2. Verify anchor chain:
```bash
stella evidence anchor verify --all
```
3. Export anchor for external archival:
```bash
stella evidence anchor export --latest --output anchor-$(date +%Y%m%dT%H%M%S).json
```
### SP-004: Storage Cleanup
**Frequency:** Monthly or when storage alerts trigger
**Duration:** Varies
1. Review storage usage:
```bash
stella evidence storage stats
```
2. Apply retention policy:
```bash
# Dry run first
stella evidence cleanup --apply-retention --dry-run
# Execute cleanup
stella evidence cleanup --apply-retention
```
3. Archive old evidence (if required):
```bash
stella evidence archive --older-than 365d --output /archive/evidence-$(date +%Y).tar
```
---
## Incident Procedures
### INC-001: Integrity Verification Failure
**Symptoms:**
- Alert: `StellaEvidenceIntegrityFailure`
- Verification reports hash mismatch
**Investigation:**
```bash
# Get failure details
stella evidence verify-report --latest --filter failed --format json > /tmp/integrity-failures.json
# Check specific artifact
stella evidence inspect <artifact-id>
# Check provenance
stella evidence provenance show <artifact-id>
```
**Resolution:**
1. **Isolated corruption:**
```bash
# Attempt recovery from replica (if available)
stella evidence recover --id <artifact-id> --source replica
# If no replica, mark as corrupted
stella evidence mark-corrupted --id <artifact-id> --reason "hash-mismatch"
```
2. **Widespread corruption:**
- Stop evidence ingestion
- Identify corruption extent
- Restore from backup if necessary
- Escalate to L3
3. **False positive (software bug):**
- Verify with multiple hash implementations
- Check for recent software updates
- Report bug if confirmed
### INC-002: Evidence Retrieval Failure
**Symptoms:**
- Alert: `StellaEvidenceRetrievalFailed`
- API returning 404 for known artifacts
**Investigation:**
```bash
# Check if artifact exists
stella evidence exists <artifact-id>
# Check index
stella evidence index lookup <artifact-id>
# Check storage backend
stella evidence storage check <artifact-id>
```
**Resolution:**
1. **Index corruption:**
```bash
# Rebuild index
stella evidence index rebuild
```
2. **Storage backend issue:**
```bash
# Check storage health
stella doctor --check check.storage.evidencelocker
# Verify storage connectivity
stella evidence storage test
```
3. **File system issue:**
- Check disk health
- Verify file permissions
- Check mount status
### INC-003: Anchor Chain Break
**Symptoms:**
- Alert: `StellaMerkleAnchorChainBroken`
- Anchor verification fails
**Investigation:**
```bash
# Check anchor chain
stella evidence anchor verify --all --verbose
# Find break point
stella evidence anchor list --show-links
# Inspect specific anchor
stella evidence anchor inspect <anchor-id>
```
**Resolution:**
1. **Single broken link:**
```bash
# Attempt to recover from backup
stella evidence anchor recover --id <anchor-id> --source backup
```
2. **Multiple breaks:**
- Stop new anchoring
- Assess extent of damage
- Restore from backup or rebuild chain
3. **Create new chain segment:**
```bash
# Start new chain (preserves old chain as archived)
stella evidence anchor new-chain --reason "chain-break-recovery"
```
### INC-004: Storage Full
**Symptoms:**
- Alert: `StellaEvidenceStorageFull`
- Ingestion failing
**Immediate Actions:**
```bash
# Check storage usage
stella evidence storage stats
# Emergency cleanup of temporary files
stella evidence cleanup --temp-only
# Find large/old artifacts
stella evidence storage analyze --sort size --limit 20
```
**Resolution:**
1. **Apply retention policy:**
```bash
stella evidence cleanup --apply-retention --aggressive
```
2. **Archive old evidence:**
```bash
stella evidence archive --older-than 180d --compress
```
3. **Expand storage:**
- Follow cloud provider procedure
- Or add additional storage volume
---
## Disaster Recovery
### DR-001: Full Evidence Locker Recovery
**Prerequisites:**
- Backup available
- Target storage provisioned
- Recovery environment ready
**Procedure:**
1. Provision new storage:
```bash
stella evidence storage provision --size <size>
```
2. Restore from backup:
```bash
# List available backups
stella backup list --type evidence-locker
# Restore
stella evidence restore --backup-id <backup-id> --target /var/lib/stellaops/evidence
```
3. Verify restoration:
```bash
stella evidence verify --mode full
stella evidence anchor verify --all
```
4. Update service configuration:
```bash
stella config set EvidenceLocker:Path /var/lib/stellaops/evidence
stella service restart
```
### DR-002: Point-in-Time Recovery
For recovering to a specific point in time:
1. Identify target anchor:
```bash
stella evidence anchor list --before <timestamp>
```
2. Restore to that point:
```bash
stella evidence restore --to-anchor <anchor-id>
```
3. Verify integrity:
```bash
stella evidence verify --mode full --to-anchor <anchor-id>
```
---
## Offline Mode Operations
### Preparing Offline Evidence Pack
```bash
# Export evidence for specific artifact
stella evidence export --digest <artifact-digest> --output evidence-pack.tar.gz
# Export with all dependencies
stella evidence export --digest <artifact-digest> --include-deps --output evidence-full.tar.gz
```
### Verifying Evidence Offline
```bash
# Verify evidence pack without network
stella evidence verify --offline --input evidence-pack.tar.gz
# Replay verdict using evidence
stella replay --evidence evidence-pack.tar.gz --output verdict.json
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → Evidence Locker
Key panels:
- Artifact ingestion rate
- Retrieval latency
- Storage utilization trend
- Integrity check status
- Anchor chain health
---
## Evidence Capture
For any incident:
```bash
stella evidence diagnostics --output /tmp/evidence-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
Bundle includes:
- Index status
- Storage stats
- Recent anchor chain
- Integrity check results
- Operation audit log
---
## Escalation Path
1. **L1 (On-call):** Standard procedures, cleanup operations
2. **L2 (Platform team):** Index rebuild, anchor issues
3. **L3 (Architecture):** Chain recovery, DR procedures
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,183 @@
# Runbook: Release Orchestrator - Required Evidence Not Found
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | High |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.evidence-availability` |
---
## Symptoms
- [ ] Promotion failing with "required evidence not found"
- [ ] Alert `OrchestratorEvidenceMissing` firing
- [ ] Gate evaluation blocked waiting for evidence
- [ ] Error: "SBOM not found" or "attestation missing"
- [ ] Evidence chain incomplete for artifact
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Promotion blocked until evidence is generated |
| **Data integrity** | Indicates missing security artifact - must be resolved |
| **SLA impact** | Release blocked; compliance requirements not met |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.evidence-availability
```
2. **List missing evidence for promotion:**
```bash
stella promotion evidence <promotion-id> --missing
```
3. **Check what evidence exists for artifact:**
```bash
stella evidence list --artifact <digest>
```
### Deep diagnosis
1. **Check evidence chain completeness:**
```bash
stella evidence chain --artifact <digest> --verbose
```
Look for: Missing nodes in the chain
2. **Check if scan completed:**
```bash
stella scanner jobs list --artifact <digest>
```
Problem if: No completed scan or scan failed
3. **Check if attestation was created:**
```bash
stella attest list --subject <digest>
```
Problem if: No attestation or attestation failed
4. **Check evidence store health:**
```bash
stella evidence store health
```
---
## Resolution
### Immediate mitigation
1. **Generate missing SBOM:**
```bash
stella scan image --image <image-ref> --sbom-only
```
2. **Generate missing attestation:**
```bash
stella attest create --subject <digest> --type slsa-provenance
```
3. **Re-scan artifact to regenerate all evidence:**
```bash
stella scan image --image <image-ref> --force
```
### Root cause fix
**If scan never ran:**
1. Check why artifact wasn't scanned:
```bash
stella scanner queue list --artifact <digest>
```
2. Configure automatic scanning on push:
```bash
stella scanner config set auto_scan.enabled true
stella scanner config set auto_scan.triggers "push,promote"
```
**If evidence was generated but not stored:**
1. Check evidence store connectivity:
```bash
stella evidence store health
```
2. Retry evidence storage:
```bash
stella evidence retry-store --artifact <digest>
```
**If attestation signing failed:**
1. Check attestor status:
```bash
stella attest status
```
2. See `attestor-signing-failed.md` runbook
**If evidence expired or was deleted:**
1. Check evidence retention policy:
```bash
stella evidence policy show
```
2. Regenerate evidence:
```bash
stella scan image --image <image-ref> --force
stella attest create --subject <digest> --type slsa-provenance
```
### Verification
```bash
# Check all evidence now exists
stella evidence list --artifact <digest>
# Verify evidence chain is complete
stella evidence chain --artifact <digest>
# Retry promotion
stella promotion retry <promotion-id>
# Verify promotion proceeds
stella promotion status <promotion-id>
```
---
## Prevention
- [ ] **Auto-scan:** Enable automatic scanning for all pushed images
- [ ] **Gates:** Configure evidence requirements clearly in promotion policy
- [ ] **Monitoring:** Alert on evidence generation failures
- [ ] **Retention:** Set appropriate evidence retention periods
---
## Related Resources
- **Architecture:** `docs/modules/evidence-locker/architecture.md`
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `attestor-signing-failed.md`
- **Evidence requirements:** `docs/operations/evidence-requirements.md`

View File

@@ -0,0 +1,178 @@
# Runbook: Release Orchestrator - Gate Evaluation Timeout
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.gate-timeout` |
---
## Symptoms
- [ ] Promotion gates timing out before completing evaluation
- [ ] Alert `OrchestratorGateTimeout` firing
- [ ] Error: "gate evaluation timeout exceeded"
- [ ] Promotion stuck waiting for gate response
- [ ] Metric `orchestrator_gate_timeout_total` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Promotions delayed or blocked; release pipeline stalled |
| **Data integrity** | No data loss; promotion can be retried |
| **SLA impact** | Release SLO violated if timeout persists |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.gate-timeout
```
2. **Identify timed-out gates:**
```bash
stella promotion gates <promotion-id> --status timeout
```
3. **Check gate service health:**
```bash
stella orch gate-services status
```
### Deep diagnosis
1. **Check specific gate latency:**
```bash
stella orch gate stats --gate <gate-name> --last 1h
```
Look for: P95 latency, timeout rate
2. **Check external service connectivity:**
```bash
stella orch connectivity --gate <gate-name>
```
3. **Check gate evaluation logs:**
```bash
stella orch logs --gate <gate-name> --promotion <promotion-id>
```
Look for: Slow queries, external API delays
4. **Check policy engine latency (for policy gates):**
```bash
stella policy stats --last 10m
```
---
## Resolution
### Immediate mitigation
1. **Increase timeout for specific gate:**
```bash
stella orch config set gates.<gate-name>.timeout 5m
stella orch reload
```
2. **Skip the timed-out gate (requires approval):**
```bash
stella promotion gate skip <promotion-id> <gate-name> \
--reason "External service timeout - approved by <approver>"
```
3. **Retry the promotion:**
```bash
stella promotion retry <promotion-id>
```
### Root cause fix
**If external service is slow:**
1. Configure gate retry with backoff:
```bash
stella orch config set gates.<gate-name>.retries 3
stella orch config set gates.<gate-name>.retry_backoff 5s
```
2. Enable gate result caching:
```bash
stella orch config set gates.<gate-name>.cache_ttl 5m
```
3. Configure circuit breaker:
```bash
stella orch config set gates.<gate-name>.circuit_breaker.enabled true
stella orch config set gates.<gate-name>.circuit_breaker.threshold 5
```
**If policy evaluation is slow:**
1. Optimize policy (see `policy-evaluation-slow.md` runbook)
2. Increase policy worker count:
```bash
stella policy config set opa.workers 4
```
**If evidence retrieval is slow:**
1. Enable evidence pre-fetching:
```bash
stella orch config set gates.evidence_prefetch true
```
2. Increase evidence cache:
```bash
stella orch config set evidence.cache_size 1000
stella orch config set evidence.cache_ttl 10m
```
### Verification
```bash
# Retry promotion
stella promotion retry <promotion-id>
# Monitor gate evaluation
stella promotion gates <promotion-id> --watch
# Check gate latency improved
stella orch gate stats --gate <gate-name> --last 10m
# Verify no timeouts
stella orch logs --filter "timeout" --last 30m
```
---
## Prevention
- [ ] **Timeouts:** Set appropriate timeouts based on gate SLAs (default: 2m)
- [ ] **Monitoring:** Alert on gate P95 latency > 1m
- [ ] **Caching:** Enable caching for slow gates
- [ ] **Circuit breakers:** Enable circuit breakers for external service gates
---
## Related Resources
- **Architecture:** `docs/modules/release-orchestrator/gates.md`
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `policy-evaluation-slow.md`
- **Dashboard:** Grafana > Stella Ops > Gate Latency

View File

@@ -0,0 +1,168 @@
# Runbook: Release Orchestrator - Promotion Job Not Progressing
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | Critical |
| **On-call scope** | Platform team, Release team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.job-health` |
---
## Symptoms
- [ ] Promotion job stuck in "in_progress" state for >10 minutes
- [ ] No progress updates in promotion timeline
- [ ] Alert `OrchestratorPromotionStuck` firing
- [ ] UI shows promotion spinner indefinitely
- [ ] Downstream environment not receiving promoted artifact
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Release blocked, cannot promote to target environment |
| **Data integrity** | Artifact is safe; promotion can be retried |
| **SLA impact** | Release SLO violated if not resolved within 30 minutes |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.job-health
```
2. **Check promotion status:**
```bash
stella promotion status <promotion-id>
```
Look for: Current step, last update time, any error messages
3. **Check orchestrator service:**
```bash
stella orch status
```
### Deep diagnosis
1. **Get detailed promotion trace:**
```bash
stella promotion trace <promotion-id> --verbose
```
Look for: Which step is stuck, any timeouts
2. **Check gate evaluation status:**
```bash
stella promotion gates <promotion-id>
```
Problem if: Gate stuck waiting for external service
3. **Check target environment connectivity:**
```bash
stella orch connectivity --target <env-name>
```
4. **Check for lock contention:**
```bash
stella orch locks list
```
Problem if: Stale locks on the artifact or environment
---
## Resolution
### Immediate mitigation
1. **If gate is stuck waiting for external service:**
```bash
# Skip the stuck gate (requires approval)
stella promotion gate skip <promotion-id> <gate-name> --reason "External service timeout"
```
2. **If lock is stale:**
```bash
# Release the lock (use with caution)
stella orch locks release <lock-id> --force
```
3. **If orchestrator is unresponsive:**
```bash
stella service restart orchestrator
```
### Root cause fix
**If external gate service is slow:**
1. Increase gate timeout:
```bash
stella orch config set gates.<gate-name>.timeout 5m
```
2. Configure gate retry:
```bash
stella orch config set gates.<gate-name>.retries 3
```
**If target environment is unreachable:**
1. Check network connectivity to target
2. Verify credentials for target environment:
```bash
stella orch credentials verify --target <env-name>
```
**If database lock contention:**
1. Increase lock timeout:
```bash
stella orch config set locks.timeout 60s
```
2. Enable optimistic locking:
```bash
stella orch config set locks.mode optimistic
```
### Verification
```bash
# Check promotion completed
stella promotion status <promotion-id>
# Verify artifact in target environment
stella orch artifacts list --env <target-env> --filter <artifact-digest>
# Check no stuck promotions
stella promotion list --status in_progress --older-than 5m
```
---
## Prevention
- [ ] **Timeouts:** Configure appropriate timeouts for all gates
- [ ] **Monitoring:** Alert on promotions stuck > 10 minutes
- [ ] **Health checks:** Enable connectivity pre-checks before promotion
- [ ] **Documentation:** Document SLAs for external gate services
---
## Related Resources
- **Architecture:** `docs/modules/release-orchestrator/architecture.md`
- **Related runbooks:** `orchestrator-gate-timeout.md`, `orchestrator-evidence-missing.md`
- **Dashboard:** Grafana > Stella Ops > Release Orchestrator

View File

@@ -0,0 +1,189 @@
# Runbook: Release Orchestrator - Promotion Quota Exhausted
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | Medium |
| **On-call scope** | Platform team, Release team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.quota-status` |
---
## Symptoms
- [ ] Promotions failing with "quota exceeded"
- [ ] Alert `OrchestratorQuotaExceeded` firing
- [ ] Error: "promotion rate limit reached" or "daily quota exhausted"
- [ ] New promotions being rejected
- [ ] Queued promotions not processing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | New releases blocked until quota resets or increases |
| **Data integrity** | No data loss; promotions queued for later |
| **SLA impact** | Release frequency SLO may be violated |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.quota-status
```
2. **Check current quota usage:**
```bash
stella orch quota status
```
3. **Check quota limits:**
```bash
stella orch quota limits show
```
### Deep diagnosis
1. **Check promotion history:**
```bash
stella promotion list --last 24h --count
```
Look for: Unusual spike in promotions
2. **Check per-environment quotas:**
```bash
stella orch quota status --by-environment
```
3. **Check for runaway automation:**
```bash
stella promotion list --last 1h --by-actor
```
Problem if: Single actor/service making many promotions
4. **Check when quota resets:**
```bash
stella orch quota reset-time
```
---
## Resolution
### Immediate mitigation
1. **Request temporary quota increase:**
```bash
stella orch quota request-increase --amount 50 --reason "Release deadline"
```
2. **Prioritize critical promotions:**
```bash
stella promotion priority set <promotion-id> high
```
3. **Cancel unnecessary queued promotions:**
```bash
stella promotion list --status queued
stella promotion cancel <promotion-id>
```
### Root cause fix
**If legitimate high volume:**
1. Increase quota limits:
```bash
stella orch quota limits set --daily 200 --hourly 50
```
2. Increase per-environment limits:
```bash
stella orch quota limits set --env production --daily 50
```
**If runaway automation:**
1. Identify the source:
```bash
stella promotion list --last 1h --by-actor --verbose
```
2. Revoke or rate-limit the service account:
```bash
stella auth rate-limit set <service-account> --promotions-per-hour 10
```
3. Fix the automation bug
**If promotion retries causing spike:**
1. Check for failing promotions causing retries:
```bash
stella promotion list --status failed --last 24h
```
2. Fix underlying promotion failures (see other runbooks)
3. Configure retry limits:
```bash
stella orch config set promotion.max_retries 3
stella orch config set promotion.retry_backoff 5m
```
**If quota too restrictive for workload:**
1. Analyze actual promotion patterns:
```bash
stella orch quota analyze --last 30d
```
2. Adjust quotas based on analysis:
```bash
stella orch quota limits set --daily <recommended>
```
### Verification
```bash
# Check quota status
stella orch quota status
# Verify promotions processing
stella promotion list --status in_progress
# Test new promotion
stella promotion create --test --dry-run
# Check no quota errors
stella orch logs --filter "quota" --level error --last 30m
```
---
## Prevention
- [ ] **Monitoring:** Alert at 80% quota usage
- [ ] **Limits:** Set appropriate quotas based on team size and release frequency
- [ ] **Automation:** Implement rate limiting in CI/CD pipelines
- [ ] **Review:** Regularly review and adjust quotas based on usage patterns
---
## Related Resources
- **Architecture:** `docs/modules/release-orchestrator/quotas.md`
- **Related runbooks:** `orchestrator-promotion-stuck.md`
- **Quota management:** `docs/operations/quota-management.md`

View File

@@ -0,0 +1,189 @@
# Runbook: Release Orchestrator - Rollback Operation Failed
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-004 - Release Orchestrator Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Release Orchestrator |
| **Severity** | Critical |
| **On-call scope** | Platform team, Release team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.orchestrator.rollback-health` |
---
## Symptoms
- [ ] Rollback operation failing or stuck
- [ ] Alert `OrchestratorRollbackFailed` firing
- [ ] Error: "rollback failed" or "cannot restore previous version"
- [ ] Target environment in inconsistent state
- [ ] Previous artifact not available for deployment
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Rollback blocked; potentially broken release in production |
| **Data integrity** | Environment may be in partial rollback state |
| **SLA impact** | Incident resolution blocked; extended outage |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.orchestrator.rollback-health
```
2. **Check rollback status:**
```bash
stella rollback status <rollback-id>
```
3. **Check previous deployment history:**
```bash
stella orch deployments list --env <env-name> --last 10
```
### Deep diagnosis
1. **Check why rollback failed:**
```bash
stella rollback trace <rollback-id> --verbose
```
Look for: Which step failed, error message
2. **Check previous artifact availability:**
```bash
stella orch artifacts get <previous-digest> --check
```
Problem if: Artifact deleted, not in registry
3. **Check environment state:**
```bash
stella orch env status <env-name> --detailed
```
4. **Check for deployment locks:**
```bash
stella orch locks list --env <env-name>
```
---
## Resolution
### Immediate mitigation
1. **Force release lock if stuck:**
```bash
stella orch locks release --env <env-name> --force
```
2. **Manual rollback using specific artifact:**
```bash
stella deploy --env <env-name> --artifact <previous-digest> --force
```
3. **If artifact unavailable, deploy last known good:**
```bash
stella orch deployments list --env <env-name> --status success
stella deploy --env <env-name> --artifact <last-good-digest>
```
### Root cause fix
**If previous artifact not in registry:**
1. Check artifact retention policy:
```bash
stella registry retention show
```
2. Restore from backup registry:
```bash
stella registry restore --artifact <digest> --from backup
```
3. Increase artifact retention:
```bash
stella registry retention set --min-versions 10
```
**If deployment service unavailable:**
1. Check deployment target connectivity:
```bash
stella orch connectivity --target <env-name>
```
2. Check deployment agent status:
```bash
stella orch agent status --env <env-name>
```
**If configuration drift:**
1. Check environment configuration:
```bash
stella orch env config diff <env-name>
```
2. Reset environment to known state:
```bash
stella orch env reset <env-name> --to-baseline
```
**If database state inconsistent:**
1. Check orchestrator database:
```bash
stella orch db verify
```
2. Repair deployment state:
```bash
stella orch repair --deployment <deployment-id>
```
### Verification
```bash
# Verify rollback completed
stella rollback status <rollback-id>
# Verify environment state
stella orch env status <env-name>
# Verify correct version deployed
stella orch deployments current --env <env-name>
# Health check the environment
stella orch health-check --env <env-name>
```
---
## Prevention
- [ ] **Retention:** Maintain at least 5 previous versions in registry
- [ ] **Testing:** Test rollback procedure in staging regularly
- [ ] **Monitoring:** Alert on rollback failures immediately
- [ ] **Documentation:** Document manual rollback procedures per environment
---
## Related Resources
- **Architecture:** `docs/modules/release-orchestrator/rollback.md`
- **Related runbooks:** `orchestrator-promotion-stuck.md`, `orchestrator-evidence-missing.md`
- **Rollback procedures:** `docs/operations/rollback-procedures.md`

View File

@@ -0,0 +1,189 @@
# Runbook: Policy Engine - Rego Compilation Errors
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.compilation-health` |
---
## Symptoms
- [ ] Policy deployment failing with "compilation error"
- [ ] Alert `PolicyCompilationFailed` firing
- [ ] Error: "rego_parse_error" or "rego_type_error"
- [ ] New policies not taking effect
- [ ] OPA rejecting policy bundle
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | New policies cannot be deployed; using stale policies |
| **Data integrity** | Existing policies continue to work; new rules not enforced |
| **SLA impact** | Policy updates blocked; security posture may be outdated |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.compilation-health
```
2. **Check policy compilation status:**
```bash
stella policy status --compilation
```
3. **Validate specific policy:**
```bash
stella policy validate --file <policy-file>
```
### Deep diagnosis
1. **Get detailed compilation errors:**
```bash
stella policy compile --verbose
```
Look for: Line numbers, error types, undefined references
2. **Check for syntax errors:**
```bash
stella policy lint --file <policy-file>
```
3. **Check for type errors:**
```bash
stella policy typecheck --file <policy-file>
```
4. **Check OPA version compatibility:**
```bash
stella policy opa version
stella policy check-compat --file <policy-file>
```
---
## Resolution
### Immediate mitigation
1. **Rollback to last working policy:**
```bash
stella policy rollback --to-last-good
```
2. **Disable the failing policy:**
```bash
stella policy disable <policy-id>
stella policy reload
```
3. **Use previous bundle:**
```bash
stella policy bundle load --version <previous-version>
```
### Root cause fix
**If syntax error:**
1. Get exact error location:
```bash
stella policy validate --file <policy-file> --show-line
```
2. Common syntax issues:
- Missing brackets or braces
- Invalid rule head syntax
- Incorrect import statements
3. Fix and re-validate:
```bash
stella policy validate --file <fixed-policy.rego>
```
**If undefined reference:**
1. Check for missing imports:
```bash
stella policy analyze --file <policy-file> --show-imports
```
2. Verify data references exist:
```bash
stella policy data show
```
3. Add missing imports or data definitions
**If type error:**
1. Check type mismatches:
```bash
stella policy typecheck --file <policy-file> --verbose
```
2. Common type issues:
- Comparing incompatible types
- Invalid function arguments
- Missing type annotations
**If OPA version incompatibility:**
1. Check Rego version features used:
```bash
stella policy analyze --file <policy-file> --show-features
```
2. Update policy to use compatible features or upgrade OPA
### Verification
```bash
# Validate fixed policy
stella policy validate --file <fixed-policy.rego>
# Test policy compilation
stella policy compile --file <fixed-policy.rego>
# Deploy policy
stella policy deploy --file <fixed-policy.rego>
# Test policy evaluation
stella policy evaluate --test
```
---
## Prevention
- [ ] **CI/CD:** Add policy validation to CI pipeline before deployment
- [ ] **Linting:** Run `stella policy lint` on all policy changes
- [ ] **Testing:** Write unit tests for policies with `stella policy test`
- [ ] **Staging:** Deploy to staging environment before production
---
## Related Resources
- **Architecture:** `docs/modules/policy/architecture.md`
- **Related runbooks:** `policy-opa-crash.md`, `policy-evaluation-slow.md`
- **Rego reference:** https://www.openpolicyagent.org/docs/latest/policy-language/
- **Policy testing:** `docs/modules/policy/testing.md`

View File

@@ -0,0 +1,174 @@
# Runbook: Policy Engine - Evaluation Latency High
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.evaluation-latency` |
---
## Symptoms
- [ ] Policy evaluation takes >500ms (warning) or >2s (critical)
- [ ] Gate decisions timing out in CI/CD pipelines
- [ ] Alert `PolicyEvaluationSlow` firing
- [ ] Metric `policy_evaluation_duration_seconds` P95 > 1s
- [ ] Users report "policy check taking too long"
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Slow release gate checks, CI/CD pipeline delays |
| **Data integrity** | No data loss; decisions are still correct |
| **SLA impact** | Gate latency SLO violated (target: P95 < 500ms) |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.evaluation-latency
```
2. **Check policy engine status:**
```bash
stella policy status
```
3. **Check recent evaluation times:**
```bash
stella policy stats --last 10m
```
Look for: P95 latency, cache hit rate
### Deep diagnosis
1. **Profile a slow evaluation:**
```bash
stella policy evaluate --image <image-ref> --profile
```
Look for: Which phase is slowest (parse, compile, execute)
2. **Check OPA compilation cache:**
```bash
stella policy cache stats
```
Problem if: Cache hit rate < 90%
3. **Check policy complexity:**
```bash
stella policy analyze --complexity
```
Problem if: Cyclomatic complexity > 50 or rule count > 200
4. **Check external data fetches:**
```bash
stella policy logs --filter "external fetch" --level debug
```
Problem if: Many external fetches or slow responses
---
## Resolution
### Immediate mitigation
1. **Clear and warm the compilation cache:**
```bash
stella policy cache clear
stella policy cache warm
```
2. **Increase OPA worker count:**
```bash
stella policy config set opa.workers 4
stella policy reload
```
3. **Enable evaluation result caching:**
```bash
stella policy config set cache.evaluation_ttl 60s
stella policy reload
```
### Root cause fix
**If policy is too complex:**
1. Analyze and simplify policy:
```bash
stella policy analyze --suggest-optimizations
```
2. Split large policies into modules:
```bash
stella policy refactor --auto-split
```
**If external data fetches are slow:**
1. Increase external data cache TTL:
```bash
stella policy config set external_data.cache_ttl 5m
```
2. Pre-fetch external data:
```bash
stella policy external-data prefetch
```
**If Rego compilation is slow:**
1. Enable partial evaluation:
```bash
stella policy config set opa.partial_eval true
```
2. Pre-compile policies:
```bash
stella policy compile --all
```
### Verification
```bash
# Run evaluation and check latency
stella policy evaluate --image <image-ref> --timing
# Check P95 latency
stella policy stats --last 5m
# Verify cache is effective
stella policy cache stats
```
---
## Prevention
- [ ] **Review:** Review policy complexity before deployment
- [ ] **Monitoring:** Alert on P95 latency > 300ms
- [ ] **Caching:** Ensure evaluation cache is enabled
- [ ] **Pre-warming:** Add cache warming to deployment pipeline
---
## Related Resources
- **Architecture:** `docs/modules/policy/architecture.md`
- **Related runbooks:** `policy-opa-crash.md`, `policy-compilation-failed.md`
- **Dashboard:** Grafana > Stella Ops > Policy Engine

View File

@@ -0,0 +1,205 @@
# Runbook: Policy Engine - OPA Process Crashed
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | Critical |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.opa-health` |
---
## Symptoms
- [ ] Policy evaluations failing with "OPA unavailable" error
- [ ] Alert `PolicyOPACrashed` firing
- [ ] OPA process exited unexpectedly
- [ ] Error: "connection refused" when connecting to OPA
- [ ] Metric `policy_opa_restarts_total` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | All policy evaluations fail; gate decisions blocked |
| **Data integrity** | No data loss; decisions delayed until OPA recovers |
| **SLA impact** | Gate latency SLO violated; release pipeline blocked |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.opa-health
```
2. **Check OPA process status:**
```bash
stella policy status
```
Look for: OPA process state, restart count
3. **Check OPA logs for crash reason:**
```bash
stella policy opa logs --last 30m --level error
```
### Deep diagnosis
1. **Check OPA memory usage before crash:**
```bash
stella policy stats --opa-metrics
```
Problem if: Memory usage near limit before crash
2. **Check for problematic policy:**
```bash
stella policy list --last-error
```
Look for: Policies that caused evaluation errors
3. **Check OPA configuration:**
```bash
stella policy opa config show
```
Look for: Invalid configuration, missing bundles
4. **Check for infinite loops in Rego:**
```bash
stella policy analyze --detect-loops
```
---
## Resolution
### Immediate mitigation
1. **Restart OPA process:**
```bash
stella policy opa restart
```
2. **If OPA keeps crashing, start in safe mode:**
```bash
stella policy opa start --safe-mode
```
Note: Safe mode disables custom policies
3. **Enable failopen temporarily (if allowed by policy):**
```bash
stella policy config set failopen true
stella policy reload
```
**Warning:** Only use if compliance allows fail-open mode
### Root cause fix
**If OOM killed:**
1. Increase OPA memory limit:
```bash
stella policy opa config set memory_limit 2Gi
stella policy opa restart
```
2. Enable garbage collection tuning:
```bash
stella policy opa config set gc_min_heap_size 256Mi
stella policy opa config set gc_max_heap_size 1Gi
```
**If policy caused crash:**
1. Identify problematic policy:
```bash
stella policy list --status error
```
2. Disable the problematic policy:
```bash
stella policy disable <policy-id>
stella policy reload
```
3. Fix and re-enable:
```bash
stella policy validate --file <fixed-policy.rego>
stella policy update <policy-id> --file <fixed-policy.rego>
stella policy enable <policy-id>
```
**If bundle loading failed:**
1. Check bundle integrity:
```bash
stella policy bundle verify
```
2. Rebuild bundle:
```bash
stella policy bundle build --output bundle.tar.gz
stella policy bundle load bundle.tar.gz
```
**If configuration issue:**
1. Reset to default configuration:
```bash
stella policy opa config reset
```
2. Reconfigure with validated settings:
```bash
stella policy opa config set workers 4
stella policy opa config set decision_log true
stella policy opa restart
```
### Verification
```bash
# Check OPA is running
stella policy status
# Check OPA health
stella policy opa health
# Test policy evaluation
stella policy evaluate --test
# Check no crashes in recent logs
stella policy opa logs --level error --last 30m
# Monitor stability
stella policy stats --watch
```
---
## Prevention
- [ ] **Resources:** Set appropriate memory limits based on policy complexity
- [ ] **Validation:** Validate all policies before deployment
- [ ] **Monitoring:** Alert on OPA restart count > 2 in 10 minutes
- [ ] **Testing:** Load test policies before production deployment
---
## Related Resources
- **Architecture:** `docs/modules/policy/architecture.md`
- **Related runbooks:** `policy-evaluation-slow.md`, `policy-compilation-failed.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Policy/`
- **OPA documentation:** https://www.openpolicyagent.org/docs/latest/

View File

@@ -0,0 +1,178 @@
# Runbook: Policy Engine - Policy Storage Backend Down
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | Critical |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.storage-health` |
---
## Symptoms
- [ ] Policy operations failing with "storage unavailable"
- [ ] Alert `PolicyStorageUnavailable` firing
- [ ] Error: "failed to connect to policy store" or "database connection refused"
- [ ] Policy updates not persisting
- [ ] OPA unable to load bundles from storage
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Policy updates fail; cached policies may still work |
| **Data integrity** | Policy changes not persisted; risk of inconsistent state |
| **SLA impact** | Policy management blocked; evaluations use cached data |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.storage-health
```
2. **Check storage connectivity:**
```bash
stella policy storage status
```
3. **Check database health:**
```bash
stella db status --component policy
```
### Deep diagnosis
1. **Check PostgreSQL connectivity:**
```bash
stella db ping --database policy
```
2. **Check connection pool status:**
```bash
stella db pool-status --database policy
```
Problem if: Pool exhausted, connections timing out
3. **Check storage logs:**
```bash
stella policy logs --filter "storage" --level error --last 30m
```
4. **Check disk space (if local storage):**
```bash
stella policy storage disk-usage
```
---
## Resolution
### Immediate mitigation
1. **Enable read-only mode (use cached policies):**
```bash
stella policy config set storage.read_only true
stella policy reload
```
2. **Switch to backup storage:**
```bash
stella policy storage failover --to backup
```
3. **Restart policy service to reconnect:**
```bash
stella service restart policy-engine
```
### Root cause fix
**If database connection issue:**
1. Check database status:
```bash
stella db status --database policy --verbose
```
2. Restart database connection pool:
```bash
stella db pool-restart --database policy
```
3. Check and increase connection limits:
```bash
stella db config set policy.max_connections 50
```
**If disk space exhausted:**
1. Check storage usage:
```bash
stella policy storage disk-usage --verbose
```
2. Clean old policy versions:
```bash
stella policy versions cleanup --older-than 30d
```
3. Increase storage capacity
**If storage corruption:**
1. Verify storage integrity:
```bash
stella policy storage verify
```
2. Restore from backup:
```bash
stella policy storage restore --from-backup latest
```
### Verification
```bash
# Check storage status
stella policy storage status
# Test write operation
stella policy storage test-write
# Test policy update
stella policy update --test
# Verify no errors
stella policy logs --filter "storage" --level error --last 30m
```
---
## Prevention
- [ ] **Monitoring:** Alert on storage connection failures immediately
- [ ] **Redundancy:** Configure backup storage for failover
- [ ] **Cleanup:** Schedule regular cleanup of old policy versions
- [ ] **Capacity:** Monitor disk usage and plan for growth
---
## Related Resources
- **Architecture:** `docs/modules/policy/storage.md`
- **Related runbooks:** `policy-opa-crash.md`, `postgres-ops.md`
- **Database setup:** `docs/operations/database-configuration.md`

View File

@@ -0,0 +1,195 @@
# Runbook: Policy Engine - Policy Version Conflicts
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-003 - Policy Engine Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Policy Engine |
| **Severity** | Medium |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.policy.version-consistency` |
---
## Symptoms
- [ ] Policy evaluation returning unexpected results
- [ ] Alert `PolicyVersionMismatch` firing
- [ ] Error: "policy version conflict" or "bundle version mismatch"
- [ ] Different nodes evaluating with different policy versions
- [ ] Inconsistent gate decisions for same artifact
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Inconsistent policy decisions; unpredictable gate results |
| **Data integrity** | Decisions may not match expected policy behavior |
| **SLA impact** | Gate accuracy SLO violated; trust in decisions reduced |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.policy.version-consistency
```
2. **Check policy version across nodes:**
```bash
stella policy version --all-nodes
```
3. **Check active policy version:**
```bash
stella policy active --show-version
```
### Deep diagnosis
1. **Compare versions across instances:**
```bash
stella policy version diff --all-instances
```
Problem if: Different versions on different nodes
2. **Check bundle distribution status:**
```bash
stella policy bundle status --all-nodes
```
3. **Check for failed deployments:**
```bash
stella policy deployments list --status failed --last 24h
```
4. **Check OPA bundle sync:**
```bash
stella policy opa bundle-status
```
---
## Resolution
### Immediate mitigation
1. **Force sync to latest version:**
```bash
stella policy sync --force --all-nodes
```
2. **Pin specific version:**
```bash
stella policy pin --version <version>
stella policy sync --all-nodes
```
3. **Restart policy engines to force reload:**
```bash
stella service restart policy-engine --all-nodes
```
### Root cause fix
**If bundle distribution failed:**
1. Check bundle storage:
```bash
stella policy bundle storage-status
```
2. Rebuild and redistribute bundle:
```bash
stella policy bundle build
stella policy bundle distribute --all-nodes
```
**If node out of sync:**
1. Check specific node status:
```bash
stella policy status --node <node-id>
```
2. Force node resync:
```bash
stella policy sync --node <node-id> --force
```
3. Verify node is receiving updates:
```bash
stella policy bundle check-subscription --node <node-id>
```
**If concurrent deployments caused conflict:**
1. Check deployment history:
```bash
stella policy deployments list --last 1h
```
2. Resolve to single version:
```bash
stella policy resolve-conflict --to-version <version>
```
3. Enable deployment locking:
```bash
stella policy config set deployment.locking true
```
**If OPA bundle polling issue:**
1. Check OPA bundle configuration:
```bash
stella policy opa config show | grep bundle
```
2. Decrease polling interval for faster sync:
```bash
stella policy opa config set bundle.polling.min_delay_seconds 10
stella policy opa config set bundle.polling.max_delay_seconds 30
```
### Verification
```bash
# Verify all nodes on same version
stella policy version --all-nodes
# Test consistent evaluation
stella policy evaluate --test --all-nodes
# Verify bundle status
stella policy bundle status --all-nodes
# Check no version warnings
stella policy logs --filter "version" --level warning --last 30m
```
---
## Prevention
- [ ] **Locking:** Enable deployment locking to prevent concurrent updates
- [ ] **Monitoring:** Alert on version drift between nodes
- [ ] **Sync:** Configure aggressive bundle polling for fast convergence
- [ ] **Testing:** Deploy to staging before production to catch issues
---
## Related Resources
- **Architecture:** `docs/modules/policy/versioning.md`
- **Related runbooks:** `policy-opa-crash.md`, `policy-storage-unavailable.md`
- **Deployment guide:** `docs/operations/policy-deployment.md`

View File

@@ -0,0 +1,371 @@
# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-001 - PostgreSQL Operations Runbook
# PostgreSQL Database Runbook (dev-mock ready)
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
PostgreSQL database operations including monitoring, maintenance, backup/restore, and common incident handling for Stella Ops deployments.
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check database connection
stella db ping
# Verify connection pool health
stella doctor --check check.postgres.connectivity,check.postgres.pool
# Check migration status
stella db migrations status
```
### Metrics to Watch
- `stella_postgres_connections_active` - Active connections (should be < 80% of max)
- `stella_postgres_query_duration_seconds` - P99 query latency (target: < 100ms)
- `stella_postgres_pool_waiting` - Connections waiting for pool (should be 0)
---
## Standard Procedures
### SP-001: Daily Health Check
**Frequency:** Daily or on-demand
**Duration:** ~5 minutes
1. Run comprehensive health check:
```bash
stella doctor --category database --format json > /tmp/db-health-$(date +%Y%m%d).json
```
2. Review slow queries from last 24h:
```bash
stella db queries --slow --period 24h --limit 20
```
3. Check replication status (if applicable):
```bash
stella db replication status
```
4. Verify backup completion:
```bash
stella backup status --type database
```
### SP-002: Connection Pool Tuning
**When:** Pool exhaustion alerts or high wait times
1. Check current pool usage:
```bash
stella db pool stats --detailed
```
2. Identify connection-holding queries:
```bash
stella db queries --active --sort duration
```
3. Adjust pool size (if needed):
```bash
# Review current settings
stella config get Database:MaxPoolSize
# Increase pool size
stella config set Database:MaxPoolSize 150
# Restart affected services
stella service restart --service release-orchestrator
```
4. Verify improvement:
```bash
stella db pool watch --duration 5m
```
### SP-003: Backup and Restore
**Backup:**
```bash
# Create immediate backup
stella backup create --type database --name "pre-upgrade-$(date +%Y%m%d)"
# Verify backup
stella backup verify --latest
```
**Restore:**
```bash
# List available backups
stella backup list --type database
# Restore to specific point (CAUTION: destructive)
stella backup restore --id <backup-id> --confirm
# Verify restoration
stella db ping
stella db migrations status
```
### SP-004: Migration Execution
1. Pre-migration backup:
```bash
stella backup create --type database --name "pre-migration"
```
2. Run migrations:
```bash
# Dry run first
stella db migrate --dry-run
# Apply migrations
stella db migrate
```
3. Verify migration success:
```bash
stella db migrations status
stella doctor --check check.postgres.migrations
```
---
## Incident Procedures
### INC-001: Connection Pool Exhaustion
**Symptoms:**
- Alert: `StellaPostgresPoolExhausted`
- Error logs: "connection pool exhausted, waiting for available connection"
- Increased request latency
**Investigation:**
```bash
# Check pool status
stella db pool stats
# Find long-running queries
stella db queries --active --sort duration --limit 10
# Check for connection leaks
stella db connections --by-client
```
**Resolution:**
1. **Immediate relief** - Terminate long-running queries:
```bash
# Identify stuck queries
stella db queries --active --duration ">5m"
# Terminate specific query (use with caution)
stella db query terminate --pid <pid>
```
2. **Scale pool** (if legitimate load):
```bash
stella config set Database:MaxPoolSize 200
stella service restart --graceful
```
3. **Fix leaks** (if application bug):
- Review application logs for unclosed connections
- Deploy fix to affected service
### INC-002: Slow Query Performance
**Symptoms:**
- Alert: `StellaPostgresQueryLatencyHigh`
- P99 query latency > 500ms
**Investigation:**
```bash
# Get slow query report
stella db queries --slow --period 1h --format json > /tmp/slow-queries.json
# Analyze specific query
stella db query explain --sql "SELECT ..." --analyze
# Check table statistics
stella db stats tables --sort bloat
```
**Resolution:**
1. **Index optimization:**
```bash
# Get index recommendations
stella db index suggest --table <table>
# Create recommended index
stella db index create --table <table> --columns "col1,col2"
```
2. **Vacuum/analyze:**
```bash
stella db vacuum --table <table>
stella db analyze --table <table>
```
3. **Query optimization** - Review and rewrite problematic queries
### INC-003: Database Connectivity Loss
**Symptoms:**
- Alert: `StellaPostgresConnectionFailed`
- All services reporting database connection errors
**Investigation:**
```bash
# Test basic connectivity
stella db ping
# Check DNS resolution
stella network dns-lookup <db-host>
# Check firewall/network
stella network test --host <db-host> --port 5432
```
**Resolution:**
1. **Network issue:**
- Verify security groups / firewall rules
- Check VPN/tunnel status if applicable
- Verify DNS resolution
2. **Database server issue:**
- Check PostgreSQL service status on server
- Review PostgreSQL logs
- Check disk space on database server
3. **Credential issue:**
```bash
stella db verify-credentials
stella secrets rotate --scope database
```
### INC-004: Disk Space Alert
**Symptoms:**
- Alert: `StellaPostgresDiskSpaceWarning` or `Critical`
- Database write failures
**Investigation:**
```bash
# Check disk usage
stella db disk-usage
# Find large tables
stella db stats tables --sort size --limit 20
# Check for bloat
stella db stats tables --sort bloat
```
**Resolution:**
1. **Immediate cleanup:**
```bash
# Vacuum to reclaim space
stella db vacuum --full --table <large-table>
# Clean old data (if retention policy allows)
stella db prune --table evidence_artifacts --older-than 90d --dry-run
```
2. **Archive old data:**
```bash
stella db archive --table findings_history --older-than 180d
```
3. **Expand disk** (if legitimate growth):
- Follow cloud provider procedure to expand volume
- Resize filesystem
---
## Maintenance Windows
### Weekly Maintenance (Sunday 02:00 UTC)
1. Run vacuum analyze on all tables:
```bash
stella db vacuum --analyze --all-tables
```
2. Update table statistics:
```bash
stella db analyze --all-tables
```
3. Clean temporary files:
```bash
stella db cleanup --temp-files
```
### Monthly Maintenance (First Sunday 03:00 UTC)
1. Full vacuum on large tables:
```bash
stella db vacuum --full --table findings --table verdicts
```
2. Reindex if needed:
```bash
stella db reindex --concurrently --table findings
```
3. Archive old data per retention policy:
```bash
stella db archive --apply-retention
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → PostgreSQL
Key panels:
- Connection pool utilization
- Query latency percentiles
- Disk usage trend
- Replication lag (if applicable)
- Active queries count
---
## Evidence Capture
For any incident, capture:
```bash
# Comprehensive database state
stella db diagnostics --output /tmp/db-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
Bundle includes:
- Connection stats
- Active queries
- Lock information
- Table statistics
- Recent slow query log
- Configuration snapshot
---
## Escalation Path
1. **L1 (On-call):** Standard procedures, restart services
2. **L2 (Database team):** Query optimization, schema changes
3. **L3 (Vendor support):** Hardware/cloud platform issues
---
_Last updated: 2026-01-17 (UTC)_

View File

@@ -0,0 +1,152 @@
# Runbook: Scanner - Out of Memory on Large Images
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.memory-usage` |
---
## Symptoms
- [ ] Scanner worker exits with code 137 (OOM killed)
- [ ] Scans fail consistently for specific large images
- [ ] Error log contains "fatal error: runtime: out of memory"
- [ ] Alert `ScannerWorkerOOM` firing
- [ ] Metric `scanner_worker_restarts_total{reason="oom"}` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Large images cannot be scanned; smaller images may still work |
| **Data integrity** | No data loss; failed scans can be retried |
| **SLA impact** | Specific images blocked from release pipeline |
---
## Diagnosis
### Quick checks
1. **Identify the failing image:**
```bash
stella scanner jobs list --status failed --last 1h
```
2. **Check image size:**
```bash
stella image inspect <image-ref> --format json | jq '.size'
```
Problem if: Image size > 2GB or layer count > 100
3. **Check worker memory limit:**
```bash
stella scanner config get worker.memory_limit
```
### Deep diagnosis
1. **Profile memory usage during scan:**
```bash
stella scan image --image <image-ref> --profile-memory
```
2. **Check SBOM generation memory:**
```bash
stella scanner logs --filter "sbom" --level debug --last 30m
```
Look for: "memory allocation failed", "heap exhausted"
3. **Identify memory-heavy layers:**
```bash
stella image layers <image-ref> --sort-by size
```
---
## Resolution
### Immediate mitigation
1. **Increase worker memory limit:**
```bash
stella scanner config set worker.memory_limit 8Gi
stella scanner workers restart
```
2. **Enable streaming mode for large images:**
```bash
stella scanner config set sbom.streaming_threshold 1Gi
stella scanner workers restart
```
3. **Retry the failed scan:**
```bash
stella scan image --image <image-ref> --retry
```
### Root cause fix
**For consistently large images:**
1. Configure dedicated large-image worker pool:
```bash
stella scanner workers add --pool large-images --memory 16Gi --count 2
stella scanner config set routing.large_image_threshold 2Gi
stella scanner config set routing.large_image_pool large-images
```
**For images with many small files (node_modules, etc.):**
1. Enable incremental SBOM mode:
```bash
stella scanner config set sbom.incremental_mode true
```
**For base image reuse:**
1. Enable layer caching:
```bash
stella scanner config set cache.layer_dedup true
```
### Verification
```bash
# Retry the previously failing scan
stella scan image --image <image-ref>
# Monitor memory during scan
stella scanner workers stats --watch
# Verify no OOM in recent logs
stella scanner logs --filter "out of memory" --last 1h
```
---
## Prevention
- [ ] **Capacity:** Set memory limit based on largest expected image (recommend 4Gi minimum)
- [ ] **Routing:** Configure large-image pool for images > 2GB
- [ ] **Monitoring:** Alert on `scanner_worker_memory_usage_bytes` > 80% of limit
- [ ] **Documentation:** Document image size limits in user guide
---
## Related Resources
- **Architecture:** `docs/modules/scanner/architecture.md`
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
- **Dashboard:** Grafana > Stella Ops > Scanner Memory

View File

@@ -0,0 +1,195 @@
# Runbook: Scanner - Registry Authentication Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | High |
| **On-call scope** | Platform team, Security team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.registry-auth` |
---
## Symptoms
- [ ] Scans failing with "401 Unauthorized" or "403 Forbidden"
- [ ] Alert `ScannerRegistryAuthFailed` firing
- [ ] Error: "failed to authenticate with registry"
- [ ] Error: "failed to pull image manifest"
- [ ] Scans work for public images but fail for private images
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Cannot scan private images; release pipeline blocked |
| **Data integrity** | No data loss; authentication issue only |
| **SLA impact** | All scans for affected registry blocked |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.scanner.registry-auth
```
2. **List configured registries:**
```bash
stella registry list --show-status
```
Look for: Registries with "auth_failed" status
3. **Test registry authentication:**
```bash
stella registry test <registry-url>
```
### Deep diagnosis
1. **Check credential expiration:**
```bash
stella registry credentials show <registry-name>
```
Look for: Expiration date, token type
2. **Test with verbose output:**
```bash
stella registry test <registry-url> --verbose
```
Look for: Specific auth error message, HTTP status code
3. **Check registry logs:**
```bash
stella scanner logs --filter "registry auth" --last 30m
```
4. **Verify IAM/OIDC configuration (for cloud registries):**
```bash
stella registry iam-status <registry-name>
```
Problem if: IAM role not assumable, OIDC token expired
---
## Resolution
### Immediate mitigation
1. **Refresh credentials (for token-based auth):**
```bash
stella registry refresh-credentials <registry-name>
```
2. **Update static credentials:**
```bash
stella registry update-credentials <registry-name> \
--username <user> \
--password <token>
```
3. **For Docker Hub rate limiting:**
```bash
stella registry configure docker-hub \
--username <user> \
--access-token <token>
```
### Root cause fix
**If credentials expired:**
1. Generate new access token in registry (ECR, GCR, ACR, etc.)
2. Update credentials:
```bash
stella registry update-credentials <registry-name> --from-env
```
3. Configure automatic token refresh:
```bash
stella registry config set <registry-name>.auto_refresh true
stella registry config set <registry-name>.refresh_interval 11h
```
**If IAM role/policy changed (AWS ECR):**
1. Verify IAM role permissions:
```bash
stella registry iam verify <registry-name>
```
2. Update IAM role ARN if changed:
```bash
stella registry configure ecr \
--region <region> \
--role-arn <arn>
```
**If OIDC federation changed (GCP Artifact Registry):**
1. Verify service account:
```bash
stella registry oidc verify <registry-name>
```
2. Update workload identity configuration:
```bash
stella registry configure gcr \
--project <project> \
--workload-identity-provider <provider>
```
**If certificate changed (self-hosted registries):**
1. Update CA certificate:
```bash
stella registry configure <registry-name> \
--ca-cert /path/to/ca.crt
```
2. Or skip verification (not recommended for production):
```bash
stella registry configure <registry-name> \
--insecure-skip-verify
```
### Verification
```bash
# Test authentication
stella registry test <registry-url>
# Test scanning a private image
stella scan image --image <registry-url>/<image>:<tag> --dry-run
# Verify no auth failures in recent logs
stella scanner logs --filter "auth" --level error --last 30m
```
---
## Prevention
- [ ] **Credentials:** Use service accounts/workload identity instead of static tokens
- [ ] **Rotation:** Configure automatic token refresh before expiration
- [ ] **Monitoring:** Alert on authentication failure rate > 0
- [ ] **Documentation:** Document registry credential management procedures
---
## Related Resources
- **Architecture:** `docs/modules/scanner/registry-auth.md`
- **Related runbooks:** `scanner-worker-stuck.md`, `scanner-timeout.md`
- **Registry setup:** `docs/operations/registry-configuration.md`

View File

@@ -0,0 +1,188 @@
# Runbook: Scanner - SBOM Generation Failures
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | High |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.sbom-generation` |
---
## Symptoms
- [ ] Scans completing but SBOM generation failing
- [ ] Alert `ScannerSbomGenerationFailed` firing
- [ ] Error: "SBOM generation failed" or "unsupported package format"
- [ ] Partial SBOM with missing components
- [ ] Metric `scanner_sbom_generation_failures_total` increasing
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Incomplete vulnerability coverage; missing dependencies not scanned |
| **Data integrity** | Partial SBOM may miss vulnerabilities; attestations incomplete |
| **SLA impact** | SBOM completeness SLO violated (target: > 95%) |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.scanner.sbom-generation
```
2. **Check failed SBOM jobs:**
```bash
stella scanner jobs list --status sbom_failed --last 1h
```
3. **Check SBOM completeness rate:**
```bash
stella scanner stats --sbom-metrics
```
### Deep diagnosis
1. **Analyze specific failure:**
```bash
stella scanner job details <job-id> --sbom-errors
```
Look for: Specific package manager or file type causing failure
2. **Check for unsupported ecosystems:**
```bash
stella sbom analyze --image <image-ref> --verbose
```
Look for: "unsupported", "unknown package format", "parsing failed"
3. **Check scanner plugin status:**
```bash
stella scanner plugins list --status
```
Problem if: Package manager plugin disabled or erroring
4. **Check for corrupted package files:**
```bash
stella image inspect <image-ref> --check-integrity
```
---
## Resolution
### Immediate mitigation
1. **Enable fallback SBOM generation:**
```bash
stella scanner config set sbom.fallback_mode true
stella scan image --image <image-ref> --sbom-fallback
```
2. **Use alternative SBOM generator:**
```bash
stella sbom generate --image <image-ref> --generator syft --output sbom.json
```
3. **Generate partial SBOM and continue:**
```bash
stella scan image --image <image-ref> --sbom-partial-ok
```
### Root cause fix
**If package manager not supported:**
1. Check supported package managers:
```bash
stella scanner plugins list --type package-manager
```
2. Enable additional plugins:
```bash
stella scanner plugins enable <plugin-name>
```
3. For custom package formats, add mapping:
```bash
stella scanner config set sbom.custom_mappings.<format> <handler>
```
**If package file corrupted:**
1. Identify corrupted files:
```bash
stella image layers <image-ref> --verify-packages
```
2. Report to image owner for fix
**If memory/resource issue during generation:**
1. Increase SBOM generator resources:
```bash
stella scanner config set sbom.memory_limit 4Gi
stella scanner config set sbom.timeout 10m
```
2. Enable streaming mode:
```bash
stella scanner config set sbom.streaming_mode true
```
**If plugin crashed:**
1. Check plugin logs:
```bash
stella scanner plugins logs <plugin-name> --last 30m
```
2. Restart plugin:
```bash
stella scanner plugins restart <plugin-name>
```
### Verification
```bash
# Retry SBOM generation
stella sbom generate --image <image-ref> --output sbom.json
# Validate SBOM completeness
stella sbom validate --file sbom.json --check-completeness
# Check component count
stella sbom stats --file sbom.json
# Full scan with SBOM
stella scan image --image <image-ref>
```
---
## Prevention
- [ ] **Plugins:** Keep all package manager plugins enabled and updated
- [ ] **Monitoring:** Alert on SBOM completeness < 90%
- [ ] **Fallback:** Configure fallback SBOM generator for resilience
- [ ] **Testing:** Test SBOM generation for new image types before production
---
## Related Resources
- **Architecture:** `docs/modules/scanner/sbom-generation.md`
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
- **SBOM formats:** `docs/formats/sbom-spdx.md`, `docs/formats/sbom-cyclonedx.md`

View File

@@ -0,0 +1,174 @@
# Runbook: Scanner - Scan Timeout on Complex Images
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | Medium |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.timeout-rate` |
---
## Symptoms
- [ ] Scans failing with "timeout exceeded" error
- [ ] Alert `ScannerTimeoutExceeded` firing
- [ ] Metric `scanner_scan_timeout_total` increasing
- [ ] Specific images consistently timing out
- [ ] Error log: "scan operation exceeded timeout of X seconds"
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | Specific images cannot be scanned; pipeline blocked |
| **Data integrity** | No data loss; scans can be retried with adjusted settings |
| **SLA impact** | Release pipeline delayed for affected images |
---
## Diagnosis
### Quick checks
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.scanner.timeout-rate
```
2. **Identify failing images:**
```bash
stella scanner jobs list --status timeout --last 1h
```
Look for: Pattern in image types or sizes
3. **Check current timeout settings:**
```bash
stella scanner config get timeouts
```
### Deep diagnosis
1. **Analyze image complexity:**
```bash
stella image inspect <image-ref> --format json | jq '{size, layers: .layers | length, files: .manifest.fileCount}'
```
Problem if: > 50 layers, > 100k files, or > 5GB size
2. **Check scanner worker load:**
```bash
stella scanner workers stats
```
Problem if: All workers at capacity during timeouts
3. **Profile a scan:**
```bash
stella scan image --image <image-ref> --profile --verbose
```
Look for: Which phase is slowest (layer extraction, SBOM generation, vuln matching)
4. **Check for filesystem-heavy images:**
```bash
stella image layers <image-ref> --sort-by file-count
```
Problem if: Single layer with > 50k files (e.g., node_modules)
---
## Resolution
### Immediate mitigation
1. **Increase timeout for specific image:**
```bash
stella scan image --image <image-ref> --timeout 30m
```
2. **Increase global scan timeout:**
```bash
stella scanner config set timeouts.scan 20m
stella scanner workers restart
```
3. **Enable fast mode for initial scan:**
```bash
stella scan image --image <image-ref> --fast-mode
```
### Root cause fix
**If image is too complex:**
1. Enable incremental scanning:
```bash
stella scanner config set scan.incremental_mode true
```
2. Configure layer caching:
```bash
stella scanner config set cache.layer_dedup true
stella scanner config set cache.sbom_cache true
```
**If filesystem is too large:**
1. Enable streaming SBOM generation:
```bash
stella scanner config set sbom.streaming_threshold 500Gi
```
2. Configure file sampling for massive images:
```bash
stella scanner config set sbom.file_sample_max 100000
```
**If vulnerability matching is slow:**
1. Enable parallel matching:
```bash
stella scanner config set vuln.parallel_matching true
stella scanner config set vuln.match_workers 4
```
2. Optimize vulnerability database indexes:
```bash
stella db optimize --component scanner
```
### Verification
```bash
# Retry the previously failing scan
stella scan image --image <image-ref> --timeout 30m
# Monitor scan progress
stella scanner jobs watch <job-id>
# Verify no timeouts in recent scans
stella scanner jobs list --status timeout --last 1h
```
---
## Prevention
- [ ] **Capacity:** Configure appropriate timeouts based on expected image complexity (15m default, 30m for large)
- [ ] **Monitoring:** Alert on timeout rate > 5%
- [ ] **Caching:** Enable layer and SBOM caching for base images
- [ ] **Documentation:** Document image size/complexity limits in user guide
---
## Related Resources
- **Architecture:** `docs/modules/scanner/architecture.md`
- **Related runbooks:** `scanner-oom.md`, `scanner-worker-stuck.md`
- **Dashboard:** Grafana > Stella Ops > Scanner Performance

View File

@@ -0,0 +1,174 @@
# Runbook: Scanner - Worker Not Processing Jobs
> **Sprint:** SPRINT_20260117_029_DOCS_runbook_coverage
> **Task:** RUN-002 - Scanner Runbooks
## Metadata
| Field | Value |
|-------|-------|
| **Component** | Scanner |
| **Severity** | Critical |
| **On-call scope** | Platform team |
| **Last updated** | 2026-01-17 |
| **Doctor check** | `check.scanner.worker-health` |
---
## Symptoms
- [ ] Scan jobs stuck in "pending" or "processing" state for >5 minutes
- [ ] Scanner worker process shows 0% CPU usage
- [ ] Alert `ScannerWorkerStuck` or `ScannerQueueBacklog` firing
- [ ] UI shows "Scan in progress" indefinitely
- [ ] Metric `scanner_jobs_pending` increasing over time
---
## Impact
| Impact Type | Description |
|-------------|-------------|
| **User-facing** | New scans cannot complete, blocking CI/CD pipelines and release gates |
| **Data integrity** | No data loss; pending jobs will resume when worker recovers |
| **SLA impact** | Scan latency SLO violated if not resolved within 15 minutes |
---
## Diagnosis
### Quick checks (< 2 minutes)
1. **Check Doctor diagnostics:**
```bash
stella doctor --check check.scanner.worker-health
```
2. **Check scanner service status:**
```bash
stella scanner status
```
Expected: "Scanner workers: 4 active, 0 idle"
Problem: "Scanner workers: 0 active" or "status: degraded"
3. **Check job queue depth:**
```bash
stella scanner queue status
```
Expected: Queue depth < 50
Problem: Queue depth > 100 or growing rapidly
### Deep diagnosis
1. **Check worker process logs:**
```bash
stella scanner logs --tail 100 --level error
```
Look for: "timeout", "connection refused", "out of memory"
2. **Check Valkey connectivity (job queue):**
```bash
stella doctor --check check.storage.valkey
```
3. **Check if workers are OOM-killed:**
```bash
stella scanner workers inspect
```
Look for: "exit_code: 137" (OOM) or "exit_code: 143" (SIGTERM)
4. **Check resource utilization:**
```bash
stella obs metrics --filter scanner --last 10m
```
Look for: Memory > 90%, CPU sustained > 95%
---
## Resolution
### Immediate mitigation
1. **Restart scanner workers:**
```bash
stella scanner workers restart
```
This will: Terminate current workers and spawn fresh ones
2. **If restart fails, force restart the scanner service:**
```bash
stella service restart scanner
```
3. **Verify workers are processing:**
```bash
stella scanner queue status --watch
```
Queue depth should start decreasing
### Root cause fix
**If workers were OOM-killed:**
1. Increase worker memory limit:
```bash
stella scanner config set worker.memory_limit 4Gi
stella scanner workers restart
```
2. Reduce concurrent scans per worker:
```bash
stella scanner config set worker.concurrency 2
stella scanner workers restart
```
**If Valkey connection failed:**
1. Check Valkey health:
```bash
stella doctor --check check.storage.valkey
```
2. Restart Valkey if needed (see `valkey-connection-failure.md`)
**If workers are deadlocked:**
1. Enable deadlock detection:
```bash
stella scanner config set worker.deadlock_detection true
stella scanner workers restart
```
### Verification
```bash
# Verify workers are healthy
stella doctor --check check.scanner.worker-health
# Submit a test scan
stella scan image --image alpine:latest --dry-run
# Watch queue drain
stella scanner queue status --watch
# Verify no errors in recent logs
stella scanner logs --tail 20 --level error
```
---
## Prevention
- [ ] **Alert:** Ensure `ScannerQueueBacklog` alert is configured with threshold < 100 jobs
- [ ] **Monitoring:** Add Grafana panel for worker memory usage
- [ ] **Capacity:** Review worker count and memory limits during capacity planning
- [ ] **Deadlock:** Enable `worker.deadlock_detection` in production
---
## Related Resources
- **Architecture:** `docs/modules/scanner/architecture.md`
- **Related runbooks:** `scanner-oom.md`, `scanner-timeout.md`
- **Doctor check:** `src/Doctor/__Plugins/StellaOps.Doctor.Plugin.Scanner/Checks/WorkerHealthCheck.cs`
- **Dashboard:** Grafana > Stella Ops > Scanner Overview

View File

@@ -1,202 +0,0 @@
# Product Advisory: AI Economics Moat
ID: ADVISORY-20260116-AI-ECON-MOAT
Status: ACTIVE
Owner intent: Product-wide directive
Scope: All modules, docs, sprints, and roadmap decisions
## 0) Thesis (why this advisory exists)
In AI economics, code is cheap, software is expensive.
Competitors (and future competitors) can produce large volumes of code quickly. Stella Ops must remain hard to catch by focusing on the parts that are still expensive:
- trust
- operability
- determinism
- evidence integrity
- low-touch onboarding
- low support burden at scale
This advisory defines the product-level objectives and non-negotiable standards that make Stella Ops defensible against "code producers".
## 1) Product positioning (the class we must win)
Stella Ops Suite must be "best in class" for:
Evidence-grade release orchestration for containerized applications outside Kubernetes.
Stella is NOT attempting to be:
- a generic CD platform (Octopus, GitLab, Jenkins replacements)
- a generic vulnerability scanner (Trivy, Grype replacements)
- a "platform of everything" with infinite integrations
The moat is the end-to-end chain:
digest identity -> evidence -> verdict -> gate -> promotion -> audit export -> deterministic replay
The product wins when customers can run verified releases with minimal human labor and produce auditor-ready evidence.
## 2) Target customer and adoption constraint
Constraint: founder operates solo until ~100 paying customers.
Therefore, the product must be self-serve by default:
- install must be predictable
- failures must be diagnosable without maintainer time
- docs must replace support
- "Doctor" must replace debugging sessions
Support must be an exception, not a workflow.
## 3) The five non-negotiable product invariants
Every meaningful product change MUST preserve and strengthen these invariants:
I1. Evidence-grade by design
- Every verified decision has an evidence trail.
- Evidence is exportable, replayable, and verifiable.
I2. Deterministic replay
- Same inputs -> same outputs.
- A verdict can be reproduced and verified later, not just explained.
I3. Digest-first identity
- Releases are immutable digests, not mutable tags.
- "What is deployed where" is anchored to digests.
I4. Offline-first posture
- Air-gapped and low-egress environments must remain first-class.
- No hidden network dependencies in core flows.
I5. Low-touch operability
- Misconfigurations fail fast at startup with clear messages.
- Runtime failures have deterministic recovery playbooks.
- Doctor provides actionable diagnostics bundles and remediation steps.
If a proposed feature weakens any invariant, it must be rejected or redesigned.
## 4) Moats we build (how Stella stays hard to catch)
M1. Evidence chain continuity (no "glue work" required)
- Scan results, reachability proofs, policy evaluation, approvals, promotions, and exports are one continuous chain.
- Do not require customers to stitch multiple tools together to get audit-grade releases.
M2. Explainability with proof, not narrative
- "Why blocked?" must produce a deterministic trace + referenced evidence artifacts.
- The answer must be replayable, not a one-time explanation.
M3. Operability moat (Doctor + safe defaults)
- Diagnostics must identify root cause, not just symptoms.
- Provide deterministic checklists and fixes.
- Every integration must ship with health checks and failure-mode docs.
M4. Controlled surface area (reduce permutations)
- Ship a small number of Tier-1 golden integrations and targets.
- Keep the plugin system as an escape valve, but do not expand the maintained matrix beyond what solo operations can support.
M5. Standards-grade outputs with stable schemas
- SBOM, VEX, attestations, exports, and decision records must be stable, versioned, and backwards compatible where promised.
- Stability is a moat: auditors and platform teams adopt what they can depend on.
## 5) Explicit non-goals (what to reject quickly)
Reject or de-prioritize proposals that primarily:
- add a generic CD surface without evidence and determinism improvements
- expand integrations broadly without a "Tier-1" support model and diagnostics coverage
- compete on raw scanner breadth rather than evidence-grade gating outcomes
- add UI polish that does not reduce operator labor or support load
- add "AI features" that create nondeterminism or require external calls in core paths
If a feature does not strengthen at least one moat (M1-M5), it is likely not worth shipping now.
## 6) Agent review rubric (use this to evaluate any proposal, advisory, or sprint)
When reviewing any new idea, feature request, PRD, or sprint, score it against:
A) Moat impact (required)
- Which moat does it strengthen (M1-M5)?
- What measurable operator/auditor outcome improves?
B) Support burden risk (critical)
- Does this increase the probability of support tickets?
- Does Doctor cover the new failure modes?
- Are there clear runbooks and error messages?
C) Determinism and evidence risk (critical)
- Does this introduce nondeterminism?
- Are outputs stable, canonical, and replayable?
- Does it weaken evidence chain integrity?
D) Permutation risk (critical)
- Does this increase the matrix of supported combinations?
- Can it be constrained to a "golden path" configuration?
E) Time-to-value impact (important)
- Does this reduce time to first verified release?
- Does it reduce time to answer "why blocked"?
If a proposal scores poorly on B/C/D, it must be redesigned or rejected.
## 7) Definition of Done (feature-level) - do not ship without the boring parts
Any shippable feature must include, at minimum:
DOD-1: Operator story
- Clear user story for operators and auditors, not just developers.
DOD-2: Failure modes and recovery
- Documented expected failures, error codes/messages, and remediation steps.
- Doctor checks added or extended to cover the common failure paths.
DOD-3: Determinism and evidence
- Deterministic outputs where applicable.
- Evidence artifacts linked to decisions.
- Replay or verify path exists if the feature affects verdicts or gates.
DOD-4: Tests
- Unit tests for logic (happy + edge cases).
- Integration tests for contracts (DB, queues, storage where used).
- Determinism tests when outputs are serialized, hashed, or signed.
DOD-5: Documentation
- Docs updated where the feature changes behavior or contracts.
- Include copy/paste examples for the golden path usage.
DOD-6: Observability
- Structured logs and metrics for success/failure paths.
- Explicit "reason codes" for gate decisions and failures.
If the feature cannot afford these, it cannot afford to exist in a solo-scaled product.
## 8) Product-level metrics (what we optimize)
These metrics are the scoreboard. Prioritize work that improves them.
P0 metrics (most important):
- Time-to-first-verified-release (fresh install -> verified promotion)
- Mean time to answer "why blocked?" (with proof)
- Support minutes per customer per month (must trend toward near-zero)
- Determinism regressions per release (must be near-zero)
P1 metrics:
- Noise reduction ratio (reachable actionable findings vs raw findings)
- Audit export acceptance rate (auditors can consume without manual reconstruction)
- Upgrade success rate (low-friction updates, predictable migrations)
## 9) Immediate product focus areas implied by this advisory
When unsure what to build next, prefer investments in:
- Doctor: diagnostics coverage, fix suggestions, bundles, and environment validation
- Golden path onboarding: install -> connect -> scan -> gate -> promote -> export
- Determinism gates in CI and runtime checks for canonical outputs
- Evidence export bundles that map to common audit needs
- "Why blocked" trace quality, completeness, and replay verification
Avoid "breadth expansion" unless it includes full operability coverage.
## 10) How to apply this advisory in planning
When processing this advisory:
- Ensure docs reflect the invariants and moats at the product overview level.
- Ensure sprints and tasks reference which moat they strengthen (M1-M5).
- If a sprint increases complexity without decreasing operator labor or improving evidence integrity, treat it as suspect.
Archive this advisory only if it is superseded by a newer product-wide directive.