Files
git.stella-ops.org/docs/operations/runbooks/evidence-locker-ops.md

409 lines
8.4 KiB
Markdown

# Sprint: SPRINT_20260117_029_Runbook_coverage_expansion
# Task: RUN-003 - Evidence Locker Runbook
# Evidence Locker Operations Runbook
Status: PRODUCTION-READY (2026-01-17 UTC)
## Scope
Evidence locker operations including storage management, integrity verification, attestation management, provenance chain maintenance, and disaster recovery procedures.
---
## Pre-flight Checklist
### Environment Verification
```bash
# Check evidence locker health
stella doctor --category evidence
# Verify storage accessibility
stella evidence status
# Check index health
stella evidence index status
# Verify anchor chain
stella evidence anchor verify --latest
```
### Metrics to Watch
- `stella_evidence_artifacts_total` - Total artifacts stored
- `stella_evidence_retrieval_latency_seconds` - Retrieval latency P99
- `stella_evidence_storage_bytes` - Storage consumption
- `stella_merkle_anchor_age_seconds` - Time since last anchor
---
## Standard Procedures
### SP-001: Daily Integrity Check
**Frequency:** Daily (automated) or on-demand
**Duration:** Varies by locker size (typically 5-30 minutes)
1. Run integrity verification:
```bash
# Quick check (sample-based)
stella evidence verify --mode quick
# Full check (all artifacts)
stella evidence verify --mode full
```
2. Review results:
```bash
stella evidence verify-report --latest
```
3. Address any failures:
```bash
# List failed artifacts
stella evidence verify-report --latest --filter failed
```
### SP-002: Index Maintenance
**Frequency:** Weekly or after large ingestion
**Duration:** ~10 minutes
1. Check index health:
```bash
stella evidence index status
```
2. Refresh index if needed:
```bash
# Incremental refresh
stella evidence index refresh
# Full rebuild (if corruption suspected)
stella evidence index rebuild
```
3. Optimize index:
```bash
stella evidence index optimize
```
### SP-003: Merkle Anchoring
**Frequency:** Per policy (default: every 6 hours)
**Duration:** ~2 minutes
1. Create new anchor:
```bash
stella evidence anchor create
```
2. Verify anchor chain:
```bash
stella evidence anchor verify --all
```
3. Export anchor for external archival:
```bash
stella evidence anchor export --latest --output anchor-$(date +%Y%m%dT%H%M%S).json
```
### SP-004: Storage Cleanup
**Frequency:** Monthly or when storage alerts trigger
**Duration:** Varies
1. Review storage usage:
```bash
stella evidence storage stats
```
2. Apply retention policy:
```bash
# Dry run first
stella evidence cleanup --apply-retention --dry-run
# Execute cleanup
stella evidence cleanup --apply-retention
```
3. Archive old evidence (if required):
```bash
stella evidence archive --older-than 365d --output /archive/evidence-$(date +%Y).tar
```
---
## Incident Procedures
### INC-001: Integrity Verification Failure
**Symptoms:**
- Alert: `StellaEvidenceIntegrityFailure`
- Verification reports hash mismatch
**Investigation:**
```bash
# Get failure details
stella evidence verify-report --latest --filter failed --format json > /tmp/integrity-failures.json
# Check specific artifact
stella evidence inspect <artifact-id>
# Check provenance
stella evidence provenance show <artifact-id>
```
**Resolution:**
1. **Isolated corruption:**
```bash
# Attempt recovery from replica (if available)
stella evidence recover --id <artifact-id> --source replica
# If no replica, mark as corrupted
stella evidence mark-corrupted --id <artifact-id> --reason "hash-mismatch"
```
2. **Widespread corruption:**
- Stop evidence ingestion
- Identify corruption extent
- Restore from backup if necessary
- Escalate to L3
3. **False positive (software bug):**
- Verify with multiple hash implementations
- Check for recent software updates
- Report bug if confirmed
### INC-002: Evidence Retrieval Failure
**Symptoms:**
- Alert: `StellaEvidenceRetrievalFailed`
- API returning 404 for known artifacts
**Investigation:**
```bash
# Check if artifact exists
stella evidence exists <artifact-id>
# Check index
stella evidence index lookup <artifact-id>
# Check storage backend
stella evidence storage check <artifact-id>
```
**Resolution:**
1. **Index corruption:**
```bash
# Rebuild index
stella evidence index rebuild
```
2. **Storage backend issue:**
```bash
# Check storage health
stella doctor --check check.storage.evidencelocker
# Verify storage connectivity
stella evidence storage test
```
3. **File system issue:**
- Check disk health
- Verify file permissions
- Check mount status
### INC-003: Anchor Chain Break
**Symptoms:**
- Alert: `StellaMerkleAnchorChainBroken`
- Anchor verification fails
**Investigation:**
```bash
# Check anchor chain
stella evidence anchor verify --all --verbose
# Find break point
stella evidence anchor list --show-links
# Inspect specific anchor
stella evidence anchor inspect <anchor-id>
```
**Resolution:**
1. **Single broken link:**
```bash
# Attempt to recover from backup
stella evidence anchor recover --id <anchor-id> --source backup
```
2. **Multiple breaks:**
- Stop new anchoring
- Assess extent of damage
- Restore from backup or rebuild chain
3. **Create new chain segment:**
```bash
# Start new chain (preserves old chain as archived)
stella evidence anchor new-chain --reason "chain-break-recovery"
```
### INC-004: Storage Full
**Symptoms:**
- Alert: `StellaEvidenceStorageFull`
- Ingestion failing
**Immediate Actions:**
```bash
# Check storage usage
stella evidence storage stats
# Emergency cleanup of temporary files
stella evidence cleanup --temp-only
# Find large/old artifacts
stella evidence storage analyze --sort size --limit 20
```
**Resolution:**
1. **Apply retention policy:**
```bash
stella evidence cleanup --apply-retention --aggressive
```
2. **Archive old evidence:**
```bash
stella evidence archive --older-than 180d --compress
```
3. **Expand storage:**
- Follow cloud provider procedure
- Or add additional storage volume
---
## Disaster Recovery
### DR-001: Full Evidence Locker Recovery
**Prerequisites:**
- Backup available
- Target storage provisioned
- Recovery environment ready
**Procedure:**
1. Provision new storage:
```bash
stella evidence storage provision --size <size>
```
2. Restore from backup:
```bash
# List available backups
stella backup list --type evidence-locker
# Restore
stella evidence restore --backup-id <backup-id> --target /var/lib/stellaops/evidence
```
3. Verify restoration:
```bash
stella evidence verify --mode full
stella evidence anchor verify --all
```
4. Update service configuration:
```bash
stella config set EvidenceLocker:Path /var/lib/stellaops/evidence
stella service restart
```
### DR-002: Point-in-Time Recovery
For recovering to a specific point in time:
1. Identify target anchor:
```bash
stella evidence anchor list --before <timestamp>
```
2. Restore to that point:
```bash
stella evidence restore --to-anchor <anchor-id>
```
3. Verify integrity:
```bash
stella evidence verify --mode full --to-anchor <anchor-id>
```
---
## Offline Mode Operations
### Preparing Offline Evidence Pack
```bash
# Export evidence for specific artifact
stella evidence export --digest <artifact-digest> --output evidence-pack.tar.gz
# Export with all dependencies
stella evidence export --digest <artifact-digest> --include-deps --output evidence-full.tar.gz
```
### Verifying Evidence Offline
```bash
# Verify evidence pack without network
stella evidence verify --offline --input evidence-pack.tar.gz
# Replay verdict using evidence
stella replay --evidence evidence-pack.tar.gz --output verdict.json
```
---
## Monitoring Dashboard
Access: Grafana → Dashboards → Stella Ops → Evidence Locker
Key panels:
- Artifact ingestion rate
- Retrieval latency
- Storage utilization trend
- Integrity check status
- Anchor chain health
---
## Evidence Capture
For any incident:
```bash
stella evidence diagnostics --output /tmp/evidence-diag-$(date +%Y%m%dT%H%M%S).tar.gz
```
Bundle includes:
- Index status
- Storage stats
- Recent anchor chain
- Integrity check results
- Operation audit log
---
## Escalation Path
1. **L1 (On-call):** Standard procedures, cleanup operations
2. **L2 (Platform team):** Index rebuild, anchor issues
3. **L3 (Architecture):** Chain recovery, DR procedures
---
_Last updated: 2026-01-17 (UTC)_