Files
git.stella-ops.org/docs/reachability/operator-runbook.md
2026-01-28 02:30:48 +02:00

468 lines
9.0 KiB
Markdown

# Operator Runbook
## Overview
This runbook provides operational procedures for managing the eBPF reachability evidence collection system.
## Monitoring
### Key Metrics
Monitor these metrics for system health:
| Metric | Description | Alert Threshold |
|--------|-------------|-----------------|
| `stellaops_signals_events_total` | Total events collected | N/A (info) |
| `stellaops_signals_events_rate` | Events per second | > 100,000 (high load) |
| `stellaops_signals_ringbuf_usage` | Ring buffer utilization % | > 80% (overflow risk) |
| `stellaops_signals_drops_total` | Events dropped | > 0 (investigate) |
| `stellaops_signals_enrich_latency_p99` | Enrichment latency | > 50ms (degraded) |
| `stellaops_signals_chunks_signed` | Signed chunks count | N/A (info) |
| `stellaops_signals_rekor_failures` | Rekor submission failures | > 0 (investigate) |
### Health Checks
```bash
# Quick health check
stella signals health
# Detailed status
stella signals status --verbose
# Prometheus metrics
curl localhost:9090/metrics | grep stellaops_signals
```
### Log Analysis
```bash
# View recent logs
journalctl -u stellaops-signals --since "1 hour ago"
# Filter by severity
journalctl -u stellaops-signals -p err
# Follow live
journalctl -u stellaops-signals -f
```
## Common Issues
### Issue: Probe Failed to Attach
**Symptoms:**
```
Error: Failed to attach tracepoint/syscalls/sys_enter_openat: permission denied
```
**Diagnosis:**
```bash
# Check capabilities
getcap /usr/bin/stella
# Check kernel config
cat /boot/config-$(uname -r) | grep CONFIG_BPF
# Check seccomp/AppArmor
dmesg | grep -i "bpf\|seccomp\|apparmor"
```
**Resolution:**
1. Ensure proper capabilities:
```bash
sudo setcap cap_bpf,cap_perfmon,cap_sys_ptrace+ep /usr/bin/stella
```
2. Or run as root:
```bash
sudo stella signals start
```
3. Check AppArmor/SELinux isn't blocking
---
### Issue: Ring Buffer Overflow
**Symptoms:**
```
Warning: Ring buffer full, 1523 events dropped
```
**Diagnosis:**
```bash
# Check buffer usage
stella signals stats | grep ringbuf
# Check event rate
stella signals stats | grep rate
```
**Resolution:**
1. Increase buffer size:
```yaml
signals:
ring_buffer_size: 1048576 # 1MB
```
2. Enable rate limiting:
```yaml
signals:
max_events_per_second: 50000
```
3. Add more aggressive filtering:
```yaml
signals:
filters:
paths:
denylist:
- /proc/**
- /sys/**
```
---
### Issue: High Memory Usage
**Symptoms:**
- OOM kills
- High RSS in process stats
**Diagnosis:**
```bash
# Check memory breakdown
stella signals stats --memory
# Check cache sizes
stella signals cache-stats
```
**Resolution:**
1. Reduce cache sizes:
```yaml
signals:
resources:
symbol_cache_max_entries: 50000
max_cache_memory_mb: 128
```
2. Reduce container cache TTL:
```yaml
signals:
resources:
container_cache_ttl_seconds: 60
```
---
### Issue: Symbol Resolution Failures
**Symptoms:**
```
Symbol: addr:0x7f4a3b2c1000 (unresolved)
```
**Diagnosis:**
```bash
# Check if binary has symbols
nm /path/to/binary | head
# Check if debuginfo available
file /path/to/binary | grep "not stripped"
```
**Resolution:**
1. Install debug symbols:
```bash
# Debian/Ubuntu
apt install libc6-dbg
# RHEL/CentOS
debuginfo-install glibc
```
2. Accept address-only evidence (still valuable for correlation)
---
### Issue: Container Resolution Failures
**Symptoms:**
```
container_id: unknown:1234567890
```
**Diagnosis:**
```bash
# Check cgroup path format
cat /proc/<pid>/cgroup
# Verify container runtime
docker ps
crictl ps
```
**Resolution:**
1. Verify Zastava integration is running
2. Check container runtime is supported (containerd/Docker/CRI-O)
3. Restart collector to refresh container mappings
---
### Issue: Evidence Chain Verification Failure
**Symptoms:**
```
$ stella signals verify-chain /var/lib/stellaops/evidence/
Chain Status: ✗ INVALID
Error: Chain broken at chunk 42
```
**Diagnosis:**
```bash
# Get detailed report
stella signals verify-chain /var/lib/stellaops/evidence/ --verbose --format json
```
**Resolution:**
1. Check for missing chunk files
2. Check for disk corruption
3. If intentional restart, document gap in audit trail
4. Re-initialize chain if necessary:
```bash
stella signals reset-chain --confirm
```
---
### Issue: Rekor Submission Failures
**Symptoms:**
```
Warning: Failed to submit to Rekor: connection refused
```
**Diagnosis:**
```bash
# Check Rekor connectivity
curl https://rekor.sigstore.dev/api/v1/log
# Check signing service
stella signer status
```
**Resolution:**
1. Check network connectivity to Rekor
2. Verify Fulcio/OIDC tokens are valid
3. Switch to offline mode temporarily:
```yaml
signals:
signing:
submit_to_rekor: false
```
4. Retry failed submissions later:
```bash
stella signals resubmit-pending
```
## Operational Procedures
### Procedure: Rotate Evidence Directory
When evidence directory is full or needs archival:
```bash
# 1. Stop collector gracefully
stella signals stop
# 2. Archive current evidence
tar -czvf evidence-$(date +%Y%m%d).tar.gz /var/lib/stellaops/evidence/
# 3. Verify archive integrity
stella signals verify-chain evidence-$(date +%Y%m%d).tar.gz
# 4. Move to long-term storage
aws s3 cp evidence-$(date +%Y%m%d).tar.gz s3://evidence-archive/
# 5. Clear old evidence (keep chain state)
stella signals cleanup --keep-chain-state --older-than 7d
# 6. Restart collector
stella signals start
```
### Procedure: Update Collector
```bash
# 1. Check current version
stella version
# 2. Download new version
curl -fsSL https://stella.ops/install.sh | bash -s -- --version 1.2.0
# 3. Verify probe compatibility
stella signals test-probes
# 4. Restart service
sudo systemctl restart stellaops-signals
# 5. Verify operation
stella signals status
```
### Procedure: Recover from Crash
```bash
# 1. Check service status
systemctl status stellaops-signals
# 2. Check for core dumps
coredumpctl list | grep stella
# 3. Review logs for cause
journalctl -u stellaops-signals --since "30 min ago"
# 4. Verify chain state
stella signals verify-chain /var/lib/stellaops/evidence/
# 5. Restart service
sudo systemctl start stellaops-signals
# 6. Monitor for recurrence
watch -n 5 'stella signals stats'
```
### Procedure: Air-Gap Evidence Export
```bash
# 1. Create signed export bundle
stella signals export \
--from 2026-01-01 \
--to 2026-01-31 \
--include-proofs \
--output january-evidence.tar.gz
# 2. Generate verification manifest
stella signals manifest january-evidence.tar.gz > manifest.json
# 3. Transfer to verification system
scp january-evidence.tar.gz manifest.json airgap-verifier:
# 4. On verifier, import and verify
stella signals import january-evidence.tar.gz
stella signals verify-chain --offline /imported/evidence/
```
## Configuration Reference
### Full Configuration Example
```yaml
signals:
enabled: true
output_directory: /var/lib/stellaops/evidence
# Ring buffer (kernel space)
ring_buffer_size: 262144 # 256KB
# Rate limiting
max_events_per_second: 0 # unlimited
# Rotation
rotation:
max_size_mb: 100
max_age_hours: 1
# Signing
signing:
enabled: true
key_id: fulcio
submit_to_rekor: true
# Probes
probes:
sys_enter_openat: true
sched_process_exec: true
inet_sock_set_state: true
libc_connect: true
libc_accept: true
openssl_read: true
openssl_write: true
# Filters
filters:
target_containers: []
target_namespaces: []
paths:
allowlist:
- /etc/**
- /var/lib/**
denylist:
- /proc/**
- /sys/**
- /dev/**
networks:
allowlist: []
denylist:
- 127.0.0.0/8
# Resources
resources:
max_cache_memory_mb: 256
symbol_cache_max_entries: 100000
container_cache_ttl_seconds: 300
# Observability
metrics:
enabled: true
port: 9090
logging:
level: info
format: json
```
## Emergency Procedures
### Emergency: Disable Collection
If collector is causing system issues:
```bash
# Immediate stop
sudo systemctl stop stellaops-signals
# Disable on boot
sudo systemctl disable stellaops-signals
# Remove all probes manually
sudo bpftool prog list | grep stella | awk '{print $1}' | xargs -I{} sudo bpftool prog detach {}
```
### Emergency: Clear Corrupted State
If state is corrupted and normal recovery fails:
```bash
# Stop service
sudo systemctl stop stellaops-signals
# Backup current state
cp -r /var/lib/stellaops/evidence /var/lib/stellaops/evidence.backup
# Clear state
rm -rf /var/lib/stellaops/evidence/*
# Re-initialize
stella signals init
# Start fresh
sudo systemctl start stellaops-signals
```
## Support
For issues not covered in this runbook:
1. Check [GitHub Issues](https://github.com/stellaops/stellaops/issues)
2. Search [Documentation](https://docs.stella.ops/)
3. Contact support with:
- Output of `stella signals status --verbose`
- Relevant log excerpts
- Kernel version (`uname -a`)
- Configuration file (sanitized)