468 lines
9.0 KiB
Markdown
468 lines
9.0 KiB
Markdown
# Operator Runbook
|
|
|
|
## Overview
|
|
|
|
This runbook provides operational procedures for managing the eBPF reachability evidence collection system.
|
|
|
|
## Monitoring
|
|
|
|
### Key Metrics
|
|
|
|
Monitor these metrics for system health:
|
|
|
|
| Metric | Description | Alert Threshold |
|
|
|--------|-------------|-----------------|
|
|
| `stellaops_signals_events_total` | Total events collected | N/A (info) |
|
|
| `stellaops_signals_events_rate` | Events per second | > 100,000 (high load) |
|
|
| `stellaops_signals_ringbuf_usage` | Ring buffer utilization % | > 80% (overflow risk) |
|
|
| `stellaops_signals_drops_total` | Events dropped | > 0 (investigate) |
|
|
| `stellaops_signals_enrich_latency_p99` | Enrichment latency | > 50ms (degraded) |
|
|
| `stellaops_signals_chunks_signed` | Signed chunks count | N/A (info) |
|
|
| `stellaops_signals_rekor_failures` | Rekor submission failures | > 0 (investigate) |
|
|
|
|
### Health Checks
|
|
|
|
```bash
|
|
# Quick health check
|
|
stella signals health
|
|
|
|
# Detailed status
|
|
stella signals status --verbose
|
|
|
|
# Prometheus metrics
|
|
curl localhost:9090/metrics | grep stellaops_signals
|
|
```
|
|
|
|
### Log Analysis
|
|
|
|
```bash
|
|
# View recent logs
|
|
journalctl -u stellaops-signals --since "1 hour ago"
|
|
|
|
# Filter by severity
|
|
journalctl -u stellaops-signals -p err
|
|
|
|
# Follow live
|
|
journalctl -u stellaops-signals -f
|
|
```
|
|
|
|
## Common Issues
|
|
|
|
### Issue: Probe Failed to Attach
|
|
|
|
**Symptoms:**
|
|
```
|
|
Error: Failed to attach tracepoint/syscalls/sys_enter_openat: permission denied
|
|
```
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check capabilities
|
|
getcap /usr/bin/stella
|
|
|
|
# Check kernel config
|
|
cat /boot/config-$(uname -r) | grep CONFIG_BPF
|
|
|
|
# Check seccomp/AppArmor
|
|
dmesg | grep -i "bpf\|seccomp\|apparmor"
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Ensure proper capabilities:
|
|
```bash
|
|
sudo setcap cap_bpf,cap_perfmon,cap_sys_ptrace+ep /usr/bin/stella
|
|
```
|
|
2. Or run as root:
|
|
```bash
|
|
sudo stella signals start
|
|
```
|
|
3. Check AppArmor/SELinux isn't blocking
|
|
|
|
---
|
|
|
|
### Issue: Ring Buffer Overflow
|
|
|
|
**Symptoms:**
|
|
```
|
|
Warning: Ring buffer full, 1523 events dropped
|
|
```
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check buffer usage
|
|
stella signals stats | grep ringbuf
|
|
|
|
# Check event rate
|
|
stella signals stats | grep rate
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Increase buffer size:
|
|
```yaml
|
|
signals:
|
|
ring_buffer_size: 1048576 # 1MB
|
|
```
|
|
2. Enable rate limiting:
|
|
```yaml
|
|
signals:
|
|
max_events_per_second: 50000
|
|
```
|
|
3. Add more aggressive filtering:
|
|
```yaml
|
|
signals:
|
|
filters:
|
|
paths:
|
|
denylist:
|
|
- /proc/**
|
|
- /sys/**
|
|
```
|
|
|
|
---
|
|
|
|
### Issue: High Memory Usage
|
|
|
|
**Symptoms:**
|
|
- OOM kills
|
|
- High RSS in process stats
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check memory breakdown
|
|
stella signals stats --memory
|
|
|
|
# Check cache sizes
|
|
stella signals cache-stats
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Reduce cache sizes:
|
|
```yaml
|
|
signals:
|
|
resources:
|
|
symbol_cache_max_entries: 50000
|
|
max_cache_memory_mb: 128
|
|
```
|
|
2. Reduce container cache TTL:
|
|
```yaml
|
|
signals:
|
|
resources:
|
|
container_cache_ttl_seconds: 60
|
|
```
|
|
|
|
---
|
|
|
|
### Issue: Symbol Resolution Failures
|
|
|
|
**Symptoms:**
|
|
```
|
|
Symbol: addr:0x7f4a3b2c1000 (unresolved)
|
|
```
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check if binary has symbols
|
|
nm /path/to/binary | head
|
|
|
|
# Check if debuginfo available
|
|
file /path/to/binary | grep "not stripped"
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Install debug symbols:
|
|
```bash
|
|
# Debian/Ubuntu
|
|
apt install libc6-dbg
|
|
|
|
# RHEL/CentOS
|
|
debuginfo-install glibc
|
|
```
|
|
2. Accept address-only evidence (still valuable for correlation)
|
|
|
|
---
|
|
|
|
### Issue: Container Resolution Failures
|
|
|
|
**Symptoms:**
|
|
```
|
|
container_id: unknown:1234567890
|
|
```
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check cgroup path format
|
|
cat /proc/<pid>/cgroup
|
|
|
|
# Verify container runtime
|
|
docker ps
|
|
crictl ps
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Verify Zastava integration is running
|
|
2. Check container runtime is supported (containerd/Docker/CRI-O)
|
|
3. Restart collector to refresh container mappings
|
|
|
|
---
|
|
|
|
### Issue: Evidence Chain Verification Failure
|
|
|
|
**Symptoms:**
|
|
```
|
|
$ stella signals verify-chain /var/lib/stellaops/evidence/
|
|
Chain Status: ✗ INVALID
|
|
Error: Chain broken at chunk 42
|
|
```
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Get detailed report
|
|
stella signals verify-chain /var/lib/stellaops/evidence/ --verbose --format json
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Check for missing chunk files
|
|
2. Check for disk corruption
|
|
3. If intentional restart, document gap in audit trail
|
|
4. Re-initialize chain if necessary:
|
|
```bash
|
|
stella signals reset-chain --confirm
|
|
```
|
|
|
|
---
|
|
|
|
### Issue: Rekor Submission Failures
|
|
|
|
**Symptoms:**
|
|
```
|
|
Warning: Failed to submit to Rekor: connection refused
|
|
```
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check Rekor connectivity
|
|
curl https://rekor.sigstore.dev/api/v1/log
|
|
|
|
# Check signing service
|
|
stella signer status
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Check network connectivity to Rekor
|
|
2. Verify Fulcio/OIDC tokens are valid
|
|
3. Switch to offline mode temporarily:
|
|
```yaml
|
|
signals:
|
|
signing:
|
|
submit_to_rekor: false
|
|
```
|
|
4. Retry failed submissions later:
|
|
```bash
|
|
stella signals resubmit-pending
|
|
```
|
|
|
|
## Operational Procedures
|
|
|
|
### Procedure: Rotate Evidence Directory
|
|
|
|
When evidence directory is full or needs archival:
|
|
|
|
```bash
|
|
# 1. Stop collector gracefully
|
|
stella signals stop
|
|
|
|
# 2. Archive current evidence
|
|
tar -czvf evidence-$(date +%Y%m%d).tar.gz /var/lib/stellaops/evidence/
|
|
|
|
# 3. Verify archive integrity
|
|
stella signals verify-chain evidence-$(date +%Y%m%d).tar.gz
|
|
|
|
# 4. Move to long-term storage
|
|
aws s3 cp evidence-$(date +%Y%m%d).tar.gz s3://evidence-archive/
|
|
|
|
# 5. Clear old evidence (keep chain state)
|
|
stella signals cleanup --keep-chain-state --older-than 7d
|
|
|
|
# 6. Restart collector
|
|
stella signals start
|
|
```
|
|
|
|
### Procedure: Update Collector
|
|
|
|
```bash
|
|
# 1. Check current version
|
|
stella version
|
|
|
|
# 2. Download new version
|
|
curl -fsSL https://stella.ops/install.sh | bash -s -- --version 1.2.0
|
|
|
|
# 3. Verify probe compatibility
|
|
stella signals test-probes
|
|
|
|
# 4. Restart service
|
|
sudo systemctl restart stellaops-signals
|
|
|
|
# 5. Verify operation
|
|
stella signals status
|
|
```
|
|
|
|
### Procedure: Recover from Crash
|
|
|
|
```bash
|
|
# 1. Check service status
|
|
systemctl status stellaops-signals
|
|
|
|
# 2. Check for core dumps
|
|
coredumpctl list | grep stella
|
|
|
|
# 3. Review logs for cause
|
|
journalctl -u stellaops-signals --since "30 min ago"
|
|
|
|
# 4. Verify chain state
|
|
stella signals verify-chain /var/lib/stellaops/evidence/
|
|
|
|
# 5. Restart service
|
|
sudo systemctl start stellaops-signals
|
|
|
|
# 6. Monitor for recurrence
|
|
watch -n 5 'stella signals stats'
|
|
```
|
|
|
|
### Procedure: Air-Gap Evidence Export
|
|
|
|
```bash
|
|
# 1. Create signed export bundle
|
|
stella signals export \
|
|
--from 2026-01-01 \
|
|
--to 2026-01-31 \
|
|
--include-proofs \
|
|
--output january-evidence.tar.gz
|
|
|
|
# 2. Generate verification manifest
|
|
stella signals manifest january-evidence.tar.gz > manifest.json
|
|
|
|
# 3. Transfer to verification system
|
|
scp january-evidence.tar.gz manifest.json airgap-verifier:
|
|
|
|
# 4. On verifier, import and verify
|
|
stella signals import january-evidence.tar.gz
|
|
stella signals verify-chain --offline /imported/evidence/
|
|
```
|
|
|
|
## Configuration Reference
|
|
|
|
### Full Configuration Example
|
|
|
|
```yaml
|
|
signals:
|
|
enabled: true
|
|
output_directory: /var/lib/stellaops/evidence
|
|
|
|
# Ring buffer (kernel space)
|
|
ring_buffer_size: 262144 # 256KB
|
|
|
|
# Rate limiting
|
|
max_events_per_second: 0 # unlimited
|
|
|
|
# Rotation
|
|
rotation:
|
|
max_size_mb: 100
|
|
max_age_hours: 1
|
|
|
|
# Signing
|
|
signing:
|
|
enabled: true
|
|
key_id: fulcio
|
|
submit_to_rekor: true
|
|
|
|
# Probes
|
|
probes:
|
|
sys_enter_openat: true
|
|
sched_process_exec: true
|
|
inet_sock_set_state: true
|
|
libc_connect: true
|
|
libc_accept: true
|
|
openssl_read: true
|
|
openssl_write: true
|
|
|
|
# Filters
|
|
filters:
|
|
target_containers: []
|
|
target_namespaces: []
|
|
paths:
|
|
allowlist:
|
|
- /etc/**
|
|
- /var/lib/**
|
|
denylist:
|
|
- /proc/**
|
|
- /sys/**
|
|
- /dev/**
|
|
networks:
|
|
allowlist: []
|
|
denylist:
|
|
- 127.0.0.0/8
|
|
|
|
# Resources
|
|
resources:
|
|
max_cache_memory_mb: 256
|
|
symbol_cache_max_entries: 100000
|
|
container_cache_ttl_seconds: 300
|
|
|
|
# Observability
|
|
metrics:
|
|
enabled: true
|
|
port: 9090
|
|
logging:
|
|
level: info
|
|
format: json
|
|
```
|
|
|
|
## Emergency Procedures
|
|
|
|
### Emergency: Disable Collection
|
|
|
|
If collector is causing system issues:
|
|
|
|
```bash
|
|
# Immediate stop
|
|
sudo systemctl stop stellaops-signals
|
|
|
|
# Disable on boot
|
|
sudo systemctl disable stellaops-signals
|
|
|
|
# Remove all probes manually
|
|
sudo bpftool prog list | grep stella | awk '{print $1}' | xargs -I{} sudo bpftool prog detach {}
|
|
```
|
|
|
|
### Emergency: Clear Corrupted State
|
|
|
|
If state is corrupted and normal recovery fails:
|
|
|
|
```bash
|
|
# Stop service
|
|
sudo systemctl stop stellaops-signals
|
|
|
|
# Backup current state
|
|
cp -r /var/lib/stellaops/evidence /var/lib/stellaops/evidence.backup
|
|
|
|
# Clear state
|
|
rm -rf /var/lib/stellaops/evidence/*
|
|
|
|
# Re-initialize
|
|
stella signals init
|
|
|
|
# Start fresh
|
|
sudo systemctl start stellaops-signals
|
|
```
|
|
|
|
## Support
|
|
|
|
For issues not covered in this runbook:
|
|
|
|
1. Check [GitHub Issues](https://github.com/stellaops/stellaops/issues)
|
|
2. Search [Documentation](https://docs.stella.ops/)
|
|
3. Contact support with:
|
|
- Output of `stella signals status --verbose`
|
|
- Relevant log excerpts
|
|
- Kernel version (`uname -a`)
|
|
- Configuration file (sanitized)
|