test fixes and new product advisories work
This commit is contained in:
467
docs/reachability/operator-runbook.md
Normal file
467
docs/reachability/operator-runbook.md
Normal file
@@ -0,0 +1,467 @@
|
||||
# Operator Runbook
|
||||
|
||||
## Overview
|
||||
|
||||
This runbook provides operational procedures for managing the eBPF reachability evidence collection system.
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Key Metrics
|
||||
|
||||
Monitor these metrics for system health:
|
||||
|
||||
| Metric | Description | Alert Threshold |
|
||||
|--------|-------------|-----------------|
|
||||
| `stellaops_signals_events_total` | Total events collected | N/A (info) |
|
||||
| `stellaops_signals_events_rate` | Events per second | > 100,000 (high load) |
|
||||
| `stellaops_signals_ringbuf_usage` | Ring buffer utilization % | > 80% (overflow risk) |
|
||||
| `stellaops_signals_drops_total` | Events dropped | > 0 (investigate) |
|
||||
| `stellaops_signals_enrich_latency_p99` | Enrichment latency | > 50ms (degraded) |
|
||||
| `stellaops_signals_chunks_signed` | Signed chunks count | N/A (info) |
|
||||
| `stellaops_signals_rekor_failures` | Rekor submission failures | > 0 (investigate) |
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Quick health check
|
||||
stella signals health
|
||||
|
||||
# Detailed status
|
||||
stella signals status --verbose
|
||||
|
||||
# Prometheus metrics
|
||||
curl localhost:9090/metrics | grep stellaops_signals
|
||||
```
|
||||
|
||||
### Log Analysis
|
||||
|
||||
```bash
|
||||
# View recent logs
|
||||
journalctl -u stellaops-signals --since "1 hour ago"
|
||||
|
||||
# Filter by severity
|
||||
journalctl -u stellaops-signals -p err
|
||||
|
||||
# Follow live
|
||||
journalctl -u stellaops-signals -f
|
||||
```
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Issue: Probe Failed to Attach
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: Failed to attach tracepoint/syscalls/sys_enter_openat: permission denied
|
||||
```
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check capabilities
|
||||
getcap /usr/bin/stella
|
||||
|
||||
# Check kernel config
|
||||
cat /boot/config-$(uname -r) | grep CONFIG_BPF
|
||||
|
||||
# Check seccomp/AppArmor
|
||||
dmesg | grep -i "bpf\|seccomp\|apparmor"
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Ensure proper capabilities:
|
||||
```bash
|
||||
sudo setcap cap_bpf,cap_perfmon,cap_sys_ptrace+ep /usr/bin/stella
|
||||
```
|
||||
2. Or run as root:
|
||||
```bash
|
||||
sudo stella signals start
|
||||
```
|
||||
3. Check AppArmor/SELinux isn't blocking
|
||||
|
||||
---
|
||||
|
||||
### Issue: Ring Buffer Overflow
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Warning: Ring buffer full, 1523 events dropped
|
||||
```
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check buffer usage
|
||||
stella signals stats | grep ringbuf
|
||||
|
||||
# Check event rate
|
||||
stella signals stats | grep rate
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Increase buffer size:
|
||||
```yaml
|
||||
signals:
|
||||
ring_buffer_size: 1048576 # 1MB
|
||||
```
|
||||
2. Enable rate limiting:
|
||||
```yaml
|
||||
signals:
|
||||
max_events_per_second: 50000
|
||||
```
|
||||
3. Add more aggressive filtering:
|
||||
```yaml
|
||||
signals:
|
||||
filters:
|
||||
paths:
|
||||
denylist:
|
||||
- /proc/**
|
||||
- /sys/**
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: High Memory Usage
|
||||
|
||||
**Symptoms:**
|
||||
- OOM kills
|
||||
- High RSS in process stats
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check memory breakdown
|
||||
stella signals stats --memory
|
||||
|
||||
# Check cache sizes
|
||||
stella signals cache-stats
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Reduce cache sizes:
|
||||
```yaml
|
||||
signals:
|
||||
resources:
|
||||
symbol_cache_max_entries: 50000
|
||||
max_cache_memory_mb: 128
|
||||
```
|
||||
2. Reduce container cache TTL:
|
||||
```yaml
|
||||
signals:
|
||||
resources:
|
||||
container_cache_ttl_seconds: 60
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Symbol Resolution Failures
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Symbol: addr:0x7f4a3b2c1000 (unresolved)
|
||||
```
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check if binary has symbols
|
||||
nm /path/to/binary | head
|
||||
|
||||
# Check if debuginfo available
|
||||
file /path/to/binary | grep "not stripped"
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Install debug symbols:
|
||||
```bash
|
||||
# Debian/Ubuntu
|
||||
apt install libc6-dbg
|
||||
|
||||
# RHEL/CentOS
|
||||
debuginfo-install glibc
|
||||
```
|
||||
2. Accept address-only evidence (still valuable for correlation)
|
||||
|
||||
---
|
||||
|
||||
### Issue: Container Resolution Failures
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
container_id: unknown:1234567890
|
||||
```
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check cgroup path format
|
||||
cat /proc/<pid>/cgroup
|
||||
|
||||
# Verify container runtime
|
||||
docker ps
|
||||
crictl ps
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Verify Zastava integration is running
|
||||
2. Check container runtime is supported (containerd/Docker/CRI-O)
|
||||
3. Restart collector to refresh container mappings
|
||||
|
||||
---
|
||||
|
||||
### Issue: Evidence Chain Verification Failure
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
$ stella signals verify-chain /var/lib/stellaops/evidence/
|
||||
Chain Status: ✗ INVALID
|
||||
Error: Chain broken at chunk 42
|
||||
```
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Get detailed report
|
||||
stella signals verify-chain /var/lib/stellaops/evidence/ --verbose --format json
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Check for missing chunk files
|
||||
2. Check for disk corruption
|
||||
3. If intentional restart, document gap in audit trail
|
||||
4. Re-initialize chain if necessary:
|
||||
```bash
|
||||
stella signals reset-chain --confirm
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Issue: Rekor Submission Failures
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Warning: Failed to submit to Rekor: connection refused
|
||||
```
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check Rekor connectivity
|
||||
curl https://rekor.sigstore.dev/api/v1/log
|
||||
|
||||
# Check signing service
|
||||
stella signer status
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Check network connectivity to Rekor
|
||||
2. Verify Fulcio/OIDC tokens are valid
|
||||
3. Switch to offline mode temporarily:
|
||||
```yaml
|
||||
signals:
|
||||
signing:
|
||||
submit_to_rekor: false
|
||||
```
|
||||
4. Retry failed submissions later:
|
||||
```bash
|
||||
stella signals resubmit-pending
|
||||
```
|
||||
|
||||
## Operational Procedures
|
||||
|
||||
### Procedure: Rotate Evidence Directory
|
||||
|
||||
When evidence directory is full or needs archival:
|
||||
|
||||
```bash
|
||||
# 1. Stop collector gracefully
|
||||
stella signals stop
|
||||
|
||||
# 2. Archive current evidence
|
||||
tar -czvf evidence-$(date +%Y%m%d).tar.gz /var/lib/stellaops/evidence/
|
||||
|
||||
# 3. Verify archive integrity
|
||||
stella signals verify-chain evidence-$(date +%Y%m%d).tar.gz
|
||||
|
||||
# 4. Move to long-term storage
|
||||
aws s3 cp evidence-$(date +%Y%m%d).tar.gz s3://evidence-archive/
|
||||
|
||||
# 5. Clear old evidence (keep chain state)
|
||||
stella signals cleanup --keep-chain-state --older-than 7d
|
||||
|
||||
# 6. Restart collector
|
||||
stella signals start
|
||||
```
|
||||
|
||||
### Procedure: Update Collector
|
||||
|
||||
```bash
|
||||
# 1. Check current version
|
||||
stella version
|
||||
|
||||
# 2. Download new version
|
||||
curl -fsSL https://stella.ops/install.sh | bash -s -- --version 1.2.0
|
||||
|
||||
# 3. Verify probe compatibility
|
||||
stella signals test-probes
|
||||
|
||||
# 4. Restart service
|
||||
sudo systemctl restart stellaops-signals
|
||||
|
||||
# 5. Verify operation
|
||||
stella signals status
|
||||
```
|
||||
|
||||
### Procedure: Recover from Crash
|
||||
|
||||
```bash
|
||||
# 1. Check service status
|
||||
systemctl status stellaops-signals
|
||||
|
||||
# 2. Check for core dumps
|
||||
coredumpctl list | grep stella
|
||||
|
||||
# 3. Review logs for cause
|
||||
journalctl -u stellaops-signals --since "30 min ago"
|
||||
|
||||
# 4. Verify chain state
|
||||
stella signals verify-chain /var/lib/stellaops/evidence/
|
||||
|
||||
# 5. Restart service
|
||||
sudo systemctl start stellaops-signals
|
||||
|
||||
# 6. Monitor for recurrence
|
||||
watch -n 5 'stella signals stats'
|
||||
```
|
||||
|
||||
### Procedure: Air-Gap Evidence Export
|
||||
|
||||
```bash
|
||||
# 1. Create signed export bundle
|
||||
stella signals export \
|
||||
--from 2026-01-01 \
|
||||
--to 2026-01-31 \
|
||||
--include-proofs \
|
||||
--output january-evidence.tar.gz
|
||||
|
||||
# 2. Generate verification manifest
|
||||
stella signals manifest january-evidence.tar.gz > manifest.json
|
||||
|
||||
# 3. Transfer to verification system
|
||||
scp january-evidence.tar.gz manifest.json airgap-verifier:
|
||||
|
||||
# 4. On verifier, import and verify
|
||||
stella signals import january-evidence.tar.gz
|
||||
stella signals verify-chain --offline /imported/evidence/
|
||||
```
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### Full Configuration Example
|
||||
|
||||
```yaml
|
||||
signals:
|
||||
enabled: true
|
||||
output_directory: /var/lib/stellaops/evidence
|
||||
|
||||
# Ring buffer (kernel space)
|
||||
ring_buffer_size: 262144 # 256KB
|
||||
|
||||
# Rate limiting
|
||||
max_events_per_second: 0 # unlimited
|
||||
|
||||
# Rotation
|
||||
rotation:
|
||||
max_size_mb: 100
|
||||
max_age_hours: 1
|
||||
|
||||
# Signing
|
||||
signing:
|
||||
enabled: true
|
||||
key_id: fulcio
|
||||
submit_to_rekor: true
|
||||
|
||||
# Probes
|
||||
probes:
|
||||
sys_enter_openat: true
|
||||
sched_process_exec: true
|
||||
inet_sock_set_state: true
|
||||
libc_connect: true
|
||||
libc_accept: true
|
||||
openssl_read: true
|
||||
openssl_write: true
|
||||
|
||||
# Filters
|
||||
filters:
|
||||
target_containers: []
|
||||
target_namespaces: []
|
||||
paths:
|
||||
allowlist:
|
||||
- /etc/**
|
||||
- /var/lib/**
|
||||
denylist:
|
||||
- /proc/**
|
||||
- /sys/**
|
||||
- /dev/**
|
||||
networks:
|
||||
allowlist: []
|
||||
denylist:
|
||||
- 127.0.0.0/8
|
||||
|
||||
# Resources
|
||||
resources:
|
||||
max_cache_memory_mb: 256
|
||||
symbol_cache_max_entries: 100000
|
||||
container_cache_ttl_seconds: 300
|
||||
|
||||
# Observability
|
||||
metrics:
|
||||
enabled: true
|
||||
port: 9090
|
||||
logging:
|
||||
level: info
|
||||
format: json
|
||||
```
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Emergency: Disable Collection
|
||||
|
||||
If collector is causing system issues:
|
||||
|
||||
```bash
|
||||
# Immediate stop
|
||||
sudo systemctl stop stellaops-signals
|
||||
|
||||
# Disable on boot
|
||||
sudo systemctl disable stellaops-signals
|
||||
|
||||
# Remove all probes manually
|
||||
sudo bpftool prog list | grep stella | awk '{print $1}' | xargs -I{} sudo bpftool prog detach {}
|
||||
```
|
||||
|
||||
### Emergency: Clear Corrupted State
|
||||
|
||||
If state is corrupted and normal recovery fails:
|
||||
|
||||
```bash
|
||||
# Stop service
|
||||
sudo systemctl stop stellaops-signals
|
||||
|
||||
# Backup current state
|
||||
cp -r /var/lib/stellaops/evidence /var/lib/stellaops/evidence.backup
|
||||
|
||||
# Clear state
|
||||
rm -rf /var/lib/stellaops/evidence/*
|
||||
|
||||
# Re-initialize
|
||||
stella signals init
|
||||
|
||||
# Start fresh
|
||||
sudo systemctl start stellaops-signals
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
For issues not covered in this runbook:
|
||||
|
||||
1. Check [GitHub Issues](https://github.com/stellaops/stellaops/issues)
|
||||
2. Search [Documentation](https://docs.stella.ops/)
|
||||
3. Contact support with:
|
||||
- Output of `stella signals status --verbose`
|
||||
- Relevant log excerpts
|
||||
- Kernel version (`uname -a`)
|
||||
- Configuration file (sanitized)
|
||||
Reference in New Issue
Block a user