# Operator Runbook ## Overview This runbook provides operational procedures for managing the eBPF reachability evidence collection system. ## Monitoring ### Key Metrics Monitor these metrics for system health: | Metric | Description | Alert Threshold | |--------|-------------|-----------------| | `stellaops_signals_events_total` | Total events collected | N/A (info) | | `stellaops_signals_events_rate` | Events per second | > 100,000 (high load) | | `stellaops_signals_ringbuf_usage` | Ring buffer utilization % | > 80% (overflow risk) | | `stellaops_signals_drops_total` | Events dropped | > 0 (investigate) | | `stellaops_signals_enrich_latency_p99` | Enrichment latency | > 50ms (degraded) | | `stellaops_signals_chunks_signed` | Signed chunks count | N/A (info) | | `stellaops_signals_rekor_failures` | Rekor submission failures | > 0 (investigate) | ### Health Checks ```bash # Quick health check stella signals health # Detailed status stella signals status --verbose # Prometheus metrics curl localhost:9090/metrics | grep stellaops_signals ``` ### Log Analysis ```bash # View recent logs journalctl -u stellaops-signals --since "1 hour ago" # Filter by severity journalctl -u stellaops-signals -p err # Follow live journalctl -u stellaops-signals -f ``` ## Common Issues ### Issue: Probe Failed to Attach **Symptoms:** ``` Error: Failed to attach tracepoint/syscalls/sys_enter_openat: permission denied ``` **Diagnosis:** ```bash # Check capabilities getcap /usr/bin/stella # Check kernel config cat /boot/config-$(uname -r) | grep CONFIG_BPF # Check seccomp/AppArmor dmesg | grep -i "bpf\|seccomp\|apparmor" ``` **Resolution:** 1. Ensure proper capabilities: ```bash sudo setcap cap_bpf,cap_perfmon,cap_sys_ptrace+ep /usr/bin/stella ``` 2. Or run as root: ```bash sudo stella signals start ``` 3. Check AppArmor/SELinux isn't blocking --- ### Issue: Ring Buffer Overflow **Symptoms:** ``` Warning: Ring buffer full, 1523 events dropped ``` **Diagnosis:** ```bash # Check buffer usage stella signals stats | grep ringbuf # Check event rate stella signals stats | grep rate ``` **Resolution:** 1. Increase buffer size: ```yaml signals: ring_buffer_size: 1048576 # 1MB ``` 2. Enable rate limiting: ```yaml signals: max_events_per_second: 50000 ``` 3. Add more aggressive filtering: ```yaml signals: filters: paths: denylist: - /proc/** - /sys/** ``` --- ### Issue: High Memory Usage **Symptoms:** - OOM kills - High RSS in process stats **Diagnosis:** ```bash # Check memory breakdown stella signals stats --memory # Check cache sizes stella signals cache-stats ``` **Resolution:** 1. Reduce cache sizes: ```yaml signals: resources: symbol_cache_max_entries: 50000 max_cache_memory_mb: 128 ``` 2. Reduce container cache TTL: ```yaml signals: resources: container_cache_ttl_seconds: 60 ``` --- ### Issue: Symbol Resolution Failures **Symptoms:** ``` Symbol: addr:0x7f4a3b2c1000 (unresolved) ``` **Diagnosis:** ```bash # Check if binary has symbols nm /path/to/binary | head # Check if debuginfo available file /path/to/binary | grep "not stripped" ``` **Resolution:** 1. Install debug symbols: ```bash # Debian/Ubuntu apt install libc6-dbg # RHEL/CentOS debuginfo-install glibc ``` 2. Accept address-only evidence (still valuable for correlation) --- ### Issue: Container Resolution Failures **Symptoms:** ``` container_id: unknown:1234567890 ``` **Diagnosis:** ```bash # Check cgroup path format cat /proc//cgroup # Verify container runtime docker ps crictl ps ``` **Resolution:** 1. Verify Zastava integration is running 2. Check container runtime is supported (containerd/Docker/CRI-O) 3. Restart collector to refresh container mappings --- ### Issue: Evidence Chain Verification Failure **Symptoms:** ``` $ stella signals verify-chain /var/lib/stellaops/evidence/ Chain Status: ✗ INVALID Error: Chain broken at chunk 42 ``` **Diagnosis:** ```bash # Get detailed report stella signals verify-chain /var/lib/stellaops/evidence/ --verbose --format json ``` **Resolution:** 1. Check for missing chunk files 2. Check for disk corruption 3. If intentional restart, document gap in audit trail 4. Re-initialize chain if necessary: ```bash stella signals reset-chain --confirm ``` --- ### Issue: Rekor Submission Failures **Symptoms:** ``` Warning: Failed to submit to Rekor: connection refused ``` **Diagnosis:** ```bash # Check Rekor connectivity curl https://rekor.sigstore.dev/api/v1/log # Check signing service stella signer status ``` **Resolution:** 1. Check network connectivity to Rekor 2. Verify Fulcio/OIDC tokens are valid 3. Switch to offline mode temporarily: ```yaml signals: signing: submit_to_rekor: false ``` 4. Retry failed submissions later: ```bash stella signals resubmit-pending ``` ## Operational Procedures ### Procedure: Rotate Evidence Directory When evidence directory is full or needs archival: ```bash # 1. Stop collector gracefully stella signals stop # 2. Archive current evidence tar -czvf evidence-$(date +%Y%m%d).tar.gz /var/lib/stellaops/evidence/ # 3. Verify archive integrity stella signals verify-chain evidence-$(date +%Y%m%d).tar.gz # 4. Move to long-term storage aws s3 cp evidence-$(date +%Y%m%d).tar.gz s3://evidence-archive/ # 5. Clear old evidence (keep chain state) stella signals cleanup --keep-chain-state --older-than 7d # 6. Restart collector stella signals start ``` ### Procedure: Update Collector ```bash # 1. Check current version stella version # 2. Download new version curl -fsSL https://stella.ops/install.sh | bash -s -- --version 1.2.0 # 3. Verify probe compatibility stella signals test-probes # 4. Restart service sudo systemctl restart stellaops-signals # 5. Verify operation stella signals status ``` ### Procedure: Recover from Crash ```bash # 1. Check service status systemctl status stellaops-signals # 2. Check for core dumps coredumpctl list | grep stella # 3. Review logs for cause journalctl -u stellaops-signals --since "30 min ago" # 4. Verify chain state stella signals verify-chain /var/lib/stellaops/evidence/ # 5. Restart service sudo systemctl start stellaops-signals # 6. Monitor for recurrence watch -n 5 'stella signals stats' ``` ### Procedure: Air-Gap Evidence Export ```bash # 1. Create signed export bundle stella signals export \ --from 2026-01-01 \ --to 2026-01-31 \ --include-proofs \ --output january-evidence.tar.gz # 2. Generate verification manifest stella signals manifest january-evidence.tar.gz > manifest.json # 3. Transfer to verification system scp january-evidence.tar.gz manifest.json airgap-verifier: # 4. On verifier, import and verify stella signals import january-evidence.tar.gz stella signals verify-chain --offline /imported/evidence/ ``` ## Configuration Reference ### Full Configuration Example ```yaml signals: enabled: true output_directory: /var/lib/stellaops/evidence # Ring buffer (kernel space) ring_buffer_size: 262144 # 256KB # Rate limiting max_events_per_second: 0 # unlimited # Rotation rotation: max_size_mb: 100 max_age_hours: 1 # Signing signing: enabled: true key_id: fulcio submit_to_rekor: true # Probes probes: sys_enter_openat: true sched_process_exec: true inet_sock_set_state: true libc_connect: true libc_accept: true openssl_read: true openssl_write: true # Filters filters: target_containers: [] target_namespaces: [] paths: allowlist: - /etc/** - /var/lib/** denylist: - /proc/** - /sys/** - /dev/** networks: allowlist: [] denylist: - 127.0.0.0/8 # Resources resources: max_cache_memory_mb: 256 symbol_cache_max_entries: 100000 container_cache_ttl_seconds: 300 # Observability metrics: enabled: true port: 9090 logging: level: info format: json ``` ## Emergency Procedures ### Emergency: Disable Collection If collector is causing system issues: ```bash # Immediate stop sudo systemctl stop stellaops-signals # Disable on boot sudo systemctl disable stellaops-signals # Remove all probes manually sudo bpftool prog list | grep stella | awk '{print $1}' | xargs -I{} sudo bpftool prog detach {} ``` ### Emergency: Clear Corrupted State If state is corrupted and normal recovery fails: ```bash # Stop service sudo systemctl stop stellaops-signals # Backup current state cp -r /var/lib/stellaops/evidence /var/lib/stellaops/evidence.backup # Clear state rm -rf /var/lib/stellaops/evidence/* # Re-initialize stella signals init # Start fresh sudo systemctl start stellaops-signals ``` ## Support For issues not covered in this runbook: 1. Check [GitHub Issues](https://github.com/stellaops/stellaops/issues) 2. Search [Documentation](https://docs.stella.ops/) 3. Contact support with: - Output of `stella signals status --verbose` - Relevant log excerpts - Kernel version (`uname -a`) - Configuration file (sanitized)