Files
git.stella-ops.org/docs/reachability/operator-runbook.md
2026-01-28 02:30:48 +02:00

9.0 KiB

Operator Runbook

Overview

This runbook provides operational procedures for managing the eBPF reachability evidence collection system.

Monitoring

Key Metrics

Monitor these metrics for system health:

Metric Description Alert Threshold
stellaops_signals_events_total Total events collected N/A (info)
stellaops_signals_events_rate Events per second > 100,000 (high load)
stellaops_signals_ringbuf_usage Ring buffer utilization % > 80% (overflow risk)
stellaops_signals_drops_total Events dropped > 0 (investigate)
stellaops_signals_enrich_latency_p99 Enrichment latency > 50ms (degraded)
stellaops_signals_chunks_signed Signed chunks count N/A (info)
stellaops_signals_rekor_failures Rekor submission failures > 0 (investigate)

Health Checks

# Quick health check
stella signals health

# Detailed status
stella signals status --verbose

# Prometheus metrics
curl localhost:9090/metrics | grep stellaops_signals

Log Analysis

# View recent logs
journalctl -u stellaops-signals --since "1 hour ago"

# Filter by severity
journalctl -u stellaops-signals -p err

# Follow live
journalctl -u stellaops-signals -f

Common Issues

Issue: Probe Failed to Attach

Symptoms:

Error: Failed to attach tracepoint/syscalls/sys_enter_openat: permission denied

Diagnosis:

# Check capabilities
getcap /usr/bin/stella

# Check kernel config
cat /boot/config-$(uname -r) | grep CONFIG_BPF

# Check seccomp/AppArmor
dmesg | grep -i "bpf\|seccomp\|apparmor"

Resolution:

  1. Ensure proper capabilities:
    sudo setcap cap_bpf,cap_perfmon,cap_sys_ptrace+ep /usr/bin/stella
    
  2. Or run as root:
    sudo stella signals start
    
  3. Check AppArmor/SELinux isn't blocking

Issue: Ring Buffer Overflow

Symptoms:

Warning: Ring buffer full, 1523 events dropped

Diagnosis:

# Check buffer usage
stella signals stats | grep ringbuf

# Check event rate
stella signals stats | grep rate

Resolution:

  1. Increase buffer size:
    signals:
      ring_buffer_size: 1048576  # 1MB
    
  2. Enable rate limiting:
    signals:
      max_events_per_second: 50000
    
  3. Add more aggressive filtering:
    signals:
      filters:
        paths:
          denylist:
            - /proc/**
            - /sys/**
    

Issue: High Memory Usage

Symptoms:

  • OOM kills
  • High RSS in process stats

Diagnosis:

# Check memory breakdown
stella signals stats --memory

# Check cache sizes
stella signals cache-stats

Resolution:

  1. Reduce cache sizes:
    signals:
      resources:
        symbol_cache_max_entries: 50000
        max_cache_memory_mb: 128
    
  2. Reduce container cache TTL:
    signals:
      resources:
        container_cache_ttl_seconds: 60
    

Issue: Symbol Resolution Failures

Symptoms:

Symbol: addr:0x7f4a3b2c1000 (unresolved)

Diagnosis:

# Check if binary has symbols
nm /path/to/binary | head

# Check if debuginfo available
file /path/to/binary | grep "not stripped"

Resolution:

  1. Install debug symbols:
    # Debian/Ubuntu
    apt install libc6-dbg
    
    # RHEL/CentOS
    debuginfo-install glibc
    
  2. Accept address-only evidence (still valuable for correlation)

Issue: Container Resolution Failures

Symptoms:

container_id: unknown:1234567890

Diagnosis:

# Check cgroup path format
cat /proc/<pid>/cgroup

# Verify container runtime
docker ps
crictl ps

Resolution:

  1. Verify Zastava integration is running
  2. Check container runtime is supported (containerd/Docker/CRI-O)
  3. Restart collector to refresh container mappings

Issue: Evidence Chain Verification Failure

Symptoms:

$ stella signals verify-chain /var/lib/stellaops/evidence/
Chain Status: ✗ INVALID
  Error: Chain broken at chunk 42

Diagnosis:

# Get detailed report
stella signals verify-chain /var/lib/stellaops/evidence/ --verbose --format json

Resolution:

  1. Check for missing chunk files
  2. Check for disk corruption
  3. If intentional restart, document gap in audit trail
  4. Re-initialize chain if necessary:
    stella signals reset-chain --confirm
    

Issue: Rekor Submission Failures

Symptoms:

Warning: Failed to submit to Rekor: connection refused

Diagnosis:

# Check Rekor connectivity
curl https://rekor.sigstore.dev/api/v1/log

# Check signing service
stella signer status

Resolution:

  1. Check network connectivity to Rekor
  2. Verify Fulcio/OIDC tokens are valid
  3. Switch to offline mode temporarily:
    signals:
      signing:
        submit_to_rekor: false
    
  4. Retry failed submissions later:
    stella signals resubmit-pending
    

Operational Procedures

Procedure: Rotate Evidence Directory

When evidence directory is full or needs archival:

# 1. Stop collector gracefully
stella signals stop

# 2. Archive current evidence
tar -czvf evidence-$(date +%Y%m%d).tar.gz /var/lib/stellaops/evidence/

# 3. Verify archive integrity
stella signals verify-chain evidence-$(date +%Y%m%d).tar.gz

# 4. Move to long-term storage
aws s3 cp evidence-$(date +%Y%m%d).tar.gz s3://evidence-archive/

# 5. Clear old evidence (keep chain state)
stella signals cleanup --keep-chain-state --older-than 7d

# 6. Restart collector
stella signals start

Procedure: Update Collector

# 1. Check current version
stella version

# 2. Download new version
curl -fsSL https://stella.ops/install.sh | bash -s -- --version 1.2.0

# 3. Verify probe compatibility
stella signals test-probes

# 4. Restart service
sudo systemctl restart stellaops-signals

# 5. Verify operation
stella signals status

Procedure: Recover from Crash

# 1. Check service status
systemctl status stellaops-signals

# 2. Check for core dumps
coredumpctl list | grep stella

# 3. Review logs for cause
journalctl -u stellaops-signals --since "30 min ago"

# 4. Verify chain state
stella signals verify-chain /var/lib/stellaops/evidence/

# 5. Restart service
sudo systemctl start stellaops-signals

# 6. Monitor for recurrence
watch -n 5 'stella signals stats'

Procedure: Air-Gap Evidence Export

# 1. Create signed export bundle
stella signals export \
  --from 2026-01-01 \
  --to 2026-01-31 \
  --include-proofs \
  --output january-evidence.tar.gz

# 2. Generate verification manifest
stella signals manifest january-evidence.tar.gz > manifest.json

# 3. Transfer to verification system
scp january-evidence.tar.gz manifest.json airgap-verifier:

# 4. On verifier, import and verify
stella signals import january-evidence.tar.gz
stella signals verify-chain --offline /imported/evidence/

Configuration Reference

Full Configuration Example

signals:
  enabled: true
  output_directory: /var/lib/stellaops/evidence

  # Ring buffer (kernel space)
  ring_buffer_size: 262144  # 256KB

  # Rate limiting
  max_events_per_second: 0  # unlimited

  # Rotation
  rotation:
    max_size_mb: 100
    max_age_hours: 1

  # Signing
  signing:
    enabled: true
    key_id: fulcio
    submit_to_rekor: true

  # Probes
  probes:
    sys_enter_openat: true
    sched_process_exec: true
    inet_sock_set_state: true
    libc_connect: true
    libc_accept: true
    openssl_read: true
    openssl_write: true

  # Filters
  filters:
    target_containers: []
    target_namespaces: []
    paths:
      allowlist:
        - /etc/**
        - /var/lib/**
      denylist:
        - /proc/**
        - /sys/**
        - /dev/**
    networks:
      allowlist: []
      denylist:
        - 127.0.0.0/8

  # Resources
  resources:
    max_cache_memory_mb: 256
    symbol_cache_max_entries: 100000
    container_cache_ttl_seconds: 300

  # Observability
  metrics:
    enabled: true
    port: 9090
  logging:
    level: info
    format: json

Emergency Procedures

Emergency: Disable Collection

If collector is causing system issues:

# Immediate stop
sudo systemctl stop stellaops-signals

# Disable on boot
sudo systemctl disable stellaops-signals

# Remove all probes manually
sudo bpftool prog list | grep stella | awk '{print $1}' | xargs -I{} sudo bpftool prog detach {}

Emergency: Clear Corrupted State

If state is corrupted and normal recovery fails:

# Stop service
sudo systemctl stop stellaops-signals

# Backup current state
cp -r /var/lib/stellaops/evidence /var/lib/stellaops/evidence.backup

# Clear state
rm -rf /var/lib/stellaops/evidence/*

# Re-initialize
stella signals init

# Start fresh
sudo systemctl start stellaops-signals

Support

For issues not covered in this runbook:

  1. Check GitHub Issues
  2. Search Documentation
  3. Contact support with:
    • Output of stella signals status --verbose
    • Relevant log excerpts
    • Kernel version (uname -a)
    • Configuration file (sanitized)