Files
git.stella-ops.org/docs/runbooks/runtime-linkage-ops.md
2026-01-24 00:12:43 +02:00

7.0 KiB

Runtime Linkage Verification - Operational Runbook

Audience: Platform operators, SREs, security engineers Related: Runtime Linkage Guide, Function Map V1 Contract

Overview

This runbook covers production deployment and operation of the runtime linkage verification system. The system uses eBPF probes to observe function calls and verifies them against declared function maps.


Prerequisites

  • Linux kernel 5.8+ (for eBPF CO-RE support)
  • CAP_BPF and CAP_PERFMON capabilities for the runtime agent
  • BTF (BPF Type Format) enabled in kernel config
  • Stella runtime agent deployed as a DaemonSet or sidecar

Deployment

Runtime Agent Configuration

The Stella runtime agent (stella-runtime-agent) attaches eBPF probes based on function map predicates. Configuration via environment or YAML:

runtime_agent:
  observation_store:
    type: "memory"  # or "postgres", "valkey"
    retention_hours: 72
    max_batch_size: 1000
  probes:
    max_concurrent: 256
    attach_timeout_ms: 5000
    default_types: ["uprobe", "kprobe"]
  export:
    format: "ndjson"
    flush_interval_ms: 5000
    output_path: "/var/stella/observations/"

Probe Selection Guidance

Category Probe Type Use Case
Crypto functions uprobe OpenSSL/BoringSSL/libsodium calls
Network I/O kprobe connect/sendto/recvfrom syscalls
Auth flows uprobe PAM/LDAP/OAuth library calls
File access kprobe open/read/write on sensitive paths
TLS handshake uprobe SSL_do_handshake, TLS negotiation

Prioritization:

  1. Start with crypto and auth paths (highest security relevance)
  2. Add network I/O for service mesh verification
  3. Expand to file access for compliance requirements

Resource Overhead

Expected overhead per probe:

  • CPU: ~0.1-0.5% per active uprobe (per-call overhead ~100ns)
  • Memory: ~2KB per attached probe + observation buffer
  • Disk: ~100 bytes per observation record (NDJSON)

Recommended limits:

  • Max 256 concurrent probes per node
  • Observation buffer: 64MB
  • Flush interval: 5 seconds
  • Retention: 72 hours (configurable)

Operations

Generating Function Maps

Run generation as part of CI/CD pipeline after SBOM generation:

# In CI after SBOM generation
stella function-map generate \
  --sbom ${BUILD_DIR}/sbom.cdx.json \
  --service ${SERVICE_NAME} \
  --hot-functions "crypto/*" --hot-functions "net/*" --hot-functions "auth/*" \
  --min-rate 0.95 \
  --window 1800 \
  --build-id ${CI_BUILD_ID} \
  --output ${BUILD_DIR}/function-map.json

Store the function map alongside the container image (OCI referrer or artifact registry).

Continuous Verification

Set up periodic verification (cron or controller loop):

# Every 30 minutes, verify the last hour of observations
stella function-map verify \
  --function-map /etc/stella/function-map.json \
  --from "$(date -d '1 hour ago' -Iseconds)" \
  --to "$(date -Iseconds)" \
  --format json --output /var/stella/verification/latest.json

Monitoring

Key metrics to alert on:

Metric Threshold Action
observation_rate < 0.80 Warning: coverage dropping
observation_rate < 0.50 Critical: significant coverage loss
unexpected_symbols_count > 0 Investigate: undeclared functions executing
probe_attach_failures > 5% Warning: probe attachment issues
observation_buffer_full true Critical: observations being dropped

Alert Configuration

alerts:
  - name: "function-map-coverage-low"
    condition: observation_rate < 0.80
    severity: warning
    description: "Function map coverage below 80% for {service}"
    runbook: "Check probe attachment, verify no binary update without map regeneration"

  - name: "function-map-unexpected-calls"
    condition: unexpected_symbols_count > 0
    severity: info
    description: "Unexpected function calls detected in {service}"
    runbook: "Review unexpected symbols, regenerate function map if benign"

  - name: "function-map-probe-failures"
    condition: probe_attach_failure_rate > 0.05
    severity: warning
    description: "Probe attachment failure rate above 5%"
    runbook: "Check kernel version, verify BTF availability, check CAP_BPF"

Performance Tuning

High-Traffic Services

For services with >10K calls/second on probed functions:

  1. Sampling: Configure observation sampling rate:

    probes:
      sampling_rate: 0.01  # 1% of calls
    
  2. Aggregation: Use count-based observations instead of per-call:

    export:
      aggregation_window_ms: 1000  # Aggregate per second
    
  3. Selective probing: Use --hot-functions to limit to critical paths only

Large Function Maps

For maps with >100 expected paths:

  1. Tag paths by priority: crypto > auth > network > general
  2. Mark low-priority paths as optional: true
  3. Set per-tag minimum rates if needed

Storage Optimization

For long-term observation storage:

  1. Enable retention pruning: pruneOlderThanAsync(72h)
  2. Compress archived observations (gzip NDJSON)
  3. Use dedicated Postgres partitions by date for query performance

Incident Response

Coverage Dropped After Deployment

  1. Check if binary was updated without regenerating the function map
  2. Verify probes are still attached: stella observations query --summary
  3. Check for symbol changes (ASLR, different build)
  4. Regenerate function map from new SBOM and redeploy

Unexpected Symbols Detected

  1. Identify the unexpected functions from the verification report
  2. Determine if they are:
    • Benign: Dynamic dispatch, plugins, lazy-loaded libraries → add to map
    • Suspicious: Unexpected crypto usage, network calls → escalate to security team
  3. If benign, regenerate function map with broader patterns
  4. If suspicious, correlate with vulnerability findings and open incident

Probe Attachment Failures

  1. Check kernel version: uname -r (need 5.8+)
  2. Verify BTF: ls /sys/kernel/btf/vmlinux
  3. Check capabilities: capsh --print | grep bpf
  4. Check binary paths: verify binary_path in function map matches deployed binary
  5. Check for SELinux/AppArmor blocking BPF operations

Air-Gap Considerations

For air-gapped environments:

  1. Bundle generation (connected side):

    stella function-map generate --sbom app.cdx.json --service my-service --output fm.json
    # Package with observations
    tar czf linkage-bundle.tgz fm.json observations/*.ndjson
    
  2. Transfer via approved media to air-gapped environment

  3. Offline verification (air-gapped side):

    stella function-map verify --function-map fm.json --offline --observations obs.ndjson
    
  4. Result export for compliance reporting:

    stella function-map verify ... --format json --output report.json
    # Sign the report
    stella attest sign --input report.json --output report.dsse.json