Files
git.stella-ops.org/docs/product-advisories/28-Nov-2025 - Telemetry and Observability Patterns.md
StellaOps Bot 0bef705bcc
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
true the date
2025-11-30 19:23:21 +02:00

11 KiB

Telemetry and Observability Patterns

Version: 1.0 Date: 2025-11-29 Status: Canonical

This advisory defines the product rationale, collector topology, and implementation strategy for the Telemetry module, covering metrics, traces, logs, forensic pipelines, and offline packaging.


1. Executive Summary

The Telemetry module provides unified observability infrastructure across all Stella Ops components. Key capabilities:

  • OpenTelemetry Native - OTLP collection for metrics, traces, logs
  • Forensic Mode - Extended retention and 100% sampling during incidents
  • Profile-Based Configuration - Default, forensic, and air-gap profiles
  • Sealed-Mode Guards - Automatic exporter restrictions in air-gap
  • Offline Bundles - Signed OTLP archives for compliance

2. Market Drivers

2.1 Target Segments

Segment Observability Requirements Use Case
Platform Ops Real-time monitoring Operational health
Security Teams Forensic investigation Incident response
Compliance Audit trails SOC 2, FedRAMP
DevSecOps Pipeline visibility CI/CD debugging

2.2 Competitive Positioning

Most vulnerability tools provide minimal observability. Stella Ops differentiates with:

  • Built-in OpenTelemetry across all services
  • Forensic mode with automatic retention extension
  • Sealed-mode compatibility for air-gap
  • Signed OTLP bundles for compliance archives
  • Incident-triggered sampling escalation

3. Collector Topology

3.1 Architecture

┌─────────────────────────────────────────────────────┐
│                    Services                          │
│  Scanner │ Policy │ Authority │ Orchestrator │ ...  │
└─────────────────────┬───────────────────────────────┘
                      │ OTLP/gRPC
                      ▼
┌─────────────────────────────────────────────────────┐
│              OpenTelemetry Collector                 │
│  ┌─────────┐  ┌──────────┐  ┌─────────────────────┐ │
│  │ Traces  │  │ Metrics  │  │       Logs          │ │
│  └────┬────┘  └────┬─────┘  └──────────┬──────────┘ │
│       │ Tail       │ Batch           │ Redaction   │
│       │ Sampling   │                 │             │
└───────┼────────────┼─────────────────┼─────────────┘
        │            │                 │
        ▼            ▼                 ▼
   ┌────────┐   ┌──────────┐      ┌────────┐
   │ Tempo  │   │Prometheus│      │  Loki  │
   └────────┘   └──────────┘      └────────┘

3.2 Collector Profiles

Profile Use Case Configuration
default Normal operation 10% trace sampling, 30-day retention
forensic Investigation mode 100% sampling, 180-day retention
airgap Offline deployment File exporters, no external network

4. Metrics

4.1 Standard Metrics

Metric Type Labels Description
stellaops_request_duration_seconds Histogram service, endpoint Request latency
stellaops_request_total Counter service, status Request count
stellaops_active_jobs Gauge tenant, jobType Active job count
stellaops_queue_depth Gauge queue Queue depth
stellaops_scan_duration_seconds Histogram tenant Scan duration

4.2 Module-Specific Metrics

Policy Engine:

  • policy_run_seconds{mode,tenant,policy}
  • policy_rules_fired_total{policy,rule}
  • policy_vex_overrides_total{policy,vendor}

Scanner:

  • scanner_sbom_components_total{ecosystem}
  • scanner_vulnerabilities_found_total{severity}
  • scanner_attestations_logged_total

Authority:

  • authority_token_issued_total{grant_type,audience}
  • authority_token_rejected_total{reason}
  • authority_dpop_nonce_miss_total

5. Traces

5.1 Trace Context

All services propagate W3C Trace Context:

  • traceparent header
  • tracestate for vendor-specific data
  • baggage for cross-service attributes

5.2 Span Conventions

Span Attributes Description
http.request url, method, status HTTP handler
db.query collection, operation MongoDB ops
policy.evaluate policyId, version Policy run
scan.image imageRef, digest Image scan
sign.dsse predicateType DSSE signing

5.3 Sampling Strategy

Default (Tail Sampling):

  • Error traces: 100%
  • Slow traces (>2s): 100%
  • Normal traces: 10%

Forensic Mode:

  • All traces: 100%
  • Extended attributes enabled

6. Logs

6.1 Structured Format

{
  "timestamp": "2025-11-29T12:00:00.123Z",
  "level": "info",
  "message": "Scan completed",
  "service": "scanner",
  "traceId": "abc123...",
  "spanId": "def456...",
  "tenant": "acme-corp",
  "imageDigest": "sha256:...",
  "componentCount": 245,
  "vulnerabilityCount": 12
}

6.2 Redaction

Attribute processors strip sensitive data:

  • authorization headers
  • secretRef values
  • PII based on allowed-key policy

6.3 Log Levels

Level Purpose Retention
error Failures 180 days
warn Anomalies 90 days
info Operations 30 days
debug Development 7 days

7. Forensic Mode

7.1 Activation

# Activate forensic mode for tenant
stella telemetry incident start --tenant acme-corp --reason "CVE-2025-12345 investigation"

# Check status
stella telemetry incident status

# Deactivate
stella telemetry incident stop --tenant acme-corp

7.2 Behavior Changes

Aspect Default Forensic
Trace sampling 10% 100%
Log level info debug
Retention 30 days 180 days
Attributes Standard Extended
Export frequency 1 minute 10 seconds

7.3 Automatic Triggers

  • Orchestrator incident escalation
  • Policy violation threshold exceeded
  • Circuit breaker activation
  • Manual operator trigger

8. Implementation Strategy

8.1 Phase 1: Core Telemetry (Complete)

  • OpenTelemetry SDK integration
  • Metrics exporter (Prometheus)
  • Trace exporter (Tempo/Jaeger)
  • Log exporter (Loki)

8.2 Phase 2: Advanced Features (Complete)

  • Tail sampling configuration
  • Attribute redaction
  • Profile-based configuration
  • Dashboard provisioning

8.3 Phase 3: Forensic & Offline (In Progress)

  • Forensic mode toggle
  • Forensic bundle export (TELEM-FOR-50-001)
  • Sealed-mode guards (TELEM-SEAL-51-001)
  • Offline bundle signing (TELEM-SIGN-52-001)

9. API Surface

9.1 Configuration

Endpoint Method Scope Description
/telemetry/config/profile/{name} GET telemetry:read Download collector config
/telemetry/config/profiles GET telemetry:read List profiles

9.2 Incident Mode

Endpoint Method Scope Description
/telemetry/incidents/mode POST telemetry:admin Toggle forensic mode
/telemetry/incidents/status GET telemetry:read Current mode status

9.3 Exports

Endpoint Method Scope Description
/telemetry/exports/forensic/{window} GET telemetry:export Stream OTLP bundle

10. Offline Support

10.1 Bundle Structure

telemetry-bundle/
├── otlp/
│   ├── metrics.pb
│   ├── traces.pb
│   └── logs.pb
├── config/
│   ├── collector.yaml
│   └── dashboards/
├── manifest.json
└── signatures/
    └── manifest.sig

10.2 Sealed-Mode Guards

// StellaOps.Telemetry.Core enforces IEgressPolicy
if (sealedMode.IsActive)
{
    // Disable non-loopback exporters
    // Emit structured warning with remediation
    // Fall back to file-based export
}

11. Dashboards & Alerts

11.1 Standard Dashboards

Dashboard Purpose Panels
Platform Health Overall status Request rate, error rate, latency
Scan Operations Scanner metrics Scan rate, duration, findings
Policy Engine Policy metrics Evaluation rate, rule hits, verdicts
Job Orchestration Queue metrics Queue depth, job latency, failures

11.2 Alert Rules

Alert Condition Severity
High Error Rate error_rate > 5% critical
Slow Scans p95 > 5m warning
Queue Backlog depth > 1000 warning
Circuit Open breaker_open = 1 critical

12. Security Considerations

12.1 Data Protection

  • Sensitive attributes redacted at collection
  • Encrypted in transit (TLS)
  • Encrypted at rest (storage layer)
  • Retention policies enforced

12.2 Access Control

  • Authority scopes for API access
  • Tenant isolation in queries
  • Audit logging for forensic access

Resource Location
Telemetry architecture docs/modules/telemetry/architecture.md
Collector configuration docs/modules/telemetry/collector-config.md
Dashboard provisioning docs/modules/telemetry/dashboards.md

14. Sprint Mapping

  • Primary Sprint: SPRINT_0180_0001_0001_telemetry_core.md (NEW)
  • Related Sprints:
    • SPRINT_0181_0001_0002_telemetry_forensic.md
    • SPRINT_0182_0001_0003_telemetry_offline.md

Key Task IDs:

  • TELEM-CORE-40-001 - SDK integration (DONE)
  • TELEM-DASH-41-001 - Dashboard provisioning (DONE)
  • TELEM-FOR-50-001 - Forensic bundles (IN PROGRESS)
  • TELEM-SEAL-51-001 - Sealed-mode guards (TODO)
  • TELEM-SIGN-52-001 - Bundle signing (TODO)

15. Success Metrics

Metric Target
Collection overhead < 2% CPU
Trace sampling accuracy 100% for errors
Log ingestion latency < 5 seconds
Forensic activation time < 30 seconds
Bundle export time < 5 minutes (24h data)

Last updated: 2025-11-29