# Telemetry and Observability Patterns **Version:** 1.0 **Date:** 2025-11-29 **Status:** Canonical This advisory defines the product rationale, collector topology, and implementation strategy for the Telemetry module, covering metrics, traces, logs, forensic pipelines, and offline packaging. --- ## 1. Executive Summary The Telemetry module provides **unified observability infrastructure** across all Stella Ops components. Key capabilities: - **OpenTelemetry Native** - OTLP collection for metrics, traces, logs - **Forensic Mode** - Extended retention and 100% sampling during incidents - **Profile-Based Configuration** - Default, forensic, and air-gap profiles - **Sealed-Mode Guards** - Automatic exporter restrictions in air-gap - **Offline Bundles** - Signed OTLP archives for compliance --- ## 2. Market Drivers ### 2.1 Target Segments | Segment | Observability Requirements | Use Case | |---------|---------------------------|----------| | **Platform Ops** | Real-time monitoring | Operational health | | **Security Teams** | Forensic investigation | Incident response | | **Compliance** | Audit trails | SOC 2, FedRAMP | | **DevSecOps** | Pipeline visibility | CI/CD debugging | ### 2.2 Competitive Positioning Most vulnerability tools provide minimal observability. Stella Ops differentiates with: - **Built-in OpenTelemetry** across all services - **Forensic mode** with automatic retention extension - **Sealed-mode compatibility** for air-gap - **Signed OTLP bundles** for compliance archives - **Incident-triggered sampling** escalation --- ## 3. Collector Topology ### 3.1 Architecture ``` ┌─────────────────────────────────────────────────────┐ │ Services │ │ Scanner │ Policy │ Authority │ Orchestrator │ ... │ └─────────────────────┬───────────────────────────────┘ │ OTLP/gRPC ▼ ┌─────────────────────────────────────────────────────┐ │ OpenTelemetry Collector │ │ ┌─────────┐ ┌──────────┐ ┌─────────────────────┐ │ │ │ Traces │ │ Metrics │ │ Logs │ │ │ └────┬────┘ └────┬─────┘ └──────────┬──────────┘ │ │ │ Tail │ Batch │ Redaction │ │ │ Sampling │ │ │ └───────┼────────────┼─────────────────┼─────────────┘ │ │ │ ▼ ▼ ▼ ┌────────┐ ┌──────────┐ ┌────────┐ │ Tempo │ │Prometheus│ │ Loki │ └────────┘ └──────────┘ └────────┘ ``` ### 3.2 Collector Profiles | Profile | Use Case | Configuration | |---------|----------|---------------| | **default** | Normal operation | 10% trace sampling, 30-day retention | | **forensic** | Investigation mode | 100% sampling, 180-day retention | | **airgap** | Offline deployment | File exporters, no external network | --- ## 4. Metrics ### 4.1 Standard Metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `stellaops_request_duration_seconds` | Histogram | service, endpoint | Request latency | | `stellaops_request_total` | Counter | service, status | Request count | | `stellaops_active_jobs` | Gauge | tenant, jobType | Active job count | | `stellaops_queue_depth` | Gauge | queue | Queue depth | | `stellaops_scan_duration_seconds` | Histogram | tenant | Scan duration | ### 4.2 Module-Specific Metrics **Policy Engine:** - `policy_run_seconds{mode,tenant,policy}` - `policy_rules_fired_total{policy,rule}` - `policy_vex_overrides_total{policy,vendor}` **Scanner:** - `scanner_sbom_components_total{ecosystem}` - `scanner_vulnerabilities_found_total{severity}` - `scanner_attestations_logged_total` **Authority:** - `authority_token_issued_total{grant_type,audience}` - `authority_token_rejected_total{reason}` - `authority_dpop_nonce_miss_total` --- ## 5. Traces ### 5.1 Trace Context All services propagate W3C Trace Context: - `traceparent` header - `tracestate` for vendor-specific data - `baggage` for cross-service attributes ### 5.2 Span Conventions | Span | Attributes | Description | |------|------------|-------------| | `http.request` | url, method, status | HTTP handler | | `db.query` | collection, operation | MongoDB ops | | `policy.evaluate` | policyId, version | Policy run | | `scan.image` | imageRef, digest | Image scan | | `sign.dsse` | predicateType | DSSE signing | ### 5.3 Sampling Strategy **Default (Tail Sampling):** - Error traces: 100% - Slow traces (>2s): 100% - Normal traces: 10% **Forensic Mode:** - All traces: 100% - Extended attributes enabled --- ## 6. Logs ### 6.1 Structured Format ```json { "timestamp": "2025-11-29T12:00:00.123Z", "level": "info", "message": "Scan completed", "service": "scanner", "traceId": "abc123...", "spanId": "def456...", "tenant": "acme-corp", "imageDigest": "sha256:...", "componentCount": 245, "vulnerabilityCount": 12 } ``` ### 6.2 Redaction Attribute processors strip sensitive data: - `authorization` headers - `secretRef` values - PII based on allowed-key policy ### 6.3 Log Levels | Level | Purpose | Retention | |-------|---------|-----------| | `error` | Failures | 180 days | | `warn` | Anomalies | 90 days | | `info` | Operations | 30 days | | `debug` | Development | 7 days | --- ## 7. Forensic Mode ### 7.1 Activation ```bash # Activate forensic mode for tenant stella telemetry incident start --tenant acme-corp --reason "CVE-2025-12345 investigation" # Check status stella telemetry incident status # Deactivate stella telemetry incident stop --tenant acme-corp ``` ### 7.2 Behavior Changes | Aspect | Default | Forensic | |--------|---------|----------| | Trace sampling | 10% | 100% | | Log level | info | debug | | Retention | 30 days | 180 days | | Attributes | Standard | Extended | | Export frequency | 1 minute | 10 seconds | ### 7.3 Automatic Triggers - Orchestrator incident escalation - Policy violation threshold exceeded - Circuit breaker activation - Manual operator trigger --- ## 8. Implementation Strategy ### 8.1 Phase 1: Core Telemetry (Complete) - [x] OpenTelemetry SDK integration - [x] Metrics exporter (Prometheus) - [x] Trace exporter (Tempo/Jaeger) - [x] Log exporter (Loki) ### 8.2 Phase 2: Advanced Features (Complete) - [x] Tail sampling configuration - [x] Attribute redaction - [x] Profile-based configuration - [x] Dashboard provisioning ### 8.3 Phase 3: Forensic & Offline (In Progress) - [x] Forensic mode toggle - [ ] Forensic bundle export (TELEM-FOR-50-001) - [ ] Sealed-mode guards (TELEM-SEAL-51-001) - [ ] Offline bundle signing (TELEM-SIGN-52-001) --- ## 9. API Surface ### 9.1 Configuration | Endpoint | Method | Scope | Description | |----------|--------|-------|-------------| | `/telemetry/config/profile/{name}` | GET | `telemetry:read` | Download collector config | | `/telemetry/config/profiles` | GET | `telemetry:read` | List profiles | ### 9.2 Incident Mode | Endpoint | Method | Scope | Description | |----------|--------|-------|-------------| | `/telemetry/incidents/mode` | POST | `telemetry:admin` | Toggle forensic mode | | `/telemetry/incidents/status` | GET | `telemetry:read` | Current mode status | ### 9.3 Exports | Endpoint | Method | Scope | Description | |----------|--------|-------|-------------| | `/telemetry/exports/forensic/{window}` | GET | `telemetry:export` | Stream OTLP bundle | --- ## 10. Offline Support ### 10.1 Bundle Structure ``` telemetry-bundle/ ├── otlp/ │ ├── metrics.pb │ ├── traces.pb │ └── logs.pb ├── config/ │ ├── collector.yaml │ └── dashboards/ ├── manifest.json └── signatures/ └── manifest.sig ``` ### 10.2 Sealed-Mode Guards ```csharp // StellaOps.Telemetry.Core enforces IEgressPolicy if (sealedMode.IsActive) { // Disable non-loopback exporters // Emit structured warning with remediation // Fall back to file-based export } ``` --- ## 11. Dashboards & Alerts ### 11.1 Standard Dashboards | Dashboard | Purpose | Panels | |-----------|---------|--------| | Platform Health | Overall status | Request rate, error rate, latency | | Scan Operations | Scanner metrics | Scan rate, duration, findings | | Policy Engine | Policy metrics | Evaluation rate, rule hits, verdicts | | Job Orchestration | Queue metrics | Queue depth, job latency, failures | ### 11.2 Alert Rules | Alert | Condition | Severity | |-------|-----------|----------| | High Error Rate | error_rate > 5% | critical | | Slow Scans | p95 > 5m | warning | | Queue Backlog | depth > 1000 | warning | | Circuit Open | breaker_open = 1 | critical | --- ## 12. Security Considerations ### 12.1 Data Protection - Sensitive attributes redacted at collection - Encrypted in transit (TLS) - Encrypted at rest (storage layer) - Retention policies enforced ### 12.2 Access Control - Authority scopes for API access - Tenant isolation in queries - Audit logging for forensic access --- ## 13. Related Documentation | Resource | Location | |----------|----------| | Telemetry architecture | `docs/modules/telemetry/architecture.md` | | Collector configuration | `docs/modules/telemetry/collector-config.md` | | Dashboard provisioning | `docs/modules/telemetry/dashboards.md` | --- ## 14. Sprint Mapping - **Primary Sprint:** SPRINT_0180_0001_0001_telemetry_core.md (NEW) - **Related Sprints:** - SPRINT_0181_0001_0002_telemetry_forensic.md - SPRINT_0182_0001_0003_telemetry_offline.md **Key Task IDs:** - `TELEM-CORE-40-001` - SDK integration (DONE) - `TELEM-DASH-41-001` - Dashboard provisioning (DONE) - `TELEM-FOR-50-001` - Forensic bundles (IN PROGRESS) - `TELEM-SEAL-51-001` - Sealed-mode guards (TODO) - `TELEM-SIGN-52-001` - Bundle signing (TODO) --- ## 15. Success Metrics | Metric | Target | |--------|--------| | Collection overhead | < 2% CPU | | Trace sampling accuracy | 100% for errors | | Log ingestion latency | < 5 seconds | | Forensic activation time | < 30 seconds | | Bundle export time | < 5 minutes (24h data) | --- *Last updated: 2025-11-29*