11 KiB
11 KiB
Telemetry and Observability Patterns
Version: 1.0 Date: 2025-11-29 Status: Canonical
This advisory defines the product rationale, collector topology, and implementation strategy for the Telemetry module, covering metrics, traces, logs, forensic pipelines, and offline packaging.
1. Executive Summary
The Telemetry module provides unified observability infrastructure across all Stella Ops components. Key capabilities:
- OpenTelemetry Native - OTLP collection for metrics, traces, logs
- Forensic Mode - Extended retention and 100% sampling during incidents
- Profile-Based Configuration - Default, forensic, and air-gap profiles
- Sealed-Mode Guards - Automatic exporter restrictions in air-gap
- Offline Bundles - Signed OTLP archives for compliance
2. Market Drivers
2.1 Target Segments
| Segment | Observability Requirements | Use Case |
|---|---|---|
| Platform Ops | Real-time monitoring | Operational health |
| Security Teams | Forensic investigation | Incident response |
| Compliance | Audit trails | SOC 2, FedRAMP |
| DevSecOps | Pipeline visibility | CI/CD debugging |
2.2 Competitive Positioning
Most vulnerability tools provide minimal observability. Stella Ops differentiates with:
- Built-in OpenTelemetry across all services
- Forensic mode with automatic retention extension
- Sealed-mode compatibility for air-gap
- Signed OTLP bundles for compliance archives
- Incident-triggered sampling escalation
3. Collector Topology
3.1 Architecture
┌─────────────────────────────────────────────────────┐
│ Services │
│ Scanner │ Policy │ Authority │ Orchestrator │ ... │
└─────────────────────┬───────────────────────────────┘
│ OTLP/gRPC
▼
┌─────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ ┌─────────┐ ┌──────────┐ ┌─────────────────────┐ │
│ │ Traces │ │ Metrics │ │ Logs │ │
│ └────┬────┘ └────┬─────┘ └──────────┬──────────┘ │
│ │ Tail │ Batch │ Redaction │
│ │ Sampling │ │ │
└───────┼────────────┼─────────────────┼─────────────┘
│ │ │
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌────────┐
│ Tempo │ │Prometheus│ │ Loki │
└────────┘ └──────────┘ └────────┘
3.2 Collector Profiles
| Profile | Use Case | Configuration |
|---|---|---|
| default | Normal operation | 10% trace sampling, 30-day retention |
| forensic | Investigation mode | 100% sampling, 180-day retention |
| airgap | Offline deployment | File exporters, no external network |
4. Metrics
4.1 Standard Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
stellaops_request_duration_seconds |
Histogram | service, endpoint | Request latency |
stellaops_request_total |
Counter | service, status | Request count |
stellaops_active_jobs |
Gauge | tenant, jobType | Active job count |
stellaops_queue_depth |
Gauge | queue | Queue depth |
stellaops_scan_duration_seconds |
Histogram | tenant | Scan duration |
4.2 Module-Specific Metrics
Policy Engine:
policy_run_seconds{mode,tenant,policy}policy_rules_fired_total{policy,rule}policy_vex_overrides_total{policy,vendor}
Scanner:
scanner_sbom_components_total{ecosystem}scanner_vulnerabilities_found_total{severity}scanner_attestations_logged_total
Authority:
authority_token_issued_total{grant_type,audience}authority_token_rejected_total{reason}authority_dpop_nonce_miss_total
5. Traces
5.1 Trace Context
All services propagate W3C Trace Context:
traceparentheadertracestatefor vendor-specific databaggagefor cross-service attributes
5.2 Span Conventions
| Span | Attributes | Description |
|---|---|---|
http.request |
url, method, status | HTTP handler |
db.query |
collection, operation | MongoDB ops |
policy.evaluate |
policyId, version | Policy run |
scan.image |
imageRef, digest | Image scan |
sign.dsse |
predicateType | DSSE signing |
5.3 Sampling Strategy
Default (Tail Sampling):
- Error traces: 100%
- Slow traces (>2s): 100%
- Normal traces: 10%
Forensic Mode:
- All traces: 100%
- Extended attributes enabled
6. Logs
6.1 Structured Format
{
"timestamp": "2025-11-29T12:00:00.123Z",
"level": "info",
"message": "Scan completed",
"service": "scanner",
"traceId": "abc123...",
"spanId": "def456...",
"tenant": "acme-corp",
"imageDigest": "sha256:...",
"componentCount": 245,
"vulnerabilityCount": 12
}
6.2 Redaction
Attribute processors strip sensitive data:
authorizationheaderssecretRefvalues- PII based on allowed-key policy
6.3 Log Levels
| Level | Purpose | Retention |
|---|---|---|
error |
Failures | 180 days |
warn |
Anomalies | 90 days |
info |
Operations | 30 days |
debug |
Development | 7 days |
7. Forensic Mode
7.1 Activation
# Activate forensic mode for tenant
stella telemetry incident start --tenant acme-corp --reason "CVE-2025-12345 investigation"
# Check status
stella telemetry incident status
# Deactivate
stella telemetry incident stop --tenant acme-corp
7.2 Behavior Changes
| Aspect | Default | Forensic |
|---|---|---|
| Trace sampling | 10% | 100% |
| Log level | info | debug |
| Retention | 30 days | 180 days |
| Attributes | Standard | Extended |
| Export frequency | 1 minute | 10 seconds |
7.3 Automatic Triggers
- Orchestrator incident escalation
- Policy violation threshold exceeded
- Circuit breaker activation
- Manual operator trigger
8. Implementation Strategy
8.1 Phase 1: Core Telemetry (Complete)
- OpenTelemetry SDK integration
- Metrics exporter (Prometheus)
- Trace exporter (Tempo/Jaeger)
- Log exporter (Loki)
8.2 Phase 2: Advanced Features (Complete)
- Tail sampling configuration
- Attribute redaction
- Profile-based configuration
- Dashboard provisioning
8.3 Phase 3: Forensic & Offline (In Progress)
- Forensic mode toggle
- Forensic bundle export (TELEM-FOR-50-001)
- Sealed-mode guards (TELEM-SEAL-51-001)
- Offline bundle signing (TELEM-SIGN-52-001)
9. API Surface
9.1 Configuration
| Endpoint | Method | Scope | Description |
|---|---|---|---|
/telemetry/config/profile/{name} |
GET | telemetry:read |
Download collector config |
/telemetry/config/profiles |
GET | telemetry:read |
List profiles |
9.2 Incident Mode
| Endpoint | Method | Scope | Description |
|---|---|---|---|
/telemetry/incidents/mode |
POST | telemetry:admin |
Toggle forensic mode |
/telemetry/incidents/status |
GET | telemetry:read |
Current mode status |
9.3 Exports
| Endpoint | Method | Scope | Description |
|---|---|---|---|
/telemetry/exports/forensic/{window} |
GET | telemetry:export |
Stream OTLP bundle |
10. Offline Support
10.1 Bundle Structure
telemetry-bundle/
├── otlp/
│ ├── metrics.pb
│ ├── traces.pb
│ └── logs.pb
├── config/
│ ├── collector.yaml
│ └── dashboards/
├── manifest.json
└── signatures/
└── manifest.sig
10.2 Sealed-Mode Guards
// StellaOps.Telemetry.Core enforces IEgressPolicy
if (sealedMode.IsActive)
{
// Disable non-loopback exporters
// Emit structured warning with remediation
// Fall back to file-based export
}
11. Dashboards & Alerts
11.1 Standard Dashboards
| Dashboard | Purpose | Panels |
|---|---|---|
| Platform Health | Overall status | Request rate, error rate, latency |
| Scan Operations | Scanner metrics | Scan rate, duration, findings |
| Policy Engine | Policy metrics | Evaluation rate, rule hits, verdicts |
| Job Orchestration | Queue metrics | Queue depth, job latency, failures |
11.2 Alert Rules
| Alert | Condition | Severity |
|---|---|---|
| High Error Rate | error_rate > 5% | critical |
| Slow Scans | p95 > 5m | warning |
| Queue Backlog | depth > 1000 | warning |
| Circuit Open | breaker_open = 1 | critical |
12. Security Considerations
12.1 Data Protection
- Sensitive attributes redacted at collection
- Encrypted in transit (TLS)
- Encrypted at rest (storage layer)
- Retention policies enforced
12.2 Access Control
- Authority scopes for API access
- Tenant isolation in queries
- Audit logging for forensic access
13. Related Documentation
| Resource | Location |
|---|---|
| Telemetry architecture | docs/modules/telemetry/architecture.md |
| Collector configuration | docs/modules/telemetry/collector-config.md |
| Dashboard provisioning | docs/modules/telemetry/dashboards.md |
14. Sprint Mapping
- Primary Sprint: SPRINT_0180_0001_0001_telemetry_core.md (NEW)
- Related Sprints:
- SPRINT_0181_0001_0002_telemetry_forensic.md
- SPRINT_0182_0001_0003_telemetry_offline.md
Key Task IDs:
TELEM-CORE-40-001- SDK integration (DONE)TELEM-DASH-41-001- Dashboard provisioning (DONE)TELEM-FOR-50-001- Forensic bundles (IN PROGRESS)TELEM-SEAL-51-001- Sealed-mode guards (TODO)TELEM-SIGN-52-001- Bundle signing (TODO)
15. Success Metrics
| Metric | Target |
|---|---|
| Collection overhead | < 2% CPU |
| Trace sampling accuracy | 100% for errors |
| Log ingestion latency | < 5 seconds |
| Forensic activation time | < 30 seconds |
| Bundle export time | < 5 minutes (24h data) |
Last updated: 2025-11-29