true the date

2025-11-30 19:23:21 +02:00
parent 71e9a56cfd
commit 0bef705bcc
14 changed files with 0 additions and 0 deletions
--- a/docs/product-advisories/28-Nov-2025
+++ b/docs/product-advisories/28-Nov-2025
@@ -0,0 +1,373 @@
+# Telemetry and Observability Patterns
+
+**Version:** 1.0
+**Date:** 2025-11-29
+**Status:** Canonical
+
+This advisory defines the product rationale, collector topology, and implementation strategy for the Telemetry module, covering metrics, traces, logs, forensic pipelines, and offline packaging.
+
+---
+
+## 1. Executive Summary
+
+The Telemetry module provides **unified observability infrastructure** across all Stella Ops components. Key capabilities:
+
+- **OpenTelemetry Native** - OTLP collection for metrics, traces, logs
+- **Forensic Mode** - Extended retention and 100% sampling during incidents
+- **Profile-Based Configuration** - Default, forensic, and air-gap profiles
+- **Sealed-Mode Guards** - Automatic exporter restrictions in air-gap
+- **Offline Bundles** - Signed OTLP archives for compliance
+
+---
+
+## 2. Market Drivers
+
+### 2.1 Target Segments
+
+| Segment | Observability Requirements | Use Case |
+|---------|---------------------------|----------|
+| **Platform Ops** | Real-time monitoring | Operational health |
+| **Security Teams** | Forensic investigation | Incident response |
+| **Compliance** | Audit trails | SOC 2, FedRAMP |
+| **DevSecOps** | Pipeline visibility | CI/CD debugging |
+
+### 2.2 Competitive Positioning
+
+Most vulnerability tools provide minimal observability. Stella Ops differentiates with:
+- **Built-in OpenTelemetry** across all services
+- **Forensic mode** with automatic retention extension
+- **Sealed-mode compatibility** for air-gap
+- **Signed OTLP bundles** for compliance archives
+- **Incident-triggered sampling** escalation
+
+---
+
+## 3. Collector Topology
+
+### 3.1 Architecture
+
+```
+┌─────────────────────────────────────────────────────┐
+│                    Services                          │
+│  Scanner │ Policy │ Authority │ Orchestrator │ ...  │
+└─────────────────────┬───────────────────────────────┘
+                      │ OTLP/gRPC
+                      ▼
+┌─────────────────────────────────────────────────────┐
+│              OpenTelemetry Collector                 │
+│  ┌─────────┐  ┌──────────┐  ┌─────────────────────┐ │
+│  │ Traces  │  │ Metrics  │  │       Logs          │ │
+│  └────┬────┘  └────┬─────┘  └──────────┬──────────┘ │
+│       │ Tail       │ Batch           │ Redaction   │
+│       │ Sampling   │                 │             │
+└───────┼────────────┼─────────────────┼─────────────┘
+        │            │                 │
+        ▼            ▼                 ▼
+   ┌────────┐   ┌──────────┐      ┌────────┐
+   │ Tempo  │   │Prometheus│      │  Loki  │
+   └────────┘   └──────────┘      └────────┘
+```
+
+### 3.2 Collector Profiles
+
+| Profile | Use Case | Configuration |
+|---------|----------|---------------|
+| **default** | Normal operation | 10% trace sampling, 30-day retention |
+| **forensic** | Investigation mode | 100% sampling, 180-day retention |
+| **airgap** | Offline deployment | File exporters, no external network |
+
+---
+
+## 4. Metrics
+
+### 4.1 Standard Metrics
+
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `stellaops_request_duration_seconds` | Histogram | service, endpoint | Request latency |
+| `stellaops_request_total` | Counter | service, status | Request count |
+| `stellaops_active_jobs` | Gauge | tenant, jobType | Active job count |
+| `stellaops_queue_depth` | Gauge | queue | Queue depth |
+| `stellaops_scan_duration_seconds` | Histogram | tenant | Scan duration |
+
+### 4.2 Module-Specific Metrics
+
+**Policy Engine:**
+- `policy_run_seconds{mode,tenant,policy}`
+- `policy_rules_fired_total{policy,rule}`
+- `policy_vex_overrides_total{policy,vendor}`
+
+**Scanner:**
+- `scanner_sbom_components_total{ecosystem}`
+- `scanner_vulnerabilities_found_total{severity}`
+- `scanner_attestations_logged_total`
+
+**Authority:**
+- `authority_token_issued_total{grant_type,audience}`
+- `authority_token_rejected_total{reason}`
+- `authority_dpop_nonce_miss_total`
+
+---
+
+## 5. Traces
+
+### 5.1 Trace Context
+
+All services propagate W3C Trace Context:
+- `traceparent` header
+- `tracestate` for vendor-specific data
+- `baggage` for cross-service attributes
+
+### 5.2 Span Conventions
+
+| Span | Attributes | Description |
+|------|------------|-------------|
+| `http.request` | url, method, status | HTTP handler |
+| `db.query` | collection, operation | MongoDB ops |
+| `policy.evaluate` | policyId, version | Policy run |
+| `scan.image` | imageRef, digest | Image scan |
+| `sign.dsse` | predicateType | DSSE signing |
+
+### 5.3 Sampling Strategy
+
+**Default (Tail Sampling):**
+- Error traces: 100%
+- Slow traces (>2s): 100%
+- Normal traces: 10%
+
+**Forensic Mode:**
+- All traces: 100%
+- Extended attributes enabled
+
+---
+
+## 6. Logs
+
+### 6.1 Structured Format
+
+```json
+{
+  "timestamp": "2025-11-29T12:00:00.123Z",
+  "level": "info",
+  "message": "Scan completed",
+  "service": "scanner",
+  "traceId": "abc123...",
+  "spanId": "def456...",
+  "tenant": "acme-corp",
+  "imageDigest": "sha256:...",
+  "componentCount": 245,
+  "vulnerabilityCount": 12
+}
+```
+
+### 6.2 Redaction
+
+Attribute processors strip sensitive data:
+- `authorization` headers
+- `secretRef` values
+- PII based on allowed-key policy
+
+### 6.3 Log Levels
+
+| Level | Purpose | Retention |
+|-------|---------|-----------|
+| `error` | Failures | 180 days |
+| `warn` | Anomalies | 90 days |
+| `info` | Operations | 30 days |
+| `debug` | Development | 7 days |
+
+---
+
+## 7. Forensic Mode
+
+### 7.1 Activation
+
+```bash
+# Activate forensic mode for tenant
+stella telemetry incident start --tenant acme-corp --reason "CVE-2025-12345 investigation"
+
+# Check status
+stella telemetry incident status
+
+# Deactivate
+stella telemetry incident stop --tenant acme-corp
+```
+
+### 7.2 Behavior Changes
+
+| Aspect | Default | Forensic |
+|--------|---------|----------|
+| Trace sampling | 10% | 100% |
+| Log level | info | debug |
+| Retention | 30 days | 180 days |
+| Attributes | Standard | Extended |
+| Export frequency | 1 minute | 10 seconds |
+
+### 7.3 Automatic Triggers
+
+- Orchestrator incident escalation
+- Policy violation threshold exceeded
+- Circuit breaker activation
+- Manual operator trigger
+
+---
+
+## 8. Implementation Strategy
+
+### 8.1 Phase 1: Core Telemetry (Complete)
+
+- [x] OpenTelemetry SDK integration
+- [x] Metrics exporter (Prometheus)
+- [x] Trace exporter (Tempo/Jaeger)
+- [x] Log exporter (Loki)
+
+### 8.2 Phase 2: Advanced Features (Complete)
+
+- [x] Tail sampling configuration
+- [x] Attribute redaction
+- [x] Profile-based configuration
+- [x] Dashboard provisioning
+
+### 8.3 Phase 3: Forensic & Offline (In Progress)
+
+- [x] Forensic mode toggle
+- [ ] Forensic bundle export (TELEM-FOR-50-001)
+- [ ] Sealed-mode guards (TELEM-SEAL-51-001)
+- [ ] Offline bundle signing (TELEM-SIGN-52-001)
+
+---
+
+## 9. API Surface
+
+### 9.1 Configuration
+
+| Endpoint | Method | Scope | Description |
+|----------|--------|-------|-------------|
+| `/telemetry/config/profile/{name}` | GET | `telemetry:read` | Download collector config |
+| `/telemetry/config/profiles` | GET | `telemetry:read` | List profiles |
+
+### 9.2 Incident Mode
+
+| Endpoint | Method | Scope | Description |
+|----------|--------|-------|-------------|
+| `/telemetry/incidents/mode` | POST | `telemetry:admin` | Toggle forensic mode |
+| `/telemetry/incidents/status` | GET | `telemetry:read` | Current mode status |
+
+### 9.3 Exports
+
+| Endpoint | Method | Scope | Description |
+|----------|--------|-------|-------------|
+| `/telemetry/exports/forensic/{window}` | GET | `telemetry:export` | Stream OTLP bundle |
+
+---
+
+## 10. Offline Support
+
+### 10.1 Bundle Structure
+
+```
+telemetry-bundle/
+├── otlp/
+│   ├── metrics.pb
+│   ├── traces.pb
+│   └── logs.pb
+├── config/
+│   ├── collector.yaml
+│   └── dashboards/
+├── manifest.json
+└── signatures/
+    └── manifest.sig
+```
+
+### 10.2 Sealed-Mode Guards
+
+```csharp
+// StellaOps.Telemetry.Core enforces IEgressPolicy
+if (sealedMode.IsActive)
+{
+    // Disable non-loopback exporters
+    // Emit structured warning with remediation
+    // Fall back to file-based export
+}
+```
+
+---
+
+## 11. Dashboards & Alerts
+
+### 11.1 Standard Dashboards
+
+| Dashboard | Purpose | Panels |
+|-----------|---------|--------|
+| Platform Health | Overall status | Request rate, error rate, latency |
+| Scan Operations | Scanner metrics | Scan rate, duration, findings |
+| Policy Engine | Policy metrics | Evaluation rate, rule hits, verdicts |
+| Job Orchestration | Queue metrics | Queue depth, job latency, failures |
+
+### 11.2 Alert Rules
+
+| Alert | Condition | Severity |
+|-------|-----------|----------|
+| High Error Rate | error_rate > 5% | critical |
+| Slow Scans | p95 > 5m | warning |
+| Queue Backlog | depth > 1000 | warning |
+| Circuit Open | breaker_open = 1 | critical |
+
+---
+
+## 12. Security Considerations
+
+### 12.1 Data Protection
+
+- Sensitive attributes redacted at collection
+- Encrypted in transit (TLS)
+- Encrypted at rest (storage layer)
+- Retention policies enforced
+
+### 12.2 Access Control
+
+- Authority scopes for API access
+- Tenant isolation in queries
+- Audit logging for forensic access
+
+---
+
+## 13. Related Documentation
+
+| Resource | Location |
+|----------|----------|
+| Telemetry architecture | `docs/modules/telemetry/architecture.md` |
+| Collector configuration | `docs/modules/telemetry/collector-config.md` |
+| Dashboard provisioning | `docs/modules/telemetry/dashboards.md` |
+
+---
+
+## 14. Sprint Mapping
+
+- **Primary Sprint:** SPRINT_0180_0001_0001_telemetry_core.md (NEW)
+- **Related Sprints:**
+  - SPRINT_0181_0001_0002_telemetry_forensic.md
+  - SPRINT_0182_0001_0003_telemetry_offline.md
+
+**Key Task IDs:**
+- `TELEM-CORE-40-001` - SDK integration (DONE)
+- `TELEM-DASH-41-001` - Dashboard provisioning (DONE)
+- `TELEM-FOR-50-001` - Forensic bundles (IN PROGRESS)
+- `TELEM-SEAL-51-001` - Sealed-mode guards (TODO)
+- `TELEM-SIGN-52-001` - Bundle signing (TODO)
+
+---
+
+## 15. Success Metrics
+
+| Metric | Target |
+|--------|--------|
+| Collection overhead | < 2% CPU |
+| Trace sampling accuracy | 100% for errors |
+| Log ingestion latency | < 5 seconds |
+| Forensic activation time | < 30 seconds |
+| Bundle export time | < 5 minutes (24h data) |
+
+---
+
+*Last updated: 2025-11-29*