true the date
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

This commit is contained in:
StellaOps Bot
2025-11-30 19:23:21 +02:00
parent 71e9a56cfd
commit 0bef705bcc
14 changed files with 0 additions and 0 deletions

View File

@@ -0,0 +1,373 @@
# Telemetry and Observability Patterns
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, collector topology, and implementation strategy for the Telemetry module, covering metrics, traces, logs, forensic pipelines, and offline packaging.
---
## 1. Executive Summary
The Telemetry module provides **unified observability infrastructure** across all Stella Ops components. Key capabilities:
- **OpenTelemetry Native** - OTLP collection for metrics, traces, logs
- **Forensic Mode** - Extended retention and 100% sampling during incidents
- **Profile-Based Configuration** - Default, forensic, and air-gap profiles
- **Sealed-Mode Guards** - Automatic exporter restrictions in air-gap
- **Offline Bundles** - Signed OTLP archives for compliance
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Observability Requirements | Use Case |
|---------|---------------------------|----------|
| **Platform Ops** | Real-time monitoring | Operational health |
| **Security Teams** | Forensic investigation | Incident response |
| **Compliance** | Audit trails | SOC 2, FedRAMP |
| **DevSecOps** | Pipeline visibility | CI/CD debugging |
### 2.2 Competitive Positioning
Most vulnerability tools provide minimal observability. Stella Ops differentiates with:
- **Built-in OpenTelemetry** across all services
- **Forensic mode** with automatic retention extension
- **Sealed-mode compatibility** for air-gap
- **Signed OTLP bundles** for compliance archives
- **Incident-triggered sampling** escalation
---
## 3. Collector Topology
### 3.1 Architecture
```
┌─────────────────────────────────────────────────────┐
│ Services │
│ Scanner │ Policy │ Authority │ Orchestrator │ ... │
└─────────────────────┬───────────────────────────────┘
│ OTLP/gRPC
┌─────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ ┌─────────┐ ┌──────────┐ ┌─────────────────────┐ │
│ │ Traces │ │ Metrics │ │ Logs │ │
│ └────┬────┘ └────┬─────┘ └──────────┬──────────┘ │
│ │ Tail │ Batch │ Redaction │
│ │ Sampling │ │ │
└───────┼────────────┼─────────────────┼─────────────┘
│ │ │
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌────────┐
│ Tempo │ │Prometheus│ │ Loki │
└────────┘ └──────────┘ └────────┘
```
### 3.2 Collector Profiles
| Profile | Use Case | Configuration |
|---------|----------|---------------|
| **default** | Normal operation | 10% trace sampling, 30-day retention |
| **forensic** | Investigation mode | 100% sampling, 180-day retention |
| **airgap** | Offline deployment | File exporters, no external network |
---
## 4. Metrics
### 4.1 Standard Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `stellaops_request_duration_seconds` | Histogram | service, endpoint | Request latency |
| `stellaops_request_total` | Counter | service, status | Request count |
| `stellaops_active_jobs` | Gauge | tenant, jobType | Active job count |
| `stellaops_queue_depth` | Gauge | queue | Queue depth |
| `stellaops_scan_duration_seconds` | Histogram | tenant | Scan duration |
### 4.2 Module-Specific Metrics
**Policy Engine:**
- `policy_run_seconds{mode,tenant,policy}`
- `policy_rules_fired_total{policy,rule}`
- `policy_vex_overrides_total{policy,vendor}`
**Scanner:**
- `scanner_sbom_components_total{ecosystem}`
- `scanner_vulnerabilities_found_total{severity}`
- `scanner_attestations_logged_total`
**Authority:**
- `authority_token_issued_total{grant_type,audience}`
- `authority_token_rejected_total{reason}`
- `authority_dpop_nonce_miss_total`
---
## 5. Traces
### 5.1 Trace Context
All services propagate W3C Trace Context:
- `traceparent` header
- `tracestate` for vendor-specific data
- `baggage` for cross-service attributes
### 5.2 Span Conventions
| Span | Attributes | Description |
|------|------------|-------------|
| `http.request` | url, method, status | HTTP handler |
| `db.query` | collection, operation | MongoDB ops |
| `policy.evaluate` | policyId, version | Policy run |
| `scan.image` | imageRef, digest | Image scan |
| `sign.dsse` | predicateType | DSSE signing |
### 5.3 Sampling Strategy
**Default (Tail Sampling):**
- Error traces: 100%
- Slow traces (>2s): 100%
- Normal traces: 10%
**Forensic Mode:**
- All traces: 100%
- Extended attributes enabled
---
## 6. Logs
### 6.1 Structured Format
```json
{
"timestamp": "2025-11-29T12:00:00.123Z",
"level": "info",
"message": "Scan completed",
"service": "scanner",
"traceId": "abc123...",
"spanId": "def456...",
"tenant": "acme-corp",
"imageDigest": "sha256:...",
"componentCount": 245,
"vulnerabilityCount": 12
}
```
### 6.2 Redaction
Attribute processors strip sensitive data:
- `authorization` headers
- `secretRef` values
- PII based on allowed-key policy
### 6.3 Log Levels
| Level | Purpose | Retention |
|-------|---------|-----------|
| `error` | Failures | 180 days |
| `warn` | Anomalies | 90 days |
| `info` | Operations | 30 days |
| `debug` | Development | 7 days |
---
## 7. Forensic Mode
### 7.1 Activation
```bash
# Activate forensic mode for tenant
stella telemetry incident start --tenant acme-corp --reason "CVE-2025-12345 investigation"
# Check status
stella telemetry incident status
# Deactivate
stella telemetry incident stop --tenant acme-corp
```
### 7.2 Behavior Changes
| Aspect | Default | Forensic |
|--------|---------|----------|
| Trace sampling | 10% | 100% |
| Log level | info | debug |
| Retention | 30 days | 180 days |
| Attributes | Standard | Extended |
| Export frequency | 1 minute | 10 seconds |
### 7.3 Automatic Triggers
- Orchestrator incident escalation
- Policy violation threshold exceeded
- Circuit breaker activation
- Manual operator trigger
---
## 8. Implementation Strategy
### 8.1 Phase 1: Core Telemetry (Complete)
- [x] OpenTelemetry SDK integration
- [x] Metrics exporter (Prometheus)
- [x] Trace exporter (Tempo/Jaeger)
- [x] Log exporter (Loki)
### 8.2 Phase 2: Advanced Features (Complete)
- [x] Tail sampling configuration
- [x] Attribute redaction
- [x] Profile-based configuration
- [x] Dashboard provisioning
### 8.3 Phase 3: Forensic & Offline (In Progress)
- [x] Forensic mode toggle
- [ ] Forensic bundle export (TELEM-FOR-50-001)
- [ ] Sealed-mode guards (TELEM-SEAL-51-001)
- [ ] Offline bundle signing (TELEM-SIGN-52-001)
---
## 9. API Surface
### 9.1 Configuration
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/telemetry/config/profile/{name}` | GET | `telemetry:read` | Download collector config |
| `/telemetry/config/profiles` | GET | `telemetry:read` | List profiles |
### 9.2 Incident Mode
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/telemetry/incidents/mode` | POST | `telemetry:admin` | Toggle forensic mode |
| `/telemetry/incidents/status` | GET | `telemetry:read` | Current mode status |
### 9.3 Exports
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/telemetry/exports/forensic/{window}` | GET | `telemetry:export` | Stream OTLP bundle |
---
## 10. Offline Support
### 10.1 Bundle Structure
```
telemetry-bundle/
├── otlp/
│ ├── metrics.pb
│ ├── traces.pb
│ └── logs.pb
├── config/
│ ├── collector.yaml
│ └── dashboards/
├── manifest.json
└── signatures/
└── manifest.sig
```
### 10.2 Sealed-Mode Guards
```csharp
// StellaOps.Telemetry.Core enforces IEgressPolicy
if (sealedMode.IsActive)
{
// Disable non-loopback exporters
// Emit structured warning with remediation
// Fall back to file-based export
}
```
---
## 11. Dashboards & Alerts
### 11.1 Standard Dashboards
| Dashboard | Purpose | Panels |
|-----------|---------|--------|
| Platform Health | Overall status | Request rate, error rate, latency |
| Scan Operations | Scanner metrics | Scan rate, duration, findings |
| Policy Engine | Policy metrics | Evaluation rate, rule hits, verdicts |
| Job Orchestration | Queue metrics | Queue depth, job latency, failures |
### 11.2 Alert Rules
| Alert | Condition | Severity |
|-------|-----------|----------|
| High Error Rate | error_rate > 5% | critical |
| Slow Scans | p95 > 5m | warning |
| Queue Backlog | depth > 1000 | warning |
| Circuit Open | breaker_open = 1 | critical |
---
## 12. Security Considerations
### 12.1 Data Protection
- Sensitive attributes redacted at collection
- Encrypted in transit (TLS)
- Encrypted at rest (storage layer)
- Retention policies enforced
### 12.2 Access Control
- Authority scopes for API access
- Tenant isolation in queries
- Audit logging for forensic access
---
## 13. Related Documentation
| Resource | Location |
|----------|----------|
| Telemetry architecture | `docs/modules/telemetry/architecture.md` |
| Collector configuration | `docs/modules/telemetry/collector-config.md` |
| Dashboard provisioning | `docs/modules/telemetry/dashboards.md` |
---
## 14. Sprint Mapping
- **Primary Sprint:** SPRINT_0180_0001_0001_telemetry_core.md (NEW)
- **Related Sprints:**
- SPRINT_0181_0001_0002_telemetry_forensic.md
- SPRINT_0182_0001_0003_telemetry_offline.md
**Key Task IDs:**
- `TELEM-CORE-40-001` - SDK integration (DONE)
- `TELEM-DASH-41-001` - Dashboard provisioning (DONE)
- `TELEM-FOR-50-001` - Forensic bundles (IN PROGRESS)
- `TELEM-SEAL-51-001` - Sealed-mode guards (TODO)
- `TELEM-SIGN-52-001` - Bundle signing (TODO)
---
## 15. Success Metrics
| Metric | Target |
|--------|--------|
| Collection overhead | < 2% CPU |
| Trace sampling accuracy | 100% for errors |
| Log ingestion latency | < 5 seconds |
| Forensic activation time | < 30 seconds |
| Bundle export time | < 5 minutes (24h data) |
---
*Last updated: 2025-11-29*