This commit is contained in:
@@ -0,0 +1,373 @@
|
||||
# Telemetry and Observability Patterns
|
||||
|
||||
**Version:** 1.0
|
||||
**Date:** 2025-11-29
|
||||
**Status:** Canonical
|
||||
|
||||
This advisory defines the product rationale, collector topology, and implementation strategy for the Telemetry module, covering metrics, traces, logs, forensic pipelines, and offline packaging.
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
The Telemetry module provides **unified observability infrastructure** across all Stella Ops components. Key capabilities:
|
||||
|
||||
- **OpenTelemetry Native** - OTLP collection for metrics, traces, logs
|
||||
- **Forensic Mode** - Extended retention and 100% sampling during incidents
|
||||
- **Profile-Based Configuration** - Default, forensic, and air-gap profiles
|
||||
- **Sealed-Mode Guards** - Automatic exporter restrictions in air-gap
|
||||
- **Offline Bundles** - Signed OTLP archives for compliance
|
||||
|
||||
---
|
||||
|
||||
## 2. Market Drivers
|
||||
|
||||
### 2.1 Target Segments
|
||||
|
||||
| Segment | Observability Requirements | Use Case |
|
||||
|---------|---------------------------|----------|
|
||||
| **Platform Ops** | Real-time monitoring | Operational health |
|
||||
| **Security Teams** | Forensic investigation | Incident response |
|
||||
| **Compliance** | Audit trails | SOC 2, FedRAMP |
|
||||
| **DevSecOps** | Pipeline visibility | CI/CD debugging |
|
||||
|
||||
### 2.2 Competitive Positioning
|
||||
|
||||
Most vulnerability tools provide minimal observability. Stella Ops differentiates with:
|
||||
- **Built-in OpenTelemetry** across all services
|
||||
- **Forensic mode** with automatic retention extension
|
||||
- **Sealed-mode compatibility** for air-gap
|
||||
- **Signed OTLP bundles** for compliance archives
|
||||
- **Incident-triggered sampling** escalation
|
||||
|
||||
---
|
||||
|
||||
## 3. Collector Topology
|
||||
|
||||
### 3.1 Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ Services │
|
||||
│ Scanner │ Policy │ Authority │ Orchestrator │ ... │
|
||||
└─────────────────────┬───────────────────────────────┘
|
||||
│ OTLP/gRPC
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ OpenTelemetry Collector │
|
||||
│ ┌─────────┐ ┌──────────┐ ┌─────────────────────┐ │
|
||||
│ │ Traces │ │ Metrics │ │ Logs │ │
|
||||
│ └────┬────┘ └────┬─────┘ └──────────┬──────────┘ │
|
||||
│ │ Tail │ Batch │ Redaction │
|
||||
│ │ Sampling │ │ │
|
||||
└───────┼────────────┼─────────────────┼─────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌────────┐ ┌──────────┐ ┌────────┐
|
||||
│ Tempo │ │Prometheus│ │ Loki │
|
||||
└────────┘ └──────────┘ └────────┘
|
||||
```
|
||||
|
||||
### 3.2 Collector Profiles
|
||||
|
||||
| Profile | Use Case | Configuration |
|
||||
|---------|----------|---------------|
|
||||
| **default** | Normal operation | 10% trace sampling, 30-day retention |
|
||||
| **forensic** | Investigation mode | 100% sampling, 180-day retention |
|
||||
| **airgap** | Offline deployment | File exporters, no external network |
|
||||
|
||||
---
|
||||
|
||||
## 4. Metrics
|
||||
|
||||
### 4.1 Standard Metrics
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `stellaops_request_duration_seconds` | Histogram | service, endpoint | Request latency |
|
||||
| `stellaops_request_total` | Counter | service, status | Request count |
|
||||
| `stellaops_active_jobs` | Gauge | tenant, jobType | Active job count |
|
||||
| `stellaops_queue_depth` | Gauge | queue | Queue depth |
|
||||
| `stellaops_scan_duration_seconds` | Histogram | tenant | Scan duration |
|
||||
|
||||
### 4.2 Module-Specific Metrics
|
||||
|
||||
**Policy Engine:**
|
||||
- `policy_run_seconds{mode,tenant,policy}`
|
||||
- `policy_rules_fired_total{policy,rule}`
|
||||
- `policy_vex_overrides_total{policy,vendor}`
|
||||
|
||||
**Scanner:**
|
||||
- `scanner_sbom_components_total{ecosystem}`
|
||||
- `scanner_vulnerabilities_found_total{severity}`
|
||||
- `scanner_attestations_logged_total`
|
||||
|
||||
**Authority:**
|
||||
- `authority_token_issued_total{grant_type,audience}`
|
||||
- `authority_token_rejected_total{reason}`
|
||||
- `authority_dpop_nonce_miss_total`
|
||||
|
||||
---
|
||||
|
||||
## 5. Traces
|
||||
|
||||
### 5.1 Trace Context
|
||||
|
||||
All services propagate W3C Trace Context:
|
||||
- `traceparent` header
|
||||
- `tracestate` for vendor-specific data
|
||||
- `baggage` for cross-service attributes
|
||||
|
||||
### 5.2 Span Conventions
|
||||
|
||||
| Span | Attributes | Description |
|
||||
|------|------------|-------------|
|
||||
| `http.request` | url, method, status | HTTP handler |
|
||||
| `db.query` | collection, operation | MongoDB ops |
|
||||
| `policy.evaluate` | policyId, version | Policy run |
|
||||
| `scan.image` | imageRef, digest | Image scan |
|
||||
| `sign.dsse` | predicateType | DSSE signing |
|
||||
|
||||
### 5.3 Sampling Strategy
|
||||
|
||||
**Default (Tail Sampling):**
|
||||
- Error traces: 100%
|
||||
- Slow traces (>2s): 100%
|
||||
- Normal traces: 10%
|
||||
|
||||
**Forensic Mode:**
|
||||
- All traces: 100%
|
||||
- Extended attributes enabled
|
||||
|
||||
---
|
||||
|
||||
## 6. Logs
|
||||
|
||||
### 6.1 Structured Format
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2025-11-29T12:00:00.123Z",
|
||||
"level": "info",
|
||||
"message": "Scan completed",
|
||||
"service": "scanner",
|
||||
"traceId": "abc123...",
|
||||
"spanId": "def456...",
|
||||
"tenant": "acme-corp",
|
||||
"imageDigest": "sha256:...",
|
||||
"componentCount": 245,
|
||||
"vulnerabilityCount": 12
|
||||
}
|
||||
```
|
||||
|
||||
### 6.2 Redaction
|
||||
|
||||
Attribute processors strip sensitive data:
|
||||
- `authorization` headers
|
||||
- `secretRef` values
|
||||
- PII based on allowed-key policy
|
||||
|
||||
### 6.3 Log Levels
|
||||
|
||||
| Level | Purpose | Retention |
|
||||
|-------|---------|-----------|
|
||||
| `error` | Failures | 180 days |
|
||||
| `warn` | Anomalies | 90 days |
|
||||
| `info` | Operations | 30 days |
|
||||
| `debug` | Development | 7 days |
|
||||
|
||||
---
|
||||
|
||||
## 7. Forensic Mode
|
||||
|
||||
### 7.1 Activation
|
||||
|
||||
```bash
|
||||
# Activate forensic mode for tenant
|
||||
stella telemetry incident start --tenant acme-corp --reason "CVE-2025-12345 investigation"
|
||||
|
||||
# Check status
|
||||
stella telemetry incident status
|
||||
|
||||
# Deactivate
|
||||
stella telemetry incident stop --tenant acme-corp
|
||||
```
|
||||
|
||||
### 7.2 Behavior Changes
|
||||
|
||||
| Aspect | Default | Forensic |
|
||||
|--------|---------|----------|
|
||||
| Trace sampling | 10% | 100% |
|
||||
| Log level | info | debug |
|
||||
| Retention | 30 days | 180 days |
|
||||
| Attributes | Standard | Extended |
|
||||
| Export frequency | 1 minute | 10 seconds |
|
||||
|
||||
### 7.3 Automatic Triggers
|
||||
|
||||
- Orchestrator incident escalation
|
||||
- Policy violation threshold exceeded
|
||||
- Circuit breaker activation
|
||||
- Manual operator trigger
|
||||
|
||||
---
|
||||
|
||||
## 8. Implementation Strategy
|
||||
|
||||
### 8.1 Phase 1: Core Telemetry (Complete)
|
||||
|
||||
- [x] OpenTelemetry SDK integration
|
||||
- [x] Metrics exporter (Prometheus)
|
||||
- [x] Trace exporter (Tempo/Jaeger)
|
||||
- [x] Log exporter (Loki)
|
||||
|
||||
### 8.2 Phase 2: Advanced Features (Complete)
|
||||
|
||||
- [x] Tail sampling configuration
|
||||
- [x] Attribute redaction
|
||||
- [x] Profile-based configuration
|
||||
- [x] Dashboard provisioning
|
||||
|
||||
### 8.3 Phase 3: Forensic & Offline (In Progress)
|
||||
|
||||
- [x] Forensic mode toggle
|
||||
- [ ] Forensic bundle export (TELEM-FOR-50-001)
|
||||
- [ ] Sealed-mode guards (TELEM-SEAL-51-001)
|
||||
- [ ] Offline bundle signing (TELEM-SIGN-52-001)
|
||||
|
||||
---
|
||||
|
||||
## 9. API Surface
|
||||
|
||||
### 9.1 Configuration
|
||||
|
||||
| Endpoint | Method | Scope | Description |
|
||||
|----------|--------|-------|-------------|
|
||||
| `/telemetry/config/profile/{name}` | GET | `telemetry:read` | Download collector config |
|
||||
| `/telemetry/config/profiles` | GET | `telemetry:read` | List profiles |
|
||||
|
||||
### 9.2 Incident Mode
|
||||
|
||||
| Endpoint | Method | Scope | Description |
|
||||
|----------|--------|-------|-------------|
|
||||
| `/telemetry/incidents/mode` | POST | `telemetry:admin` | Toggle forensic mode |
|
||||
| `/telemetry/incidents/status` | GET | `telemetry:read` | Current mode status |
|
||||
|
||||
### 9.3 Exports
|
||||
|
||||
| Endpoint | Method | Scope | Description |
|
||||
|----------|--------|-------|-------------|
|
||||
| `/telemetry/exports/forensic/{window}` | GET | `telemetry:export` | Stream OTLP bundle |
|
||||
|
||||
---
|
||||
|
||||
## 10. Offline Support
|
||||
|
||||
### 10.1 Bundle Structure
|
||||
|
||||
```
|
||||
telemetry-bundle/
|
||||
├── otlp/
|
||||
│ ├── metrics.pb
|
||||
│ ├── traces.pb
|
||||
│ └── logs.pb
|
||||
├── config/
|
||||
│ ├── collector.yaml
|
||||
│ └── dashboards/
|
||||
├── manifest.json
|
||||
└── signatures/
|
||||
└── manifest.sig
|
||||
```
|
||||
|
||||
### 10.2 Sealed-Mode Guards
|
||||
|
||||
```csharp
|
||||
// StellaOps.Telemetry.Core enforces IEgressPolicy
|
||||
if (sealedMode.IsActive)
|
||||
{
|
||||
// Disable non-loopback exporters
|
||||
// Emit structured warning with remediation
|
||||
// Fall back to file-based export
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. Dashboards & Alerts
|
||||
|
||||
### 11.1 Standard Dashboards
|
||||
|
||||
| Dashboard | Purpose | Panels |
|
||||
|-----------|---------|--------|
|
||||
| Platform Health | Overall status | Request rate, error rate, latency |
|
||||
| Scan Operations | Scanner metrics | Scan rate, duration, findings |
|
||||
| Policy Engine | Policy metrics | Evaluation rate, rule hits, verdicts |
|
||||
| Job Orchestration | Queue metrics | Queue depth, job latency, failures |
|
||||
|
||||
### 11.2 Alert Rules
|
||||
|
||||
| Alert | Condition | Severity |
|
||||
|-------|-----------|----------|
|
||||
| High Error Rate | error_rate > 5% | critical |
|
||||
| Slow Scans | p95 > 5m | warning |
|
||||
| Queue Backlog | depth > 1000 | warning |
|
||||
| Circuit Open | breaker_open = 1 | critical |
|
||||
|
||||
---
|
||||
|
||||
## 12. Security Considerations
|
||||
|
||||
### 12.1 Data Protection
|
||||
|
||||
- Sensitive attributes redacted at collection
|
||||
- Encrypted in transit (TLS)
|
||||
- Encrypted at rest (storage layer)
|
||||
- Retention policies enforced
|
||||
|
||||
### 12.2 Access Control
|
||||
|
||||
- Authority scopes for API access
|
||||
- Tenant isolation in queries
|
||||
- Audit logging for forensic access
|
||||
|
||||
---
|
||||
|
||||
## 13. Related Documentation
|
||||
|
||||
| Resource | Location |
|
||||
|----------|----------|
|
||||
| Telemetry architecture | `docs/modules/telemetry/architecture.md` |
|
||||
| Collector configuration | `docs/modules/telemetry/collector-config.md` |
|
||||
| Dashboard provisioning | `docs/modules/telemetry/dashboards.md` |
|
||||
|
||||
---
|
||||
|
||||
## 14. Sprint Mapping
|
||||
|
||||
- **Primary Sprint:** SPRINT_0180_0001_0001_telemetry_core.md (NEW)
|
||||
- **Related Sprints:**
|
||||
- SPRINT_0181_0001_0002_telemetry_forensic.md
|
||||
- SPRINT_0182_0001_0003_telemetry_offline.md
|
||||
|
||||
**Key Task IDs:**
|
||||
- `TELEM-CORE-40-001` - SDK integration (DONE)
|
||||
- `TELEM-DASH-41-001` - Dashboard provisioning (DONE)
|
||||
- `TELEM-FOR-50-001` - Forensic bundles (IN PROGRESS)
|
||||
- `TELEM-SEAL-51-001` - Sealed-mode guards (TODO)
|
||||
- `TELEM-SIGN-52-001` - Bundle signing (TODO)
|
||||
|
||||
---
|
||||
|
||||
## 15. Success Metrics
|
||||
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| Collection overhead | < 2% CPU |
|
||||
| Trace sampling accuracy | 100% for errors |
|
||||
| Log ingestion latency | < 5 seconds |
|
||||
| Forensic activation time | < 30 seconds |
|
||||
| Bundle export time | < 5 minutes (24h data) |
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2025-11-29*
|
||||
Reference in New Issue
Block a user