# Telemetry and Observability Patterns

**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical

This advisory defines the product rationale, collector topology, and implementation strategy for the Telemetry module, covering metrics, traces, logs, forensic pipelines, and offline packaging.

---

## 1. Executive Summary

The Telemetry module provides **unified observability infrastructure** across all Stella Ops components. Key capabilities:

- **OpenTelemetry Native** - OTLP collection for metrics, traces, logs
- **Forensic Mode** - Extended retention and 100% sampling during incidents
- **Profile-Based Configuration** - Default, forensic, and air-gap profiles
- **Sealed-Mode Guards** - Automatic exporter restrictions in air-gap
- **Offline Bundles** - Signed OTLP archives for compliance

---

## 2. Market Drivers

### 2.1 Target Segments

| Segment | Observability Requirements | Use Case |
|---------|---------------------------|----------|
| **Platform Ops** | Real-time monitoring | Operational health |
| **Security Teams** | Forensic investigation | Incident response |
| **Compliance** | Audit trails | SOC 2, FedRAMP |
| **DevSecOps** | Pipeline visibility | CI/CD debugging |

### 2.2 Competitive Positioning

Most vulnerability tools provide minimal observability. Stella Ops differentiates with:
- **Built-in OpenTelemetry** across all services
- **Forensic mode** with automatic retention extension
- **Sealed-mode compatibility** for air-gap
- **Signed OTLP bundles** for compliance archives
- **Incident-triggered sampling** escalation

---

## 3. Collector Topology

### 3.1 Architecture

```
┌─────────────────────────────────────────────────────┐
│                    Services                          │
│  Scanner │ Policy │ Authority │ Orchestrator │ ...  │
└─────────────────────┬───────────────────────────────┘
                      │ OTLP/gRPC
                      ▼
┌─────────────────────────────────────────────────────┐
│              OpenTelemetry Collector                 │
│  ┌─────────┐  ┌──────────┐  ┌─────────────────────┐ │
│  │ Traces  │  │ Metrics  │  │       Logs          │ │
│  └────┬────┘  └────┬─────┘  └──────────┬──────────┘ │
│       │ Tail       │ Batch           │ Redaction   │
│       │ Sampling   │                 │             │
└───────┼────────────┼─────────────────┼─────────────┘
        │            │                 │
        ▼            ▼                 ▼
   ┌────────┐   ┌──────────┐      ┌────────┐
   │ Tempo  │   │Prometheus│      │  Loki  │
   └────────┘   └──────────┘      └────────┘
```

### 3.2 Collector Profiles

| Profile | Use Case | Configuration |
|---------|----------|---------------|
| **default** | Normal operation | 10% trace sampling, 30-day retention |
| **forensic** | Investigation mode | 100% sampling, 180-day retention |
| **airgap** | Offline deployment | File exporters, no external network |

---

## 4. Metrics

### 4.1 Standard Metrics

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `stellaops_request_duration_seconds` | Histogram | service, endpoint | Request latency |
| `stellaops_request_total` | Counter | service, status | Request count |
| `stellaops_active_jobs` | Gauge | tenant, jobType | Active job count |
| `stellaops_queue_depth` | Gauge | queue | Queue depth |
| `stellaops_scan_duration_seconds` | Histogram | tenant | Scan duration |

### 4.2 Module-Specific Metrics

**Policy Engine:**
- `policy_run_seconds{mode,tenant,policy}`
- `policy_rules_fired_total{policy,rule}`
- `policy_vex_overrides_total{policy,vendor}`

**Scanner:**
- `scanner_sbom_components_total{ecosystem}`
- `scanner_vulnerabilities_found_total{severity}`
- `scanner_attestations_logged_total`

**Authority:**
- `authority_token_issued_total{grant_type,audience}`
- `authority_token_rejected_total{reason}`
- `authority_dpop_nonce_miss_total`

---

## 5. Traces

### 5.1 Trace Context

All services propagate W3C Trace Context:
- `traceparent` header
- `tracestate` for vendor-specific data
- `baggage` for cross-service attributes

### 5.2 Span Conventions

| Span | Attributes | Description |
|------|------------|-------------|
| `http.request` | url, method, status | HTTP handler |
| `db.query` | collection, operation | MongoDB ops |
| `policy.evaluate` | policyId, version | Policy run |
| `scan.image` | imageRef, digest | Image scan |
| `sign.dsse` | predicateType | DSSE signing |

### 5.3 Sampling Strategy

**Default (Tail Sampling):**
- Error traces: 100%
- Slow traces (>2s): 100%
- Normal traces: 10%

**Forensic Mode:**
- All traces: 100%
- Extended attributes enabled

---

## 6. Logs

### 6.1 Structured Format

```json
{
  "timestamp": "2025-11-29T12:00:00.123Z",
  "level": "info",
  "message": "Scan completed",
  "service": "scanner",
  "traceId": "abc123...",
  "spanId": "def456...",
  "tenant": "acme-corp",
  "imageDigest": "sha256:...",
  "componentCount": 245,
  "vulnerabilityCount": 12
}
```

### 6.2 Redaction

Attribute processors strip sensitive data:
- `authorization` headers
- `secretRef` values
- PII based on allowed-key policy

### 6.3 Log Levels

| Level | Purpose | Retention |
|-------|---------|-----------|
| `error` | Failures | 180 days |
| `warn` | Anomalies | 90 days |
| `info` | Operations | 30 days |
| `debug` | Development | 7 days |

---

## 7. Forensic Mode

### 7.1 Activation

```bash
# Activate forensic mode for tenant
stella telemetry incident start --tenant acme-corp --reason "CVE-2025-12345 investigation"

# Check status
stella telemetry incident status

# Deactivate
stella telemetry incident stop --tenant acme-corp
```

### 7.2 Behavior Changes

| Aspect | Default | Forensic |
|--------|---------|----------|
| Trace sampling | 10% | 100% |
| Log level | info | debug |
| Retention | 30 days | 180 days |
| Attributes | Standard | Extended |
| Export frequency | 1 minute | 10 seconds |

### 7.3 Automatic Triggers

- Orchestrator incident escalation
- Policy violation threshold exceeded
- Circuit breaker activation
- Manual operator trigger

---

## 8. Implementation Strategy

### 8.1 Phase 1: Core Telemetry (Complete)

- [x] OpenTelemetry SDK integration
- [x] Metrics exporter (Prometheus)
- [x] Trace exporter (Tempo/Jaeger)
- [x] Log exporter (Loki)

### 8.2 Phase 2: Advanced Features (Complete)

- [x] Tail sampling configuration
- [x] Attribute redaction
- [x] Profile-based configuration
- [x] Dashboard provisioning

### 8.3 Phase 3: Forensic & Offline (In Progress)

- [x] Forensic mode toggle
- [ ] Forensic bundle export (TELEM-FOR-50-001)
- [ ] Sealed-mode guards (TELEM-SEAL-51-001)
- [ ] Offline bundle signing (TELEM-SIGN-52-001)

---

## 9. API Surface

### 9.1 Configuration

| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/telemetry/config/profile/{name}` | GET | `telemetry:read` | Download collector config |
| `/telemetry/config/profiles` | GET | `telemetry:read` | List profiles |

### 9.2 Incident Mode

| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/telemetry/incidents/mode` | POST | `telemetry:admin` | Toggle forensic mode |
| `/telemetry/incidents/status` | GET | `telemetry:read` | Current mode status |

### 9.3 Exports

| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/telemetry/exports/forensic/{window}` | GET | `telemetry:export` | Stream OTLP bundle |

---

## 10. Offline Support

### 10.1 Bundle Structure

```
telemetry-bundle/
├── otlp/
│   ├── metrics.pb
│   ├── traces.pb
│   └── logs.pb
├── config/
│   ├── collector.yaml
│   └── dashboards/
├── manifest.json
└── signatures/
    └── manifest.sig
```

### 10.2 Sealed-Mode Guards

```csharp
// StellaOps.Telemetry.Core enforces IEgressPolicy
if (sealedMode.IsActive)
{
    // Disable non-loopback exporters
    // Emit structured warning with remediation
    // Fall back to file-based export
}
```

---

## 11. Dashboards & Alerts

### 11.1 Standard Dashboards

| Dashboard | Purpose | Panels |
|-----------|---------|--------|
| Platform Health | Overall status | Request rate, error rate, latency |
| Scan Operations | Scanner metrics | Scan rate, duration, findings |
| Policy Engine | Policy metrics | Evaluation rate, rule hits, verdicts |
| Job Orchestration | Queue metrics | Queue depth, job latency, failures |

### 11.2 Alert Rules

| Alert | Condition | Severity |
|-------|-----------|----------|
| High Error Rate | error_rate > 5% | critical |
| Slow Scans | p95 > 5m | warning |
| Queue Backlog | depth > 1000 | warning |
| Circuit Open | breaker_open = 1 | critical |

---

## 12. Security Considerations

### 12.1 Data Protection

- Sensitive attributes redacted at collection
- Encrypted in transit (TLS)
- Encrypted at rest (storage layer)
- Retention policies enforced

### 12.2 Access Control

- Authority scopes for API access
- Tenant isolation in queries
- Audit logging for forensic access

---

## 13. Related Documentation

| Resource | Location |
|----------|----------|
| Telemetry architecture | `docs/modules/telemetry/architecture.md` |
| Collector configuration | `docs/modules/telemetry/collector-config.md` |
| Dashboard provisioning | `docs/modules/telemetry/dashboards.md` |

---

## 14. Sprint Mapping

- **Primary Sprint:** SPRINT_0180_0001_0001_telemetry_core.md (NEW)
- **Related Sprints:**
  - SPRINT_0181_0001_0002_telemetry_forensic.md
  - SPRINT_0182_0001_0003_telemetry_offline.md

**Key Task IDs:**
- `TELEM-CORE-40-001` - SDK integration (DONE)
- `TELEM-DASH-41-001` - Dashboard provisioning (DONE)
- `TELEM-FOR-50-001` - Forensic bundles (IN PROGRESS)
- `TELEM-SEAL-51-001` - Sealed-mode guards (TODO)
- `TELEM-SIGN-52-001` - Bundle signing (TODO)

---

## 15. Success Metrics

| Metric | Target |
|--------|--------|
| Collection overhead | < 2% CPU |
| Trace sampling accuracy | 100% for errors |
| Log ingestion latency | < 5 seconds |
| Forensic activation time | < 30 seconds |
| Bundle export time | < 5 minutes (24h data) |

---

*Last updated: 2025-11-29*