feat: Add VEX Status Chip component and integration tests for reachability drift detection

- Introduced `VexStatusChipComponent` to display VEX status with color coding and tooltips.
- Implemented integration tests for reachability drift detection, covering various scenarios including drift detection, determinism, and error handling.
- Enhanced `ScannerToSignalsReachabilityTests` with a null implementation of `ICallGraphSyncService` for better test isolation.
- Updated project references to include the new Reachability Drift library.
This commit is contained in:
StellaOps Bot
2025-12-20 01:26:42 +02:00
parent edc91ea96f
commit 5fc469ad98
159 changed files with 41116 additions and 2305 deletions

View File

@@ -0,0 +1,473 @@
# OBS-50 Telemetry Baselines Contract v1.0.0
**Status:** APPROVED
**Version:** 1.0.0
**Effective:** 2025-12-19
**Owner:** Observability Guild + Telemetry Core Guild
**Sprint:** SPRINT_0170_0001_0001 (unblocks 51-002, ORCH-OBS-50-001)
---
## 1. Purpose
This contract defines the baseline telemetry standards for all StellaOps services, ensuring consistent observability across the platform. It specifies common envelope schemas, metric naming conventions, trace span standards, log formats, and redaction requirements.
## 2. Schema References
| Schema | Location |
|--------|----------|
| Telemetry Config | `docs/modules/telemetry/schemas/telemetry-config.schema.json` |
| Telemetry Bundle | `docs/modules/telemetry/schemas/telemetry-bundle.schema.json` |
| Telemetry Standards | `docs/observability/telemetry-standards.md` |
| Telemetry Bootstrap | `docs/observability/telemetry-bootstrap.md` |
## 3. Common Envelope Schema
### 3.1 Required Fields
All telemetry signals (traces, metrics, logs) MUST include these resource attributes:
```csharp
public sealed record TelemetryEnvelope
{
/// <summary>W3C trace context identifier.</summary>
public required string TraceId { get; init; }
/// <summary>W3C span identifier.</summary>
public required string SpanId { get; init; }
/// <summary>W3C trace flags.</summary>
public int TraceFlags { get; init; }
/// <summary>Tenant identifier.</summary>
public required string TenantId { get; init; }
/// <summary>Service/workload name.</summary>
public required string Workload { get; init; }
/// <summary>Deployment region.</summary>
public required string Region { get; init; }
/// <summary>Environment (dev/stage/prod).</summary>
public required string Environment { get; init; }
/// <summary>Service version (git SHA or semver).</summary>
public required string Version { get; init; }
/// <summary>Module/component name.</summary>
public required string Component { get; init; }
/// <summary>Operation name (verb/action).</summary>
public required string Operation { get; init; }
/// <summary>UTC ISO-8601 timestamp.</summary>
public required DateTimeOffset Timestamp { get; init; }
/// <summary>Outcome status.</summary>
public required TelemetryStatus Status { get; init; }
}
public enum TelemetryStatus
{
Ok = 0,
Error = 1,
Fault = 2,
Throttle = 3
}
```
### 3.2 Optional Fields
```csharp
public sealed record TelemetryContext
{
/// <summary>Correlation ID for request chains.</summary>
public string? CorrelationId { get; init; }
/// <summary>Subject identifier (PURL, URI, or hashed ID).</summary>
public string? Resource { get; init; }
/// <summary>Project identifier within tenant.</summary>
public string? ProjectId { get; init; }
/// <summary>Actor identity (user/service).</summary>
public string? Actor { get; init; }
/// <summary>Policy rule that was applied.</summary>
public string? ImposedRule { get; init; }
/// <summary>Job/task run identifier.</summary>
public string? RunId { get; init; }
}
```
### 3.3 JSON Example
```json
{
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"trace_flags": 1,
"tenant_id": "tenant-001",
"workload": "StellaOps.Orchestrator",
"region": "eu-west-1",
"environment": "prod",
"version": "1.2.3",
"component": "scheduler",
"operation": "job.dispatch",
"timestamp": "2025-12-19T10:00:00.000Z",
"status": "ok",
"correlation_id": "req-abc123",
"run_id": "run-xyz789"
}
```
## 4. Metric Naming Conventions
### 4.1 Naming Pattern
```
{module}_{component}_{metric_type}_{unit}
```
Examples:
- `orchestrator_jobs_dispatched_total` (counter)
- `scanner_analysis_duration_seconds` (histogram)
- `policy_evaluations_active` (gauge)
- `concelier_ingestion_bytes_total` (counter)
### 4.2 Required Labels
| Label | Description | Cardinality |
|-------|-------------|-------------|
| `tenant` | Tenant identifier | Low |
| `workload` | Service name | Low |
| `environment` | Deployment environment | Low |
| `status` | Outcome (ok/error/fault) | Low |
### 4.3 Histogram Buckets
| Metric Type | Default Buckets |
|-------------|-----------------|
| Duration (seconds) | `[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]` |
| Size (bytes) | `[256, 512, 1024, 4096, 16384, 65536, 262144, 1048576]` |
| Count | `[1, 5, 10, 25, 50, 100, 250, 500, 1000]` |
### 4.4 Golden Signal Metrics
Every service MUST expose these metrics:
| Metric | Type | Description |
|--------|------|-------------|
| `{service}_requests_total` | counter | Total requests by status |
| `{service}_request_duration_seconds` | histogram | Request latency |
| `{service}_errors_total` | counter | Error count by type |
| `{service}_saturation_ratio` | gauge | Resource utilization (0.0-1.0) |
## 5. Trace Span Standards
### 5.1 Span Naming
```
{component}.{operation}
```
Examples:
- `scheduler.dispatch`
- `policy.evaluate`
- `scanner.analyze`
- `concelier.ingest`
### 5.2 Required Span Attributes
| Attribute | Description |
|-----------|-------------|
| `tenant.id` | Tenant identifier |
| `workload` | Service name |
| `component` | Module/subsystem |
| `operation` | Action being performed |
| `status.code` | OpenTelemetry status code |
| `status.message` | Status description |
### 5.3 Span Events
Use span events for notable occurrences within a span:
```csharp
public sealed record SpanEventContract
{
public required string Name { get; init; }
public required DateTimeOffset Timestamp { get; init; }
public ImmutableDictionary<string, object>? Attributes { get; init; }
}
```
Standard event names:
- `exception` - Exception occurred
- `retry` - Retry attempt
- `cache.hit` / `cache.miss` - Cache interaction
- `policy.applied` - Policy rule applied
## 6. Log Format Standards
### 6.1 Structured Log Fields
```csharp
public sealed record StructuredLogEntry
{
/// <summary>UTC ISO-8601 timestamp.</summary>
public required DateTimeOffset Timestamp { get; init; }
/// <summary>Log severity level.</summary>
public required LogLevel Level { get; init; }
/// <summary>Log message template.</summary>
public required string MessageTemplate { get; init; }
/// <summary>Rendered message.</summary>
public required string Message { get; init; }
/// <summary>Exception details if present.</summary>
public ExceptionInfo? Exception { get; init; }
/// <summary>Trace context.</summary>
public required TraceContext TraceContext { get; init; }
/// <summary>Service context.</summary>
public required ServiceContext ServiceContext { get; init; }
/// <summary>Additional properties.</summary>
public ImmutableDictionary<string, object>? Properties { get; init; }
}
public enum LogLevel
{
Trace = 0,
Debug = 1,
Information = 2,
Warning = 3,
Error = 4,
Critical = 5
}
```
### 6.2 Log Rate Limits
| Level | Default Rate | Notes |
|-------|--------------|-------|
| Trace/Debug | 10/s per component | Disabled in production |
| Information | 100/s per component | Sampled under pressure |
| Warning | 500/s per component | Never sampled |
| Error/Critical | Unlimited | Always emitted |
## 7. Redaction and Scrubbing
### 7.1 Denylist Patterns
The following patterns MUST be redacted before emission:
| Category | Patterns |
|----------|----------|
| Secrets | `authorization`, `bearer`, `token`, `api[-_]?key`, `secret`, `password`, `credential` |
| PII | `email`, `phone`, `ssn`, `address`, `name` (when user-provided) |
| Security | `private[-_]?key`, `certificate`, `session[-_]?id` |
### 7.2 Redaction Format
```json
{
"authorization": "[REDACTED]",
"redaction": {
"reason": "secret",
"policy": "default-v1",
"timestamp": "2025-12-19T10:00:00Z"
}
}
```
### 7.3 Hash Policy
When identifiers need to be preserved for correlation but hidden:
```csharp
public sealed record HashedIdentifier
{
/// <summary>SHA-256 lowercase hex of original value.</summary>
public required string Hash { get; init; }
/// <summary>Marker indicating this is a hash.</summary>
public bool IsHashed { get; init; } = true;
/// <summary>Original field name.</summary>
public required string FieldName { get; init; }
}
```
## 8. Sampling Policies
### 8.1 Trace Sampling
| Environment | Head Sampling | Error Boost | Audit Boost |
|-------------|--------------|-------------|-------------|
| Development | 100% | - | - |
| Staging | 10% | 100% | 100% |
| Production | 5% | 100% | 100% |
### 8.2 Audit Spans
Spans tagged `audit=true` are always sampled and retained for extended periods:
```csharp
public interface IAuditableOperation
{
/// <summary>Mark span for audit trail.</summary>
void MarkAudit(string reason);
}
```
## 9. Service Integration
### 9.1 Bootstrap Registration
```csharp
public static class TelemetryBootstrap
{
public static IServiceCollection AddStellaOpsTelemetry(
this IServiceCollection services,
IConfiguration configuration,
string serviceName,
string serviceVersion,
Action<TelemetryOptions>? configureOptions = null,
Action<MeterProviderBuilder>? configureMetrics = null,
Action<TracerProviderBuilder>? configureTracing = null);
}
public sealed class TelemetryOptions
{
public CollectorOptions Collector { get; set; } = new();
public SamplingOptions Sampling { get; set; } = new();
public RedactionOptions Redaction { get; set; } = new();
public bool SealedMode { get; set; }
}
```
### 9.2 Context Propagation
HTTP headers for W3C trace context:
- `traceparent`: `{version}-{trace-id}-{parent-id}-{trace-flags}`
- `tracestate`: Custom vendor state
- `baggage`: Tenant/correlation context
gRPC metadata:
- `x-trace-id`
- `x-span-id`
- `x-tenant-id`
- `x-correlation-id`
## 10. Orchestrator Integration (ORCH-OBS-50-001)
### 10.1 Required Spans
The Orchestrator service MUST emit these trace spans:
| Span Name | Description |
|-----------|-------------|
| `scheduler.dispatch` | Job dispatch to worker |
| `scheduler.schedule` | Job scheduling decision |
| `controller.create_job` | Job creation API |
| `controller.cancel_job` | Job cancellation API |
| `worker.execute` | Worker job execution |
### 10.2 Required Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `orchestrator_jobs_dispatched_total` | counter | Jobs dispatched by type |
| `orchestrator_jobs_pending` | gauge | Jobs in queue |
| `orchestrator_job_duration_seconds` | histogram | Job execution time |
| `orchestrator_dispatch_latency_seconds` | histogram | Time to dispatch |
| `orchestrator_worker_utilization` | gauge | Worker pool utilization |
### 10.3 Required Logs
| Event | Level | Fields |
|-------|-------|--------|
| Job scheduled | Info | `job_id`, `type`, `tenant_id`, `scheduled_at` |
| Job started | Info | `job_id`, `worker_id`, `trace_id` |
| Job completed | Info | `job_id`, `duration_ms`, `status` |
| Job failed | Error | `job_id`, `error_code`, `error_message`, `retry_count` |
## 11. Telemetry
### 11.1 Self-Monitoring Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `telemetry_exports_total` | counter | Export operations by status |
| `telemetry_export_duration_seconds` | histogram | Export latency |
| `telemetry_buffer_size` | gauge | Buffer utilization |
| `telemetry_dropped_total` | counter | Dropped signals |
### 11.2 Alerts
```yaml
groups:
- name: telemetry-baselines
rules:
- alert: TelemetryExportFailure
expr: increase(telemetry_exports_total{status="error"}[5m]) > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Telemetry export failures detected"
- alert: TelemetryHighDropRate
expr: rate(telemetry_dropped_total[5m]) > 100
for: 5m
labels:
severity: critical
annotations:
summary: "High telemetry signal drop rate"
```
## 12. Configuration
```yaml
# etc/telemetry.yaml
Telemetry:
Collector:
Enabled: true
Endpoint: "https://otel-collector.example:4317"
Protocol: "grpc"
Sampling:
HeadSamplingRatio: 0.05
ErrorBoost: true
AuditBoost: true
Redaction:
Enabled: true
PolicyVersion: "v1"
StrictMode: true
SealedMode: false # Enable for air-gap
```
## 13. Validation Rules
1. All signals MUST include `trace_id`, `tenant_id`, `workload`
2. Timestamps MUST be UTC ISO-8601 format
3. Metric names MUST follow `{module}_{component}_{type}_{unit}` pattern
4. Span names MUST follow `{component}.{operation}` pattern
5. Redaction MUST be applied before any external export
6. Hash values MUST use SHA-256 lowercase hex
7. Log messages MUST NOT contain raw PII/secrets
---
## Changelog
| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2025-12-19 | Initial release - unblocks 51-002, ORCH-OBS-50-001 |