Files
git.stella-ops.org/docs/modules/telemetry/contracts/obs-50-telemetry-baselines-contract.md
StellaOps Bot 5fc469ad98 feat: Add VEX Status Chip component and integration tests for reachability drift detection
- Introduced `VexStatusChipComponent` to display VEX status with color coding and tooltips.
- Implemented integration tests for reachability drift detection, covering various scenarios including drift detection, determinism, and error handling.
- Enhanced `ScannerToSignalsReachabilityTests` with a null implementation of `ICallGraphSyncService` for better test isolation.
- Updated project references to include the new Reachability Drift library.
2025-12-20 01:26:42 +02:00

13 KiB

OBS-50 Telemetry Baselines Contract v1.0.0

Status: APPROVED Version: 1.0.0 Effective: 2025-12-19 Owner: Observability Guild + Telemetry Core Guild Sprint: SPRINT_0170_0001_0001 (unblocks 51-002, ORCH-OBS-50-001)


1. Purpose

This contract defines the baseline telemetry standards for all StellaOps services, ensuring consistent observability across the platform. It specifies common envelope schemas, metric naming conventions, trace span standards, log formats, and redaction requirements.

2. Schema References

Schema Location
Telemetry Config docs/modules/telemetry/schemas/telemetry-config.schema.json
Telemetry Bundle docs/modules/telemetry/schemas/telemetry-bundle.schema.json
Telemetry Standards docs/observability/telemetry-standards.md
Telemetry Bootstrap docs/observability/telemetry-bootstrap.md

3. Common Envelope Schema

3.1 Required Fields

All telemetry signals (traces, metrics, logs) MUST include these resource attributes:

public sealed record TelemetryEnvelope
{
    /// <summary>W3C trace context identifier.</summary>
    public required string TraceId { get; init; }

    /// <summary>W3C span identifier.</summary>
    public required string SpanId { get; init; }

    /// <summary>W3C trace flags.</summary>
    public int TraceFlags { get; init; }

    /// <summary>Tenant identifier.</summary>
    public required string TenantId { get; init; }

    /// <summary>Service/workload name.</summary>
    public required string Workload { get; init; }

    /// <summary>Deployment region.</summary>
    public required string Region { get; init; }

    /// <summary>Environment (dev/stage/prod).</summary>
    public required string Environment { get; init; }

    /// <summary>Service version (git SHA or semver).</summary>
    public required string Version { get; init; }

    /// <summary>Module/component name.</summary>
    public required string Component { get; init; }

    /// <summary>Operation name (verb/action).</summary>
    public required string Operation { get; init; }

    /// <summary>UTC ISO-8601 timestamp.</summary>
    public required DateTimeOffset Timestamp { get; init; }

    /// <summary>Outcome status.</summary>
    public required TelemetryStatus Status { get; init; }
}

public enum TelemetryStatus
{
    Ok = 0,
    Error = 1,
    Fault = 2,
    Throttle = 3
}

3.2 Optional Fields

public sealed record TelemetryContext
{
    /// <summary>Correlation ID for request chains.</summary>
    public string? CorrelationId { get; init; }

    /// <summary>Subject identifier (PURL, URI, or hashed ID).</summary>
    public string? Resource { get; init; }

    /// <summary>Project identifier within tenant.</summary>
    public string? ProjectId { get; init; }

    /// <summary>Actor identity (user/service).</summary>
    public string? Actor { get; init; }

    /// <summary>Policy rule that was applied.</summary>
    public string? ImposedRule { get; init; }

    /// <summary>Job/task run identifier.</summary>
    public string? RunId { get; init; }
}

3.3 JSON Example

{
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "trace_flags": 1,
  "tenant_id": "tenant-001",
  "workload": "StellaOps.Orchestrator",
  "region": "eu-west-1",
  "environment": "prod",
  "version": "1.2.3",
  "component": "scheduler",
  "operation": "job.dispatch",
  "timestamp": "2025-12-19T10:00:00.000Z",
  "status": "ok",
  "correlation_id": "req-abc123",
  "run_id": "run-xyz789"
}

4. Metric Naming Conventions

4.1 Naming Pattern

{module}_{component}_{metric_type}_{unit}

Examples:

  • orchestrator_jobs_dispatched_total (counter)
  • scanner_analysis_duration_seconds (histogram)
  • policy_evaluations_active (gauge)
  • concelier_ingestion_bytes_total (counter)

4.2 Required Labels

Label Description Cardinality
tenant Tenant identifier Low
workload Service name Low
environment Deployment environment Low
status Outcome (ok/error/fault) Low

4.3 Histogram Buckets

Metric Type Default Buckets
Duration (seconds) [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
Size (bytes) [256, 512, 1024, 4096, 16384, 65536, 262144, 1048576]
Count [1, 5, 10, 25, 50, 100, 250, 500, 1000]

4.4 Golden Signal Metrics

Every service MUST expose these metrics:

Metric Type Description
{service}_requests_total counter Total requests by status
{service}_request_duration_seconds histogram Request latency
{service}_errors_total counter Error count by type
{service}_saturation_ratio gauge Resource utilization (0.0-1.0)

5. Trace Span Standards

5.1 Span Naming

{component}.{operation}

Examples:

  • scheduler.dispatch
  • policy.evaluate
  • scanner.analyze
  • concelier.ingest

5.2 Required Span Attributes

Attribute Description
tenant.id Tenant identifier
workload Service name
component Module/subsystem
operation Action being performed
status.code OpenTelemetry status code
status.message Status description

5.3 Span Events

Use span events for notable occurrences within a span:

public sealed record SpanEventContract
{
    public required string Name { get; init; }
    public required DateTimeOffset Timestamp { get; init; }
    public ImmutableDictionary<string, object>? Attributes { get; init; }
}

Standard event names:

  • exception - Exception occurred
  • retry - Retry attempt
  • cache.hit / cache.miss - Cache interaction
  • policy.applied - Policy rule applied

6. Log Format Standards

6.1 Structured Log Fields

public sealed record StructuredLogEntry
{
    /// <summary>UTC ISO-8601 timestamp.</summary>
    public required DateTimeOffset Timestamp { get; init; }

    /// <summary>Log severity level.</summary>
    public required LogLevel Level { get; init; }

    /// <summary>Log message template.</summary>
    public required string MessageTemplate { get; init; }

    /// <summary>Rendered message.</summary>
    public required string Message { get; init; }

    /// <summary>Exception details if present.</summary>
    public ExceptionInfo? Exception { get; init; }

    /// <summary>Trace context.</summary>
    public required TraceContext TraceContext { get; init; }

    /// <summary>Service context.</summary>
    public required ServiceContext ServiceContext { get; init; }

    /// <summary>Additional properties.</summary>
    public ImmutableDictionary<string, object>? Properties { get; init; }
}

public enum LogLevel
{
    Trace = 0,
    Debug = 1,
    Information = 2,
    Warning = 3,
    Error = 4,
    Critical = 5
}

6.2 Log Rate Limits

Level Default Rate Notes
Trace/Debug 10/s per component Disabled in production
Information 100/s per component Sampled under pressure
Warning 500/s per component Never sampled
Error/Critical Unlimited Always emitted

7. Redaction and Scrubbing

7.1 Denylist Patterns

The following patterns MUST be redacted before emission:

Category Patterns
Secrets authorization, bearer, token, api[-_]?key, secret, password, credential
PII email, phone, ssn, address, name (when user-provided)
Security private[-_]?key, certificate, session[-_]?id

7.2 Redaction Format

{
  "authorization": "[REDACTED]",
  "redaction": {
    "reason": "secret",
    "policy": "default-v1",
    "timestamp": "2025-12-19T10:00:00Z"
  }
}

7.3 Hash Policy

When identifiers need to be preserved for correlation but hidden:

public sealed record HashedIdentifier
{
    /// <summary>SHA-256 lowercase hex of original value.</summary>
    public required string Hash { get; init; }

    /// <summary>Marker indicating this is a hash.</summary>
    public bool IsHashed { get; init; } = true;

    /// <summary>Original field name.</summary>
    public required string FieldName { get; init; }
}

8. Sampling Policies

8.1 Trace Sampling

Environment Head Sampling Error Boost Audit Boost
Development 100% - -
Staging 10% 100% 100%
Production 5% 100% 100%

8.2 Audit Spans

Spans tagged audit=true are always sampled and retained for extended periods:

public interface IAuditableOperation
{
    /// <summary>Mark span for audit trail.</summary>
    void MarkAudit(string reason);
}

9. Service Integration

9.1 Bootstrap Registration

public static class TelemetryBootstrap
{
    public static IServiceCollection AddStellaOpsTelemetry(
        this IServiceCollection services,
        IConfiguration configuration,
        string serviceName,
        string serviceVersion,
        Action<TelemetryOptions>? configureOptions = null,
        Action<MeterProviderBuilder>? configureMetrics = null,
        Action<TracerProviderBuilder>? configureTracing = null);
}

public sealed class TelemetryOptions
{
    public CollectorOptions Collector { get; set; } = new();
    public SamplingOptions Sampling { get; set; } = new();
    public RedactionOptions Redaction { get; set; } = new();
    public bool SealedMode { get; set; }
}

9.2 Context Propagation

HTTP headers for W3C trace context:

  • traceparent: {version}-{trace-id}-{parent-id}-{trace-flags}
  • tracestate: Custom vendor state
  • baggage: Tenant/correlation context

gRPC metadata:

  • x-trace-id
  • x-span-id
  • x-tenant-id
  • x-correlation-id

10. Orchestrator Integration (ORCH-OBS-50-001)

10.1 Required Spans

The Orchestrator service MUST emit these trace spans:

Span Name Description
scheduler.dispatch Job dispatch to worker
scheduler.schedule Job scheduling decision
controller.create_job Job creation API
controller.cancel_job Job cancellation API
worker.execute Worker job execution

10.2 Required Metrics

Metric Type Description
orchestrator_jobs_dispatched_total counter Jobs dispatched by type
orchestrator_jobs_pending gauge Jobs in queue
orchestrator_job_duration_seconds histogram Job execution time
orchestrator_dispatch_latency_seconds histogram Time to dispatch
orchestrator_worker_utilization gauge Worker pool utilization

10.3 Required Logs

Event Level Fields
Job scheduled Info job_id, type, tenant_id, scheduled_at
Job started Info job_id, worker_id, trace_id
Job completed Info job_id, duration_ms, status
Job failed Error job_id, error_code, error_message, retry_count

11. Telemetry

11.1 Self-Monitoring Metrics

Metric Type Description
telemetry_exports_total counter Export operations by status
telemetry_export_duration_seconds histogram Export latency
telemetry_buffer_size gauge Buffer utilization
telemetry_dropped_total counter Dropped signals

11.2 Alerts

groups:
  - name: telemetry-baselines
    rules:
      - alert: TelemetryExportFailure
        expr: increase(telemetry_exports_total{status="error"}[5m]) > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Telemetry export failures detected"

      - alert: TelemetryHighDropRate
        expr: rate(telemetry_dropped_total[5m]) > 100
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High telemetry signal drop rate"

12. Configuration

# etc/telemetry.yaml
Telemetry:
  Collector:
    Enabled: true
    Endpoint: "https://otel-collector.example:4317"
    Protocol: "grpc"

  Sampling:
    HeadSamplingRatio: 0.05
    ErrorBoost: true
    AuditBoost: true

  Redaction:
    Enabled: true
    PolicyVersion: "v1"
    StrictMode: true

  SealedMode: false  # Enable for air-gap

13. Validation Rules

  1. All signals MUST include trace_id, tenant_id, workload
  2. Timestamps MUST be UTC ISO-8601 format
  3. Metric names MUST follow {module}_{component}_{type}_{unit} pattern
  4. Span names MUST follow {component}.{operation} pattern
  5. Redaction MUST be applied before any external export
  6. Hash values MUST use SHA-256 lowercase hex
  7. Log messages MUST NOT contain raw PII/secrets

Changelog

Version Date Changes
1.0.0 2025-12-19 Initial release - unblocks 51-002, ORCH-OBS-50-001