synergy moats product advisory implementations

2026-01-17 01:30:03 +02:00
parent 77ff029205
commit 702a27ac83
112 changed files with 21356 additions and 127 deletions
--- a/docs/modules/telemetry/guides/p0-metrics.md
+++ b/docs/modules/telemetry/guides/p0-metrics.md
@@ -0,0 +1,333 @@
+# P0 Product Metrics
+
+> **Sprint:** SPRINT_20260117_028_Telemetry_p0_metrics  
+> **Task:** P0M-007 - Documentation
+
+This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.
+
+## Overview
+
+These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."
+
+| Metric | Target | Alert Threshold |
+|--------|--------|-----------------|
+| Time to First Verified Release | P90 < 4 hours | P90 > 24 hours |
+| Mean Time to Answer "Why Blocked" | P90 < 5 minutes | P90 > 1 hour |
+| Support Minutes per Customer | Trend toward 0 | > 30 min/month |
+| Determinism Regressions | Zero | Any policy-level |
+
+---
+
+## Metric 1: Time to First Verified Release
+
+**Name:** `stella_time_to_first_verified_release_seconds`  
+**Type:** Histogram
+
+### Definition
+
+Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).
+
+### Labels
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `tenant` | (varies) | Tenant identifier |
+| `deployment_type` | `fresh`, `upgrade` | Type of installation |
+
+### Histogram Buckets
+
+5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)
+
+### Collection Points
+
+1. **Install timestamp** - Recorded on first Authority service startup
+2. **First promotion** - Recorded in Release Orchestrator on first verified promotion
+
+### Why This Matters
+
+A short time-to-first-release indicates:
+- Good onboarding experience
+- Clear documentation
+- Sensible default configurations
+- Working integrations
+
+### Dashboard Usage
+
+The Grafana dashboard shows:
+- Histogram heatmap of time distribution
+- P50/P90/P99 statistics
+- Trend over time
+
+### Alert Response
+
+**Warning (P90 > 4 hours):**
+1. Review recent onboarding experiences
+2. Check for common configuration issues
+3. Review documentation clarity
+
+**Critical (P90 > 24 hours):**
+1. Investigate blocked customers
+2. Check for integration failures
+3. Consider guided onboarding assistance
+
+---
+
+## Metric 2: Mean Time to Answer "Why Blocked"
+
+**Name:** `stella_why_blocked_latency_seconds`  
+**Type:** Histogram
+
+### Definition
+
+Time from block decision to user viewing explanation (via CLI, UI, or API).
+
+### Labels
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `tenant` | (varies) | Tenant identifier |
+| `surface` | `cli`, `ui`, `api` | Interface used to view explanation |
+| `resolution_type` | `immediate`, `delayed` | Same session vs different session |
+
+### Histogram Buckets
+
+1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h
+
+### Collection Points
+
+1. **Block decision** - Timestamp stored in verdict
+2. **Explanation view** - Tracked when `stella explain block` or UI equivalent invoked
+
+### Why This Matters
+
+Short "why blocked" latency indicates:
+- Clear block messaging
+- Discoverable explanation tools
+- Good explainability UX
+
+Long latency may indicate:
+- Users confused about where to find answers
+- Documentation gaps
+- UX friction
+
+### Dashboard Usage
+
+The Grafana dashboard shows:
+- Histogram heatmap of latency distribution
+- Trend line over time
+- Breakdown by surface (CLI vs UI vs API)
+
+### Alert Response
+
+**Warning (P90 > 5 minutes):**
+1. Review block notification messaging
+2. Check CLI command discoverability
+3. Verify UI links are prominent
+
+**Critical (P90 > 1 hour):**
+1. Investigate user flows
+2. Add proactive notifications
+3. Review documentation and help text
+
+---
+
+## Metric 3: Support Minutes per Customer
+
+**Name:** `stella_support_burden_minutes_total`  
+**Type:** Counter
+
+### Definition
+
+Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.
+
+### Labels
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `tenant` | (varies) | Tenant identifier |
+| `category` | `install`, `config`, `policy`, `integration`, `bug`, `other` | Support category |
+| `month` | YYYY-MM | Month of support |
+
+### Collection
+
+Log support interactions using:
+
+```bash
+stella ops support log --tenant <id> --minutes <n> --category <cat>
+```
+
+Or via API:
+
+```bash
+POST /v1/ops/support/log
+{
+  "tenant": "acme-corp",
+  "minutes": 15,
+  "category": "config"
+}
+```
+
+### Why This Matters
+
+This metric tracks operational scalability. For solo-scaled operations:
+- Support burden should trend toward zero
+- High support minutes indicate product gaps
+- Categories identify areas needing improvement
+
+### Dashboard Usage
+
+The Grafana dashboard shows:
+- Stacked bar chart by category
+- Monthly trend per tenant
+- Total support burden
+
+### Alert Response
+
+**Warning (> 30 min/month per tenant):**
+1. Review support interactions for patterns
+2. Identify documentation gaps
+3. Create runbooks for common issues
+
+**Critical (> 60 min/month per tenant):**
+1. Escalate to product for feature work
+2. Consider dedicated support time
+3. Prioritize automation
+
+---
+
+## Metric 4: Determinism Regressions
+
+**Name:** `stella_determinism_regressions_total`  
+**Type:** Counter
+
+### Definition
+
+Count of detected determinism failures in production (same inputs produced different outputs).
+
+### Labels
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `tenant` | (varies) | Tenant identifier |
+| `component` | `scanner`, `policy`, `attestor`, `export` | Component with regression |
+| `severity` | `bitwise`, `semantic`, `policy` | Fidelity tier of regression |
+
+### Severity Tiers
+
+| Tier | Description | Impact |
+|------|-------------|--------|
+| `bitwise` | Byte-for-byte output differs | Low - cosmetic |
+| `semantic` | Output semantically differs | Medium - potential confusion |
+| `policy` | Policy decision differs | **Critical** - audit risk |
+
+### Collection Points
+
+1. **Scheduled verification jobs** - Regular determinism checks
+2. **Replay verification failures** - User-initiated replays
+3. **CI golden test failures** - Development-time detection
+
+### Why This Matters
+
+Determinism is a core moat. Regressions indicate:
+- Non-deterministic code introduced
+- External dependency changes
+- Time-sensitive logic bugs
+
+**Policy-level regressions are audit-breaking** and must be fixed immediately.
+
+### Dashboard Usage
+
+The Grafana dashboard shows:
+- Counter with severity breakdown
+- Alert status indicator
+- Historical trend
+
+### Alert Response
+
+**Warning (any bitwise/semantic):**
+1. Review recent deployments
+2. Check for dependency updates
+3. Investigate affected component
+
+**Critical (any policy):**
+1. **Immediate investigation required**
+2. Consider rollback
+3. Review all recent policy decisions
+4. Notify affected customers
+
+---
+
+## Dashboard Access
+
+The P0 metrics dashboard is available at:
+
+```
+/grafana/d/stella-p0-metrics
+```
+
+Or directly:
+```bash
+stella ops dashboard p0
+```
+
+### Dashboard Features
+
+- **Tenant selector** - Filter by specific tenant
+- **Time range** - Adjust analysis window
+- **SLO indicators** - Green/yellow/red status
+- **Drill-down links** - Navigate to detailed views
+
+---
+
+## Alerting Configuration
+
+Alerts are configured in `devops/telemetry/alerts/stella-p0-alerts.yml`.
+
+### Alert Channels
+
+Configure alert destinations in Grafana:
+- Slack/Teams for warnings
+- PagerDuty for critical alerts
+- Email for summaries
+
+### Silencing Alerts
+
+During maintenance windows:
+```bash
+stella ops alerts silence --duration 2h --reason "Planned maintenance"
+```
+
+---
+
+## Implementation Notes
+
+### Source Files
+
+| Component | Location |
+|-----------|----------|
+| Metric definitions | `src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs` |
+| Install timestamp | `src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs` |
+| Dashboard template | `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json` |
+| Alert rules | `devops/telemetry/alerts/stella-p0-alerts.yml` |
+
+### Adding Custom Metrics
+
+To add additional P0-level metrics:
+
+1. Define in `P0ProductMetrics.cs`
+2. Add collection points in relevant services
+3. Create dashboard panel in Grafana JSON
+4. Add alert rules
+5. Update this documentation
+
+---
+
+## Related
+
+- [Observability Guide](observability.md)
+- [Alerting Configuration](alerting.md)
+- [Runbook: Metric Collection Issues](../../operations/runbooks/telemetry-metrics-ops.md)
+
+---
+
+_Last updated: 2026-01-17 (UTC)_