synergy moats product advisory implementations

This commit is contained in:
master
2026-01-17 01:30:03 +02:00
parent 77ff029205
commit 702a27ac83
112 changed files with 21356 additions and 127 deletions

View File

@@ -0,0 +1,333 @@
# P0 Product Metrics
> **Sprint:** SPRINT_20260117_028_Telemetry_p0_metrics
> **Task:** P0M-007 - Documentation
This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.
## Overview
These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."
| Metric | Target | Alert Threshold |
|--------|--------|-----------------|
| Time to First Verified Release | P90 < 4 hours | P90 > 24 hours |
| Mean Time to Answer "Why Blocked" | P90 < 5 minutes | P90 > 1 hour |
| Support Minutes per Customer | Trend toward 0 | > 30 min/month |
| Determinism Regressions | Zero | Any policy-level |
---
## Metric 1: Time to First Verified Release
**Name:** `stella_time_to_first_verified_release_seconds`
**Type:** Histogram
### Definition
Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).
### Labels
| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `deployment_type` | `fresh`, `upgrade` | Type of installation |
### Histogram Buckets
5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)
### Collection Points
1. **Install timestamp** - Recorded on first Authority service startup
2. **First promotion** - Recorded in Release Orchestrator on first verified promotion
### Why This Matters
A short time-to-first-release indicates:
- Good onboarding experience
- Clear documentation
- Sensible default configurations
- Working integrations
### Dashboard Usage
The Grafana dashboard shows:
- Histogram heatmap of time distribution
- P50/P90/P99 statistics
- Trend over time
### Alert Response
**Warning (P90 > 4 hours):**
1. Review recent onboarding experiences
2. Check for common configuration issues
3. Review documentation clarity
**Critical (P90 > 24 hours):**
1. Investigate blocked customers
2. Check for integration failures
3. Consider guided onboarding assistance
---
## Metric 2: Mean Time to Answer "Why Blocked"
**Name:** `stella_why_blocked_latency_seconds`
**Type:** Histogram
### Definition
Time from block decision to user viewing explanation (via CLI, UI, or API).
### Labels
| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `surface` | `cli`, `ui`, `api` | Interface used to view explanation |
| `resolution_type` | `immediate`, `delayed` | Same session vs different session |
### Histogram Buckets
1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h
### Collection Points
1. **Block decision** - Timestamp stored in verdict
2. **Explanation view** - Tracked when `stella explain block` or UI equivalent invoked
### Why This Matters
Short "why blocked" latency indicates:
- Clear block messaging
- Discoverable explanation tools
- Good explainability UX
Long latency may indicate:
- Users confused about where to find answers
- Documentation gaps
- UX friction
### Dashboard Usage
The Grafana dashboard shows:
- Histogram heatmap of latency distribution
- Trend line over time
- Breakdown by surface (CLI vs UI vs API)
### Alert Response
**Warning (P90 > 5 minutes):**
1. Review block notification messaging
2. Check CLI command discoverability
3. Verify UI links are prominent
**Critical (P90 > 1 hour):**
1. Investigate user flows
2. Add proactive notifications
3. Review documentation and help text
---
## Metric 3: Support Minutes per Customer
**Name:** `stella_support_burden_minutes_total`
**Type:** Counter
### Definition
Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.
### Labels
| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `category` | `install`, `config`, `policy`, `integration`, `bug`, `other` | Support category |
| `month` | YYYY-MM | Month of support |
### Collection
Log support interactions using:
```bash
stella ops support log --tenant <id> --minutes <n> --category <cat>
```
Or via API:
```bash
POST /v1/ops/support/log
{
"tenant": "acme-corp",
"minutes": 15,
"category": "config"
}
```
### Why This Matters
This metric tracks operational scalability. For solo-scaled operations:
- Support burden should trend toward zero
- High support minutes indicate product gaps
- Categories identify areas needing improvement
### Dashboard Usage
The Grafana dashboard shows:
- Stacked bar chart by category
- Monthly trend per tenant
- Total support burden
### Alert Response
**Warning (> 30 min/month per tenant):**
1. Review support interactions for patterns
2. Identify documentation gaps
3. Create runbooks for common issues
**Critical (> 60 min/month per tenant):**
1. Escalate to product for feature work
2. Consider dedicated support time
3. Prioritize automation
---
## Metric 4: Determinism Regressions
**Name:** `stella_determinism_regressions_total`
**Type:** Counter
### Definition
Count of detected determinism failures in production (same inputs produced different outputs).
### Labels
| Label | Values | Description |
|-------|--------|-------------|
| `tenant` | (varies) | Tenant identifier |
| `component` | `scanner`, `policy`, `attestor`, `export` | Component with regression |
| `severity` | `bitwise`, `semantic`, `policy` | Fidelity tier of regression |
### Severity Tiers
| Tier | Description | Impact |
|------|-------------|--------|
| `bitwise` | Byte-for-byte output differs | Low - cosmetic |
| `semantic` | Output semantically differs | Medium - potential confusion |
| `policy` | Policy decision differs | **Critical** - audit risk |
### Collection Points
1. **Scheduled verification jobs** - Regular determinism checks
2. **Replay verification failures** - User-initiated replays
3. **CI golden test failures** - Development-time detection
### Why This Matters
Determinism is a core moat. Regressions indicate:
- Non-deterministic code introduced
- External dependency changes
- Time-sensitive logic bugs
**Policy-level regressions are audit-breaking** and must be fixed immediately.
### Dashboard Usage
The Grafana dashboard shows:
- Counter with severity breakdown
- Alert status indicator
- Historical trend
### Alert Response
**Warning (any bitwise/semantic):**
1. Review recent deployments
2. Check for dependency updates
3. Investigate affected component
**Critical (any policy):**
1. **Immediate investigation required**
2. Consider rollback
3. Review all recent policy decisions
4. Notify affected customers
---
## Dashboard Access
The P0 metrics dashboard is available at:
```
/grafana/d/stella-p0-metrics
```
Or directly:
```bash
stella ops dashboard p0
```
### Dashboard Features
- **Tenant selector** - Filter by specific tenant
- **Time range** - Adjust analysis window
- **SLO indicators** - Green/yellow/red status
- **Drill-down links** - Navigate to detailed views
---
## Alerting Configuration
Alerts are configured in `devops/telemetry/alerts/stella-p0-alerts.yml`.
### Alert Channels
Configure alert destinations in Grafana:
- Slack/Teams for warnings
- PagerDuty for critical alerts
- Email for summaries
### Silencing Alerts
During maintenance windows:
```bash
stella ops alerts silence --duration 2h --reason "Planned maintenance"
```
---
## Implementation Notes
### Source Files
| Component | Location |
|-----------|----------|
| Metric definitions | `src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs` |
| Install timestamp | `src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs` |
| Dashboard template | `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json` |
| Alert rules | `devops/telemetry/alerts/stella-p0-alerts.yml` |
### Adding Custom Metrics
To add additional P0-level metrics:
1. Define in `P0ProductMetrics.cs`
2. Add collection points in relevant services
3. Create dashboard panel in Grafana JSON
4. Add alert rules
5. Update this documentation
---
## Related
- [Observability Guide](observability.md)
- [Alerting Configuration](alerting.md)
- [Runbook: Metric Collection Issues](../../operations/runbooks/telemetry-metrics-ops.md)
---
_Last updated: 2026-01-17 (UTC)_