synergy moats product advisory implementations
This commit is contained in:
333
docs/modules/telemetry/guides/p0-metrics.md
Normal file
333
docs/modules/telemetry/guides/p0-metrics.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# P0 Product Metrics
|
||||
|
||||
> **Sprint:** SPRINT_20260117_028_Telemetry_p0_metrics
|
||||
> **Task:** P0M-007 - Documentation
|
||||
|
||||
This document describes the four P0 (highest priority) product-level metrics for tracking Stella Ops operational health.
|
||||
|
||||
## Overview
|
||||
|
||||
These metrics serve as the primary scoreboard for product health and should guide prioritization decisions. Per the AI Economics Moat advisory: "Prioritize work that improves them."
|
||||
|
||||
| Metric | Target | Alert Threshold |
|
||||
|--------|--------|-----------------|
|
||||
| Time to First Verified Release | P90 < 4 hours | P90 > 24 hours |
|
||||
| Mean Time to Answer "Why Blocked" | P90 < 5 minutes | P90 > 1 hour |
|
||||
| Support Minutes per Customer | Trend toward 0 | > 30 min/month |
|
||||
| Determinism Regressions | Zero | Any policy-level |
|
||||
|
||||
---
|
||||
|
||||
## Metric 1: Time to First Verified Release
|
||||
|
||||
**Name:** `stella_time_to_first_verified_release_seconds`
|
||||
**Type:** Histogram
|
||||
|
||||
### Definition
|
||||
|
||||
Elapsed time from fresh install (first service startup) to first successful verified promotion (policy gate passed, evidence recorded).
|
||||
|
||||
### Labels
|
||||
|
||||
| Label | Values | Description |
|
||||
|-------|--------|-------------|
|
||||
| `tenant` | (varies) | Tenant identifier |
|
||||
| `deployment_type` | `fresh`, `upgrade` | Type of installation |
|
||||
|
||||
### Histogram Buckets
|
||||
|
||||
5m, 15m, 30m, 1h, 2h, 4h, 8h, 24h, 48h, 168h (1 week)
|
||||
|
||||
### Collection Points
|
||||
|
||||
1. **Install timestamp** - Recorded on first Authority service startup
|
||||
2. **First promotion** - Recorded in Release Orchestrator on first verified promotion
|
||||
|
||||
### Why This Matters
|
||||
|
||||
A short time-to-first-release indicates:
|
||||
- Good onboarding experience
|
||||
- Clear documentation
|
||||
- Sensible default configurations
|
||||
- Working integrations
|
||||
|
||||
### Dashboard Usage
|
||||
|
||||
The Grafana dashboard shows:
|
||||
- Histogram heatmap of time distribution
|
||||
- P50/P90/P99 statistics
|
||||
- Trend over time
|
||||
|
||||
### Alert Response
|
||||
|
||||
**Warning (P90 > 4 hours):**
|
||||
1. Review recent onboarding experiences
|
||||
2. Check for common configuration issues
|
||||
3. Review documentation clarity
|
||||
|
||||
**Critical (P90 > 24 hours):**
|
||||
1. Investigate blocked customers
|
||||
2. Check for integration failures
|
||||
3. Consider guided onboarding assistance
|
||||
|
||||
---
|
||||
|
||||
## Metric 2: Mean Time to Answer "Why Blocked"
|
||||
|
||||
**Name:** `stella_why_blocked_latency_seconds`
|
||||
**Type:** Histogram
|
||||
|
||||
### Definition
|
||||
|
||||
Time from block decision to user viewing explanation (via CLI, UI, or API).
|
||||
|
||||
### Labels
|
||||
|
||||
| Label | Values | Description |
|
||||
|-------|--------|-------------|
|
||||
| `tenant` | (varies) | Tenant identifier |
|
||||
| `surface` | `cli`, `ui`, `api` | Interface used to view explanation |
|
||||
| `resolution_type` | `immediate`, `delayed` | Same session vs different session |
|
||||
|
||||
### Histogram Buckets
|
||||
|
||||
1s, 5s, 30s, 1m, 5m, 15m, 1h, 4h, 24h
|
||||
|
||||
### Collection Points
|
||||
|
||||
1. **Block decision** - Timestamp stored in verdict
|
||||
2. **Explanation view** - Tracked when `stella explain block` or UI equivalent invoked
|
||||
|
||||
### Why This Matters
|
||||
|
||||
Short "why blocked" latency indicates:
|
||||
- Clear block messaging
|
||||
- Discoverable explanation tools
|
||||
- Good explainability UX
|
||||
|
||||
Long latency may indicate:
|
||||
- Users confused about where to find answers
|
||||
- Documentation gaps
|
||||
- UX friction
|
||||
|
||||
### Dashboard Usage
|
||||
|
||||
The Grafana dashboard shows:
|
||||
- Histogram heatmap of latency distribution
|
||||
- Trend line over time
|
||||
- Breakdown by surface (CLI vs UI vs API)
|
||||
|
||||
### Alert Response
|
||||
|
||||
**Warning (P90 > 5 minutes):**
|
||||
1. Review block notification messaging
|
||||
2. Check CLI command discoverability
|
||||
3. Verify UI links are prominent
|
||||
|
||||
**Critical (P90 > 1 hour):**
|
||||
1. Investigate user flows
|
||||
2. Add proactive notifications
|
||||
3. Review documentation and help text
|
||||
|
||||
---
|
||||
|
||||
## Metric 3: Support Minutes per Customer
|
||||
|
||||
**Name:** `stella_support_burden_minutes_total`
|
||||
**Type:** Counter
|
||||
|
||||
### Definition
|
||||
|
||||
Accumulated support time per customer per month. This is a manual/semi-automated metric for solo operations tracking.
|
||||
|
||||
### Labels
|
||||
|
||||
| Label | Values | Description |
|
||||
|-------|--------|-------------|
|
||||
| `tenant` | (varies) | Tenant identifier |
|
||||
| `category` | `install`, `config`, `policy`, `integration`, `bug`, `other` | Support category |
|
||||
| `month` | YYYY-MM | Month of support |
|
||||
|
||||
### Collection
|
||||
|
||||
Log support interactions using:
|
||||
|
||||
```bash
|
||||
stella ops support log --tenant <id> --minutes <n> --category <cat>
|
||||
```
|
||||
|
||||
Or via API:
|
||||
|
||||
```bash
|
||||
POST /v1/ops/support/log
|
||||
{
|
||||
"tenant": "acme-corp",
|
||||
"minutes": 15,
|
||||
"category": "config"
|
||||
}
|
||||
```
|
||||
|
||||
### Why This Matters
|
||||
|
||||
This metric tracks operational scalability. For solo-scaled operations:
|
||||
- Support burden should trend toward zero
|
||||
- High support minutes indicate product gaps
|
||||
- Categories identify areas needing improvement
|
||||
|
||||
### Dashboard Usage
|
||||
|
||||
The Grafana dashboard shows:
|
||||
- Stacked bar chart by category
|
||||
- Monthly trend per tenant
|
||||
- Total support burden
|
||||
|
||||
### Alert Response
|
||||
|
||||
**Warning (> 30 min/month per tenant):**
|
||||
1. Review support interactions for patterns
|
||||
2. Identify documentation gaps
|
||||
3. Create runbooks for common issues
|
||||
|
||||
**Critical (> 60 min/month per tenant):**
|
||||
1. Escalate to product for feature work
|
||||
2. Consider dedicated support time
|
||||
3. Prioritize automation
|
||||
|
||||
---
|
||||
|
||||
## Metric 4: Determinism Regressions
|
||||
|
||||
**Name:** `stella_determinism_regressions_total`
|
||||
**Type:** Counter
|
||||
|
||||
### Definition
|
||||
|
||||
Count of detected determinism failures in production (same inputs produced different outputs).
|
||||
|
||||
### Labels
|
||||
|
||||
| Label | Values | Description |
|
||||
|-------|--------|-------------|
|
||||
| `tenant` | (varies) | Tenant identifier |
|
||||
| `component` | `scanner`, `policy`, `attestor`, `export` | Component with regression |
|
||||
| `severity` | `bitwise`, `semantic`, `policy` | Fidelity tier of regression |
|
||||
|
||||
### Severity Tiers
|
||||
|
||||
| Tier | Description | Impact |
|
||||
|------|-------------|--------|
|
||||
| `bitwise` | Byte-for-byte output differs | Low - cosmetic |
|
||||
| `semantic` | Output semantically differs | Medium - potential confusion |
|
||||
| `policy` | Policy decision differs | **Critical** - audit risk |
|
||||
|
||||
### Collection Points
|
||||
|
||||
1. **Scheduled verification jobs** - Regular determinism checks
|
||||
2. **Replay verification failures** - User-initiated replays
|
||||
3. **CI golden test failures** - Development-time detection
|
||||
|
||||
### Why This Matters
|
||||
|
||||
Determinism is a core moat. Regressions indicate:
|
||||
- Non-deterministic code introduced
|
||||
- External dependency changes
|
||||
- Time-sensitive logic bugs
|
||||
|
||||
**Policy-level regressions are audit-breaking** and must be fixed immediately.
|
||||
|
||||
### Dashboard Usage
|
||||
|
||||
The Grafana dashboard shows:
|
||||
- Counter with severity breakdown
|
||||
- Alert status indicator
|
||||
- Historical trend
|
||||
|
||||
### Alert Response
|
||||
|
||||
**Warning (any bitwise/semantic):**
|
||||
1. Review recent deployments
|
||||
2. Check for dependency updates
|
||||
3. Investigate affected component
|
||||
|
||||
**Critical (any policy):**
|
||||
1. **Immediate investigation required**
|
||||
2. Consider rollback
|
||||
3. Review all recent policy decisions
|
||||
4. Notify affected customers
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Access
|
||||
|
||||
The P0 metrics dashboard is available at:
|
||||
|
||||
```
|
||||
/grafana/d/stella-p0-metrics
|
||||
```
|
||||
|
||||
Or directly:
|
||||
```bash
|
||||
stella ops dashboard p0
|
||||
```
|
||||
|
||||
### Dashboard Features
|
||||
|
||||
- **Tenant selector** - Filter by specific tenant
|
||||
- **Time range** - Adjust analysis window
|
||||
- **SLO indicators** - Green/yellow/red status
|
||||
- **Drill-down links** - Navigate to detailed views
|
||||
|
||||
---
|
||||
|
||||
## Alerting Configuration
|
||||
|
||||
Alerts are configured in `devops/telemetry/alerts/stella-p0-alerts.yml`.
|
||||
|
||||
### Alert Channels
|
||||
|
||||
Configure alert destinations in Grafana:
|
||||
- Slack/Teams for warnings
|
||||
- PagerDuty for critical alerts
|
||||
- Email for summaries
|
||||
|
||||
### Silencing Alerts
|
||||
|
||||
During maintenance windows:
|
||||
```bash
|
||||
stella ops alerts silence --duration 2h --reason "Planned maintenance"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Source Files
|
||||
|
||||
| Component | Location |
|
||||
|-----------|----------|
|
||||
| Metric definitions | `src/Telemetry/StellaOps.Telemetry.Core/P0ProductMetrics.cs` |
|
||||
| Install timestamp | `src/Telemetry/StellaOps.Telemetry.Core/InstallTimestampService.cs` |
|
||||
| Dashboard template | `devops/telemetry/grafana/dashboards/stella-ops-p0-metrics.json` |
|
||||
| Alert rules | `devops/telemetry/alerts/stella-p0-alerts.yml` |
|
||||
|
||||
### Adding Custom Metrics
|
||||
|
||||
To add additional P0-level metrics:
|
||||
|
||||
1. Define in `P0ProductMetrics.cs`
|
||||
2. Add collection points in relevant services
|
||||
3. Create dashboard panel in Grafana JSON
|
||||
4. Add alert rules
|
||||
5. Update this documentation
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [Observability Guide](observability.md)
|
||||
- [Alerting Configuration](alerting.md)
|
||||
- [Runbook: Metric Collection Issues](../../operations/runbooks/telemetry-metrics-ops.md)
|
||||
|
||||
---
|
||||
|
||||
_Last updated: 2026-01-17 (UTC)_
|
||||
Reference in New Issue
Block a user