notify doctors work, audit work, new product advisory sprints
This commit is contained in:
416
docs/doctor/README.md
Normal file
416
docs/doctor/README.md
Normal file
@@ -0,0 +1,416 @@
|
||||
# Stella Ops Doctor
|
||||
|
||||
> Self-service diagnostics for Stella Ops deployments
|
||||
|
||||
## Overview
|
||||
|
||||
The Doctor system provides comprehensive diagnostics for Stella Ops deployments, enabling operators, DevOps engineers, and developers to:
|
||||
|
||||
- **Diagnose** what is working and what is not
|
||||
- **Understand** why failures occur with collected evidence
|
||||
- **Remediate** issues with copy/paste commands
|
||||
- **Verify** fixes with re-runnable checks
|
||||
|
||||
## Quick Start
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
# Quick health check
|
||||
stella doctor
|
||||
|
||||
# Full diagnostic with all checks
|
||||
stella doctor --full
|
||||
|
||||
# Check specific category
|
||||
stella doctor --category database
|
||||
|
||||
# Export report for support
|
||||
stella doctor export --output diagnostic-bundle.zip
|
||||
```
|
||||
|
||||
### UI
|
||||
|
||||
Navigate to `/ops/doctor` in the Stella Ops console to access the interactive Doctor Dashboard.
|
||||
|
||||
### API
|
||||
|
||||
```bash
|
||||
# Run diagnostics
|
||||
POST /api/v1/doctor/run
|
||||
|
||||
# Get available checks
|
||||
GET /api/v1/doctor/checks
|
||||
|
||||
# Stream results
|
||||
WebSocket /api/v1/doctor/stream
|
||||
```
|
||||
|
||||
## Available Checks
|
||||
|
||||
The Doctor system includes 48+ diagnostic checks across 7 plugins:
|
||||
|
||||
| Plugin | Category | Checks | Description |
|
||||
|--------|----------|--------|-------------|
|
||||
| `stellaops.doctor.core` | Core | 9 | Configuration, runtime, disk, memory, time, crypto |
|
||||
| `stellaops.doctor.database` | Database | 8 | Connectivity, migrations, schema, connection pool |
|
||||
| `stellaops.doctor.servicegraph` | ServiceGraph | 6 | Gateway, routing, service health |
|
||||
| `stellaops.doctor.security` | Security | 9 | OIDC, LDAP, TLS, Vault |
|
||||
| `stellaops.doctor.scm.*` | Integration.SCM | 8 | GitHub, GitLab connectivity/auth/permissions |
|
||||
| `stellaops.doctor.registry.*` | Integration.Registry | 6 | Harbor, ECR connectivity/auth/pull |
|
||||
| `stellaops.doctor.observability` | Observability | 4 | OTLP, logs, metrics |
|
||||
|
||||
### Check ID Convention
|
||||
|
||||
```
|
||||
check.{category}.{subcategory}.{specific}
|
||||
```
|
||||
|
||||
Examples:
|
||||
- `check.config.required`
|
||||
- `check.database.migrations.pending`
|
||||
- `check.services.gateway.routing`
|
||||
- `check.integration.scm.github.auth`
|
||||
|
||||
## CLI Reference
|
||||
|
||||
See [CLI Reference](./cli-reference.md) for complete command documentation.
|
||||
|
||||
### Common Commands
|
||||
|
||||
```bash
|
||||
# Quick health check (tagged 'quick' checks only)
|
||||
stella doctor --quick
|
||||
|
||||
# Full diagnostic with all checks
|
||||
stella doctor --full
|
||||
|
||||
# Filter by category
|
||||
stella doctor --category database
|
||||
stella doctor --category security
|
||||
|
||||
# Filter by plugin
|
||||
stella doctor --plugin scm.github
|
||||
|
||||
# Run single check
|
||||
stella doctor --check check.database.migrations.pending
|
||||
|
||||
# Output formats
|
||||
stella doctor --format json
|
||||
stella doctor --format markdown
|
||||
stella doctor --format text
|
||||
|
||||
# Filter output by severity
|
||||
stella doctor --severity fail,warn
|
||||
|
||||
# Export diagnostic bundle
|
||||
stella doctor export --output diagnostic.zip
|
||||
stella doctor export --include-logs --log-duration 4h
|
||||
```
|
||||
|
||||
## Exit Codes
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| 0 | All checks passed |
|
||||
| 1 | One or more warnings |
|
||||
| 2 | One or more failures |
|
||||
| 3 | Doctor engine error |
|
||||
| 4 | Invalid arguments |
|
||||
| 5 | Timeout exceeded |
|
||||
|
||||
## Output Example
|
||||
|
||||
```
|
||||
Stella Ops Doctor
|
||||
=================
|
||||
|
||||
Running 47 checks across 8 plugins...
|
||||
|
||||
[PASS] check.config.required
|
||||
All required configuration values are present
|
||||
|
||||
[PASS] check.database.connectivity
|
||||
PostgreSQL connection successful (latency: 12ms)
|
||||
|
||||
[WARN] check.tls.certificates.expiry
|
||||
Diagnosis: TLS certificate expires in 14 days
|
||||
|
||||
Evidence:
|
||||
Certificate: /etc/ssl/certs/stellaops.crt
|
||||
Subject: CN=stellaops.example.com
|
||||
Expires: 2026-01-26T00:00:00Z
|
||||
Days remaining: 14
|
||||
|
||||
Likely Causes:
|
||||
1. Certificate renewal not scheduled
|
||||
2. ACME/Let's Encrypt automation not configured
|
||||
|
||||
Fix Steps:
|
||||
# 1. Check current certificate
|
||||
openssl x509 -in /etc/ssl/certs/stellaops.crt -noout -dates
|
||||
|
||||
# 2. Renew certificate (if using certbot)
|
||||
sudo certbot renew --cert-name stellaops.example.com
|
||||
|
||||
# 3. Restart services to pick up new certificate
|
||||
sudo systemctl restart stellaops-gateway
|
||||
|
||||
Verification:
|
||||
stella doctor --check check.tls.certificates.expiry
|
||||
|
||||
[FAIL] check.database.migrations.pending
|
||||
Diagnosis: 3 pending release migrations detected in schema 'auth'
|
||||
|
||||
Evidence:
|
||||
Schema: auth
|
||||
Current version: 099_add_dpop_thumbprints
|
||||
Pending migrations:
|
||||
- 100_add_tenant_quotas
|
||||
- 101_add_audit_retention
|
||||
- 102_add_session_revocation
|
||||
|
||||
Likely Causes:
|
||||
1. Release migrations not applied before deployment
|
||||
2. Migration files added after last deployment
|
||||
|
||||
Fix Steps:
|
||||
# 1. Backup database first (RECOMMENDED)
|
||||
pg_dump -h localhost -U stella_admin -d stellaops -F c \
|
||||
-f stellaops_backup_$(date +%Y%m%d_%H%M%S).dump
|
||||
|
||||
# 2. Apply pending release migrations
|
||||
stella system migrations-run --module Authority --category release
|
||||
|
||||
# 3. Verify migrations applied
|
||||
stella system migrations-status --module Authority
|
||||
|
||||
Verification:
|
||||
stella doctor --check check.database.migrations.pending
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
Summary: 44 passed, 2 warnings, 1 failed (47 total)
|
||||
Duration: 8.3s
|
||||
--------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
## Export Bundle
|
||||
|
||||
The Doctor export feature creates a diagnostic bundle for support escalation:
|
||||
|
||||
```bash
|
||||
stella doctor export --output diagnostic-bundle.zip
|
||||
```
|
||||
|
||||
The bundle contains:
|
||||
- `doctor-report.json` - Full diagnostic report
|
||||
- `doctor-report.md` - Human-readable report
|
||||
- `environment.json` - Environment information
|
||||
- `system-info.json` - System details (OS, runtime, memory)
|
||||
- `config-sanitized.json` - Sanitized configuration (secrets redacted)
|
||||
- `logs/` - Recent log files (optional)
|
||||
- `README.md` - Bundle contents guide
|
||||
|
||||
### Export Options
|
||||
|
||||
```bash
|
||||
# Include logs from last 4 hours
|
||||
stella doctor export --include-logs --log-duration 4h
|
||||
|
||||
# Exclude configuration
|
||||
stella doctor export --no-config
|
||||
|
||||
# Custom output path
|
||||
stella doctor export --output /tmp/support-bundle.zip
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
### Secret Redaction
|
||||
|
||||
All evidence output is sanitized. Sensitive values (passwords, tokens, connection strings) are replaced with `***REDACTED***` in:
|
||||
- Console output
|
||||
- JSON exports
|
||||
- Diagnostic bundles
|
||||
- Log files
|
||||
|
||||
### RBAC Permissions
|
||||
|
||||
| Scope | Description |
|
||||
|-------|-------------|
|
||||
| `doctor:run` | Execute doctor checks |
|
||||
| `doctor:run:full` | Execute all checks including sensitive |
|
||||
| `doctor:export` | Export diagnostic reports |
|
||||
| `admin:system` | Access system-level checks |
|
||||
|
||||
## Plugin Development
|
||||
|
||||
To create a custom Doctor plugin, implement `IDoctorPlugin`:
|
||||
|
||||
```csharp
|
||||
public class MyCustomPlugin : IDoctorPlugin
|
||||
{
|
||||
public string PluginId => "stellaops.doctor.custom";
|
||||
public string DisplayName => "Custom Checks";
|
||||
public Version Version => new(1, 0, 0);
|
||||
public DoctorCategory Category => DoctorCategory.Integration;
|
||||
|
||||
public bool IsAvailable(IServiceProvider services) => true;
|
||||
|
||||
public IReadOnlyList<IDoctorCheck> GetChecks(DoctorPluginContext context)
|
||||
{
|
||||
return new IDoctorCheck[]
|
||||
{
|
||||
new MyCustomCheck()
|
||||
};
|
||||
}
|
||||
|
||||
public Task InitializeAsync(DoctorPluginContext context, CancellationToken ct)
|
||||
=> Task.CompletedTask;
|
||||
}
|
||||
```
|
||||
|
||||
Implement checks using `IDoctorCheck`:
|
||||
|
||||
```csharp
|
||||
public class MyCustomCheck : IDoctorCheck
|
||||
{
|
||||
public string CheckId => "check.custom.mycheck";
|
||||
public string Name => "My Custom Check";
|
||||
public string Description => "Validates custom configuration";
|
||||
public DoctorSeverity DefaultSeverity => DoctorSeverity.Fail;
|
||||
public IReadOnlyList<string> Tags => new[] { "custom", "quick" };
|
||||
public TimeSpan EstimatedDuration => TimeSpan.FromSeconds(2);
|
||||
|
||||
public bool CanRun(DoctorPluginContext context) => true;
|
||||
|
||||
public async Task<DoctorCheckResult> RunAsync(
|
||||
DoctorPluginContext context,
|
||||
CancellationToken ct)
|
||||
{
|
||||
// Perform check logic
|
||||
var isValid = await ValidateAsync(ct);
|
||||
|
||||
if (isValid)
|
||||
{
|
||||
return DoctorCheckResult.Pass(
|
||||
checkId: CheckId,
|
||||
diagnosis: "Custom configuration is valid",
|
||||
evidence: new Evidence
|
||||
{
|
||||
Description = "Validation passed",
|
||||
Data = new Dictionary<string, string>
|
||||
{
|
||||
["validated_at"] = context.TimeProvider.GetUtcNow().ToString("O")
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
return DoctorCheckResult.Fail(
|
||||
checkId: CheckId,
|
||||
diagnosis: "Custom configuration is invalid",
|
||||
evidence: new Evidence
|
||||
{
|
||||
Description = "Validation failed",
|
||||
Data = new Dictionary<string, string>
|
||||
{
|
||||
["error"] = "Configuration file missing"
|
||||
}
|
||||
},
|
||||
remediation: new Remediation
|
||||
{
|
||||
Steps = new[]
|
||||
{
|
||||
new RemediationStep
|
||||
{
|
||||
Order = 1,
|
||||
Description = "Create configuration file",
|
||||
Command = "cp /etc/stellaops/custom.yaml.sample /etc/stellaops/custom.yaml",
|
||||
CommandType = CommandType.Shell
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Register the plugin in DI:
|
||||
|
||||
```csharp
|
||||
services.AddSingleton<IDoctorPlugin, MyCustomPlugin>();
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
+------------------+ +------------------+ +------------------+
|
||||
| CLI | | UI | | External |
|
||||
| stella doctor | | /ops/doctor | | Monitoring |
|
||||
+--------+---------+ +--------+---------+ +--------+---------+
|
||||
| | |
|
||||
v v v
|
||||
+------------------------------------------------------------------------+
|
||||
| Doctor API Layer |
|
||||
| POST /api/v1/doctor/run GET /api/v1/doctor/checks |
|
||||
| GET /api/v1/doctor/report WebSocket /api/v1/doctor/stream |
|
||||
+------------------------------------------------------------------------+
|
||||
|
|
||||
v
|
||||
+------------------------------------------------------------------------+
|
||||
| Doctor Engine (Core) |
|
||||
| +------------------+ +------------------+ +------------------+ |
|
||||
| | Check Registry | | Check Executor | | Report Generator | |
|
||||
| | - Discovery | | - Parallel exec | | - JSON/MD/Text | |
|
||||
| | - Filtering | | - Timeout mgmt | | - Remediation | |
|
||||
| +------------------+ +------------------+ +------------------+ |
|
||||
+------------------------------------------------------------------------+
|
||||
|
|
||||
v
|
||||
+------------------------------------------------------------------------+
|
||||
| Plugin System |
|
||||
+--------+---------+---------+---------+---------+---------+-------------+
|
||||
| | | | | |
|
||||
v v v v v v
|
||||
+--------+ +------+ +------+ +------+ +------+ +------+ +----------+
|
||||
| Core | | DB & | |Service| | SCM | |Regis-| |Observ-| |Security |
|
||||
| Plugin | |Migra-| | Graph | |Plugin| | try | |ability| | Plugin |
|
||||
| | | tions| |Plugin | | | |Plugin| |Plugin | | |
|
||||
+--------+ +------+ +------+ +------+ +------+ +------+ +----------+
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [CLI Reference](./cli-reference.md) - Complete CLI command reference
|
||||
- [Doctor Capabilities Specification](./doctor-capabilities.md) - Full technical specification
|
||||
- [Plugin Development Guide](./plugin-development.md) - Creating custom plugins
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Doctor Engine Error (Exit Code 3)
|
||||
|
||||
If `stella doctor` returns exit code 3:
|
||||
|
||||
1. Check the error message for details
|
||||
2. Verify required services are running
|
||||
3. Check connectivity to databases
|
||||
4. Review logs at `/var/log/stellaops/doctor.log`
|
||||
|
||||
### Timeout Exceeded (Exit Code 5)
|
||||
|
||||
If checks are timing out:
|
||||
|
||||
```bash
|
||||
# Increase per-check timeout
|
||||
stella doctor --timeout 60s
|
||||
|
||||
# Run with reduced parallelism
|
||||
stella doctor --parallel 2
|
||||
```
|
||||
|
||||
### Checks Not Found
|
||||
|
||||
If expected checks are not appearing:
|
||||
|
||||
1. Verify plugin is registered in DI
|
||||
2. Check `CanRun()` returns true for your environment
|
||||
3. Review plugin initialization logs
|
||||
Reference in New Issue
Block a user