421 lines
12 KiB
Markdown
421 lines
12 KiB
Markdown
# Stella Ops Doctor
|
|
|
|
> Self-service diagnostics for Stella Ops deployments
|
|
|
|
## Overview
|
|
|
|
The Doctor system provides comprehensive diagnostics for Stella Ops deployments, enabling operators, DevOps engineers, and developers to:
|
|
|
|
- **Diagnose** what is working and what is not
|
|
- **Understand** why failures occur with collected evidence
|
|
- **Remediate** issues with copy/paste commands
|
|
- **Verify** fixes with re-runnable checks
|
|
|
|
## Quick Start
|
|
|
|
### CLI
|
|
|
|
```bash
|
|
# Quick health check
|
|
stella doctor
|
|
|
|
# Full diagnostic with all checks
|
|
stella doctor --full
|
|
|
|
# Check specific category
|
|
stella doctor --category database
|
|
|
|
# Export report for support
|
|
stella doctor export --output diagnostic-bundle.zip
|
|
|
|
# Apply safe fixes from a report (dry-run by default)
|
|
stella doctor fix --from doctor-report.json --apply
|
|
```
|
|
|
|
### UI
|
|
|
|
Navigate to `/ops/doctor` in the Stella Ops console to access the interactive Doctor Dashboard.
|
|
Fix actions are exposed in the UI and mirror CLI commands; destructive steps are never executed by Doctor.
|
|
|
|
### API
|
|
|
|
```bash
|
|
# Run diagnostics
|
|
POST /api/v1/doctor/run
|
|
|
|
# Get available checks
|
|
GET /api/v1/doctor/checks
|
|
|
|
# Stream results
|
|
WebSocket /api/v1/doctor/stream
|
|
```
|
|
|
|
## Available Checks
|
|
|
|
The Doctor system includes 48+ diagnostic checks across 7 plugins:
|
|
|
|
| Plugin | Category | Checks | Description |
|
|
|--------|----------|--------|-------------|
|
|
| `stellaops.doctor.core` | Core | 9 | Configuration, runtime, disk, memory, time, crypto |
|
|
| `stellaops.doctor.database` | Database | 8 | Connectivity, migrations, schema, connection pool |
|
|
| `stellaops.doctor.servicegraph` | ServiceGraph | 6 | Gateway, routing, service health |
|
|
| `stellaops.doctor.security` | Security | 9 | OIDC, LDAP, TLS, Vault |
|
|
| `stellaops.doctor.scm.*` | Integration.SCM | 8 | GitHub, GitLab connectivity/auth/permissions |
|
|
| `stellaops.doctor.registry.*` | Integration.Registry | 6 | Harbor, ECR connectivity/auth/pull |
|
|
| `stellaops.doctor.observability` | Observability | 4 | OTLP, logs, metrics |
|
|
|
|
### Check ID Convention
|
|
|
|
```
|
|
check.{category}.{subcategory}.{specific}
|
|
```
|
|
|
|
Examples:
|
|
- `check.config.required`
|
|
- `check.database.migrations.pending`
|
|
- `check.services.gateway.routing`
|
|
- `check.integration.scm.github.auth`
|
|
|
|
## CLI Reference
|
|
|
|
See [CLI Reference](./cli-reference.md) for complete command documentation.
|
|
|
|
### Common Commands
|
|
|
|
```bash
|
|
# Quick health check (tagged 'quick' checks only)
|
|
stella doctor --quick
|
|
|
|
# Full diagnostic with all checks
|
|
stella doctor --full
|
|
|
|
# Filter by category
|
|
stella doctor --category database
|
|
stella doctor --category security
|
|
|
|
# Filter by plugin
|
|
stella doctor --plugin scm.github
|
|
|
|
# Run single check
|
|
stella doctor --check check.database.migrations.pending
|
|
|
|
# Output formats
|
|
stella doctor --format json
|
|
stella doctor --format markdown
|
|
stella doctor --format text
|
|
|
|
# Filter output by severity
|
|
stella doctor --severity fail,warn
|
|
|
|
# Export diagnostic bundle
|
|
stella doctor export --output diagnostic.zip
|
|
stella doctor export --include-logs --log-duration 4h
|
|
```
|
|
|
|
## Exit Codes
|
|
|
|
| Code | Meaning |
|
|
|------|---------|
|
|
| 0 | All checks passed |
|
|
| 1 | One or more warnings |
|
|
| 2 | One or more failures |
|
|
| 3 | Doctor engine error |
|
|
| 4 | Invalid arguments |
|
|
| 5 | Timeout exceeded |
|
|
|
|
## Output Example
|
|
|
|
```
|
|
Stella Ops Doctor
|
|
=================
|
|
|
|
Running 47 checks across 8 plugins...
|
|
|
|
[PASS] check.config.required
|
|
All required configuration values are present
|
|
|
|
[PASS] check.database.connectivity
|
|
PostgreSQL connection successful (latency: 12ms)
|
|
|
|
[WARN] check.tls.certificates.expiry
|
|
Diagnosis: TLS certificate expires in 14 days
|
|
|
|
Evidence:
|
|
Certificate: /etc/ssl/certs/stellaops.crt
|
|
Subject: CN=stellaops.example.com
|
|
Expires: 2026-01-26T00:00:00Z
|
|
Days remaining: 14
|
|
|
|
Likely Causes:
|
|
1. Certificate renewal not scheduled
|
|
2. ACME/Let's Encrypt automation not configured
|
|
|
|
Fix Steps:
|
|
# 1. Check current certificate
|
|
openssl x509 -in /etc/ssl/certs/stellaops.crt -noout -dates
|
|
|
|
# 2. Renew certificate (if using certbot)
|
|
sudo certbot renew --cert-name stellaops.example.com
|
|
|
|
# 3. Restart services to pick up new certificate
|
|
sudo systemctl restart stellaops-gateway
|
|
|
|
Verification:
|
|
stella doctor --check check.tls.certificates.expiry
|
|
|
|
[FAIL] check.database.migrations.pending
|
|
Diagnosis: 3 pending release migrations detected in schema 'auth'
|
|
|
|
Evidence:
|
|
Schema: auth
|
|
Current version: 099_add_dpop_thumbprints
|
|
Pending migrations:
|
|
- 100_add_tenant_quotas
|
|
- 101_add_audit_retention
|
|
- 102_add_session_revocation
|
|
|
|
Likely Causes:
|
|
1. Release migrations not applied before deployment
|
|
2. Migration files added after last deployment
|
|
|
|
Fix Steps:
|
|
# 1. Backup database first (RECOMMENDED)
|
|
pg_dump -h localhost -U stella_admin -d stellaops -F c \
|
|
-f stellaops_backup_$(date +%Y%m%d_%H%M%S).dump
|
|
|
|
# 2. Apply pending release migrations
|
|
stella system migrations-run --module Authority --category release
|
|
|
|
# 3. Verify migrations applied
|
|
stella system migrations-status --module Authority
|
|
|
|
Verification:
|
|
stella doctor --check check.database.migrations.pending
|
|
|
|
--------------------------------------------------------------------------------
|
|
Summary: 44 passed, 2 warnings, 1 failed (47 total)
|
|
Duration: 8.3s
|
|
--------------------------------------------------------------------------------
|
|
```
|
|
|
|
## Export Bundle
|
|
|
|
The Doctor export feature creates a diagnostic bundle for support escalation:
|
|
|
|
```bash
|
|
stella doctor export --output diagnostic-bundle.zip
|
|
```
|
|
|
|
The bundle contains:
|
|
- `doctor-report.json` - Full diagnostic report
|
|
- `doctor-report.md` - Human-readable report
|
|
- `environment.json` - Environment information
|
|
- `system-info.json` - System details (OS, runtime, memory)
|
|
- `config-sanitized.json` - Sanitized configuration (secrets redacted)
|
|
- `logs/` - Recent log files (optional)
|
|
- `README.md` - Bundle contents guide
|
|
|
|
### Export Options
|
|
|
|
```bash
|
|
# Include logs from last 4 hours
|
|
stella doctor export --include-logs --log-duration 4h
|
|
|
|
# Exclude configuration
|
|
stella doctor export --no-config
|
|
|
|
# Custom output path
|
|
stella doctor export --output /tmp/support-bundle.zip
|
|
```
|
|
|
|
## Security
|
|
|
|
### Secret Redaction
|
|
|
|
All evidence output is sanitized. Sensitive values (passwords, tokens, connection strings) are replaced with `***REDACTED***` in:
|
|
- Console output
|
|
- JSON exports
|
|
- Diagnostic bundles
|
|
- Log files
|
|
|
|
### RBAC Permissions
|
|
|
|
| Scope | Description |
|
|
|-------|-------------|
|
|
| `doctor:run` | Execute doctor checks |
|
|
| `doctor:run:full` | Execute all checks including sensitive |
|
|
| `doctor:export` | Export diagnostic reports |
|
|
| `admin:system` | Access system-level checks |
|
|
|
|
## Plugin Development
|
|
|
|
To create a custom Doctor plugin, implement `IDoctorPlugin`:
|
|
|
|
```csharp
|
|
public class MyCustomPlugin : IDoctorPlugin
|
|
{
|
|
public string PluginId => "stellaops.doctor.custom";
|
|
public string DisplayName => "Custom Checks";
|
|
public Version Version => new(1, 0, 0);
|
|
public DoctorCategory Category => DoctorCategory.Integration;
|
|
|
|
public bool IsAvailable(IServiceProvider services) => true;
|
|
|
|
public IReadOnlyList<IDoctorCheck> GetChecks(DoctorPluginContext context)
|
|
{
|
|
return new IDoctorCheck[]
|
|
{
|
|
new MyCustomCheck()
|
|
};
|
|
}
|
|
|
|
public Task InitializeAsync(DoctorPluginContext context, CancellationToken ct)
|
|
=> Task.CompletedTask;
|
|
}
|
|
```
|
|
|
|
Implement checks using `IDoctorCheck`:
|
|
|
|
```csharp
|
|
public class MyCustomCheck : IDoctorCheck
|
|
{
|
|
public string CheckId => "check.custom.mycheck";
|
|
public string Name => "My Custom Check";
|
|
public string Description => "Validates custom configuration";
|
|
public DoctorSeverity DefaultSeverity => DoctorSeverity.Fail;
|
|
public IReadOnlyList<string> Tags => new[] { "custom", "quick" };
|
|
public TimeSpan EstimatedDuration => TimeSpan.FromSeconds(2);
|
|
|
|
public bool CanRun(DoctorPluginContext context) => true;
|
|
|
|
public async Task<DoctorCheckResult> RunAsync(
|
|
DoctorPluginContext context,
|
|
CancellationToken ct)
|
|
{
|
|
// Perform check logic
|
|
var isValid = await ValidateAsync(ct);
|
|
|
|
if (isValid)
|
|
{
|
|
return DoctorCheckResult.Pass(
|
|
checkId: CheckId,
|
|
diagnosis: "Custom configuration is valid",
|
|
evidence: new Evidence
|
|
{
|
|
Description = "Validation passed",
|
|
Data = new Dictionary<string, string>
|
|
{
|
|
["validated_at"] = context.TimeProvider.GetUtcNow().ToString("O")
|
|
}
|
|
});
|
|
}
|
|
|
|
return DoctorCheckResult.Fail(
|
|
checkId: CheckId,
|
|
diagnosis: "Custom configuration is invalid",
|
|
evidence: new Evidence
|
|
{
|
|
Description = "Validation failed",
|
|
Data = new Dictionary<string, string>
|
|
{
|
|
["error"] = "Configuration file missing"
|
|
}
|
|
},
|
|
remediation: new Remediation
|
|
{
|
|
Steps = new[]
|
|
{
|
|
new RemediationStep
|
|
{
|
|
Order = 1,
|
|
Description = "Create configuration file",
|
|
Command = "cp /etc/stellaops/custom.yaml.sample /etc/stellaops/custom.yaml",
|
|
CommandType = CommandType.Shell
|
|
}
|
|
}
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
Register the plugin in DI:
|
|
|
|
```csharp
|
|
services.AddSingleton<IDoctorPlugin, MyCustomPlugin>();
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
+------------------+ +------------------+ +------------------+
|
|
| CLI | | UI | | External |
|
|
| stella doctor | | /ops/doctor | | Monitoring |
|
|
+--------+---------+ +--------+---------+ +--------+---------+
|
|
| | |
|
|
v v v
|
|
+------------------------------------------------------------------------+
|
|
| Doctor API Layer |
|
|
| POST /api/v1/doctor/run GET /api/v1/doctor/checks |
|
|
| GET /api/v1/doctor/report WebSocket /api/v1/doctor/stream |
|
|
+------------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------------+
|
|
| Doctor Engine (Core) |
|
|
| +------------------+ +------------------+ +------------------+ |
|
|
| | Check Registry | | Check Executor | | Report Generator | |
|
|
| | - Discovery | | - Parallel exec | | - JSON/MD/Text | |
|
|
| | - Filtering | | - Timeout mgmt | | - Remediation | |
|
|
| +------------------+ +------------------+ +------------------+ |
|
|
+------------------------------------------------------------------------+
|
|
|
|
|
v
|
|
+------------------------------------------------------------------------+
|
|
| Plugin System |
|
|
+--------+---------+---------+---------+---------+---------+-------------+
|
|
| | | | | |
|
|
v v v v v v
|
|
+--------+ +------+ +------+ +------+ +------+ +------+ +----------+
|
|
| Core | | DB & | |Service| | SCM | |Regis-| |Observ-| |Security |
|
|
| Plugin | |Migra-| | Graph | |Plugin| | try | |ability| | Plugin |
|
|
| | | tions| |Plugin | | | |Plugin| |Plugin | | |
|
|
+--------+ +------+ +------+ +------+ +------+ +------+ +----------+
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- [CLI Reference](./cli-reference.md) - Complete CLI command reference
|
|
- [Doctor Capabilities Specification](./doctor-capabilities.md) - Full technical specification
|
|
- [Plugin Development Guide](./plugin-development.md) - Creating custom plugins
|
|
|
|
## Troubleshooting
|
|
|
|
### Doctor Engine Error (Exit Code 3)
|
|
|
|
If `stella doctor` returns exit code 3:
|
|
|
|
1. Check the error message for details
|
|
2. Verify required services are running
|
|
3. Check connectivity to databases
|
|
4. Review logs at `/var/log/stellaops/doctor.log`
|
|
|
|
### Timeout Exceeded (Exit Code 5)
|
|
|
|
If checks are timing out:
|
|
|
|
```bash
|
|
# Increase per-check timeout
|
|
stella doctor --timeout 60s
|
|
|
|
# Run with reduced parallelism
|
|
stella doctor --parallel 2
|
|
```
|
|
|
|
### Checks Not Found
|
|
|
|
If expected checks are not appearing:
|
|
|
|
1. Verify plugin is registered in DI
|
|
2. Check `CanRun()` returns true for your environment
|
|
3. Review plugin initialization logs
|