Files
git.stella-ops.org/docs/modules/doctor/architecture.md
master 908619e739 feat(scheduler): plugin architecture + Doctor health check plugin
- Create ISchedulerJobPlugin abstraction with JobKind routing
- Add SchedulerPluginRegistry for plugin discovery and resolution
- Wrap existing scan logic as ScanJobPlugin (zero behavioral change)
- Extend Schedule model with JobKind (default "scan") and PluginConfig (jsonb)
- Add SQL migrations 007 (job_kind/plugin_config) and 008 (doctor_trends table)
- Implement DoctorJobPlugin replacing standalone doctor-scheduler service
- Add PostgresDoctorTrendRepository for persistent trend storage
- Register Doctor trend endpoints at /api/v1/scheduler/doctor/trends/*
- Seed 3 default Doctor schedules (daily full, hourly quick, weekly compliance)
- Comment out doctor-scheduler container in compose and services-matrix
- Update Doctor architecture docs and AGENTS.md with scheduling migration info

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 16:24:46 +03:00

355 lines
12 KiB
Markdown

# Doctor Architecture
> Module: Doctor
> Sprint: SPRINT_0127_001_0002_oci_registry_compatibility
Stella Doctor is a diagnostic framework for validating system health, configuration, and integration connectivity across the StellaOps platform.
## 1) Overview
Doctor provides a plugin-based diagnostic system that enables:
- **Health checks** for all platform components
- **Integration validation** for external systems (registries, SCM, CI, secrets)
- **Configuration verification** before deployment
- **Capability probing** for feature compatibility
- **Evidence collection** for troubleshooting and compliance
### Scheduler Integration (Sprint 20260408-003)
> **The standalone Doctor Scheduler service is deprecated.**
> Doctor health check scheduling is now handled by the Scheduler service's `DoctorJobPlugin`.
Doctor schedules are managed via the Scheduler API with `jobKind="doctor"` and plugin-specific
configuration in `pluginConfig`. Trend data is stored in `scheduler.doctor_trends` (PostgreSQL).
**Scheduler-hosted Doctor endpoints:**
- `GET /api/v1/scheduler/doctor/trends` -- aggregated trend summaries
- `GET /api/v1/scheduler/doctor/trends/checks/{checkId}` -- per-check trend data
- `GET /api/v1/scheduler/doctor/trends/categories/{category}` -- per-category trend data
- `GET /api/v1/scheduler/doctor/trends/degrading` -- checks with degrading health
**Schedule management** uses the standard Scheduler API at `/api/v1/scheduler/schedules`
with `jobKind="doctor"` and `pluginConfig` containing Doctor-specific options (mode, categories, alerts).
Three default Doctor schedules are seeded by `SystemScheduleBootstrap`:
- `doctor-full-daily` (0 4 * * *) -- Full health check
- `doctor-quick-hourly` (0 * * * *) -- Quick health check
- `doctor-compliance-weekly` (0 5 * * 0) -- Compliance category audit
The Doctor WebService (`src/Doctor/StellaOps.Doctor.WebService/`) remains the execution engine.
The plugin communicates with it via HTTP POST to `/api/v1/doctor/run`.
### AdvisoryAI Diagnosis Surface (run-003 remediation)
Doctor WebService now exposes a diagnosis endpoint for AdvisoryAI-backed health analysis:
- `POST /api/v1/doctor/diagnosis`
The endpoint accepts either:
- `runId` referencing a stored Doctor report, or
- an inline `DoctorRunResultResponse` payload.
Runtime wiring includes:
- `IDoctorContextAdapter` for deterministic context projection from Doctor reports
- `IDoctorAIDiagnosisService` (deterministic implementation) for assessment, root cause, correlation, and remediation projection
- schema enrichment through `IEvidenceSchemaRegistry.RegisterCommonSchemas()`
### AdvisoryAI Diagnosis Surface (run-002 remediation)
Doctor WebService now exposes an AdvisoryAI diagnosis endpoint:
- `POST /api/v1/doctor/diagnosis`
The endpoint accepts either inline Doctor run payloads or a persisted `runId`, maps them through the shared Doctor AdvisoryAI context adapter (`src/__Libraries/StellaOps.Doctor/AdvisoryAI/**`), and returns deterministic diagnosis output (issues, root causes, recommended actions, and related runbook links).
## 2) Plugin Architecture
### Core Interfaces
```csharp
public interface IDoctorPlugin
{
string PluginId { get; }
string DisplayName { get; }
string Category { get; }
Version Version { get; }
IEnumerable<IDoctorCheck> GetChecks();
Task InitializeAsync(DoctorPluginContext context, CancellationToken ct);
}
public interface IDoctorCheck
{
string CheckId { get; }
string Name { get; }
string Description { get; }
DoctorSeverity DefaultSeverity { get; }
IReadOnlyList<string> Tags { get; }
TimeSpan EstimatedDuration { get; }
bool CanRun(DoctorPluginContext context);
Task<CheckResult> RunAsync(DoctorPluginContext context, CancellationToken ct);
}
```
### Plugin Context
```csharp
public sealed class DoctorPluginContext
{
public IServiceProvider Services { get; }
public IConfiguration Configuration { get; }
public TimeProvider TimeProvider { get; }
public ILogger Logger { get; }
public string EnvironmentName { get; }
public IReadOnlyDictionary<string, object> PluginConfig { get; }
}
```
### Check Results
```csharp
public sealed record CheckResult
{
public DoctorSeverity Severity { get; init; }
public string Diagnosis { get; init; }
public Evidence Evidence { get; init; }
public IReadOnlyList<string> LikelyCauses { get; init; }
public Remediation? Remediation { get; init; }
public string? VerificationCommand { get; init; }
}
public enum DoctorSeverity
{
Pass, // Check succeeded
Info, // Informational (no action needed)
Warn, // Warning (degraded but functional)
Fail, // Failure (requires action)
Skip // Check skipped (preconditions not met)
}
```
## 3) Built-in Plugins
### IntegrationPlugin
Validates external system connectivity and capabilities.
**Check Catalog:**
| Check ID | Name | Severity | Description |
|----------|------|----------|-------------|
| `check.integration.oci.credentials` | OCI Registry Credentials | Fail | Validate registry authentication |
| `check.integration.oci.pull` | OCI Registry Pull Authorization | Fail | Verify pull permissions |
| `check.integration.oci.push` | OCI Registry Push Authorization | Fail | Verify push permissions |
| `check.integration.oci.referrers` | OCI Registry Referrers API | Warn | Check OCI 1.1 referrers support |
| `check.integration.oci.capabilities` | OCI Registry Capability Matrix | Info | Probe all registry capabilities |
See [Registry Diagnostic Checks](./registry-checks.md) for detailed documentation.
### ConfigurationPlugin
Validates platform configuration.
| Check ID | Name | Severity | Description |
|----------|------|----------|-------------|
| `check.config.database` | Database Connection | Fail | Verify database connectivity |
| `check.config.secrets` | Secrets Provider | Fail | Verify secrets access |
| `check.config.tls` | TLS Configuration | Warn | Validate TLS certificates |
### HealthPlugin
Validates platform component health.
| Check ID | Name | Severity | Description |
|----------|------|----------|-------------|
| `check.health.api` | API Health | Fail | Verify API endpoints |
| `check.health.worker` | Worker Health | Fail | Verify background workers |
| `check.health.storage` | Storage Health | Fail | Verify storage backends |
## 4) Check Patterns
### Non-Destructive Probing
Registry checks use non-destructive operations:
```csharp
// Pull check: HEAD request only (no data transfer)
var response = await client.SendAsync(new HttpRequestMessage(HttpMethod.Head, manifestUrl), ct);
// Push check: Start upload then immediately cancel
var uploadResponse = await client.PostAsync(uploadsUrl, null, ct);
if (uploadResponse.StatusCode == HttpStatusCode.Accepted)
{
var location = uploadResponse.Headers.Location;
await client.DeleteAsync(location, ct); // Cancel upload
}
```
### Capability Detection
Registry capability probing sequence:
```
1. GET /v2/ → Extract OCI-Distribution-API-Version header
2. GET /v2/{repo}/referrers/{digest} → Check referrers API support
3. POST /v2/{repo}/blobs/uploads/ → Check chunked upload support
└─ DELETE {location} → Cancel upload session
4. POST /v2/{repo}/blobs/uploads/?mount=...&from=... → Check cross-repo mount
5. OPTIONS /v2/{repo}/manifests/{ref} → Check delete support (Allow header)
6. OPTIONS /v2/{repo}/blobs/{digest} → Check blob delete support
```
### Evidence Collection
All checks collect structured evidence:
```csharp
var result = CheckResultBuilder.Create(check)
.Pass("Registry authentication successful")
.WithEvidence(eb => eb
.Add("registry_url", registryUrl)
.Add("auth_method", "bearer")
.Add("response_time_ms", elapsed.TotalMilliseconds.ToString("F0"))
.AddSensitive("token_preview", RedactToken(token)))
.Build();
```
### Credential Redaction
Sensitive values are automatically redacted:
```csharp
// Redact to first 2 + last 2 characters
private static string Redact(string? value)
{
if (string.IsNullOrEmpty(value) || value.Length <= 4)
return "****";
return $"{value[..2]}...{value[^2..]}";
}
// "mysecretpassword" → "my...rd"
```
## 5) CLI Integration
```bash
# Run all checks
stella doctor
# Run checks by tag
stella doctor --tag registry
stella doctor --tag configuration
# Run specific check
stella doctor --check check.integration.oci.referrers
# Output formats
stella doctor --format table # Default: human-readable
stella doctor --format json # Machine-readable
stella doctor --format sarif # SARIF for CI integration
# Verbosity
stella doctor --verbose # Include evidence details
stella doctor --quiet # Only show failures
# Filtering by severity
stella doctor --min-severity warn # Skip info/pass
```
## 6) Extensibility
### Creating a Custom Check
```csharp
public sealed class MyCustomCheck : IDoctorCheck
{
public string CheckId => "check.custom.mycheck";
public string Name => "My Custom Check";
public string Description => "Validates custom integration";
public DoctorSeverity DefaultSeverity => DoctorSeverity.Fail;
public IReadOnlyList<string> Tags => ["custom", "integration"];
public TimeSpan EstimatedDuration => TimeSpan.FromSeconds(5);
public bool CanRun(DoctorPluginContext context)
{
// Return false if preconditions not met
return context.Configuration["Custom:Enabled"] == "true";
}
public async Task<CheckResult> RunAsync(DoctorPluginContext context, CancellationToken ct)
{
var builder = CheckResultBuilder.Create(this);
try
{
// Perform check logic
var result = await ValidateAsync(context, ct);
if (result.Success)
{
return builder
.Pass("Custom validation successful")
.WithEvidence(eb => eb.Add("detail", result.Detail))
.Build();
}
return builder
.Fail("Custom validation failed")
.WithLikelyCause("Configuration is invalid")
.WithRemediation(rb => rb
.AddManualStep(1, "Check configuration", "Verify Custom:Setting is correct")
.WithRunbookUrl("https://docs.stella-ops.org/runbooks/custom-check"))
.Build();
}
catch (Exception ex)
{
return builder
.Fail($"Check failed with error: {ex.Message}")
.WithEvidence(eb => eb.Add("exception_type", ex.GetType().Name))
.Build();
}
}
}
```
### Creating a Custom Plugin
```csharp
public sealed class MyCustomPlugin : IDoctorPlugin
{
public string PluginId => "custom";
public string DisplayName => "Custom Checks";
public string Category => "Integration";
public Version Version => new(1, 0, 0);
public IEnumerable<IDoctorCheck> GetChecks()
{
yield return new MyCustomCheck();
yield return new AnotherCustomCheck();
}
public Task InitializeAsync(DoctorPluginContext context, CancellationToken ct)
{
// Optional initialization
return Task.CompletedTask;
}
}
```
## 7) Telemetry
Doctor emits metrics and traces for observability:
**Metrics:**
- `doctor_check_duration_seconds{check_id, severity}` - Check execution time
- `doctor_check_results_total{check_id, severity}` - Result counts
- `doctor_plugin_load_duration_seconds{plugin_id}` - Plugin initialization time
**Traces:**
- `doctor.run` - Full doctor run span
- `doctor.check.{check_id}` - Individual check spans with evidence as attributes
## 8) Related Documentation
- [Registry Diagnostic Checks](./registry-checks.md)
- [Registry Compatibility Runbook](../../runbooks/registry-compatibility.md)
- [Registry Referrer Troubleshooting](../../runbooks/registry-referrer-troubleshooting.md)