feat(scheduler): wire startup migrations, dedupe 007/008, fix UI trend path
TASK-013: SchedulerPersistenceExtensions now calls AddStartupMigrations so
the embedded SQL files (including 007 job_kind + 008 doctor_trends) run on
every cold start. Deletes duplicate migrations 007_add_job_kind_plugin_config
(kept 007_add_schedule_job_kind.sql with tenant-scoped index) and
008_doctor_trends_table (kept 008_add_doctor_trends.sql with RLS + BRIN
time-series index).
TASK-010: Doctor UI trend service now calls
/api/v1/scheduler/doctor/trends/categories/{category} (was
/api/v1/doctor/scheduler/...) so it routes through the scheduler plugin
endpoints rather than the deprecated standalone doctor-scheduler path.
TASK-009: New DoctorJobPluginTests exercises plugin lifecycle: identity,
config validation for full/quick/categories/plugins modes, plan creation,
JSON schema shape, and PluginConfig round-trip (including alerts). 10 tests
added, all pass (26/26 in Plugin.Tests project).
Archives the sprint — all 13 tasks now DONE — and archives the platform
retest sprint (SPRINT_20260409_002) whose RETEST-008 completed via the
earlier feed-mirror cleanup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,562 +0,0 @@
|
||||
# Sprint 20260408-003 - Scheduler Plugin Architecture + Doctor Migration
|
||||
|
||||
## Topic & Scope
|
||||
|
||||
- Design and implement a generic job-plugin system for the Scheduler service, enabling non-scanning workloads (health checks, policy sweeps, graph builds, etc.) to be scheduled and executed as first-class Scheduler jobs.
|
||||
- Migrate Doctor's thin scheduling layer (`StellaOps.Doctor.Scheduler`) to become the first Scheduler job plugin, eliminating a standalone service while preserving Doctor-specific UX and trending.
|
||||
- Working directory: `src/JobEngine/` (primary), `src/Doctor/` (migration source), `src/Web/StellaOps.Web/src/app/features/doctor/` (UI adapter).
|
||||
- Expected evidence: interface definitions compile, Doctor plugin builds, existing Scheduler tests pass, new plugin tests pass, Doctor UI still renders schedules and trends.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
|
||||
- No upstream sprint blockers. The Scheduler WebService and Doctor Scheduler are both stable.
|
||||
- Batch 1 (tasks 001-004) can proceed independently of Batch 2 (005-009).
|
||||
- Batch 2 (Doctor plugin) depends on Batch 1 (plugin contracts).
|
||||
- Batch 3 (UI + cleanup, tasks 010-012) depends on Batch 2.
|
||||
- Safe to develop in parallel with any FE or Findings sprints since working directories do not overlap.
|
||||
|
||||
## Documentation Prerequisites
|
||||
|
||||
- `docs/modules/scheduler/architecture.md` (read before DOING)
|
||||
- `src/JobEngine/AGENTS.Scheduler.md`
|
||||
- `src/Doctor/AGENTS.md`
|
||||
- `docs/doctor/doctor-capabilities.md`
|
||||
|
||||
---
|
||||
|
||||
## Architecture Design
|
||||
|
||||
### A. Current State Analysis
|
||||
|
||||
**Scheduler** (src/JobEngine/StellaOps.Scheduler.WebService):
|
||||
- Manages `Schedule` entities with cron expressions, `ScheduleMode` (AnalysisOnly, ContentRefresh), `Selector` (image targeting), `ScheduleOnlyIf` preconditions, `ScheduleNotify` preferences, and `ScheduleLimits`.
|
||||
- Creates `Run` entities with state machine: Planning -> Queued -> Running -> Completed/Error/Cancelled.
|
||||
- The `Schedule.Mode` enum is hardcoded to scanning modes. The `Selector` model is image-centric (digests, namespaces, repositories, labels).
|
||||
- Worker Host processes queue segments via `StellaOps.Scheduler.Queue` and `StellaOps.Scheduler.Worker.DependencyInjection`.
|
||||
- Has an empty `StellaOps.Scheduler.plugins/scheduler/` directory and a working `PluginHostOptions` / `PluginHost.LoadPlugins()` assembly-loading pipeline via `StellaOps.Plugin.Hosting`.
|
||||
- `SystemScheduleBootstrap` seeds 6 system schedules on startup.
|
||||
- Already registers plugin assemblies via `RegisterPluginRoutines()` in Program.cs (line 189), which scans for `IDependencyInjectionRoutine` implementations.
|
||||
|
||||
**Doctor Scheduler** (src/Doctor/StellaOps.Doctor.Scheduler):
|
||||
- Standalone slim WebApplication (~65 lines in Program.cs).
|
||||
- `DoctorScheduleWorker` (BackgroundService): polls every N seconds, evaluates cron via Cronos, dispatches to `ScheduleExecutor`.
|
||||
- `ScheduleExecutor`: makes HTTP POST to Doctor WebService `/api/v1/doctor/run`, polls for completion, stores trend data, evaluates alert rules.
|
||||
- `DoctorSchedule` model: ScheduleId, Name, CronExpression, Mode (Quick/Full/Categories/Plugins), Categories[], Plugins[], Enabled, Alerts (AlertConfiguration), TimeZoneId, LastRunAt/Id/Status.
|
||||
- All persistence is in-memory (`InMemoryScheduleRepository`, `InMemoryTrendRepository`). No Postgres implementation exists yet.
|
||||
- Exposes REST endpoints at `/api/v1/doctor/scheduler/schedules` and `/api/v1/doctor/scheduler/trends`.
|
||||
- 20 Doctor plugins across 18+ directories under `src/Doctor/__Plugins/`, each implementing `IDoctorPlugin` with `IDoctorCheck[]`.
|
||||
|
||||
**Doctor UI** (src/Web/StellaOps.Web/src/app/features/doctor):
|
||||
- Calls Doctor WebService directly (`/doctor/api/v1/doctor/...`) for runs, checks, plugins, reports.
|
||||
- Calls Doctor Scheduler at `/api/v1/doctor/scheduler/trends/categories/{category}` for trend sparklines.
|
||||
- No schedule management UI exists yet (schedules are created via API or seed data).
|
||||
|
||||
### B. Plugin Architecture Design
|
||||
|
||||
#### B.1 The `ISchedulerJobPlugin` Contract
|
||||
|
||||
A new library `StellaOps.Scheduler.Plugin.Abstractions` defines the plugin contract:
|
||||
|
||||
```csharp
|
||||
namespace StellaOps.Scheduler.Plugin;
|
||||
|
||||
/// <summary>
|
||||
/// Identifies the kind of job a plugin handles. Used in Schedule.JobKind
|
||||
/// to route cron triggers to the correct plugin at execution time.
|
||||
/// </summary>
|
||||
public interface ISchedulerJobPlugin
|
||||
{
|
||||
/// <summary>
|
||||
/// Unique, stable identifier for this job kind (e.g., "scan", "doctor", "policy-sweep").
|
||||
/// Stored in the Schedule record; must be immutable once published.
|
||||
/// </summary>
|
||||
string JobKind { get; }
|
||||
|
||||
/// <summary>
|
||||
/// Human-readable display name for the UI.
|
||||
/// </summary>
|
||||
string DisplayName { get; }
|
||||
|
||||
/// <summary>
|
||||
/// Plugin version for compatibility checking.
|
||||
/// </summary>
|
||||
Version Version { get; }
|
||||
|
||||
/// <summary>
|
||||
/// Creates a typed execution plan from a Schedule + Run.
|
||||
/// Called when the cron fires or a manual run is created.
|
||||
/// Returns a plan object that the Scheduler persists as the Run's plan payload.
|
||||
/// </summary>
|
||||
Task<JobPlan> CreatePlanAsync(JobPlanContext context, CancellationToken ct);
|
||||
|
||||
/// <summary>
|
||||
/// Executes the plan. Called by the Worker Host.
|
||||
/// Must be idempotent and support cancellation.
|
||||
/// Updates Run state via the provided IRunProgressReporter.
|
||||
/// </summary>
|
||||
Task ExecuteAsync(JobExecutionContext context, CancellationToken ct);
|
||||
|
||||
/// <summary>
|
||||
/// Optionally validates plugin-specific configuration stored in Schedule.PluginConfig.
|
||||
/// Called on schedule create/update.
|
||||
/// </summary>
|
||||
Task<JobConfigValidationResult> ValidateConfigAsync(
|
||||
IReadOnlyDictionary<string, object?> pluginConfig,
|
||||
CancellationToken ct);
|
||||
|
||||
/// <summary>
|
||||
/// Returns the JSON schema for plugin-specific configuration, enabling UI-driven forms.
|
||||
/// </summary>
|
||||
string? GetConfigJsonSchema();
|
||||
|
||||
/// <summary>
|
||||
/// Registers plugin-specific services into DI.
|
||||
/// Called once during host startup.
|
||||
/// </summary>
|
||||
void ConfigureServices(IServiceCollection services, IConfiguration configuration);
|
||||
|
||||
/// <summary>
|
||||
/// Registers plugin-specific HTTP endpoints (optional).
|
||||
/// Called during app.Map* phase.
|
||||
/// </summary>
|
||||
void MapEndpoints(IEndpointRouteBuilder routes);
|
||||
}
|
||||
```
|
||||
|
||||
#### B.2 Supporting Types
|
||||
|
||||
```csharp
|
||||
/// <summary>
|
||||
/// Immutable context passed to CreatePlanAsync.
|
||||
/// </summary>
|
||||
public sealed record JobPlanContext(
|
||||
Schedule Schedule,
|
||||
Run Run,
|
||||
IServiceProvider Services,
|
||||
TimeProvider TimeProvider);
|
||||
|
||||
/// <summary>
|
||||
/// The plan produced by a plugin. Serialized to JSON and stored on the Run.
|
||||
/// </summary>
|
||||
public sealed record JobPlan(
|
||||
string JobKind,
|
||||
IReadOnlyDictionary<string, object?> Payload,
|
||||
int EstimatedSteps = 1);
|
||||
|
||||
/// <summary>
|
||||
/// Context passed to ExecuteAsync.
|
||||
/// </summary>
|
||||
public sealed record JobExecutionContext(
|
||||
Schedule Schedule,
|
||||
Run Run,
|
||||
JobPlan Plan,
|
||||
IRunProgressReporter Reporter,
|
||||
IServiceProvider Services,
|
||||
TimeProvider TimeProvider);
|
||||
|
||||
/// <summary>
|
||||
/// Callback interface for plugins to report progress and update Run state.
|
||||
/// </summary>
|
||||
public interface IRunProgressReporter
|
||||
{
|
||||
Task ReportProgressAsync(int completed, int total, string? message = null, CancellationToken ct = default);
|
||||
Task TransitionStateAsync(RunState newState, string? error = null, CancellationToken ct = default);
|
||||
Task AppendLogAsync(string message, string level = "info", CancellationToken ct = default);
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Result of plugin config validation.
|
||||
/// </summary>
|
||||
public sealed record JobConfigValidationResult(
|
||||
bool IsValid,
|
||||
IReadOnlyList<string> Errors);
|
||||
```
|
||||
|
||||
#### B.3 Schedule Model Extension
|
||||
|
||||
The existing `Schedule` record needs two new fields:
|
||||
|
||||
1. **`JobKind`** (string, default `"scan"`): routes to the correct `ISchedulerJobPlugin`. Existing schedules implicitly use `"scan"`.
|
||||
2. **`PluginConfig`** (ImmutableDictionary<string, object?>?, optional): plugin-specific configuration stored as JSON. For scan jobs this is null (mode/selector cover everything). For Doctor jobs this contains `{ "doctorMode": "full", "categories": [...], "plugins": [...], "alerts": {...} }`.
|
||||
|
||||
The existing `ScheduleMode` and `Selector` remain valid for scan-type jobs. Plugins that don't target images can ignore `Selector` and set `Scope = AllImages` as a no-op.
|
||||
|
||||
#### B.4 Plugin Registry and Discovery
|
||||
|
||||
```
|
||||
SchedulerPluginRegistry : ISchedulerPluginRegistry
|
||||
- Dictionary<string, ISchedulerJobPlugin> _plugins
|
||||
- Register(ISchedulerJobPlugin plugin)
|
||||
- Resolve(string jobKind) -> ISchedulerJobPlugin?
|
||||
- ListRegistered() -> IReadOnlyList<(string JobKind, string DisplayName)>
|
||||
```
|
||||
|
||||
Plugins are discovered in two ways:
|
||||
1. **Built-in**: The existing scan logic is refactored into `ScanJobPlugin : ISchedulerJobPlugin` with `JobKind = "scan"`. Registered in DI unconditionally.
|
||||
2. **Assembly-loaded**: The existing `PluginHost.LoadPlugins()` pipeline scans `plugins/scheduler/` for DLLs. Any type implementing `ISchedulerJobPlugin` is instantiated and registered. This uses the existing `PluginHostOptions` infrastructure already wired in the Scheduler.
|
||||
|
||||
#### B.5 Execution Flow
|
||||
|
||||
```
|
||||
Cron fires for Schedule (jobKind="doctor")
|
||||
-> SchedulerPluginRegistry.Resolve("doctor") -> DoctorJobPlugin
|
||||
-> DoctorJobPlugin.CreatePlanAsync(schedule, run) -> JobPlan
|
||||
-> Run persisted with state=Queued, plan payload
|
||||
-> Worker dequeues Run
|
||||
-> DoctorJobPlugin.ExecuteAsync(context)
|
||||
-> Calls Doctor WebService HTTP API (same as current ScheduleExecutor)
|
||||
-> Reports progress via IRunProgressReporter
|
||||
-> Stores trend data
|
||||
-> Evaluates alerts
|
||||
-> Run transitions to Completed/Error
|
||||
```
|
||||
|
||||
#### B.6 Backward Compatibility
|
||||
|
||||
- `Schedule.JobKind` defaults to `"scan"` for all existing schedules (migration adds column with default).
|
||||
- `Schedule.PluginConfig` defaults to null for existing schedules.
|
||||
- `ScanJobPlugin` wraps the current execution logic with no behavioral change.
|
||||
- The `ScheduleMode` enum remains but is only meaningful for `jobKind="scan"`. Other plugins ignore it (or set a sentinel value).
|
||||
- All existing API contracts (`/api/v1/scheduler/schedules`, `/api/v1/scheduler/runs`) are extended, not broken.
|
||||
|
||||
### C. Doctor Plugin Design
|
||||
|
||||
#### C.1 DoctorJobPlugin
|
||||
|
||||
```csharp
|
||||
public sealed class DoctorJobPlugin : ISchedulerJobPlugin
|
||||
{
|
||||
public string JobKind => "doctor";
|
||||
public string DisplayName => "Doctor Health Checks";
|
||||
|
||||
// CreatePlanAsync: reads DoctorScheduleConfig from Schedule.PluginConfig,
|
||||
// resolves which checks to run, returns JobPlan with check list.
|
||||
|
||||
// ExecuteAsync: HTTP POST to Doctor WebService /api/v1/doctor/run,
|
||||
// polls for completion (same logic as current ScheduleExecutor),
|
||||
// stores trend data via ITrendRepository,
|
||||
// evaluates alerts via IAlertService.
|
||||
|
||||
// MapEndpoints: registers /api/v1/scheduler/doctor/trends/* endpoints
|
||||
// to serve trend data (proxied from Scheduler's database).
|
||||
}
|
||||
```
|
||||
|
||||
#### C.2 Doctor-Specific Config Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"doctorMode": "full|quick|categories|plugins",
|
||||
"categories": ["security", "platform"],
|
||||
"plugins": ["stellaops.doctor.agent"],
|
||||
"timeoutSeconds": 300,
|
||||
"alerts": {
|
||||
"enabled": true,
|
||||
"alertOnFail": true,
|
||||
"alertOnWarn": false,
|
||||
"alertOnStatusChange": true,
|
||||
"channels": ["email"],
|
||||
"emailRecipients": [],
|
||||
"webhookUrls": [],
|
||||
"minSeverity": "Fail"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This replaces `DoctorSchedule.Mode`, `Categories`, `Plugins`, and `Alerts` with structured data inside `Schedule.PluginConfig`.
|
||||
|
||||
#### C.3 What Stays vs. What Moves
|
||||
|
||||
| Component | Current Location | After Migration |
|
||||
|---|---|---|
|
||||
| Doctor WebService | `src/Doctor/StellaOps.Doctor.WebService/` | **Stays unchanged** -- remains the execution engine |
|
||||
| Doctor Scheduler (standalone service) | `src/Doctor/StellaOps.Doctor.Scheduler/` | **Deprecated** -- replaced by DoctorJobPlugin in Scheduler |
|
||||
| Doctor checks (20 plugins) | `src/Doctor/__Plugins/` | **Stay unchanged** -- loaded by Doctor WebService |
|
||||
| Doctor schedule CRUD | Doctor Scheduler endpoints | **Moves** to Scheduler schedule CRUD (with jobKind="doctor") |
|
||||
| Doctor trend storage | `InMemoryTrendRepository` | **Moves** to Scheduler persistence (new table `scheduler.doctor_trends`) |
|
||||
| Doctor trend endpoints | `/api/v1/doctor/scheduler/trends/*` | **Moves** to DoctorJobPlugin.MapEndpoints at same paths (or proxied) |
|
||||
| Doctor UI | `src/Web/.../doctor/` | **Minor change** -- trend API base URL may change, schedule API uses Scheduler |
|
||||
|
||||
#### C.4 Doctor UI Continuity
|
||||
|
||||
The Doctor UI (`doctor.client.ts`) currently calls:
|
||||
1. `/doctor/api/v1/doctor/...` (runs, checks, plugins, reports) -- **no change needed**, Doctor WebService stays.
|
||||
2. `/api/v1/doctor/scheduler/trends/categories/{category}` (trends) -- **routed to DoctorJobPlugin endpoints registered in Scheduler**, or the existing Doctor Scheduler service can be kept running temporarily as a compatibility shim.
|
||||
|
||||
Strategy: DoctorJobPlugin registers the same trend endpoints under the Scheduler service. The gateway route for `doctor-scheduler.stella-ops.local` is remapped to the Scheduler service. UI code requires zero changes.
|
||||
|
||||
### D. What This Architecture Enables (Future)
|
||||
|
||||
After this sprint, adding a new scheduled job type requires:
|
||||
1. Implement `ISchedulerJobPlugin` (one class + supporting types).
|
||||
2. Drop the DLL into `plugins/scheduler/`.
|
||||
3. Create schedules with `jobKind="your-kind"` and `pluginConfig={...}`.
|
||||
4. No Scheduler core changes needed.
|
||||
|
||||
Future plugin candidates: `policy-sweep`, `graph-build`, `feed-refresh`, `evidence-export`, `compliance-audit`.
|
||||
|
||||
---
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### TASK-001 - Create StellaOps.Scheduler.Plugin.Abstractions library
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: Developer (Backend)
|
||||
Task description:
|
||||
- Create new class library `src/JobEngine/StellaOps.Scheduler.__Libraries/StellaOps.Scheduler.Plugin.Abstractions/`.
|
||||
- Define `ISchedulerJobPlugin`, `JobPlanContext`, `JobPlan`, `JobExecutionContext`, `IRunProgressReporter`, `JobConfigValidationResult`.
|
||||
- Target net10.0. No external dependencies beyond `StellaOps.Scheduler.Models`.
|
||||
- Add to `StellaOps.JobEngine.sln`.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Library compiles with zero warnings
|
||||
- [ ] All types documented with XML comments
|
||||
- [ ] Added to solution and referenced by Scheduler.WebService and Scheduler.Worker.Host csproj files
|
||||
|
||||
### TASK-002 - Create SchedulerPluginRegistry
|
||||
Status: DONE
|
||||
Dependency: TASK-001
|
||||
Owners: Developer (Backend)
|
||||
Task description:
|
||||
- Create `ISchedulerPluginRegistry` and `SchedulerPluginRegistry` in the Scheduler.WebService project (or a shared library).
|
||||
- Registry stores `Dictionary<string, ISchedulerJobPlugin>` keyed by `JobKind`.
|
||||
- Provides `Register()`, `Resolve(string jobKind)`, `ListRegistered()`.
|
||||
- Wire into DI as singleton in Program.cs.
|
||||
- Integrate with existing `PluginHost.LoadPlugins()` to discover and register `ISchedulerJobPlugin` implementations from plugin assemblies.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Registry resolves built-in plugins
|
||||
- [ ] Registry discovers plugins from assembly-loaded DLLs
|
||||
- [ ] Unit tests verify registration, resolution, and duplicate-kind rejection
|
||||
|
||||
### TASK-003 - Extend Schedule model with JobKind and PluginConfig
|
||||
Status: DONE
|
||||
Dependency: TASK-001
|
||||
Owners: Developer (Backend)
|
||||
Task description:
|
||||
- Add `JobKind` (string, default "scan") and `PluginConfig` (ImmutableDictionary<string, object?>?) to the `Schedule` record.
|
||||
- Update `ScheduleCreateRequest` and `ScheduleUpdateRequest` contracts to accept these fields.
|
||||
- Update `ScheduleEndpoints` create/update handlers to validate `PluginConfig` via the resolved plugin's `ValidateConfigAsync()`.
|
||||
- Add SQL migration to add `job_kind` (varchar, default 'scan') and `plugin_config` (jsonb, nullable) columns to the schedules table.
|
||||
- Update EF Core entity mapping and compiled model.
|
||||
- Update `SystemScheduleBootstrap` to set `JobKind = "scan"` explicitly.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Existing schedule tests pass (backward compatible)
|
||||
- [ ] New schedules can be created with jobKind and pluginConfig
|
||||
- [ ] SQL migration is embedded resource and auto-applies
|
||||
- [ ] Serialization round-trips correctly for pluginConfig
|
||||
|
||||
### TASK-004 - Refactor existing scan logic into ScanJobPlugin
|
||||
Status: DONE
|
||||
Dependency: TASK-001, TASK-002
|
||||
Owners: Developer (Backend)
|
||||
Task description:
|
||||
- Create `ScanJobPlugin : ISchedulerJobPlugin` with `JobKind = "scan"`.
|
||||
- `CreatePlanAsync`: reuse existing run-planning logic (impact resolution, selector evaluation, queue dispatch).
|
||||
- `ExecuteAsync`: reuse existing worker segment processing.
|
||||
- `ValidateConfigAsync`: validate ScheduleMode is valid.
|
||||
- `ConfigureServices`: no-op (scan services already registered).
|
||||
- `MapEndpoints`: no-op (scan endpoints already registered).
|
||||
- Register as built-in plugin in `SchedulerPluginRegistry` during DI setup.
|
||||
- This is a refactoring task. No behavioral change allowed.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Existing scan schedules work identically through the plugin path
|
||||
- [ ] All existing Scheduler tests pass without modification
|
||||
- [ ] ScanJobPlugin is the default plugin when jobKind is "scan" or null
|
||||
|
||||
### TASK-005 - Create StellaOps.Scheduler.Plugin.Doctor library
|
||||
Status: DONE
|
||||
Dependency: TASK-001, TASK-003
|
||||
Owners: Developer (Backend)
|
||||
Task description:
|
||||
- Create new class library `src/JobEngine/StellaOps.Scheduler.plugins/StellaOps.Scheduler.Plugin.Doctor/`.
|
||||
- Implement `DoctorJobPlugin : ISchedulerJobPlugin` with `JobKind = "doctor"`.
|
||||
- Port `ScheduleExecutor` logic: HTTP POST to Doctor WebService, poll for completion, map results.
|
||||
- Port `DoctorScheduleConfig` deserialization from `Schedule.PluginConfig`.
|
||||
- Port `AlertConfiguration` evaluation and `IAlertService` integration.
|
||||
- `ConfigureServices`: register `HttpClient` for Doctor API, `IAlertService`, `ITrendRepository`.
|
||||
- Use Scheduler's persistence layer for trend storage (new table via embedded SQL migration).
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Plugin compiles and loads via PluginHost
|
||||
- [ ] Plugin can create a plan from a doctor-type schedule
|
||||
- [ ] Plugin executes a doctor run via HTTP against Doctor WebService
|
||||
- [ ] Trend data is stored in Scheduler's Postgres schema
|
||||
|
||||
### TASK-006 - Add Doctor trend persistence to Scheduler schema
|
||||
Status: DONE
|
||||
Dependency: TASK-005
|
||||
Owners: Developer (Backend)
|
||||
Task description:
|
||||
- Add SQL migration creating `scheduler.doctor_trends` table (timestamp, check_id, plugin_id, category, run_id, status, health_score, duration_ms, evidence_values jsonb).
|
||||
- Add `scheduler.doctor_trend_summaries` materialized view or summary query.
|
||||
- Implement `PostgresDoctorTrendRepository : ITrendRepository` using Scheduler's DB connection.
|
||||
- Implement data retention pruning (configurable, default 365 days).
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Migration auto-applies on Scheduler startup
|
||||
- [ ] Trend data round-trips correctly
|
||||
- [ ] Pruning removes old data beyond retention period
|
||||
- [ ] Query performance acceptable for 365-day windows
|
||||
|
||||
### TASK-007 - Register Doctor trend and schedule endpoints in DoctorJobPlugin
|
||||
Status: DONE
|
||||
Dependency: TASK-005, TASK-006
|
||||
Owners: Developer (Backend)
|
||||
Task description:
|
||||
- Implement `DoctorJobPlugin.MapEndpoints()` to register:
|
||||
- `GET /api/v1/scheduler/doctor/trends` (mirrors existing `/api/v1/doctor/scheduler/trends`)
|
||||
- `GET /api/v1/scheduler/doctor/trends/checks/{checkId}`
|
||||
- `GET /api/v1/scheduler/doctor/trends/categories/{category}`
|
||||
- `GET /api/v1/scheduler/doctor/trends/degrading`
|
||||
- Ensure response shapes match current Doctor Scheduler endpoint contracts for UI compatibility.
|
||||
- Add gateway route alias so requests to `/api/v1/doctor/scheduler/trends/*` are forwarded to Scheduler service.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] All trend endpoints return correct data shapes
|
||||
- [ ] Existing Doctor UI trend sparklines work without code changes
|
||||
- [ ] Gateway routing verified
|
||||
|
||||
### TASK-008 - Seed default Doctor schedules via SystemScheduleBootstrap
|
||||
Status: DONE
|
||||
Dependency: TASK-003, TASK-005
|
||||
Owners: Developer (Backend)
|
||||
Task description:
|
||||
- Add Doctor system schedules to `SystemScheduleBootstrap.SystemSchedules`:
|
||||
- `doctor-full-daily` ("Daily Health Check", `0 4 * * *`, jobKind="doctor", pluginConfig for Full mode)
|
||||
- `doctor-quick-hourly` ("Hourly Quick Check", `0 * * * *`, jobKind="doctor", pluginConfig for Quick mode)
|
||||
- `doctor-compliance-weekly` ("Weekly Compliance Audit", `0 5 * * 0`, jobKind="doctor", pluginConfig for Categories=["compliance"])
|
||||
- These replace the in-memory seeds from Doctor Scheduler's `InMemoryScheduleRepository`.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Doctor schedules are created on fresh DB
|
||||
- [ ] Existing scan schedules unaffected
|
||||
- [ ] Schedules appear in Scheduler API with correct jobKind and pluginConfig
|
||||
|
||||
### TASK-009 - Integration tests for Doctor plugin lifecycle
|
||||
Status: TODO
|
||||
Dependency: TASK-005, TASK-006, TASK-007, TASK-008
|
||||
Owners: Developer (Backend), Test Automation
|
||||
Task description:
|
||||
- Add integration tests in `src/JobEngine/StellaOps.Scheduler.__Tests/`:
|
||||
- Plugin discovery and registration test
|
||||
- Doctor schedule create/update with pluginConfig validation
|
||||
- Doctor plan creation from schedule
|
||||
- Doctor execution mock (mock HTTP to Doctor WebService)
|
||||
- Trend storage and query
|
||||
- Alert evaluation
|
||||
- Use deterministic fixtures and `TimeProvider.System` replacement for time control.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] All new tests pass
|
||||
- [ ] No flaky tests (deterministic time, no network)
|
||||
- [ ] Coverage includes happy path, validation errors, execution errors, cancellation
|
||||
|
||||
### TASK-010 - Update Doctor UI trend API base URL
|
||||
Status: TODO
|
||||
Dependency: TASK-007
|
||||
Owners: Developer (Frontend)
|
||||
Task description:
|
||||
- If gateway routing alias is set up correctly (TASK-007), this may be a no-op.
|
||||
- If API path changes, update `doctor.client.ts` `getTrends()` method to use new endpoint path.
|
||||
- Verify trend sparklines render correctly.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Doctor dashboard trend sparklines display data
|
||||
- [ ] No console errors related to trend API calls
|
||||
|
||||
### TASK-011 - Deprecate Doctor Scheduler standalone service
|
||||
Status: DONE
|
||||
Dependency: TASK-009 (all tests pass)
|
||||
Owners: Developer (Backend), Project Manager
|
||||
Task description:
|
||||
- Add deprecation notice to `src/Doctor/StellaOps.Doctor.Scheduler/README.md`.
|
||||
- Remove Doctor Scheduler from `docker-compose.stella-ops.yml` (or disable by default).
|
||||
- Remove Doctor Scheduler from `devops/compose/services-matrix.env` if present.
|
||||
- Keep source code intact for one release cycle before deletion.
|
||||
- Update `docs/modules/doctor/` to reflect that scheduling is now handled by the Scheduler service.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Doctor Scheduler container no longer starts in default compose
|
||||
- [ ] All Doctor scheduling functionality verified via Scheduler service
|
||||
- [ ] Deprecation documented
|
||||
|
||||
### TASK-012 - Update architecture documentation
|
||||
Status: DONE
|
||||
Dependency: TASK-004, TASK-005
|
||||
Owners: Documentation Author
|
||||
Task description:
|
||||
- Update `docs/modules/scheduler/architecture.md` with plugin architecture section.
|
||||
- Add `ISchedulerJobPlugin` contract reference.
|
||||
- Update `docs/modules/doctor/` to document scheduler integration.
|
||||
- Update `docs/07_HIGH_LEVEL_ARCHITECTURE.md` if Scheduler's role description needs updating.
|
||||
- Create or update `src/JobEngine/StellaOps.Scheduler.plugins/AGENTS.md` with plugin development guide.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] Architecture docs reflect plugin system
|
||||
- [ ] Doctor scheduling migration documented
|
||||
- [ ] Plugin development guide exists for future plugin authors
|
||||
|
||||
### TASK-013 - Wire Scheduler persistence auto-migrations + dedupe 007/008
|
||||
Status: TODO
|
||||
Dependency: TASK-006
|
||||
Owners: Developer / Implementer
|
||||
Task description:
|
||||
- The Scheduler service does not wire `AddStartupMigrations` (see `src/JobEngine/StellaOps.Scheduler.__Libraries/StellaOps.Scheduler.Persistence/Extensions/SchedulerPersistenceExtensions.cs`). Migrations 007 (`job_kind`/`plugin_config`) and 008 (`doctor_trends`) never execute on startup. `SystemScheduleBootstrap` crashes on every boot with `Npgsql.PostgresException 42703: column "job_kind" of relation "schedules" does not exist`, blocking default Doctor schedule seeding (the goal of TASK-008).
|
||||
- Two collision pairs exist under `Migrations/`: `007_add_job_kind_plugin_config.sql` + `007_add_schedule_job_kind.sql`, and `008_add_doctor_trends.sql` + `008_doctor_trends_table.sql`. Pick one of each, delete the other, reconcile index/comment differences.
|
||||
- Wire `AddStartupMigrations("scheduler", "StellaOps.Scheduler", persistenceAssembly)` from `StellaOps.Infrastructure.Postgres.Migrations` in `AddSchedulerPersistence(...)` so the embedded SQL files run on every cold start. Reference pattern: `src/Signals/__Libraries/StellaOps.Signals.Persistence/Extensions/`.
|
||||
- This violates the non-negotiable rule in `CLAUDE.md §2.7` (Database auto-migration requirement). Symptom observed 2026-04-13 on a stack that had been running since fresh DB bootstrap.
|
||||
|
||||
Completion criteria:
|
||||
- [ ] `scheduler.schema_migrations` table exists after a fresh-DB startup (parity with other services in §2.7)
|
||||
- [ ] `scheduler.schedules` has `job_kind TEXT NOT NULL DEFAULT 'scan'` and `plugin_config JSONB` columns
|
||||
- [ ] `scheduler.doctor_trends` table exists
|
||||
- [ ] `SystemScheduleBootstrap` seeds 3 default Doctor schedules without error on fresh DB
|
||||
- [ ] Duplicate 007/008 SQL files collapsed to one each
|
||||
- [ ] Integration test (or targeted manual run) proves volume-reset → working scheduler without any manual `psql`
|
||||
|
||||
## Execution Log
|
||||
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-08 | Sprint created with full architectural design after codebase analysis. 12 tasks defined across 3 batches. | Planning |
|
||||
| 2026-04-08 | Batch 1 complete: Plugin.Abstractions library (ISchedulerJobPlugin, SchedulerPluginRegistry, ScanJobPlugin), Schedule model extended with JobKind+PluginConfig, SQL migration 007, contracts updated, Program.cs wired. All 143 existing tests pass. | Developer |
|
||||
| 2026-04-08 | Batch 2 complete: DoctorJobPlugin created with HTTP execution, trend storage (PostgresDoctorTrendRepository), alert service, trend endpoints. SQL migration 008 for doctor_trends table. 3 default Doctor schedules seeded. | Developer |
|
||||
| 2026-04-08 | Batch 3 complete: doctor-scheduler commented out in both compose files. AGENTS.md created for scheduler plugins. Build verified: WebService + Doctor plugin compile with 0 warnings/errors. | Developer |
|
||||
| 2026-04-13 | QA verification on running stack: Doctor trend endpoints returned 500 due to missing `[FromServices]` on `IDoctorTrendRepository?` in three endpoints. Fixed (commit `337aa5802`); all four trend endpoints now return HTTP 200 via gateway. Discovered Scheduler persistence never wires `AddStartupMigrations` — migrations 007/008 never ran; `SystemScheduleBootstrap` crashes on every boot; duplicate 007/008 SQL files present. Opened TASK-013. | QA / Developer |
|
||||
|
||||
## Decisions & Risks
|
||||
|
||||
### Decisions
|
||||
|
||||
1. **Plugin interface vs. message-based dispatch**: Chose an in-process `ISchedulerJobPlugin` interface over a message queue dispatch model. Rationale: the Scheduler already has assembly-loading infrastructure (`PluginHost`), and in-process execution avoids adding another IPC layer. Plugins that need to call remote services (like Doctor) do so via HttpClient, which is already the pattern.
|
||||
|
||||
2. **Schedule model extension vs. separate table**: Chose to extend the existing `Schedule` record with `JobKind` + `PluginConfig` rather than creating a separate `PluginSchedule` table. Rationale: keeps the CRUD API unified, avoids join complexity, and the JSON pluginConfig column provides flexibility without schema changes per plugin.
|
||||
|
||||
3. **Doctor WebService stays**: Doctor WebService remains a standalone service. The plugin only replaces the scheduling/triggering layer (Doctor Scheduler). This preserves the existing Doctor engine, plugin loading, check execution, and report storage. The plugin communicates with Doctor WebService via HTTP, same as today.
|
||||
|
||||
4. **Trend data in Scheduler schema**: Doctor trend data moves to the Scheduler's Postgres schema rather than staying in Doctor's (non-existent) Postgres. Rationale: Scheduler already has persistent storage; Doctor Scheduler was in-memory only. This gives trends durability without adding a new database dependency to Doctor.
|
||||
|
||||
5. **ScanJobPlugin as refactoring, not rewrite**: The existing scan logic is wrapped in `ScanJobPlugin` by extracting and delegating, not by rewriting. This minimizes regression risk.
|
||||
|
||||
### Risks
|
||||
|
||||
1. **Schedule.PluginConfig schema evolution**: As plugin configs evolve, backward compatibility of the JSON blob must be maintained. Mitigation: plugins should version their config schema and handle migration in `ValidateConfigAsync`.
|
||||
|
||||
2. **Doctor WebService availability during scheduled runs**: If Doctor WebService is down, the DoctorJobPlugin's execution will fail. Mitigation: implement retry with backoff in the plugin, and use Run state machine to track Error state with meaningful messages.
|
||||
|
||||
3. **Gateway routing for trend endpoints**: The UI currently hits Doctor Scheduler directly. After migration, requests must be routed to the Scheduler service. Mitigation: TASK-007 explicitly addresses gateway configuration, and TASK-010 handles UI fallback.
|
||||
|
||||
4. **Compiled model regeneration**: Adding columns to Schedule requires regenerating EF Core compiled models. This is mechanical but must not be forgotten.
|
||||
|
||||
5. **Plugin isolation**: In-process plugins share the Scheduler's AppDomain. A misbehaving plugin (memory leak, thread starvation) affects all jobs. Mitigation: use `SemaphoreSlim` for concurrency limits (same pattern as current Doctor Scheduler), add plugin execution timeouts.
|
||||
|
||||
## Next Checkpoints
|
||||
|
||||
- **Batch 1 complete** (TASK-001 through TASK-004): Plugin abstractions + registry + scan refactoring. Demo: existing scan schedules work through plugin dispatch. Estimated: 3-4 days.
|
||||
- **Batch 2 complete** (TASK-005 through TASK-009): Doctor plugin + trend storage + tests. Demo: doctor health checks triggered by Scheduler, trends visible. Estimated: 4-5 days.
|
||||
- **Batch 3 complete** (TASK-010 through TASK-012): UI fix-up, deprecation, docs. Demo: full end-to-end. Estimated: 2 days.
|
||||
- **Total estimated effort**: 9-11 working days for one backend developer + 1 day frontend.
|
||||
@@ -1,186 +0,0 @@
|
||||
# Sprint 20260409-002 -- Local Stack Regression Retest
|
||||
|
||||
## Topic & Scope
|
||||
- Regress the rebuilt local Stella Ops environment in the order requested by the user: backend unit tests, backend end-to-end checks, frontend unit tests, then frontend end-to-end checks.
|
||||
- Keep execution strictly serial: no overlapping test runs across projects or suites.
|
||||
- Capture failures with concrete project, command, and runtime evidence so blockers can be fixed and retested deterministically.
|
||||
- Working directory: `.`.
|
||||
- Expected evidence: per-project test command output, live API/browser verification artifacts, and sprint execution log updates for each lane.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Required docs: `docs/qa/feature-checks/FLOW.md`, `docs/code-of-conduct/TESTING_PRACTICES.md`, `devops/compose/README.md`, `docs/modules/platform/architecture-overview.md`.
|
||||
- Depends on [SPRINT_20260409_001_Platform_local_container_rebuild_integrations_sources.md](/C:/dev/New%20folder/git.stella-ops.org/docs/implplan/SPRINT_20260409_001_Platform_local_container_rebuild_integrations_sources.md) because the retest uses the rebuilt local stack and seeded integration/source catalogs from that sprint.
|
||||
- Test execution is strictly sequential per the user request. Only one project-level test command may run at a time.
|
||||
- Cross-module reads and test execution are allowed across `src/**`, `src/Web/**`, and `devops/compose/**`. Code edits are allowed only if a failing test exposes a concrete product defect that must be fixed to continue the retest.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/qa/feature-checks/FLOW.md`
|
||||
- `docs/code-of-conduct/TESTING_PRACTICES.md`
|
||||
- `devops/compose/README.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### RETEST-001 - Run backend unit test lane sequentially
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: QA / Test Automation
|
||||
Task description:
|
||||
- Identify the backend test projects that cover the rebuilt local integration and advisory-source paths plus the core services currently known to be unhealthy after fresh bootstrap.
|
||||
- Run each backend unit/integration-oriented project one at a time, record exact commands and outcomes, and stop to triage any hard failures before advancing to backend E2E checks.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Backend project list and execution order are recorded in the sprint log.
|
||||
- [x] Each selected backend test project is run individually with exact command evidence.
|
||||
- [x] Failures are captured with concrete project-level detail and triage notes.
|
||||
|
||||
### RETEST-002 - Run backend end-to-end verification against the live stack
|
||||
Status: DONE
|
||||
Dependency: RETEST-001
|
||||
Owners: QA
|
||||
Task description:
|
||||
- Exercise the live backend surfaces exposed by the rebuilt local stack, starting with the integration and advisory-source APIs already proven during rebuild and extending into the remaining core services needed for a broader backend regression call.
|
||||
- Use real HTTP requests and service health evidence from the local environment; do not treat compile/test passes as sufficient.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Live backend endpoints are exercised with fresh requests against the rebuilt environment.
|
||||
- [x] Responses and any failing services are captured with exact evidence.
|
||||
- [x] Backend E2E status is recorded as pass/fail/blocker with follow-up notes.
|
||||
|
||||
### RETEST-003 - Run frontend unit test lane sequentially
|
||||
Status: DONE
|
||||
Dependency: RETEST-002
|
||||
Owners: QA / Test Automation
|
||||
Task description:
|
||||
- Run the frontend unit-test suites one project at a time after backend E2E completes, using the repository-supported Node toolchain already present on this machine.
|
||||
- Record command output, failures, and any environment issues before moving to browser-based verification.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Frontend unit test projects are executed serially.
|
||||
- [x] Results are captured with exact commands and pass/fail counts.
|
||||
- [x] Any failures or skipped areas are recorded with reason.
|
||||
|
||||
### RETEST-004 - Run frontend end-to-end verification serially
|
||||
Status: DONE
|
||||
Dependency: RETEST-003
|
||||
Owners: QA
|
||||
Task description:
|
||||
- Run browser-based UI verification against the rebuilt local environment after frontend unit tests finish, using the host aliases now present on the machine.
|
||||
- Validate the key local integration and source-management flows visible through the web UI, and capture failures with enough detail to reproduce.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Browser-based frontend verification runs after unit tests, not before.
|
||||
- [x] Key UI flows are exercised against the local stack.
|
||||
- [x] Outcomes and blockers are recorded with concrete evidence.
|
||||
|
||||
### RETEST-005 - Restore router frontdoor startup for local browser access
|
||||
Status: DONE
|
||||
Dependency: RETEST-004
|
||||
Owners: Developer / QA
|
||||
Task description:
|
||||
- Diagnose and fix the local `router-gateway` startup failure that leaves `https://stella-ops.local/` unavailable.
|
||||
- Keep the gateway's fail-fast configuration validation intact; remove the duplicate `/platform` route at the correct configuration-loading layer rather than weakening validation.
|
||||
- Reverify the router with focused gateway tests plus live HTTP/TLS checks against the local host aliases.
|
||||
|
||||
Completion criteria:
|
||||
- [x] The local router no longer crash-loops on duplicate `/platform` routes.
|
||||
- [x] `https://stella-ops.local/` responds without `ERR_CONNECTION_CLOSED`.
|
||||
- [x] Focused router tests and live frontdoor checks are recorded in the sprint log.
|
||||
|
||||
### RETEST-006 - Restore local admin login convergence after Authority bootstrap race
|
||||
Status: DONE
|
||||
Dependency: RETEST-005
|
||||
Owners: Developer / QA
|
||||
Task description:
|
||||
- Diagnose and fix the local login failure for the documented demo admin account (`admin / Admin@Stella2026!`).
|
||||
- Keep the Authority bootstrap and password-verification paths deterministic; do not paper over the issue by weakening authentication checks.
|
||||
- Reverify the fix with focused Authority Standard Plugin tests plus a real browser/UI login against `https://stella-ops.local/`.
|
||||
|
||||
Completion criteria:
|
||||
- [x] The Authority bootstrap admin user is created reliably even when PostgreSQL is not ready on the first startup attempt.
|
||||
- [x] `admin / Admin@Stella2026!` can sign in successfully through the local browser flow.
|
||||
- [x] Focused tests and live login evidence are recorded in the sprint log.
|
||||
|
||||
### RETEST-007 - Restore local scripts catalog convergence for `/ops/scripts`
|
||||
Status: DONE
|
||||
Dependency: RETEST-006
|
||||
Owners: Developer / QA
|
||||
Task description:
|
||||
- Diagnose and fix the local `/ops/scripts` failure so the scripts catalog loads through the browser against the rebuilt local stack.
|
||||
- Keep schema ownership deterministic: the service that serves `/api/v2/scripts` must own and auto-migrate the `scripts` schema on startup instead of depending on another module's migrations or manual SQL.
|
||||
- Reverify the fix with focused Release Orchestrator tests plus live API/browser checks against `https://stella-ops.local/ops/scripts`.
|
||||
|
||||
Completion criteria:
|
||||
- [x] The `scripts` schema converges automatically on fresh local startup under the owning Release Orchestrator service.
|
||||
- [x] `GET /api/v2/scripts` succeeds through the local gateway without `relation "scripts.scripts" does not exist`.
|
||||
- [x] The `/ops/scripts` UI loads script data instead of showing the generic load failure banner.
|
||||
|
||||
### RETEST-008 - Remove bogus local feed-mirror timeout state for `mirror-osv-001`
|
||||
Status: DONE
|
||||
Dependency: RETEST-007
|
||||
Owners: Developer / QA
|
||||
Task description:
|
||||
- Diagnose the `Sync Error / Connection timeout after 30s.` message shown on the local mirror detail page at `/ops/operations/feeds/mirror/mirror-osv-001`.
|
||||
- Keep local feed-mirror behavior truthful: if the mirror management surface is currently backed by seeded/stubbed data, it must not report a fake runtime timeout that never actually occurred.
|
||||
- Reverify the fix with focused API/UI checks against the local stack.
|
||||
|
||||
Completion criteria:
|
||||
- [x] The local OSV mirror detail no longer reports a fabricated timeout error.
|
||||
- [x] Backend and frontend seed/mock fixtures stay aligned for the OSV mirror state.
|
||||
- [x] Live local verification is recorded in the sprint log.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-04-09 | Sprint created for serial backend/frontend regression retesting of the rebuilt local stack. | QA |
|
||||
| 2026-04-09 | Backend unit lane selected and ordered for serial execution: `StellaOps.Platform.WebService.Tests`, `StellaOps.Integrations.Tests`, `StellaOps.Integrations.Plugin.Tests`, `StellaOps.Concelier.WebService.Tests`, `StellaOps.Gateway.WebService.Tests`, `StellaOps.Scheduler.Worker.Tests`, `StellaOps.Graph.Api.Tests`, `StellaOps.Findings.Ledger.Tests`, `StellaOps.Timeline.WebService.Tests`. | QA |
|
||||
| 2026-04-09 | Backend unit lane executed serially. Passed: `StellaOps.Integrations.Plugin.Tests` (18/18), `StellaOps.Scheduler.Worker.Tests` (139/139), `StellaOps.Graph.Api.Tests` (77/77). Failed: `StellaOps.Platform.WebService.Tests` (9 failing assertions around seed/migration/quota flows), `StellaOps.Integrations.Tests` (7 host-boot failures from missing `StellaOps.Audit.Emission`), `StellaOps.Gateway.WebService.Tests` (1 readiness regression), `StellaOps.Findings.Ledger.Tests` (5 contract failures due PostgreSQL connectivity), `StellaOps.Timeline.WebService.Tests` (17 failures due PostgreSQL startup migration connectivity). `StellaOps.Concelier.WebService.Tests` was aborted after a non-advancing run following host startup, and is currently treated as a hang-class blocker pending deeper triage. | QA |
|
||||
| 2026-04-09 | Backend E2E over host aliases completed. Healthy surfaces: `platform` (`/healthz`, `/api/v1/platform/health/summary`), `integrations` (`providers` 17, catalog 13/13 active), and Concelier static source-management (`POST /api/v1/advisory-sources/check` => `74/74` healthy). Blockers: `router` unreachable (`stellaops-router-gateway` unhealthy), `scheduler` connection closed (`stellaops-scheduler-web` unhealthy), `graph` unreachable while `stellaops-graph-api` restart-loops, `findings` unreachable while `stellaops-findings-ledger-web` restart-loops, and `timeline` unreachable while `stellaops-timeline-web` restart-loops. Additional defect: Concelier UI read-model endpoints (`GET /api/v1/advisory-sources`, `/summary`) only report 3 sources (2 healthy, 1 stale), which diverges from the 74-source static catalog/check surface. | QA |
|
||||
| 2026-04-09 | Frontend unit lane executed serially. `npm test -- --watch=false` failed in batch 1 before test execution because the Angular build is broken (`AuditStatsSummary.byModule` missing new module keys, missing `toggleExpand`, missing scheduler component imports, and implicit-`any` callbacks in scheduler spec files). `npm run test:topology` also failed at compile time with a large missing-file/type-program regression across route specs and lazy-loaded feature components. `npm run test:active-surfaces` passed (`7` files / `25` tests). | QA |
|
||||
| 2026-04-09 | Frontend E2E over the live frontdoor is currently blocked at entry. Both `npm run test:e2e:live:auth` and `npm run test:e2e:live:changed-surfaces` failed on `page.goto('https://stella-ops.local/welcome')` with `net::ERR_CONNECTION_CLOSED`; plain HTTP `http://stella-ops.local` also returned an empty reply, matching the unhealthy router/frontdoor state. | QA |
|
||||
| 2026-04-09 | Expanded serial test sweep requested. Second-pass backend order selected: `StellaOps.Authority.Core.Tests`, `StellaOps.Auth.ServerIntegration.Tests`, `StellaOps.Policy.Engine.Tests`, `StellaOps.Policy.Scoring.Tests`, `StellaOps.Scanner.Core.Tests`, `StellaOps.Scanner.WebService.Tests`, `StellaOps.ReleaseOrchestrator.Workflow.Tests`, `StellaOps.ReleaseOrchestrator.Integration.Tests`, `StellaOps.EvidenceLocker.Tests`, `StellaOps.BinaryIndex.WebService.Tests`, `StellaOps.Doctor.WebService.Tests`, `StellaOps.ReachGraph.WebService.Tests`, `StellaOps.Notify.WebService.Tests`, `StellaOps.VexHub.WebService.Tests`, `StellaOps.Unknowns.WebService.Tests`, followed by repo-level integration/E2E projects as time permits. | QA |
|
||||
| 2026-04-09 | Repository test surface enumerated for scope control: `rg --files src -g "*Tests.csproj"` returned `503` test projects under `src`, confirming the retest remains a sampled regression sweep rather than an exhaustive full-repo run. | QA |
|
||||
| 2026-04-09 | Second-pass backend sweep executed serially. Passed: `StellaOps.Authority.Core.Tests` (46/46), `StellaOps.Auth.ServerIntegration.Tests` (30/30), `StellaOps.Policy.Scoring.Tests` (263/263), `StellaOps.Scanner.Core.Tests` (339/339), `StellaOps.ReleaseOrchestrator.Workflow.Tests` (488/488), `StellaOps.ReleaseOrchestrator.Integration.Tests` (12/12), `StellaOps.EvidenceLocker.Tests` (132/132), `StellaOps.BinaryIndex.WebService.Tests` (54/54), `StellaOps.Doctor.WebService.Tests` (35/35), `StellaOps.ReachGraph.WebService.Tests` (26/26), `StellaOps.Unknowns.WebService.Tests` (10/10). Failed: `StellaOps.Policy.Engine.Tests` (4 failures from duplicate endpoint name `ListRiskProfiles` in host boot), `StellaOps.Notify.WebService.Tests` (4 endpoint-contract regressions: readiness `400`, normalize endpoints `401`), `StellaOps.VexHub.WebService.Tests` (compile break: `InMemoryVexSourceRepository` missing `UpdateFailureTrackingAsync`). `StellaOps.Scanner.WebService.Tests` was aborted after a non-advancing run entered execution and left only an empty `TestResults` log, so it is currently tracked as a stall-class blocker. | QA |
|
||||
| 2026-04-09 | Third-pass serial backend sweep focused on root-cause isolation. Passed: `StellaOps.Notify.Core.Tests` (59/59), `StellaOps.Notify.Engine.Tests` (33/33), `StellaOps.Notify.Persistence.Tests` (109/109), `StellaOps.Scheduler.Persistence.Tests` (95/95), `StellaOps.Workflow.WebService.Tests` (4/4), `StellaOps.VexHub.Core.Tests` (1/1), `StellaOps.Concelier.Core.Tests` (569/569), `StellaOps.Concelier.SourceIntel.Tests` (61/61), `StellaOps.Feedser.Core.Tests` (81/81). Failed: `StellaOps.Scheduler.WebService.Tests` (107 failures / 18 passes, all cascading from unresolved DI registrations for `IImpactSnapshotRepository`, `IPolicyRunJobRepository`, and `IGraphJobRepository`), `StellaOps.Workflow.Engine.Tests` (6 failures / 133 passes, canonical workflow rendering now returns only 5 of 10 expected definitions and round-trip compilation injects `assign-business-reference` producing non-identical canonical JSON), `StellaOps.Concelier.Persistence.Tests` (29 failures / 207 passes, missing PostgreSQL relations including `kev_flags`, `sources`, and `merge_events`). This isolates Notify web regressions to the web surface rather than core/engine/persistence layers, and isolates Scheduler breakage to web-host service wiring rather than persistence repositories. | QA |
|
||||
| 2026-04-09 | Fourth-pass serial backend sweep widened coverage without code changes. Passed: `StellaOps.Notify.Queue.Tests` (14/14), `StellaOps.Scheduler.Queue.Tests` (102/102), `StellaOps.Scheduler.Plugin.Tests` (16/16), `StellaOps.Workflow.DataStore.PostgreSQL.Tests` (13/13), `StellaOps.Feedser.BinaryAnalysis.Tests` (26/26), `StellaOps.Concelier.SbomIntegration.Tests` (130/130). Failed: `StellaOps.Notify.Worker.Tests` failed at build time because the referenced worker project path `src/Notify/StellaOps.Notify.Worker/StellaOps.Notify.Worker.csproj` does not exist and worker namespaces/types (`Handlers`, `Processing`, `INotifyEventHandler`, `NotifyWorkerOptions`) cannot be resolved; `StellaOps.Scheduler.Models.Tests` failed 1/143 because `ScheduleSample_RoundtripsThroughCanonicalSerializer` now emits extra fields (`jobKind`, `source`) and no longer round-trips to the expected canonical sample; `StellaOps.Excititor.WebService.Tests` failed 7/37 because OIDC metadata bootstrapping points to `http://localhost/.well-known/openid-configuration` and rejects non-HTTPS (`IDX20108`). `StellaOps.Concelier.Integration.Tests` did not execute any real integration coverage because its only test was skipped behind `STELLAOPS_INTEGRATION_TESTS=true`. `StellaOps.Workflow.Renderer.Tests` was manually stopped after entering a long-running artifact-generation loop under `TestResults/workflow-renderings/20260409/DocumentProcessingWorkflow` with no terminal result emitted during the observation window. | QA |
|
||||
| 2026-04-09 | Fifth-pass serial workflow and Excititor-internal sweep executed without code changes. Passed: `StellaOps.Workflow.Signaling.Redis.Tests` (2/2), `StellaOps.Workflow.DataStore.MongoDB.Tests` (11/11), `StellaOps.Excititor.Core.Tests` (185/185), `StellaOps.Excititor.Policy.Tests` (2/2), `StellaOps.Excititor.Export.Tests` (16/16), `StellaOps.Excititor.Formats.OpenVEX.Tests` (15/15), `StellaOps.Excititor.Formats.CycloneDX.Tests` (15/15), `StellaOps.Excititor.Formats.CSAF.Tests` (13/13), `StellaOps.Excititor.Plugin.Tests` (25/25), `StellaOps.Excititor.Attestation.Tests` (17/17), `StellaOps.Excititor.Worker.Tests` (70/70), and `StellaOps.Excititor.ArtifactStores.S3.Tests` (2/2). Failed: `StellaOps.Workflow.DataStore.Oracle.Tests` (26 failures / 14 passes; DI/runtime configuration gaps including missing `StackExchange.Redis.IConnectionMultiplexer`, missing `IOracleAqTransport`, and unconfigured EF `DbContext` provider), and `StellaOps.Excititor.Persistence.Tests` (48 failures / 6 passes; shared Postgres test fixture cannot apply Excititor migrations because PostgreSQL reports `42601: syntax error at or near \"(\"`). `StellaOps.Excititor.Core.UnitTests` remained a harness anomaly: `dotnet test` exited `0` after restore/build, but emitted no test-host execution and produced no `TestResults`. | QA |
|
||||
| 2026-04-09 | Fifth-pass serial Excititor connector sweep broadened coverage beyond the core libraries. Passed: `StellaOps.Excititor.Connectors.Cisco.CSAF.Tests` (9/9), `StellaOps.Excititor.Connectors.RedHat.CSAF.Tests` (13/14 with 1 skip), `StellaOps.Excititor.Connectors.Ubuntu.CSAF.Tests` (10/10), and `StellaOps.Excititor.Connectors.Oracle.CSAF.Tests` (10/10). This confirms the connector-specific CSAF import/export layers are largely healthy even while Excititor persistence and web-host/OIDC paths remain broken. | QA |
|
||||
| 2026-04-09 | Sixth-pass serial Excititor connector sweep continued into additional source types. Passed: `StellaOps.Excititor.Connectors.MSRC.CSAF.Tests` (12/12) and `StellaOps.Excititor.Connectors.OCI.OpenVEX.Attest.Tests` (17/17). The connector matrix continues to indicate localized breakage in persistence and web-host startup rather than a connector-wide ingestion/export regression. | QA |
|
||||
| 2026-04-09 | User-reported browser failure on `https://stella-ops.local/` was rechecked live. `curl -vk https://stella-ops.local/` resolves `stella-ops.local` to `127.1.0.1` but fails during TLS handshake (`schannel: failed to receive handshake`), and `docker ps` still reports `stellaops-router-gateway` as `unhealthy`. Router logs confirm the same startup blocker as earlier: `Duplicate route path '/platform' already defined by Route[96]`, so the frontdoor remains non-functional even though many backend services and test projects are healthy. | QA |
|
||||
| 2026-04-09 | Router/frontdoor defect fixed without weakening gateway validation. Root cause was config composition in local compose: `devops/compose/router-gateway-local.json` was being mounted as `/app/appsettings.local.json`, so its route table merged with the baked-in gateway `appsettings.json` and duplicated `/platform`. The local router config was normalized into a standalone gateway configuration (`Node`, `Transports`, `Routing`, `OpenApi`, `Auth`, `Health`, `Routes`, `Logging`) and compose now mounts it as `/app/appsettings.json` in both local compose variants so it replaces, rather than merges with, the baked-in route table. A guard test was added to keep the compose config standalone. | Developer / QA |
|
||||
| 2026-04-09 | Focused router verification after the fix: `dotnet test src/Router/__Tests/StellaOps.Gateway.WebService.Tests/StellaOps.Gateway.WebService.Tests.csproj -v minimal` now runs with the new compose-config guard and reports `289` passed / `1` failed, where the sole remaining failure is the pre-existing readiness regression (`HealthReady_ReturnsOk_WhenRequiredMicroserviceIsRegistered` expected `200`, got `503`). The router container was then force-recreated from `devops/compose`, after which `docker ps` shows `stellaops-router-gateway` `healthy`, `curl -vk https://stella-ops.local/` returns `HTTP/1.1 200 OK` with the Stella Ops HTML shell, and `curl -I http://stella-ops.local/` returns `302 Found` redirecting to HTTPS. | Developer / QA |
|
||||
| 2026-04-09 | Local login failure triaged to Authority bootstrap convergence, not bad credentials. Live Authority logs showed the browser POSTs reaching `/authorize`, but the standard plugin rejected `admin` as an unknown user. Startup logs showed the root cause: during local cold start the plugin attempted bootstrap seeding once, hit `Npgsql.NpgsqlException: Failed to connect ... Connection refused`, logged the error, and never retried, leaving the service healthy but without the bootstrap admin account. | Developer / QA |
|
||||
| 2026-04-09 | Authority standard-plugin bootstrap hardening shipped. `StandardPluginBootstrapper` now retries the bootstrap pass instead of abandoning seeding after the first transient storage failure, and the focused Authority Standard Plugin suite now passes `44/44`, including a new transient-failure regression test. Module docs were updated to record that bootstrap user/client provisioning now converges when storage becomes reachable after startup. | Developer / QA |
|
||||
| 2026-04-09 | Live deployment and browser verification completed. Rebuilding `stellaops/authority:dev` through `devops/docker/Dockerfile.platform` was blocked by stale Dockerfile paths (`/src/Signer`, `/src/Scheduler` no longer exist), so the updated Authority host was published locally and copied into the running container before restarting `stellaops-authority`. On restart, Authority logs confirmed `assigned role 'admin' to bootstrap user 'admin'`. The real browser-level script `npm run test:e2e:live:auth` then succeeded and produced a full `demo-prod` session for `admin`, landing on `https://stella-ops.local/?tenant=demo-prod®ions=apac,eu-west,us-east,us-west` with title `Dashboard - StellaOps`. | Developer / QA |
|
||||
| 2026-04-10 | `/ops/scripts` was fixed at the owning-service layer. Release Orchestrator scripts now bind isolated `ScriptsPostgresOptions`, embed their own `scripts` schema DDL/seed migration, and register a dedicated startup migration host. A shared infrastructure bug also had to be corrected: `AddStartupMigrations(...)` previously used `AddHostedService(...)`, which deduplicated the second migration host in a single service and prevented the `scripts` schema migrator from starting beside the core `release_orchestrator` migrator. Focused Release Orchestrator integration tests now pass `15/15`, release-orchestrator was republished into the live container, startup logs show `Migration.ReleaseOrchestrator.Scripts` applying `001_initial.sql`, and a real authenticated Playwright probe confirmed `https://stella-ops.local/ops/scripts` loads with heading `Scripts`, no generic error banner, and 4 rendered scripts while `GET /api/v2/scripts` returns `200`. | Developer / QA |
|
||||
| 2026-04-13 | Feed-mirror backend convergence work replaced the remaining active Concelier placeholder handlers for snapshot actions, retention updates, air-gap bundles/imports, and version locks with persisted PostgreSQL/filesystem-backed behavior. A focused xUnit v3 class run (`StellaOps.Concelier.WebService.Tests.AdvisorySourceEndpointsTests`) verified the touched endpoint slice at `10` total / `9` passed / `1` pre-existing unrelated failure (`ListEndpoints_WithoutTenantHeader_ReturnsBadRequest` returns `200` instead of `400`). | Developer / QA |
|
||||
| 2026-04-13 | Republish of the updated Concelier host exposed two production-only runtime faults that the prior local binary had been hiding: embedded Release Orchestrator topology stores were still registered through an unbound `Func<Guid>` path, and `001_regions_and_infra_bindings.sql` used an invalid table-level `UNIQUE(... COALESCE(...))` expression. Both were fixed, Concelier was republished into the live container, and startup now applies `ReleaseOrchestrator.Environment` migrations cleanly before reaching `healthy`. | Developer / QA |
|
||||
| 2026-04-13 | Live browser verification of `https://stella-ops.local/ops/operations/feeds/mirror/mirror-osv-001` succeeded through the authenticated frontdoor. The page no longer renders the `Sync Error` banner, and the browser-backed API call to `/api/v1/concelier/mirrors/mirror-osv-001` returned `200` with `feedType=osv`, `syncStatus=synced`, and `errorMessage=null`. | Developer / QA |
|
||||
|
||||
## Decisions & Risks
|
||||
- Decision: the retest follows the user-specified order exactly: backend unit, backend E2E, frontend unit, frontend E2E.
|
||||
- Decision: no more than one project-level test run will execute at once, even where the repo could support more concurrency.
|
||||
- Risk: the fresh bootstrap still leaves several core services unhealthy (`router-gateway`, `findings-ledger-web`, `timeline-web`, `graph-api`, `scheduler-web`), so backend and frontend E2E coverage may be partially blocked even if unit suites pass.
|
||||
- Risk: some `.NET` test entrypoints in this repo use Microsoft.Testing.Platform, which previously ignored `--filter` in at least one integration suite; command output must be inspected carefully so suite totals are not misreported as targeted evidence.
|
||||
- Risk: the repository currently exposes `503` distinct `*Tests.csproj` projects under `src`, so this sprint is explicitly a risk-based sampled regression sweep, not a complete full-repo certification run.
|
||||
- Risk: additional second-pass failures show contract drift and compile drift beyond the original container-health blockers, notably duplicate endpoint names in Policy host boot, normalize/auth regressions in Notify, and an interface-implementation mismatch in VexHub test infrastructure.
|
||||
- Risk: deeper isolation shows several failures are layer-specific rather than module-wide: Notify core/engine/persistence pass while web contracts fail; Scheduler persistence passes while the web host cannot resolve repository services; Concelier core/source-intel/feed ingestion pass while persistence fails on missing relations and the web-service suite still hangs.
|
||||
- Risk: the latest batch adds more compile/configuration drift not visible from service health alone: missing Notify worker project references, scheduler model sample drift, Excititor/OIDC test assumptions that now require HTTPS metadata, Concelier integration tests gated entirely by `STELLAOPS_INTEGRATION_TESTS`, and a long-running workflow renderer suite that generates artifacts without reaching a terminal result promptly.
|
||||
- Risk: the latest workflow datastore pass shows Oracle-specific integration coverage is significantly behind PostgreSQL/MongoDB parity: Oracle tests fail on missing Redis multiplexer wiring, missing `IOracleAqTransport`, and missing EF provider configuration inside the runtime host.
|
||||
- Risk: Excititor internals are not uniformly broken. Most core/policy/export/format/worker/connector projects are green, but persistence is heavily red because the shared Postgres fixture cannot apply Excititor migrations (`42601` syntax error near `(`), and `StellaOps.Excititor.Core.UnitTests` appears miswired as a non-executing test harness despite `dotnet test` returning success.
|
||||
- Decision: local compose now treats `devops/compose/router-gateway-local.json` as a full replacement gateway configuration mounted at container `appsettings.json`. This preserves strict duplicate-route validation while preventing local route-table double-loading.
|
||||
- Risk: the gateway suite still has one unrelated readiness failure (`HealthReady_ReturnsOk_WhenRequiredMicroserviceIsRegistered` returning `503`), so router startup/frontdoor availability is fixed but the router test project is not yet fully green.
|
||||
- Decision: Authority standard-plugin bootstrap provisioning now retries transient startup failures so the documented local admin account and seeded console client converge after PostgreSQL becomes reachable, rather than depending on startup order luck.
|
||||
- Risk: `devops/docker/Dockerfile.platform` is stale relative to the current repo layout (`/src/Signer` and `/src/Scheduler` COPY steps fail), so image rebuilds for the updated Authority service required a temporary local publish plus container copy path instead of a clean Docker target rebuild.
|
||||
- Decision: startup schema migration hosts now register as explicit `IHostedService` singletons instead of `AddHostedService(...)` so one service can own and auto-migrate multiple PostgreSQL schemas without the second host being silently deduplicated.
|
||||
- Risk: the local browser trust chain is still not accepted by the ad hoc Playwright CLI browser session in this terminal, so the final UI verification for `/ops/scripts` used the repo’s existing `live-frontdoor-auth.mjs` harness with `ignoreHTTPSErrors: true` instead of the Playwright MCP bridge.
|
||||
|
||||
- Decision: the active Concelier feed-mirror surfaces now use real backend persistence for snapshot operations, retention settings, bundle creation/download/import, and version locks instead of placeholder `501 Not Implemented` handlers or in-memory catalogs behind the live UI routes.
|
||||
- Decision: Concelier's embedded Release Orchestrator topology runtime now binds its Postgres-backed stores explicitly through `ConcelierTopologyIdentityAccessor` rather than an unbound `Func<Guid>` DI primitive, because the latter only failed once the live service was republished and the deletion worker activated on startup.
|
||||
- Decision: `src/ReleaseOrchestrator/__Libraries/StellaOps.ReleaseOrchestrator.Environment/Migrations/001_regions_and_infra_bindings.sql` now enforces the infrastructure-binding scope uniqueness rule with a named unique index, not an invalid table-level `UNIQUE` constraint containing `COALESCE(...)`.
|
||||
- Risk: the Concelier WebService test project still has a pre-existing auth-contract failure (`ListEndpoints_WithoutTenantHeader_ReturnsBadRequest` currently returns `200` instead of `400`), and the repo's Microsoft.Testing.Platform setup still rejects legacy `dotnet test --filter` targeted evidence (`MTP0001`). Focused verification for this fix path therefore used the xUnit v3 class runner plus live frontdoor browser checks.
|
||||
|
||||
## Next Checkpoints
|
||||
- Finish backend unit lane and decide whether failures require code fixes before backend E2E.
|
||||
- Use the live stack for backend API verification once the backend unit lane is complete.
|
||||
- Proceed to frontend unit and frontend E2E only after backend lanes are recorded.
|
||||
Reference in New Issue
Block a user