Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions
--- a/docs/doctor/articles/agent/capacity.md
+++ b/docs/doctor/articles/agent/capacity.md
@@ -0,0 +1,86 @@
+---
+checkId: check.agent.capacity
+plugin: stellaops.doctor.agent
+severity: warn
+tags: [agent, capacity, performance]
+---
+# Agent Capacity
+
+## What It Checks
+
+Verifies that agents have sufficient capacity to handle incoming tasks. The check queries the agent store for the current tenant and categorizes agents by status:
+
+1. **Fail** if zero agents have `AgentStatus.Active` -- no agents are available to run tasks.
+2. **Pass** if at least one active agent exists, reporting the active-vs-total count.
+
+Evidence collected: `ActiveAgents`, `TotalAgents`.
+
+Thresholds defined in source (not yet wired to the simplified implementation):
+- High utilization: >= 90%
+- Warning utilization: >= 75%
+
+The check skips with a warning if the tenant ID is missing or unparseable.
+
+## Why It Matters
+
+When no active agents are available, the platform cannot execute deployment tasks, scans, or any agent-dispatched work. Releases stall, scan queues grow, and SLA timers expire silently. Detecting zero-capacity before a promotion attempt prevents failed deployments and on-call pages.
+
+## Common Causes
+
+- All agents are offline (host crash, network partition, maintenance window)
+- No agents have been registered for this tenant
+- Agents exist but are in `Revoked` or `Inactive` status and none remain `Active`
+- Agent bootstrap was started but never completed
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check agent container health
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
+
+# View agent container logs
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 100
+
+# Restart agent container
+docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check agent service status
+systemctl status stella-agent
+
+# Restart agent service
+sudo systemctl restart stella-agent
+
+# Bootstrap a new agent if none registered
+stella agent bootstrap --name agent-01 --env production --platform linux
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent pods
+kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
+
+# Describe agent deployment
+kubectl describe deployment stellaops-agent -n stellaops
+
+# Scale agent replicas
+kubectl scale deployment stellaops-agent --replicas=2 -n stellaops
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.capacity
+```
+
+## Related Checks
+
+- `check.agent.heartbeat.freshness` -- agents may be registered but not sending heartbeats
+- `check.agent.stale` -- agents offline for extended periods may need decommissioning
+- `check.agent.resource.utilization` -- active agents may be resource-constrained
--- a/docs/doctor/articles/agent/certificate-expiry.md
+++ b/docs/doctor/articles/agent/certificate-expiry.md
@@ -0,0 +1,95 @@
+---
+checkId: check.agent.certificate.expiry
+plugin: stellaops.doctor.agent
+severity: fail
+tags: [agent, certificate, security, quick]
+---
+# Agent Certificate Expiry
+
+## What It Checks
+
+Inspects the `CertificateExpiresAt` field on every non-revoked, non-inactive agent and classifies each into one of four buckets:
+
+1. **Expired** -- `CertificateExpiresAt` is in the past. Result: **Fail**.
+2. **Critical** -- certificate expires within **1 day** (24 hours). Result: **Fail**.
+3. **Warning** -- certificate expires within **7 days**. Result: **Warn**.
+4. **Healthy** -- certificate has more than 7 days remaining. Result: **Pass**.
+
+The check short-circuits to the most severe bucket found. Evidence includes per-agent names with time-since-expiry or time-until-expiry, plus counts of `TotalActive`, `Expired`, `Critical`, and `Warning` agents.
+
+Agents whose `CertificateExpiresAt` is null or default are silently skipped (certificate info not available). If no active agents exist the check is skipped entirely.
+
+## Why It Matters
+
+Agent mTLS certificates authenticate the agent to the orchestrator. An expired certificate causes the agent to fail heartbeats, reject task assignments, and drop out of the fleet. In production this means deployments and scans silently stop being dispatched to that agent, potentially leaving environments unserviced.
+
+## Common Causes
+
+- Certificate auto-renewal is disabled on the agent
+- Agent was offline when renewal was due (missed the renewal window)
+- Certificate authority is unreachable from the agent host
+- Agent bootstrap was incomplete (certificate provisioned but auto-renewal not configured)
+- Certificate renewal threshold not yet reached (warning-level)
+- Certificate authority rate limiting prevented renewal (critical-level)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check certificate expiry for agent containers
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent health --show-cert
+
+# Force certificate renewal
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent renew-cert --force
+
+# Verify auto-renewal configuration
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent config show | grep auto_renew
+```
+
+### Bare Metal / systemd
+
+```bash
+# Force certificate renewal on an affected agent
+stella agent renew-cert --agent-id <agent-id> --force
+
+# If agent is unreachable, re-bootstrap
+stella agent bootstrap --name <agent-name> --env <environment>
+
+# Verify auto-renewal is enabled
+stella agent config --agent-id <agent-id> | grep auto_renew
+
+# Check agent logs for renewal failures
+stella agent logs --agent-id <agent-id> --level warn
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check cert expiry across agent pods
+kubectl exec -it deploy/stellaops-agent -n stellaops -- \
+  stella agent health --show-cert
+
+# Force renewal via pod exec
+kubectl exec -it deploy/stellaops-agent -n stellaops -- \
+  stella agent renew-cert --force
+
+# If using cert-manager, check Certificate resource
+kubectl get certificate -n stellaops
+kubectl describe certificate stellaops-agent-tls -n stellaops
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.certificate.expiry
+```
+
+## Related Checks
+
+- `check.agent.certificate.validity` -- verifies certificate chain of trust (not just expiry)
+- `check.agent.heartbeat.freshness` -- expired certs cause heartbeat failures
+- `check.agent.stale` -- agents with expired certs often show as stale
--- a/docs/doctor/articles/agent/certificate-validity.md
+++ b/docs/doctor/articles/agent/certificate-validity.md
@@ -0,0 +1,84 @@
+---
+checkId: check.agent.certificate.validity
+plugin: stellaops.doctor.agent
+severity: fail
+tags: [agent, certificate, security]
+---
+# Agent Certificate Validity
+
+## What It Checks
+
+Validates the full certificate chain of trust for agent mTLS certificates. The check is designed to verify:
+
+1. Certificate is signed by a trusted CA
+2. Certificate chain is complete (no missing intermediates)
+3. No revoked certificates in the chain (CRL/OCSP check)
+4. Certificate subject matches the agent's registered identity
+
+**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The framework and metadata are wired; the chain-validation logic is not yet connected.
+
+Evidence collected: none yet (pending implementation).
+
+The check requires `IAgentStore` to be registered in DI; otherwise it will not run.
+
+## Why It Matters
+
+A valid certificate expiry date (checked by `check.agent.certificate.expiry`) is necessary but not sufficient. An agent could present a non-expired certificate that was signed by an untrusted CA, has a broken chain, or has been revoked. Any of these conditions would allow an impersonating agent to receive task dispatches or exfiltrate deployment secrets.
+
+## Common Causes
+
+- CA certificate rotated but agent still presents cert signed by old CA
+- Intermediate certificate missing from agent's cert bundle
+- Certificate revoked via CRL but agent not yet re-provisioned
+- Agent identity mismatch after hostname change or migration
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Inspect agent certificate chain
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  openssl x509 -in /etc/stellaops/agent/tls.crt -text -noout
+
+# Verify chain against CA bundle
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  openssl verify -CAfile /etc/stellaops/ca/ca.crt /etc/stellaops/agent/tls.crt
+```
+
+### Bare Metal / systemd
+
+```bash
+# Inspect agent certificate
+openssl x509 -in /etc/stellaops/agent/tls.crt -text -noout
+
+# Verify certificate chain
+openssl verify -CAfile /etc/stellaops/ca/ca.crt -untrusted /etc/stellaops/ca/intermediate.crt \
+  /etc/stellaops/agent/tls.crt
+
+# Re-bootstrap if chain is broken
+stella agent bootstrap --name <agent-name> --env <environment>
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check certificate in agent pod
+kubectl exec -it deploy/stellaops-agent -n stellaops -- \
+  openssl x509 -in /etc/stellaops/agent/tls.crt -text -noout
+
+# If using cert-manager, check CertificateRequest status
+kubectl get certificaterequest -n stellaops
+kubectl describe certificaterequest <name> -n stellaops
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.certificate.validity
+```
+
+## Related Checks
+
+- `check.agent.certificate.expiry` -- checks expiry dates (complementary to chain validation)
+- `check.agent.heartbeat.freshness` -- invalid certs prevent heartbeat communication
--- a/docs/doctor/articles/agent/cluster-health.md
+++ b/docs/doctor/articles/agent/cluster-health.md
@@ -0,0 +1,97 @@
+---
+checkId: check.agent.cluster.health
+plugin: stellaops.doctor.agent
+severity: fail
+tags: [agent, cluster, ha, resilience]
+---
+# Agent Cluster Health
+
+## What It Checks
+
+Monitors the health of the agent cluster when clustering is enabled. The check only runs when the configuration key `Agent:Cluster:Enabled` is set to `true`. It is designed to verify:
+
+1. All cluster members are reachable
+2. A leader is elected and healthy
+3. State synchronization is working across members
+4. Failover is possible if the current leader goes down
+
+**Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet perform cluster health probes.
+
+## Why It Matters
+
+In high-availability deployments, agents form a cluster to provide redundancy and automatic failover. If cluster health degrades -- members become unreachable, leader election fails, or state sync stalls -- task dispatch can stop entirely or produce split-brain scenarios where two agents execute the same task concurrently, leading to deployment conflicts.
+
+## Common Causes
+
+- Network partition between cluster members
+- Leader node crashed without triggering failover
+- State sync backlog due to high task volume
+- Clock skew between cluster members causing consensus protocol failures
+- Insufficient cluster members for quorum (see `check.agent.cluster.quorum`)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check cluster member containers
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
+
+# View cluster-specific logs
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200 | grep -i cluster
+
+# Restart all agent containers to force re-election
+docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
+```
+
+Set clustering configuration in your `.env` or compose override:
+
+```
+AGENT__CLUSTER__ENABLED=true
+AGENT__CLUSTER__MEMBERS=agent-1:8500,agent-2:8500,agent-3:8500
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check cluster status
+stella agent cluster status
+
+# View cluster member health
+stella agent cluster members
+
+# Force leader re-election if leader is unhealthy
+stella agent cluster elect --force
+
+# Restart agent to rejoin cluster
+sudo systemctl restart stella-agent
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent StatefulSet pods
+kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
+
+# View cluster gossip logs
+kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=100 | grep -i cluster
+
+# Helm values for clustering
+# agent:
+#   cluster:
+#     enabled: true
+#     replicas: 3
+helm upgrade stellaops stellaops/stellaops --set agent.cluster.enabled=true --set agent.cluster.replicas=3
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.cluster.health
+```
+
+## Related Checks
+
+- `check.agent.cluster.quorum` -- verifies minimum members for consensus
+- `check.agent.heartbeat.freshness` -- individual agent connectivity
+- `check.agent.capacity` -- fleet-level task capacity
--- a/docs/doctor/articles/agent/cluster-quorum.md
+++ b/docs/doctor/articles/agent/cluster-quorum.md
@@ -0,0 +1,97 @@
+---
+checkId: check.agent.cluster.quorum
+plugin: stellaops.doctor.agent
+severity: fail
+tags: [agent, cluster, quorum, ha]
+---
+# Agent Cluster Quorum
+
+## What It Checks
+
+Verifies that the agent cluster has sufficient members online to maintain quorum for leader election and consensus operations. The check only runs when `Agent:Cluster:Enabled` is `true`. It is designed to verify:
+
+1. Minimum members are online (n/2 + 1 for odd-numbered clusters, or the configured minimum)
+2. Leader election is possible with current membership
+3. Split-brain prevention mechanisms are active
+
+**Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet query cluster membership.
+
+## Why It Matters
+
+Without quorum, the agent cluster cannot elect a leader, which means no task dispatch, no failover, and potentially a complete halt of agent-driven operations. Losing quorum is often the step before a full cluster outage. Monitoring quorum proactively allows operators to add members or fix partitions before the cluster becomes non-functional.
+
+## Common Causes
+
+- Too many cluster members went offline simultaneously (maintenance, host failure)
+- Network partition isolating a minority of members from the majority
+- Cluster scaled down below quorum threshold
+- New deployment removed members without draining them first
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Verify all agent containers are running
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
+
+# Scale agents to restore quorum (minimum 3 for quorum of 2)
+docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
+```
+
+Ensure cluster member list is correct in `.env`:
+
+```
+AGENT__CLUSTER__ENABLED=true
+AGENT__CLUSTER__MINMEMBERS=2
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check how many cluster members are online
+stella agent cluster members --status online
+
+# If a member is down, restart it
+ssh <agent-host> 'sudo systemctl restart stella-agent'
+
+# Verify quorum status
+stella agent cluster quorum
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent pod count vs desired
+kubectl get statefulset stellaops-agent -n stellaops
+
+# Scale up if below quorum
+kubectl scale statefulset stellaops-agent --replicas=3 -n stellaops
+
+# Check pod disruption budget
+kubectl get pdb -n stellaops
+```
+
+Set a PodDisruptionBudget to prevent quorum loss during rollouts:
+
+```yaml
+# values.yaml
+agent:
+  cluster:
+    enabled: true
+    replicas: 3
+  podDisruptionBudget:
+    minAvailable: 2
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.cluster.quorum
+```
+
+## Related Checks
+
+- `check.agent.cluster.health` -- overall cluster health including leader and sync status
+- `check.agent.capacity` -- even with quorum, capacity may be insufficient
+- `check.agent.heartbeat.freshness` -- individual member connectivity
--- a/docs/doctor/articles/agent/heartbeat-freshness.md
+++ b/docs/doctor/articles/agent/heartbeat-freshness.md
@@ -0,0 +1,104 @@
+---
+checkId: check.agent.heartbeat.freshness
+plugin: stellaops.doctor.agent
+severity: fail
+tags: [agent, heartbeat, connectivity, quick]
+---
+# Agent Heartbeat Freshness
+
+## What It Checks
+
+Queries all non-revoked, non-inactive agents for the current tenant and classifies each by the age of its last heartbeat:
+
+1. **Stale** (> 5 minutes since last heartbeat): Result is **Fail**. Evidence lists each stale agent with the time since its last heartbeat in minutes.
+2. **Warning** (> 2 minutes but <= 5 minutes): Result is **Warn**. Evidence lists each delayed agent with time since heartbeat in seconds.
+3. **Healthy** (<= 2 minutes): Result is **Pass**.
+
+If no active agents are registered, the check returns **Warn** with a prompt to bootstrap agents. If the tenant ID is missing, it warns about being unable to check.
+
+Evidence collected: `TotalActive`, `Stale` count, `Warning` count, `Healthy` count, per-agent names and heartbeat ages.
+
+## Why It Matters
+
+Heartbeats are the primary signal that an agent is alive and accepting work. A stale heartbeat means the agent has stopped communicating with the orchestrator -- it may have crashed, lost network connectivity, or had its mTLS certificate expire. Tasks dispatched to a stale agent will time out, and the lack of timely detection causes deployment delays and alert fatigue.
+
+## Common Causes
+
+- Agent process has crashed or stopped
+- Network connectivity issue between agent and orchestrator
+- Firewall blocking agent heartbeat traffic (typically HTTPS on port 8443)
+- Agent host is unreachable or powered off
+- mTLS certificate has expired (see `check.agent.certificate.expiry`)
+- Agent is under heavy load (warning-level)
+- Network latency between agent and orchestrator (warning-level)
+- Agent is processing long-running tasks that block the heartbeat loop (warning-level)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check agent container status
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent
+
+# View agent logs for crash or error messages
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200
+
+# Restart agent container
+docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
+
+# Verify network connectivity from agent to orchestrator
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  curl -k https://orchestrator:8443/health
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check agent service status
+systemctl status stella-agent
+
+# View recent agent logs
+journalctl -u stella-agent --since '10 minutes ago'
+
+# Run agent diagnostics
+stella agent doctor
+
+# Check network connectivity to orchestrator
+curl -k https://orchestrator:8443/health
+
+# If certificate expired, renew it
+stella agent renew-cert --force
+
+# Restart the service
+sudo systemctl restart stella-agent
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent pod status and restarts
+kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
+
+# View agent pod logs
+kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=200
+
+# Check network policy allowing agent -> orchestrator traffic
+kubectl get networkpolicy -n stellaops
+
+# Restart agent pods via rollout
+kubectl rollout restart deployment/stellaops-agent -n stellaops
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.heartbeat.freshness
+```
+
+## Related Checks
+
+- `check.agent.stale` -- detects agents offline for hours/days (longer threshold than heartbeat freshness)
+- `check.agent.certificate.expiry` -- expired certificates cause heartbeat authentication failures
+- `check.agent.capacity` -- heartbeat failures reduce effective fleet capacity
+- `check.agent.resource.utilization` -- overloaded agents may delay heartbeats
--- a/docs/doctor/articles/agent/resource-utilization.md
+++ b/docs/doctor/articles/agent/resource-utilization.md
@@ -0,0 +1,103 @@
+---
+checkId: check.agent.resource.utilization
+plugin: stellaops.doctor.agent
+severity: warn
+tags: [agent, resource, performance, capacity]
+---
+# Agent Resource Utilization
+
+## What It Checks
+
+Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify:
+
+1. CPU utilization per agent
+2. Memory utilization per agent
+3. Disk space per agent (for task workspace, logs, and cached artifacts)
+4. Resource usage trends (increasing/stable/decreasing)
+
+**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true, so the check will always appear in results.
+
+## Why It Matters
+
+Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs.
+
+## Common Causes
+
+- Agent running too many concurrent tasks for its resource allocation
+- Disk filled by accumulated scan artifacts, logs, or cached images
+- Memory leak in long-running agent process
+- Noisy neighbor on shared infrastructure consuming resources
+- Resource limits not configured (no cgroup/container memory cap)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check agent container resource usage
+docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent)
+
+# Set resource limits in compose override
+# docker-compose.override.yml:
+#   services:
+#     agent:
+#       deploy:
+#         resources:
+#           limits:
+#             cpus: '2.0'
+#             memory: 4G
+
+# Clean up old task artifacts
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent cleanup --older-than 7d
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check resource usage
+stella agent health <agent-id>
+
+# View system resources on agent host
+top -bn1 | head -20
+df -h /var/lib/stellaops
+
+# Clean up old task artifacts
+stella agent cleanup --older-than 7d
+
+# Adjust concurrent task limit
+stella agent config --agent-id <agent-id> --set max_concurrent_tasks=4
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent pod resource usage
+kubectl top pods -l app.kubernetes.io/component=agent -n stellaops
+
+# Set resource requests and limits in Helm values
+# agent:
+#   resources:
+#     requests:
+#       cpu: "500m"
+#       memory: "1Gi"
+#     limits:
+#       cpu: "2000m"
+#       memory: "4Gi"
+helm upgrade stellaops stellaops/stellaops -f values.yaml
+
+# Check if pods are being OOM-killed
+kubectl get events -n stellaops --field-selector reason=OOMKilling
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.resource.utilization
+```
+
+## Related Checks
+
+- `check.agent.capacity` -- resource exhaustion reduces effective capacity
+- `check.agent.heartbeat.freshness` -- resource saturation can delay heartbeats
+- `check.agent.task.backlog` -- high utilization combined with backlog indicates need to scale
--- a/docs/doctor/articles/agent/stale.md
+++ b/docs/doctor/articles/agent/stale.md
@@ -0,0 +1,91 @@
+---
+checkId: check.agent.stale
+plugin: stellaops.doctor.agent
+severity: warn
+tags: [agent, maintenance, cleanup]
+---
+# Stale Agent Detection
+
+## What It Checks
+
+Identifies agents that have been offline (no heartbeat) for extended periods and may need investigation or decommissioning. The check inspects all non-revoked, non-inactive agents and categorizes them:
+
+1. **Decommission candidates** -- offline for more than **7 days**. Result: **Warn** listing each agent with days offline.
+2. **Stale** -- offline for more than **1 hour** but less than 7 days. Result: **Warn** listing each agent with hours offline.
+3. **All healthy** -- no agents exceed the 1-hour stale threshold. Result: **Pass**.
+
+The check uses `LastHeartbeatAt` from the agent store. Agents with no recorded heartbeat (`null`) are treated as having `TimeSpan.MaxValue` offline duration.
+
+Evidence collected: `DecommissionCandidates` count, `StaleAgents` count, per-agent names with offline durations.
+
+## Why It Matters
+
+Stale agents consume fleet management overhead, confuse capacity planning, and may hold allocated resources (IP addresses, certificates, license seats) that could be reclaimed. An agent that has been offline for 7+ days is unlikely to return without intervention and should be explicitly deactivated or investigated. Ignoring stale agents leads to a growing inventory of ghost entries that obscure the true fleet state.
+
+## Common Causes
+
+- Agent host has been permanently removed (decommissioned hardware, terminated cloud instance)
+- Agent was replaced by a new instance but the old registration was not deactivated
+- Infrastructure change (network re-architecture, datacenter migration) without cleanup
+- Agent host is undergoing extended maintenance
+- Network partition isolating the agent
+- Agent process crash without auto-restart configured (systemd restart policy missing)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# List all agent registrations with status
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent list --all
+
+# Deactivate a stale agent
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent deactivate --agent-id <agent-id>
+```
+
+### Bare Metal / systemd
+
+```bash
+# Review stale agents
+stella agent list --status stale
+
+# Deactivate agents that are no longer needed
+stella agent deactivate --agent-id <agent-id>
+
+# If the agent should still be active, investigate the host
+ssh <agent-host> 'systemctl status stella-agent'
+
+# Check network connectivity from the agent host
+ssh <agent-host> 'curl -k https://orchestrator:8443/health'
+
+# Restart agent on the host
+ssh <agent-host> 'sudo systemctl restart stella-agent'
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check for terminated or evicted agent pods
+kubectl get pods -l app.kubernetes.io/component=agent -n stellaops --field-selector=status.phase!=Running
+
+# Remove stale agent registrations via API
+stella agent deactivate --agent-id <agent-id>
+
+# If pod was evicted, check node status
+kubectl get nodes
+kubectl describe node <node-name> | grep -A5 Conditions
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.stale
+```
+
+## Related Checks
+
+- `check.agent.heartbeat.freshness` -- short-term heartbeat staleness (minutes vs. hours/days)
+- `check.agent.capacity` -- stale agents do not contribute to capacity
+- `check.agent.certificate.expiry` -- long-offline agents likely have expired certificates
--- a/docs/doctor/articles/agent/task-backlog.md
+++ b/docs/doctor/articles/agent/task-backlog.md
@@ -0,0 +1,92 @@
+---
+checkId: check.agent.task.backlog
+plugin: stellaops.doctor.agent
+severity: warn
+tags: [agent, task, queue, capacity]
+---
+# Task Queue Backlog
+
+## What It Checks
+
+Monitors the pending task queue depth across the agent fleet to detect capacity issues. The check is designed to evaluate:
+
+1. Total queued tasks across the entire fleet
+2. Age of the oldest queued task (how long tasks wait before dispatch)
+3. Queue growth rate trend (growing, stable, or draining)
+
+**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true.
+
+## Why It Matters
+
+A growing task backlog means agents cannot keep up with incoming work. Tasks age in the queue, SLA timers expire, and users experience delayed deployments and scan results. If the backlog grows unchecked, it can cascade: delayed scans block policy gates, which block promotions, which block release trains. Detecting backlog growth early allows operators to scale the fleet or prioritize the queue.
+
+## Common Causes
+
+- Insufficient agent count for current workload
+- One or more agents offline, reducing effective fleet capacity
+- Task burst from bulk operations (mass rescans, environment-wide deployments)
+- Slow tasks monopolizing agent slots (large image scans, complex builds)
+- Task dispatch paused due to configuration or freeze window
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check current queue depth
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent tasks --status queued --count
+
+# Scale agents to reduce backlog
+docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
+
+# Increase concurrent task limit per agent
+# Set environment variable in compose override:
+# AGENT__MAXCONCURRENTTASKS=8
+```
+
+### Bare Metal / systemd
+
+```bash
+# Check queue depth and oldest task
+stella agent tasks --status queued
+
+# Increase concurrent task limit
+stella agent config --agent-id <id> --set max_concurrent_tasks=8
+
+# Add more agents to the fleet
+stella agent bootstrap --name agent-03 --env production --platform linux
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check queue depth
+kubectl exec -it deploy/stellaops-agent -n stellaops -- \
+  stella agent tasks --status queued --count
+
+# Scale agent deployment
+kubectl scale deployment stellaops-agent --replicas=5 -n stellaops
+
+# Or use HPA for auto-scaling
+# agent:
+#   autoscaling:
+#     enabled: true
+#     minReplicas: 2
+#     maxReplicas: 10
+#     targetCPUUtilizationPercentage: 70
+helm upgrade stellaops stellaops/stellaops -f values.yaml
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.task.backlog
+```
+
+## Related Checks
+
+- `check.agent.capacity` -- backlog grows when capacity is insufficient
+- `check.agent.task.failure.rate` -- failed tasks may be re-queued, inflating the backlog
+- `check.agent.resource.utilization` -- saturated agents process tasks slowly
+- `check.agent.heartbeat.freshness` -- offline agents reduce dispatch targets
--- a/docs/doctor/articles/agent/task-failure-rate.md
+++ b/docs/doctor/articles/agent/task-failure-rate.md
@@ -0,0 +1,82 @@
+---
+checkId: check.agent.task.failure.rate
+plugin: stellaops.doctor.agent
+severity: warn
+tags: [agent, task, failure, reliability]
+---
+# Task Failure Rate
+
+## What It Checks
+
+Monitors the task failure rate across the agent fleet to detect systemic issues. The check is designed to evaluate:
+
+1. Overall task failure rate over the last hour
+2. Per-agent failure rate to isolate problematic agents
+3. Failure rate trend (increasing, decreasing, or stable)
+4. Common failure reasons to guide remediation
+
+**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true.
+
+## Why It Matters
+
+A rising task failure rate is an early indicator of systemic problems: infrastructure issues, misconfigured environments, expired credentials, or agent software bugs. Catching a spike before it reaches 100% failure allows operators to intervene, roll back, or redirect tasks to healthy agents before an outage fully materializes.
+
+## Common Causes
+
+- Registry or artifact store unreachable (tasks cannot pull images)
+- Expired credentials used by tasks (registry tokens, cloud provider keys)
+- Agent software bug introduced by recent update
+- Target environment misconfigured (wrong endpoints, firewall rules)
+- Disk exhaustion on agent hosts preventing artifact staging
+- OOM kills during resource-intensive tasks (scans, builds)
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check agent logs for task failures
+docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 500 | \
+  grep -i "task.*fail\|error\|exception"
+
+# Review recent task history
+docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
+  stella agent tasks --status failed --last 1h
+```
+
+### Bare Metal / systemd
+
+```bash
+# View failed tasks
+stella agent tasks --status failed --last 1h
+
+# Check per-agent failure rates
+stella agent health <agent-id> --show-tasks
+
+# Review agent logs for failure patterns
+journalctl -u stella-agent --since '1 hour ago' | grep -i 'fail\|error'
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check agent pod logs for task errors
+kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=500 | \
+  grep -i "task.*fail\|error"
+
+# Check pod events for OOM or crash signals
+kubectl get events -n stellaops --sort-by='.lastTimestamp' | grep -i agent
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.task.failure.rate
+```
+
+## Related Checks
+
+- `check.agent.resource.utilization` -- resource exhaustion causes task failures
+- `check.agent.task.backlog` -- high failure rate combined with backlog indicates systemic issue
+- `check.agent.heartbeat.freshness` -- crashing agents fail tasks and go stale
+- `check.agent.version.consistency` -- version skew can cause task compatibility failures
--- a/docs/doctor/articles/agent/version-consistency.md
+++ b/docs/doctor/articles/agent/version-consistency.md
@@ -0,0 +1,81 @@
+---
+checkId: check.agent.version.consistency
+plugin: stellaops.doctor.agent
+severity: warn
+tags: [agent, version, maintenance]
+---
+# Agent Version Consistency
+
+## What It Checks
+
+Groups all non-revoked, non-inactive agents by their reported `Version` field and evaluates version skew:
+
+1. **Single version** across all agents: **Pass** -- all agents are consistent.
+2. **Two versions** with skew affecting less than half the fleet: **Pass** (minor skew acceptable).
+3. **Significant skew** (more than 2 distinct versions, or outdated agents exceed half the fleet): **Warn** with evidence listing the version distribution and up to 10 outdated agent names.
+4. **No active agents**: **Skip**.
+
+The "majority version" is the version running on the most agents. All other versions are considered outdated. Evidence collected: `MajorityVersion`, `VersionDistribution` (e.g., "1.5.0: 8, 1.4.2: 2"), `OutdatedAgents` (list of names with their versions).
+
+## Why It Matters
+
+Version skew across the agent fleet can cause subtle compatibility issues: newer agents may support task types that older agents reject, protocol changes may cause heartbeat or dispatch failures, and mixed versions make incident triage harder because behavior differs across agents. Keeping the fleet consistent reduces operational surprises.
+
+## Common Causes
+
+- Auto-update is disabled on some agents
+- Some agents failed to update (download failure, permission issue, disk full)
+- Phased rollout in progress (expected, temporary skew)
+- Agents on isolated networks that cannot reach the update server
+
+## How to Fix
+
+### Docker Compose
+
+```bash
+# Check agent image versions
+docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent --format json | \
+  jq '.[] | {name: .Name, image: .Image}'
+
+# Pull latest image and recreate
+docker compose -f devops/compose/docker-compose.stella-ops.yml pull agent
+docker compose -f devops/compose/docker-compose.stella-ops.yml up -d agent
+```
+
+### Bare Metal / systemd
+
+```bash
+# Update outdated agents to target version
+stella agent update --version <target-version> --agent-id <id>
+
+# Enable auto-update
+stella agent config --agent-id <id> --set auto_update.enabled=true
+
+# Batch update all agents
+stella agent update --version <target-version> --all
+```
+
+### Kubernetes / Helm
+
+```bash
+# Check running image versions across pods
+kubectl get pods -l app.kubernetes.io/component=agent -n stellaops \
+  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
+
+# Update image tag in Helm values and rollout
+helm upgrade stellaops stellaops/stellaops --set agent.image.tag=<target-version>
+
+# Monitor rollout
+kubectl rollout status deployment/stellaops-agent -n stellaops
+```
+
+## Verification
+
+```
+stella doctor run --check check.agent.version.consistency
+```
+
+## Related Checks
+
+- `check.agent.heartbeat.freshness` -- version mismatch can cause heartbeat protocol failures
+- `check.agent.capacity` -- outdated agents may be unable to accept newer task types