Doctor plugin checks: implement health check classes and documentation

Implement remediation-aware health checks across all Doctor plugin modules
(Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment,
EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release,
Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation,
Authority, Core, Cryptography, Database, Docker, Integration, Notify,
Observability, Security, ServiceGraph, Sources, Verification).

Each check now emits structured remediation metadata (severity, category,
runbook links, and fix suggestions) consumed by the Doctor dashboard
remediation panel.

Also adds:
- docs/doctor/articles/ knowledge base for check explanations
- Advisory AI search seed and allowlist updates for doctor content
- Sprint plan for doctor checks documentation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
master
2026-03-27 12:28:00 +02:00
parent fbd24e71de
commit c58a236d70
326 changed files with 18500 additions and 463 deletions

View File

@@ -0,0 +1,86 @@
---
checkId: check.agent.capacity
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, capacity, performance]
---
# Agent Capacity
## What It Checks
Verifies that agents have sufficient capacity to handle incoming tasks. The check queries the agent store for the current tenant and categorizes agents by status:
1. **Fail** if zero agents have `AgentStatus.Active` -- no agents are available to run tasks.
2. **Pass** if at least one active agent exists, reporting the active-vs-total count.
Evidence collected: `ActiveAgents`, `TotalAgents`.
Thresholds defined in source (not yet wired to the simplified implementation):
- High utilization: >= 90%
- Warning utilization: >= 75%
The check skips with a warning if the tenant ID is missing or unparseable.
## Why It Matters
When no active agents are available, the platform cannot execute deployment tasks, scans, or any agent-dispatched work. Releases stall, scan queues grow, and SLA timers expire silently. Detecting zero-capacity before a promotion attempt prevents failed deployments and on-call pages.
## Common Causes
- All agents are offline (host crash, network partition, maintenance window)
- No agents have been registered for this tenant
- Agents exist but are in `Revoked` or `Inactive` status and none remain `Active`
- Agent bootstrap was started but never completed
## How to Fix
### Docker Compose
```bash
# Check agent container health
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
# View agent container logs
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 100
# Restart agent container
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
```
### Bare Metal / systemd
```bash
# Check agent service status
systemctl status stella-agent
# Restart agent service
sudo systemctl restart stella-agent
# Bootstrap a new agent if none registered
stella agent bootstrap --name agent-01 --env production --platform linux
```
### Kubernetes / Helm
```bash
# Check agent pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
# Describe agent deployment
kubectl describe deployment stellaops-agent -n stellaops
# Scale agent replicas
kubectl scale deployment stellaops-agent --replicas=2 -n stellaops
```
## Verification
```
stella doctor run --check check.agent.capacity
```
## Related Checks
- `check.agent.heartbeat.freshness` -- agents may be registered but not sending heartbeats
- `check.agent.stale` -- agents offline for extended periods may need decommissioning
- `check.agent.resource.utilization` -- active agents may be resource-constrained

View File

@@ -0,0 +1,95 @@
---
checkId: check.agent.certificate.expiry
plugin: stellaops.doctor.agent
severity: fail
tags: [agent, certificate, security, quick]
---
# Agent Certificate Expiry
## What It Checks
Inspects the `CertificateExpiresAt` field on every non-revoked, non-inactive agent and classifies each into one of four buckets:
1. **Expired** -- `CertificateExpiresAt` is in the past. Result: **Fail**.
2. **Critical** -- certificate expires within **1 day** (24 hours). Result: **Fail**.
3. **Warning** -- certificate expires within **7 days**. Result: **Warn**.
4. **Healthy** -- certificate has more than 7 days remaining. Result: **Pass**.
The check short-circuits to the most severe bucket found. Evidence includes per-agent names with time-since-expiry or time-until-expiry, plus counts of `TotalActive`, `Expired`, `Critical`, and `Warning` agents.
Agents whose `CertificateExpiresAt` is null or default are silently skipped (certificate info not available). If no active agents exist the check is skipped entirely.
## Why It Matters
Agent mTLS certificates authenticate the agent to the orchestrator. An expired certificate causes the agent to fail heartbeats, reject task assignments, and drop out of the fleet. In production this means deployments and scans silently stop being dispatched to that agent, potentially leaving environments unserviced.
## Common Causes
- Certificate auto-renewal is disabled on the agent
- Agent was offline when renewal was due (missed the renewal window)
- Certificate authority is unreachable from the agent host
- Agent bootstrap was incomplete (certificate provisioned but auto-renewal not configured)
- Certificate renewal threshold not yet reached (warning-level)
- Certificate authority rate limiting prevented renewal (critical-level)
## How to Fix
### Docker Compose
```bash
# Check certificate expiry for agent containers
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent health --show-cert
# Force certificate renewal
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent renew-cert --force
# Verify auto-renewal configuration
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent config show | grep auto_renew
```
### Bare Metal / systemd
```bash
# Force certificate renewal on an affected agent
stella agent renew-cert --agent-id <agent-id> --force
# If agent is unreachable, re-bootstrap
stella agent bootstrap --name <agent-name> --env <environment>
# Verify auto-renewal is enabled
stella agent config --agent-id <agent-id> | grep auto_renew
# Check agent logs for renewal failures
stella agent logs --agent-id <agent-id> --level warn
```
### Kubernetes / Helm
```bash
# Check cert expiry across agent pods
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
stella agent health --show-cert
# Force renewal via pod exec
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
stella agent renew-cert --force
# If using cert-manager, check Certificate resource
kubectl get certificate -n stellaops
kubectl describe certificate stellaops-agent-tls -n stellaops
```
## Verification
```
stella doctor run --check check.agent.certificate.expiry
```
## Related Checks
- `check.agent.certificate.validity` -- verifies certificate chain of trust (not just expiry)
- `check.agent.heartbeat.freshness` -- expired certs cause heartbeat failures
- `check.agent.stale` -- agents with expired certs often show as stale

View File

@@ -0,0 +1,84 @@
---
checkId: check.agent.certificate.validity
plugin: stellaops.doctor.agent
severity: fail
tags: [agent, certificate, security]
---
# Agent Certificate Validity
## What It Checks
Validates the full certificate chain of trust for agent mTLS certificates. The check is designed to verify:
1. Certificate is signed by a trusted CA
2. Certificate chain is complete (no missing intermediates)
3. No revoked certificates in the chain (CRL/OCSP check)
4. Certificate subject matches the agent's registered identity
**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The framework and metadata are wired; the chain-validation logic is not yet connected.
Evidence collected: none yet (pending implementation).
The check requires `IAgentStore` to be registered in DI; otherwise it will not run.
## Why It Matters
A valid certificate expiry date (checked by `check.agent.certificate.expiry`) is necessary but not sufficient. An agent could present a non-expired certificate that was signed by an untrusted CA, has a broken chain, or has been revoked. Any of these conditions would allow an impersonating agent to receive task dispatches or exfiltrate deployment secrets.
## Common Causes
- CA certificate rotated but agent still presents cert signed by old CA
- Intermediate certificate missing from agent's cert bundle
- Certificate revoked via CRL but agent not yet re-provisioned
- Agent identity mismatch after hostname change or migration
## How to Fix
### Docker Compose
```bash
# Inspect agent certificate chain
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
openssl x509 -in /etc/stellaops/agent/tls.crt -text -noout
# Verify chain against CA bundle
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
openssl verify -CAfile /etc/stellaops/ca/ca.crt /etc/stellaops/agent/tls.crt
```
### Bare Metal / systemd
```bash
# Inspect agent certificate
openssl x509 -in /etc/stellaops/agent/tls.crt -text -noout
# Verify certificate chain
openssl verify -CAfile /etc/stellaops/ca/ca.crt -untrusted /etc/stellaops/ca/intermediate.crt \
/etc/stellaops/agent/tls.crt
# Re-bootstrap if chain is broken
stella agent bootstrap --name <agent-name> --env <environment>
```
### Kubernetes / Helm
```bash
# Check certificate in agent pod
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
openssl x509 -in /etc/stellaops/agent/tls.crt -text -noout
# If using cert-manager, check CertificateRequest status
kubectl get certificaterequest -n stellaops
kubectl describe certificaterequest <name> -n stellaops
```
## Verification
```
stella doctor run --check check.agent.certificate.validity
```
## Related Checks
- `check.agent.certificate.expiry` -- checks expiry dates (complementary to chain validation)
- `check.agent.heartbeat.freshness` -- invalid certs prevent heartbeat communication

View File

@@ -0,0 +1,97 @@
---
checkId: check.agent.cluster.health
plugin: stellaops.doctor.agent
severity: fail
tags: [agent, cluster, ha, resilience]
---
# Agent Cluster Health
## What It Checks
Monitors the health of the agent cluster when clustering is enabled. The check only runs when the configuration key `Agent:Cluster:Enabled` is set to `true`. It is designed to verify:
1. All cluster members are reachable
2. A leader is elected and healthy
3. State synchronization is working across members
4. Failover is possible if the current leader goes down
**Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet perform cluster health probes.
## Why It Matters
In high-availability deployments, agents form a cluster to provide redundancy and automatic failover. If cluster health degrades -- members become unreachable, leader election fails, or state sync stalls -- task dispatch can stop entirely or produce split-brain scenarios where two agents execute the same task concurrently, leading to deployment conflicts.
## Common Causes
- Network partition between cluster members
- Leader node crashed without triggering failover
- State sync backlog due to high task volume
- Clock skew between cluster members causing consensus protocol failures
- Insufficient cluster members for quorum (see `check.agent.cluster.quorum`)
## How to Fix
### Docker Compose
```bash
# Check cluster member containers
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
# View cluster-specific logs
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200 | grep -i cluster
# Restart all agent containers to force re-election
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
```
Set clustering configuration in your `.env` or compose override:
```
AGENT__CLUSTER__ENABLED=true
AGENT__CLUSTER__MEMBERS=agent-1:8500,agent-2:8500,agent-3:8500
```
### Bare Metal / systemd
```bash
# Check cluster status
stella agent cluster status
# View cluster member health
stella agent cluster members
# Force leader re-election if leader is unhealthy
stella agent cluster elect --force
# Restart agent to rejoin cluster
sudo systemctl restart stella-agent
```
### Kubernetes / Helm
```bash
# Check agent StatefulSet pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
# View cluster gossip logs
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=100 | grep -i cluster
# Helm values for clustering
# agent:
# cluster:
# enabled: true
# replicas: 3
helm upgrade stellaops stellaops/stellaops --set agent.cluster.enabled=true --set agent.cluster.replicas=3
```
## Verification
```
stella doctor run --check check.agent.cluster.health
```
## Related Checks
- `check.agent.cluster.quorum` -- verifies minimum members for consensus
- `check.agent.heartbeat.freshness` -- individual agent connectivity
- `check.agent.capacity` -- fleet-level task capacity

View File

@@ -0,0 +1,97 @@
---
checkId: check.agent.cluster.quorum
plugin: stellaops.doctor.agent
severity: fail
tags: [agent, cluster, quorum, ha]
---
# Agent Cluster Quorum
## What It Checks
Verifies that the agent cluster has sufficient members online to maintain quorum for leader election and consensus operations. The check only runs when `Agent:Cluster:Enabled` is `true`. It is designed to verify:
1. Minimum members are online (n/2 + 1 for odd-numbered clusters, or the configured minimum)
2. Leader election is possible with current membership
3. Split-brain prevention mechanisms are active
**Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet query cluster membership.
## Why It Matters
Without quorum, the agent cluster cannot elect a leader, which means no task dispatch, no failover, and potentially a complete halt of agent-driven operations. Losing quorum is often the step before a full cluster outage. Monitoring quorum proactively allows operators to add members or fix partitions before the cluster becomes non-functional.
## Common Causes
- Too many cluster members went offline simultaneously (maintenance, host failure)
- Network partition isolating a minority of members from the majority
- Cluster scaled down below quorum threshold
- New deployment removed members without draining them first
## How to Fix
### Docker Compose
```bash
# Verify all agent containers are running
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
# Scale agents to restore quorum (minimum 3 for quorum of 2)
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
```
Ensure cluster member list is correct in `.env`:
```
AGENT__CLUSTER__ENABLED=true
AGENT__CLUSTER__MINMEMBERS=2
```
### Bare Metal / systemd
```bash
# Check how many cluster members are online
stella agent cluster members --status online
# If a member is down, restart it
ssh <agent-host> 'sudo systemctl restart stella-agent'
# Verify quorum status
stella agent cluster quorum
```
### Kubernetes / Helm
```bash
# Check agent pod count vs desired
kubectl get statefulset stellaops-agent -n stellaops
# Scale up if below quorum
kubectl scale statefulset stellaops-agent --replicas=3 -n stellaops
# Check pod disruption budget
kubectl get pdb -n stellaops
```
Set a PodDisruptionBudget to prevent quorum loss during rollouts:
```yaml
# values.yaml
agent:
cluster:
enabled: true
replicas: 3
podDisruptionBudget:
minAvailable: 2
```
## Verification
```
stella doctor run --check check.agent.cluster.quorum
```
## Related Checks
- `check.agent.cluster.health` -- overall cluster health including leader and sync status
- `check.agent.capacity` -- even with quorum, capacity may be insufficient
- `check.agent.heartbeat.freshness` -- individual member connectivity

View File

@@ -0,0 +1,104 @@
---
checkId: check.agent.heartbeat.freshness
plugin: stellaops.doctor.agent
severity: fail
tags: [agent, heartbeat, connectivity, quick]
---
# Agent Heartbeat Freshness
## What It Checks
Queries all non-revoked, non-inactive agents for the current tenant and classifies each by the age of its last heartbeat:
1. **Stale** (> 5 minutes since last heartbeat): Result is **Fail**. Evidence lists each stale agent with the time since its last heartbeat in minutes.
2. **Warning** (> 2 minutes but <= 5 minutes): Result is **Warn**. Evidence lists each delayed agent with time since heartbeat in seconds.
3. **Healthy** (<= 2 minutes): Result is **Pass**.
If no active agents are registered, the check returns **Warn** with a prompt to bootstrap agents. If the tenant ID is missing, it warns about being unable to check.
Evidence collected: `TotalActive`, `Stale` count, `Warning` count, `Healthy` count, per-agent names and heartbeat ages.
## Why It Matters
Heartbeats are the primary signal that an agent is alive and accepting work. A stale heartbeat means the agent has stopped communicating with the orchestrator -- it may have crashed, lost network connectivity, or had its mTLS certificate expire. Tasks dispatched to a stale agent will time out, and the lack of timely detection causes deployment delays and alert fatigue.
## Common Causes
- Agent process has crashed or stopped
- Network connectivity issue between agent and orchestrator
- Firewall blocking agent heartbeat traffic (typically HTTPS on port 8443)
- Agent host is unreachable or powered off
- mTLS certificate has expired (see `check.agent.certificate.expiry`)
- Agent is under heavy load (warning-level)
- Network latency between agent and orchestrator (warning-level)
- Agent is processing long-running tasks that block the heartbeat loop (warning-level)
## How to Fix
### Docker Compose
```bash
# Check agent container status
docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent
# View agent logs for crash or error messages
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200
# Restart agent container
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
# Verify network connectivity from agent to orchestrator
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
curl -k https://orchestrator:8443/health
```
### Bare Metal / systemd
```bash
# Check agent service status
systemctl status stella-agent
# View recent agent logs
journalctl -u stella-agent --since '10 minutes ago'
# Run agent diagnostics
stella agent doctor
# Check network connectivity to orchestrator
curl -k https://orchestrator:8443/health
# If certificate expired, renew it
stella agent renew-cert --force
# Restart the service
sudo systemctl restart stella-agent
```
### Kubernetes / Helm
```bash
# Check agent pod status and restarts
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
# View agent pod logs
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=200
# Check network policy allowing agent -> orchestrator traffic
kubectl get networkpolicy -n stellaops
# Restart agent pods via rollout
kubectl rollout restart deployment/stellaops-agent -n stellaops
```
## Verification
```
stella doctor run --check check.agent.heartbeat.freshness
```
## Related Checks
- `check.agent.stale` -- detects agents offline for hours/days (longer threshold than heartbeat freshness)
- `check.agent.certificate.expiry` -- expired certificates cause heartbeat authentication failures
- `check.agent.capacity` -- heartbeat failures reduce effective fleet capacity
- `check.agent.resource.utilization` -- overloaded agents may delay heartbeats

View File

@@ -0,0 +1,103 @@
---
checkId: check.agent.resource.utilization
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, resource, performance, capacity]
---
# Agent Resource Utilization
## What It Checks
Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify:
1. CPU utilization per agent
2. Memory utilization per agent
3. Disk space per agent (for task workspace, logs, and cached artifacts)
4. Resource usage trends (increasing/stable/decreasing)
**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true, so the check will always appear in results.
## Why It Matters
Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs.
## Common Causes
- Agent running too many concurrent tasks for its resource allocation
- Disk filled by accumulated scan artifacts, logs, or cached images
- Memory leak in long-running agent process
- Noisy neighbor on shared infrastructure consuming resources
- Resource limits not configured (no cgroup/container memory cap)
## How to Fix
### Docker Compose
```bash
# Check agent container resource usage
docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent)
# Set resource limits in compose override
# docker-compose.override.yml:
# services:
# agent:
# deploy:
# resources:
# limits:
# cpus: '2.0'
# memory: 4G
# Clean up old task artifacts
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent cleanup --older-than 7d
```
### Bare Metal / systemd
```bash
# Check resource usage
stella agent health <agent-id>
# View system resources on agent host
top -bn1 | head -20
df -h /var/lib/stellaops
# Clean up old task artifacts
stella agent cleanup --older-than 7d
# Adjust concurrent task limit
stella agent config --agent-id <agent-id> --set max_concurrent_tasks=4
```
### Kubernetes / Helm
```bash
# Check agent pod resource usage
kubectl top pods -l app.kubernetes.io/component=agent -n stellaops
# Set resource requests and limits in Helm values
# agent:
# resources:
# requests:
# cpu: "500m"
# memory: "1Gi"
# limits:
# cpu: "2000m"
# memory: "4Gi"
helm upgrade stellaops stellaops/stellaops -f values.yaml
# Check if pods are being OOM-killed
kubectl get events -n stellaops --field-selector reason=OOMKilling
```
## Verification
```
stella doctor run --check check.agent.resource.utilization
```
## Related Checks
- `check.agent.capacity` -- resource exhaustion reduces effective capacity
- `check.agent.heartbeat.freshness` -- resource saturation can delay heartbeats
- `check.agent.task.backlog` -- high utilization combined with backlog indicates need to scale

View File

@@ -0,0 +1,91 @@
---
checkId: check.agent.stale
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, maintenance, cleanup]
---
# Stale Agent Detection
## What It Checks
Identifies agents that have been offline (no heartbeat) for extended periods and may need investigation or decommissioning. The check inspects all non-revoked, non-inactive agents and categorizes them:
1. **Decommission candidates** -- offline for more than **7 days**. Result: **Warn** listing each agent with days offline.
2. **Stale** -- offline for more than **1 hour** but less than 7 days. Result: **Warn** listing each agent with hours offline.
3. **All healthy** -- no agents exceed the 1-hour stale threshold. Result: **Pass**.
The check uses `LastHeartbeatAt` from the agent store. Agents with no recorded heartbeat (`null`) are treated as having `TimeSpan.MaxValue` offline duration.
Evidence collected: `DecommissionCandidates` count, `StaleAgents` count, per-agent names with offline durations.
## Why It Matters
Stale agents consume fleet management overhead, confuse capacity planning, and may hold allocated resources (IP addresses, certificates, license seats) that could be reclaimed. An agent that has been offline for 7+ days is unlikely to return without intervention and should be explicitly deactivated or investigated. Ignoring stale agents leads to a growing inventory of ghost entries that obscure the true fleet state.
## Common Causes
- Agent host has been permanently removed (decommissioned hardware, terminated cloud instance)
- Agent was replaced by a new instance but the old registration was not deactivated
- Infrastructure change (network re-architecture, datacenter migration) without cleanup
- Agent host is undergoing extended maintenance
- Network partition isolating the agent
- Agent process crash without auto-restart configured (systemd restart policy missing)
## How to Fix
### Docker Compose
```bash
# List all agent registrations with status
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent list --all
# Deactivate a stale agent
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent deactivate --agent-id <agent-id>
```
### Bare Metal / systemd
```bash
# Review stale agents
stella agent list --status stale
# Deactivate agents that are no longer needed
stella agent deactivate --agent-id <agent-id>
# If the agent should still be active, investigate the host
ssh <agent-host> 'systemctl status stella-agent'
# Check network connectivity from the agent host
ssh <agent-host> 'curl -k https://orchestrator:8443/health'
# Restart agent on the host
ssh <agent-host> 'sudo systemctl restart stella-agent'
```
### Kubernetes / Helm
```bash
# Check for terminated or evicted agent pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops --field-selector=status.phase!=Running
# Remove stale agent registrations via API
stella agent deactivate --agent-id <agent-id>
# If pod was evicted, check node status
kubectl get nodes
kubectl describe node <node-name> | grep -A5 Conditions
```
## Verification
```
stella doctor run --check check.agent.stale
```
## Related Checks
- `check.agent.heartbeat.freshness` -- short-term heartbeat staleness (minutes vs. hours/days)
- `check.agent.capacity` -- stale agents do not contribute to capacity
- `check.agent.certificate.expiry` -- long-offline agents likely have expired certificates

View File

@@ -0,0 +1,92 @@
---
checkId: check.agent.task.backlog
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, task, queue, capacity]
---
# Task Queue Backlog
## What It Checks
Monitors the pending task queue depth across the agent fleet to detect capacity issues. The check is designed to evaluate:
1. Total queued tasks across the entire fleet
2. Age of the oldest queued task (how long tasks wait before dispatch)
3. Queue growth rate trend (growing, stable, or draining)
**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true.
## Why It Matters
A growing task backlog means agents cannot keep up with incoming work. Tasks age in the queue, SLA timers expire, and users experience delayed deployments and scan results. If the backlog grows unchecked, it can cascade: delayed scans block policy gates, which block promotions, which block release trains. Detecting backlog growth early allows operators to scale the fleet or prioritize the queue.
## Common Causes
- Insufficient agent count for current workload
- One or more agents offline, reducing effective fleet capacity
- Task burst from bulk operations (mass rescans, environment-wide deployments)
- Slow tasks monopolizing agent slots (large image scans, complex builds)
- Task dispatch paused due to configuration or freeze window
## How to Fix
### Docker Compose
```bash
# Check current queue depth
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent tasks --status queued --count
# Scale agents to reduce backlog
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
# Increase concurrent task limit per agent
# Set environment variable in compose override:
# AGENT__MAXCONCURRENTTASKS=8
```
### Bare Metal / systemd
```bash
# Check queue depth and oldest task
stella agent tasks --status queued
# Increase concurrent task limit
stella agent config --agent-id <id> --set max_concurrent_tasks=8
# Add more agents to the fleet
stella agent bootstrap --name agent-03 --env production --platform linux
```
### Kubernetes / Helm
```bash
# Check queue depth
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
stella agent tasks --status queued --count
# Scale agent deployment
kubectl scale deployment stellaops-agent --replicas=5 -n stellaops
# Or use HPA for auto-scaling
# agent:
# autoscaling:
# enabled: true
# minReplicas: 2
# maxReplicas: 10
# targetCPUUtilizationPercentage: 70
helm upgrade stellaops stellaops/stellaops -f values.yaml
```
## Verification
```
stella doctor run --check check.agent.task.backlog
```
## Related Checks
- `check.agent.capacity` -- backlog grows when capacity is insufficient
- `check.agent.task.failure.rate` -- failed tasks may be re-queued, inflating the backlog
- `check.agent.resource.utilization` -- saturated agents process tasks slowly
- `check.agent.heartbeat.freshness` -- offline agents reduce dispatch targets

View File

@@ -0,0 +1,82 @@
---
checkId: check.agent.task.failure.rate
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, task, failure, reliability]
---
# Task Failure Rate
## What It Checks
Monitors the task failure rate across the agent fleet to detect systemic issues. The check is designed to evaluate:
1. Overall task failure rate over the last hour
2. Per-agent failure rate to isolate problematic agents
3. Failure rate trend (increasing, decreasing, or stable)
4. Common failure reasons to guide remediation
**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true.
## Why It Matters
A rising task failure rate is an early indicator of systemic problems: infrastructure issues, misconfigured environments, expired credentials, or agent software bugs. Catching a spike before it reaches 100% failure allows operators to intervene, roll back, or redirect tasks to healthy agents before an outage fully materializes.
## Common Causes
- Registry or artifact store unreachable (tasks cannot pull images)
- Expired credentials used by tasks (registry tokens, cloud provider keys)
- Agent software bug introduced by recent update
- Target environment misconfigured (wrong endpoints, firewall rules)
- Disk exhaustion on agent hosts preventing artifact staging
- OOM kills during resource-intensive tasks (scans, builds)
## How to Fix
### Docker Compose
```bash
# Check agent logs for task failures
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 500 | \
grep -i "task.*fail\|error\|exception"
# Review recent task history
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
stella agent tasks --status failed --last 1h
```
### Bare Metal / systemd
```bash
# View failed tasks
stella agent tasks --status failed --last 1h
# Check per-agent failure rates
stella agent health <agent-id> --show-tasks
# Review agent logs for failure patterns
journalctl -u stella-agent --since '1 hour ago' | grep -i 'fail\|error'
```
### Kubernetes / Helm
```bash
# Check agent pod logs for task errors
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=500 | \
grep -i "task.*fail\|error"
# Check pod events for OOM or crash signals
kubectl get events -n stellaops --sort-by='.lastTimestamp' | grep -i agent
```
## Verification
```
stella doctor run --check check.agent.task.failure.rate
```
## Related Checks
- `check.agent.resource.utilization` -- resource exhaustion causes task failures
- `check.agent.task.backlog` -- high failure rate combined with backlog indicates systemic issue
- `check.agent.heartbeat.freshness` -- crashing agents fail tasks and go stale
- `check.agent.version.consistency` -- version skew can cause task compatibility failures

View File

@@ -0,0 +1,81 @@
---
checkId: check.agent.version.consistency
plugin: stellaops.doctor.agent
severity: warn
tags: [agent, version, maintenance]
---
# Agent Version Consistency
## What It Checks
Groups all non-revoked, non-inactive agents by their reported `Version` field and evaluates version skew:
1. **Single version** across all agents: **Pass** -- all agents are consistent.
2. **Two versions** with skew affecting less than half the fleet: **Pass** (minor skew acceptable).
3. **Significant skew** (more than 2 distinct versions, or outdated agents exceed half the fleet): **Warn** with evidence listing the version distribution and up to 10 outdated agent names.
4. **No active agents**: **Skip**.
The "majority version" is the version running on the most agents. All other versions are considered outdated. Evidence collected: `MajorityVersion`, `VersionDistribution` (e.g., "1.5.0: 8, 1.4.2: 2"), `OutdatedAgents` (list of names with their versions).
## Why It Matters
Version skew across the agent fleet can cause subtle compatibility issues: newer agents may support task types that older agents reject, protocol changes may cause heartbeat or dispatch failures, and mixed versions make incident triage harder because behavior differs across agents. Keeping the fleet consistent reduces operational surprises.
## Common Causes
- Auto-update is disabled on some agents
- Some agents failed to update (download failure, permission issue, disk full)
- Phased rollout in progress (expected, temporary skew)
- Agents on isolated networks that cannot reach the update server
## How to Fix
### Docker Compose
```bash
# Check agent image versions
docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent --format json | \
jq '.[] | {name: .Name, image: .Image}'
# Pull latest image and recreate
docker compose -f devops/compose/docker-compose.stella-ops.yml pull agent
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d agent
```
### Bare Metal / systemd
```bash
# Update outdated agents to target version
stella agent update --version <target-version> --agent-id <id>
# Enable auto-update
stella agent config --agent-id <id> --set auto_update.enabled=true
# Batch update all agents
stella agent update --version <target-version> --all
```
### Kubernetes / Helm
```bash
# Check running image versions across pods
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
# Update image tag in Helm values and rollout
helm upgrade stellaops stellaops/stellaops --set agent.image.tag=<target-version>
# Monitor rollout
kubectl rollout status deployment/stellaops-agent -n stellaops
```
## Verification
```
stella doctor run --check check.agent.version.consistency
```
## Related Checks
- `check.agent.heartbeat.freshness` -- version mismatch can cause heartbeat protocol failures
- `check.agent.capacity` -- outdated agents may be unable to accept newer task types