Doctor plugin checks: implement health check classes and documentation
Implement remediation-aware health checks across all Doctor plugin modules (Agent, Attestor, Auth, BinaryAnalysis, Compliance, Crypto, Environment, EvidenceLocker, Notify, Observability, Operations, Policy, Postgres, Release, Scanner, Storage, Vex) and their backing library counterparts (AI, Attestation, Authority, Core, Cryptography, Database, Docker, Integration, Notify, Observability, Security, ServiceGraph, Sources, Verification). Each check now emits structured remediation metadata (severity, category, runbook links, and fix suggestions) consumed by the Doctor dashboard remediation panel. Also adds: - docs/doctor/articles/ knowledge base for check explanations - Advisory AI search seed and allowlist updates for doctor content - Sprint plan for doctor checks documentation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
47
docs/doctor/articles/_TEMPLATE.md
Normal file
47
docs/doctor/articles/_TEMPLATE.md
Normal file
@@ -0,0 +1,47 @@
|
||||
---
|
||||
checkId: check.<plugin>.<name>
|
||||
plugin: stellaops.doctor.<plugin>
|
||||
severity: fail | warn | info
|
||||
tags: [tag1, tag2]
|
||||
---
|
||||
# <Check Name>
|
||||
|
||||
## What It Checks
|
||||
<Exact condition tested. What thresholds trigger fail vs warn vs pass. What evidence is collected.>
|
||||
|
||||
## Why It Matters
|
||||
<Business impact if this check fails in production. What breaks, what data is at risk, what users experience.>
|
||||
|
||||
## Common Causes
|
||||
- Cause 1 (specific: exact misconfiguration, missing file, wrong env var)
|
||||
- Cause 2
|
||||
- Cause 3
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Step-by-step commands for docker-compose deployments
|
||||
# Use exact env var names with __ separator
|
||||
# Reference exact file paths relative to devops/compose/
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Step-by-step commands for bare-metal / systemd deployments
|
||||
# Reference exact config file paths (e.g., /etc/stellaops/appsettings.json)
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Step-by-step commands for Kubernetes/Helm deployments
|
||||
# Reference exact Helm values, ConfigMap keys, Secret names
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor run --check check.<plugin>.<name>
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.related.id` - brief explanation of relationship
|
||||
86
docs/doctor/articles/agent/capacity.md
Normal file
86
docs/doctor/articles/agent/capacity.md
Normal file
@@ -0,0 +1,86 @@
|
||||
---
|
||||
checkId: check.agent.capacity
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: warn
|
||||
tags: [agent, capacity, performance]
|
||||
---
|
||||
# Agent Capacity
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies that agents have sufficient capacity to handle incoming tasks. The check queries the agent store for the current tenant and categorizes agents by status:
|
||||
|
||||
1. **Fail** if zero agents have `AgentStatus.Active` -- no agents are available to run tasks.
|
||||
2. **Pass** if at least one active agent exists, reporting the active-vs-total count.
|
||||
|
||||
Evidence collected: `ActiveAgents`, `TotalAgents`.
|
||||
|
||||
Thresholds defined in source (not yet wired to the simplified implementation):
|
||||
- High utilization: >= 90%
|
||||
- Warning utilization: >= 75%
|
||||
|
||||
The check skips with a warning if the tenant ID is missing or unparseable.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
When no active agents are available, the platform cannot execute deployment tasks, scans, or any agent-dispatched work. Releases stall, scan queues grow, and SLA timers expire silently. Detecting zero-capacity before a promotion attempt prevents failed deployments and on-call pages.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- All agents are offline (host crash, network partition, maintenance window)
|
||||
- No agents have been registered for this tenant
|
||||
- Agents exist but are in `Revoked` or `Inactive` status and none remain `Active`
|
||||
- Agent bootstrap was started but never completed
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check agent container health
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
|
||||
|
||||
# View agent container logs
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 100
|
||||
|
||||
# Restart agent container
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check agent service status
|
||||
systemctl status stella-agent
|
||||
|
||||
# Restart agent service
|
||||
sudo systemctl restart stella-agent
|
||||
|
||||
# Bootstrap a new agent if none registered
|
||||
stella agent bootstrap --name agent-01 --env production --platform linux
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check agent pods
|
||||
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
|
||||
|
||||
# Describe agent deployment
|
||||
kubectl describe deployment stellaops-agent -n stellaops
|
||||
|
||||
# Scale agent replicas
|
||||
kubectl scale deployment stellaops-agent --replicas=2 -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.capacity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.heartbeat.freshness` -- agents may be registered but not sending heartbeats
|
||||
- `check.agent.stale` -- agents offline for extended periods may need decommissioning
|
||||
- `check.agent.resource.utilization` -- active agents may be resource-constrained
|
||||
95
docs/doctor/articles/agent/certificate-expiry.md
Normal file
95
docs/doctor/articles/agent/certificate-expiry.md
Normal file
@@ -0,0 +1,95 @@
|
||||
---
|
||||
checkId: check.agent.certificate.expiry
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: fail
|
||||
tags: [agent, certificate, security, quick]
|
||||
---
|
||||
# Agent Certificate Expiry
|
||||
|
||||
## What It Checks
|
||||
|
||||
Inspects the `CertificateExpiresAt` field on every non-revoked, non-inactive agent and classifies each into one of four buckets:
|
||||
|
||||
1. **Expired** -- `CertificateExpiresAt` is in the past. Result: **Fail**.
|
||||
2. **Critical** -- certificate expires within **1 day** (24 hours). Result: **Fail**.
|
||||
3. **Warning** -- certificate expires within **7 days**. Result: **Warn**.
|
||||
4. **Healthy** -- certificate has more than 7 days remaining. Result: **Pass**.
|
||||
|
||||
The check short-circuits to the most severe bucket found. Evidence includes per-agent names with time-since-expiry or time-until-expiry, plus counts of `TotalActive`, `Expired`, `Critical`, and `Warning` agents.
|
||||
|
||||
Agents whose `CertificateExpiresAt` is null or default are silently skipped (certificate info not available). If no active agents exist the check is skipped entirely.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Agent mTLS certificates authenticate the agent to the orchestrator. An expired certificate causes the agent to fail heartbeats, reject task assignments, and drop out of the fleet. In production this means deployments and scans silently stop being dispatched to that agent, potentially leaving environments unserviced.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Certificate auto-renewal is disabled on the agent
|
||||
- Agent was offline when renewal was due (missed the renewal window)
|
||||
- Certificate authority is unreachable from the agent host
|
||||
- Agent bootstrap was incomplete (certificate provisioned but auto-renewal not configured)
|
||||
- Certificate renewal threshold not yet reached (warning-level)
|
||||
- Certificate authority rate limiting prevented renewal (critical-level)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check certificate expiry for agent containers
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
stella agent health --show-cert
|
||||
|
||||
# Force certificate renewal
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
stella agent renew-cert --force
|
||||
|
||||
# Verify auto-renewal configuration
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
stella agent config show | grep auto_renew
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Force certificate renewal on an affected agent
|
||||
stella agent renew-cert --agent-id <agent-id> --force
|
||||
|
||||
# If agent is unreachable, re-bootstrap
|
||||
stella agent bootstrap --name <agent-name> --env <environment>
|
||||
|
||||
# Verify auto-renewal is enabled
|
||||
stella agent config --agent-id <agent-id> | grep auto_renew
|
||||
|
||||
# Check agent logs for renewal failures
|
||||
stella agent logs --agent-id <agent-id> --level warn
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check cert expiry across agent pods
|
||||
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
|
||||
stella agent health --show-cert
|
||||
|
||||
# Force renewal via pod exec
|
||||
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
|
||||
stella agent renew-cert --force
|
||||
|
||||
# If using cert-manager, check Certificate resource
|
||||
kubectl get certificate -n stellaops
|
||||
kubectl describe certificate stellaops-agent-tls -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.certificate.expiry
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.certificate.validity` -- verifies certificate chain of trust (not just expiry)
|
||||
- `check.agent.heartbeat.freshness` -- expired certs cause heartbeat failures
|
||||
- `check.agent.stale` -- agents with expired certs often show as stale
|
||||
84
docs/doctor/articles/agent/certificate-validity.md
Normal file
84
docs/doctor/articles/agent/certificate-validity.md
Normal file
@@ -0,0 +1,84 @@
|
||||
---
|
||||
checkId: check.agent.certificate.validity
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: fail
|
||||
tags: [agent, certificate, security]
|
||||
---
|
||||
# Agent Certificate Validity
|
||||
|
||||
## What It Checks
|
||||
|
||||
Validates the full certificate chain of trust for agent mTLS certificates. The check is designed to verify:
|
||||
|
||||
1. Certificate is signed by a trusted CA
|
||||
2. Certificate chain is complete (no missing intermediates)
|
||||
3. No revoked certificates in the chain (CRL/OCSP check)
|
||||
4. Certificate subject matches the agent's registered identity
|
||||
|
||||
**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The framework and metadata are wired; the chain-validation logic is not yet connected.
|
||||
|
||||
Evidence collected: none yet (pending implementation).
|
||||
|
||||
The check requires `IAgentStore` to be registered in DI; otherwise it will not run.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
A valid certificate expiry date (checked by `check.agent.certificate.expiry`) is necessary but not sufficient. An agent could present a non-expired certificate that was signed by an untrusted CA, has a broken chain, or has been revoked. Any of these conditions would allow an impersonating agent to receive task dispatches or exfiltrate deployment secrets.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- CA certificate rotated but agent still presents cert signed by old CA
|
||||
- Intermediate certificate missing from agent's cert bundle
|
||||
- Certificate revoked via CRL but agent not yet re-provisioned
|
||||
- Agent identity mismatch after hostname change or migration
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Inspect agent certificate chain
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
openssl x509 -in /etc/stellaops/agent/tls.crt -text -noout
|
||||
|
||||
# Verify chain against CA bundle
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
openssl verify -CAfile /etc/stellaops/ca/ca.crt /etc/stellaops/agent/tls.crt
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Inspect agent certificate
|
||||
openssl x509 -in /etc/stellaops/agent/tls.crt -text -noout
|
||||
|
||||
# Verify certificate chain
|
||||
openssl verify -CAfile /etc/stellaops/ca/ca.crt -untrusted /etc/stellaops/ca/intermediate.crt \
|
||||
/etc/stellaops/agent/tls.crt
|
||||
|
||||
# Re-bootstrap if chain is broken
|
||||
stella agent bootstrap --name <agent-name> --env <environment>
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check certificate in agent pod
|
||||
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
|
||||
openssl x509 -in /etc/stellaops/agent/tls.crt -text -noout
|
||||
|
||||
# If using cert-manager, check CertificateRequest status
|
||||
kubectl get certificaterequest -n stellaops
|
||||
kubectl describe certificaterequest <name> -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.certificate.validity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.certificate.expiry` -- checks expiry dates (complementary to chain validation)
|
||||
- `check.agent.heartbeat.freshness` -- invalid certs prevent heartbeat communication
|
||||
97
docs/doctor/articles/agent/cluster-health.md
Normal file
97
docs/doctor/articles/agent/cluster-health.md
Normal file
@@ -0,0 +1,97 @@
|
||||
---
|
||||
checkId: check.agent.cluster.health
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: fail
|
||||
tags: [agent, cluster, ha, resilience]
|
||||
---
|
||||
# Agent Cluster Health
|
||||
|
||||
## What It Checks
|
||||
|
||||
Monitors the health of the agent cluster when clustering is enabled. The check only runs when the configuration key `Agent:Cluster:Enabled` is set to `true`. It is designed to verify:
|
||||
|
||||
1. All cluster members are reachable
|
||||
2. A leader is elected and healthy
|
||||
3. State synchronization is working across members
|
||||
4. Failover is possible if the current leader goes down
|
||||
|
||||
**Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet perform cluster health probes.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
In high-availability deployments, agents form a cluster to provide redundancy and automatic failover. If cluster health degrades -- members become unreachable, leader election fails, or state sync stalls -- task dispatch can stop entirely or produce split-brain scenarios where two agents execute the same task concurrently, leading to deployment conflicts.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Network partition between cluster members
|
||||
- Leader node crashed without triggering failover
|
||||
- State sync backlog due to high task volume
|
||||
- Clock skew between cluster members causing consensus protocol failures
|
||||
- Insufficient cluster members for quorum (see `check.agent.cluster.quorum`)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check cluster member containers
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
|
||||
|
||||
# View cluster-specific logs
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200 | grep -i cluster
|
||||
|
||||
# Restart all agent containers to force re-election
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
|
||||
```
|
||||
|
||||
Set clustering configuration in your `.env` or compose override:
|
||||
|
||||
```
|
||||
AGENT__CLUSTER__ENABLED=true
|
||||
AGENT__CLUSTER__MEMBERS=agent-1:8500,agent-2:8500,agent-3:8500
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check cluster status
|
||||
stella agent cluster status
|
||||
|
||||
# View cluster member health
|
||||
stella agent cluster members
|
||||
|
||||
# Force leader re-election if leader is unhealthy
|
||||
stella agent cluster elect --force
|
||||
|
||||
# Restart agent to rejoin cluster
|
||||
sudo systemctl restart stella-agent
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check agent StatefulSet pods
|
||||
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
|
||||
|
||||
# View cluster gossip logs
|
||||
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=100 | grep -i cluster
|
||||
|
||||
# Helm values for clustering
|
||||
# agent:
|
||||
# cluster:
|
||||
# enabled: true
|
||||
# replicas: 3
|
||||
helm upgrade stellaops stellaops/stellaops --set agent.cluster.enabled=true --set agent.cluster.replicas=3
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.cluster.health
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.cluster.quorum` -- verifies minimum members for consensus
|
||||
- `check.agent.heartbeat.freshness` -- individual agent connectivity
|
||||
- `check.agent.capacity` -- fleet-level task capacity
|
||||
97
docs/doctor/articles/agent/cluster-quorum.md
Normal file
97
docs/doctor/articles/agent/cluster-quorum.md
Normal file
@@ -0,0 +1,97 @@
|
||||
---
|
||||
checkId: check.agent.cluster.quorum
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: fail
|
||||
tags: [agent, cluster, quorum, ha]
|
||||
---
|
||||
# Agent Cluster Quorum
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies that the agent cluster has sufficient members online to maintain quorum for leader election and consensus operations. The check only runs when `Agent:Cluster:Enabled` is `true`. It is designed to verify:
|
||||
|
||||
1. Minimum members are online (n/2 + 1 for odd-numbered clusters, or the configured minimum)
|
||||
2. Leader election is possible with current membership
|
||||
3. Split-brain prevention mechanisms are active
|
||||
|
||||
**Current status:** implementation pending -- the check returns Skip with a placeholder message. The `CanRun` gate is functional (reads cluster config), but `RunAsync` does not yet query cluster membership.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Without quorum, the agent cluster cannot elect a leader, which means no task dispatch, no failover, and potentially a complete halt of agent-driven operations. Losing quorum is often the step before a full cluster outage. Monitoring quorum proactively allows operators to add members or fix partitions before the cluster becomes non-functional.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Too many cluster members went offline simultaneously (maintenance, host failure)
|
||||
- Network partition isolating a minority of members from the majority
|
||||
- Cluster scaled down below quorum threshold
|
||||
- New deployment removed members without draining them first
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Verify all agent containers are running
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps | grep agent
|
||||
|
||||
# Scale agents to restore quorum (minimum 3 for quorum of 2)
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
|
||||
```
|
||||
|
||||
Ensure cluster member list is correct in `.env`:
|
||||
|
||||
```
|
||||
AGENT__CLUSTER__ENABLED=true
|
||||
AGENT__CLUSTER__MINMEMBERS=2
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check how many cluster members are online
|
||||
stella agent cluster members --status online
|
||||
|
||||
# If a member is down, restart it
|
||||
ssh <agent-host> 'sudo systemctl restart stella-agent'
|
||||
|
||||
# Verify quorum status
|
||||
stella agent cluster quorum
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check agent pod count vs desired
|
||||
kubectl get statefulset stellaops-agent -n stellaops
|
||||
|
||||
# Scale up if below quorum
|
||||
kubectl scale statefulset stellaops-agent --replicas=3 -n stellaops
|
||||
|
||||
# Check pod disruption budget
|
||||
kubectl get pdb -n stellaops
|
||||
```
|
||||
|
||||
Set a PodDisruptionBudget to prevent quorum loss during rollouts:
|
||||
|
||||
```yaml
|
||||
# values.yaml
|
||||
agent:
|
||||
cluster:
|
||||
enabled: true
|
||||
replicas: 3
|
||||
podDisruptionBudget:
|
||||
minAvailable: 2
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.cluster.quorum
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.cluster.health` -- overall cluster health including leader and sync status
|
||||
- `check.agent.capacity` -- even with quorum, capacity may be insufficient
|
||||
- `check.agent.heartbeat.freshness` -- individual member connectivity
|
||||
104
docs/doctor/articles/agent/heartbeat-freshness.md
Normal file
104
docs/doctor/articles/agent/heartbeat-freshness.md
Normal file
@@ -0,0 +1,104 @@
|
||||
---
|
||||
checkId: check.agent.heartbeat.freshness
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: fail
|
||||
tags: [agent, heartbeat, connectivity, quick]
|
||||
---
|
||||
# Agent Heartbeat Freshness
|
||||
|
||||
## What It Checks
|
||||
|
||||
Queries all non-revoked, non-inactive agents for the current tenant and classifies each by the age of its last heartbeat:
|
||||
|
||||
1. **Stale** (> 5 minutes since last heartbeat): Result is **Fail**. Evidence lists each stale agent with the time since its last heartbeat in minutes.
|
||||
2. **Warning** (> 2 minutes but <= 5 minutes): Result is **Warn**. Evidence lists each delayed agent with time since heartbeat in seconds.
|
||||
3. **Healthy** (<= 2 minutes): Result is **Pass**.
|
||||
|
||||
If no active agents are registered, the check returns **Warn** with a prompt to bootstrap agents. If the tenant ID is missing, it warns about being unable to check.
|
||||
|
||||
Evidence collected: `TotalActive`, `Stale` count, `Warning` count, `Healthy` count, per-agent names and heartbeat ages.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Heartbeats are the primary signal that an agent is alive and accepting work. A stale heartbeat means the agent has stopped communicating with the orchestrator -- it may have crashed, lost network connectivity, or had its mTLS certificate expire. Tasks dispatched to a stale agent will time out, and the lack of timely detection causes deployment delays and alert fatigue.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Agent process has crashed or stopped
|
||||
- Network connectivity issue between agent and orchestrator
|
||||
- Firewall blocking agent heartbeat traffic (typically HTTPS on port 8443)
|
||||
- Agent host is unreachable or powered off
|
||||
- mTLS certificate has expired (see `check.agent.certificate.expiry`)
|
||||
- Agent is under heavy load (warning-level)
|
||||
- Network latency between agent and orchestrator (warning-level)
|
||||
- Agent is processing long-running tasks that block the heartbeat loop (warning-level)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check agent container status
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent
|
||||
|
||||
# View agent logs for crash or error messages
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 200
|
||||
|
||||
# Restart agent container
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart agent
|
||||
|
||||
# Verify network connectivity from agent to orchestrator
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
curl -k https://orchestrator:8443/health
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check agent service status
|
||||
systemctl status stella-agent
|
||||
|
||||
# View recent agent logs
|
||||
journalctl -u stella-agent --since '10 minutes ago'
|
||||
|
||||
# Run agent diagnostics
|
||||
stella agent doctor
|
||||
|
||||
# Check network connectivity to orchestrator
|
||||
curl -k https://orchestrator:8443/health
|
||||
|
||||
# If certificate expired, renew it
|
||||
stella agent renew-cert --force
|
||||
|
||||
# Restart the service
|
||||
sudo systemctl restart stella-agent
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check agent pod status and restarts
|
||||
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops
|
||||
|
||||
# View agent pod logs
|
||||
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=200
|
||||
|
||||
# Check network policy allowing agent -> orchestrator traffic
|
||||
kubectl get networkpolicy -n stellaops
|
||||
|
||||
# Restart agent pods via rollout
|
||||
kubectl rollout restart deployment/stellaops-agent -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.heartbeat.freshness
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.stale` -- detects agents offline for hours/days (longer threshold than heartbeat freshness)
|
||||
- `check.agent.certificate.expiry` -- expired certificates cause heartbeat authentication failures
|
||||
- `check.agent.capacity` -- heartbeat failures reduce effective fleet capacity
|
||||
- `check.agent.resource.utilization` -- overloaded agents may delay heartbeats
|
||||
103
docs/doctor/articles/agent/resource-utilization.md
Normal file
103
docs/doctor/articles/agent/resource-utilization.md
Normal file
@@ -0,0 +1,103 @@
|
||||
---
|
||||
checkId: check.agent.resource.utilization
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: warn
|
||||
tags: [agent, resource, performance, capacity]
|
||||
---
|
||||
# Agent Resource Utilization
|
||||
|
||||
## What It Checks
|
||||
|
||||
Monitors CPU, memory, and disk utilization across the agent fleet. The check is designed to verify:
|
||||
|
||||
1. CPU utilization per agent
|
||||
2. Memory utilization per agent
|
||||
3. Disk space per agent (for task workspace, logs, and cached artifacts)
|
||||
4. Resource usage trends (increasing/stable/decreasing)
|
||||
|
||||
**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true, so the check will always appear in results.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Agents that exhaust CPU, memory, or disk become unable to execute tasks reliably. CPU saturation causes task timeouts; memory exhaustion triggers OOM kills that look like intermittent crashes; disk exhaustion prevents artifact downloads and log writes. Proactive monitoring prevents these cascading failures before they impact deployment SLAs.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Agent running too many concurrent tasks for its resource allocation
|
||||
- Disk filled by accumulated scan artifacts, logs, or cached images
|
||||
- Memory leak in long-running agent process
|
||||
- Noisy neighbor on shared infrastructure consuming resources
|
||||
- Resource limits not configured (no cgroup/container memory cap)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check agent container resource usage
|
||||
docker stats --no-stream $(docker compose -f devops/compose/docker-compose.stella-ops.yml ps -q agent)
|
||||
|
||||
# Set resource limits in compose override
|
||||
# docker-compose.override.yml:
|
||||
# services:
|
||||
# agent:
|
||||
# deploy:
|
||||
# resources:
|
||||
# limits:
|
||||
# cpus: '2.0'
|
||||
# memory: 4G
|
||||
|
||||
# Clean up old task artifacts
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
stella agent cleanup --older-than 7d
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check resource usage
|
||||
stella agent health <agent-id>
|
||||
|
||||
# View system resources on agent host
|
||||
top -bn1 | head -20
|
||||
df -h /var/lib/stellaops
|
||||
|
||||
# Clean up old task artifacts
|
||||
stella agent cleanup --older-than 7d
|
||||
|
||||
# Adjust concurrent task limit
|
||||
stella agent config --agent-id <agent-id> --set max_concurrent_tasks=4
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check agent pod resource usage
|
||||
kubectl top pods -l app.kubernetes.io/component=agent -n stellaops
|
||||
|
||||
# Set resource requests and limits in Helm values
|
||||
# agent:
|
||||
# resources:
|
||||
# requests:
|
||||
# cpu: "500m"
|
||||
# memory: "1Gi"
|
||||
# limits:
|
||||
# cpu: "2000m"
|
||||
# memory: "4Gi"
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
|
||||
# Check if pods are being OOM-killed
|
||||
kubectl get events -n stellaops --field-selector reason=OOMKilling
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.resource.utilization
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.capacity` -- resource exhaustion reduces effective capacity
|
||||
- `check.agent.heartbeat.freshness` -- resource saturation can delay heartbeats
|
||||
- `check.agent.task.backlog` -- high utilization combined with backlog indicates need to scale
|
||||
91
docs/doctor/articles/agent/stale.md
Normal file
91
docs/doctor/articles/agent/stale.md
Normal file
@@ -0,0 +1,91 @@
|
||||
---
|
||||
checkId: check.agent.stale
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: warn
|
||||
tags: [agent, maintenance, cleanup]
|
||||
---
|
||||
# Stale Agent Detection
|
||||
|
||||
## What It Checks
|
||||
|
||||
Identifies agents that have been offline (no heartbeat) for extended periods and may need investigation or decommissioning. The check inspects all non-revoked, non-inactive agents and categorizes them:
|
||||
|
||||
1. **Decommission candidates** -- offline for more than **7 days**. Result: **Warn** listing each agent with days offline.
|
||||
2. **Stale** -- offline for more than **1 hour** but less than 7 days. Result: **Warn** listing each agent with hours offline.
|
||||
3. **All healthy** -- no agents exceed the 1-hour stale threshold. Result: **Pass**.
|
||||
|
||||
The check uses `LastHeartbeatAt` from the agent store. Agents with no recorded heartbeat (`null`) are treated as having `TimeSpan.MaxValue` offline duration.
|
||||
|
||||
Evidence collected: `DecommissionCandidates` count, `StaleAgents` count, per-agent names with offline durations.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Stale agents consume fleet management overhead, confuse capacity planning, and may hold allocated resources (IP addresses, certificates, license seats) that could be reclaimed. An agent that has been offline for 7+ days is unlikely to return without intervention and should be explicitly deactivated or investigated. Ignoring stale agents leads to a growing inventory of ghost entries that obscure the true fleet state.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Agent host has been permanently removed (decommissioned hardware, terminated cloud instance)
|
||||
- Agent was replaced by a new instance but the old registration was not deactivated
|
||||
- Infrastructure change (network re-architecture, datacenter migration) without cleanup
|
||||
- Agent host is undergoing extended maintenance
|
||||
- Network partition isolating the agent
|
||||
- Agent process crash without auto-restart configured (systemd restart policy missing)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# List all agent registrations with status
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
stella agent list --all
|
||||
|
||||
# Deactivate a stale agent
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
stella agent deactivate --agent-id <agent-id>
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Review stale agents
|
||||
stella agent list --status stale
|
||||
|
||||
# Deactivate agents that are no longer needed
|
||||
stella agent deactivate --agent-id <agent-id>
|
||||
|
||||
# If the agent should still be active, investigate the host
|
||||
ssh <agent-host> 'systemctl status stella-agent'
|
||||
|
||||
# Check network connectivity from the agent host
|
||||
ssh <agent-host> 'curl -k https://orchestrator:8443/health'
|
||||
|
||||
# Restart agent on the host
|
||||
ssh <agent-host> 'sudo systemctl restart stella-agent'
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check for terminated or evicted agent pods
|
||||
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops --field-selector=status.phase!=Running
|
||||
|
||||
# Remove stale agent registrations via API
|
||||
stella agent deactivate --agent-id <agent-id>
|
||||
|
||||
# If pod was evicted, check node status
|
||||
kubectl get nodes
|
||||
kubectl describe node <node-name> | grep -A5 Conditions
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.stale
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.heartbeat.freshness` -- short-term heartbeat staleness (minutes vs. hours/days)
|
||||
- `check.agent.capacity` -- stale agents do not contribute to capacity
|
||||
- `check.agent.certificate.expiry` -- long-offline agents likely have expired certificates
|
||||
92
docs/doctor/articles/agent/task-backlog.md
Normal file
92
docs/doctor/articles/agent/task-backlog.md
Normal file
@@ -0,0 +1,92 @@
|
||||
---
|
||||
checkId: check.agent.task.backlog
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: warn
|
||||
tags: [agent, task, queue, capacity]
|
||||
---
|
||||
# Task Queue Backlog
|
||||
|
||||
## What It Checks
|
||||
|
||||
Monitors the pending task queue depth across the agent fleet to detect capacity issues. The check is designed to evaluate:
|
||||
|
||||
1. Total queued tasks across the entire fleet
|
||||
2. Age of the oldest queued task (how long tasks wait before dispatch)
|
||||
3. Queue growth rate trend (growing, stable, or draining)
|
||||
|
||||
**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
A growing task backlog means agents cannot keep up with incoming work. Tasks age in the queue, SLA timers expire, and users experience delayed deployments and scan results. If the backlog grows unchecked, it can cascade: delayed scans block policy gates, which block promotions, which block release trains. Detecting backlog growth early allows operators to scale the fleet or prioritize the queue.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Insufficient agent count for current workload
|
||||
- One or more agents offline, reducing effective fleet capacity
|
||||
- Task burst from bulk operations (mass rescans, environment-wide deployments)
|
||||
- Slow tasks monopolizing agent slots (large image scans, complex builds)
|
||||
- Task dispatch paused due to configuration or freeze window
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check current queue depth
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
stella agent tasks --status queued --count
|
||||
|
||||
# Scale agents to reduce backlog
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d --scale agent=3
|
||||
|
||||
# Increase concurrent task limit per agent
|
||||
# Set environment variable in compose override:
|
||||
# AGENT__MAXCONCURRENTTASKS=8
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check queue depth and oldest task
|
||||
stella agent tasks --status queued
|
||||
|
||||
# Increase concurrent task limit
|
||||
stella agent config --agent-id <id> --set max_concurrent_tasks=8
|
||||
|
||||
# Add more agents to the fleet
|
||||
stella agent bootstrap --name agent-03 --env production --platform linux
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check queue depth
|
||||
kubectl exec -it deploy/stellaops-agent -n stellaops -- \
|
||||
stella agent tasks --status queued --count
|
||||
|
||||
# Scale agent deployment
|
||||
kubectl scale deployment stellaops-agent --replicas=5 -n stellaops
|
||||
|
||||
# Or use HPA for auto-scaling
|
||||
# agent:
|
||||
# autoscaling:
|
||||
# enabled: true
|
||||
# minReplicas: 2
|
||||
# maxReplicas: 10
|
||||
# targetCPUUtilizationPercentage: 70
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.task.backlog
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.capacity` -- backlog grows when capacity is insufficient
|
||||
- `check.agent.task.failure.rate` -- failed tasks may be re-queued, inflating the backlog
|
||||
- `check.agent.resource.utilization` -- saturated agents process tasks slowly
|
||||
- `check.agent.heartbeat.freshness` -- offline agents reduce dispatch targets
|
||||
82
docs/doctor/articles/agent/task-failure-rate.md
Normal file
82
docs/doctor/articles/agent/task-failure-rate.md
Normal file
@@ -0,0 +1,82 @@
|
||||
---
|
||||
checkId: check.agent.task.failure.rate
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: warn
|
||||
tags: [agent, task, failure, reliability]
|
||||
---
|
||||
# Task Failure Rate
|
||||
|
||||
## What It Checks
|
||||
|
||||
Monitors the task failure rate across the agent fleet to detect systemic issues. The check is designed to evaluate:
|
||||
|
||||
1. Overall task failure rate over the last hour
|
||||
2. Per-agent failure rate to isolate problematic agents
|
||||
3. Failure rate trend (increasing, decreasing, or stable)
|
||||
4. Common failure reasons to guide remediation
|
||||
|
||||
**Current status:** implementation pending -- the check always returns Pass with a placeholder message. The `CanRun` method always returns true.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
A rising task failure rate is an early indicator of systemic problems: infrastructure issues, misconfigured environments, expired credentials, or agent software bugs. Catching a spike before it reaches 100% failure allows operators to intervene, roll back, or redirect tasks to healthy agents before an outage fully materializes.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Registry or artifact store unreachable (tasks cannot pull images)
|
||||
- Expired credentials used by tasks (registry tokens, cloud provider keys)
|
||||
- Agent software bug introduced by recent update
|
||||
- Target environment misconfigured (wrong endpoints, firewall rules)
|
||||
- Disk exhaustion on agent hosts preventing artifact staging
|
||||
- OOM kills during resource-intensive tasks (scans, builds)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check agent logs for task failures
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs agent --tail 500 | \
|
||||
grep -i "task.*fail\|error\|exception"
|
||||
|
||||
# Review recent task history
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec agent \
|
||||
stella agent tasks --status failed --last 1h
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# View failed tasks
|
||||
stella agent tasks --status failed --last 1h
|
||||
|
||||
# Check per-agent failure rates
|
||||
stella agent health <agent-id> --show-tasks
|
||||
|
||||
# Review agent logs for failure patterns
|
||||
journalctl -u stella-agent --since '1 hour ago' | grep -i 'fail\|error'
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check agent pod logs for task errors
|
||||
kubectl logs -l app.kubernetes.io/component=agent -n stellaops --tail=500 | \
|
||||
grep -i "task.*fail\|error"
|
||||
|
||||
# Check pod events for OOM or crash signals
|
||||
kubectl get events -n stellaops --sort-by='.lastTimestamp' | grep -i agent
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.task.failure.rate
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.resource.utilization` -- resource exhaustion causes task failures
|
||||
- `check.agent.task.backlog` -- high failure rate combined with backlog indicates systemic issue
|
||||
- `check.agent.heartbeat.freshness` -- crashing agents fail tasks and go stale
|
||||
- `check.agent.version.consistency` -- version skew can cause task compatibility failures
|
||||
81
docs/doctor/articles/agent/version-consistency.md
Normal file
81
docs/doctor/articles/agent/version-consistency.md
Normal file
@@ -0,0 +1,81 @@
|
||||
---
|
||||
checkId: check.agent.version.consistency
|
||||
plugin: stellaops.doctor.agent
|
||||
severity: warn
|
||||
tags: [agent, version, maintenance]
|
||||
---
|
||||
# Agent Version Consistency
|
||||
|
||||
## What It Checks
|
||||
|
||||
Groups all non-revoked, non-inactive agents by their reported `Version` field and evaluates version skew:
|
||||
|
||||
1. **Single version** across all agents: **Pass** -- all agents are consistent.
|
||||
2. **Two versions** with skew affecting less than half the fleet: **Pass** (minor skew acceptable).
|
||||
3. **Significant skew** (more than 2 distinct versions, or outdated agents exceed half the fleet): **Warn** with evidence listing the version distribution and up to 10 outdated agent names.
|
||||
4. **No active agents**: **Skip**.
|
||||
|
||||
The "majority version" is the version running on the most agents. All other versions are considered outdated. Evidence collected: `MajorityVersion`, `VersionDistribution` (e.g., "1.5.0: 8, 1.4.2: 2"), `OutdatedAgents` (list of names with their versions).
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Version skew across the agent fleet can cause subtle compatibility issues: newer agents may support task types that older agents reject, protocol changes may cause heartbeat or dispatch failures, and mixed versions make incident triage harder because behavior differs across agents. Keeping the fleet consistent reduces operational surprises.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Auto-update is disabled on some agents
|
||||
- Some agents failed to update (download failure, permission issue, disk full)
|
||||
- Phased rollout in progress (expected, temporary skew)
|
||||
- Agents on isolated networks that cannot reach the update server
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check agent image versions
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps agent --format json | \
|
||||
jq '.[] | {name: .Name, image: .Image}'
|
||||
|
||||
# Pull latest image and recreate
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml pull agent
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d agent
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Update outdated agents to target version
|
||||
stella agent update --version <target-version> --agent-id <id>
|
||||
|
||||
# Enable auto-update
|
||||
stella agent config --agent-id <id> --set auto_update.enabled=true
|
||||
|
||||
# Batch update all agents
|
||||
stella agent update --version <target-version> --all
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check running image versions across pods
|
||||
kubectl get pods -l app.kubernetes.io/component=agent -n stellaops \
|
||||
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
|
||||
|
||||
# Update image tag in Helm values and rollout
|
||||
helm upgrade stellaops stellaops/stellaops --set agent.image.tag=<target-version>
|
||||
|
||||
# Monitor rollout
|
||||
kubectl rollout status deployment/stellaops-agent -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.agent.version.consistency
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.agent.heartbeat.freshness` -- version mismatch can cause heartbeat protocol failures
|
||||
- `check.agent.capacity` -- outdated agents may be unable to accept newer task types
|
||||
142
docs/doctor/articles/attestor/clock-skew.md
Normal file
142
docs/doctor/articles/attestor/clock-skew.md
Normal file
@@ -0,0 +1,142 @@
|
||||
---
|
||||
checkId: check.attestation.clock.skew
|
||||
plugin: stellaops.doctor.attestor
|
||||
severity: fail
|
||||
tags: [attestation, time, ntp, quick, setup]
|
||||
---
|
||||
# Clock Skew
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies that the system clock is synchronized accurately enough for attestation validity by comparing local time against the Rekor transparency log server's `Date` response header. Additionally collects NTP daemon status and virtual machine detection as discriminating evidence for root-cause analysis.
|
||||
|
||||
**Threshold:** maximum allowed skew is **5 seconds** (`MaxSkewSeconds`).
|
||||
|
||||
The check performs these steps:
|
||||
|
||||
1. Collects NTP status: daemon type (chronyd, ntpd, systemd-timesyncd, w32time), running state, configured servers, last sync time, and sync age.
|
||||
2. Detects virtual machine environment: VMware, Hyper-V, KVM, Xen, or container. Checks whether VM clock synchronization is enabled.
|
||||
3. Sends HTTP GET to `{rekorUrl}/api/v1/log` (configured via `Attestor:Rekor:Url` or `Transparency:Rekor:Url`, defaults to `https://rekor.sigstore.dev`) with 5-second timeout.
|
||||
4. Extracts server time from the HTTP `Date` header.
|
||||
5. Computes skew as `localTime - serverTime`.
|
||||
|
||||
Results:
|
||||
|
||||
- **Skew <= 5s**: **Pass** with exact skew value.
|
||||
- **Skew > 5s**: **Fail** with skew, NTP status, and VM detection evidence. Remediation steps are platform-specific (Linux: chronyd/ntpd/timesyncd; Windows: w32time; VM: clock sync integration).
|
||||
- **Server unreachable or non-2xx**: **Warn** (cannot verify, includes NTP evidence).
|
||||
- **No Date header**: **Skip**.
|
||||
- **HTTP exception**: **Warn** with classified error type (ssl_error, dns_failure, refused, timeout, connection_failed).
|
||||
- **Timeout**: **Warn** with 5-second timeout note.
|
||||
|
||||
Evidence collected: `local_time_utc`, `server_time_utc`, `skew_seconds`, `max_allowed_skew`, `ntp_daemon_running`, `ntp_daemon_type`, `ntp_servers_configured`, `last_sync_time_utc`, `sync_age_seconds`, `is_virtual_machine`, `vm_type`, `vm_clock_sync_enabled`, `connection_error_type`.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Attestation timestamps must be accurate for signature validity. Rekor transparency log entries include timestamps that are verified against the signing time. If the system clock is skewed beyond the tolerance, attestations may be rejected as invalid, signatures may fail verification, and OIDC tokens used in keyless signing will be rejected for having future or expired timestamps. Even a few seconds of skew can cause intermittent failures that are difficult to diagnose.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- NTP service not running (stopped, disabled, or not installed)
|
||||
- NTP server unreachable (firewall, DNS, or network issue)
|
||||
- System clock manually set incorrectly
|
||||
- Virtual machine clock drift (common when VM clock sync is disabled)
|
||||
- Container relying on host clock which is itself drifted
|
||||
- Hibernation/resume causing sudden clock jump
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
Docker containers inherit the host clock. Fix the host time:
|
||||
|
||||
```bash
|
||||
# Check host time
|
||||
date -u
|
||||
|
||||
# Linux host: ensure NTP is running
|
||||
sudo timedatectl set-ntp true
|
||||
sudo systemctl start systemd-timesyncd
|
||||
|
||||
# Windows host: resync time
|
||||
w32tm /resync /nowait
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
**Linux with chronyd:**
|
||||
```bash
|
||||
# Start NTP service
|
||||
sudo systemctl start chronyd
|
||||
|
||||
# Enable NTP synchronization
|
||||
sudo timedatectl set-ntp true
|
||||
|
||||
# Force immediate sync
|
||||
sudo chronyc -a makestep
|
||||
|
||||
# Check status
|
||||
timedatectl status
|
||||
chronyc tracking
|
||||
```
|
||||
|
||||
**Linux with ntpd:**
|
||||
```bash
|
||||
# Start NTP service
|
||||
sudo systemctl start ntpd
|
||||
|
||||
# Enable NTP synchronization
|
||||
sudo timedatectl set-ntp true
|
||||
|
||||
# Force immediate sync
|
||||
sudo ntpdate -u pool.ntp.org
|
||||
```
|
||||
|
||||
**Linux with systemd-timesyncd:**
|
||||
```bash
|
||||
# Start and enable
|
||||
sudo systemctl start systemd-timesyncd
|
||||
sudo timedatectl set-ntp true
|
||||
```
|
||||
|
||||
**Windows:**
|
||||
```bash
|
||||
# Start Windows Time service
|
||||
net start w32time
|
||||
|
||||
# Force time resync
|
||||
w32tm /resync /nowait
|
||||
|
||||
# Check status
|
||||
w32tm /query /status
|
||||
```
|
||||
|
||||
**Virtual machine with clock sync disabled:**
|
||||
```
|
||||
Enable time synchronization in Hyper-V Integration Services or VMware Tools settings.
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
Kubernetes pods inherit node clock. Fix the node:
|
||||
|
||||
```bash
|
||||
# Check node time
|
||||
kubectl debug node/<node-name> -it --image=busybox -- date -u
|
||||
|
||||
# Ensure NTP is configured on all nodes (varies by OS)
|
||||
# For systemd-based nodes:
|
||||
ssh <node> 'sudo timedatectl set-ntp true'
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.attestation.clock.skew
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.attestation.rekor.connectivity` -- clock skew check requires Rekor connectivity
|
||||
- `check.attestation.rekor.verification.job` -- verification job can fail due to clock skew
|
||||
- `check.attestation.transparency.consistency` -- timestamp accuracy affects consistency proofs
|
||||
128
docs/doctor/articles/attestor/cosign-keymaterial.md
Normal file
128
docs/doctor/articles/attestor/cosign-keymaterial.md
Normal file
@@ -0,0 +1,128 @@
|
||||
---
|
||||
checkId: check.attestation.cosign.keymaterial
|
||||
plugin: stellaops.doctor.attestor
|
||||
severity: fail
|
||||
tags: [attestation, cosign, signing, setup]
|
||||
---
|
||||
# Cosign Key Material
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies that signing key material is available for container image attestation. The check reads the signing mode from configuration (`Attestor:Signing:Mode` or `Signing:Mode`, defaulting to `keyless`) and validates the appropriate key material for that mode:
|
||||
|
||||
### Keyless mode
|
||||
Checks that Fulcio URL is configured (defaults to `https://fulcio.sigstore.dev`). Uses OIDC identity for signing -- no persistent key material required. Result: **Pass** if configured.
|
||||
|
||||
### File mode
|
||||
1. If `KeyPath` is not configured: **Fail** with "KeyPath not configured".
|
||||
2. If the key file does not exist at the configured path: **Fail** with "Signing key file not found".
|
||||
3. If the key file cannot be read (permission error): **Fail** with the error message.
|
||||
4. If the key file exists and is readable: **Pass** with file size and last modification time.
|
||||
|
||||
### KMS mode
|
||||
1. If `KmsKeyRef` is not configured: **Fail** with "KmsKeyRef not configured".
|
||||
2. If configured, the check parses the KMS provider from the key reference URI prefix (`awskms://`, `gcpkms://`, `azurekms://`, `hashivault://`) and reports it. Result: **Pass** with provider name and key reference.
|
||||
|
||||
### Unknown mode
|
||||
**Fail** with "Unknown signing mode" and the list of supported modes.
|
||||
|
||||
Evidence collected varies by mode: `SigningMode`, `FulcioUrl`, `KeyPath`, `FileExists`, `FileSize`, `LastModified`, `KmsKeyRef`, `Provider`.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Without valid signing key material, the Attestor cannot sign container images, SBOMs, or provenance attestations. Unsigned artifacts cannot pass policy gates that require signature verification, blocking the entire release pipeline. This check ensures the signing infrastructure is correctly configured before any signing operations are attempted.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- KeyPath not set in configuration (file mode, incomplete setup)
|
||||
- Configuration file not loaded (missing appsettings, environment variable not set)
|
||||
- Key file was moved or deleted from the configured path
|
||||
- Wrong path configured (typo, path changed during migration)
|
||||
- Key file not yet generated (first-run setup incomplete)
|
||||
- KmsKeyRef not configured (KMS mode, missing configuration)
|
||||
- Unknown or misspelled signing mode in configuration
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
For **file** mode:
|
||||
```bash
|
||||
# Generate a new Cosign key pair
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
cosign generate-key-pair --output-key-prefix stellaops
|
||||
|
||||
# Set key path in environment
|
||||
# ATTESTOR__SIGNING__MODE=file
|
||||
# ATTESTOR__SIGNING__KEYPATH=/etc/stellaops/cosign.key
|
||||
|
||||
# Restart attestor
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart attestor
|
||||
```
|
||||
|
||||
For **keyless** mode:
|
||||
```bash
|
||||
# Set signing mode to keyless (default)
|
||||
# ATTESTOR__SIGNING__MODE=keyless
|
||||
# ATTESTOR__FULCIO__URL=https://fulcio.sigstore.dev
|
||||
```
|
||||
|
||||
For **KMS** mode:
|
||||
```bash
|
||||
# Set KMS key reference
|
||||
# ATTESTOR__SIGNING__MODE=kms
|
||||
# ATTESTOR__SIGNING__KMSKEYREF=awskms:///arn:aws:kms:us-east-1:123456789:key/abcd-1234
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Configure signing mode
|
||||
stella attestor signing configure --mode keyless
|
||||
|
||||
# For file mode, generate keys
|
||||
cosign generate-key-pair --output-key-prefix stellaops
|
||||
stella attestor signing configure --mode file --key-path /etc/stellaops/cosign.key
|
||||
|
||||
# For KMS mode (AWS example)
|
||||
stella attestor signing configure --mode kms \
|
||||
--kms-key-ref 'awskms:///arn:aws:kms:us-east-1:123456789:key/abcd-1234'
|
||||
|
||||
# For KMS mode (GCP example)
|
||||
stella attestor signing configure --mode kms \
|
||||
--kms-key-ref 'gcpkms://projects/my-project/locations/global/keyRings/my-ring/cryptoKeys/my-key'
|
||||
|
||||
# Check if key exists at another location
|
||||
find /etc/stellaops -name '*.key' -o -name 'cosign*'
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# For file mode, create secret with key material
|
||||
kubectl create secret generic stellaops-cosign-key -n stellaops \
|
||||
--from-file=cosign.key=/path/to/cosign.key \
|
||||
--from-file=cosign.pub=/path/to/cosign.pub
|
||||
|
||||
# Set signing configuration in Helm values
|
||||
# attestor:
|
||||
# signing:
|
||||
# mode: "kms" # or "file" or "keyless"
|
||||
# kmsKeyRef: "awskms:///arn:aws:kms:..."
|
||||
# # For file mode:
|
||||
# # keyPath: "/etc/stellaops/cosign.key"
|
||||
# # keySecret: "stellaops-cosign-key"
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.attestation.cosign.keymaterial
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.attestation.keymaterial` -- signing key expiration monitoring
|
||||
- `check.attestation.rekor.connectivity` -- Rekor required for keyless signing verification
|
||||
- `check.attestation.clock.skew` -- clock accuracy required for keyless OIDC tokens
|
||||
110
docs/doctor/articles/attestor/keymaterial.md
Normal file
110
docs/doctor/articles/attestor/keymaterial.md
Normal file
@@ -0,0 +1,110 @@
|
||||
---
|
||||
checkId: check.attestation.keymaterial
|
||||
plugin: stellaops.doctor.attestor
|
||||
severity: warn
|
||||
tags: [attestation, signing, security, expiration]
|
||||
---
|
||||
# Signing Key Expiration
|
||||
|
||||
## What It Checks
|
||||
|
||||
Monitors the expiration timeline of attestation signing keys. The check reads the signing mode from configuration and, for modes that use expiring keys (file, kms, certificate), retrieves key information and classifies each key:
|
||||
|
||||
1. **Expired** -- key has already expired (`daysUntilExpiry < 0`). Result: **Fail** with list of expired key IDs.
|
||||
2. **Critical** -- key expires within **7 days**. Result: **Fail** with key IDs and days remaining.
|
||||
3. **Warning** -- key expires within **30 days**. Result: **Warn** with key IDs and days remaining.
|
||||
4. **Healthy** -- all keys have more than 30 days until expiration. Result: **Pass** with key count and per-key expiry dates (up to 5 keys shown).
|
||||
|
||||
For **keyless** signing mode, the check returns **Skip** because keyless signing does not use expiring key material.
|
||||
|
||||
If no signing keys are found, the check returns **Skip** with a note that no file-based or certificate-based keys were found.
|
||||
|
||||
Evidence collected: `ExpiredKeys` (list of IDs), `CriticalKeys` (ID + days), `WarningKeys` (ID + days), `TotalKeys`, `HealthyKeys`, per-key entries showing `Key:<id>` with expiry date and days remaining.
|
||||
|
||||
Thresholds:
|
||||
- Warning: 30 days (`WarningDays`)
|
||||
- Critical: 7 days (`CriticalDays`)
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Expired signing keys make it impossible to create new attestations, blocking the release pipeline at policy gates that require signed artifacts. Keys approaching expiration should be rotated proactively to ensure overlap between old and new keys, allowing verifiers to accept signatures from both during the transition period. Without monitoring, key expiration causes a sudden, hard outage.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Keys were not rotated before expiration (manual process forgotten)
|
||||
- Scheduled rotation job failed (permissions, connectivity)
|
||||
- Key expiration not monitored (no alerting configured)
|
||||
- Normal lifecycle -- keys approaching the warning threshold (plan rotation)
|
||||
- Rotation reminders not configured
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check key status
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
stella keys status
|
||||
|
||||
# Rotate expired or critical keys
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
stella keys rotate <key-id>
|
||||
|
||||
# Set up expiration monitoring
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
stella notify channels add --type email --event key.expiring --threshold-days 30
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Rotate expired keys immediately
|
||||
stella keys rotate <expired-key-id>
|
||||
|
||||
# Set up key expiration monitoring
|
||||
stella notify channels add --type email --event key.expiring --threshold-days 30
|
||||
|
||||
# Schedule immediate key rotation for critical keys (with overlap)
|
||||
stella keys rotate <critical-key-id> --overlap-days 7
|
||||
|
||||
# Plan rotation for warning-level keys (dry run first)
|
||||
stella keys rotate <warning-key-id> --dry-run
|
||||
|
||||
# Execute rotation with overlap period
|
||||
stella keys rotate <warning-key-id> --overlap-days 14
|
||||
|
||||
# Review all key status
|
||||
stella keys status
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check key status
|
||||
kubectl exec -it deploy/stellaops-attestor -n stellaops -- \
|
||||
stella keys status
|
||||
|
||||
# Rotate keys
|
||||
kubectl exec -it deploy/stellaops-attestor -n stellaops -- \
|
||||
stella keys rotate <key-id> --overlap-days 14
|
||||
|
||||
# Configure automatic key rotation in Helm values
|
||||
# attestor:
|
||||
# signing:
|
||||
# autoRotate: true
|
||||
# rotationBeforeDays: 30
|
||||
# overlapDays: 14
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.attestation.keymaterial
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.attestation.cosign.keymaterial` -- verifies key material availability (existence, not expiration)
|
||||
- `check.auth.signing-key` -- auth signing key health (separate from attestation keys)
|
||||
- `check.attestation.rekor.verification.job` -- expired keys cause verification failures
|
||||
115
docs/doctor/articles/attestor/rekor-connectivity.md
Normal file
115
docs/doctor/articles/attestor/rekor-connectivity.md
Normal file
@@ -0,0 +1,115 @@
|
||||
---
|
||||
checkId: check.attestation.rekor.connectivity
|
||||
plugin: stellaops.doctor.attestor
|
||||
severity: fail
|
||||
tags: [attestation, rekor, transparency, quick, setup]
|
||||
---
|
||||
# Rekor Connectivity
|
||||
|
||||
## What It Checks
|
||||
|
||||
Tests connectivity to the Rekor transparency log by sending an HTTP GET request to the log info endpoint (`{rekorUrl}/api/v1/log`). The Rekor URL is read from configuration (`Attestor:Rekor:Url` or `Transparency:Rekor:Url`, defaulting to `https://rekor.sigstore.dev`). Request timeout is 10 seconds.
|
||||
|
||||
Results:
|
||||
|
||||
1. **HTTP 2xx success**: **Pass**. Parses the response JSON for `treeSize` and reports the endpoint, response latency (ms), and current tree size.
|
||||
2. **HTTP non-2xx**: **Fail** with status code and latency.
|
||||
3. **Connection timeout** (`TaskCanceledException`): **Fail** with "Connection timeout (10s)".
|
||||
4. **HTTP request exception** (DNS failure, SSL error, connection refused): **Fail** with the exception message.
|
||||
|
||||
Evidence collected: `Endpoint`, `Latency` (ms), `TreeSize`, `StatusCode`, `Error`.
|
||||
|
||||
The check always runs (`CanRun` returns true) because Rekor connectivity is essential for attestation.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Rekor is the transparency log that records attestation entries, providing tamper-evident proof that signatures were created at a specific time. Without Rekor connectivity, the Attestor cannot submit new log entries, and verifiers cannot confirm that attestations were properly logged. In non-air-gapped deployments, Rekor connectivity is a hard requirement for the signing and verification pipeline.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Rekor service is down or undergoing maintenance
|
||||
- Network connectivity issue (proxy not configured, routing problem)
|
||||
- Firewall blocking outbound HTTPS (port 443)
|
||||
- DNS resolution failure for `rekor.sigstore.dev`
|
||||
- Wrong Rekor endpoint configured
|
||||
- SSL/TLS handshake failure (expired CA cert, corporate MITM proxy)
|
||||
- Air-gapped environment without offline bundle configured
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Test Rekor connectivity from the attestor container
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
curl -s https://rekor.sigstore.dev/api/v1/log | jq .
|
||||
|
||||
# Check DNS resolution
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
nslookup rekor.sigstore.dev
|
||||
|
||||
# Set Rekor URL in environment
|
||||
# ATTESTOR__REKOR__URL=https://rekor.sigstore.dev
|
||||
|
||||
# For air-gapped environments, configure offline mode
|
||||
# ATTESTOR__OFFLINE__ENABLED=true
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Test Rekor connectivity manually
|
||||
curl -s https://rekor.sigstore.dev/api/v1/log | jq .
|
||||
|
||||
# Check network connectivity
|
||||
nc -zv rekor.sigstore.dev 443
|
||||
|
||||
# Check DNS resolution
|
||||
nslookup rekor.sigstore.dev
|
||||
|
||||
# Check SSL certificates
|
||||
openssl s_client -connect rekor.sigstore.dev:443 -brief
|
||||
|
||||
# Verify Rekor URL configuration
|
||||
grep -r 'rekor' /etc/stellaops/*.yaml
|
||||
|
||||
# For air-gapped environments, download offline bundle
|
||||
stella attestor offline-bundle download --output /var/lib/stellaops/rekor-offline
|
||||
|
||||
# Enable offline mode
|
||||
stella attestor config set --key offline.enabled --value true
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Test from attestor pod
|
||||
kubectl exec -it deploy/stellaops-attestor -n stellaops -- \
|
||||
curl -s https://rekor.sigstore.dev/api/v1/log | jq .
|
||||
|
||||
# Check egress NetworkPolicy
|
||||
kubectl get networkpolicy -n stellaops -o yaml | grep -A10 egress
|
||||
|
||||
# Set Rekor URL in Helm values
|
||||
# attestor:
|
||||
# rekor:
|
||||
# url: "https://rekor.sigstore.dev"
|
||||
# # For air-gapped:
|
||||
# # offline:
|
||||
# # enabled: true
|
||||
# # bundlePath: "/var/lib/stellaops/rekor-offline"
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.attestation.rekor.connectivity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.attestation.clock.skew` -- clock accuracy affects Rekor entry timestamps
|
||||
- `check.attestation.transparency.consistency` -- consistency check requires Rekor connectivity
|
||||
- `check.attestation.rekor.verification.job` -- verification job depends on Rekor access
|
||||
- `check.attestation.cosign.keymaterial` -- keyless signing requires Rekor for transparency logging
|
||||
138
docs/doctor/articles/attestor/rekor-verification-job.md
Normal file
138
docs/doctor/articles/attestor/rekor-verification-job.md
Normal file
@@ -0,0 +1,138 @@
|
||||
---
|
||||
checkId: check.attestation.rekor.verification.job
|
||||
plugin: stellaops.doctor.attestor
|
||||
severity: warn
|
||||
tags: [attestation, rekor, verification, background]
|
||||
---
|
||||
# Rekor Verification Job
|
||||
|
||||
## What It Checks
|
||||
|
||||
Monitors the health of the periodic background job that re-verifies attestation entries stored in Rekor. The check queries `IRekorVerificationStatusProvider` from DI and evaluates several conditions in priority order:
|
||||
|
||||
1. **Service not registered**: **Skip** if `IRekorVerificationStatusProvider` is not in the DI container.
|
||||
2. **Never run**: **Warn** if `LastRunAt` is null (job has never executed).
|
||||
3. **Critical alerts**: **Fail** if `CriticalAlertCount > 0` (possible log tampering, root hash mismatch, mass signature failures).
|
||||
4. **Root consistency failed**: **Fail** if `RootConsistent` is false (stored checkpoint disagrees with remote log state).
|
||||
5. **Stale run**: **Warn** if the job has not run in more than **48 hours**.
|
||||
6. **High failure rate**: **Warn** if `FailureRate > 10%` (more than 10% of verified entries failed).
|
||||
7. **Healthy**: **Pass** with last run time, status, entries verified, failure rate, root consistency, and duration.
|
||||
|
||||
The check only runs when verification is enabled (`Attestor:Verification:Enabled` or `Transparency:Verification:Enabled` is not set to `false`).
|
||||
|
||||
Evidence collected: `LastRun`, `LastRunStatus`, `IsRunning`, `NextScheduledRun`, `CriticalAlerts`, `RootConsistent`, `LastConsistencyCheck`, `HoursSinceLastRun`, `EntriesVerified`, `EntriesFailed`, `FailureRate`, `TimeSkewViolations`, `Duration`.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
The verification job is the integrity watchdog for the attestation pipeline. It periodically re-checks that Rekor log entries have not been tampered with, that the root hash is consistent, and that signatures remain valid. Without this job running, an attacker could modify transparency log entries without detection, undermining the entire attestation trust model. A high failure rate may indicate clock skew, key rotation issues, or data corruption.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Job was just deployed and has not run yet (first-run delay)
|
||||
- Job is disabled in configuration
|
||||
- Background service failed to start (DI error, missing dependency)
|
||||
- Transparency log tampering detected (critical alert)
|
||||
- Root hash mismatch with stored checkpoints
|
||||
- Mass signature verification failures after key rotation
|
||||
- Background service stopped or scheduler not running (stale run)
|
||||
- Job stuck or failed repeatedly
|
||||
- Clock skew causing timestamp validation failures (high failure rate)
|
||||
- Invalid signatures from previous key rotations
|
||||
- Corrupted entries in local database
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check attestor container status
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps attestor
|
||||
|
||||
# View verification job logs
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs attestor --tail 300 | \
|
||||
grep -i 'verification\|rekor'
|
||||
|
||||
# Trigger manual verification run
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
stella attestor verification run --now
|
||||
|
||||
# Enable verification if disabled
|
||||
# ATTESTOR__VERIFICATION__ENABLED=true
|
||||
|
||||
# Restart attestor service
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart attestor
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check if the job is scheduled
|
||||
stella attestor verification status
|
||||
|
||||
# Trigger a manual verification run
|
||||
stella attestor verification run --now
|
||||
|
||||
# Check application logs for errors
|
||||
journalctl -u stellaops-attestor --since '1 hour ago' | grep -i 'verification\|rekor'
|
||||
|
||||
# Review critical alerts
|
||||
stella attestor verification alerts --severity critical
|
||||
|
||||
# Check transparency log status
|
||||
stella attestor transparency status
|
||||
|
||||
# Review failed entries (high failure rate)
|
||||
stella attestor verification failures --last-run
|
||||
|
||||
# Check system clock synchronization (if time skew violations)
|
||||
timedatectl status
|
||||
|
||||
# Re-sync failed entries from Rekor
|
||||
stella attestor verification resync --failed-only
|
||||
|
||||
# Restart the service if job is stale
|
||||
sudo systemctl restart stellaops-attestor
|
||||
|
||||
# Review recent error logs (stale job)
|
||||
journalctl -u stellaops-attestor --since '48 hours ago' | grep -i error
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check attestor pod status
|
||||
kubectl get pods -l app.kubernetes.io/component=attestor -n stellaops
|
||||
|
||||
# View verification logs
|
||||
kubectl logs -l app.kubernetes.io/component=attestor -n stellaops --tail=300 | \
|
||||
grep -i 'verification\|rekor'
|
||||
|
||||
# Trigger manual verification
|
||||
kubectl exec -it deploy/stellaops-attestor -n stellaops -- \
|
||||
stella attestor verification run --now
|
||||
|
||||
# Enable verification in Helm values
|
||||
# attestor:
|
||||
# verification:
|
||||
# enabled: true
|
||||
# intervalHours: 24
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
|
||||
# Restart attestor pods
|
||||
kubectl rollout restart deployment/stellaops-attestor -n stellaops
|
||||
```
|
||||
|
||||
If critical alerts indicate possible log tampering, this may be a security incident. Review evidence carefully before dismissing alerts.
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.attestation.rekor.verification.job
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.attestation.rekor.connectivity` -- verification job requires Rekor connectivity
|
||||
- `check.attestation.transparency.consistency` -- complementary consistency check against stored checkpoints
|
||||
- `check.attestation.clock.skew` -- clock skew causes verification timestamp failures
|
||||
- `check.attestation.keymaterial` -- expired signing keys cause verification failures
|
||||
151
docs/doctor/articles/attestor/transparency-consistency.md
Normal file
151
docs/doctor/articles/attestor/transparency-consistency.md
Normal file
@@ -0,0 +1,151 @@
|
||||
---
|
||||
checkId: check.attestation.transparency.consistency
|
||||
plugin: stellaops.doctor.attestor
|
||||
severity: fail
|
||||
tags: [attestation, transparency, security]
|
||||
---
|
||||
# Transparency Log Consistency
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies that locally stored transparency log checkpoints are consistent with the remote Rekor log. This is a critical security check that detects log rollback or tampering.
|
||||
|
||||
The check only runs if a checkpoint path is configured (`Attestor:Transparency:CheckpointPath` or `Transparency:CheckpointPath`) or a checkpoint file exists at the default path (`{AppData}/stellaops/transparency/checkpoint.json`).
|
||||
|
||||
Steps performed:
|
||||
|
||||
1. **Read stored checkpoint** -- parses the local `checkpoint.json` file containing `TreeSize`, `RootHash`, `UpdatedAt`, and `LogId`.
|
||||
- If the file does not exist: **Skip** (checkpoint will be created on first verification run).
|
||||
- If the JSON is invalid: **Fail** with remediation to remove the corrupted file and re-sync.
|
||||
- If the file is empty/null: **Fail**.
|
||||
|
||||
2. **Fetch remote log state** -- HTTP GET to `{rekorUrl}/api/v1/log` (10-second timeout). Parses `treeSize` and `rootHash` from the response.
|
||||
- If Rekor is unreachable: **Skip** (cannot verify consistency without remote state).
|
||||
|
||||
3. **Compare tree sizes**:
|
||||
- If remote tree size < stored tree size: **Fail** with "possible fork/rollback" (the log should only grow, never shrink). This is a CRITICAL security finding.
|
||||
- If tree sizes match but root hashes differ: **Fail** with "possible tampering" (same size but different content). This is a CRITICAL security finding.
|
||||
- If remote tree size >= stored tree size and hashes are consistent: **Pass** with entries-behind count and checkpoint age.
|
||||
|
||||
Evidence collected: `CheckpointPath`, `Exists`, `StoredTreeSize`, `RemoteTreeSize`, `StoredRootHash`, `RemoteRootHash`, `EntriesBehind`, `CheckpointAge`, `ConsistencyVerified`, `Error`.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
The transparency log is the tamper-evident backbone of the attestation system. If an attacker modifies or rolls back the log, they could hide revoked signatures, alter attestation records, or forge provenance data. This check is the primary defense against such attacks. A root hash mismatch at the same tree size is one of the strongest indicators of log tampering and should trigger an immediate security investigation.
|
||||
|
||||
## Common Causes
|
||||
|
||||
**For log rollback (remote < stored):**
|
||||
- Transparency log was actually rolled back (CRITICAL security event)
|
||||
- Stored checkpoint is from a different Rekor instance
|
||||
- Man-in-the-middle attack on log queries (network interception)
|
||||
- Configuration changed to point at a different Rekor server
|
||||
|
||||
**For root hash mismatch:**
|
||||
- Transparency log was modified (CRITICAL security event)
|
||||
- Man-in-the-middle attack returning forged log state
|
||||
- Checkpoint file corruption (disk error, incomplete write)
|
||||
|
||||
**For corrupted checkpoint file:**
|
||||
- Disk failure during checkpoint write
|
||||
- Process killed during checkpoint update
|
||||
- Manual editing of checkpoint file
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check stored checkpoint
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
cat /app/data/transparency/checkpoint.json | jq .
|
||||
|
||||
# Verify you are connecting to the correct Rekor instance
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
curl -s https://rekor.sigstore.dev/api/v1/log | jq .
|
||||
|
||||
# If corrupted checkpoint, remove and re-sync
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
rm /app/data/transparency/checkpoint.json
|
||||
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec attestor \
|
||||
stella attestor transparency sync
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
For **corrupted checkpoint**:
|
||||
```bash
|
||||
# Back up the corrupted checkpoint first
|
||||
cp /path/to/checkpoint.json /path/to/checkpoint.json.bak
|
||||
|
||||
# Remove corrupted checkpoint
|
||||
rm /path/to/checkpoint.json
|
||||
|
||||
# Trigger re-sync
|
||||
stella attestor transparency sync
|
||||
```
|
||||
|
||||
For **log rollback or hash mismatch** (CRITICAL):
|
||||
```bash
|
||||
# CRITICAL: This may indicate a security incident. Do not dismiss without investigation.
|
||||
|
||||
# Get current root hash from Rekor
|
||||
curl -s https://rekor.sigstore.dev/api/v1/log | jq .rootHash
|
||||
|
||||
# Compare with stored checkpoint
|
||||
stella attestor transparency checkpoint show
|
||||
|
||||
# Verify you are connecting to the correct Rekor instance
|
||||
curl -s https://rekor.sigstore.dev/api/v1/log | jq .
|
||||
|
||||
# Check stored checkpoint
|
||||
cat /path/to/checkpoint.json | jq .
|
||||
|
||||
# If using wrong log instance, reset checkpoint (DESTRUCTIVE -- only after confirming wrong instance)
|
||||
rm /path/to/checkpoint.json
|
||||
stella attestor transparency sync
|
||||
|
||||
# If mismatch persists with correct log, escalate to security team
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check stored checkpoint
|
||||
kubectl exec -it deploy/stellaops-attestor -n stellaops -- \
|
||||
cat /app/data/transparency/checkpoint.json | jq .
|
||||
|
||||
# Verify Rekor connectivity
|
||||
kubectl exec -it deploy/stellaops-attestor -n stellaops -- \
|
||||
curl -s https://rekor.sigstore.dev/api/v1/log | jq .
|
||||
|
||||
# If corrupted, remove checkpoint and re-sync
|
||||
kubectl exec -it deploy/stellaops-attestor -n stellaops -- \
|
||||
rm /app/data/transparency/checkpoint.json
|
||||
|
||||
kubectl exec -it deploy/stellaops-attestor -n stellaops -- \
|
||||
stella attestor transparency sync
|
||||
|
||||
# Check checkpoint persistence (PVC)
|
||||
kubectl get pvc -l app.kubernetes.io/component=attestor -n stellaops
|
||||
|
||||
# Set checkpoint path in Helm values
|
||||
# attestor:
|
||||
# transparency:
|
||||
# checkpointPath: "/app/data/transparency/checkpoint.json"
|
||||
```
|
||||
|
||||
Root hash mismatches or log rollbacks should be treated as potential security incidents. Do not reset the checkpoint without first investigating whether the remote log was actually compromised.
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.attestation.transparency.consistency
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.attestation.rekor.connectivity` -- consistency check requires Rekor access
|
||||
- `check.attestation.rekor.verification.job` -- verification job also checks root consistency
|
||||
- `check.attestation.clock.skew` -- clock accuracy affects consistency proof timestamps
|
||||
108
docs/doctor/articles/auth/config.md
Normal file
108
docs/doctor/articles/auth/config.md
Normal file
@@ -0,0 +1,108 @@
|
||||
---
|
||||
checkId: check.auth.config
|
||||
plugin: stellaops.doctor.auth
|
||||
severity: fail
|
||||
tags: [auth, security, core, config]
|
||||
---
|
||||
# Auth Configuration
|
||||
|
||||
## What It Checks
|
||||
|
||||
Validates the overall authentication configuration by inspecting three layers in sequence:
|
||||
|
||||
1. **Authentication configured** -- verifies that the auth subsystem has been set up (issuer URL present, basic config loaded). If not: **Fail** with "Authentication not configured".
|
||||
2. **Signing keys available** -- checks whether signing keys exist for token issuance. If configured but no keys: **Fail** with "No signing keys available".
|
||||
3. **Signing key expiration** -- checks if the active signing key is approaching expiration. If it will expire soon: **Warn** with the number of days remaining.
|
||||
4. **All healthy** -- issuer URL configured, signing keys available, key not near expiry. Result: **Pass**.
|
||||
|
||||
Evidence collected: `AuthConfigured` (YES/NO), `IssuerConfigured` (YES/NO), `IssuerUrl`, `SigningKeysConfigured`/`SigningKeysAvailable` (YES/NO), `KeyExpiration` (days), `ActiveClients` count, `ActiveScopes` count.
|
||||
|
||||
The check always runs (`CanRun` returns true).
|
||||
|
||||
## Why It Matters
|
||||
|
||||
Authentication is the foundation of every API call in Stella Ops. If the auth subsystem is not configured, no user can log in, no service-to-service call can authenticate, and the entire platform is non-functional. Missing signing keys mean tokens cannot be issued, and an expiring key that is not rotated will cause a hard outage when it expires.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Authority service not configured (fresh installation without `stella setup auth`)
|
||||
- Missing issuer URL configuration in environment variables or config files
|
||||
- Signing keys not yet generated (first-run setup incomplete)
|
||||
- Key material corrupted (disk failure, accidental deletion)
|
||||
- HSM/PKCS#11 module not accessible (hardware key store offline)
|
||||
- Signing key approaching expiration without scheduled rotation
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check Authority service configuration
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
cat /app/appsettings.json | grep -A5 "Issuer\|Signing"
|
||||
|
||||
# Set issuer URL via environment variable
|
||||
# In .env or docker-compose.override.yml:
|
||||
# AUTHORITY__ISSUER__URL=https://stella-ops.local/authority
|
||||
|
||||
# Restart Authority service after config changes
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart authority
|
||||
|
||||
# Generate signing keys
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys generate --type rsa
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Run initial auth setup
|
||||
stella setup auth
|
||||
|
||||
# Configure issuer URL
|
||||
stella auth configure --issuer https://auth.yourdomain.com
|
||||
|
||||
# Generate signing keys
|
||||
stella keys generate --type rsa
|
||||
|
||||
# Rotate signing keys (if approaching expiration)
|
||||
stella keys rotate
|
||||
|
||||
# Schedule automatic key rotation
|
||||
stella keys rotate --schedule 30d
|
||||
|
||||
# Check key store health
|
||||
stella doctor run --check check.crypto.keystore
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check authority pod configuration
|
||||
kubectl get configmap stellaops-authority-config -n stellaops -o yaml
|
||||
|
||||
# Set issuer URL in Helm values
|
||||
# authority:
|
||||
# issuer:
|
||||
# url: "https://auth.yourdomain.com"
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
|
||||
# Generate keys via job
|
||||
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
|
||||
stella keys generate --type rsa
|
||||
|
||||
# Check secrets for key material
|
||||
kubectl get secret stellaops-signing-keys -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.auth.config
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.auth.signing-key` -- deeper signing key health (algorithm, size, rotation schedule)
|
||||
- `check.auth.token-service` -- verifies token endpoint is responsive
|
||||
- `check.auth.oidc` -- external OIDC provider connectivity
|
||||
100
docs/doctor/articles/auth/oidc.md
Normal file
100
docs/doctor/articles/auth/oidc.md
Normal file
@@ -0,0 +1,100 @@
|
||||
---
|
||||
checkId: check.auth.oidc
|
||||
plugin: stellaops.doctor.auth
|
||||
severity: warn
|
||||
tags: [auth, oidc, connectivity]
|
||||
---
|
||||
# OIDC Provider Connectivity
|
||||
|
||||
## What It Checks
|
||||
|
||||
Tests connectivity to an external OIDC provider by performing real HTTP requests. The check reads the issuer URL from configuration keys (in priority order): `Authentication:Oidc:Issuer`, `Auth:Oidc:Authority`, `Oidc:Issuer`. If none is configured, the check passes immediately (local authority mode).
|
||||
|
||||
When an external provider is configured, the check performs a multi-step validation:
|
||||
|
||||
1. **Fetch discovery document** -- HTTP GET to `{issuerUrl}/.well-known/openid-configuration` with a 10-second timeout. If unreachable: **Fail** with connection error type classification (ssl_error, dns_failure, refused, timeout, connection_failed).
|
||||
2. **Validate discovery fields** -- Parses the discovery JSON and verifies presence of `authorization_endpoint`, `token_endpoint`, and `jwks_uri`. If any are missing: **Warn** listing the missing fields.
|
||||
3. **Fetch JWKS** -- HTTP GET to the `jwks_uri` from the discovery document. Counts the number of keys in the `keys` array. If zero keys: **Warn** (token validation may fail).
|
||||
4. **All healthy** -- provider reachable, discovery valid, JWKS has keys. Result: **Pass**.
|
||||
|
||||
Evidence collected: `issuer_url`, `discovery_reachable`, `discovery_response_ms`, `authorization_endpoint_present`, `token_endpoint_present`, `jwks_uri_present`, `jwks_key_count`, `jwks_fetch_ms`, `http_status_code`, `error_message`, `connection_error_type`.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
When Stella Ops is configured to delegate authentication to an external OIDC provider (Azure AD, Keycloak, Okta, etc.), all user logins and token validations depend on that provider being reachable and correctly configured. A connectivity failure means users cannot log in, and services cannot validate tokens, leading to a platform-wide authentication outage.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- OIDC provider is down or undergoing maintenance
|
||||
- Network connectivity issue (proxy misconfiguration, firewall rule change)
|
||||
- DNS resolution failure for the provider hostname
|
||||
- Firewall blocking outbound HTTPS to the provider
|
||||
- Discovery document missing required fields (misconfigured provider)
|
||||
- Token endpoint misconfigured after provider upgrade
|
||||
- JWKS endpoint returning empty key set (key rotation in progress)
|
||||
- OIDC provider rate limiting or returning errors
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Test OIDC provider connectivity from the authority container
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq .
|
||||
|
||||
# Check DNS resolution
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
nslookup <oidc-host>
|
||||
|
||||
# Set OIDC configuration via environment
|
||||
# AUTHENTICATION__OIDC__ISSUER=https://login.microsoftonline.com/<tenant>/v2.0
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Test provider connectivity
|
||||
curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq .
|
||||
|
||||
# Check DNS resolution
|
||||
nslookup <oidc-host>
|
||||
|
||||
# Validate OIDC configuration
|
||||
stella auth oidc validate
|
||||
|
||||
# Check JWKS endpoint
|
||||
curl -s $(curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq -r .jwks_uri) | jq .
|
||||
|
||||
# Check network connectivity
|
||||
stella doctor run --check check.network.dns
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Test from authority pod
|
||||
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
|
||||
curl -s https://<oidc-issuer>/.well-known/openid-configuration | jq .
|
||||
|
||||
# Check NetworkPolicy allows egress to OIDC provider
|
||||
kubectl get networkpolicy -n stellaops -o yaml | grep -A10 egress
|
||||
|
||||
# Set OIDC configuration in Helm values
|
||||
# authority:
|
||||
# oidc:
|
||||
# issuer: "https://login.microsoftonline.com/<tenant>/v2.0"
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.auth.oidc
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.auth.config` -- overall auth configuration health
|
||||
- `check.auth.signing-key` -- local signing key health (used when not delegating to external OIDC)
|
||||
- `check.auth.token-service` -- token endpoint availability
|
||||
106
docs/doctor/articles/auth/signing-key.md
Normal file
106
docs/doctor/articles/auth/signing-key.md
Normal file
@@ -0,0 +1,106 @@
|
||||
---
|
||||
checkId: check.auth.signing-key
|
||||
plugin: stellaops.doctor.auth
|
||||
severity: fail
|
||||
tags: [auth, security, keys]
|
||||
---
|
||||
# Signing Key Health
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies the health of the active signing key used for token issuance. The check evaluates three conditions in sequence:
|
||||
|
||||
1. **No active key** -- if `HasActiveKey` is false: **Fail** with "No active signing key available". Evidence includes `ActiveKey: NONE` and total key count.
|
||||
2. **Approaching expiration** -- if the active key expires within **30 days** (`ExpirationWarningDays`): **Warn** with the number of days remaining. Evidence includes key ID, algorithm, days until expiration, and whether rotation is scheduled.
|
||||
3. **Healthy** -- active key exists with more than 30 days until expiration. Result: **Pass**. Evidence includes key ID, algorithm, key size (bits), days until expiration, and rotation schedule status.
|
||||
|
||||
The check always runs (`CanRun` returns true).
|
||||
|
||||
Evidence collected: `ActiveKeyId`, `Algorithm`, `KeySize`, `DaysUntilExpiration`, `RotationScheduled` (YES/NO), `TotalKeys`.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
The signing key is used to sign every JWT token issued by the Authority service. If no active key exists, no tokens can be issued, and the entire platform's authentication stops working. If the key is approaching expiration without a rotation plan, the platform faces a hard outage on the expiration date -- all tokens signed with the key become unverifiable.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Signing keys not generated (incomplete setup)
|
||||
- All keys expired without rotation
|
||||
- Key store corrupted (file system issue, accidental deletion)
|
||||
- Key rotation not scheduled (manual process that was forgotten)
|
||||
- Previous rotation attempt failed (permissions, HSM connectivity)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check current key status
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys status
|
||||
|
||||
# Generate new signing key
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys generate --type rsa --bits 4096
|
||||
|
||||
# Activate the new key
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys activate
|
||||
|
||||
# Rotate keys
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella keys rotate
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Generate new signing key
|
||||
stella keys generate --type rsa --bits 4096
|
||||
|
||||
# Activate the key
|
||||
stella keys activate
|
||||
|
||||
# Rotate signing key
|
||||
stella keys rotate
|
||||
|
||||
# Schedule automatic rotation (every 30 days)
|
||||
stella keys rotate --schedule 30d
|
||||
|
||||
# Check key status
|
||||
stella keys status
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check key status
|
||||
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
|
||||
stella keys status
|
||||
|
||||
# Generate and activate key
|
||||
kubectl exec -it deploy/stellaops-authority -n stellaops -- \
|
||||
stella keys generate --type rsa --bits 4096
|
||||
|
||||
# Set automatic rotation in Helm values
|
||||
# authority:
|
||||
# signing:
|
||||
# autoRotate: true
|
||||
# rotationIntervalDays: 30
|
||||
helm upgrade stellaops stellaops/stellaops -f values.yaml
|
||||
|
||||
# Check signing key secret
|
||||
kubectl get secret stellaops-signing-keys -n stellaops -o jsonpath='{.data}' | base64 -d
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.auth.signing-key
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.auth.config` -- overall auth configuration including signing key presence
|
||||
- `check.auth.token-service` -- token issuance depends on a healthy signing key
|
||||
- `check.attestation.keymaterial` -- attestor signing keys (separate from auth signing keys)
|
||||
114
docs/doctor/articles/auth/token-service.md
Normal file
114
docs/doctor/articles/auth/token-service.md
Normal file
@@ -0,0 +1,114 @@
|
||||
---
|
||||
checkId: check.auth.token-service
|
||||
plugin: stellaops.doctor.auth
|
||||
severity: fail
|
||||
tags: [auth, service, health]
|
||||
---
|
||||
# Token Service Health
|
||||
|
||||
## What It Checks
|
||||
|
||||
Verifies the availability and performance of the token service endpoint (`/connect/token`). The check evaluates four conditions:
|
||||
|
||||
1. **Service unavailable** -- token endpoint is not responding. Result: **Fail** with the endpoint URL and error message.
|
||||
2. **Critically slow** -- response time exceeds **2000ms**. Result: **Fail** with actual response time and threshold.
|
||||
3. **Slow** -- response time exceeds **500ms** but is under 2000ms. Result: **Warn** with response time, threshold, and token issuance count.
|
||||
4. **Healthy** -- service is available and response time is under 500ms. Result: **Pass** with response time, tokens issued in last 24 hours, and active session count.
|
||||
|
||||
Evidence collected: `ServiceAvailable` (YES/NO), `Endpoint`, `ResponseTimeMs`, `CriticalThreshold` (2000), `WarningThreshold` (500), `TokensIssuedLast24h`, `ActiveSessions`, `Error`.
|
||||
|
||||
The check always runs (`CanRun` returns true).
|
||||
|
||||
## Why It Matters
|
||||
|
||||
The token service is the single point through which all access tokens are issued. If it is unavailable, no user can log in, no service can authenticate, and every API call fails with 401. Even if the service is available but slow, user login experiences degrade, automated integrations time out, and the platform feels unresponsive. This check is typically the first to detect Authority database issues or resource starvation.
|
||||
|
||||
## Common Causes
|
||||
|
||||
- Authority service not running (container stopped, process crashed)
|
||||
- Token endpoint misconfigured (wrong path, wrong port)
|
||||
- Database connectivity issue (Authority cannot query clients/keys)
|
||||
- Database performance issues (slow queries for token validation)
|
||||
- Service overloaded (high authentication request volume)
|
||||
- Resource contention (CPU/memory pressure on Authority host)
|
||||
- Higher than normal load (warning-level)
|
||||
- Database query performance degraded (warning-level)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Check Authority service status
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml ps authority
|
||||
|
||||
# View Authority service logs
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml logs authority --tail 200
|
||||
|
||||
# Restart Authority service
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml restart authority
|
||||
|
||||
# Test token endpoint directly
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
curl -s -o /dev/null -w "%{http_code} %{time_total}s" http://localhost:80/connect/token
|
||||
|
||||
# Check database connectivity
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml exec authority \
|
||||
stella doctor run --check check.storage.postgres
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
```bash
|
||||
# Check authority service status
|
||||
stella auth status
|
||||
|
||||
# Restart authority service
|
||||
stella service restart authority
|
||||
|
||||
# Check database connectivity
|
||||
stella doctor run --check check.storage.postgres
|
||||
|
||||
# Monitor service metrics
|
||||
stella auth metrics --period 1h
|
||||
|
||||
# Review database performance
|
||||
stella doctor run --check check.storage.performance
|
||||
|
||||
# Watch metrics in real-time (warning-level slowness)
|
||||
stella auth metrics --watch
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
|
||||
```bash
|
||||
# Check authority pod status
|
||||
kubectl get pods -l app.kubernetes.io/component=authority -n stellaops
|
||||
|
||||
# View pod logs
|
||||
kubectl logs -l app.kubernetes.io/component=authority -n stellaops --tail=200
|
||||
|
||||
# Check resource usage
|
||||
kubectl top pods -l app.kubernetes.io/component=authority -n stellaops
|
||||
|
||||
# Restart authority pods
|
||||
kubectl rollout restart deployment/stellaops-authority -n stellaops
|
||||
|
||||
# Scale up if under load
|
||||
kubectl scale deployment stellaops-authority --replicas=3 -n stellaops
|
||||
|
||||
# Check liveness/readiness probe status
|
||||
kubectl describe pod -l app.kubernetes.io/component=authority -n stellaops | grep -A5 "Liveness\|Readiness"
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
```
|
||||
stella doctor run --check check.auth.token-service
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
|
||||
- `check.auth.config` -- auth must be configured before the token service can function
|
||||
- `check.auth.signing-key` -- token issuance requires a valid signing key
|
||||
- `check.auth.oidc` -- if delegating to external OIDC, that provider must also be healthy
|
||||
74
docs/doctor/articles/binary-analysis/buildinfo-cache.md
Normal file
74
docs/doctor/articles/binary-analysis/buildinfo-cache.md
Normal file
@@ -0,0 +1,74 @@
|
||||
---
|
||||
checkId: check.binaryanalysis.buildinfo.cache
|
||||
plugin: stellaops.doctor.binaryanalysis
|
||||
severity: warn
|
||||
tags: [binaryanalysis, buildinfo, debian, cache, security]
|
||||
---
|
||||
# Debian Buildinfo Cache
|
||||
|
||||
## What It Checks
|
||||
Verifies Debian buildinfo service accessibility and local cache directory configuration. The check:
|
||||
|
||||
- Tests HTTPS connectivity to `buildinfos.debian.net` and `reproduce.debian.net` via HEAD requests.
|
||||
- Checks the local cache directory (default `/var/cache/stella/buildinfo`, configurable via `BinaryAnalysis:BuildinfoCache:Directory`) for existence and writability by creating and deleting a temp file.
|
||||
- Fails if both services are unreachable AND the cache directory does not exist.
|
||||
- Warns if services are unreachable but the cache exists (offline mode possible), or if services are reachable but the cache directory is missing or not writable.
|
||||
|
||||
## Why It Matters
|
||||
Buildinfo files from Debian are used for reproducible-build verification. Without access to buildinfo services or a local cache, binary analysis cannot verify whether packages were built reproducibly, degrading supply-chain assurance for Debian-based container images.
|
||||
|
||||
## Common Causes
|
||||
- Firewall blocking HTTPS access to Debian buildinfo services
|
||||
- Network connectivity issues or DNS resolution failure
|
||||
- Proxy configuration required but not set
|
||||
- Cache directory not created
|
||||
- Insufficient permissions on cache directory
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
BinaryAnalysis__BuildinfoCache__Directory: "/var/cache/stella/buildinfo"
|
||||
volumes:
|
||||
- buildinfo-cache:/var/cache/stella/buildinfo
|
||||
```
|
||||
|
||||
Test connectivity:
|
||||
```bash
|
||||
docker exec <binaryindex-container> curl -I https://buildinfos.debian.net
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Create cache directory
|
||||
sudo mkdir -p /var/cache/stella/buildinfo
|
||||
sudo chmod 755 /var/cache/stella/buildinfo
|
||||
|
||||
# Test connectivity
|
||||
curl -I https://buildinfos.debian.net
|
||||
|
||||
# If behind a proxy
|
||||
export HTTPS_PROXY=http://proxy.example.com:8080
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
binaryAnalysis:
|
||||
buildinfo:
|
||||
cacheDirectory: "/var/cache/stella/buildinfo"
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 5Gi
|
||||
```
|
||||
|
||||
For air-gapped environments, pre-populate the buildinfo cache with required files or disable this check.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.binaryanalysis.buildinfo.cache
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.binaryanalysis.symbol.recovery.fallback` — meta-check ensuring at least one symbol recovery path is available
|
||||
- `check.binaryanalysis.debuginfod.available` — verifies debuginfod service connectivity
|
||||
@@ -0,0 +1,67 @@
|
||||
---
|
||||
checkId: check.binaryanalysis.corpus.mirror.freshness
|
||||
plugin: stellaops.doctor.binaryanalysis
|
||||
severity: warn
|
||||
tags: [binaryanalysis, corpus, mirrors, freshness, security, groundtruth]
|
||||
---
|
||||
# Corpus Mirror Freshness
|
||||
|
||||
## What It Checks
|
||||
Verifies that local corpus mirrors are not stale. The check:
|
||||
|
||||
- Reads the mirrors root directory (default `/var/lib/stella/mirrors`, configurable via `BinaryAnalysis:Corpus:MirrorsDirectory`).
|
||||
- Inspects five known mirror subdirectories: `debian/archive`, `debian/snapshot`, `ubuntu/usn-index`, `alpine/secdb`, and `osv`.
|
||||
- For each existing mirror, finds the most recent file modification time (sampling up to 1000 files) and compares it against a staleness threshold (default 7 days, configurable via `BinaryAnalysis:Corpus:StalenessThresholdDays`).
|
||||
- Fails if no mirrors exist or all mirrors are stale. Warns if some mirrors are stale. Reports info if all present mirrors are fresh but optional mirrors are missing.
|
||||
|
||||
## Why It Matters
|
||||
Corpus mirrors provide ground-truth vulnerability and package data for binary analysis. Stale mirrors mean symbol recovery operates on outdated data, leading to missed vulnerabilities and inaccurate matching in security scans.
|
||||
|
||||
## Common Causes
|
||||
- Corpus mirrors have not been initialized
|
||||
- Mirror sync job has not run recently or is disabled
|
||||
- Network connectivity issues preventing sync
|
||||
- Air-gapped setup incomplete (mirrors not pre-populated)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Initialize all mirrors
|
||||
docker exec <binaryindex-container> stella groundtruth mirror sync --all
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Create mirrors directory
|
||||
sudo mkdir -p /var/lib/stella/mirrors
|
||||
|
||||
# Sync all mirrors
|
||||
stella groundtruth mirror sync --all
|
||||
|
||||
# Set up a timer for automatic sync
|
||||
sudo systemctl enable stella-mirror-sync.timer
|
||||
sudo systemctl start stella-mirror-sync.timer
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
binaryAnalysis:
|
||||
corpus:
|
||||
mirrorsDirectory: "/var/lib/stella/mirrors"
|
||||
syncSchedule: "0 2 * * *" # daily at 2am
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 50Gi
|
||||
```
|
||||
|
||||
For air-gapped environments, transfer pre-populated mirrors from an online system.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.binaryanalysis.corpus.mirror.freshness
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.binaryanalysis.corpus.kpi.baseline` — verifies KPI baseline exists for regression detection
|
||||
- `check.binaryanalysis.symbol.recovery.fallback` — meta-check for symbol recovery path availability
|
||||
61
docs/doctor/articles/binary-analysis/ddeb-repo-enabled.md
Normal file
61
docs/doctor/articles/binary-analysis/ddeb-repo-enabled.md
Normal file
@@ -0,0 +1,61 @@
|
||||
---
|
||||
checkId: check.binaryanalysis.ddeb.enabled
|
||||
plugin: stellaops.doctor.binaryanalysis
|
||||
severity: warn
|
||||
tags: [binaryanalysis, ddeb, ubuntu, symbols, security]
|
||||
---
|
||||
# Ubuntu Ddeb Repository
|
||||
|
||||
## What It Checks
|
||||
Verifies Ubuntu debug symbol repository (ddebs.ubuntu.com) is configured and accessible. The check (Linux only):
|
||||
|
||||
- Parses `/etc/apt/sources.list` and `/etc/apt/sources.list.d/*.list` (and `.sources` DEB822 files) for entries containing `ddebs.ubuntu.com`.
|
||||
- Tests HTTP connectivity to `http://ddebs.ubuntu.com` via a HEAD request.
|
||||
- Detects the distribution codename from `/etc/lsb-release` or `/etc/os-release`.
|
||||
- Reports different warnings based on whether the repo is configured, reachable, or both.
|
||||
- Skips on non-Linux platforms.
|
||||
|
||||
## Why It Matters
|
||||
The Ubuntu ddeb repository provides debug symbol packages (`-dbgsym`) needed for binary analysis of Ubuntu-based container images. Without debug symbols, binary matching accuracy is significantly reduced, weakening vulnerability detection for Ubuntu workloads.
|
||||
|
||||
## Common Causes
|
||||
- Ddeb repository not added to apt sources
|
||||
- Network connectivity issues preventing access to ddebs.ubuntu.com
|
||||
- Firewall blocking HTTP access
|
||||
- Running on a non-Ubuntu Linux distribution
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Add ddeb repository inside the binary analysis container:
|
||||
|
||||
```bash
|
||||
docker exec <binaryindex-container> bash -c \
|
||||
'echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" > /etc/apt/sources.list.d/ddebs.list'
|
||||
docker exec <binaryindex-container> apt-key adv --keyserver keyserver.ubuntu.com \
|
||||
--recv-keys F2EDC64DC5AEE1F6B9C621F0C8CAB6595FDFF622
|
||||
docker exec <binaryindex-container> apt update
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" \
|
||||
| sudo tee /etc/apt/sources.list.d/ddebs.list
|
||||
sudo apt-key adv --keyserver keyserver.ubuntu.com \
|
||||
--recv-keys F2EDC64DC5AEE1F6B9C621F0C8CAB6595FDFF622
|
||||
sudo apt update
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Include the ddeb repository in your container image's Dockerfile or use an init container to configure it at startup.
|
||||
|
||||
For air-gapped environments, set up a local ddeb mirror or use offline symbol packages.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.binaryanalysis.ddeb.enabled
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.binaryanalysis.debuginfod.available` — verifies debuginfod service availability
|
||||
- `check.binaryanalysis.symbol.recovery.fallback` — meta-check for symbol recovery path availability
|
||||
@@ -0,0 +1,71 @@
|
||||
---
|
||||
checkId: check.binaryanalysis.debuginfod.available
|
||||
plugin: stellaops.doctor.binaryanalysis
|
||||
severity: warn
|
||||
tags: [binaryanalysis, debuginfod, symbols, security]
|
||||
---
|
||||
# Debuginfod Availability
|
||||
|
||||
## What It Checks
|
||||
Verifies DEBUGINFOD_URLS environment variable and debuginfod service connectivity. The check:
|
||||
|
||||
- Reads the `DEBUGINFOD_URLS` environment variable (space-separated list of URLs).
|
||||
- If not set, falls back to the default Fedora debuginfod at `https://debuginfod.fedoraproject.org`.
|
||||
- Tests HTTP connectivity to each URL via HEAD requests.
|
||||
- Reports info if DEBUGINFOD_URLS is not set but the default is reachable.
|
||||
- Warns if some configured URLs are unreachable. Fails if none are reachable.
|
||||
|
||||
## Why It Matters
|
||||
Debuginfod provides on-demand debug information (DWARF, source) for ELF binaries. It is the primary mechanism for symbol recovery in binary analysis. Without a reachable debuginfod endpoint, binary matching accuracy drops significantly, reducing the effectiveness of vulnerability correlation and reachability analysis.
|
||||
|
||||
## Common Causes
|
||||
- `DEBUGINFOD_URLS` environment variable is not set
|
||||
- Configured debuginfod servers may be down
|
||||
- Firewall blocking HTTPS access to debuginfod servers
|
||||
- Proxy configuration required but not set
|
||||
- DNS resolution failure for debuginfod hostnames
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
DEBUGINFOD_URLS: "https://debuginfod.fedoraproject.org"
|
||||
```
|
||||
|
||||
Test connectivity:
|
||||
```bash
|
||||
docker exec <binaryindex-container> curl -I https://debuginfod.fedoraproject.org
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Set the environment variable
|
||||
export DEBUGINFOD_URLS="https://debuginfod.fedoraproject.org"
|
||||
|
||||
# Or add to service file
|
||||
sudo systemctl edit stellaops-binaryindex
|
||||
# Add: Environment=DEBUGINFOD_URLS=https://debuginfod.fedoraproject.org
|
||||
|
||||
# Verify connectivity
|
||||
curl -I https://debuginfod.fedoraproject.org
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
binaryAnalysis:
|
||||
debuginfod:
|
||||
urls: "https://debuginfod.fedoraproject.org"
|
||||
```
|
||||
|
||||
For air-gapped environments, deploy a local debuginfod instance or use offline symbol bundles. See `docs/modules/binary-index/ground-truth-corpus.md` for offline setup.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.binaryanalysis.debuginfod.available
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.binaryanalysis.ddeb.enabled` — verifies Ubuntu ddeb repository availability
|
||||
- `check.binaryanalysis.buildinfo.cache` — verifies Debian buildinfo service and cache
|
||||
- `check.binaryanalysis.symbol.recovery.fallback` — meta-check aggregating all symbol sources
|
||||
75
docs/doctor/articles/binary-analysis/kpi-baseline-exists.md
Normal file
75
docs/doctor/articles/binary-analysis/kpi-baseline-exists.md
Normal file
@@ -0,0 +1,75 @@
|
||||
---
|
||||
checkId: check.binaryanalysis.corpus.kpi.baseline
|
||||
plugin: stellaops.doctor.binaryanalysis
|
||||
severity: warn
|
||||
tags: [binaryanalysis, corpus, kpi, baseline, regression, ci, groundtruth, security]
|
||||
---
|
||||
# KPI Baseline Configuration
|
||||
|
||||
## What It Checks
|
||||
Verifies that a KPI baseline file exists for regression detection in CI gates. The check:
|
||||
|
||||
- Looks for a baseline file at the configured directory (default `/var/lib/stella/baselines`) and filename (default `current.json`), configurable via `BinaryAnalysis:Corpus:BaselineDirectory` and `BinaryAnalysis:Corpus:BaselineFilename`.
|
||||
- If the directory does not exist, warns.
|
||||
- If the default baseline file is missing but other `.json` files exist in the directory, warns and identifies the latest one.
|
||||
- Validates the baseline file as JSON and checks for expected KPI fields: `precision`, `recall`, `falseNegativeRate`, `deterministicReplayRate`, `ttfrpP95Ms`.
|
||||
- Fails if the file exists but is invalid JSON or has no recognized KPI fields.
|
||||
- Warns if some recommended fields are missing.
|
||||
|
||||
## Why It Matters
|
||||
Without a KPI baseline, CI gates cannot detect regressions in binary matching accuracy. A regression in precision or recall means vulnerability detection quality has degraded without anyone being alerted. The baseline enables automated quality gates that block releases when binary analysis accuracy drops.
|
||||
|
||||
## Common Causes
|
||||
- KPI baseline has never been established (first run of corpus validation not yet completed)
|
||||
- Baseline directory path misconfigured
|
||||
- Baseline file was deleted or corrupted
|
||||
- Baseline created with an older tool version missing newer KPI fields
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Create baseline directory
|
||||
docker exec <binaryindex-container> mkdir -p /var/lib/stella/baselines
|
||||
|
||||
# Run corpus validation to establish baseline
|
||||
docker exec <binaryindex-container> stella groundtruth validate run \
|
||||
--corpus datasets/golden-corpus/seed/ --output-baseline
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
sudo mkdir -p /var/lib/stella/baselines
|
||||
|
||||
# Run validation and save baseline
|
||||
stella groundtruth validate run \
|
||||
--corpus datasets/golden-corpus/seed/ \
|
||||
--output /var/lib/stella/baselines/current.json
|
||||
|
||||
# Or promote latest results
|
||||
stella groundtruth baseline update --from-latest \
|
||||
--output /var/lib/stella/baselines/current.json
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
binaryAnalysis:
|
||||
corpus:
|
||||
baselineDirectory: "/var/lib/stella/baselines"
|
||||
persistence:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
Run a one-time job to establish the baseline:
|
||||
```bash
|
||||
kubectl exec -it <binaryindex-pod> -- stella groundtruth validate run --output-baseline
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.binaryanalysis.corpus.kpi.baseline
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.binaryanalysis.corpus.mirror.freshness` — verifies corpus mirror data is not stale
|
||||
- `check.binaryanalysis.symbol.recovery.fallback` — meta-check for symbol recovery availability
|
||||
@@ -0,0 +1,69 @@
|
||||
---
|
||||
checkId: check.binaryanalysis.symbol.recovery.fallback
|
||||
plugin: stellaops.doctor.binaryanalysis
|
||||
severity: warn
|
||||
tags: [binaryanalysis, symbols, fallback, security, meta]
|
||||
---
|
||||
# Symbol Recovery Fallback
|
||||
|
||||
## What It Checks
|
||||
Meta-check that ensures at least one symbol recovery path is available. The check aggregates results from three child checks:
|
||||
|
||||
- **Debuginfod Availability** (`check.binaryanalysis.debuginfod.available`)
|
||||
- **Ubuntu Ddeb Repository** (`check.binaryanalysis.ddeb.enabled`) -- skipped on non-Linux
|
||||
- **Debian Buildinfo Cache** (`check.binaryanalysis.buildinfo.cache`)
|
||||
|
||||
Fails if zero sources are available. Reports info if some but not all sources are available. Passes if all sources are operational.
|
||||
|
||||
## Why It Matters
|
||||
Symbol recovery is critical for binary analysis accuracy. If all symbol sources are unavailable, binary matching operates without debug information, severely degrading vulnerability detection quality. Having at least one source ensures a minimum level of binary analysis capability; having multiple sources provides redundancy.
|
||||
|
||||
## Common Causes
|
||||
- All symbol recovery endpoints unreachable
|
||||
- Network connectivity issues affecting all sources
|
||||
- Firewall blocking access to symbol servers
|
||||
- Air-gapped environment without offline symbol cache configured
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Configure at least one symbol source:
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
DEBUGINFOD_URLS: "https://debuginfod.fedoraproject.org"
|
||||
BinaryAnalysis__BuildinfoCache__Directory: "/var/cache/stella/buildinfo"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Option 1: Configure debuginfod
|
||||
export DEBUGINFOD_URLS="https://debuginfod.fedoraproject.org"
|
||||
|
||||
# Option 2: Set up buildinfo cache
|
||||
sudo mkdir -p /var/cache/stella/buildinfo
|
||||
|
||||
# Option 3: Configure ddeb repository (Ubuntu)
|
||||
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/ddebs.list
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
binaryAnalysis:
|
||||
debuginfod:
|
||||
urls: "https://debuginfod.fedoraproject.org"
|
||||
buildinfo:
|
||||
cacheDirectory: "/var/cache/stella/buildinfo"
|
||||
```
|
||||
|
||||
For air-gapped environments, set up an offline symbol bundle. See `docs/modules/binary-index/ground-truth-corpus.md` for instructions on creating and importing offline symbol packs.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.binaryanalysis.symbol.recovery.fallback
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.binaryanalysis.debuginfod.available` — individual debuginfod connectivity check
|
||||
- `check.binaryanalysis.ddeb.enabled` — individual Ubuntu ddeb repository check
|
||||
- `check.binaryanalysis.buildinfo.cache` — individual Debian buildinfo cache check
|
||||
101
docs/doctor/articles/compliance/attestation-signing.md
Normal file
101
docs/doctor/articles/compliance/attestation-signing.md
Normal file
@@ -0,0 +1,101 @@
|
||||
---
|
||||
checkId: check.compliance.attestation-signing
|
||||
plugin: stellaops.doctor.compliance
|
||||
severity: fail
|
||||
tags: [compliance, attestation, signing, crypto]
|
||||
---
|
||||
# Attestation Signing Health
|
||||
|
||||
## What It Checks
|
||||
Monitors attestation signing capability by querying the Attestor service at `/api/v1/signing/status`. The check validates:
|
||||
|
||||
- **Key availability**: whether a signing key is loaded and accessible (via `keyAvailable` in the response).
|
||||
- **Key expiration**: if the key has an `expiresAt` timestamp, the check fails when the key is already expired, warns when expiry is within 30 days, and passes otherwise.
|
||||
- **Signing activity**: reports the key type and the number of signatures produced in the last 24 hours.
|
||||
|
||||
The check only runs when `Attestor:Url` or `Services:Attestor:Url` is configured. It uses a 10-second HTTP timeout.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Attestor unreachable or HTTP error | Fail |
|
||||
| Key not available | Fail |
|
||||
| Key expired | Fail |
|
||||
| Key expires within 30 days | Warn |
|
||||
| Key available and not expiring soon | Pass |
|
||||
|
||||
## Why It Matters
|
||||
Attestation signing is the foundation of Stella Ops' evidence chain. Without a working signing key, the system cannot create attestations for releases, SBOM scans, or policy decisions. This breaks the entire compliance audit trail and makes releases unverifiable. Key expiration without timely rotation causes the same downstream impact as a missing key, but with no advance warning unless monitored.
|
||||
|
||||
## Common Causes
|
||||
- HSM/KMS connectivity issue preventing key access
|
||||
- Key rotation in progress (brief window of unavailability)
|
||||
- Key expired or revoked without replacement
|
||||
- Permission denied on the key management backend
|
||||
- Attestor service unavailable or misconfigured endpoint URL
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Verify the Attestor service is running and the URL is correct:
|
||||
|
||||
```bash
|
||||
# Check attestor container health
|
||||
docker compose ps attestor
|
||||
|
||||
# Verify signing key status
|
||||
docker compose exec attestor stella attestor key status
|
||||
|
||||
# If key is expired, rotate it
|
||||
docker compose exec attestor stella attestor key rotate
|
||||
|
||||
# Ensure the URL is correct in your .env or compose override
|
||||
# Attestor__Url=http://attestor:5082
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Check the Attestor service and key configuration:
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
sudo systemctl status stellaops-attestor
|
||||
|
||||
# Verify key status
|
||||
stella attestor key status
|
||||
|
||||
# Test HSM/KMS connectivity
|
||||
stella attestor hsm test
|
||||
|
||||
# Rotate an expired key
|
||||
stella attestor key rotate
|
||||
|
||||
# If using appsettings.json, verify Attestor:Url is correct
|
||||
cat /etc/stellaops/appsettings.json | jq '.Attestor'
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check attestor pod status
|
||||
kubectl get pods -l app=stellaops-attestor
|
||||
|
||||
# Check signing key status
|
||||
kubectl exec deploy/stellaops-attestor -- stella attestor key status
|
||||
|
||||
# Verify HSM/KMS connectivity from the pod
|
||||
kubectl exec deploy/stellaops-attestor -- stella attestor hsm test
|
||||
|
||||
# Schedule key rotation via Helm values
|
||||
helm upgrade stellaops ./charts/stellaops \
|
||||
--set attestor.keyRotation.enabled=true \
|
||||
--set attestor.keyRotation.scheduleBeforeExpiryDays=30
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.compliance.attestation-signing
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.compliance.evidence-rate` — monitors evidence generation success rate, which depends on signing
|
||||
- `check.compliance.provenance-completeness` — verifies provenance records exist for releases (requires working signing)
|
||||
- `check.compliance.evidence-integrity` — verifies signatures on stored evidence
|
||||
- `check.crypto.hsm` — validates HSM/PKCS#11 module availability used by the signing key
|
||||
100
docs/doctor/articles/compliance/audit-readiness.md
Normal file
100
docs/doctor/articles/compliance/audit-readiness.md
Normal file
@@ -0,0 +1,100 @@
|
||||
---
|
||||
checkId: check.compliance.audit-readiness
|
||||
plugin: stellaops.doctor.compliance
|
||||
severity: warn
|
||||
tags: [compliance, audit, evidence]
|
||||
---
|
||||
# Audit Readiness
|
||||
|
||||
## What It Checks
|
||||
Verifies the system is ready for compliance audits by querying the Evidence Locker at `/api/v1/evidence/audit-readiness`. The check evaluates four readiness criteria:
|
||||
|
||||
- **Retention policy configured**: whether a data retention policy is active.
|
||||
- **Audit logging enabled**: whether audit log capture is turned on.
|
||||
- **Backup verified**: whether the most recent backup has been validated.
|
||||
- **Evidence retention age**: whether the oldest evidence meets the required retention period (default 365 days).
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Evidence Locker unreachable | Warn |
|
||||
| 3 or more issues found | Fail |
|
||||
| 1-2 issues found | Warn |
|
||||
| All criteria satisfied | Pass |
|
||||
|
||||
Evidence collected: `issues_count`, `retention_policy_configured`, `audit_log_enabled`, `backup_verified`, `evidence_count`, `oldest_evidence_days`.
|
||||
|
||||
The check only runs when `EvidenceLocker:Url` or `Services:EvidenceLocker:Url` is configured. It uses a 15-second HTTP timeout.
|
||||
|
||||
## Why It Matters
|
||||
Compliance audits (SOC2, FedRAMP, HIPAA, PCI-DSS) require verifiable evidence retention, continuous audit logging, and validated backups. If any of these controls is missing, the organization cannot demonstrate compliance during an audit. A missing retention policy means evidence may be silently deleted. Disabled audit logging creates gaps in the chain of custody. Unverified backups risk data loss during incident recovery.
|
||||
|
||||
## Common Causes
|
||||
- No retention policy configured (default is not set)
|
||||
- Audit logging disabled in configuration or by error
|
||||
- Backup verification job not running or failing silently
|
||||
- Evidence retention shorter than the required period (e.g., 90 days configured but 365 required)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Configure retention policy
|
||||
docker compose exec evidence-locker stella evidence retention set --days 365
|
||||
|
||||
# Enable audit logging
|
||||
docker compose exec platform stella audit enable
|
||||
|
||||
# Verify backup status
|
||||
docker compose exec evidence-locker stella evidence backup verify
|
||||
|
||||
# Set environment variables if needed
|
||||
# EvidenceLocker__Retention__Days=365
|
||||
# AuditLog__Enabled=true
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Configure retention policy
|
||||
stella evidence retention set --days 365
|
||||
|
||||
# Enable audit logging
|
||||
stella audit enable
|
||||
|
||||
# Verify backup status
|
||||
stella evidence backup verify
|
||||
|
||||
# Edit appsettings.json
|
||||
# "EvidenceLocker": { "Retention": { "Days": 365 } }
|
||||
# "AuditLog": { "Enabled": true }
|
||||
|
||||
sudo systemctl restart stellaops-evidence-locker
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
evidenceLocker:
|
||||
retention:
|
||||
days: 365
|
||||
backup:
|
||||
enabled: true
|
||||
schedule: "0 2 * * *"
|
||||
verifyAfterBackup: true
|
||||
auditLog:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
```bash
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.compliance.audit-readiness
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.compliance.evidence-integrity` — verifies evidence has not been tampered with
|
||||
- `check.compliance.export-readiness` — verifies evidence can be exported for auditors
|
||||
- `check.compliance.evidence-rate` — monitors evidence generation health
|
||||
- `check.compliance.framework` — verifies compliance framework controls are passing
|
||||
100
docs/doctor/articles/compliance/evidence-integrity.md
Normal file
100
docs/doctor/articles/compliance/evidence-integrity.md
Normal file
@@ -0,0 +1,100 @@
|
||||
---
|
||||
checkId: check.compliance.evidence-integrity
|
||||
plugin: stellaops.doctor.compliance
|
||||
severity: fail
|
||||
tags: [compliance, security, integrity, signatures]
|
||||
---
|
||||
# Evidence Integrity
|
||||
|
||||
## What It Checks
|
||||
Detects evidence tampering or integrity issues by querying the Evidence Locker at `/api/v1/evidence/integrity-check`. The check verifies cryptographic signatures and hash chains across all stored evidence records. It evaluates:
|
||||
|
||||
- **Tampered records**: evidence records where the signature or hash does not match the stored content.
|
||||
- **Verification errors**: records that could not be verified (e.g., missing certificates, unsupported algorithms).
|
||||
- **Hash chain validity**: whether the sequential hash chain linking evidence records is intact.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Evidence Locker unreachable | Warn |
|
||||
| Any tampered records detected (tamperedCount > 0) | Fail (CRITICAL) |
|
||||
| Verification errors but no tampering | Warn |
|
||||
| All records verified, no tampering | Pass |
|
||||
|
||||
Evidence collected: `tampered_count`, `verified_count`, `total_checked`, `first_tampered_id`, `verification_errors`, `hash_chain_valid`.
|
||||
|
||||
The check only runs when `EvidenceLocker:Url` or `Services:EvidenceLocker:Url` is configured. It uses a 60-second HTTP timeout due to the intensive nature of the integrity scan.
|
||||
|
||||
## Why It Matters
|
||||
Evidence integrity is the cornerstone of compliance and audit trust. Tampered evidence records indicate either storage corruption, a security breach, or malicious modification of release decisions. Any tampering invalidates the entire evidence chain and must be treated as a security incident. Verification errors, while less severe, mean some evidence cannot be independently validated, weakening the audit posture.
|
||||
|
||||
## Common Causes
|
||||
- Evidence modification after signing (accidental or malicious)
|
||||
- Storage corruption (disk errors, incomplete writes)
|
||||
- Malicious tampering by an attacker with storage access
|
||||
- Key or certificate mismatch after key rotation
|
||||
- Missing signing certificates needed for verification
|
||||
- Certificate expiration rendering signatures unverifiable
|
||||
- Unsupported signature algorithm in older evidence records
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# List tampered evidence (DO NOT DELETE - preserve for investigation)
|
||||
docker compose exec evidence-locker stella evidence audit --tampered
|
||||
|
||||
# Check for storage corruption
|
||||
docker compose exec evidence-locker stella evidence integrity-check --verbose
|
||||
|
||||
# If tampering is confirmed, escalate to security team
|
||||
# Preserve all logs and evidence for forensic analysis
|
||||
docker compose logs evidence-locker > evidence-locker-forensic.log
|
||||
|
||||
# For verification errors (missing certs), import the required certificates
|
||||
docker compose exec evidence-locker stella evidence certs import --path /certs/
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# List tampered evidence
|
||||
stella evidence audit --tampered
|
||||
|
||||
# Full integrity check with details
|
||||
stella evidence integrity-check --verbose
|
||||
|
||||
# Check for disk errors
|
||||
sudo smartctl -H /dev/sda
|
||||
sudo fsck -n /dev/sda1
|
||||
|
||||
# Import missing certificates for verification
|
||||
stella evidence certs import --path /etc/stellaops/certs/
|
||||
|
||||
# DO NOT delete tampered evidence - preserve for investigation
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# List tampered evidence
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence audit --tampered
|
||||
|
||||
# Full integrity check
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence integrity-check --verbose
|
||||
|
||||
# Check persistent volume health
|
||||
kubectl describe pvc stellaops-evidence-data
|
||||
|
||||
# Export forensic logs
|
||||
kubectl logs deploy/stellaops-evidence-locker --all-containers > forensic.log
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.compliance.evidence-integrity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.compliance.attestation-signing` — signing key health affects evidence signature creation
|
||||
- `check.compliance.evidence-rate` — evidence generation failures may relate to integrity issues
|
||||
- `check.evidencelocker.merkle` — Merkle anchor verification provides additional integrity guarantees
|
||||
- `check.evidencelocker.provenance` — provenance chain integrity validates the evidence chain
|
||||
- `check.compliance.audit-readiness` — overall audit readiness depends on evidence integrity
|
||||
94
docs/doctor/articles/compliance/evidence-rate.md
Normal file
94
docs/doctor/articles/compliance/evidence-rate.md
Normal file
@@ -0,0 +1,94 @@
|
||||
---
|
||||
checkId: check.compliance.evidence-rate
|
||||
plugin: stellaops.doctor.compliance
|
||||
severity: fail
|
||||
tags: [compliance, evidence, attestation]
|
||||
---
|
||||
# Evidence Generation Rate
|
||||
|
||||
## What It Checks
|
||||
Monitors evidence generation success rate by querying the Evidence Locker at `/api/v1/evidence/metrics`. The check computes the success rate as `(totalGenerated - failed) / totalGenerated` over the last 24 hours and compares it against two thresholds:
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Evidence Locker unreachable | Warn |
|
||||
| Success rate < 95% | Fail |
|
||||
| Success rate 95%-99% | Warn |
|
||||
| Success rate >= 99% | Pass |
|
||||
|
||||
Evidence collected: `success_rate`, `total_generated_24h`, `failed_24h`, `pending_24h`, `avg_generation_time_ms`.
|
||||
|
||||
The check only runs when `EvidenceLocker:Url` or `Services:EvidenceLocker:Url` is configured. It uses a 10-second HTTP timeout. If no evidence has been generated (`totalGenerated == 0`), the success rate defaults to 100%.
|
||||
|
||||
## Why It Matters
|
||||
Evidence generation is a critical path in the release pipeline. Every release decision, scan result, and policy evaluation produces evidence that feeds compliance audits and attestation chains. A dropping success rate means evidence records are being lost, which creates gaps in the audit trail. Below 95%, the system is losing more than 1 in 20 evidence records, making compliance reporting unreliable and potentially invalidating release approvals that lack supporting evidence.
|
||||
|
||||
## Common Causes
|
||||
- Evidence generation service failures (internal errors, OOM)
|
||||
- Database connectivity issues preventing evidence persistence
|
||||
- Signing key unavailable, blocking signed evidence creation
|
||||
- Storage quota exceeded on the evidence backend
|
||||
- Intermittent failures due to high load or resource contention
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check evidence locker logs for errors
|
||||
docker compose logs evidence-locker --since 1h | grep -i error
|
||||
|
||||
# Verify signing keys
|
||||
docker compose exec evidence-locker stella evidence keys status
|
||||
|
||||
# Check database connectivity
|
||||
docker compose exec evidence-locker stella evidence db check
|
||||
|
||||
# Check storage capacity
|
||||
docker compose exec evidence-locker df -h /data/evidence
|
||||
|
||||
# If storage is full, clean up or expand volume
|
||||
docker compose exec evidence-locker stella evidence cleanup --older-than 90d --dry-run
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check service logs
|
||||
journalctl -u stellaops-evidence-locker --since "1 hour ago" | grep -i error
|
||||
|
||||
# Verify signing keys
|
||||
stella evidence keys status
|
||||
|
||||
# Check database connectivity
|
||||
stella evidence db check
|
||||
|
||||
# Check storage usage
|
||||
df -h /var/lib/stellaops/evidence
|
||||
|
||||
sudo systemctl restart stellaops-evidence-locker
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check evidence locker pod logs
|
||||
kubectl logs deploy/stellaops-evidence-locker --since=1h | grep -i error
|
||||
|
||||
# Verify signing keys
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence keys status
|
||||
|
||||
# Check persistent volume usage
|
||||
kubectl exec deploy/stellaops-evidence-locker -- df -h /data/evidence
|
||||
|
||||
# Check for OOMKilled pods
|
||||
kubectl get events --field-selector reason=OOMKilled -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.compliance.evidence-rate
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.compliance.attestation-signing` — signing key health affects evidence generation
|
||||
- `check.compliance.evidence-integrity` — integrity of generated evidence
|
||||
- `check.compliance.provenance-completeness` — provenance depends on evidence generation
|
||||
- `check.compliance.audit-readiness` — overall audit readiness depends on evidence availability
|
||||
104
docs/doctor/articles/compliance/export-readiness.md
Normal file
104
docs/doctor/articles/compliance/export-readiness.md
Normal file
@@ -0,0 +1,104 @@
|
||||
---
|
||||
checkId: check.compliance.export-readiness
|
||||
plugin: stellaops.doctor.compliance
|
||||
severity: warn
|
||||
tags: [compliance, export, audit]
|
||||
---
|
||||
# Evidence Export Readiness
|
||||
|
||||
## What It Checks
|
||||
Verifies that evidence can be exported in auditor-ready formats by querying the Evidence Locker at `/api/v1/evidence/export/capabilities`. The check evaluates four export capabilities:
|
||||
|
||||
- **PDF export**: ability to generate PDF evidence reports.
|
||||
- **JSON export**: ability to export evidence as structured JSON.
|
||||
- **Signed bundle export**: ability to create cryptographically signed evidence bundles.
|
||||
- **Chain of custody report**: ability to generate chain-of-custody documentation.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Evidence Locker unreachable | Warn |
|
||||
| 2 or more export formats unavailable | Fail |
|
||||
| 1 export format unavailable | Warn |
|
||||
| All 4 export formats available | Pass |
|
||||
|
||||
Evidence collected: `pdf_export`, `json_export`, `signed_bundle`, `chain_of_custody`, `available_formats`.
|
||||
|
||||
The check only runs when `EvidenceLocker:Url` or `Services:EvidenceLocker:Url` is configured. It uses a 10-second HTTP timeout.
|
||||
|
||||
## Why It Matters
|
||||
Auditors require evidence in specific formats. PDF reports are the most common delivery format for compliance reviews. Signed bundles provide cryptographic proof of evidence authenticity. The chain of custody report demonstrates that evidence has not been modified since collection. If these export capabilities are not available when an auditor requests them, it delays the audit process and may raise concerns about evidence integrity.
|
||||
|
||||
## Common Causes
|
||||
- Export dependencies not installed (e.g., PDF rendering libraries)
|
||||
- Signing keys not configured for evidence bundle signing
|
||||
- Template files missing for PDF report generation
|
||||
- Evidence Locker deployed without export module enabled
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check export configuration
|
||||
docker compose exec evidence-locker stella evidence export --check
|
||||
|
||||
# Verify export dependencies are installed
|
||||
docker compose exec evidence-locker dpkg -l | grep -i wkhtmltopdf
|
||||
|
||||
# Enable export features in environment
|
||||
# EvidenceLocker__Export__PdfEnabled=true
|
||||
# EvidenceLocker__Export__SignedBundleEnabled=true
|
||||
# EvidenceLocker__Export__ChainOfCustodyEnabled=true
|
||||
|
||||
# Restart after configuration changes
|
||||
docker compose restart evidence-locker
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check export configuration
|
||||
stella evidence export --check
|
||||
|
||||
# Install PDF rendering dependencies if missing
|
||||
sudo apt install wkhtmltopdf
|
||||
|
||||
# Configure export in appsettings.json
|
||||
# "EvidenceLocker": {
|
||||
# "Export": {
|
||||
# "PdfEnabled": true,
|
||||
# "SignedBundleEnabled": true,
|
||||
# "ChainOfCustodyEnabled": true
|
||||
# }
|
||||
# }
|
||||
|
||||
sudo systemctl restart stellaops-evidence-locker
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
evidenceLocker:
|
||||
export:
|
||||
pdfEnabled: true
|
||||
jsonEnabled: true
|
||||
signedBundleEnabled: true
|
||||
chainOfCustodyEnabled: true
|
||||
signingKeySecret: "stellaops-export-signing-key"
|
||||
```
|
||||
|
||||
```bash
|
||||
# Create signing key secret for bundles
|
||||
kubectl create secret generic stellaops-export-signing-key \
|
||||
--from-file=key.pem=./export-signing-key.pem
|
||||
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.compliance.export-readiness
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.compliance.audit-readiness` — overall audit readiness including retention and logging
|
||||
- `check.compliance.attestation-signing` — signing key health required for signed bundle export
|
||||
- `check.compliance.evidence-integrity` — integrity of the evidence being exported
|
||||
90
docs/doctor/articles/compliance/framework.md
Normal file
90
docs/doctor/articles/compliance/framework.md
Normal file
@@ -0,0 +1,90 @@
|
||||
---
|
||||
checkId: check.compliance.framework
|
||||
plugin: stellaops.doctor.compliance
|
||||
severity: warn
|
||||
tags: [compliance, framework, soc2, fedramp]
|
||||
---
|
||||
# Compliance Framework
|
||||
|
||||
## What It Checks
|
||||
Verifies that configured compliance framework requirements are met by querying the Policy service at `/api/v1/compliance/status`. The check supports SOC2, FedRAMP, HIPAA, PCI-DSS, and custom frameworks. It evaluates:
|
||||
|
||||
- **Failing controls**: any compliance controls in a failed state trigger a fail result.
|
||||
- **Compliance score**: a score below 100% (but with zero failing controls) triggers a warning.
|
||||
- **Control counts**: reports total, passing, and failing control counts along with the framework name.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Policy service unreachable | Warn |
|
||||
| Any controls failing (failingControls > 0) | Fail |
|
||||
| Compliance score < 100% | Warn |
|
||||
| All controls passing, score = 100% | Pass |
|
||||
|
||||
The check only runs when `Compliance:Frameworks` is configured. It uses a 15-second HTTP timeout.
|
||||
|
||||
## Why It Matters
|
||||
Compliance frameworks define the security and operational controls your organization must satisfy. Failing controls mean the system is not meeting regulatory requirements, which can result in audit findings, failed certifications, or legal exposure. Even partial non-compliance (score below 100%) indicates controls that need attention before the next audit cycle.
|
||||
|
||||
## Common Causes
|
||||
- Control requirements not implemented in the platform configuration
|
||||
- Evidence gaps where expected artifacts are missing
|
||||
- Policy violations detected by the policy engine
|
||||
- Configuration drift from the established compliance baseline
|
||||
- New controls added to the framework that have not been addressed
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# List all failing controls
|
||||
docker compose exec policy stella compliance audit --failing
|
||||
|
||||
# Generate remediation plan
|
||||
docker compose exec policy stella compliance remediate --plan
|
||||
|
||||
# Review compliance status in detail
|
||||
docker compose exec policy stella compliance status --framework soc2
|
||||
|
||||
# Configure frameworks in your .env
|
||||
# Compliance__Frameworks=soc2,hipaa
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# List failing controls
|
||||
stella compliance audit --failing
|
||||
|
||||
# Generate remediation plan
|
||||
stella compliance remediate --plan
|
||||
|
||||
# Configure frameworks in appsettings.json
|
||||
# "Compliance": { "Frameworks": "soc2,hipaa" }
|
||||
|
||||
sudo systemctl restart stellaops-policy
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
compliance:
|
||||
frameworks: "soc2,hipaa"
|
||||
autoRemediate: false
|
||||
reportSchedule: "0 6 * * 1" # Weekly Monday 6am
|
||||
```
|
||||
|
||||
```bash
|
||||
# Apply and check
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
kubectl exec deploy/stellaops-policy -- stella compliance audit --failing
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.compliance.framework
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.compliance.audit-readiness` — verifies the system is ready for compliance audits
|
||||
- `check.compliance.evidence-integrity` — verifies evidence integrity for compliance evidence
|
||||
- `check.compliance.provenance-completeness` — verifies provenance records support compliance claims
|
||||
- `check.compliance.export-readiness` — verifies evidence can be exported for auditor review
|
||||
102
docs/doctor/articles/compliance/provenance-completeness.md
Normal file
102
docs/doctor/articles/compliance/provenance-completeness.md
Normal file
@@ -0,0 +1,102 @@
|
||||
---
|
||||
checkId: check.compliance.provenance-completeness
|
||||
plugin: stellaops.doctor.compliance
|
||||
severity: fail
|
||||
tags: [compliance, provenance, slsa]
|
||||
---
|
||||
# Provenance Completeness
|
||||
|
||||
## What It Checks
|
||||
Verifies that provenance records exist for all releases by querying the Provenance service at `/api/v1/provenance/completeness`. The check computes a completeness rate as `(totalReleases - missingCount) / totalReleases` and evaluates the SLSA (Supply-chain Levels for Software Artifacts) level:
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Provenance service unreachable | Warn |
|
||||
| Completeness rate < 99% | Fail |
|
||||
| SLSA level < 2 (but completeness >= 99%) | Warn |
|
||||
| Completeness >= 99% and SLSA level >= 2 | Pass |
|
||||
|
||||
Evidence collected: `completeness_rate`, `total_releases`, `missing_count`, `slsa_level`.
|
||||
|
||||
The check only runs when `Provenance:Url` or `Services:Provenance:Url` is configured. It uses a 15-second HTTP timeout. If no releases exist (`totalReleases == 0`), completeness defaults to 100%.
|
||||
|
||||
## Why It Matters
|
||||
Provenance records document the complete history of how a software artifact was built, including the source code, build system, and build steps. Without provenance, there is no verifiable link between source code and the deployed artifact. This is a foundational requirement for SLSA compliance and supply-chain security. Missing provenance for even a small percentage of releases creates audit gaps that undermine the trustworthiness of the entire release pipeline.
|
||||
|
||||
## Common Causes
|
||||
- Build pipeline not configured to generate provenance attestations
|
||||
- Provenance upload failures due to network or authentication issues
|
||||
- Legacy releases created before provenance generation was enabled
|
||||
- Manual deployments that bypass the standard build pipeline
|
||||
- Build system not meeting SLSA level 2+ requirements
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# List releases missing provenance
|
||||
docker compose exec provenance stella provenance audit --missing
|
||||
|
||||
# Generate backfill provenance for existing releases (dry run first)
|
||||
docker compose exec provenance stella provenance backfill --dry-run
|
||||
|
||||
# If dry run looks correct, run the actual backfill
|
||||
docker compose exec provenance stella provenance backfill
|
||||
|
||||
# Check SLSA level
|
||||
docker compose exec provenance stella provenance slsa-level
|
||||
|
||||
# Ensure provenance generation is enabled in the pipeline
|
||||
# Provenance__Enabled=true
|
||||
# Provenance__SlsaLevel=2
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# List releases missing provenance
|
||||
stella provenance audit --missing
|
||||
|
||||
# Backfill provenance (dry run first)
|
||||
stella provenance backfill --dry-run
|
||||
|
||||
# Check SLSA level configuration
|
||||
stella provenance slsa-level
|
||||
|
||||
# Configure in appsettings.json
|
||||
# "Provenance": { "Enabled": true, "SlsaLevel": 2 }
|
||||
|
||||
sudo systemctl restart stellaops-provenance
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
provenance:
|
||||
enabled: true
|
||||
slsaLevel: 2
|
||||
backfill:
|
||||
enabled: true
|
||||
schedule: "0 3 * * 0" # Weekly Sunday 3am
|
||||
```
|
||||
|
||||
```bash
|
||||
# List missing provenance
|
||||
kubectl exec deploy/stellaops-provenance -- stella provenance audit --missing
|
||||
|
||||
# Backfill
|
||||
kubectl exec deploy/stellaops-provenance -- stella provenance backfill --dry-run
|
||||
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.compliance.provenance-completeness
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.compliance.attestation-signing` — signing key required for provenance attestations
|
||||
- `check.compliance.evidence-rate` — evidence generation rate includes provenance records
|
||||
- `check.compliance.evidence-integrity` — integrity of provenance evidence
|
||||
- `check.evidencelocker.provenance` — provenance chain integrity at the storage level
|
||||
- `check.compliance.framework` — compliance frameworks may require specific SLSA levels
|
||||
97
docs/doctor/articles/core/auth-config.md
Normal file
97
docs/doctor/articles/core/auth-config.md
Normal file
@@ -0,0 +1,97 @@
|
||||
---
|
||||
checkId: check.core.auth.config
|
||||
plugin: stellaops.doctor.core
|
||||
severity: warn
|
||||
tags: [security, authentication, configuration]
|
||||
---
|
||||
# Authentication Configuration
|
||||
|
||||
## What It Checks
|
||||
Verifies that authentication and authorization configuration is valid. The check inspects three configuration sections (`Authentication`, `Authority`, `Identity`) and validates:
|
||||
|
||||
- **JWT settings** (under `Authentication:Jwt`): ensures `Issuer` and `Audience` are set, the `SecretKey` is at least 32 characters long, and the key does not contain common weak values such as "secret" or "changeme".
|
||||
- **OpenID Connect settings** (under `Authentication:OpenIdConnect`): ensures the `Authority` URL is configured.
|
||||
- **Authority provider settings** (under `Authority`): reports which providers are enabled via `EnabledProviders`.
|
||||
|
||||
The check only runs when at least one of the three auth configuration sections exists. If none exist, it reports an informational result noting that authentication may not be configured.
|
||||
|
||||
## Why It Matters
|
||||
Misconfigured authentication allows unauthorized access to the Stella Ops control plane. A missing JWT issuer or audience disables token validation. A short or default signing key can be brute-forced or guessed, enabling token forgery. Without a properly configured OIDC authority, federated login flows will fail entirely.
|
||||
|
||||
## Common Causes
|
||||
- JWT Issuer not configured
|
||||
- JWT Audience not configured
|
||||
- JWT SecretKey is shorter than 32 characters
|
||||
- JWT SecretKey contains common weak values like "secret" or "changeme"
|
||||
- OpenIdConnect Authority URL is missing
|
||||
- Using development defaults in production
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Set the appropriate environment variables in your service definition inside `docker-compose.yml` or the `.env` file:
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
Authentication__Jwt__Issuer: "https://stella-ops.local"
|
||||
Authentication__Jwt__Audience: "stellaops-api"
|
||||
Authentication__Jwt__SecretKey: "<generate-a-strong-key-at-least-32-chars>"
|
||||
Authentication__OpenIdConnect__Authority: "https://authority.stella-ops.local"
|
||||
```
|
||||
|
||||
Generate a strong key:
|
||||
```bash
|
||||
openssl rand -base64 48
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Edit `appsettings.json` or `appsettings.Production.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"Authentication": {
|
||||
"Jwt": {
|
||||
"Issuer": "https://stella-ops.yourdomain.com",
|
||||
"Audience": "stellaops-api",
|
||||
"SecretKey": "<strong-key-at-least-32-characters>"
|
||||
},
|
||||
"OpenIdConnect": {
|
||||
"Authority": "https://authority.yourdomain.com"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Restart the service:
|
||||
```bash
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Set values in your Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
authentication:
|
||||
jwt:
|
||||
issuer: "https://stella-ops.yourdomain.com"
|
||||
audience: "stellaops-api"
|
||||
signingKeySecret: "stellaops-jwt-secret" # reference a Kubernetes Secret
|
||||
oidc:
|
||||
authority: "https://authority.yourdomain.com"
|
||||
```
|
||||
|
||||
Create the signing key secret:
|
||||
```bash
|
||||
kubectl create secret generic stellaops-jwt-secret \
|
||||
--from-literal=key="$(openssl rand -base64 48)"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.core.auth.config
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.security.jwt.config` — deep validation of JWT signing, algorithm, and expiration settings
|
||||
- `check.security.secrets` — ensures secrets like JWT keys are not stored as plain text in config
|
||||
- `check.security.password.policy` — validates password complexity requirements
|
||||
85
docs/doctor/articles/core/config-loaded.md
Normal file
85
docs/doctor/articles/core/config-loaded.md
Normal file
@@ -0,0 +1,85 @@
|
||||
---
|
||||
checkId: check.core.config.loaded
|
||||
plugin: stellaops.doctor.core
|
||||
severity: fail
|
||||
tags: [quick, configuration, startup]
|
||||
---
|
||||
# Configuration Loaded
|
||||
|
||||
## What It Checks
|
||||
Verifies that the application configuration system is properly loaded and accessible. The check calls `IConfiguration.GetChildren()` and counts the number of root configuration sections. It collects:
|
||||
|
||||
- **SectionCount**: total number of top-level configuration sections found.
|
||||
- **RootSections**: names of up to 10 root sections (e.g., `Logging`, `ConnectionStrings`, `Authentication`).
|
||||
- **Environment**: the current hosting environment name.
|
||||
|
||||
If zero sections are found, the check fails. If the configuration object throws an exception when accessed, the check also fails with the exception details.
|
||||
|
||||
## Why It Matters
|
||||
Configuration is the foundation of every Stella Ops service. Without a loaded configuration, connection strings, authentication settings, feature flags, and service endpoints are all missing. The service will fail to connect to databases, message brokers, and upstream services. This check catches the scenario where config files are missing from the container image, environment variables are not injected, or a configuration provider failed to initialize.
|
||||
|
||||
## Common Causes
|
||||
- Configuration file (`appsettings.json`) is missing or empty
|
||||
- Configuration provider not registered in `Program.cs`
|
||||
- Environment variables not set in the deployment
|
||||
- Config file not included in the Docker image build
|
||||
- Volume mount overwriting the config directory with an empty directory
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Verify the configuration file exists inside the container:
|
||||
|
||||
```bash
|
||||
docker compose exec <service> ls -la /app/appsettings.json
|
||||
```
|
||||
|
||||
If missing, check your `Dockerfile` to ensure the file is copied. Alternatively, mount it as a volume:
|
||||
|
||||
```yaml
|
||||
volumes:
|
||||
- ./config/appsettings.json:/app/appsettings.json:ro
|
||||
```
|
||||
|
||||
Check that environment variables are being injected:
|
||||
|
||||
```bash
|
||||
docker compose exec <service> printenv | grep -i stella
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Verify the config file exists in the application directory:
|
||||
|
||||
```bash
|
||||
ls -la /opt/stellaops/appsettings.json
|
||||
cat /opt/stellaops/appsettings.json | head -5
|
||||
```
|
||||
|
||||
Check environment variables:
|
||||
|
||||
```bash
|
||||
printenv | grep -i stella
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Check that the ConfigMap is mounted:
|
||||
|
||||
```bash
|
||||
kubectl exec -it <pod> -- cat /app/appsettings.json
|
||||
kubectl exec -it <pod> -- printenv | grep -i STELLA
|
||||
```
|
||||
|
||||
Verify the ConfigMap exists:
|
||||
|
||||
```bash
|
||||
kubectl get configmap stellaops-config -o yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.core.config.loaded
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.config.required` — verifies specific required settings are present
|
||||
- `check.core.env.variables` — verifies environment variables are set
|
||||
104
docs/doctor/articles/core/config-required.md
Normal file
104
docs/doctor/articles/core/config-required.md
Normal file
@@ -0,0 +1,104 @@
|
||||
---
|
||||
checkId: check.core.config.required
|
||||
plugin: stellaops.doctor.core
|
||||
severity: fail
|
||||
tags: [quick, configuration, startup]
|
||||
---
|
||||
# Required Settings
|
||||
|
||||
## What It Checks
|
||||
Verifies that required configuration settings are present and have non-empty values. The check supports multiple key variants to accommodate both `appsettings.json` (colon-separated) and environment variable (double-underscore-separated) configuration styles.
|
||||
|
||||
**Required settings** (at least one variant must be present):
|
||||
|
||||
| Canonical Name | Accepted Variants |
|
||||
|---|---|
|
||||
| `ConnectionStrings:DefaultConnection` | `ConnectionStrings:DefaultConnection`, `ConnectionStrings:Default`, `CONNECTIONSTRINGS__DEFAULTCONNECTION`, `CONNECTIONSTRINGS__DEFAULT` |
|
||||
|
||||
**Recommended settings** (warn if missing, not fail):
|
||||
|
||||
| Setting | Purpose |
|
||||
|---|---|
|
||||
| `Logging:LogLevel:Default` | Default log level |
|
||||
|
||||
The check also reads `PluginConfig:RequiredSettings` for additional plugin-specific required settings configured at runtime. For each required setting, it checks both the `IConfiguration` value and the direct environment variable (converting `:` to `__`).
|
||||
|
||||
## Why It Matters
|
||||
The database connection string is the most critical setting for any Stella Ops service. Without it, the service cannot connect to PostgreSQL, auto-migration cannot run, and every database-dependent operation will fail with a 500 error. This check catches the most common deployment mistake: forgetting to set the connection string.
|
||||
|
||||
## Common Causes
|
||||
- Database connection string not configured in environment variables or appsettings
|
||||
- Environment variables not set (check Docker compose `.env` or service environment section)
|
||||
- Typo in the environment variable name (e.g., `CONNECTIONSTRING` instead of `CONNECTIONSTRINGS`)
|
||||
- Config file present but missing the `ConnectionStrings` section
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Add the connection string to your `.env` file or directly in `docker-compose.yml`:
|
||||
|
||||
```bash
|
||||
# In .env file
|
||||
CONNECTIONSTRINGS__DEFAULTCONNECTION=Host=127.1.1.1;Port=5432;Database=stellaops_platform;Username=stellaops;Password=stellaops
|
||||
```
|
||||
|
||||
Or in the service environment section:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
platform:
|
||||
environment:
|
||||
ConnectionStrings__DefaultConnection: "Host=postgres;Port=5432;Database=stellaops_platform;Username=stellaops;Password=stellaops"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Add to `appsettings.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"ConnectionStrings": {
|
||||
"DefaultConnection": "Host=localhost;Port=5432;Database=stellaops_platform;Username=stellaops;Password=stellaops"
|
||||
},
|
||||
"Logging": {
|
||||
"LogLevel": {
|
||||
"Default": "Information"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Or set as an environment variable in the systemd unit:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
Environment=CONNECTIONSTRINGS__DEFAULTCONNECTION=Host=localhost;Port=5432;Database=stellaops_platform;Username=stellaops;Password=stellaops
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Set connection string via a Kubernetes Secret:
|
||||
|
||||
```bash
|
||||
kubectl create secret generic stellaops-db \
|
||||
--from-literal=connection-string="Host=postgres;Port=5432;Database=stellaops_platform;Username=stellaops;Password=stellaops"
|
||||
```
|
||||
|
||||
Reference in Helm values:
|
||||
|
||||
```yaml
|
||||
env:
|
||||
- name: ConnectionStrings__DefaultConnection
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: stellaops-db
|
||||
key: connection-string
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.core.config.required
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.config.loaded` — verifies the configuration system itself is loaded
|
||||
- `check.core.env.variables` — verifies environment variables are set
|
||||
- `check.core.services.health` — database health checks will fail if the connection string is missing
|
||||
88
docs/doctor/articles/core/crypto-available.md
Normal file
88
docs/doctor/articles/core/crypto-available.md
Normal file
@@ -0,0 +1,88 @@
|
||||
---
|
||||
checkId: check.core.crypto.available
|
||||
plugin: stellaops.doctor.core
|
||||
severity: fail
|
||||
tags: [quick, security, crypto]
|
||||
---
|
||||
# Cryptography Providers
|
||||
|
||||
## What It Checks
|
||||
Verifies that required cryptographic algorithms are available on the host system. The check tests six algorithms by actually executing them:
|
||||
|
||||
| Algorithm | Test |
|
||||
|-----------|------|
|
||||
| **SHA-256** | Hashes a 4-byte test payload |
|
||||
| **SHA-384** | Hashes a 4-byte test payload |
|
||||
| **SHA-512** | Hashes a 4-byte test payload |
|
||||
| **RSA** | Creates an RSA key pair and reads the key size |
|
||||
| **ECDSA** | Creates an ECDSA key pair and reads the key size |
|
||||
| **AES** | Creates an AES cipher and reads the key size |
|
||||
|
||||
The check also detects whether FIPS mode is enforced on the system via `CryptoConfig.AllowOnlyFipsAlgorithms` and reports the OS platform.
|
||||
|
||||
If any algorithm fails to execute, the check reports `fail` with the list of unavailable algorithms.
|
||||
|
||||
## Why It Matters
|
||||
Stella Ops relies on these cryptographic primitives for:
|
||||
- **SHA-256/384/512**: SBOM digests, evidence hashing, content-addressable storage, DSSE payloads.
|
||||
- **RSA/ECDSA**: JWT signing, TLS certificates, code signing, attestation signatures.
|
||||
- **AES**: Data-at-rest encryption, data protection keys.
|
||||
|
||||
If any algorithm is unavailable, core features like evidence signing, token validation, and encrypted storage will fail at runtime.
|
||||
|
||||
## Common Causes
|
||||
- Operating system does not support required algorithms (minimal or stripped-down containers)
|
||||
- FIPS mode restrictions preventing non-FIPS algorithms
|
||||
- Missing cryptographic libraries (e.g., OpenSSL not installed in Alpine images)
|
||||
- Running on a platform with limited crypto support
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
If using Alpine-based images, ensure OpenSSL is installed:
|
||||
|
||||
```dockerfile
|
||||
RUN apk add --no-cache openssl
|
||||
```
|
||||
|
||||
Or switch to a Debian/Ubuntu-based image that includes full crypto support:
|
||||
|
||||
```dockerfile
|
||||
FROM mcr.microsoft.com/dotnet/aspnet:8.0
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Install required crypto libraries:
|
||||
|
||||
```bash
|
||||
# Debian/Ubuntu
|
||||
sudo apt-get install -y openssl libssl-dev
|
||||
|
||||
# RHEL/CentOS
|
||||
sudo yum install -y openssl openssl-devel
|
||||
```
|
||||
|
||||
If FIPS mode is required, ensure all algorithms used are FIPS-compliant:
|
||||
|
||||
```bash
|
||||
# Check FIPS status
|
||||
cat /proc/sys/crypto/fips_enabled
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Use a base image with full cryptographic support. In your Helm values:
|
||||
|
||||
```yaml
|
||||
image:
|
||||
repository: stellaops/platform
|
||||
tag: latest # Uses Debian-based runtime with full crypto
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.core.crypto.available
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.security.encryption` — validates encryption key configuration and algorithms
|
||||
- `check.security.tls.certificate` — validates TLS certificate availability and validity
|
||||
101
docs/doctor/articles/core/env-diskspace.md
Normal file
101
docs/doctor/articles/core/env-diskspace.md
Normal file
@@ -0,0 +1,101 @@
|
||||
---
|
||||
checkId: check.core.env.diskspace
|
||||
plugin: stellaops.doctor.core
|
||||
severity: fail
|
||||
tags: [quick, environment, resources]
|
||||
---
|
||||
# Disk Space
|
||||
|
||||
## What It Checks
|
||||
Verifies sufficient disk space is available on the drive where the application is running. The check reads the drive information for the current working directory and applies two thresholds:
|
||||
|
||||
| Threshold | Value | Result |
|
||||
|---|---|---|
|
||||
| Critical (fail) | Less than **1 GB** free | `fail` |
|
||||
| Warning | Less than **5 GB** free | `warn` |
|
||||
| Healthy | 5 GB or more free | `pass` |
|
||||
|
||||
Evidence collected includes: drive name, free space, total space, and used percentage.
|
||||
|
||||
## Why It Matters
|
||||
Stella Ops services write logs, evidence files, SBOM data, scan results, and temporary processing artifacts to disk. When disk space is critically low:
|
||||
|
||||
- Database writes fail (PostgreSQL requires WAL space).
|
||||
- Container images cannot be pulled or built.
|
||||
- Log files cannot be written, causing silent data loss.
|
||||
- Evidence locker writes fail, breaking the audit trail.
|
||||
- Temporary scan artifacts fill up, causing scanner crashes.
|
||||
|
||||
## Common Causes
|
||||
- Log files consuming disk space without rotation
|
||||
- Temporary files not cleaned up after processing
|
||||
- Application data growth (evidence locker, SBOM storage, scan results)
|
||||
- Docker images and volumes consuming space on the same partition
|
||||
- Database WAL files growing due to long-running transactions
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Check disk usage and clean up:
|
||||
|
||||
```bash
|
||||
# Check overall disk usage
|
||||
df -h
|
||||
|
||||
# Find large files
|
||||
du -sh /var/lib/docker/* | sort -hr | head -20
|
||||
|
||||
# Clean Docker artifacts
|
||||
docker system prune -a --volumes
|
||||
|
||||
# Clean application temp files
|
||||
docker compose exec <service> rm -rf /tmp/*
|
||||
|
||||
# Set up log rotation in compose
|
||||
# Add to your service definition:
|
||||
logging:
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: "10m"
|
||||
max-file: "3"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Find large files
|
||||
du -sh /* | sort -hr | head -20
|
||||
|
||||
# Clean temp files
|
||||
rm -rf /tmp/*
|
||||
|
||||
# Rotate logs
|
||||
sudo logrotate -f /etc/logrotate.conf
|
||||
|
||||
# Check and clean old journal logs
|
||||
sudo journalctl --vacuum-size=100M
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check node disk usage
|
||||
kubectl top nodes
|
||||
|
||||
# Check PVC usage
|
||||
kubectl exec -it <pod> -- df -h
|
||||
|
||||
# Set ephemeral storage limits in Helm values:
|
||||
resources:
|
||||
limits:
|
||||
ephemeral-storage: "2Gi"
|
||||
requests:
|
||||
ephemeral-storage: "1Gi"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.core.env.diskspace
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.docker.storage` — checks Docker-specific storage driver and disk usage
|
||||
- `check.core.env.memory` — checks process memory usage
|
||||
113
docs/doctor/articles/core/env-memory.md
Normal file
113
docs/doctor/articles/core/env-memory.md
Normal file
@@ -0,0 +1,113 @@
|
||||
---
|
||||
checkId: check.core.env.memory
|
||||
plugin: stellaops.doctor.core
|
||||
severity: warn
|
||||
tags: [quick, environment, resources]
|
||||
---
|
||||
# Memory Usage
|
||||
|
||||
## What It Checks
|
||||
Verifies that the application process memory usage is within acceptable limits. The check reads the current process metrics and applies two thresholds:
|
||||
|
||||
| Threshold | Value (Working Set) | Result |
|
||||
|---|---|---|
|
||||
| Critical (fail) | Greater than **2 GB** | `fail` |
|
||||
| Warning | Greater than **1 GB** | `warn` |
|
||||
| Healthy | 1 GB or less | `pass` |
|
||||
|
||||
Evidence collected includes:
|
||||
- **WorkingSet**: physical memory currently allocated to the process.
|
||||
- **PrivateBytes**: total private memory allocated.
|
||||
- **GCHeapSize**: managed heap size reported by the GC.
|
||||
- **GCMemory**: total managed memory from `GC.GetTotalMemory()`.
|
||||
- **Gen0/Gen1/Gen2 Collections**: garbage collection counts for each generation.
|
||||
|
||||
## Why It Matters
|
||||
Excessive memory usage leads to out-of-memory kills (OOM), especially in containerized deployments where memory limits are enforced. When a Stella Ops service is OOM-killed:
|
||||
|
||||
- In-flight requests are dropped.
|
||||
- Evidence writes may be incomplete, compromising the audit trail.
|
||||
- Scan results in progress are lost.
|
||||
- The container restarts, causing a brief outage and potential data corruption.
|
||||
|
||||
High Gen2 GC counts can also indicate a memory leak, where objects are promoted to the long-lived generation faster than they can be collected.
|
||||
|
||||
## Common Causes
|
||||
- Memory leak in application code (undisposed resources, growing caches)
|
||||
- Large data sets loaded entirely into memory (SBOM graphs, scan results)
|
||||
- Insufficient memory limits configured for the container
|
||||
- Normal operation with high load (many concurrent scans or requests)
|
||||
- Memory-intensive operations in progress (large SBOM diff, graph analysis)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Set memory limits for the service in `docker-compose.yml`:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
platform:
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 2G
|
||||
reservations:
|
||||
memory: 512M
|
||||
```
|
||||
|
||||
Analyze memory usage:
|
||||
|
||||
```bash
|
||||
# Check container memory stats
|
||||
docker stats --no-stream <container>
|
||||
|
||||
# Capture a memory dump for analysis
|
||||
docker compose exec <service> dotnet-dump collect -p 1
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Set memory limits in the systemd unit:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
MemoryMax=2G
|
||||
MemoryHigh=1536M
|
||||
```
|
||||
|
||||
Analyze memory:
|
||||
|
||||
```bash
|
||||
# Install diagnostics tools
|
||||
dotnet tool install -g dotnet-gcdump
|
||||
|
||||
# Capture GC dump
|
||||
dotnet-gcdump collect -p <pid>
|
||||
|
||||
# Analyze with dotnet-dump
|
||||
dotnet-dump analyze <dump-file>
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Set resource limits in Helm values:
|
||||
|
||||
```yaml
|
||||
resources:
|
||||
limits:
|
||||
memory: "2Gi"
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
```
|
||||
|
||||
Monitor memory:
|
||||
```bash
|
||||
kubectl top pods -l app=stellaops-platform
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.core.env.memory
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.env.diskspace` — checks available disk space
|
||||
- `check.core.services.health` — overall service health which can degrade under memory pressure
|
||||
88
docs/doctor/articles/core/env-variables.md
Normal file
88
docs/doctor/articles/core/env-variables.md
Normal file
@@ -0,0 +1,88 @@
|
||||
---
|
||||
checkId: check.core.env.variables
|
||||
plugin: stellaops.doctor.core
|
||||
severity: warn
|
||||
tags: [quick, environment, configuration]
|
||||
---
|
||||
# Environment Variables
|
||||
|
||||
## What It Checks
|
||||
Verifies that expected environment variables are configured for the runtime environment. The check looks for two recommended variables:
|
||||
|
||||
| Variable | Purpose |
|
||||
|---|---|
|
||||
| `ASPNETCORE_ENVIRONMENT` | Sets the ASP.NET Core hosting environment (Development, Staging, Production) |
|
||||
| `DOTNET_ENVIRONMENT` | Sets the .NET hosting environment (fallback for non-ASP.NET hosts) |
|
||||
|
||||
The check also counts all platform-related environment variables matching these prefixes: `STELLA*`, `ASPNETCORE*`, `DOTNET*`, `CONNECTIONSTRINGS*`.
|
||||
|
||||
**Result logic:**
|
||||
- If neither recommended variable is set but other platform variables exist (e.g., `STELLAOPS_*`, `CONNECTIONSTRINGS__*`), the check **passes** with a note that the environment defaults are being used.
|
||||
- If no platform variables at all are found, the check **warns** that the service may not be running in a configured deployment.
|
||||
- If at least one recommended variable is set, the check **passes** and reports the current environment name and total platform variable count.
|
||||
|
||||
## Why It Matters
|
||||
The hosting environment controls which configuration files are loaded (`appsettings.Development.json` vs. `appsettings.Production.json`), whether developer exception pages are shown, and how logging is configured. Running in the wrong environment can expose detailed error information in production or apply development-only settings that degrade performance.
|
||||
|
||||
## Common Causes
|
||||
- No StellaOps, ASP.NET, or .NET environment variables found in the process
|
||||
- The service is not running in a configured deployment (e.g., running directly without Docker or systemd)
|
||||
- Docker compose `.env` file missing or not loaded
|
||||
- Environment variables defined in the wrong scope (host-level vs. container-level)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Add the environment variable to your service in `docker-compose.yml`:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
platform:
|
||||
environment:
|
||||
ASPNETCORE_ENVIRONMENT: Production
|
||||
```
|
||||
|
||||
Or in the `.env` file:
|
||||
|
||||
```bash
|
||||
ASPNETCORE_ENVIRONMENT=Production
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Set the variable in the systemd unit file:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
Environment=ASPNETCORE_ENVIRONMENT=Production
|
||||
```
|
||||
|
||||
Or export it in the shell:
|
||||
|
||||
```bash
|
||||
export ASPNETCORE_ENVIRONMENT=Production
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Set in Helm values:
|
||||
|
||||
```yaml
|
||||
env:
|
||||
- name: ASPNETCORE_ENVIRONMENT
|
||||
value: "Production"
|
||||
```
|
||||
|
||||
Or in the pod spec directly:
|
||||
|
||||
```bash
|
||||
kubectl set env deployment/stellaops-platform ASPNETCORE_ENVIRONMENT=Production
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.core.env.variables
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.config.loaded` — verifies the configuration system is loaded (environment affects which config files load)
|
||||
- `check.core.config.required` — verifies specific required settings are present
|
||||
70
docs/doctor/articles/core/services-dependencies.md
Normal file
70
docs/doctor/articles/core/services-dependencies.md
Normal file
@@ -0,0 +1,70 @@
|
||||
---
|
||||
checkId: check.core.services.dependencies
|
||||
plugin: stellaops.doctor.core
|
||||
severity: fail
|
||||
tags: [quick, services, di]
|
||||
---
|
||||
# Required Services
|
||||
|
||||
## What It Checks
|
||||
Verifies that required infrastructure services are registered in the .NET dependency injection (DI) container. The check resolves the following service types from the `IServiceProvider`:
|
||||
|
||||
| Service Type | Purpose |
|
||||
|---|---|
|
||||
| `TimeProvider` | Abstracts system clock for testability and time-based logic |
|
||||
| `ILoggerFactory` | Provides structured logging across all components |
|
||||
|
||||
For each service type, the check attempts `GetService<T>()`. If the service resolves to `null` or throws, it is recorded as missing.
|
||||
|
||||
The check reports the count of registered vs. missing services and lists the missing ones by name.
|
||||
|
||||
## Why It Matters
|
||||
These services are foundational dependencies used by nearly every Stella Ops component. If `TimeProvider` is missing, time-based features (token expiration, certificate validity, scheduling) will not work. If `ILoggerFactory` is missing, no structured logging is produced, making troubleshooting impossible. A missing DI registration usually indicates a misconfigured `Program.cs` or a missing `AddStellaOps*()` call during startup.
|
||||
|
||||
## Common Causes
|
||||
- Services not registered in the DI container during application startup
|
||||
- Missing `builder.Services.AddXxx()` call in `Program.cs` or `Startup.cs`
|
||||
- Incorrect service registration order causing dependency resolution failures
|
||||
- Custom host builder that skips default service registrations
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
This is a code-level issue, not a deployment configuration problem. Ensure the service's `Program.cs` includes the standard Stella Ops service registration:
|
||||
|
||||
```csharp
|
||||
builder.Services.AddSingleton(TimeProvider.System);
|
||||
builder.Services.AddLogging();
|
||||
```
|
||||
|
||||
Rebuild the container after code changes:
|
||||
```bash
|
||||
docker compose build <service> --no-cache
|
||||
docker compose up -d <service>
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Verify the application is using the standard Stella Ops host builder. Check `Program.cs` for the required registrations.
|
||||
|
||||
Restart after any code changes:
|
||||
```bash
|
||||
sudo systemctl restart stellaops-<service>
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
This issue requires a code fix and new container image. After fixing the registration, build and push a new image:
|
||||
|
||||
```bash
|
||||
docker build -t stellaops/<service>:latest .
|
||||
docker push stellaops/<service>:latest
|
||||
kubectl rollout restart deployment/<service>
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.core.services.dependencies
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.services.health` — aggregates health check results from registered health check services
|
||||
- `check.core.config.loaded` — verifies the configuration system is loaded (a prerequisite for service registration)
|
||||
110
docs/doctor/articles/core/services-health.md
Normal file
110
docs/doctor/articles/core/services-health.md
Normal file
@@ -0,0 +1,110 @@
|
||||
---
|
||||
checkId: check.core.services.health
|
||||
plugin: stellaops.doctor.core
|
||||
severity: fail
|
||||
tags: [health, services]
|
||||
---
|
||||
# Service Health
|
||||
|
||||
## What It Checks
|
||||
Aggregates health status from all registered ASP.NET Core `IHealthCheck` services. The check resolves `HealthCheckService` from the DI container and calls `CheckHealthAsync()`. It then categorizes each registered health check as Healthy, Degraded, or Unhealthy.
|
||||
|
||||
| Overall Status | Result |
|
||||
|---|---|
|
||||
| **Unhealthy** (any check unhealthy) | `fail` — lists the failing checks by name with error details for up to 5 |
|
||||
| **Degraded** (any check degraded, none unhealthy) | `warn` |
|
||||
| **Healthy** (all checks healthy) | `pass` — reports total count and duration |
|
||||
|
||||
If `HealthCheckService` is not registered in the DI container, the check is skipped.
|
||||
|
||||
Evidence collected includes: overall status, total checks count, healthy/degraded/unhealthy counts, failed check names, and execution duration.
|
||||
|
||||
## Why It Matters
|
||||
Health checks are the primary mechanism for detecting infrastructure problems: database connectivity, message broker availability, external API reachability, and internal service dependencies. An unhealthy result means at least one critical dependency is down, and the service cannot function correctly. Load balancers and orchestrators use health check endpoints to route traffic away from unhealthy instances.
|
||||
|
||||
## Common Causes
|
||||
- Dependent service unavailable (database, Valkey, external API)
|
||||
- Database connection failed or timed out
|
||||
- External API unreachable (network partition, DNS failure)
|
||||
- Health check timeout exceeded (default check estimated duration is 5 seconds)
|
||||
- Configuration error preventing a dependency from connecting
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Check the health endpoint directly:
|
||||
|
||||
```bash
|
||||
# Hit the health endpoint
|
||||
curl -s http://localhost:5000/health | jq
|
||||
|
||||
# Check dependent service connectivity
|
||||
docker compose exec <service> curl -s http://postgres:5432
|
||||
docker compose exec <service> curl -s http://valkey:6379
|
||||
|
||||
# Restart unhealthy services
|
||||
docker compose restart <failing-dependency>
|
||||
```
|
||||
|
||||
Ensure dependent services are healthy before starting:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
platform:
|
||||
depends_on:
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
valkey:
|
||||
condition: service_healthy
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check the health endpoint
|
||||
curl -s http://localhost:5000/health | jq
|
||||
|
||||
# Check database connectivity
|
||||
pg_isready -h localhost -p 5432
|
||||
|
||||
# Check service logs for errors
|
||||
journalctl -u stellaops-platform --since "5 minutes ago" | grep -i error
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check pod health
|
||||
kubectl describe pod <pod> | grep -A 5 "Conditions"
|
||||
|
||||
# Check health endpoint inside the pod
|
||||
kubectl exec -it <pod> -- curl -s http://localhost:5000/health | jq
|
||||
|
||||
# Check events for restart loops
|
||||
kubectl get events --field-selector involvedObject.name=<pod> --sort-by='.lastTimestamp'
|
||||
```
|
||||
|
||||
Configure health check probes in Helm values:
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 5000
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 5000
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 5
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.core.services.health
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.services.dependencies` — verifies required DI services are registered (prerequisite for health checks)
|
||||
- `check.core.config.required` — verifies required settings like connection strings are present
|
||||
- `check.docker.daemon` — verifies Docker daemon is running (relevant when health checks include Docker connectivity)
|
||||
127
docs/doctor/articles/crypto/certchain.md
Normal file
127
docs/doctor/articles/crypto/certchain.md
Normal file
@@ -0,0 +1,127 @@
|
||||
---
|
||||
checkId: check.crypto.certchain
|
||||
plugin: stellaops.doctor.crypto
|
||||
severity: warn
|
||||
tags: [crypto, certificate, tls, security]
|
||||
---
|
||||
# Certificate Chain Validation
|
||||
|
||||
## What It Checks
|
||||
Verifies certificate chain completeness, trust anchor validity, and expiration for the configured TLS certificate. The check reads the certificate path from `Crypto:TlsCertPath`, `Kestrel:Certificates:Default:Path`, or `Server:TlsCertificate` and validates:
|
||||
|
||||
- **File existence**: whether the configured certificate file exists on disk.
|
||||
- **Chain completeness**: whether all intermediate certificates are present (no missing links).
|
||||
- **Trust anchor validity**: whether the root CA is trusted by the system trust store.
|
||||
- **Expiration**: days until the certificate expires, with tiered severity.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| No TLS certificate configured | Skip |
|
||||
| Certificate file not found | Fail |
|
||||
| Certificate chain incomplete (missing intermediates) | Fail |
|
||||
| Trust anchor not valid (unknown root CA) | Fail |
|
||||
| Certificate already expired | Fail |
|
||||
| Certificate expires within 7 days | Fail |
|
||||
| Certificate expires within 30 days | Warn |
|
||||
| Chain complete, trust anchor valid, not expiring soon | Pass |
|
||||
|
||||
Evidence collected: `CertPath`, `ChainLength`, `MissingIntermediates`, `TrustAnchorValid`, `TrustAnchorIssuer`, `ExpirationDate`, `DaysRemaining`.
|
||||
|
||||
This check always runs (no precondition), but skips if no TLS certificate path is configured.
|
||||
|
||||
## Why It Matters
|
||||
An incomplete certificate chain causes TLS handshake failures for clients that do not have intermediate certificates cached. An untrusted root CA triggers browser and API client warnings or outright connection refusal. An expired certificate causes immediate service outage for all HTTPS connections. Certificate issues affect every component that communicates over TLS, including the UI, API, inter-service communication, and external integrations.
|
||||
|
||||
## Common Causes
|
||||
- Certificate file was moved or deleted from the configured path
|
||||
- Incorrect certificate path in configuration
|
||||
- Missing intermediate certificates in the certificate bundle
|
||||
- Incomplete certificate bundle (only leaf certificate, no intermediates)
|
||||
- Root CA not added to the system trust store
|
||||
- Self-signed certificate not explicitly trusted
|
||||
- Certificate not renewed before expiration
|
||||
- Automated renewal process failed silently
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check if certificate file exists at configured path
|
||||
docker compose exec gateway ls -la /certs/
|
||||
|
||||
# Verify certificate details
|
||||
docker compose exec gateway openssl x509 -in /certs/server.crt -noout -dates -subject -issuer
|
||||
|
||||
# Verify certificate chain
|
||||
docker compose exec gateway openssl verify -untrusted /certs/chain.pem /certs/server.crt
|
||||
|
||||
# Bundle certificates correctly (leaf + intermediates)
|
||||
cat server.crt intermediate.crt > fullchain.pem
|
||||
|
||||
# Update configuration in .env or compose override
|
||||
# Crypto__TlsCertPath=/certs/fullchain.pem
|
||||
|
||||
# Set up automated renewal notification
|
||||
# Notify__CertExpiry__ThresholdDays=14
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify certificate file exists
|
||||
ls -la /etc/stellaops/certs/server.crt
|
||||
|
||||
# Check certificate expiration
|
||||
openssl x509 -in /etc/stellaops/certs/server.crt -noout -enddate
|
||||
|
||||
# Download missing intermediates
|
||||
stella crypto cert fetch-chain --cert /etc/stellaops/certs/server.crt --output /etc/stellaops/certs/fullchain.pem
|
||||
|
||||
# Add CA to system trust store (Debian/Ubuntu)
|
||||
sudo cp root-ca.crt /usr/local/share/ca-certificates/
|
||||
sudo update-ca-certificates
|
||||
|
||||
# Or configure explicit trust anchor
|
||||
stella crypto trust-anchors add --type ca --cert root-ca.crt
|
||||
|
||||
# Renew certificate
|
||||
stella crypto cert renew --cert /etc/stellaops/certs/server.crt
|
||||
|
||||
# Update appsettings.json
|
||||
# "Crypto": { "TlsCertPath": "/etc/stellaops/certs/fullchain.pem" }
|
||||
|
||||
sudo systemctl restart stellaops-gateway
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check certificate secret
|
||||
kubectl get secret stellaops-tls-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
|
||||
|
||||
# Verify certificate chain
|
||||
kubectl get secret stellaops-tls-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl verify
|
||||
|
||||
# Update TLS certificate secret
|
||||
kubectl create secret tls stellaops-tls-cert \
|
||||
--cert=fullchain.pem \
|
||||
--key=server.key \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
```
|
||||
|
||||
```yaml
|
||||
# values.yaml - use cert-manager for automated renewal
|
||||
certManager:
|
||||
enabled: true
|
||||
issuer: letsencrypt-prod
|
||||
renewBefore: 360h # 15 days before expiry
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.crypto.certchain
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.crypto.fips` — FIPS compliance may impose certificate algorithm constraints
|
||||
- `check.crypto.eidas` — eIDAS compliance requires specific signature algorithms on certificates
|
||||
- `check.crypto.hsm` — HSM may store the private key associated with the certificate
|
||||
- `check.compliance.attestation-signing` — attestation signing uses related key material
|
||||
101
docs/doctor/articles/crypto/eidas.md
Normal file
101
docs/doctor/articles/crypto/eidas.md
Normal file
@@ -0,0 +1,101 @@
|
||||
---
|
||||
checkId: check.crypto.eidas
|
||||
plugin: stellaops.doctor.crypto
|
||||
severity: fail
|
||||
tags: [crypto, eidas, eu, compliance, signature]
|
||||
---
|
||||
# eIDAS Compliance
|
||||
|
||||
## What It Checks
|
||||
Verifies that eIDAS-compliant signature algorithms are available for EU deployments. The check references ETSI TS 119 312 (Cryptographic Suites) and validates availability of the following required algorithms:
|
||||
|
||||
- **RSA-PSS-SHA256** (RSA-PSS with SHA-256)
|
||||
- **RSA-PSS-SHA384** (RSA-PSS with SHA-384)
|
||||
- **RSA-PSS-SHA512** (RSA-PSS with SHA-512)
|
||||
- **ECDSA-P256-SHA256** (ECDSA with P-256 and SHA-256)
|
||||
- **ECDSA-P384-SHA384** (ECDSA with P-384 and SHA-384)
|
||||
- **Ed25519** (EdDSA with Curve25519)
|
||||
|
||||
The check also validates the minimum RSA key size. Per eIDAS guidelines post-2024, RSA keys must be at least 3072 bits. The configured minimum is read from `Crypto:MinRsaKeySize` (default 2048).
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Any required algorithms missing | Fail |
|
||||
| All algorithms available but RSA key size < 3072 | Warn |
|
||||
| All algorithms available and key size >= 3072 | Pass |
|
||||
|
||||
Evidence collected: `CryptoProfile`, `AvailableAlgorithms`, `MissingAlgorithms`, `MinRsaKeySize`, `RequiredMinRsaKeySize`.
|
||||
|
||||
The check only runs when `Crypto:Profile` or `Cryptography:Profile` contains "eidas", "eu", or "european".
|
||||
|
||||
## Why It Matters
|
||||
eIDAS (Electronic Identification, Authentication and Trust Services) is an EU regulation that establishes standards for electronic signatures and trust services. Deployments in the EU that create qualified electronic signatures or seals must use algorithms approved by ETSI. Using non-compliant algorithms means signatures may not be legally recognized, and the deployment may fail regulatory requirements. RSA keys below 3072 bits are considered insufficient for long-term security under current eIDAS guidelines.
|
||||
|
||||
## Common Causes
|
||||
- OpenSSL version too old to support all required algorithms
|
||||
- Crypto libraries compiled without required algorithm support
|
||||
- Configuration restricting the set of available algorithms
|
||||
- Legacy RSA key size configuration not updated for post-2024 requirements
|
||||
- Using LibreSSL instead of OpenSSL (missing some algorithms)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check OpenSSL version and available algorithms
|
||||
docker compose exec gateway openssl version
|
||||
docker compose exec gateway openssl list -signature-algorithms
|
||||
|
||||
# Update minimum RSA key size
|
||||
# Crypto__MinRsaKeySize=3072
|
||||
# Crypto__Profile=eu
|
||||
|
||||
# Restart services after configuration change
|
||||
docker compose restart gateway
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check OpenSSL version
|
||||
openssl version
|
||||
|
||||
# Verify available signature algorithms
|
||||
openssl list -signature-algorithms
|
||||
|
||||
# Update OpenSSL if algorithms are missing
|
||||
sudo apt update && sudo apt install openssl libssl-dev
|
||||
|
||||
# Configure eIDAS crypto profile
|
||||
stella crypto profile set --profile eu
|
||||
|
||||
# Set minimum RSA key size in appsettings.json
|
||||
# "Crypto": { "Profile": "eu", "MinRsaKeySize": 3072 }
|
||||
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
crypto:
|
||||
profile: eu
|
||||
minRsaKeySize: 3072
|
||||
```
|
||||
|
||||
```bash
|
||||
# Verify algorithm support in pod
|
||||
kubectl exec deploy/stellaops-gateway -- openssl list -signature-algorithms
|
||||
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.crypto.eidas
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.crypto.certchain` — certificate chain must use eIDAS-compliant algorithms
|
||||
- `check.crypto.fips` — FIPS and eIDAS have overlapping but distinct algorithm requirements
|
||||
- `check.crypto.hsm` — HSM may be required for qualified eIDAS signatures
|
||||
- `check.compliance.attestation-signing` — attestation signing should use eIDAS-compliant algorithms in EU deployments
|
||||
141
docs/doctor/articles/crypto/fips.md
Normal file
141
docs/doctor/articles/crypto/fips.md
Normal file
@@ -0,0 +1,141 @@
|
||||
---
|
||||
checkId: check.crypto.fips
|
||||
plugin: stellaops.doctor.crypto
|
||||
severity: fail
|
||||
tags: [crypto, fips, compliance, security]
|
||||
---
|
||||
# FIPS 140-2 Compliance
|
||||
|
||||
## What It Checks
|
||||
Verifies that FIPS 140-2 mode is enabled and that FIPS-compliant algorithms are functional. The check performs two phases:
|
||||
|
||||
**Phase 1 - FIPS mode detection:**
|
||||
- On Linux: reads `/proc/sys/crypto/fips_enabled` (expects "1").
|
||||
- On Windows: checks the registry at `HKLM\System\CurrentControlSet\Control\Lsa\FipsAlgorithmPolicy\Enabled` and the `DOTNET_SYSTEM_NET_SECURITY_USEFIPSVALIDATED` environment variable.
|
||||
- Reports the platform, crypto provider (OpenSSL/bcrypt/CoreCrypto), and whether the OpenSSL FIPS module is loaded.
|
||||
|
||||
**Phase 2 - Algorithm verification** (actual crypto operations, not just configuration):
|
||||
- **AES-256**: creates key, encrypts test data, verifies output.
|
||||
- **SHA-256**: hashes test data, verifies 32-byte output.
|
||||
- **SHA-384**: hashes test data, verifies 48-byte output.
|
||||
- **SHA-512**: hashes test data, verifies 64-byte output.
|
||||
- **RSA-2048**: generates key pair, signs and verifies test data.
|
||||
- **ECDSA-P256**: generates key pair, signs and verifies test data.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| FIPS mode not enabled at OS level | Fail |
|
||||
| FIPS mode enabled but some algorithms fail testing | Warn |
|
||||
| FIPS mode enabled and all algorithms pass | Pass |
|
||||
|
||||
Evidence collected: `fips_mode_enabled`, `platform`, `crypto_provider`, `openssl_fips_module_loaded`, `crypto_profile`, `algorithms_tested`, `algorithms_available`, `algorithms_missing`, per-algorithm test results.
|
||||
|
||||
The check only runs when `Crypto:Profile` or `Cryptography:Profile` contains "fips", "fedramp", or equals "us-gov".
|
||||
|
||||
## Why It Matters
|
||||
FIPS 140-2 compliance is mandatory for US government deployments (FedRAMP, DoD, ITAR) and many regulated industries (finance, healthcare). Running without FIPS mode means cryptographic operations may use non-validated implementations, which violates federal security requirements. Even with FIPS mode enabled, individual algorithm failures indicate a broken crypto subsystem that could silently produce invalid signatures or weak encryption.
|
||||
|
||||
## Common Causes
|
||||
- FIPS mode not enabled in the operating system
|
||||
- OpenSSL FIPS provider not loaded or not installed
|
||||
- .NET runtime not configured for FIPS-validated algorithms
|
||||
- FIPS module version incompatible with the OpenSSL version
|
||||
- Algorithm test failure due to incomplete FIPS provider installation
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check if FIPS mode is enabled in the container
|
||||
docker compose exec gateway cat /proc/sys/crypto/fips_enabled
|
||||
|
||||
# Enable FIPS mode in the host OS first (container inherits host FIPS)
|
||||
# Then restart the compose stack
|
||||
|
||||
# Set crypto profile
|
||||
# Crypto__Profile=fips
|
||||
|
||||
# Verify algorithms inside container
|
||||
docker compose exec gateway openssl list -providers
|
||||
docker compose exec gateway openssl list -digest-algorithms
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
|
||||
**Linux (RHEL/CentOS/Fedora):**
|
||||
```bash
|
||||
# Enable FIPS mode
|
||||
sudo fips-mode-setup --enable
|
||||
|
||||
# Verify FIPS status
|
||||
fips-mode-setup --check
|
||||
|
||||
# Reboot required after enabling
|
||||
sudo reboot
|
||||
|
||||
# After reboot, verify
|
||||
cat /proc/sys/crypto/fips_enabled # Should output "1"
|
||||
|
||||
# Restart StellaOps services
|
||||
sudo systemctl restart stellaops
|
||||
```
|
||||
|
||||
**Linux (Ubuntu/Debian):**
|
||||
```bash
|
||||
# Install FIPS packages
|
||||
sudo apt install ubuntu-fips
|
||||
sudo ua enable fips
|
||||
|
||||
# Reboot required
|
||||
sudo reboot
|
||||
```
|
||||
|
||||
**Windows:**
|
||||
```
|
||||
Enable via Local Security Policy:
|
||||
Security Settings > Local Policies > Security Options >
|
||||
"System cryptography: Use FIPS compliant algorithms" = Enabled
|
||||
|
||||
Or via registry (requires reboot):
|
||||
reg add HKLM\System\CurrentControlSet\Control\Lsa\FipsAlgorithmPolicy /v Enabled /t REG_DWORD /d 1 /f
|
||||
```
|
||||
|
||||
```bash
|
||||
# Configure StellaOps
|
||||
# "Crypto": { "Profile": "fips" }
|
||||
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
crypto:
|
||||
profile: fips
|
||||
|
||||
# FIPS must be enabled at the node level
|
||||
# For EKS: use Amazon Linux 2 FIPS AMI
|
||||
# For AKS: use FIPS-enabled node pools
|
||||
# For GKE: use Container-Optimized OS with FIPS
|
||||
```
|
||||
|
||||
```bash
|
||||
# Verify FIPS in pod
|
||||
kubectl exec deploy/stellaops-gateway -- cat /proc/sys/crypto/fips_enabled
|
||||
|
||||
# Check OpenSSL FIPS provider
|
||||
kubectl exec deploy/stellaops-gateway -- openssl list -providers
|
||||
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.crypto.fips
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.crypto.certchain` — certificates must use FIPS-approved algorithms
|
||||
- `check.crypto.eidas` — eIDAS has overlapping but distinct requirements from FIPS
|
||||
- `check.crypto.hsm` — FIPS 140-2 Level 3+ may require HSM for key storage
|
||||
- `check.compliance.attestation-signing` — signing must use FIPS-validated algorithms in FIPS deployments
|
||||
120
docs/doctor/articles/crypto/gost.md
Normal file
120
docs/doctor/articles/crypto/gost.md
Normal file
@@ -0,0 +1,120 @@
|
||||
---
|
||||
checkId: check.crypto.gost
|
||||
plugin: stellaops.doctor.crypto
|
||||
severity: fail
|
||||
tags: [crypto, gost, russia, compliance]
|
||||
---
|
||||
# GOST Algorithm Availability
|
||||
|
||||
## What It Checks
|
||||
Verifies that GOST cryptographic algorithms are available for Russian deployments. The check validates two layers:
|
||||
|
||||
**Layer 1 - GOST engine detection:**
|
||||
Checks whether the OpenSSL GOST engine is loaded by looking for the engine shared object at:
|
||||
- A custom path configured via `Crypto:Gost:EnginePath`
|
||||
- Common system paths: `/usr/lib/x86_64-linux-gnu/engines-{3,1.1}/gost.so`, `/usr/lib64/engines-{3,1.1}/gost.so`
|
||||
|
||||
**Layer 2 - Algorithm availability** (only if engine is loaded):
|
||||
Verifies the following GOST algorithms are accessible:
|
||||
- **GOST R 34.10-2012-256** (digital signature, 256-bit)
|
||||
- **GOST R 34.10-2012-512** (digital signature, 512-bit)
|
||||
- **GOST R 34.11-2012-256** (Stribog hash, 256-bit)
|
||||
- **GOST R 34.11-2012-512** (Stribog hash, 512-bit)
|
||||
- **GOST R 34.12-2015** (Kuznyechik block cipher)
|
||||
- **GOST 28147-89** (Magma legacy block cipher)
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| GOST engine not loaded | Fail |
|
||||
| Engine loaded but some algorithms missing | Warn |
|
||||
| Engine loaded and all algorithms available | Pass |
|
||||
|
||||
Evidence collected: `CryptoProfile`, `GostEngineLoaded`, `AvailableAlgorithms`, `MissingAlgorithms`, `RequiredAlgorithms`.
|
||||
|
||||
The check only runs when `Crypto:Profile` or `Cryptography:Profile` contains "gost", "russia", or equals "ru".
|
||||
|
||||
## Why It Matters
|
||||
Russian regulatory requirements mandate the use of GOST cryptographic algorithms for government and many commercial deployments. Without GOST algorithm support, the platform cannot create compliant digital signatures or encrypt data according to Russian standards. This blocks deployment in regulated Russian environments and may violate data protection requirements.
|
||||
|
||||
## Common Causes
|
||||
- OpenSSL GOST engine not installed on the system
|
||||
- GOST engine not configured in `openssl.cnf`
|
||||
- Missing `gost-engine` package
|
||||
- GOST engine version too old (missing newer algorithms)
|
||||
- GOST engine installed but algorithm disabled in configuration
|
||||
- Incomplete GOST engine installation
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check if GOST engine is available
|
||||
docker compose exec gateway openssl engine gost -c 2>/dev/null || echo "GOST engine not found"
|
||||
|
||||
# Install GOST engine in the container (add to Dockerfile for persistence)
|
||||
# For Debian/Ubuntu based images:
|
||||
# RUN apt-get install -y libengine-gost-openssl1.1
|
||||
|
||||
# Set crypto profile
|
||||
# Crypto__Profile=ru
|
||||
# Crypto__Gost__EnginePath=/usr/lib/x86_64-linux-gnu/engines-3/gost.so
|
||||
|
||||
docker compose restart gateway
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Install GOST engine (Debian/Ubuntu)
|
||||
sudo apt install libengine-gost-openssl1.1
|
||||
|
||||
# Or install from source
|
||||
git clone https://github.com/gost-engine/engine
|
||||
cd engine && mkdir build && cd build
|
||||
cmake .. && make && sudo make install
|
||||
|
||||
# Configure OpenSSL to load GOST engine
|
||||
# Add to /etc/ssl/openssl.cnf:
|
||||
# [gost_section]
|
||||
# engine_id = gost
|
||||
# default_algorithms = ALL
|
||||
|
||||
# Verify engine is loaded
|
||||
openssl engine gost -c
|
||||
|
||||
# Configure StellaOps GOST profile
|
||||
stella crypto profile set --profile ru
|
||||
|
||||
# In appsettings.json:
|
||||
# "Crypto": { "Profile": "ru" }
|
||||
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
crypto:
|
||||
profile: ru
|
||||
gost:
|
||||
enginePath: /usr/lib/x86_64-linux-gnu/engines-3/gost.so
|
||||
```
|
||||
|
||||
```bash
|
||||
# Verify in pod
|
||||
kubectl exec deploy/stellaops-gateway -- openssl engine gost -c
|
||||
|
||||
# Use a base image that includes GOST engine support
|
||||
# Or mount the engine as a volume
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.crypto.gost
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.crypto.certchain` — certificates in GOST deployments should use GOST signature algorithms
|
||||
- `check.crypto.fips` — FIPS and GOST are mutually exclusive regional crypto profiles
|
||||
- `check.crypto.sm` — SM (Chinese) is another regional crypto profile with similar structure
|
||||
- `check.crypto.hsm` — GOST keys may be stored in an HSM with GOST support
|
||||
131
docs/doctor/articles/crypto/hsm.md
Normal file
131
docs/doctor/articles/crypto/hsm.md
Normal file
@@ -0,0 +1,131 @@
|
||||
---
|
||||
checkId: check.crypto.hsm
|
||||
plugin: stellaops.doctor.crypto
|
||||
severity: warn
|
||||
tags: [crypto, hsm, pkcs11, security]
|
||||
---
|
||||
# HSM/PKCS#11 Availability
|
||||
|
||||
## What It Checks
|
||||
Verifies HSM (Hardware Security Module) availability via PKCS#11 interface. The check validates three layers:
|
||||
|
||||
1. **Module configuration**: whether a PKCS#11 module path is configured via `Crypto:Hsm:ModulePath` or `Cryptography:Pkcs11:ModulePath`.
|
||||
2. **Module file existence**: whether the configured `.so` (Linux) or `.dll` (Windows) file exists on disk.
|
||||
3. **Slot access**: whether the PKCS#11 module can enumerate slots and access the configured slot.
|
||||
4. **Token presence**: whether a token is initialized in the slot and accessible (login test).
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Module path not configured | Fail |
|
||||
| Module file not found at configured path | Fail |
|
||||
| Slot access failed (init error, no slots, permission denied) | Fail |
|
||||
| Token not accessible (not initialized, login failure) | Warn |
|
||||
| Module loaded, slot accessible, token present | Pass |
|
||||
|
||||
Evidence collected: `ModulePath`, `ModuleExists`, `SlotId`, `SlotLabel`, `SlotAccess`, `TokenPresent`, `TokenLabel`.
|
||||
|
||||
The check only runs when `Crypto:Hsm:Enabled` or `Cryptography:Pkcs11:Enabled` is set to "true".
|
||||
|
||||
## Why It Matters
|
||||
HSMs provide tamper-resistant hardware protection for cryptographic keys. When HSM is enabled, all signing operations (attestations, evidence seals, certificate signing) depend on the HSM being accessible. An unavailable HSM means no signing can occur, which blocks evidence generation, attestation creation, and release approvals. HSM connectivity issues can silently degrade to software-based signing if fallback is enabled, which may violate compliance requirements for FIPS 140-2 Level 3 or eIDAS qualified signatures.
|
||||
|
||||
## Common Causes
|
||||
- PKCS#11 module path not configured in application settings
|
||||
- Module file was moved or deleted from the configured path
|
||||
- HSM software not installed (e.g., SoftHSM2 not installed for development)
|
||||
- PKCS#11 module initialization failure (driver compatibility issues)
|
||||
- No slots available in the HSM
|
||||
- Permission denied accessing the PKCS#11 module or device
|
||||
- Token not initialized in the configured slot
|
||||
- Token login required but PIN not configured or incorrect
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Verify HSM module is accessible
|
||||
docker compose exec gateway ls -la /usr/lib/softhsm/libsofthsm2.so
|
||||
|
||||
# Initialize a token if needed (SoftHSM2 for development)
|
||||
docker compose exec gateway softhsm2-util --init-token --slot 0 --label "stellaops" --pin 1234 --so-pin 0000
|
||||
|
||||
# List available slots
|
||||
docker compose exec gateway softhsm2-util --show-slots
|
||||
|
||||
# Set environment variables
|
||||
# Crypto__Hsm__Enabled=true
|
||||
# Crypto__Hsm__ModulePath=/usr/lib/softhsm/libsofthsm2.so
|
||||
# Crypto__Hsm__Pin=1234
|
||||
|
||||
docker compose restart gateway
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Install SoftHSM2 (for development/testing)
|
||||
sudo apt install softhsm2
|
||||
|
||||
# Configure PKCS#11 module path
|
||||
stella crypto config set --hsm-module /usr/lib/softhsm/libsofthsm2.so
|
||||
|
||||
# Initialize token
|
||||
softhsm2-util --init-token --slot 0 --label "stellaops" --pin 1234 --so-pin 0000
|
||||
|
||||
# List slots
|
||||
softhsm2-util --show-slots
|
||||
|
||||
# Verify module permissions
|
||||
ls -la /usr/lib/softhsm/libsofthsm2.so
|
||||
|
||||
# Configure token PIN
|
||||
stella crypto config set --hsm-pin <your-pin>
|
||||
|
||||
# For Windows with SoftHSM2:
|
||||
# stella crypto config set --hsm-module C:\SoftHSM2\lib\softhsm2.dll
|
||||
|
||||
# In appsettings.json:
|
||||
# "Crypto": {
|
||||
# "Hsm": {
|
||||
# "Enabled": true,
|
||||
# "ModulePath": "/usr/lib/softhsm/libsofthsm2.so"
|
||||
# }
|
||||
# }
|
||||
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
crypto:
|
||||
hsm:
|
||||
enabled: true
|
||||
modulePath: /usr/lib/softhsm/libsofthsm2.so
|
||||
pinSecret: stellaops-hsm-pin
|
||||
slotId: 0
|
||||
```
|
||||
|
||||
```bash
|
||||
# Create HSM PIN secret
|
||||
kubectl create secret generic stellaops-hsm-pin \
|
||||
--from-literal=pin=<your-pin>
|
||||
|
||||
# For hardware HSMs, mount the device into the pod
|
||||
# Add to pod spec: devices: ["/dev/pkcs11"]
|
||||
|
||||
# Initialize token
|
||||
kubectl exec deploy/stellaops-gateway -- softhsm2-util --init-token --slot 0 --label stellaops --pin 1234 --so-pin 0000
|
||||
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.crypto.hsm
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.crypto.fips` — FIPS 140-2 Level 3+ requires key storage in a validated HSM
|
||||
- `check.crypto.eidas` — qualified eIDAS signatures may require HSM-backed keys
|
||||
- `check.crypto.certchain` — the TLS certificate private key may reside in the HSM
|
||||
- `check.compliance.attestation-signing` — attestation signing keys may be HSM-protected
|
||||
112
docs/doctor/articles/crypto/sm.md
Normal file
112
docs/doctor/articles/crypto/sm.md
Normal file
@@ -0,0 +1,112 @@
|
||||
---
|
||||
checkId: check.crypto.sm
|
||||
plugin: stellaops.doctor.crypto
|
||||
severity: fail
|
||||
tags: [crypto, sm2, sm3, sm4, china, compliance]
|
||||
---
|
||||
# SM2/SM3/SM4 Availability
|
||||
|
||||
## What It Checks
|
||||
Verifies that Chinese national cryptographic algorithms (GM/T standards) are available for CN deployments. The check validates:
|
||||
|
||||
1. **OpenSSL version**: SM algorithms are natively supported in OpenSSL 1.1.1+. If the version is older, the check fails immediately.
|
||||
2. **Algorithm availability**: tests each required algorithm:
|
||||
- **SM2**: Elliptic curve cryptography (signature, key exchange)
|
||||
- **SM3**: Cryptographic hash function (256-bit output)
|
||||
- **SM4**: Block cipher (128-bit blocks, 128-bit key)
|
||||
3. **SM2 curve parameters**: verifies the SM2 elliptic curve is properly initialized.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| OpenSSL < 1.1.1 and algorithms missing | Fail |
|
||||
| Any SM algorithms unavailable | Fail |
|
||||
| All algorithms available but SM2 curve cannot be verified | Warn |
|
||||
| All algorithms available and SM2 curve verified | Pass |
|
||||
|
||||
Evidence collected: `CryptoProfile`, `OpenSslVersion`, `NativeSmSupport`, `AvailableAlgorithms`, `MissingAlgorithms`, `SM2CurveVerified`.
|
||||
|
||||
The check only runs when `Crypto:Profile` or `Cryptography:Profile` contains "sm", "china", or equals "cn".
|
||||
|
||||
## Why It Matters
|
||||
Chinese regulatory requirements (GB/T standards) mandate the use of SM2, SM3, and SM4 algorithms for government systems, financial services, and critical infrastructure. Without SM algorithm support, the platform cannot create compliant digital signatures or encrypt data according to Chinese national standards. This blocks deployment in regulated Chinese environments and may violate the Cryptography Law of the People's Republic of China.
|
||||
|
||||
## Common Causes
|
||||
- OpenSSL version too old (pre-1.1.1) to include native SM support
|
||||
- Using LibreSSL instead of OpenSSL (lacks SM algorithm support)
|
||||
- System OpenSSL not updated to a version with SM support
|
||||
- OpenSSL compiled without SM algorithm support (custom builds)
|
||||
- SM algorithms disabled in OpenSSL configuration
|
||||
- SM2 curve not properly initialized in the crypto provider
|
||||
- Missing external SM crypto provider (e.g., GmSSL)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check OpenSSL version (must be 1.1.1+)
|
||||
docker compose exec gateway openssl version
|
||||
|
||||
# Verify SM algorithm support
|
||||
docker compose exec gateway openssl list -cipher-algorithms | grep -i sm
|
||||
docker compose exec gateway openssl ecparam -list_curves | grep -i sm2
|
||||
|
||||
# Set crypto profile
|
||||
# Crypto__Profile=cn
|
||||
|
||||
# If OpenSSL is too old, rebuild with a newer base image
|
||||
# FROM ubuntu:22.04 (includes OpenSSL 3.0+)
|
||||
|
||||
docker compose restart gateway
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check current OpenSSL version
|
||||
openssl version
|
||||
|
||||
# Update OpenSSL to 1.1.1+ if needed
|
||||
sudo apt update && sudo apt install openssl
|
||||
|
||||
# Verify SM algorithm support
|
||||
openssl list -cipher-algorithms | grep -i sm
|
||||
openssl ecparam -list_curves | grep -i sm2
|
||||
|
||||
# Configure SM crypto profile
|
||||
stella crypto profile set --profile cn
|
||||
|
||||
# Or use external SM provider (GmSSL)
|
||||
stella crypto config set --sm-provider gmssl
|
||||
|
||||
# In appsettings.json:
|
||||
# "Crypto": { "Profile": "cn" }
|
||||
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
crypto:
|
||||
profile: cn
|
||||
# Optionally specify external SM provider
|
||||
smProvider: native # or "gmssl" for GmSSL
|
||||
```
|
||||
|
||||
```bash
|
||||
# Verify SM support in pod
|
||||
kubectl exec deploy/stellaops-gateway -- openssl version
|
||||
kubectl exec deploy/stellaops-gateway -- openssl ecparam -list_curves | grep -i sm2
|
||||
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.crypto.sm
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.crypto.certchain` — certificates in CN deployments should use SM2 signatures
|
||||
- `check.crypto.gost` — GOST (Russian) is another regional crypto profile with similar structure
|
||||
- `check.crypto.fips` — FIPS and SM are mutually exclusive regional crypto profiles
|
||||
- `check.crypto.hsm` — SM keys may be stored in an HSM with SM algorithm support
|
||||
94
docs/doctor/articles/docker/apiversion.md
Normal file
94
docs/doctor/articles/docker/apiversion.md
Normal file
@@ -0,0 +1,94 @@
|
||||
---
|
||||
checkId: check.docker.apiversion
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: warn
|
||||
tags: [docker, api, compatibility]
|
||||
---
|
||||
# Docker API Version
|
||||
|
||||
## What It Checks
|
||||
Validates that the Docker API version meets minimum requirements for Stella Ops. The check connects to the Docker daemon (using `Docker:Host` configuration or the platform default) and queries the API version via `System.GetVersionAsync()`.
|
||||
|
||||
| API Version | Result |
|
||||
|---|---|
|
||||
| Below **1.41** | `warn` — below minimum required |
|
||||
| Between **1.41** and **1.43** | `warn` — below recommended |
|
||||
| **1.43** or higher | `pass` |
|
||||
|
||||
The minimum API version 1.41 corresponds to Docker Engine 20.10+. The recommended version 1.43 corresponds to Docker Engine 23.0+.
|
||||
|
||||
Evidence collected includes: API version, Docker version, minimum required version, recommended version, OS, build time, and git commit.
|
||||
|
||||
Default Docker host:
|
||||
- **Linux**: `unix:///var/run/docker.sock`
|
||||
- **Windows**: `npipe://./pipe/docker_engine`
|
||||
|
||||
## Why It Matters
|
||||
Stella Ops uses Docker API features for container management, image inspection, and network configuration. Older API versions may not support required features such as:
|
||||
|
||||
- BuildKit-based image builds (API 1.39+).
|
||||
- Multi-platform image inspection (API 1.41+).
|
||||
- Container resource management improvements (API 1.43+).
|
||||
|
||||
Running an outdated Docker version also means missing security patches and bug fixes.
|
||||
|
||||
## Common Causes
|
||||
- Docker Engine is outdated (version < 20.10)
|
||||
- Docker Engine is functional but below recommended version (< 23.0)
|
||||
- Using a Docker-compatible runtime (Podman, containerd) that reports a lower API version
|
||||
- Docker not updated after OS upgrade
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Update Docker Engine to the latest stable version:
|
||||
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get update
|
||||
sudo apt-get install docker-ce docker-ce-cli containerd.io
|
||||
|
||||
# RHEL/CentOS
|
||||
sudo yum update docker-ce docker-ce-cli containerd.io
|
||||
|
||||
# Verify version
|
||||
docker version
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check current version
|
||||
docker version
|
||||
|
||||
# Update Docker
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
|
||||
# Restart Docker
|
||||
sudo systemctl restart docker
|
||||
|
||||
# Verify
|
||||
docker version
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Update the container runtime on cluster nodes. The method depends on your Kubernetes distribution:
|
||||
|
||||
```bash
|
||||
# Check node runtime version
|
||||
kubectl get nodes -o wide
|
||||
|
||||
# For kubeadm clusters, update containerd on each node
|
||||
sudo apt-get update && sudo apt-get install containerd.io
|
||||
|
||||
# Verify
|
||||
sudo crictl version
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.apiversion
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.docker.daemon` — verifies Docker daemon is running (prerequisite for version check)
|
||||
- `check.docker.socket` — verifies Docker socket is accessible
|
||||
124
docs/doctor/articles/docker/daemon.md
Normal file
124
docs/doctor/articles/docker/daemon.md
Normal file
@@ -0,0 +1,124 @@
|
||||
---
|
||||
checkId: check.docker.daemon
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: fail
|
||||
tags: [docker, daemon, container]
|
||||
---
|
||||
# Docker Daemon
|
||||
|
||||
## What It Checks
|
||||
Validates that the Docker daemon is running and responsive. The check connects to the Docker daemon (using `Docker:Host` configuration or the platform default) and performs two operations:
|
||||
|
||||
1. **Ping**: Sends a ping request to verify the daemon is alive (with a configurable timeout, default 10 seconds via `Docker:TimeoutSeconds`).
|
||||
2. **Version**: Retrieves version information to confirm the daemon is fully operational.
|
||||
|
||||
Evidence collected on success: host address, Docker version, API version, OS, architecture, and kernel version.
|
||||
|
||||
On failure, the check distinguishes between:
|
||||
- **DockerApiException**: The daemon is running but returned an error (reports status code and response body).
|
||||
- **Connection failure**: Cannot connect to the daemon at all (Docker not installed, not running, or socket inaccessible).
|
||||
|
||||
Default Docker host:
|
||||
- **Linux**: `unix:///var/run/docker.sock`
|
||||
- **Windows**: `npipe://./pipe/docker_engine`
|
||||
|
||||
## Why It Matters
|
||||
The Docker daemon is the core runtime for all Stella Ops containers. If the daemon is down:
|
||||
|
||||
- No containers can start, stop, or restart.
|
||||
- Health checks for all containerized services fail.
|
||||
- Image pulls and builds are impossible.
|
||||
- Docker Compose operations fail entirely.
|
||||
- The entire Stella Ops platform is offline in container-based deployments.
|
||||
|
||||
## Common Causes
|
||||
- Docker daemon is not running or not accessible
|
||||
- Docker is not installed on the host
|
||||
- Docker service crashed or was stopped
|
||||
- Docker daemon returned an error response (resource exhaustion, configuration error)
|
||||
- Timeout connecting to the daemon (overloaded host, slow disk)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Check and restart the Docker daemon:
|
||||
|
||||
```bash
|
||||
# Check daemon status
|
||||
sudo systemctl status docker
|
||||
|
||||
# Start the daemon
|
||||
sudo systemctl start docker
|
||||
|
||||
# Enable auto-start on boot
|
||||
sudo systemctl enable docker
|
||||
|
||||
# Verify
|
||||
docker info
|
||||
```
|
||||
|
||||
If Docker is not installed:
|
||||
```bash
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
sudo usermod -aG docker $USER
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check status
|
||||
sudo systemctl status docker
|
||||
|
||||
# View daemon logs
|
||||
sudo journalctl -u docker --since "10 minutes ago"
|
||||
|
||||
# Restart the daemon
|
||||
sudo systemctl restart docker
|
||||
|
||||
# Verify connectivity
|
||||
docker version
|
||||
docker info
|
||||
```
|
||||
|
||||
If the daemon crashes repeatedly, check for resource exhaustion:
|
||||
```bash
|
||||
# Check disk space (Docker requires space for images/containers)
|
||||
df -h /var/lib/docker
|
||||
|
||||
# Check memory
|
||||
free -h
|
||||
|
||||
# Clean up Docker resources
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
On Kubernetes nodes, the container runtime (containerd/CRI-O) replaces Docker daemon. Check the runtime:
|
||||
|
||||
```bash
|
||||
# Check containerd status
|
||||
sudo systemctl status containerd
|
||||
|
||||
# Check CRI-O status
|
||||
sudo systemctl status crio
|
||||
|
||||
# Restart if needed
|
||||
sudo systemctl restart containerd
|
||||
```
|
||||
|
||||
For Docker Desktop (development):
|
||||
```bash
|
||||
# Restart Docker Desktop
|
||||
# macOS: killall Docker && open -a Docker
|
||||
# Windows: Restart-Service docker
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.daemon
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.docker.socket` — verifies the Docker socket exists and has correct permissions
|
||||
- `check.docker.apiversion` — verifies the Docker API version is compatible
|
||||
- `check.docker.storage` — verifies Docker storage is healthy (requires running daemon)
|
||||
- `check.docker.network` — verifies Docker networks are configured (requires running daemon)
|
||||
104
docs/doctor/articles/docker/network.md
Normal file
104
docs/doctor/articles/docker/network.md
Normal file
@@ -0,0 +1,104 @@
|
||||
---
|
||||
checkId: check.docker.network
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: warn
|
||||
tags: [docker, network, connectivity]
|
||||
---
|
||||
# Docker Network
|
||||
|
||||
## What It Checks
|
||||
Validates Docker network configuration and connectivity. The check connects to the Docker daemon and lists all networks, then verifies:
|
||||
|
||||
1. **Required networks exist**: Checks that each network listed in `Docker:RequiredNetworks` configuration is present. Defaults to `["bridge"]` if not configured.
|
||||
2. **Bridge driver available**: Verifies at least one network using the `bridge` driver exists.
|
||||
|
||||
Evidence collected includes: total network count, available network drivers, found/missing required networks, and bridge network name.
|
||||
|
||||
If the Docker daemon is unreachable, the check is skipped.
|
||||
|
||||
## Why It Matters
|
||||
Docker networks provide isolated communication channels between containers. Stella Ops services communicate over dedicated networks for:
|
||||
|
||||
- **Service-to-service communication**: Platform, Authority, Gateway, and other services need to reach each other.
|
||||
- **Database access**: PostgreSQL and Valkey are on specific networks.
|
||||
- **Network isolation**: Separating frontend, backend, and data tiers.
|
||||
|
||||
Missing networks cause container DNS resolution failures and connection refused errors between services.
|
||||
|
||||
## Common Causes
|
||||
- Required network not found (not yet created or was deleted)
|
||||
- No bridge network driver available (Docker networking misconfigured)
|
||||
- Docker Compose network not created (compose project not started)
|
||||
- Network name mismatch between configuration and actual Docker networks
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Docker Compose normally creates networks automatically. If missing:
|
||||
|
||||
```bash
|
||||
# List existing networks
|
||||
docker network ls
|
||||
|
||||
# Start compose to create networks
|
||||
docker compose -f devops/compose/docker-compose.stella-ops.yml up -d
|
||||
|
||||
# Create a network manually if needed
|
||||
docker network create stellaops-network
|
||||
|
||||
# Inspect a network
|
||||
docker network inspect <network-name>
|
||||
```
|
||||
|
||||
Configure required networks for the check:
|
||||
```yaml
|
||||
environment:
|
||||
Docker__RequiredNetworks__0: "stellaops-network"
|
||||
Docker__RequiredNetworks__1: "bridge"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
For bare metal deployments, Docker networks must be created manually:
|
||||
|
||||
```bash
|
||||
# Create required networks
|
||||
docker network create --driver bridge stellaops-frontend
|
||||
docker network create --driver bridge stellaops-backend
|
||||
docker network create --driver bridge stellaops-data
|
||||
|
||||
# List networks
|
||||
docker network ls
|
||||
|
||||
# Inspect network details
|
||||
docker network inspect stellaops-backend
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Docker networks are not used in Kubernetes; instead, Kubernetes networking (Services, NetworkPolicies) handles inter-pod communication. Configure the check to skip Docker network requirements:
|
||||
|
||||
```yaml
|
||||
doctor:
|
||||
docker:
|
||||
requiredNetworks: [] # Not applicable in Kubernetes
|
||||
```
|
||||
|
||||
Or verify Kubernetes networking:
|
||||
```bash
|
||||
# Check services
|
||||
kubectl get svc -n stellaops
|
||||
|
||||
# Check network policies
|
||||
kubectl get networkpolicy -n stellaops
|
||||
|
||||
# Test connectivity between pods
|
||||
kubectl exec -it <pod-a> -- curl http://<service-b>:5000/health
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.network
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.docker.daemon` — Docker daemon must be running to query networks
|
||||
- `check.docker.socket` — Docker socket must be accessible to communicate with the daemon
|
||||
125
docs/doctor/articles/docker/socket.md
Normal file
125
docs/doctor/articles/docker/socket.md
Normal file
@@ -0,0 +1,125 @@
|
||||
---
|
||||
checkId: check.docker.socket
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: fail
|
||||
tags: [docker, socket, permissions]
|
||||
---
|
||||
# Docker Socket
|
||||
|
||||
## What It Checks
|
||||
Validates that the Docker socket exists and is accessible with correct permissions. The check behavior differs by platform:
|
||||
|
||||
### Linux / Unix
|
||||
Checks the Unix socket at the path extracted from `Docker:Host` (default: `/var/run/docker.sock`):
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Socket does not exist + running inside a container | `pass` — socket mount is optional for most services |
|
||||
| Socket does not exist + not inside a container | `warn` |
|
||||
| Socket exists but not readable or writable | `warn` — insufficient permissions |
|
||||
| Socket exists and is readable + writable | `pass` |
|
||||
|
||||
The check detects whether the process is running inside a Docker container by checking for `/.dockerenv` or `/proc/1/cgroup`. When running inside a container without a mounted socket, this is considered normal for services that don't need direct Docker access.
|
||||
|
||||
### Windows
|
||||
On Windows, the check verifies that the named pipe path is configured (default: `npipe://./pipe/docker_engine`). The actual connectivity is deferred to the daemon check since named pipe access testing differs from Unix sockets.
|
||||
|
||||
Evidence collected includes: socket path, existence, readability, writability, and whether the process is running inside a container.
|
||||
|
||||
## Why It Matters
|
||||
The Docker socket is the communication channel between clients (CLI, SDKs, Stella Ops services) and the Docker daemon. Without socket access:
|
||||
|
||||
- Docker CLI commands fail.
|
||||
- Services that manage containers (scanner, job engine) cannot create or inspect containers.
|
||||
- Docker Compose operations fail.
|
||||
- Health checks that query Docker state cannot run.
|
||||
|
||||
Note that most Stella Ops services do NOT need direct Docker socket access. Only services that manage containers (e.g., scanner, job engine) require the socket to be mounted.
|
||||
|
||||
## Common Causes
|
||||
- Docker socket not found at the expected path
|
||||
- Docker not installed or daemon not running
|
||||
- Insufficient permissions on the socket file (user not in `docker` group)
|
||||
- Docker socket not mounted into the container (for containerized services that need it)
|
||||
- SELinux or AppArmor blocking socket access
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Mount the Docker socket for services that need container management:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
scanner:
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
|
||||
# Most services do NOT need the socket:
|
||||
platform:
|
||||
# No socket mount needed
|
||||
```
|
||||
|
||||
Fix socket permissions on the host:
|
||||
```bash
|
||||
# Add your user to the docker group
|
||||
sudo usermod -aG docker $USER
|
||||
|
||||
# Log out and back in, then verify
|
||||
docker ps
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check if Docker is installed
|
||||
which docker
|
||||
|
||||
# Check socket existence
|
||||
ls -la /var/run/docker.sock
|
||||
|
||||
# Check socket permissions
|
||||
stat /var/run/docker.sock
|
||||
|
||||
# Add user to docker group
|
||||
sudo usermod -aG docker $USER
|
||||
logout # Must log out and back in
|
||||
|
||||
# If socket is missing, start Docker
|
||||
sudo systemctl start docker
|
||||
|
||||
# Verify
|
||||
docker ps
|
||||
```
|
||||
|
||||
If SELinux is blocking access:
|
||||
```bash
|
||||
# Check SELinux denials
|
||||
sudo ausearch -m avc -ts recent | grep docker
|
||||
|
||||
# Allow Docker socket access (create a policy module)
|
||||
sudo setsebool -P container_manage_cgroup on
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
In Kubernetes, the Docker socket is typically not available. Use the container runtime socket instead:
|
||||
|
||||
```yaml
|
||||
# For containerd
|
||||
volumes:
|
||||
- name: containerd-sock
|
||||
hostPath:
|
||||
path: /run/containerd/containerd.sock
|
||||
type: Socket
|
||||
```
|
||||
|
||||
Most Stella Ops services should NOT mount any runtime socket in Kubernetes. Only the scanner or job engine may need it for container-in-container operations.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.socket
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.docker.daemon` — verifies the Docker daemon is running and responsive (uses the socket)
|
||||
- `check.docker.apiversion` — verifies Docker API version compatibility (requires socket access)
|
||||
- `check.docker.network` — verifies Docker networks (requires socket access)
|
||||
- `check.docker.storage` — verifies Docker storage (requires socket access)
|
||||
123
docs/doctor/articles/docker/storage.md
Normal file
123
docs/doctor/articles/docker/storage.md
Normal file
@@ -0,0 +1,123 @@
|
||||
---
|
||||
checkId: check.docker.storage
|
||||
plugin: stellaops.doctor.docker
|
||||
severity: warn
|
||||
tags: [docker, storage, disk]
|
||||
---
|
||||
# Docker Storage
|
||||
|
||||
## What It Checks
|
||||
Validates Docker storage driver and disk space usage. The check connects to the Docker daemon and retrieves system information, then inspects:
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Storage driver is not `overlay2`, `btrfs`, or `zfs` | `warn` — non-recommended driver |
|
||||
| Free disk space on Docker root partition < **10 GB** (configurable via `Docker:MinFreeSpaceGb`) | `warn` |
|
||||
| Disk usage > **85%** (configurable via `Docker:MaxStorageUsagePercent`) | `warn` |
|
||||
|
||||
The check reads the Docker root directory (typically `/var/lib/docker`) and queries drive info for that partition. On platforms where disk info is unavailable, the check still validates the storage driver.
|
||||
|
||||
Evidence collected includes: storage driver, Docker root directory, total space, free space, usage percentage, and whether the driver is recommended.
|
||||
|
||||
## Why It Matters
|
||||
Docker storage issues are a leading cause of container deployment failures:
|
||||
|
||||
- **Non-recommended storage drivers** (e.g., `vfs`, `devicemapper`) have performance and reliability problems. `overlay2` is the recommended driver for most workloads.
|
||||
- **Low disk space** prevents image pulls, container creation, and volume writes. Docker images and layers consume significant space.
|
||||
- **High disk usage** can cause container crashes, database corruption, and evidence write failures.
|
||||
|
||||
The Docker root directory often shares a partition with the OS, so storage exhaustion affects the entire host.
|
||||
|
||||
## Common Causes
|
||||
- Storage driver is not overlay2, btrfs, or zfs (e.g., using legacy `devicemapper` or `vfs`)
|
||||
- Low disk space on the Docker root partition (less than 10 GB free)
|
||||
- Disk usage exceeds 85% threshold
|
||||
- Unused images, containers, and volumes consuming space
|
||||
- Large build caches not pruned
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Check and clean Docker storage:
|
||||
|
||||
```bash
|
||||
# Check disk usage
|
||||
docker system df
|
||||
|
||||
# Detailed disk usage
|
||||
docker system df -v
|
||||
|
||||
# Prune unused data (images, containers, networks, build cache)
|
||||
docker system prune -a
|
||||
|
||||
# Prune volumes too (WARNING: removes data volumes)
|
||||
docker system prune -a --volumes
|
||||
|
||||
# Check storage driver
|
||||
docker info | grep "Storage Driver"
|
||||
```
|
||||
|
||||
Configure storage thresholds:
|
||||
```yaml
|
||||
environment:
|
||||
Docker__MinFreeSpaceGb: "10"
|
||||
Docker__MaxStorageUsagePercent: "85"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Switch to overlay2 storage driver if not already using it:
|
||||
|
||||
```bash
|
||||
# Check current driver
|
||||
docker info | grep "Storage Driver"
|
||||
|
||||
# Configure overlay2 in /etc/docker/daemon.json
|
||||
{
|
||||
"storage-driver": "overlay2"
|
||||
}
|
||||
|
||||
# Restart Docker (WARNING: may require re-pulling images)
|
||||
sudo systemctl restart docker
|
||||
```
|
||||
|
||||
Free up disk space:
|
||||
```bash
|
||||
# Find large Docker directories
|
||||
du -sh /var/lib/docker/*
|
||||
|
||||
# Clean unused resources
|
||||
docker system prune -a
|
||||
|
||||
# Set up automatic cleanup via cron
|
||||
echo "0 2 * * 0 docker system prune -f --filter 'until=168h'" | sudo crontab -
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Monitor node disk usage:
|
||||
|
||||
```bash
|
||||
# Check node disk pressure
|
||||
kubectl describe node <node> | grep -A 5 "Conditions"
|
||||
|
||||
# Check for DiskPressure condition
|
||||
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{range .status.conditions[?(@.type=="DiskPressure")]}{.status}{"\n"}{end}{end}'
|
||||
```
|
||||
|
||||
Configure kubelet garbage collection thresholds:
|
||||
```yaml
|
||||
# In kubelet config
|
||||
imageGCHighThresholdPercent: 85
|
||||
imageGCLowThresholdPercent: 80
|
||||
evictionHard:
|
||||
nodefs.available: "10%"
|
||||
imagefs.available: "15%"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.docker.storage
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.core.env.diskspace` — checks general disk space (not Docker-specific)
|
||||
- `check.docker.daemon` — daemon must be running to query storage info
|
||||
84
docs/doctor/articles/environment/environment-capacity.md
Normal file
84
docs/doctor/articles/environment/environment-capacity.md
Normal file
@@ -0,0 +1,84 @@
|
||||
---
|
||||
checkId: check.environment.capacity
|
||||
plugin: stellaops.doctor.environment
|
||||
severity: warn
|
||||
tags: [environment, capacity, resources, cpu, memory, storage]
|
||||
---
|
||||
# Environment Capacity
|
||||
|
||||
## What It Checks
|
||||
Queries the Release Orchestrator API (`/api/v1/environments/capacity`) and evaluates CPU, memory, storage, and deployment slot usage for every configured environment. Each resource is compared against two thresholds:
|
||||
- **Warn** when usage >= 75%
|
||||
- **Fail** when usage >= 90%
|
||||
|
||||
Deployment slot utilization is calculated as `activeDeployments / maxConcurrentDeployments * 100`. If no environments exist, the check passes with a note. If the orchestrator is unreachable, the check returns warn.
|
||||
|
||||
## Why It Matters
|
||||
Resource exhaustion in a target environment blocks deployments and can cause running services to crash or degrade. Detecting capacity pressure early gives operators time to scale up, clean up unused deployments, or redistribute workloads before an outage occurs. In production environments, exceeding 90% on any resource dimension is a leading indicator of imminent service disruption.
|
||||
|
||||
## Common Causes
|
||||
- Gradual organic growth without corresponding resource scaling
|
||||
- Runaway or leaked processes consuming CPU/memory
|
||||
- Accumulated old deployments that were never cleaned up
|
||||
- Resource limits set too tightly relative to actual workload
|
||||
- Unexpected traffic spike or batch job saturating storage
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check current resource usage on the host
|
||||
docker stats --no-stream
|
||||
|
||||
# Increase resource limits in docker-compose.stella-ops.yml
|
||||
# Edit the target service under deploy.resources.limits:
|
||||
# cpus: '4.0'
|
||||
# memory: 8G
|
||||
|
||||
# Remove stopped containers to free deployment slots
|
||||
docker container prune -f
|
||||
|
||||
# Restart with updated limits
|
||||
docker compose -f docker-compose.stella-ops.yml up -d
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check system resource usage
|
||||
free -h && df -h && top -bn1 | head -20
|
||||
|
||||
# Increase memory/CPU limits in systemd unit overrides
|
||||
sudo systemctl edit stellaops-environment-agent.service
|
||||
# Add under [Service]:
|
||||
# MemoryMax=8G
|
||||
# CPUQuota=400%
|
||||
|
||||
sudo systemctl daemon-reload && sudo systemctl restart stellaops-environment-agent.service
|
||||
|
||||
# Clean up old deployments
|
||||
stella env cleanup <environment-name>
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check node resource usage
|
||||
kubectl top nodes
|
||||
kubectl top pods -n stellaops
|
||||
|
||||
# Scale up resources via Helm values
|
||||
helm upgrade stellaops stellaops/stellaops \
|
||||
--set environments.resources.limits.cpu=4 \
|
||||
--set environments.resources.limits.memory=8Gi \
|
||||
--set environments.maxConcurrentDeployments=20
|
||||
|
||||
# Or add more nodes to the cluster for horizontal scaling
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor run --check check.environment.capacity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.environment.deployments` - checks deployed service health, which may degrade under capacity pressure
|
||||
- `check.environment.connectivity` - verifies agents are reachable, which capacity exhaustion can prevent
|
||||
98
docs/doctor/articles/environment/environment-connectivity.md
Normal file
98
docs/doctor/articles/environment/environment-connectivity.md
Normal file
@@ -0,0 +1,98 @@
|
||||
---
|
||||
checkId: check.environment.connectivity
|
||||
plugin: stellaops.doctor.environment
|
||||
severity: warn
|
||||
tags: [environment, connectivity, agent, network]
|
||||
---
|
||||
# Environment Connectivity
|
||||
|
||||
## What It Checks
|
||||
Retrieves the list of environments from the Release Orchestrator (`/api/v1/environments`), then probes each environment agent's `/health` endpoint. For each agent the check measures:
|
||||
- **Reachability** -- whether the health endpoint returns a success status code
|
||||
- **Latency** -- fails warn if response takes more than 500ms
|
||||
- **TLS certificate validity** -- warns if the agent's TLS certificate expires within 30 days
|
||||
- **Authentication** -- detects 401/403 responses indicating credential issues
|
||||
|
||||
If any agent is unreachable, the check fails. High latency or expiring certificates produce a warn.
|
||||
|
||||
## Why It Matters
|
||||
Environment agents are the control surface through which Stella Ops manages deployments, collects telemetry, and enforces policy. An unreachable agent means the platform cannot deploy to, monitor, or roll back services in that environment. TLS certificate expiry causes hard connectivity failures with no graceful degradation. High latency slows deployment pipelines and can cause timeouts in approval workflows.
|
||||
|
||||
## Common Causes
|
||||
- Environment agent service is stopped or crashed
|
||||
- Firewall rule change blocking the agent port
|
||||
- Network partition between Stella Ops control plane and target environment
|
||||
- TLS certificate not renewed before expiry
|
||||
- Agent authentication credentials rotated without updating Stella Ops configuration
|
||||
- DNS resolution failure for the agent hostname
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check if the environment agent container is running
|
||||
docker ps --filter "name=environment-agent"
|
||||
|
||||
# View agent logs for errors
|
||||
docker logs stellaops-environment-agent --tail 100
|
||||
|
||||
# Restart the agent
|
||||
docker compose -f docker-compose.stella-ops.yml restart environment-agent
|
||||
|
||||
# If TLS cert is expiring, replace the certificate files
|
||||
# mounted into the agent container and restart
|
||||
cp /path/to/new/cert.pem devops/compose/certs/agent.pem
|
||||
cp /path/to/new/key.pem devops/compose/certs/agent-key.pem
|
||||
docker compose -f docker-compose.stella-ops.yml restart environment-agent
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check agent service status
|
||||
sudo systemctl status stellaops-environment-agent
|
||||
|
||||
# View logs
|
||||
sudo journalctl -u stellaops-environment-agent --since "1 hour ago"
|
||||
|
||||
# Restart agent
|
||||
sudo systemctl restart stellaops-environment-agent
|
||||
|
||||
# Renew TLS certificate
|
||||
sudo cp /path/to/new/cert.pem /etc/stellaops/certs/agent.pem
|
||||
sudo cp /path/to/new/key.pem /etc/stellaops/certs/agent-key.pem
|
||||
sudo systemctl restart stellaops-environment-agent
|
||||
|
||||
# Test network connectivity from control plane
|
||||
curl -v https://<agent-host>:<agent-port>/health
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check agent pod status
|
||||
kubectl get pods -n stellaops -l app=environment-agent
|
||||
|
||||
# View agent logs
|
||||
kubectl logs -n stellaops -l app=environment-agent --tail=100
|
||||
|
||||
# Restart agent pods
|
||||
kubectl rollout restart deployment/environment-agent -n stellaops
|
||||
|
||||
# Renew TLS certificate via cert-manager or manual secret update
|
||||
kubectl create secret tls agent-tls \
|
||||
--cert=/path/to/cert.pem \
|
||||
--key=/path/to/key.pem \
|
||||
-n stellaops --dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# Check network policies
|
||||
kubectl get networkpolicies -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor run --check check.environment.connectivity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.environment.deployments` - checks health of services deployed via agents
|
||||
- `check.environment.network.policy` - verifies network policies that may block agent connectivity
|
||||
- `check.environment.secrets` - agent credentials may need rotation
|
||||
@@ -0,0 +1,90 @@
|
||||
---
|
||||
checkId: check.environment.deployments
|
||||
plugin: stellaops.doctor.environment
|
||||
severity: warn
|
||||
tags: [environment, deployment, services, health]
|
||||
---
|
||||
# Environment Deployment Health
|
||||
|
||||
## What It Checks
|
||||
Queries the Release Orchestrator (`/api/v1/environments/deployments`) for all deployed services across all environments. Each service is evaluated for:
|
||||
- **Status** -- `failed`, `stopped`, `degraded`, or healthy
|
||||
- **Replica health** -- compares `healthyReplicas` against total `replicas`; partial health triggers degraded status
|
||||
|
||||
Severity escalation:
|
||||
- **Fail** if any production service has status `failed` (production detected by environment name containing "prod")
|
||||
- **Fail** if any non-production service has status `failed`
|
||||
- **Warn** if services are `degraded` (partial replica health)
|
||||
- **Warn** if services are `stopped`
|
||||
- **Pass** if all services are healthy
|
||||
|
||||
## Why It Matters
|
||||
Failed services in production directly impact end users and violate SLA commitments. Degraded services with partial replica health reduce fault tolerance and can cascade into full outages under load. Stopped services may indicate incomplete deployments or maintenance windows that were never closed. This check provides the earliest signal that a deployment rollout needs intervention.
|
||||
|
||||
## Common Causes
|
||||
- Service crashed due to unhandled exception or OOM kill
|
||||
- Deployment rolled out a bad image version
|
||||
- Dependency (database, cache, message broker) became unavailable
|
||||
- Resource exhaustion preventing replicas from starting
|
||||
- Health check endpoint misconfigured, causing false failures
|
||||
- Node failure taking down co-located replicas
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Identify failed containers
|
||||
docker ps -a --filter "status=exited" --filter "status=dead"
|
||||
|
||||
# View logs for the failed service
|
||||
docker logs <container-name> --tail 200
|
||||
|
||||
# Restart the failed service
|
||||
docker compose -f docker-compose.stella-ops.yml restart <service-name>
|
||||
|
||||
# If the image is bad, roll back to previous version
|
||||
# Edit docker-compose.stella-ops.yml to pin the previous image tag
|
||||
docker compose -f docker-compose.stella-ops.yml up -d <service-name>
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check service status
|
||||
sudo systemctl status stellaops-<service-name>
|
||||
|
||||
# View logs for crash details
|
||||
sudo journalctl -u stellaops-<service-name> --since "30 minutes ago" --no-pager
|
||||
|
||||
# Restart the service
|
||||
sudo systemctl restart stellaops-<service-name>
|
||||
|
||||
# Roll back to previous binary
|
||||
sudo cp /opt/stellaops/backup/<service-name> /opt/stellaops/bin/<service-name>
|
||||
sudo systemctl restart stellaops-<service-name>
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check pod status across environments
|
||||
kubectl get pods -n stellaops-<env> --field-selector=status.phase!=Running
|
||||
|
||||
# View events and logs for failing pods
|
||||
kubectl describe pod <pod-name> -n stellaops-<env>
|
||||
kubectl logs <pod-name> -n stellaops-<env> --previous
|
||||
|
||||
# Rollback a deployment
|
||||
kubectl rollout undo deployment/<service-name> -n stellaops-<env>
|
||||
|
||||
# Or via Helm
|
||||
helm rollback stellaops <previous-revision> -n stellaops-<env>
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor run --check check.environment.deployments
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.environment.capacity` - resource exhaustion can cause deployment failures
|
||||
- `check.environment.connectivity` - agent must be reachable to report deployment health
|
||||
- `check.environment.drift` - configuration drift can cause services to fail after redeployment
|
||||
86
docs/doctor/articles/environment/environment-drift.md
Normal file
86
docs/doctor/articles/environment/environment-drift.md
Normal file
@@ -0,0 +1,86 @@
|
||||
---
|
||||
checkId: check.environment.drift
|
||||
plugin: stellaops.doctor.environment
|
||||
severity: warn
|
||||
tags: [environment, drift, configuration, consistency]
|
||||
---
|
||||
# Environment Drift Detection
|
||||
|
||||
## What It Checks
|
||||
Queries the Release Orchestrator drift report API (`/api/v1/environments/drift`) and compares configuration snapshots across environments. The check requires at least 2 environments to perform comparison. Each drift item carries a severity classification:
|
||||
- **Fail** if any drift is classified as `critical` (e.g., security-relevant configuration differences between staging and production)
|
||||
- **Warn** if drifts exist but none are critical
|
||||
- **Pass** if no configuration drift is detected between environments
|
||||
|
||||
Evidence includes the specific configuration keys that drifted and which environments are affected.
|
||||
|
||||
## Why It Matters
|
||||
Configuration drift between environments undermines the core promise of promotion-based releases: that what you test in staging is what runs in production. Drift can cause subtle behavioral differences that only manifest under production load, making bugs nearly impossible to reproduce. Critical drift in security-related configuration (TLS settings, authentication, network policies) can create compliance violations and security exposures.
|
||||
|
||||
## Common Causes
|
||||
- Manual configuration changes applied directly to one environment (bypassing the release pipeline)
|
||||
- Failed deployment that left partial configuration in one environment
|
||||
- Configuration sync job that did not propagate to all environments
|
||||
- Environment restored from an outdated backup
|
||||
- Intentional per-environment overrides that were not tracked as accepted exceptions
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# View the current drift report
|
||||
stella env drift show
|
||||
|
||||
# Compare specific configuration between environments
|
||||
diff <(docker exec stellaops-staging cat /app/appsettings.json) \
|
||||
<(docker exec stellaops-prod cat /app/appsettings.json)
|
||||
|
||||
# Reconcile by redeploying from the canonical source
|
||||
docker compose -f docker-compose.stella-ops.yml up -d --force-recreate <service>
|
||||
|
||||
# If drift is intentional, mark it as accepted
|
||||
stella env drift accept <config-key>
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# View drift report
|
||||
stella env drift show
|
||||
|
||||
# Compare config files between environments
|
||||
diff /etc/stellaops/staging/appsettings.json /etc/stellaops/prod/appsettings.json
|
||||
|
||||
# Reconcile by copying from source of truth
|
||||
sudo cp /etc/stellaops/staging/appsettings.json /etc/stellaops/prod/appsettings.json
|
||||
sudo systemctl restart stellaops-<service>
|
||||
|
||||
# Or accept drift as intentional
|
||||
stella env drift accept <config-key>
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# View drift between environments
|
||||
stella env drift show
|
||||
|
||||
# Compare Helm values between environments
|
||||
diff <(helm get values stellaops -n stellaops-staging -o yaml) \
|
||||
<(helm get values stellaops -n stellaops-prod -o yaml)
|
||||
|
||||
# Reconcile by redeploying with consistent values
|
||||
helm upgrade stellaops stellaops/stellaops -n stellaops-prod \
|
||||
-f values-prod.yaml
|
||||
|
||||
# Compare ConfigMaps
|
||||
kubectl diff -f configmap.yaml -n stellaops-prod
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor run --check check.environment.drift
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.environment.deployments` - drift can cause service failures after redeployment
|
||||
- `check.environment.secrets` - secret configuration differences between environments
|
||||
- `check.environment.network.policy` - network policy drift is a security concern
|
||||
107
docs/doctor/articles/environment/environment-network-policy.md
Normal file
107
docs/doctor/articles/environment/environment-network-policy.md
Normal file
@@ -0,0 +1,107 @@
|
||||
---
|
||||
checkId: check.environment.network.policy
|
||||
plugin: stellaops.doctor.environment
|
||||
severity: warn
|
||||
tags: [environment, network, policy, security, isolation]
|
||||
---
|
||||
# Environment Network Policy
|
||||
|
||||
## What It Checks
|
||||
Retrieves network policies from the Release Orchestrator (`/api/v1/environments/network-policies`) and evaluates isolation posture for each environment. The check enforces these rules:
|
||||
- **Production environments must not allow ingress from dev** -- detected as critical violation
|
||||
- **Production environments should use default-deny policies** -- missing default-deny is a warning
|
||||
- **No environment should have wildcard ingress** (`*` or `0.0.0.0/0`) -- critical for production, warning for others
|
||||
- **Wildcard egress** (`*` or `0.0.0.0/0`) is flagged as informational
|
||||
|
||||
Severity:
|
||||
- **Fail** if any critical violations exist (prod ingress from dev, wildcard ingress on prod)
|
||||
- **Warn** if only warning-level violations exist (missing default-deny, wildcard ingress on non-prod)
|
||||
- **Warn** if no network policies are configured at all
|
||||
- **Pass** if all policies are correctly configured
|
||||
|
||||
## Why It Matters
|
||||
Network isolation between environments is a fundamental security control. Allowing dev-to-production ingress means compromised development infrastructure can directly attack production services. Missing default-deny policies mean any new service added to the environment is implicitly network-accessible. Wildcard ingress exposes services to the entire network or internet. These misconfigurations are common audit findings that can block compliance certifications.
|
||||
|
||||
## Common Causes
|
||||
- Network policies not yet defined for a new environment
|
||||
- Legacy policy left in place from initial setup
|
||||
- Production policy copied from dev without tightening rules
|
||||
- Manual firewall rule change not reflected in Stella Ops policy
|
||||
- Policy update deployed to staging but not promoted to production
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Review current network policies
|
||||
stella env network-policy list
|
||||
|
||||
# Create a default-deny policy for production
|
||||
stella env network-policy create prod --default-deny
|
||||
|
||||
# Allow only staging ingress to production
|
||||
stella env network-policy update prod --default-deny --allow-from staging
|
||||
|
||||
# Restrict egress to specific destinations
|
||||
stella env network-policy update prod --egress-allow "10.0.0.0/8,registry.internal"
|
||||
|
||||
# In Docker Compose, use network isolation
|
||||
# Define separate networks in docker-compose.stella-ops.yml:
|
||||
# networks:
|
||||
# prod-internal:
|
||||
# internal: true
|
||||
# staging-internal:
|
||||
# internal: true
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Review current iptables/nftables rules
|
||||
sudo iptables -L -n -v
|
||||
# or
|
||||
sudo nft list ruleset
|
||||
|
||||
# Apply default-deny for production network interface
|
||||
sudo iptables -A INPUT -i prod0 -j DROP
|
||||
sudo iptables -I INPUT -i prod0 -s <staging-subnet> -j ACCEPT
|
||||
|
||||
# Or configure via stellaops policy
|
||||
stella env network-policy update prod --default-deny --allow-from staging
|
||||
|
||||
# Persist firewall rules
|
||||
sudo netfilter-persistent save
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Review existing network policies
|
||||
kubectl get networkpolicies -n stellaops-prod
|
||||
|
||||
# Apply default-deny via Helm
|
||||
helm upgrade stellaops stellaops/stellaops \
|
||||
--set environments.prod.networkPolicy.defaultDeny=true \
|
||||
--set environments.prod.networkPolicy.allowFrom[0]=stellaops-staging
|
||||
|
||||
# Or apply a NetworkPolicy manifest directly
|
||||
cat <<EOF | kubectl apply -f -
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: NetworkPolicy
|
||||
metadata:
|
||||
name: default-deny-ingress
|
||||
namespace: stellaops-prod
|
||||
spec:
|
||||
podSelector: {}
|
||||
policyTypes:
|
||||
- Ingress
|
||||
EOF
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor run --check check.environment.network.policy
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.environment.connectivity` - network policies can block agent connectivity if misconfigured
|
||||
- `check.environment.drift` - network policy differences between environments are a form of drift
|
||||
- `check.environment.secrets` - network isolation protects secret transmission
|
||||
@@ -0,0 +1,94 @@
|
||||
---
|
||||
checkId: check.environment.secrets
|
||||
plugin: stellaops.doctor.environment
|
||||
severity: warn
|
||||
tags: [environment, secrets, security, rotation, expiry]
|
||||
---
|
||||
# Environment Secret Health
|
||||
|
||||
## What It Checks
|
||||
Queries the Release Orchestrator secrets status API (`/api/v1/environments/secrets/status`) for metadata about all configured secrets (no actual secret values are retrieved). Each secret is evaluated for:
|
||||
- **Expiry** -- secrets already expired, expiring within 7 days (critical), or expiring within 30 days (warning)
|
||||
- **Rotation compliance** -- if a rotation policy is defined, checks whether `lastRotated` exceeds the policy interval by more than 10% grace
|
||||
|
||||
Severity escalation:
|
||||
- **Fail** if any production secret has expired
|
||||
- **Fail** if any secret has expired or production secrets are expiring within 7 days
|
||||
- **Warn** if secrets are expiring within 30 days or rotation is overdue
|
||||
- **Pass** if all secrets are healthy
|
||||
|
||||
## Why It Matters
|
||||
Expired secrets cause immediate authentication and authorization failures. Services that depend on expired credentials will fail to connect to databases, registries, external APIs, and other integrations. In production, this means outages. Secrets expiring within 7 days require urgent rotation to prevent imminent failures. Overdue rotation violates security policies and increases the blast radius of a credential compromise.
|
||||
|
||||
## Common Causes
|
||||
- Secret expired without automated rotation being configured
|
||||
- Rotation job failed silently (scheduler down, permissions changed)
|
||||
- Secret provider (Vault, Key Vault) connection lost during rotation window
|
||||
- Manual secret set with fixed expiry and no follow-up rotation
|
||||
- Rotation policy interval shorter than actual rotation cadence
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# List secrets with expiry status
|
||||
stella env secrets list --expiring
|
||||
|
||||
# Rotate an expired or expiring secret immediately
|
||||
stella env secrets rotate <environment> <secret-name>
|
||||
|
||||
# Check secret provider connectivity
|
||||
stella secrets provider status
|
||||
|
||||
# Update secret in .env file for compose deployments
|
||||
# Edit devops/compose/.env with the new secret value
|
||||
# Then restart affected services
|
||||
docker compose -f docker-compose.stella-ops.yml restart <service>
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# List secrets with expiry details
|
||||
stella env secrets list --expiring
|
||||
|
||||
# Rotate expired secret
|
||||
stella env secrets rotate <environment> <secret-name>
|
||||
|
||||
# If using file-based secrets, update the file
|
||||
sudo vi /etc/stellaops/secrets/<secret-name>
|
||||
sudo chmod 600 /etc/stellaops/secrets/<secret-name>
|
||||
sudo systemctl restart stellaops-<service>
|
||||
|
||||
# Schedule automated rotation
|
||||
stella env secrets rotate-scheduled --days 7
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# List expiring secrets
|
||||
stella env secrets list --expiring
|
||||
|
||||
# Rotate secret and update Kubernetes secret
|
||||
stella env secrets rotate <environment> <secret-name>
|
||||
|
||||
# Or update manually
|
||||
kubectl create secret generic <secret-name> \
|
||||
--from-literal=value=<new-value> \
|
||||
-n stellaops-<env> --dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# Restart pods to pick up new secret
|
||||
kubectl rollout restart deployment/<service> -n stellaops-<env>
|
||||
|
||||
# For external-secrets-operator, trigger a refresh
|
||||
kubectl annotate externalsecret <name> -n stellaops force-sync=$(date +%s)
|
||||
```
|
||||
|
||||
## Verification
|
||||
```bash
|
||||
stella doctor run --check check.environment.secrets
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.environment.connectivity` - expired agent credentials cause connectivity failures
|
||||
- `check.environment.deployments` - services fail when their secrets expire
|
||||
- `check.integration.secrets.manager` - verifies the secrets manager itself is healthy
|
||||
120
docs/doctor/articles/evidence-locker/index.md
Normal file
120
docs/doctor/articles/evidence-locker/index.md
Normal file
@@ -0,0 +1,120 @@
|
||||
---
|
||||
checkId: check.evidencelocker.index
|
||||
plugin: stellaops.doctor.evidencelocker
|
||||
severity: warn
|
||||
tags: [evidence, index, consistency]
|
||||
---
|
||||
# Evidence Index Consistency
|
||||
|
||||
## What It Checks
|
||||
Verifies that the evidence index is consistent with the artifacts stored on disk. The check operates on the local evidence locker path (`EvidenceLocker:Path`) and performs:
|
||||
|
||||
1. **Index existence**: looks for `index.json` or an `index/` directory at the locker root.
|
||||
2. **Artifact counting**: counts `.json` files across five artifact directories: `attestations/`, `sboms/`, `vex/`, `verdicts/`, `provenance/`.
|
||||
3. **Cross-reference validation**: for each entry in `index.json`, verifies the referenced artifact file exists on disk. Records any artifacts that are indexed but missing from disk.
|
||||
4. **Drift detection**: compares the total indexed count against the total disk artifact count. Flags a warning if drift exceeds 10% of total artifacts.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Evidence locker path not configured or missing | Skip |
|
||||
| Index file and index directory both missing | Warn |
|
||||
| Artifacts indexed but missing from disk | Fail |
|
||||
| Index count drifts > 10% from disk count | Warn |
|
||||
| Index consistent with disk artifacts | Pass |
|
||||
|
||||
Evidence collected: `IndexedCount`, `DiskArtifactCount`, `MissingFromDisk`, `MissingSamples`, `Drift`, per-directory counts (`attestationsCount`, `sbomsCount`, `vexCount`, `verdictsCount`, `provenanceCount`).
|
||||
|
||||
The check only runs when `EvidenceLocker:Path` is configured and the directory exists.
|
||||
|
||||
## Why It Matters
|
||||
The evidence index provides fast lookup for attestations, SBOMs, VEX documents, and provenance records. An inconsistent index means queries may return stale references to deleted artifacts (causing retrieval errors) or miss artifacts that exist on disk (causing incomplete audit reports). Index drift accumulates over time and degrades the reliability of evidence searches, compliance exports, and release verification lookups.
|
||||
|
||||
## Common Causes
|
||||
- Index never created (evidence locker not initialized)
|
||||
- Index file was deleted or corrupted
|
||||
- Artifacts deleted without updating the index (manual cleanup)
|
||||
- Disk corruption causing artifact loss
|
||||
- Background indexer not running or crashed
|
||||
- Race condition during concurrent writes
|
||||
- Incomplete cleanup operations removing files but not index entries
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check index status
|
||||
docker compose exec evidence-locker ls -la /data/evidence/index.json
|
||||
|
||||
# Rebuild evidence index
|
||||
docker compose exec evidence-locker stella evidence index rebuild
|
||||
|
||||
# Fix orphaned index entries
|
||||
docker compose exec evidence-locker stella evidence index rebuild --fix-orphans
|
||||
|
||||
# Verify evidence integrity after rebuild
|
||||
docker compose exec evidence-locker stella evidence verify --all
|
||||
|
||||
# Refresh index (less aggressive than rebuild)
|
||||
docker compose exec evidence-locker stella evidence index refresh
|
||||
|
||||
# Check disk health
|
||||
docker compose exec evidence-locker df -h /data/evidence
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check index file
|
||||
ls -la /var/lib/stellaops/evidence/index.json
|
||||
|
||||
# Rebuild evidence index
|
||||
stella evidence index rebuild
|
||||
|
||||
# Fix orphaned entries
|
||||
stella evidence index rebuild --fix-orphans
|
||||
|
||||
# Refresh index
|
||||
stella evidence index refresh
|
||||
|
||||
# Check for disk errors
|
||||
sudo fsck -n /dev/sda1
|
||||
|
||||
# Verify evidence integrity
|
||||
stella evidence verify --all
|
||||
|
||||
sudo systemctl restart stellaops-evidence-locker
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check index in pod
|
||||
kubectl exec deploy/stellaops-evidence-locker -- ls -la /data/evidence/index.json
|
||||
|
||||
# Rebuild index
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence index rebuild --fix-orphans
|
||||
|
||||
# Verify evidence
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence verify --all
|
||||
|
||||
# Check persistent volume health
|
||||
kubectl describe pvc stellaops-evidence-data
|
||||
```
|
||||
|
||||
```yaml
|
||||
# values.yaml - enable background indexer
|
||||
evidenceLocker:
|
||||
indexer:
|
||||
enabled: true
|
||||
intervalMinutes: 15
|
||||
repairOnDrift: true
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.evidencelocker.index
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.evidencelocker.retrieval` — retrieval depends on index accuracy for lookups
|
||||
- `check.evidencelocker.provenance` — provenance records are one of the indexed artifact types
|
||||
- `check.evidencelocker.merkle` — Merkle anchors reference indexed artifacts
|
||||
- `check.compliance.evidence-integrity` — evidence integrity includes index consistency
|
||||
122
docs/doctor/articles/evidence-locker/merkle.md
Normal file
122
docs/doctor/articles/evidence-locker/merkle.md
Normal file
@@ -0,0 +1,122 @@
|
||||
---
|
||||
checkId: check.evidencelocker.merkle
|
||||
plugin: stellaops.doctor.evidencelocker
|
||||
severity: fail
|
||||
tags: [evidence, merkle, anchoring, integrity]
|
||||
---
|
||||
# Merkle Anchor Verification
|
||||
|
||||
## What It Checks
|
||||
Verifies Merkle root anchoring integrity when anchoring is enabled. The check operates on anchor records stored in the `anchors/` subdirectory of the evidence locker path. It validates:
|
||||
|
||||
1. **Anchor record presence**: checks for `.json` anchor files in the anchors directory.
|
||||
2. **Anchor structural validity**: each anchor must contain `merkleRoot`, `timestamp`, and `signature` fields with non-empty values.
|
||||
3. **Anchor integrity**: validates the most recent 5 anchor records for structural completeness and signature presence.
|
||||
4. **Anchor freshness**: compares the latest anchor timestamp against the configured interval (`EvidenceLocker:Anchoring:IntervalHours`, default 24). Warns if the latest anchor is more than 2x the interval age.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Anchoring not enabled | Skip |
|
||||
| Evidence locker path not configured | Skip |
|
||||
| No anchor records found | Warn |
|
||||
| Any anchor records invalid (missing fields, corrupt) | Fail |
|
||||
| Latest anchor older than 2x configured interval | Warn |
|
||||
| All checked anchors valid and fresh | Pass |
|
||||
|
||||
Evidence collected: `CheckedCount`, `ValidCount`, `InvalidCount`, `InvalidAnchors`, `LatestAnchorTime`, `AnchorAgeHours`, `ExpectedIntervalHours`, `LatestRoot`.
|
||||
|
||||
The check only runs when `EvidenceLocker:Anchoring:Enabled` is set to "true".
|
||||
|
||||
## Why It Matters
|
||||
Merkle anchoring provides cryptographic proof that evidence has not been tampered with since the anchor was created. Each anchor captures the Merkle root hash of all evidence at a point in time, creating an immutable checkpoint. Invalid anchors mean the integrity chain is broken and evidence cannot be independently verified against its anchor. Stale anchors indicate the anchoring job has stopped running, creating a window where evidence changes are not captured by any checkpoint.
|
||||
|
||||
## Common Causes
|
||||
- Anchoring job not run yet (new deployment)
|
||||
- Anchoring job scheduler not running or misconfigured
|
||||
- Anchor record corrupted on disk
|
||||
- Merkle root hash mismatch due to evidence modification after anchoring
|
||||
- Evidence tampered or modified after the anchor was created
|
||||
- Anchors directory deleted during maintenance
|
||||
- Anchor creation failing silently (check job logs)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Create an initial anchor
|
||||
docker compose exec evidence-locker stella evidence anchor create
|
||||
|
||||
# Check anchor job status
|
||||
docker compose exec evidence-locker stella evidence anchor status
|
||||
|
||||
# Audit anchor integrity
|
||||
docker compose exec evidence-locker stella evidence anchor audit --full
|
||||
|
||||
# Verify a specific anchor
|
||||
docker compose exec evidence-locker stella evidence anchor verify <anchor-filename>
|
||||
|
||||
# Enable anchoring in configuration
|
||||
# EvidenceLocker__Anchoring__Enabled=true
|
||||
# EvidenceLocker__Anchoring__IntervalHours=24
|
||||
|
||||
docker compose restart evidence-locker
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Create an anchor
|
||||
stella evidence anchor create
|
||||
|
||||
# Check anchor status
|
||||
stella evidence anchor status
|
||||
|
||||
# Full anchor audit
|
||||
stella evidence anchor audit --full
|
||||
|
||||
# Configure anchoring in appsettings.json
|
||||
# "EvidenceLocker": {
|
||||
# "Anchoring": {
|
||||
# "Enabled": true,
|
||||
# "IntervalHours": 24
|
||||
# }
|
||||
# }
|
||||
|
||||
# Verify the anchor job is scheduled
|
||||
stella jobs list --filter anchor
|
||||
|
||||
sudo systemctl restart stellaops-evidence-locker
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
evidenceLocker:
|
||||
anchoring:
|
||||
enabled: true
|
||||
intervalHours: 24
|
||||
schedule: "0 */24 * * *" # Every 24 hours
|
||||
```
|
||||
|
||||
```bash
|
||||
# Create initial anchor
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence anchor create
|
||||
|
||||
# Check anchor status
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence anchor status
|
||||
|
||||
# Audit anchors
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence anchor audit --full
|
||||
|
||||
helm upgrade stellaops ./charts/stellaops -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.evidencelocker.merkle
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.evidencelocker.provenance` — provenance chain integrity complements Merkle anchoring
|
||||
- `check.evidencelocker.index` — index consistency ensures anchored artifacts are still present
|
||||
- `check.evidencelocker.retrieval` — retrieval health required to validate anchored artifacts
|
||||
- `check.compliance.evidence-integrity` — evidence integrity is the broader check that includes anchoring
|
||||
116
docs/doctor/articles/evidence-locker/provenance.md
Normal file
116
docs/doctor/articles/evidence-locker/provenance.md
Normal file
@@ -0,0 +1,116 @@
|
||||
---
|
||||
checkId: check.evidencelocker.provenance
|
||||
plugin: stellaops.doctor.evidencelocker
|
||||
severity: fail
|
||||
tags: [evidence, provenance, integrity, chain]
|
||||
---
|
||||
# Provenance Chain Integrity
|
||||
|
||||
## What It Checks
|
||||
Validates provenance chain integrity using random sample verification. The check operates on provenance records stored in the `provenance/` subdirectory of the evidence locker path. It performs:
|
||||
|
||||
1. **Random sampling**: selects up to 5 random provenance records from the pool for validation (configurable via `SampleSize` constant).
|
||||
2. **Hash verification**: for each sampled record, reads the `contentHash` and `payload` fields, recomputes the SHA-256 hash of the payload, and compares it against the declared hash. Supports `sha256:` prefixed hash values.
|
||||
3. **Structural validation**: verifies that each record contains required `contentHash` and `payload` fields.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Evidence locker path not configured | Skip |
|
||||
| No provenance directory or no records | Pass (nothing to verify) |
|
||||
| Any sampled records fail hash verification | Fail |
|
||||
| All sampled records pass hash verification | Pass |
|
||||
|
||||
Evidence collected: `TotalRecords`, `SamplesChecked`, `ValidCount`, `InvalidCount`, `InvalidRecords`.
|
||||
|
||||
The check only runs when `EvidenceLocker:Path` is configured and the directory exists.
|
||||
|
||||
## Why It Matters
|
||||
Provenance records link each software artifact to its build source, build system, and build steps. The content hash ensures that the provenance payload has not been modified since it was created. A broken hash indicates the provenance record was corrupted or tampered with, which invalidates the supply-chain integrity guarantee for the associated release. Even a single invalid provenance record undermines trust in the entire provenance chain and should be investigated as a potential security incident.
|
||||
|
||||
## Common Causes
|
||||
- Provenance record corrupted on disk (storage errors, incomplete writes)
|
||||
- Hash verification failure after accidental file modification
|
||||
- Chain link broken due to missing predecessor records
|
||||
- Data tampered or modified by unauthorized access
|
||||
- Hash format mismatch (missing or extra `sha256:` prefix)
|
||||
- Character encoding differences during payload serialization
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Run full provenance audit
|
||||
docker compose exec evidence-locker stella evidence audit --type provenance --full
|
||||
|
||||
# Check specific invalid records
|
||||
docker compose exec evidence-locker stella evidence verify --id <record-id>
|
||||
|
||||
# Review evidence locker integrity
|
||||
docker compose exec evidence-locker stella evidence integrity-check
|
||||
|
||||
# Check for storage errors
|
||||
docker compose exec evidence-locker dmesg | grep -i error
|
||||
|
||||
# Check disk health
|
||||
docker compose exec evidence-locker df -h /data/evidence/provenance/
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Full provenance audit
|
||||
stella evidence audit --type provenance --full
|
||||
|
||||
# Verify specific records
|
||||
stella evidence verify --id <record-id>
|
||||
|
||||
# Full integrity check
|
||||
stella evidence integrity-check
|
||||
|
||||
# Check filesystem health
|
||||
sudo fsck -n /dev/sda1
|
||||
|
||||
# Check for disk I/O errors
|
||||
dmesg | grep -i "i/o error"
|
||||
|
||||
# List provenance records
|
||||
ls -la /var/lib/stellaops/evidence/provenance/
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Full provenance audit
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence audit --type provenance --full
|
||||
|
||||
# Verify specific record
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence verify --id <record-id>
|
||||
|
||||
# Integrity check
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence integrity-check
|
||||
|
||||
# Check persistent volume health
|
||||
kubectl describe pvc stellaops-evidence-data
|
||||
|
||||
# Check for pod restarts that might indicate storage issues
|
||||
kubectl get events --field-selector involvedObject.name=stellaops-evidence-locker -n stellaops
|
||||
```
|
||||
|
||||
```yaml
|
||||
# values.yaml - schedule periodic integrity checks
|
||||
evidenceLocker:
|
||||
integrityCheck:
|
||||
enabled: true
|
||||
schedule: "0 4 * * *" # Daily at 4am
|
||||
sampleSize: 10
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.evidencelocker.provenance
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.evidencelocker.merkle` — Merkle anchoring provides checkpoint-level integrity on top of per-record verification
|
||||
- `check.evidencelocker.index` — index consistency ensures provenance records are discoverable
|
||||
- `check.evidencelocker.retrieval` — retrieval health is required to access provenance records
|
||||
- `check.compliance.provenance-completeness` — verifies provenance exists for all releases (completeness vs. integrity)
|
||||
- `check.compliance.evidence-integrity` — broader evidence integrity check including provenance
|
||||
139
docs/doctor/articles/evidence-locker/retrieval.md
Normal file
139
docs/doctor/articles/evidence-locker/retrieval.md
Normal file
@@ -0,0 +1,139 @@
|
||||
---
|
||||
checkId: check.evidencelocker.retrieval
|
||||
plugin: stellaops.doctor.evidencelocker
|
||||
severity: fail
|
||||
tags: [evidence, attestation, retrieval, core]
|
||||
---
|
||||
# Attestation Retrieval
|
||||
|
||||
## What It Checks
|
||||
Verifies that attestation artifacts can be retrieved from the evidence locker. The check supports two modes depending on the deployment:
|
||||
|
||||
**HTTP mode** (when `IHttpClientFactory` is available):
|
||||
Sends a GET request to `{endpoint}/v1/attestations/sample` with a 5-second timeout and measures response latency.
|
||||
|
||||
**Local file mode** (fallback):
|
||||
Checks the local evidence locker path at `EvidenceLocker:Path`, verifies the `attestations/` subdirectory exists, and attempts to read a sample attestation JSON file.
|
||||
|
||||
| Condition | Result |
|
||||
|---|---|
|
||||
| Endpoint not configured | Skip |
|
||||
| HTTP request times out (> 5000ms) | Fail |
|
||||
| HTTP error status code | Fail |
|
||||
| Connection error | Fail |
|
||||
| HTTP success but latency > 500ms | Warn |
|
||||
| Local attestations directory missing | Warn |
|
||||
| HTTP success with latency <= 500ms | Pass |
|
||||
| Local file read successful | Pass |
|
||||
|
||||
Evidence collected: `Endpoint`, `StatusCode`, `LatencyMs`, `Threshold`, `Path`, `SampleAttestation`, `ContentLength`.
|
||||
|
||||
The check only runs when `EvidenceLocker:Endpoint` or `Services:EvidenceLocker` is configured.
|
||||
|
||||
## Why It Matters
|
||||
Attestation retrieval is a core operation used throughout the release pipeline. Release approvals, audit queries, compliance reports, and evidence exports all depend on being able to retrieve attestation artifacts from the evidence locker. If retrieval is slow or failing, release approvals may time out, audit queries will fail, and compliance reports cannot be generated. Latency above 500ms indicates performance degradation that will compound when retrieving multiple attestations during a release or audit.
|
||||
|
||||
## Common Causes
|
||||
- Evidence locker service unavailable or not running
|
||||
- Authentication failure when accessing the evidence locker API
|
||||
- Artifact not found (empty or uninitialized evidence locker)
|
||||
- Evidence locker under heavy load causing elevated latency
|
||||
- Network latency between services
|
||||
- Storage backend slow (disk I/O bottleneck)
|
||||
- Local evidence locker path not configured or directory missing
|
||||
- File permission issues on local attestation files
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check evidence locker service status
|
||||
docker compose ps evidence-locker
|
||||
|
||||
# Test evidence retrieval
|
||||
docker compose exec evidence-locker stella evidence status
|
||||
|
||||
# Test authentication
|
||||
docker compose exec evidence-locker stella evidence auth-test
|
||||
|
||||
# Check service logs for errors
|
||||
docker compose logs evidence-locker --since 5m
|
||||
|
||||
# If local mode, verify the evidence path and permissions
|
||||
docker compose exec evidence-locker ls -la /data/evidence/attestations/
|
||||
|
||||
# Initialize evidence locker if needed
|
||||
docker compose exec evidence-locker stella evidence init
|
||||
|
||||
# Set endpoint configuration
|
||||
# EvidenceLocker__Endpoint=http://evidence-locker:5080
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check service status
|
||||
sudo systemctl status stellaops-evidence-locker
|
||||
|
||||
# Test evidence retrieval
|
||||
stella evidence status
|
||||
|
||||
# Test connectivity
|
||||
stella evidence ping
|
||||
|
||||
# Check attestations directory
|
||||
ls -la /var/lib/stellaops/evidence/attestations/
|
||||
|
||||
# Initialize if empty
|
||||
stella evidence init
|
||||
|
||||
# Check disk I/O
|
||||
iostat -x 1 5
|
||||
|
||||
# In appsettings.json:
|
||||
# "EvidenceLocker": { "Endpoint": "http://localhost:5080" }
|
||||
|
||||
sudo systemctl restart stellaops-evidence-locker
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check evidence locker pod status
|
||||
kubectl get pods -l app=stellaops-evidence-locker
|
||||
|
||||
# Check pod logs
|
||||
kubectl logs deploy/stellaops-evidence-locker --since=5m
|
||||
|
||||
# Test retrieval from within cluster
|
||||
kubectl exec deploy/stellaops-evidence-locker -- stella evidence status
|
||||
|
||||
# Check persistent volume
|
||||
kubectl describe pvc stellaops-evidence-data
|
||||
|
||||
# Check for resource constraints
|
||||
kubectl top pod -l app=stellaops-evidence-locker
|
||||
```
|
||||
|
||||
```yaml
|
||||
# values.yaml
|
||||
evidenceLocker:
|
||||
endpoint: http://stellaops-evidence-locker:5080
|
||||
resources:
|
||||
requests:
|
||||
memory: 256Mi
|
||||
cpu: 100m
|
||||
limits:
|
||||
memory: 512Mi
|
||||
cpu: 500m
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.evidencelocker.retrieval
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.evidencelocker.index` — evidence index consistency affects retrieval accuracy
|
||||
- `check.evidencelocker.provenance` — provenance chain integrity depends on reliable retrieval
|
||||
- `check.evidencelocker.merkle` — Merkle anchor verification requires attestation access
|
||||
- `check.compliance.evidence-rate` — evidence generation feeds the retrieval pipeline
|
||||
- `check.compliance.evidence-integrity` — integrity verification requires successful retrieval
|
||||
71
docs/doctor/articles/integration/ci-system-connectivity.md
Normal file
71
docs/doctor/articles/integration/ci-system-connectivity.md
Normal file
@@ -0,0 +1,71 @@
|
||||
---
|
||||
checkId: check.integration.ci.system
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: warn
|
||||
tags: [integration, ci, cd, jenkins, gitlab, github]
|
||||
---
|
||||
# CI System Connectivity
|
||||
|
||||
## What It Checks
|
||||
Iterates over all CI/CD systems defined under `CI:Systems` (or the legacy `CI:Url` single-system key). For each system it sends an HTTP GET to a type-specific health endpoint (Jenkins `/api/json`, GitLab `/api/v4/version`, GitHub `/rate_limit`, Azure DevOps `/_apis/connectionData`, or generic `/health`), sets the appropriate auth header (Bearer for GitHub/generic, `PRIVATE-TOKEN` for GitLab), and records reachability, authentication success, and latency. If the system is reachable and authenticated, it optionally queries runner/agent status (Jenkins `/computer/api/json`, GitLab `/api/v4/runners?status=online`). The check **fails** when any system is unreachable or returns 401/403, **warns** when all systems are reachable but one or more has zero available runners (out of a non-zero total), and **passes** otherwise.
|
||||
|
||||
## Why It Matters
|
||||
CI/CD systems are the trigger point for automated builds, tests, and release pipelines. If a CI system is unreachable or its credentials have expired, new commits will not be built, security scans will not run, and promotions will stall. Runner exhaustion has the same effect: pipelines queue indefinitely, delaying releases and blocking evidence collection.
|
||||
|
||||
## Common Causes
|
||||
- CI system is down or undergoing maintenance
|
||||
- Network connectivity issue between Stella Ops and the CI host
|
||||
- API credentials (token or password) have expired or been rotated
|
||||
- Firewall or security group blocking the CI API port
|
||||
- All CI runners/agents are offline or busy
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Verify the CI URL is correct in your environment file
|
||||
grep -E '^CI__' .env
|
||||
|
||||
# Test connectivity from within the Docker network
|
||||
docker compose exec gateway curl -sv https://ci.example.com/api/json
|
||||
|
||||
# Rotate or set a new API token
|
||||
echo 'CI__Systems__0__ApiToken=<new-token>' >> .env
|
||||
docker compose restart gateway
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check config in appsettings
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.CI'
|
||||
|
||||
# Test connectivity
|
||||
curl -H "Authorization: Bearer $CI_TOKEN" https://ci.example.com/api/json
|
||||
|
||||
# Update the token
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
ci:
|
||||
systems:
|
||||
- name: jenkins-prod
|
||||
url: https://ci.example.com
|
||||
type: jenkins
|
||||
apiToken: <token> # or use existingSecret
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.ci.system
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.webhooks` -- validates webhook delivery from CI events
|
||||
- `check.integration.git` -- validates Git provider reachability (often same host as CI)
|
||||
66
docs/doctor/articles/integration/git-provider-api.md
Normal file
66
docs/doctor/articles/integration/git-provider-api.md
Normal file
@@ -0,0 +1,66 @@
|
||||
---
|
||||
checkId: check.integration.git
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: warn
|
||||
tags: [connectivity, git, scm]
|
||||
---
|
||||
# Git Provider API
|
||||
|
||||
## What It Checks
|
||||
Resolves the configured Git provider URL from `Git:Url`, `Scm:Url`, `GitHub:Url`, `GitLab:Url`, or `Gitea:Url`. Auto-detects the provider type (GitHub, GitLab, Gitea, Bitbucket, Azure DevOps) from the URL and sends an HTTP GET to the corresponding API endpoint (e.g., GitHub -> `api.github.com`, GitLab -> `/api/v4/version`, Gitea -> `/api/v1/version`, Bitbucket -> `/rest/api/1.0/application-properties`). The check **passes** if the response is 2xx, 401, or 403 (reachable even if auth is needed), **warns** on other non-error status codes, and **fails** on connection errors or exceptions.
|
||||
|
||||
## Why It Matters
|
||||
Git provider connectivity is essential for source-code scanning, SBOM ingestion, webhook event reception, and commit-status reporting. A misconfigured or unreachable Git URL silently breaks SCM-triggered workflows and prevents evidence collection from source repositories.
|
||||
|
||||
## Common Causes
|
||||
- Git provider URL is incorrect or has a trailing-path typo
|
||||
- Network connectivity issues or DNS failure
|
||||
- Git provider service is down or undergoing maintenance
|
||||
- Provider uses a non-standard API path
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check current Git URL
|
||||
grep 'GIT__URL\|SCM__URL\|GITHUB__URL' .env
|
||||
|
||||
# Test from inside the network
|
||||
docker compose exec gateway curl -sv https://git.example.com/api/v4/version
|
||||
|
||||
# Update the URL
|
||||
echo 'Git__Url=https://git.example.com' >> .env
|
||||
docker compose restart gateway
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.Git'
|
||||
|
||||
# Test connectivity
|
||||
curl -v https://git.example.com/api/v4/version
|
||||
|
||||
# Fix the URL
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
git:
|
||||
url: https://git.example.com
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.git
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.ci.system` -- CI systems often share the same Git host
|
||||
- `check.integration.webhooks` -- webhook endpoints receive events from Git providers
|
||||
72
docs/doctor/articles/integration/ldap-connectivity.md
Normal file
72
docs/doctor/articles/integration/ldap-connectivity.md
Normal file
@@ -0,0 +1,72 @@
|
||||
---
|
||||
checkId: check.integration.ldap
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: warn
|
||||
tags: [connectivity, ldap, directory, auth]
|
||||
---
|
||||
# LDAP/AD Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads the LDAP host from `Ldap:Host`, `ActiveDirectory:Host`, or `Authority:Ldap:Host` and the port from the corresponding `:Port` key (defaulting to 389, or 636 when `UseSsl` is true). Opens a raw TCP connection to the host and port with a 5-second timeout. The check **passes** if the TCP connection succeeds, **fails** on timeout, socket error, or connection refusal.
|
||||
|
||||
## Why It Matters
|
||||
LDAP or Active Directory integration is used for user authentication, group synchronization, and role mapping. If the LDAP server is unreachable, users cannot log in via directory credentials, group-based access policies cannot be evaluated, and new user provisioning stops. This directly impacts operator access to the control plane.
|
||||
|
||||
## Common Causes
|
||||
- LDAP/AD server is not running or is being restarted
|
||||
- Firewall blocking LDAP port (389) or LDAPS port (636)
|
||||
- DNS resolution failure for the LDAP hostname
|
||||
- Network unreachable between Stella Ops and the directory server
|
||||
- Incorrect host or port in configuration
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check LDAP configuration
|
||||
grep 'LDAP__\|ACTIVEDIRECTORY__' .env
|
||||
|
||||
# Test TCP connectivity from the gateway container
|
||||
docker compose exec gateway bash -c "echo > /dev/tcp/ldap.example.com/389 && echo OK || echo FAIL"
|
||||
|
||||
# Update LDAP host/port
|
||||
echo 'Ldap__Host=ldap.example.com' >> .env
|
||||
echo 'Ldap__Port=636' >> .env
|
||||
echo 'Ldap__UseSsl=true' >> .env
|
||||
docker compose restart gateway
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.Ldap'
|
||||
|
||||
# Test connectivity
|
||||
telnet ldap.example.com 389
|
||||
# or
|
||||
nslookup ldap.example.com
|
||||
|
||||
# Update configuration
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
ldap:
|
||||
host: ldap.example.com
|
||||
port: 636
|
||||
useSsl: true
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.ldap
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.oidc` -- OIDC provider connectivity (alternative auth mechanism)
|
||||
73
docs/doctor/articles/integration/object-storage.md
Normal file
73
docs/doctor/articles/integration/object-storage.md
Normal file
@@ -0,0 +1,73 @@
|
||||
---
|
||||
checkId: check.integration.s3.storage
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: warn
|
||||
tags: [connectivity, s3, storage]
|
||||
---
|
||||
# Object Storage Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads the S3 endpoint from `S3:Endpoint`, `Storage:S3:Endpoint`, or `AWS:S3:ServiceURL`. Parses the URI to extract host and port (defaulting to 443 for HTTPS, 80 for HTTP). Opens a raw TCP connection with a 5-second timeout. The check **passes** if the TCP connection succeeds, **fails** on timeout, socket error, invalid URI format, or connection refusal.
|
||||
|
||||
## Why It Matters
|
||||
S3-compatible object storage is used for evidence packet archival, SBOM storage, offline kit distribution, and large artifact persistence. If the storage endpoint is unreachable, evidence export fails, SBOM uploads are rejected, and offline kit generation cannot complete. This blocks audit compliance workflows and air-gap distribution.
|
||||
|
||||
## Common Causes
|
||||
- S3 endpoint (MinIO, AWS S3, or compatible) is unreachable
|
||||
- Network connectivity issues or DNS failure
|
||||
- Firewall blocking the storage port
|
||||
- Invalid endpoint URL format in configuration
|
||||
- MinIO or S3-compatible service is not running
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check S3 configuration
|
||||
grep 'S3__\|STORAGE__S3' .env
|
||||
|
||||
# Test connectivity to MinIO
|
||||
docker compose exec gateway curl -v http://minio:9000/minio/health/live
|
||||
|
||||
# Restart MinIO if stopped
|
||||
docker compose up -d minio
|
||||
|
||||
# Update endpoint
|
||||
echo 'S3__Endpoint=http://minio:9000' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify S3 configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.S3'
|
||||
|
||||
# Test connectivity
|
||||
curl -v http://minio.example.com:9000/minio/health/live
|
||||
|
||||
# Check if MinIO is running
|
||||
sudo systemctl status minio
|
||||
|
||||
# Update configuration
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
s3:
|
||||
endpoint: http://minio.storage.svc.cluster.local:9000
|
||||
bucket: stellaops-evidence
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.s3.storage
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.oci.registry` -- OCI registries may also store artifacts
|
||||
@@ -0,0 +1,70 @@
|
||||
---
|
||||
checkId: check.integration.oci.registry
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: warn
|
||||
tags: [connectivity, oci, registry]
|
||||
---
|
||||
# OCI Registry Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads the registry URL from `OCI:RegistryUrl` or `Registry:Url`. Sends an HTTP GET to `<registryUrl>/v2/` (the OCI Distribution Spec base endpoint). The check **passes** if the response is 200 (open registry) or 401 (registry reachable, auth required), **warns** on any other status code, and **fails** on connection errors.
|
||||
|
||||
## Why It Matters
|
||||
The OCI registry is the central artifact store for container images, SBOMs, attestations, and signatures. If the registry is unreachable, image pulls fail during deployment, SBOM scans cannot fetch manifests, attestation verification cannot retrieve signatures, and promotions are blocked. This is a foundational dependency for nearly every Stella Ops workflow.
|
||||
|
||||
## Common Causes
|
||||
- Registry URL is incorrect (typo, wrong port, wrong scheme)
|
||||
- Network connectivity issues between Stella Ops and the registry
|
||||
- Registry service is down or restarting
|
||||
- Registry does not support the OCI Distribution spec at `/v2/`
|
||||
- Registry endpoint is misconfigured (path prefix required)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check registry configuration
|
||||
grep 'OCI__REGISTRYURL\|REGISTRY__URL' .env
|
||||
|
||||
# Test the /v2/ endpoint from inside the network
|
||||
docker compose exec gateway curl -sv https://registry.example.com/v2/
|
||||
|
||||
# Update registry URL
|
||||
echo 'OCI__RegistryUrl=https://registry.example.com' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.OCI'
|
||||
|
||||
# Test connectivity
|
||||
curl -v https://registry.example.com/v2/
|
||||
|
||||
# Fix configuration
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
oci:
|
||||
registryUrl: https://registry.example.com
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.oci.registry
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.oci.credentials` -- validates registry credentials
|
||||
- `check.integration.oci.pull` -- verifies pull authorization
|
||||
- `check.integration.oci.push` -- verifies push authorization
|
||||
- `check.integration.oci.referrers` -- checks OCI 1.1 referrers API support
|
||||
- `check.integration.oci.capabilities` -- probes full capability matrix
|
||||
75
docs/doctor/articles/integration/oidc-provider.md
Normal file
75
docs/doctor/articles/integration/oidc-provider.md
Normal file
@@ -0,0 +1,75 @@
|
||||
---
|
||||
checkId: check.integration.oidc
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: warn
|
||||
tags: [connectivity, oidc, auth, identity]
|
||||
---
|
||||
# OIDC Provider
|
||||
|
||||
## What It Checks
|
||||
Reads the OIDC issuer URL from `Oidc:Issuer`, `Authentication:Oidc:Issuer`, or `Authority:Oidc:Issuer`. Fetches the OpenID Connect discovery document at `<issuer>/.well-known/openid-configuration`. On a successful response, parses the JSON for three required endpoints: `authorization_endpoint`, `token_endpoint`, and `jwks_uri`. The check **passes** if all three are present, **warns** if the discovery document is incomplete (missing one or more endpoints), **fails** if the discovery endpoint returns a non-success status code, and **fails** on connection errors.
|
||||
|
||||
## Why It Matters
|
||||
OIDC authentication is the primary identity mechanism for Stella Ops operators and API clients. If the OIDC provider is unreachable or misconfigured, users cannot log in, API tokens cannot be validated, and all authenticated workflows halt. An incomplete discovery document causes subtle failures where some auth flows work but others (e.g., token refresh) silently break.
|
||||
|
||||
## Common Causes
|
||||
- OIDC issuer URL is incorrect or has a trailing slash issue
|
||||
- OIDC provider (Authority, Keycloak, Azure AD, etc.) is down
|
||||
- Network connectivity issues between Stella Ops and the identity provider
|
||||
- Provider does not support OpenID Connect discovery
|
||||
- Discovery document is missing required endpoints
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check OIDC configuration
|
||||
grep 'OIDC__ISSUER\|AUTHENTICATION__OIDC' .env
|
||||
|
||||
# Test discovery endpoint
|
||||
docker compose exec gateway curl -sv \
|
||||
https://auth.example.com/.well-known/openid-configuration
|
||||
|
||||
# Verify the Authority service is running
|
||||
docker compose ps authority
|
||||
|
||||
# Update issuer URL
|
||||
echo 'Oidc__Issuer=https://auth.example.com' >> .env
|
||||
docker compose restart gateway platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.Oidc'
|
||||
|
||||
# Test discovery
|
||||
curl -v https://auth.example.com/.well-known/openid-configuration
|
||||
|
||||
# Check required fields in the response
|
||||
curl -s https://auth.example.com/.well-known/openid-configuration \
|
||||
| jq '{authorization_endpoint, token_endpoint, jwks_uri}'
|
||||
|
||||
# Fix configuration
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
oidc:
|
||||
issuer: https://auth.example.com
|
||||
clientId: stellaops-ui
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.oidc
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.ldap` -- alternative directory-based authentication
|
||||
@@ -0,0 +1,89 @@
|
||||
---
|
||||
checkId: check.integration.oci.capabilities
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: info
|
||||
tags: [registry, oci, capabilities, compatibility]
|
||||
---
|
||||
# OCI Registry Capability Matrix
|
||||
|
||||
## What It Checks
|
||||
Probes the configured OCI registry for five capabilities using a test repository (`OCI:TestRepository`, default `library/alpine`):
|
||||
|
||||
1. **Distribution version** -- GET `/v2/`, reads `OCI-Distribution-API-Version` or `Docker-Distribution-API-Version` header.
|
||||
2. **Referrers API** -- GET `/v2/<repo>/referrers/<digest>` with OCI accept header; passes if 200 or if a 404 response contains OCI index JSON.
|
||||
3. **Chunked upload** -- POST `/v2/<repo>/blobs/uploads/`; passes on 202 Accepted (upload session is immediately cancelled).
|
||||
4. **Cross-repo mount** -- POST `/v2/<repo>/blobs/uploads/?mount=<digest>&from=library/alpine`; passes on 201 Created or 202 Accepted.
|
||||
5. **Delete support** (manifests and blobs) -- OPTIONS request to check if `DELETE` appears in the `Allow` header.
|
||||
|
||||
Calculates a capability score (N/5). **Warns** if referrers API is unsupported, **info** if any other capability is missing, **passes** if all 5 are supported. **Fails** on connection errors.
|
||||
|
||||
## Why It Matters
|
||||
Different OCI registries support different subsets of the OCI Distribution Spec. Stella Ops uses referrers for attestation linking, chunked uploads for large SBOMs, cross-repo mounts for efficient promotion, and deletes for garbage collection. Knowing the capability matrix upfront prevents mysterious failures during release operations and allows operators to configure appropriate fallbacks.
|
||||
|
||||
## Common Causes
|
||||
- Registry does not implement OCI Distribution Spec v1.1 (no referrers API)
|
||||
- Registry has delete operations disabled by policy
|
||||
- Chunked upload is disabled in registry configuration
|
||||
- Cross-repo mount is not supported by the registry implementation
|
||||
- Registry version is too old for newer OCI features
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check registry type and version
|
||||
docker compose exec gateway curl -sv https://registry.example.com/v2/ \
|
||||
-o /dev/null 2>&1 | grep -i 'distribution-api-version'
|
||||
|
||||
# If referrers API is missing, consider upgrading the registry
|
||||
# Harbor 2.6+, Quay 3.12+, ACR, ECR, GCR/Artifact Registry support referrers
|
||||
|
||||
# Enable delete in Harbor
|
||||
# Update harbor.yml: delete_enabled: true
|
||||
# Restart Harbor
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Test referrers API directly
|
||||
curl -H "Accept: application/vnd.oci.image.index.v1+json" \
|
||||
https://registry.example.com/v2/library/alpine/referrers/sha256:abc...
|
||||
|
||||
# Test chunked upload
|
||||
curl -X POST https://registry.example.com/v2/test/blobs/uploads/
|
||||
|
||||
# Enable delete in Docker Distribution
|
||||
# In /etc/docker/registry/config.yml:
|
||||
# storage:
|
||||
# delete:
|
||||
# enabled: true
|
||||
sudo systemctl restart docker-registry
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml (for Harbor)
|
||||
harbor:
|
||||
registry:
|
||||
deleteEnabled: true
|
||||
|
||||
# values.yaml (for Stella Ops)
|
||||
oci:
|
||||
registryUrl: https://registry.example.com
|
||||
testRepository: library/alpine
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.oci.capabilities
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.oci.registry` -- basic registry connectivity
|
||||
- `check.integration.oci.referrers` -- focused referrers API check with digest resolution
|
||||
- `check.integration.oci.credentials` -- credential validation
|
||||
- `check.integration.oci.pull` -- pull authorization
|
||||
- `check.integration.oci.push` -- push authorization
|
||||
76
docs/doctor/articles/integration/registry-credentials.md
Normal file
76
docs/doctor/articles/integration/registry-credentials.md
Normal file
@@ -0,0 +1,76 @@
|
||||
---
|
||||
checkId: check.integration.oci.credentials
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: fail
|
||||
tags: [registry, oci, credentials, secrets, auth]
|
||||
---
|
||||
# OCI Registry Credentials
|
||||
|
||||
## What It Checks
|
||||
Determines the authentication method from configuration: bearer token (`OCI:Token` / `Registry:Token`), basic auth (`OCI:Username` + `OCI:Password` / `Registry:Username` + `Registry:Password`), or anonymous. Immediately **fails** if a username is provided without a password. Then validates credentials by sending an authenticated HTTP GET to `<registryUrl>/v2/`. The check **passes** on 200 OK, or on 401 if the response includes a `WWW-Authenticate: Bearer` challenge and basic credentials are configured (OAuth2 token exchange scenario). It **fails** on 401 (invalid credentials) or 403 (forbidden), and **fails** on connection errors or timeouts.
|
||||
|
||||
## Why It Matters
|
||||
Invalid or expired registry credentials cause image pull/push failures across all deployment pipelines. Because credentials are often rotated on a schedule, this check provides early detection of expired tokens before they silently break promotions, SBOM ingestion, or attestation storage. A username-without-password misconfiguration indicates a secret reference that failed to resolve.
|
||||
|
||||
## Common Causes
|
||||
- Credentials are invalid or have been rotated without updating the configuration
|
||||
- Token has been revoked by the registry administrator
|
||||
- Username provided without a corresponding password (broken secret reference)
|
||||
- Service account token expired
|
||||
- IP address or network not in the registry's allowlist
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check credential configuration
|
||||
grep 'OCI__USERNAME\|OCI__PASSWORD\|OCI__TOKEN\|REGISTRY__' .env
|
||||
|
||||
# Test credentials manually
|
||||
docker login registry.example.com
|
||||
|
||||
# Rotate credentials
|
||||
echo 'OCI__Username=stellaops-svc' >> .env
|
||||
echo 'OCI__Password=<new-password>' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check credential configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.OCI | {Username, Password: (if .Password then "****" else null end), Token: (if .Token then "****" else null end)}'
|
||||
|
||||
# Test with curl
|
||||
curl -u stellaops-svc:<password> https://registry.example.com/v2/
|
||||
|
||||
# Update credentials
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
oci:
|
||||
registryUrl: https://registry.example.com
|
||||
existingSecret: stellaops-registry-creds # Secret with username/password keys
|
||||
```
|
||||
```bash
|
||||
# Create or update the secret
|
||||
kubectl create secret generic stellaops-registry-creds \
|
||||
--from-literal=username=stellaops-svc \
|
||||
--from-literal=password=<new-password> \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.oci.credentials
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.oci.registry` -- basic connectivity (does not test auth)
|
||||
- `check.integration.oci.pull` -- verifies pull authorization with these credentials
|
||||
- `check.integration.oci.push` -- verifies push authorization with these credentials
|
||||
@@ -0,0 +1,72 @@
|
||||
---
|
||||
checkId: check.integration.oci.pull
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: fail
|
||||
tags: [registry, oci, pull, authorization, credentials]
|
||||
---
|
||||
# OCI Registry Pull Authorization
|
||||
|
||||
## What It Checks
|
||||
Sends an authenticated HTTP HEAD request to `<registryUrl>/v2/<testRepo>/manifests/<testTag>` with OCI and Docker manifest accept headers. Uses the test repository from `OCI:TestRepository` (default `library/alpine`) and test tag from `OCI:TestTag` (default `latest`). The check **passes** on 2xx (records manifest digest and content type), returns **info** on 404 (test image not found -- cannot verify), **fails** on 401 (invalid credentials), **fails** on 403 (valid credentials but no pull permission), and **fails** on connection errors or timeouts.
|
||||
|
||||
## Why It Matters
|
||||
Pull authorization is the most fundamental registry operation. Stella Ops pulls images for scanning, SBOM extraction, attestation verification, and deployment. If pull authorization fails, the entire image-based workflow is blocked. This check tests actual pull permissions rather than just credential validity, catching permission misconfigurations that `check.integration.oci.credentials` cannot detect.
|
||||
|
||||
## Common Causes
|
||||
- Credentials are invalid or expired
|
||||
- Token has been revoked
|
||||
- Anonymous pull is not allowed and no credentials are configured
|
||||
- Service account has been removed from the repository's access list
|
||||
- Repository access restricted by IP, network, or organization policy
|
||||
- Test image does not exist in the registry (404 -- configure `OCI:TestRepository`)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Test pull manually
|
||||
docker pull registry.example.com/library/alpine:latest
|
||||
|
||||
# Check configured test repository
|
||||
grep 'OCI__TESTREPOSITORY\|REGISTRY__TESTREPOSITORY' .env
|
||||
|
||||
# Set a valid test image that exists in your registry
|
||||
echo 'OCI__TestRepository=myorg/base-image' >> .env
|
||||
echo 'OCI__TestTag=latest' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Test pull authorization with curl
|
||||
curl -I -H "Accept: application/vnd.oci.image.manifest.v1+json" \
|
||||
-u stellaops-svc:<password> \
|
||||
https://registry.example.com/v2/library/alpine/manifests/latest
|
||||
|
||||
# Configure a test image that exists in your registry
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
# Set OCI:TestRepository and OCI:TestTag
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
oci:
|
||||
registryUrl: https://registry.example.com
|
||||
testRepository: myorg/base-image
|
||||
testTag: latest
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.oci.pull
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.oci.credentials` -- validates credential configuration and token validity
|
||||
- `check.integration.oci.push` -- verifies push authorization
|
||||
- `check.integration.oci.registry` -- basic registry connectivity
|
||||
@@ -0,0 +1,74 @@
|
||||
---
|
||||
checkId: check.integration.oci.push
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: fail
|
||||
tags: [registry, oci, push, authorization, credentials]
|
||||
---
|
||||
# OCI Registry Push Authorization
|
||||
|
||||
## What It Checks
|
||||
Sends an authenticated HTTP POST to `<registryUrl>/v2/<testRepo>/blobs/uploads/` to initiate a blob upload session. Uses the test repository from `OCI:TestRepository` or `OCI:PushTestRepository` (default `stellaops/doctor-test`). Only runs if credentials are configured. The check **passes** on 202 Accepted (the upload session is immediately cancelled by sending a DELETE to the returned Location header), **fails** on 401 (invalid credentials), **fails** on 403 (valid credentials but no push permission), and **fails** on connection errors or timeouts. No data is actually written to the registry.
|
||||
|
||||
## Why It Matters
|
||||
Push authorization is required for storing attestations, SBOMs, signatures, and promoted images in the registry. Without push access, Stella Ops cannot attach evidence artifacts to releases, sign images, or complete promotion workflows. This check verifies the actual push permission grant, not just credential validity, using a non-destructive probe that leaves no artifacts behind.
|
||||
|
||||
## Common Causes
|
||||
- Credentials are valid but lack push (write) permissions
|
||||
- Repository does not exist and the registry does not support auto-creation
|
||||
- Service account has read-only access
|
||||
- Organization or team policy restricts push to specific accounts
|
||||
- Token has been revoked or expired
|
||||
- IP or network restrictions prevent write operations
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Test push manually
|
||||
echo "test" | docker push registry.example.com/stellaops/doctor-test:probe
|
||||
|
||||
# Grant push permissions to the service account in your registry UI
|
||||
|
||||
# Set a writable test repository
|
||||
echo 'OCI__PushTestRepository=myorg/stellaops-test' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Test push authorization with curl
|
||||
curl -X POST \
|
||||
-u stellaops-svc:<password> \
|
||||
https://registry.example.com/v2/stellaops/doctor-test/blobs/uploads/
|
||||
|
||||
# Expected: 202 Accepted with Location header
|
||||
|
||||
# Fix permissions in registry
|
||||
# Harbor: Add stellaops-svc as Developer/Admin to the project
|
||||
# GitLab: Grant Reporter+ role to the service account
|
||||
# ECR: Attach ecr:InitiateLayerUpload policy
|
||||
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
oci:
|
||||
registryUrl: https://registry.example.com
|
||||
pushTestRepository: myorg/stellaops-test
|
||||
existingSecret: stellaops-registry-creds
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.oci.push
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.oci.credentials` -- validates credential configuration and token validity
|
||||
- `check.integration.oci.pull` -- verifies pull authorization
|
||||
- `check.integration.oci.registry` -- basic registry connectivity
|
||||
82
docs/doctor/articles/integration/registry-referrers-api.md
Normal file
82
docs/doctor/articles/integration/registry-referrers-api.md
Normal file
@@ -0,0 +1,82 @@
|
||||
---
|
||||
checkId: check.integration.oci.referrers
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: warn
|
||||
tags: [registry, oci, referrers, compatibility, oci-1.1]
|
||||
---
|
||||
# OCI Registry Referrers API Support
|
||||
|
||||
## What It Checks
|
||||
First resolves the manifest digest for the test image (`OCI:TestRepository`:`OCI:TestTag`, defaults to `library/alpine:latest`) by sending a HEAD request to the manifests endpoint and reading the `Docker-Content-Digest` header. Then probes the referrers API at `<registryUrl>/v2/<repo>/referrers/<digest>` with the `application/vnd.oci.image.index.v1+json` accept header. The check **passes** on 200 OK or on 404 if the response body contains OCI index JSON (valid response meaning no referrers exist yet). It **warns** on 404 without OCI index (API not supported, tag-based fallback required) or 405 Method Not Allowed. Returns **info** if the test image is not found (cannot verify). **Fails** on connection errors.
|
||||
|
||||
## Why It Matters
|
||||
The OCI 1.1 referrers API enables artifact linking: attaching SBOMs, signatures, attestations, and VEX documents directly to container image manifests. Without it, Stella Ops must fall back to the tag-based referrer pattern (`sha256-{digest}.{artifactType}`), which is less efficient, harder to discover, and may conflict with registry tag naming policies. Knowing referrers API availability determines which linking strategy is used.
|
||||
|
||||
## Common Causes
|
||||
- Registry does not implement OCI Distribution Spec v1.1
|
||||
- Registry version is too old (pre-referrers API)
|
||||
- Referrers API disabled in registry configuration
|
||||
- Test image does not exist in registry (cannot resolve digest to probe)
|
||||
- Credentials lack pull permissions for the test image
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check registry version and referrers support
|
||||
docker compose exec gateway curl -sv \
|
||||
-H "Accept: application/vnd.oci.image.index.v1+json" \
|
||||
https://registry.example.com/v2/library/alpine/referrers/sha256:abc...
|
||||
|
||||
# Upgrade registry to a version supporting OCI 1.1 referrers:
|
||||
# - Harbor 2.6+
|
||||
# - Quay 3.12+
|
||||
# - ACR (default)
|
||||
# - ECR (default)
|
||||
# - GCR/Artifact Registry (default)
|
||||
# - Distribution 2.8+
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify registry version
|
||||
curl -I https://registry.example.com/v2/ 2>&1 | grep -i distribution
|
||||
|
||||
# Test referrers API
|
||||
DIGEST=$(curl -sI -H "Accept: application/vnd.oci.image.manifest.v1+json" \
|
||||
https://registry.example.com/v2/library/alpine/manifests/latest \
|
||||
| grep Docker-Content-Digest | awk '{print $2}' | tr -d '\r')
|
||||
|
||||
curl -H "Accept: application/vnd.oci.image.index.v1+json" \
|
||||
https://registry.example.com/v2/library/alpine/referrers/$DIGEST
|
||||
|
||||
# Upgrade the registry package
|
||||
sudo apt upgrade docker-registry # or equivalent
|
||||
sudo systemctl restart docker-registry
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# Upgrade Harbor chart
|
||||
helm upgrade harbor harbor/harbor --set registry.referrers.enabled=true
|
||||
|
||||
# Or configure Stella Ops with a test image that exists
|
||||
# values.yaml
|
||||
oci:
|
||||
registryUrl: https://registry.example.com
|
||||
testRepository: myorg/base-image
|
||||
testTag: latest
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.oci.referrers
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.oci.capabilities` -- broader capability matrix including referrers
|
||||
- `check.integration.oci.registry` -- basic registry connectivity
|
||||
- `check.integration.oci.pull` -- pull authorization (needed to resolve test image digest)
|
||||
@@ -0,0 +1,89 @@
|
||||
---
|
||||
checkId: check.integration.secrets.manager
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: fail
|
||||
tags: [integration, secrets, vault, security, keyvault]
|
||||
---
|
||||
# Secrets Manager Connectivity
|
||||
|
||||
## What It Checks
|
||||
Iterates over all secrets managers defined under `Secrets:Managers` (or the legacy `Secrets:Vault:Url` / `Vault:Url` single-manager key). For each manager it sends an HTTP GET to a type-specific health endpoint: Vault uses `/v1/sys/health?standbyok=true&sealedcode=200&uninitcode=200`, Azure Key Vault uses `/healthstatus`, and others use `/health`. Sets the appropriate auth header (`X-Vault-Token` for Vault, `Bearer` for others). Records reachability, authentication success, and latency. For Vault, parses the response JSON for `sealed`, `initialized`, and `version` fields. The check **fails** if any manager is unreachable or returns 401/403, **fails** if any Vault instance is sealed, and **passes** if all managers are healthy and unsealed.
|
||||
|
||||
## Why It Matters
|
||||
Secrets managers store registry credentials, signing keys, API tokens, and encryption keys. If a secrets manager is unreachable, Stella Ops cannot retrieve credentials for deployments, cannot sign attestations, and cannot decrypt sensitive configuration. A sealed Vault is equally critical: all secret reads fail until it is manually unsealed. This is a hard blocker for any release operation.
|
||||
|
||||
## Common Causes
|
||||
- Secrets manager service is down or restarting
|
||||
- Network connectivity issue between Stella Ops and the secrets manager
|
||||
- Authentication token has expired or been revoked
|
||||
- TLS certificate issue (expired, untrusted CA)
|
||||
- Vault was restarted and needs manual unseal
|
||||
- Vault auto-seal triggered due to HSM connectivity loss
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check secrets manager configuration
|
||||
grep 'SECRETS__\|VAULT__' .env
|
||||
|
||||
# Test Vault health
|
||||
docker compose exec gateway curl -sv \
|
||||
http://vault:8200/v1/sys/health
|
||||
|
||||
# Unseal Vault if sealed
|
||||
docker compose exec vault vault operator unseal <key1>
|
||||
docker compose exec vault vault operator unseal <key2>
|
||||
docker compose exec vault vault operator unseal <key3>
|
||||
|
||||
# Refresh Vault token
|
||||
docker compose exec vault vault token create -policy=stellaops
|
||||
echo 'Secrets__Managers__0__Token=<new-token>' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check Vault status
|
||||
vault status
|
||||
|
||||
# Unseal if needed
|
||||
vault operator unseal
|
||||
|
||||
# Renew the Vault token
|
||||
vault token renew
|
||||
|
||||
# Check Azure Key Vault health
|
||||
curl -v https://myvault.vault.azure.net/healthstatus
|
||||
|
||||
# Update configuration
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
secrets:
|
||||
managers:
|
||||
- name: vault-prod
|
||||
url: http://vault.vault.svc.cluster.local:8200
|
||||
type: vault
|
||||
existingSecret: stellaops-vault-token
|
||||
```
|
||||
```bash
|
||||
# Update Vault token secret
|
||||
kubectl create secret generic stellaops-vault-token \
|
||||
--from-literal=token=<new-token> \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.secrets.manager
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.oci.credentials` -- registry credentials that may be sourced from the secrets manager
|
||||
74
docs/doctor/articles/integration/slack-webhook.md
Normal file
74
docs/doctor/articles/integration/slack-webhook.md
Normal file
@@ -0,0 +1,74 @@
|
||||
---
|
||||
checkId: check.integration.slack
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: info
|
||||
tags: [notification, slack, webhook]
|
||||
---
|
||||
# Slack Webhook
|
||||
|
||||
## What It Checks
|
||||
Reads the Slack webhook URL from `Slack:WebhookUrl` or `Notify:Slack:WebhookUrl`. First validates the URL format: **warns** if the URL does not start with `https://hooks.slack.com/`. Then tests host reachability by sending an HTTP GET to the base URL (`https://hooks.slack.com`). The check **passes** if the Slack host is reachable, **warns** if the host is unreachable or if the URL format is suspicious. Does not send an actual webhook payload to avoid generating noise in the Slack channel.
|
||||
|
||||
## Why It Matters
|
||||
Slack notifications keep operators informed about deployment status, policy violations, security findings, and approval requests in real time. A misconfigured or unreachable Slack webhook means critical alerts go undelivered, potentially delaying incident response, approval workflows, or security remediation.
|
||||
|
||||
## Common Causes
|
||||
- Network connectivity issues between Stella Ops and Slack
|
||||
- Firewall blocking outbound HTTPS to `hooks.slack.com`
|
||||
- Proxy misconfiguration preventing external HTTPS
|
||||
- Webhook URL is malformed or points to the wrong service
|
||||
- Slack webhook URL has been regenerated (old URL invalidated)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check Slack webhook configuration
|
||||
grep 'SLACK__WEBHOOKURL\|NOTIFY__SLACK' .env
|
||||
|
||||
# Test connectivity to Slack
|
||||
docker compose exec gateway curl -sv https://hooks.slack.com/ -o /dev/null
|
||||
|
||||
# Update webhook URL
|
||||
echo 'Slack__WebhookUrl=https://hooks.slack.com/services/T.../B.../xxx' >> .env
|
||||
docker compose restart platform
|
||||
|
||||
# If behind a proxy
|
||||
echo 'HTTP_PROXY=http://proxy:8080' >> .env
|
||||
echo 'HTTPS_PROXY=http://proxy:8080' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.Slack'
|
||||
|
||||
# Test connectivity
|
||||
curl -sv https://hooks.slack.com/ -o /dev/null
|
||||
|
||||
# Update webhook URL
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
slack:
|
||||
webhookUrl: https://hooks.slack.com/services/T.../B.../xxx
|
||||
# or use an existing secret
|
||||
existingSecret: stellaops-slack-webhook
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.slack
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.teams` -- Microsoft Teams webhook (alternative notification channel)
|
||||
- `check.integration.webhooks` -- general webhook health monitoring
|
||||
76
docs/doctor/articles/integration/smtp-connectivity.md
Normal file
76
docs/doctor/articles/integration/smtp-connectivity.md
Normal file
@@ -0,0 +1,76 @@
|
||||
---
|
||||
checkId: check.integration.smtp
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: warn
|
||||
tags: [connectivity, email, smtp]
|
||||
---
|
||||
# SMTP Email Connectivity
|
||||
|
||||
## What It Checks
|
||||
Reads the SMTP host from `Smtp:Host`, `Email:Smtp:Host`, or `Notify:Email:Host` and the port from the corresponding `:Port` key (defaulting to 587). Opens a raw TCP connection to the SMTP server with a 5-second timeout. The check **passes** if the TCP connection succeeds, **fails** on timeout, socket error, DNS failure, or connection refusal.
|
||||
|
||||
## Why It Matters
|
||||
Email notifications deliver approval requests, security alerts, deployment summaries, and audit reports to operators who may not be monitoring Slack or the web UI. If the SMTP server is unreachable, these notifications silently fail. For organizations with compliance requirements, email delivery may be the mandated audit notification channel.
|
||||
|
||||
## Common Causes
|
||||
- SMTP server is not running or is being restarted
|
||||
- Firewall blocking SMTP port (25, 465, or 587)
|
||||
- DNS resolution failure for the SMTP hostname
|
||||
- Network unreachable between Stella Ops and the mail server
|
||||
- Incorrect host or port in configuration
|
||||
- ISP/cloud provider blocking outbound SMTP
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check SMTP configuration
|
||||
grep 'SMTP__\|EMAIL__SMTP\|NOTIFY__EMAIL' .env
|
||||
|
||||
# Test TCP connectivity
|
||||
docker compose exec gateway bash -c \
|
||||
"echo > /dev/tcp/smtp.example.com/587 && echo OK || echo FAIL"
|
||||
|
||||
# Update SMTP settings
|
||||
echo 'Smtp__Host=smtp.example.com' >> .env
|
||||
echo 'Smtp__Port=587' >> .env
|
||||
echo 'Smtp__UseSsl=true' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.Smtp'
|
||||
|
||||
# Test connectivity
|
||||
telnet smtp.example.com 587
|
||||
# or
|
||||
nslookup smtp.example.com
|
||||
|
||||
# Update configuration
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
smtp:
|
||||
host: smtp.example.com
|
||||
port: 587
|
||||
useSsl: true
|
||||
existingSecret: stellaops-smtp-creds # Secret with username/password
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.smtp
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.slack` -- Slack notifications (alternative channel)
|
||||
- `check.integration.teams` -- Teams notifications (alternative channel)
|
||||
75
docs/doctor/articles/integration/teams-webhook.md
Normal file
75
docs/doctor/articles/integration/teams-webhook.md
Normal file
@@ -0,0 +1,75 @@
|
||||
---
|
||||
checkId: check.integration.teams
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: info
|
||||
tags: [notification, teams, webhook]
|
||||
---
|
||||
# Teams Webhook
|
||||
|
||||
## What It Checks
|
||||
Reads the Microsoft Teams webhook URL from `Teams:WebhookUrl` or `Notify:Teams:WebhookUrl`. First validates the URL format: **warns** if the URL does not contain `webhook.office.com` or `teams.microsoft.com`. Then tests host reachability by sending an HTTP GET to the base URL of the webhook host. The check **passes** if the Teams host is reachable, **warns** if the host is unreachable or if the URL format is suspicious. Does not send an actual webhook payload to avoid generating noise in the Teams channel.
|
||||
|
||||
## Why It Matters
|
||||
Microsoft Teams notifications keep operators informed about deployment status, policy violations, security findings, and approval requests. A misconfigured or unreachable Teams webhook means critical alerts go undelivered, potentially delaying incident response and approval workflows. For organizations standardized on Microsoft 365, Teams may be the primary notification channel.
|
||||
|
||||
## Common Causes
|
||||
- Network connectivity issues between Stella Ops and Microsoft services
|
||||
- Firewall blocking outbound HTTPS to `webhook.office.com`
|
||||
- Proxy misconfiguration preventing external HTTPS
|
||||
- Webhook URL is malformed or was copied incorrectly
|
||||
- Teams webhook connector has been removed or regenerated
|
||||
- Microsoft has migrated to a new webhook URL domain
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check Teams webhook configuration
|
||||
grep 'TEAMS__WEBHOOKURL\|NOTIFY__TEAMS' .env
|
||||
|
||||
# Test connectivity to Teams webhook host
|
||||
docker compose exec gateway curl -sv https://webhook.office.com/ -o /dev/null
|
||||
|
||||
# Update webhook URL
|
||||
echo 'Teams__WebhookUrl=https://webhook.office.com/webhookb2/...' >> .env
|
||||
docker compose restart platform
|
||||
|
||||
# If behind a proxy
|
||||
echo 'HTTP_PROXY=http://proxy:8080' >> .env
|
||||
echo 'HTTPS_PROXY=http://proxy:8080' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Verify configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.Teams'
|
||||
|
||||
# Test connectivity
|
||||
curl -sv https://webhook.office.com/ -o /dev/null
|
||||
|
||||
# Update webhook URL
|
||||
sudo nano /etc/stellaops/appsettings.Production.json
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
teams:
|
||||
webhookUrl: https://webhook.office.com/webhookb2/...
|
||||
# or use an existing secret
|
||||
existingSecret: stellaops-teams-webhook
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.teams
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.slack` -- Slack webhook (alternative notification channel)
|
||||
- `check.integration.webhooks` -- general webhook health monitoring
|
||||
77
docs/doctor/articles/integration/webhook-health.md
Normal file
77
docs/doctor/articles/integration/webhook-health.md
Normal file
@@ -0,0 +1,77 @@
|
||||
---
|
||||
checkId: check.integration.webhooks
|
||||
plugin: stellaops.doctor.integration
|
||||
severity: warn
|
||||
tags: [integration, webhooks, notifications, events]
|
||||
---
|
||||
# Integration Webhook Health
|
||||
|
||||
## What It Checks
|
||||
Iterates over all webhook endpoints defined under `Webhooks:Endpoints`. For **outbound** webhooks it sends an HTTP HEAD request to the target URL and considers the endpoint reachable if the response status code is below 500. For **inbound** webhooks it marks reachability as true (endpoint is local). It then calculates the delivery failure rate from `TotalDeliveries` and `SuccessfulDeliveries` counters. The check **fails** if any outbound endpoint is unreachable or if any webhook's failure rate exceeds 20%, **warns** if any webhook's failure rate is between 5% and 20%, and **passes** otherwise.
|
||||
|
||||
## Why It Matters
|
||||
Webhooks are the primary event-driven communication channel between Stella Ops and external systems. Unreachable outbound endpoints mean notifications, CI triggers, and audit event deliveries silently fail. A rising failure rate is an early warning of endpoint degradation that can cascade into missed alerts, delayed approvals, and incomplete audit trails.
|
||||
|
||||
## Common Causes
|
||||
- Webhook endpoint is down or returning 5xx errors
|
||||
- Network connectivity issue or DNS resolution failure
|
||||
- TLS certificate expired or untrusted
|
||||
- Payload format changed causing receiver to reject events
|
||||
- Rate limiting by the receiving service
|
||||
- Intermittent timeouts under load
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# List configured webhooks
|
||||
grep 'WEBHOOKS__' .env
|
||||
|
||||
# Test an outbound webhook endpoint
|
||||
docker compose exec gateway curl -I https://hooks.example.com/stellaops
|
||||
|
||||
# View webhook delivery logs
|
||||
docker compose logs platform | grep -i webhook
|
||||
|
||||
# Update a webhook URL
|
||||
echo 'Webhooks__Endpoints__0__Url=https://hooks.example.com/v2/stellaops' >> .env
|
||||
docker compose restart platform
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check webhook configuration
|
||||
cat /etc/stellaops/appsettings.Production.json | jq '.Webhooks'
|
||||
|
||||
# Test endpoint connectivity
|
||||
curl -I https://hooks.example.com/stellaops
|
||||
|
||||
# Review delivery history
|
||||
stella webhooks logs <webhook-name> --status failed
|
||||
|
||||
# Retry failed deliveries
|
||||
stella webhooks retry <webhook-name>
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
# values.yaml
|
||||
webhooks:
|
||||
endpoints:
|
||||
- name: slack-releases
|
||||
url: https://hooks.example.com/stellaops
|
||||
direction: outbound
|
||||
```
|
||||
```bash
|
||||
helm upgrade stellaops ./chart -f values.yaml
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.integration.webhooks
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.integration.slack` -- Slack-specific webhook validation
|
||||
- `check.integration.teams` -- Teams-specific webhook validation
|
||||
- `check.integration.ci.system` -- CI systems that receive webhook events
|
||||
86
docs/doctor/articles/notify/email-configured.md
Normal file
86
docs/doctor/articles/notify/email-configured.md
Normal file
@@ -0,0 +1,86 @@
|
||||
---
|
||||
checkId: check.notify.email.configured
|
||||
plugin: stellaops.doctor.notify
|
||||
severity: warn
|
||||
tags: [notify, email, smtp, quick, configuration]
|
||||
---
|
||||
# Email Configuration
|
||||
|
||||
## What It Checks
|
||||
Verifies that the email (SMTP) notification channel is properly configured. The check reads the `Notify:Channels:Email` configuration section and validates:
|
||||
|
||||
- **SMTP host** (`SmtpHost` or `Host`): must be set and non-empty.
|
||||
- **SMTP port** (`SmtpPort` or `Port`): must be a valid number between 1 and 65535.
|
||||
- **From address** (`FromAddress` or `From`): must be set so outbound emails have a valid sender.
|
||||
- **Enabled flag** (`Enabled`): if explicitly set to `false`, reports a warning that the channel is configured but disabled.
|
||||
|
||||
The check only runs when the `Notify:Channels:Email` configuration section exists.
|
||||
|
||||
## Why It Matters
|
||||
Email notifications deliver critical alerts for release gate failures, policy violations, and security findings. Without a properly configured SMTP host, no email notifications can be sent, leaving operators blind to events that require immediate action. A missing from-address causes emails to be rejected by receiving mail servers.
|
||||
|
||||
## Common Causes
|
||||
- SMTP host not set in configuration
|
||||
- Missing `Notify:Channels:Email:SmtpHost` setting
|
||||
- SMTP port not specified or set to an invalid value
|
||||
- From address not configured
|
||||
- Email channel explicitly disabled in configuration
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Add environment variables to your service definition:
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
Notify__Channels__Email__SmtpHost: "smtp.example.com"
|
||||
Notify__Channels__Email__SmtpPort: "587"
|
||||
Notify__Channels__Email__FromAddress: "noreply@example.com"
|
||||
Notify__Channels__Email__UseSsl: "true"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Edit `appsettings.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"Notify": {
|
||||
"Channels": {
|
||||
"Email": {
|
||||
"SmtpHost": "smtp.example.com",
|
||||
"SmtpPort": 587,
|
||||
"FromAddress": "noreply@example.com",
|
||||
"UseSsl": true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Restart the service:
|
||||
```bash
|
||||
sudo systemctl restart stellaops-notify
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Set values in your Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
notify:
|
||||
channels:
|
||||
email:
|
||||
smtpHost: "smtp.example.com"
|
||||
smtpPort: 587
|
||||
fromAddress: "noreply@example.com"
|
||||
useSsl: true
|
||||
credentialsSecret: "stellaops-smtp-credentials"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.notify.email.configured
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.notify.email.connectivity` — tests whether the configured SMTP server is reachable
|
||||
- `check.notify.queue.health` — verifies the notification delivery queue is healthy
|
||||
78
docs/doctor/articles/notify/email-connectivity.md
Normal file
78
docs/doctor/articles/notify/email-connectivity.md
Normal file
@@ -0,0 +1,78 @@
|
||||
---
|
||||
checkId: check.notify.email.connectivity
|
||||
plugin: stellaops.doctor.notify
|
||||
severity: warn
|
||||
tags: [notify, email, smtp, connectivity, network]
|
||||
---
|
||||
# Email Connectivity
|
||||
|
||||
## What It Checks
|
||||
Verifies that the configured SMTP server is reachable by opening a TCP connection to the SMTP host and port. The check:
|
||||
|
||||
- Opens a TCP socket to `SmtpHost:SmtpPort` with a 10-second timeout.
|
||||
- Reads the SMTP banner and verifies it starts with `220` (standard SMTP greeting).
|
||||
- Reports an info-level result if the connection succeeds but the banner is not a recognized SMTP response.
|
||||
- Fails if the connection times out, is refused, or encounters a socket error.
|
||||
|
||||
The check only runs when both `SmtpHost` and `SmtpPort` are configured with valid values.
|
||||
|
||||
## Why It Matters
|
||||
A configured but unreachable SMTP server means email notifications will silently fail. Release gate alerts, security finding notifications, and approval requests will never reach operators, potentially delaying incident response.
|
||||
|
||||
## Common Causes
|
||||
- SMTP server not running
|
||||
- Wrong host or port in configuration
|
||||
- Firewall blocking outbound SMTP connections
|
||||
- DNS resolution failure for the SMTP hostname
|
||||
- Network latency too high (exceeding 10-second timeout)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Verify network connectivity from the container:
|
||||
|
||||
```bash
|
||||
docker exec <notify-container> nc -zv smtp.example.com 587
|
||||
docker exec <notify-container> nslookup smtp.example.com
|
||||
```
|
||||
|
||||
Ensure the container network can reach the SMTP server. If behind a proxy, configure it:
|
||||
```yaml
|
||||
environment:
|
||||
HTTP_PROXY: "http://proxy.example.com:8080"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Test connectivity manually:
|
||||
|
||||
```bash
|
||||
nc -zv smtp.example.com 587
|
||||
telnet smtp.example.com 587
|
||||
nslookup smtp.example.com
|
||||
```
|
||||
|
||||
Check firewall rules:
|
||||
```bash
|
||||
sudo iptables -L -n | grep 587
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
Verify connectivity from the pod:
|
||||
|
||||
```bash
|
||||
kubectl exec -it <notify-pod> -- nc -zv smtp.example.com 587
|
||||
```
|
||||
|
||||
Check NetworkPolicy resources that might block egress:
|
||||
```bash
|
||||
kubectl get networkpolicy -n stellaops
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.notify.email.connectivity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.notify.email.configured` — verifies SMTP configuration is complete
|
||||
- `check.notify.queue.health` — verifies the notification delivery queue is healthy
|
||||
93
docs/doctor/articles/notify/queue-health.md
Normal file
93
docs/doctor/articles/notify/queue-health.md
Normal file
@@ -0,0 +1,93 @@
|
||||
---
|
||||
checkId: check.notify.queue.health
|
||||
plugin: stellaops.doctor.notify
|
||||
severity: fail
|
||||
tags: [notify, queue, redis, nats, infrastructure]
|
||||
---
|
||||
# Notification Queue Health
|
||||
|
||||
## What It Checks
|
||||
Verifies that the notification event and delivery queues are healthy. The check:
|
||||
|
||||
- Reads the `Notify:Queue:Transport` (or `Kind`) setting to determine the queue transport type (Redis/Valkey or NATS).
|
||||
- Resolves `NotifyQueueHealthCheck` and `NotifyDeliveryQueueHealthCheck` from the DI container.
|
||||
- Invokes each registered health check and aggregates the results.
|
||||
- Fails if any queue reports an `Unhealthy` status; warns if degraded; passes if all are healthy.
|
||||
|
||||
The check only runs when a queue transport is configured in `Notify:Queue:Transport`.
|
||||
|
||||
## Why It Matters
|
||||
The notification queue is the backbone of the notification pipeline. If the event queue is unhealthy, new notification events are lost. If the delivery queue is unhealthy, pending notifications to email, Slack, Teams, and webhook channels will not be delivered. This is a severity-fail check because queue failure means complete notification blackout.
|
||||
|
||||
## Common Causes
|
||||
- Queue server (Redis/Valkey/NATS) not running
|
||||
- Network connectivity issues between the Notify service and the queue server
|
||||
- Authentication failure (wrong password or credentials)
|
||||
- Incorrect connection string in configuration
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
For Redis/Valkey transport:
|
||||
|
||||
```bash
|
||||
# Check Redis health
|
||||
docker exec <redis-container> redis-cli ping
|
||||
|
||||
# Check connection string
|
||||
docker exec <notify-container> env | grep Notify__Queue
|
||||
|
||||
# Restart Redis if needed
|
||||
docker restart <redis-container>
|
||||
```
|
||||
|
||||
For NATS transport:
|
||||
|
||||
```bash
|
||||
# Check NATS server status
|
||||
docker exec <nats-container> nats server ping
|
||||
|
||||
# Check NATS logs
|
||||
docker logs <nats-container> --tail 50
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Redis/Valkey
|
||||
redis-cli ping
|
||||
redis-cli info server
|
||||
|
||||
# NATS
|
||||
nats server ping
|
||||
systemctl status nats
|
||||
```
|
||||
|
||||
Verify the connection string in `appsettings.json`:
|
||||
```json
|
||||
{
|
||||
"Notify": {
|
||||
"Queue": {
|
||||
"Transport": "redis",
|
||||
"Redis": {
|
||||
"ConnectionString": "127.1.1.2:6379"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -it <redis-pod> -- redis-cli ping
|
||||
kubectl logs <notify-pod> --tail 50 | grep -i queue
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.notify.queue.health
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.notify.email.configured` — verifies email channel configuration
|
||||
- `check.notify.slack.configured` — verifies Slack channel configuration
|
||||
- `check.notify.webhook.configured` — verifies webhook channel configuration
|
||||
72
docs/doctor/articles/notify/slack-configured.md
Normal file
72
docs/doctor/articles/notify/slack-configured.md
Normal file
@@ -0,0 +1,72 @@
|
||||
---
|
||||
checkId: check.notify.slack.configured
|
||||
plugin: stellaops.doctor.notify
|
||||
severity: warn
|
||||
tags: [notify, slack, quick, configuration]
|
||||
---
|
||||
# Slack Configuration
|
||||
|
||||
## What It Checks
|
||||
Verifies that the Slack notification channel is properly configured. The check reads `Notify:Channels:Slack` and validates:
|
||||
|
||||
- **Webhook URL** (`WebhookUrl`): must be set and non-empty.
|
||||
- **Enabled flag** (`Enabled`): if explicitly `false`, reports a warning that Slack is configured but disabled.
|
||||
|
||||
The check only runs when the `Notify:Channels:Slack` configuration section exists.
|
||||
|
||||
## Why It Matters
|
||||
Slack is a primary real-time notification channel for many operations teams. Without a configured webhook URL, security alerts, release gate notifications, and approval requests cannot reach Slack channels, delaying incident response.
|
||||
|
||||
## Common Causes
|
||||
- Slack webhook URL not set in configuration
|
||||
- Missing `Notify:Channels:Slack:WebhookUrl` setting
|
||||
- Environment variable not bound to configuration
|
||||
- Slack notifications explicitly disabled
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
Notify__Channels__Slack__WebhookUrl: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
||||
```
|
||||
|
||||
> **Security note:** Slack webhook URLs are secrets. Store them in a secrets manager or Docker secrets, not in plain-text compose files.
|
||||
|
||||
### Bare Metal / systemd
|
||||
Edit `appsettings.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"Notify": {
|
||||
"Channels": {
|
||||
"Slack": {
|
||||
"WebhookUrl": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
notify:
|
||||
channels:
|
||||
slack:
|
||||
webhookUrlSecret: "stellaops-slack-webhook"
|
||||
```
|
||||
|
||||
Create the secret:
|
||||
```bash
|
||||
kubectl create secret generic stellaops-slack-webhook \
|
||||
--from-literal=url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.notify.slack.configured
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.notify.slack.connectivity` — tests whether the Slack webhook endpoint is reachable
|
||||
- `check.notify.queue.health` — verifies the notification delivery queue is healthy
|
||||
68
docs/doctor/articles/notify/slack-connectivity.md
Normal file
68
docs/doctor/articles/notify/slack-connectivity.md
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
checkId: check.notify.slack.connectivity
|
||||
plugin: stellaops.doctor.notify
|
||||
severity: warn
|
||||
tags: [notify, slack, connectivity, network]
|
||||
---
|
||||
# Slack Connectivity
|
||||
|
||||
## What It Checks
|
||||
Verifies that the configured Slack webhook endpoint is reachable. The check:
|
||||
|
||||
- Sends an empty-text POST payload to the webhook URL with a 10-second timeout.
|
||||
- Slack returns `no_text` for empty messages, which proves the endpoint is alive without posting a visible message.
|
||||
- Passes if the response is successful or contains `no_text`.
|
||||
- Warns if an unexpected HTTP status is returned (e.g., invalid or revoked webhook).
|
||||
- Fails on connection timeout or HTTP request exceptions.
|
||||
|
||||
The check only runs when `Notify:Channels:Slack:WebhookUrl` is set and is a valid absolute URL.
|
||||
|
||||
## Why It Matters
|
||||
A configured but unreachable Slack webhook means notifications are silently dropped. Teams relying on Slack for release alerts and security findings will miss critical events.
|
||||
|
||||
## Common Causes
|
||||
- Invalid or expired webhook URL
|
||||
- Slack workspace configuration changed
|
||||
- Webhook URL revoked or regenerated
|
||||
- Rate limiting by Slack
|
||||
- Firewall blocking outbound HTTPS to hooks.slack.com
|
||||
- Proxy configuration required but not set
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Test connectivity from the container:
|
||||
|
||||
```bash
|
||||
docker exec <notify-container> curl -v https://hooks.slack.com/
|
||||
```
|
||||
|
||||
If behind a proxy:
|
||||
```yaml
|
||||
environment:
|
||||
HTTPS_PROXY: "http://proxy.example.com:8080"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
curl -v https://hooks.slack.com/
|
||||
curl -X POST -H 'Content-type: application/json' \
|
||||
--data '{"text":"Doctor test"}' \
|
||||
'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -it <notify-pod> -- curl -v https://hooks.slack.com/
|
||||
```
|
||||
|
||||
If the webhook URL has been revoked, create a new one in the Slack App settings under **Incoming Webhooks** and update the configuration.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.notify.slack.connectivity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.notify.slack.configured` — verifies Slack webhook URL is set
|
||||
- `check.notify.queue.health` — verifies the notification delivery queue is healthy
|
||||
67
docs/doctor/articles/notify/teams-configured.md
Normal file
67
docs/doctor/articles/notify/teams-configured.md
Normal file
@@ -0,0 +1,67 @@
|
||||
---
|
||||
checkId: check.notify.teams.configured
|
||||
plugin: stellaops.doctor.notify
|
||||
severity: warn
|
||||
tags: [notify, teams, quick, configuration]
|
||||
---
|
||||
# Teams Configuration
|
||||
|
||||
## What It Checks
|
||||
Verifies that the Microsoft Teams notification channel is properly configured. The check reads `Notify:Channels:Teams` and validates:
|
||||
|
||||
- **Webhook URL** (`WebhookUrl`): must be set and non-empty.
|
||||
- **URL format**: validates that the URL belongs to a Microsoft domain (`webhook.office.com` or `microsoft.com`).
|
||||
- **Enabled flag** (`Enabled`): if explicitly `false`, reports a warning.
|
||||
|
||||
The check only runs when the `Notify:Channels:Teams` configuration section exists.
|
||||
|
||||
## Why It Matters
|
||||
Teams is a common enterprise notification channel. Without a valid webhook URL, notifications about release decisions, policy violations, and security findings cannot reach Teams channels.
|
||||
|
||||
## Common Causes
|
||||
- Teams webhook URL not set in configuration
|
||||
- Webhook URL is not from a Microsoft domain (malformed or legacy URL)
|
||||
- Teams notifications explicitly disabled
|
||||
- Environment variable not bound to configuration
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
Notify__Channels__Teams__WebhookUrl: "https://YOUR_TENANT.webhook.office.com/webhookb2/..."
|
||||
```
|
||||
|
||||
> **Security note:** Teams webhook URLs are secrets. Use Docker secrets or a vault.
|
||||
|
||||
### Bare Metal / systemd
|
||||
```json
|
||||
{
|
||||
"Notify": {
|
||||
"Channels": {
|
||||
"Teams": {
|
||||
"WebhookUrl": "https://YOUR_TENANT.webhook.office.com/webhookb2/..."
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
notify:
|
||||
channels:
|
||||
teams:
|
||||
webhookUrlSecret: "stellaops-teams-webhook"
|
||||
```
|
||||
|
||||
To create the webhook in Teams: Channel > Connectors > Incoming Webhook > Create.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.notify.teams.configured
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.notify.teams.connectivity` — tests whether the Teams webhook endpoint is reachable
|
||||
- `check.notify.queue.health` — verifies the notification delivery queue is healthy
|
||||
60
docs/doctor/articles/notify/teams-connectivity.md
Normal file
60
docs/doctor/articles/notify/teams-connectivity.md
Normal file
@@ -0,0 +1,60 @@
|
||||
---
|
||||
checkId: check.notify.teams.connectivity
|
||||
plugin: stellaops.doctor.notify
|
||||
severity: warn
|
||||
tags: [notify, teams, connectivity, network]
|
||||
---
|
||||
# Teams Connectivity
|
||||
|
||||
## What It Checks
|
||||
Verifies that the configured Microsoft Teams webhook endpoint is reachable. The check:
|
||||
|
||||
- Sends a minimal Adaptive Card payload to the webhook URL with a 10-second timeout.
|
||||
- Passes if the response is successful (HTTP 2xx).
|
||||
- Warns if an unexpected HTTP status is returned (invalid, expired, or revoked webhook).
|
||||
- Fails on connection timeout or HTTP request exceptions.
|
||||
|
||||
The check only runs when `Notify:Channels:Teams:WebhookUrl` is set and is a valid absolute URL.
|
||||
|
||||
## Why It Matters
|
||||
An unreachable Teams webhook means notifications silently fail to deliver. Operations teams will miss release alerts and security findings if the webhook is broken.
|
||||
|
||||
## Common Causes
|
||||
- Invalid or expired webhook URL
|
||||
- Teams connector disabled or deleted
|
||||
- Microsoft 365 tenant configuration changed
|
||||
- Firewall blocking outbound HTTPS to webhook.office.com
|
||||
- Proxy configuration required
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker exec <notify-container> curl -v https://webhook.office.com/
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
curl -v https://webhook.office.com/
|
||||
curl -H 'Content-Type: application/json' \
|
||||
-d '{"text":"Doctor test"}' \
|
||||
'https://YOUR_TENANT.webhook.office.com/webhookb2/...'
|
||||
```
|
||||
|
||||
Check Microsoft 365 service status at https://status.office.com.
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -it <notify-pod> -- curl -v https://webhook.office.com/
|
||||
```
|
||||
|
||||
If the webhook is broken, recreate it: Teams channel > Connectors > Incoming Webhook > delete and recreate.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.notify.teams.connectivity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.notify.teams.configured` — verifies Teams webhook URL is set and valid
|
||||
- `check.notify.queue.health` — verifies the notification delivery queue is healthy
|
||||
68
docs/doctor/articles/notify/webhook-configured.md
Normal file
68
docs/doctor/articles/notify/webhook-configured.md
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
checkId: check.notify.webhook.configured
|
||||
plugin: stellaops.doctor.notify
|
||||
severity: warn
|
||||
tags: [notify, webhook, quick, configuration]
|
||||
---
|
||||
# Webhook Configuration
|
||||
|
||||
## What It Checks
|
||||
Verifies that the generic webhook notification channel is properly configured. The check reads `Notify:Channels:Webhook` and validates:
|
||||
|
||||
- **URL** (`Url` or `Endpoint`): must be set and be a valid HTTP or HTTPS URL.
|
||||
- **Enabled flag** (`Enabled`): if explicitly `false`, reports a warning.
|
||||
- Also reads `Method` (defaults to POST) and `ContentType` (defaults to application/json) for evidence.
|
||||
|
||||
The check only runs when the `Notify:Channels:Webhook` configuration section exists.
|
||||
|
||||
## Why It Matters
|
||||
Generic webhooks integrate Stella Ops notifications with third-party systems (PagerDuty, OpsGenie, custom dashboards, SIEM tools). A missing or malformed URL prevents these integrations from receiving events.
|
||||
|
||||
## Common Causes
|
||||
- Webhook URL not set in configuration
|
||||
- Malformed URL (missing protocol `http://` or `https://`)
|
||||
- Invalid characters in URL
|
||||
- Webhook channel explicitly disabled
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
Notify__Channels__Webhook__Url: "https://your-endpoint/webhook"
|
||||
Notify__Channels__Webhook__Method: "POST"
|
||||
Notify__Channels__Webhook__ContentType: "application/json"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```json
|
||||
{
|
||||
"Notify": {
|
||||
"Channels": {
|
||||
"Webhook": {
|
||||
"Url": "https://your-endpoint/webhook",
|
||||
"Method": "POST",
|
||||
"ContentType": "application/json"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
notify:
|
||||
channels:
|
||||
webhook:
|
||||
url: "https://your-endpoint/webhook"
|
||||
method: "POST"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.notify.webhook.configured
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.notify.webhook.connectivity` — tests whether the webhook endpoint is reachable
|
||||
- `check.notify.queue.health` — verifies the notification delivery queue is healthy
|
||||
58
docs/doctor/articles/notify/webhook-connectivity.md
Normal file
58
docs/doctor/articles/notify/webhook-connectivity.md
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
checkId: check.notify.webhook.connectivity
|
||||
plugin: stellaops.doctor.notify
|
||||
severity: warn
|
||||
tags: [notify, webhook, connectivity, network]
|
||||
---
|
||||
# Webhook Connectivity
|
||||
|
||||
## What It Checks
|
||||
Verifies that the configured generic webhook endpoint is reachable. The check:
|
||||
|
||||
- Sends a HEAD request to the webhook URL (falls back to OPTIONS if HEAD is unsupported) with a 10-second timeout.
|
||||
- Any response with HTTP status < 500 is considered reachable (even 401/403, which indicate the endpoint exists but requires authentication).
|
||||
- Warns on HTTP 5xx responses (server-side errors).
|
||||
- Fails on connection timeout or HTTP request exceptions.
|
||||
|
||||
The check only runs when `Notify:Channels:Webhook:Url` (or `Endpoint`) is set and is a valid absolute URL.
|
||||
|
||||
## Why It Matters
|
||||
A configured but unreachable webhook endpoint means third-party integrations silently stop receiving notifications. Events that should trigger PagerDuty alerts, SIEM ingestion, or custom dashboard updates will be lost.
|
||||
|
||||
## Common Causes
|
||||
- Endpoint server not responding
|
||||
- Network connectivity issue or firewall blocking connection
|
||||
- DNS resolution failure
|
||||
- TLS/SSL certificate problem on the endpoint
|
||||
- Webhook endpoint service is down
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
docker exec <notify-container> curl -v --max-time 10 https://your-endpoint/webhook
|
||||
docker exec <notify-container> nslookup your-endpoint
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
curl -I https://your-endpoint/webhook
|
||||
nslookup your-endpoint
|
||||
nc -zv your-endpoint 443
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
kubectl exec -it <notify-pod> -- curl -v https://your-endpoint/webhook
|
||||
```
|
||||
|
||||
Check that egress NetworkPolicies allow traffic to the webhook destination.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.notify.webhook.connectivity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.notify.webhook.configured` — verifies webhook URL is set and valid
|
||||
- `check.notify.queue.health` — verifies the notification delivery queue is healthy
|
||||
68
docs/doctor/articles/observability/log-directory-writable.md
Normal file
68
docs/doctor/articles/observability/log-directory-writable.md
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
checkId: check.logs.directory.writable
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: fail
|
||||
tags: [observability, logs, quick]
|
||||
---
|
||||
# Log Directory Writable
|
||||
|
||||
## What It Checks
|
||||
Verifies that the log directory exists and is writable. The check:
|
||||
|
||||
- Reads the log path from `Logging:Path` configuration. Falls back to platform defaults: `/var/log/stellaops` on Linux, `%ProgramData%\StellaOps\logs` on Windows.
|
||||
- Verifies the directory exists.
|
||||
- Writes a temporary file to test write access, then deletes it.
|
||||
- Fails if the directory does not exist, is not writable due to permissions, or encounters an I/O error.
|
||||
|
||||
## Why It Matters
|
||||
If the log directory is not writable, application logs are silently lost. Without logs, troubleshooting service failures, debugging policy evaluation issues, and performing security incident investigations becomes impossible. This is a severity-fail check because log loss breaks the auditability guarantee.
|
||||
|
||||
## Common Causes
|
||||
- Log directory not created during installation
|
||||
- Directory was deleted
|
||||
- Configuration points to wrong path
|
||||
- Insufficient permissions or directory owned by different user
|
||||
- Read-only file system
|
||||
- Disk full
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
volumes:
|
||||
- log-data:/var/log/stellaops
|
||||
```
|
||||
|
||||
```bash
|
||||
docker exec <platform-container> mkdir -p /var/log/stellaops
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Create log directory
|
||||
sudo mkdir -p /var/log/stellaops
|
||||
|
||||
# Set ownership and permissions
|
||||
sudo chown -R stellaops:stellaops /var/log/stellaops
|
||||
sudo chmod 755 /var/log/stellaops
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
logging:
|
||||
path: "/var/log/stellaops"
|
||||
persistence:
|
||||
enabled: true
|
||||
size: 10Gi
|
||||
```
|
||||
|
||||
Or use an `emptyDir` volume for ephemeral log storage with a sidecar shipping logs to an external system.
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.logs.directory.writable
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.logs.rotation.configured` — verifies log rotation is configured
|
||||
- `check.storage.diskspace` — verifies sufficient disk space is available
|
||||
83
docs/doctor/articles/observability/log-rotation.md
Normal file
83
docs/doctor/articles/observability/log-rotation.md
Normal file
@@ -0,0 +1,83 @@
|
||||
---
|
||||
checkId: check.logs.rotation.configured
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, logs]
|
||||
---
|
||||
# Log Rotation
|
||||
|
||||
## What It Checks
|
||||
Verifies that log rotation is configured to prevent disk exhaustion. The check:
|
||||
|
||||
- Looks for application-level rotation via `Logging:RollingPolicy` configuration.
|
||||
- Checks for Serilog rolling configuration at `Serilog:WriteTo:0:Args:rollingInterval`.
|
||||
- On Linux, checks for system-level logrotate at `/etc/logrotate.d/stellaops`.
|
||||
- Scans log files in the log directory and flags any file exceeding 100MB.
|
||||
- Warns if rotation is not configured and large log files exist or total log size exceeds 200MB.
|
||||
- Reports info if rotation is not configured but logs are still small.
|
||||
|
||||
## Why It Matters
|
||||
Without log rotation, log files grow unbounded until they exhaust disk space. Disk exhaustion causes cascading failures across all services. Even before exhaustion, very large log files are slow to search and analyze during incident response.
|
||||
|
||||
## Common Causes
|
||||
- Log rotation not configured in application settings
|
||||
- logrotate not installed or stellaops config missing from `/etc/logrotate.d/`
|
||||
- Application-level rotation disabled
|
||||
- Rotation threshold set too high
|
||||
- Very high log volume overwhelming rotation schedule
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
Set application-level log rotation:
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
Logging__RollingPolicy: "Size"
|
||||
Serilog__WriteTo__0__Args__rollingInterval: "Day"
|
||||
Serilog__WriteTo__0__Args__fileSizeLimitBytes: "104857600" # 100MB
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Option 1 -- Application-level rotation in `appsettings.json`:
|
||||
```json
|
||||
{
|
||||
"Logging": {
|
||||
"RollingPolicy": "Size"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Option 2 -- System-level logrotate:
|
||||
```bash
|
||||
sudo cp /usr/share/stellaops/logrotate.conf /etc/logrotate.d/stellaops
|
||||
|
||||
# Or create manually:
|
||||
cat <<EOF | sudo tee /etc/logrotate.d/stellaops
|
||||
/var/log/stellaops/*.log {
|
||||
daily
|
||||
rotate 14
|
||||
compress
|
||||
missingok
|
||||
notifempty
|
||||
maxsize 100M
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
logging:
|
||||
rollingPolicy: "Size"
|
||||
maxFileSizeMB: 100
|
||||
retainFiles: 14
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.logs.rotation.configured
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.logs.directory.writable` — verifies log directory exists and is writable
|
||||
- `check.storage.diskspace` — verifies sufficient disk space is available
|
||||
86
docs/doctor/articles/observability/otlp-endpoint.md
Normal file
86
docs/doctor/articles/observability/otlp-endpoint.md
Normal file
@@ -0,0 +1,86 @@
|
||||
---
|
||||
checkId: check.telemetry.otlp.endpoint
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, telemetry, otlp]
|
||||
---
|
||||
# OTLP Endpoint
|
||||
|
||||
## What It Checks
|
||||
Verifies that the OTLP (OpenTelemetry Protocol) collector endpoint is reachable. The check:
|
||||
|
||||
- Reads the endpoint from `Telemetry:OtlpEndpoint` configuration.
|
||||
- Sends a GET request to `{endpoint}/v1/health` with a 5-second timeout.
|
||||
- Passes if the endpoint returns a successful HTTP response.
|
||||
- Warns on non-success status codes, timeouts, or connection failures.
|
||||
|
||||
The check only runs when `Telemetry:OtlpEndpoint` is configured.
|
||||
|
||||
## Why It Matters
|
||||
OTLP is the standard protocol for exporting traces, metrics, and logs to observability backends (Grafana, Jaeger, Datadog, etc.). If the collector is unreachable, telemetry data is lost, making it impossible to monitor service performance, trace request flows, or detect anomalies.
|
||||
|
||||
## Common Causes
|
||||
- OTLP collector not running
|
||||
- Wrong endpoint configured
|
||||
- Network connectivity issue or firewall blocking connection
|
||||
- Collector health endpoint not available at `/v1/health`
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
Telemetry__OtlpEndpoint: "http://otel-collector:4317"
|
||||
```
|
||||
|
||||
```bash
|
||||
# Check if collector is running
|
||||
docker ps | grep otel
|
||||
|
||||
# Check collector logs
|
||||
docker logs otel-collector --tail 50
|
||||
|
||||
# Test connectivity
|
||||
docker exec <platform-container> curl -v http://otel-collector:4317/v1/health
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check collector status
|
||||
systemctl status otel-collector
|
||||
|
||||
# Test endpoint
|
||||
curl -v http://localhost:4317/v1/health
|
||||
|
||||
# Check port binding
|
||||
netstat -an | grep 4317
|
||||
```
|
||||
|
||||
Edit `appsettings.json`:
|
||||
```json
|
||||
{
|
||||
"Telemetry": {
|
||||
"OtlpEndpoint": "http://localhost:4317"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
telemetry:
|
||||
otlpEndpoint: "http://otel-collector.monitoring.svc:4317"
|
||||
```
|
||||
|
||||
```bash
|
||||
kubectl get pods -n monitoring | grep otel
|
||||
kubectl logs -n monitoring <otel-collector-pod> --tail 50
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.telemetry.otlp.endpoint
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.metrics.prometheus.scrape` — verifies Prometheus metrics endpoint accessibility
|
||||
- `check.logs.directory.writable` — verifies log directory is writable
|
||||
90
docs/doctor/articles/observability/prometheus-scrape.md
Normal file
90
docs/doctor/articles/observability/prometheus-scrape.md
Normal file
@@ -0,0 +1,90 @@
|
||||
---
|
||||
checkId: check.metrics.prometheus.scrape
|
||||
plugin: stellaops.doctor.observability
|
||||
severity: warn
|
||||
tags: [observability, metrics, prometheus]
|
||||
---
|
||||
# Prometheus Scrape
|
||||
|
||||
## What It Checks
|
||||
Verifies that the application metrics endpoint is accessible for Prometheus scraping. The check:
|
||||
|
||||
- Reads `Metrics:Path` (default `/metrics`), `Metrics:Port` (default `8080`), and `Metrics:Host` (default `localhost`).
|
||||
- Sends a GET request to `http://{host}:{port}{path}` with a 5-second timeout.
|
||||
- Counts the number of Prometheus-formatted metric lines in the response.
|
||||
- Passes if the endpoint returns a successful response with metrics.
|
||||
- Warns on non-success status codes, timeouts, or connection failures.
|
||||
|
||||
The check only runs when `Metrics:Enabled` is set to `true`.
|
||||
|
||||
## Why It Matters
|
||||
Prometheus metrics provide real-time visibility into service health, request latencies, error rates, and resource utilization. Without a scrapeable metrics endpoint, alerting rules cannot fire, dashboards go blank, and capacity planning has no data.
|
||||
|
||||
## Common Causes
|
||||
- Metrics endpoint not enabled in configuration
|
||||
- Wrong port configured
|
||||
- Service not running on the expected port
|
||||
- Authentication required but not configured for Prometheus
|
||||
- Firewall blocking the metrics port
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```yaml
|
||||
environment:
|
||||
Metrics__Enabled: "true"
|
||||
Metrics__Path: "/metrics"
|
||||
Metrics__Port: "8080"
|
||||
```
|
||||
|
||||
```bash
|
||||
# Test metrics endpoint
|
||||
docker exec <platform-container> curl -s http://localhost:8080/metrics | head -5
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
Edit `appsettings.json`:
|
||||
```json
|
||||
{
|
||||
"Metrics": {
|
||||
"Enabled": true,
|
||||
"Path": "/metrics",
|
||||
"Port": 8080
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```bash
|
||||
# Verify metrics are exposed
|
||||
curl -s http://localhost:8080/metrics | head -5
|
||||
|
||||
# Check port binding
|
||||
netstat -an | grep 8080
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```yaml
|
||||
metrics:
|
||||
enabled: true
|
||||
port: 8080
|
||||
path: "/metrics"
|
||||
serviceMonitor:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
Add Prometheus annotations to the pod:
|
||||
```yaml
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
prometheus.io/path: "/metrics"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.metrics.prometheus.scrape
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.telemetry.otlp.endpoint` — verifies OTLP collector endpoint reachability
|
||||
- `check.logs.directory.writable` — verifies log directory is writable
|
||||
90
docs/doctor/articles/operations/dead-letter.md
Normal file
90
docs/doctor/articles/operations/dead-letter.md
Normal file
@@ -0,0 +1,90 @@
|
||||
---
|
||||
checkId: check.operations.dead-letter
|
||||
plugin: stellaops.doctor.operations
|
||||
severity: warn
|
||||
tags: [operations, queue, dead-letter]
|
||||
---
|
||||
# Dead Letter Queue
|
||||
|
||||
## What It Checks
|
||||
Examines the dead letter queue for failed jobs that have exhausted their retry attempts and require manual review:
|
||||
|
||||
- **Critical threshold**: fail when more than 50 failed jobs accumulate in the dead letter queue.
|
||||
- **Warning threshold**: warn when more than 10 failed jobs are present.
|
||||
- **Acceptable range**: 1-10 failed jobs pass with an informational note.
|
||||
|
||||
Evidence collected: `FailedJobs`, `OldestFailure`, `MostCommonError`, `RetryableCount`.
|
||||
|
||||
This check always runs (no configuration prerequisites).
|
||||
|
||||
## Why It Matters
|
||||
Dead letter queue entries represent work that the system was unable to complete after all retry attempts. Each entry is a job that may have had side effects (partial writes, notifications sent, resources allocated) and now sits in an inconsistent state. A growing dead letter queue indicates a systemic issue -- a downstream service outage, a configuration error, or a bug that is causing repeated failures. Left unattended, dead letters accumulate and can mask the root cause of operational issues.
|
||||
|
||||
## Common Causes
|
||||
- Persistent downstream service failures (registry unavailable, external API down)
|
||||
- Configuration errors causing jobs to fail deterministically (wrong credentials, missing endpoints)
|
||||
- Resource exhaustion (out of memory, disk full) during job execution
|
||||
- Integration service outage (SCM, CI, secrets manager)
|
||||
- Transient failures accumulating faster than the retry mechanism can clear them
|
||||
- Jobs consistently failing on specific artifact types or inputs
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# List dead letter queue entries
|
||||
stella orchestrator deadletter list --limit 20
|
||||
|
||||
# Analyze common failure patterns
|
||||
stella orchestrator deadletter analyze
|
||||
|
||||
# Retry jobs that are eligible for retry
|
||||
stella orchestrator deadletter retry --filter retryable
|
||||
|
||||
# Retry all failed jobs
|
||||
stella orchestrator deadletter retry --all
|
||||
|
||||
# View orchestrator logs for root cause
|
||||
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator | grep -i "error\|fail"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# List recent failures
|
||||
stella orchestrator deadletter list --since 1h
|
||||
|
||||
# Analyze failure patterns
|
||||
stella orchestrator deadletter analyze
|
||||
|
||||
# Retry retryable jobs
|
||||
stella orchestrator deadletter retry --filter retryable
|
||||
|
||||
# Check orchestrator service health
|
||||
sudo systemctl status stellaops-orchestrator
|
||||
sudo journalctl -u stellaops-orchestrator --since "4 hours ago" | grep -i "deadletter\|error"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# List dead letter entries
|
||||
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter list --limit 20
|
||||
|
||||
# Analyze failures
|
||||
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter analyze
|
||||
|
||||
# Retry retryable jobs
|
||||
kubectl exec -it <orchestrator-pod> -- stella orchestrator deadletter retry --filter retryable
|
||||
|
||||
# Check orchestrator pod logs
|
||||
kubectl logs -l app=stellaops-orchestrator --tail=200 | grep -i dead.letter
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.operations.dead-letter
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.operations.job-queue` -- job queue backlog can indicate the same underlying issue
|
||||
- `check.operations.scheduler` -- scheduler failures may produce dead letter entries
|
||||
- `check.postgres.connectivity` -- database issues are a common root cause of job failures
|
||||
113
docs/doctor/articles/operations/job-queue.md
Normal file
113
docs/doctor/articles/operations/job-queue.md
Normal file
@@ -0,0 +1,113 @@
|
||||
---
|
||||
checkId: check.operations.job-queue
|
||||
plugin: stellaops.doctor.operations
|
||||
severity: fail
|
||||
tags: [operations, queue, jobs, core]
|
||||
---
|
||||
# Job Queue Health
|
||||
|
||||
## What It Checks
|
||||
Evaluates the platform job queue health across three dimensions:
|
||||
|
||||
- **Worker availability**: fail immediately if no workers are active (zero active workers).
|
||||
- **Queue depth**: warn at 100+ pending jobs, fail at 500+ pending jobs.
|
||||
- **Processing rate**: warn if processing rate drops below 10 jobs/minute.
|
||||
|
||||
Evidence collected: `QueueDepth`, `ActiveWorkers`, `TotalWorkers`, `ProcessingRate`, `OldestJobAge`, `CompletedLast24h`, `CriticalThreshold`, `WarningThreshold`, `RateStatus`.
|
||||
|
||||
This check always runs (no configuration prerequisites).
|
||||
|
||||
## Why It Matters
|
||||
The job queue is the backbone of asynchronous processing in Stella Ops. It handles scan jobs, SBOM generation, vulnerability matching, evidence collection, notification delivery, and many other background tasks. If no workers are available, all background processing stops. A deep queue means jobs are waiting longer than expected, which cascades into delayed scan results, stale findings, and blocked release gates. A low processing rate indicates a performance bottleneck that will only get worse under load.
|
||||
|
||||
## Common Causes
|
||||
- Worker service not running (crashed, not started, configuration error)
|
||||
- All workers crashed or became unhealthy simultaneously
|
||||
- Job processing slower than submission rate during high-activity periods
|
||||
- Workers overloaded or misconfigured (too few workers for the workload)
|
||||
- Downstream service bottleneck (database slow, external API rate-limited)
|
||||
- Database performance issues slowing job dequeue operations
|
||||
- Higher than normal job submission rate (bulk scan, new integration)
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check orchestrator service status
|
||||
docker compose -f docker-compose.stella-ops.yml ps orchestrator
|
||||
|
||||
# View worker logs
|
||||
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator
|
||||
|
||||
# Restart the orchestrator service
|
||||
docker compose -f docker-compose.stella-ops.yml restart orchestrator
|
||||
|
||||
# Scale workers
|
||||
docker compose -f docker-compose.stella-ops.yml up -d --scale orchestrator=4
|
||||
```
|
||||
|
||||
```yaml
|
||||
services:
|
||||
orchestrator:
|
||||
environment:
|
||||
Orchestrator__Workers__Count: "8"
|
||||
Orchestrator__Workers__MaxConcurrent: "4"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check orchestrator service
|
||||
sudo systemctl status stellaops-orchestrator
|
||||
|
||||
# View logs for worker errors
|
||||
sudo journalctl -u stellaops-orchestrator --since "1 hour ago" | grep -i "worker\|queue"
|
||||
|
||||
# Restart workers
|
||||
stella orchestrator workers restart
|
||||
|
||||
# Scale workers
|
||||
stella orchestrator workers scale --count 8
|
||||
|
||||
# Monitor queue depth trend
|
||||
stella orchestrator queue watch
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check orchestrator pods
|
||||
kubectl get pods -l app=stellaops-orchestrator
|
||||
|
||||
# View worker logs
|
||||
kubectl logs -l app=stellaops-orchestrator --tail=200
|
||||
|
||||
# Scale workers
|
||||
kubectl scale deployment stellaops-orchestrator --replicas=4
|
||||
|
||||
# Check for stuck jobs
|
||||
kubectl exec -it <orchestrator-pod> -- stella orchestrator jobs list --status stuck
|
||||
```
|
||||
|
||||
Set in Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
orchestrator:
|
||||
replicas: 4
|
||||
workers:
|
||||
count: 8
|
||||
maxConcurrent: 4
|
||||
resources:
|
||||
limits:
|
||||
memory: 2Gi
|
||||
cpu: "2"
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.operations.job-queue
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.operations.dead-letter` -- failed jobs end up in the dead letter queue
|
||||
- `check.operations.scheduler` -- scheduler feeds jobs into the queue
|
||||
- `check.scanner.queue` -- scanner-specific queue health
|
||||
- `check.postgres.connectivity` -- database issues affect job dequeue performance
|
||||
108
docs/doctor/articles/operations/scheduler.md
Normal file
108
docs/doctor/articles/operations/scheduler.md
Normal file
@@ -0,0 +1,108 @@
|
||||
---
|
||||
checkId: check.operations.scheduler
|
||||
plugin: stellaops.doctor.operations
|
||||
severity: warn
|
||||
tags: [operations, scheduler, core]
|
||||
---
|
||||
# Scheduler Health
|
||||
|
||||
## What It Checks
|
||||
Evaluates the scheduler service status, scheduled jobs, and execution history:
|
||||
|
||||
- **Service status**: fail if the scheduler service is not running.
|
||||
- **Missed executions**: warn if any scheduled job executions were missed (scheduled time passed without the job running).
|
||||
|
||||
Evidence collected: `ServiceStatus`, `ScheduledJobs`, `MissedExecutions`, `LastExecution`, `NextExecution`, `CompletedToday`.
|
||||
|
||||
This check always runs (no configuration prerequisites).
|
||||
|
||||
## Why It Matters
|
||||
The scheduler is responsible for triggering time-based operations across the platform: vulnerability database syncs, periodic scans, evidence expiration, report generation, feed updates, and more. If the scheduler is down, none of these periodic tasks run, causing data staleness across the system. Missed executions indicate that the scheduler was unable to trigger a job at its scheduled time, which can cause cascading data freshness issues.
|
||||
|
||||
## Common Causes
|
||||
- Scheduler service crashed or was not started
|
||||
- Service configuration error preventing startup
|
||||
- System was down during a scheduled execution time (maintenance, outage)
|
||||
- Scheduler overloaded with too many concurrent scheduled jobs
|
||||
- Clock skew between the scheduler and other services
|
||||
- Resource exhaustion preventing the scheduler from processing triggers
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check scheduler/orchestrator service status
|
||||
docker compose -f docker-compose.stella-ops.yml ps orchestrator
|
||||
|
||||
# View scheduler logs
|
||||
docker compose -f docker-compose.stella-ops.yml logs --tail 200 orchestrator | grep -i "scheduler\|schedule"
|
||||
|
||||
# Restart the service
|
||||
docker compose -f docker-compose.stella-ops.yml restart orchestrator
|
||||
|
||||
# Review missed executions
|
||||
stella scheduler preview --missed
|
||||
|
||||
# Trigger catch-up for missed jobs
|
||||
stella scheduler catchup --dry-run
|
||||
stella scheduler catchup
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check scheduler service status
|
||||
sudo systemctl status stellaops-scheduler
|
||||
|
||||
# Start the scheduler if stopped
|
||||
sudo systemctl start stellaops-scheduler
|
||||
|
||||
# View scheduler logs
|
||||
sudo journalctl -u stellaops-scheduler --since "4 hours ago"
|
||||
|
||||
# Review missed executions
|
||||
stella scheduler preview --missed
|
||||
|
||||
# Trigger catch-up
|
||||
stella scheduler catchup --dry-run
|
||||
|
||||
# Verify system clock is synchronized
|
||||
timedatectl status
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check scheduler pod status
|
||||
kubectl get pods -l app=stellaops-scheduler
|
||||
|
||||
# View logs for the scheduler pod
|
||||
kubectl logs -l app=stellaops-scheduler --tail=200
|
||||
|
||||
# Restart the scheduler
|
||||
kubectl rollout restart deployment stellaops-scheduler
|
||||
|
||||
# Check NTP synchronization in the node
|
||||
kubectl exec -it <scheduler-pod> -- date -u
|
||||
```
|
||||
|
||||
Set in Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
scheduler:
|
||||
replicas: 1 # only one scheduler instance to avoid duplicate execution
|
||||
resources:
|
||||
limits:
|
||||
memory: 512Mi
|
||||
cpu: "0.5"
|
||||
catchupOnStart: true # run missed jobs on startup
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.operations.scheduler
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.operations.job-queue` -- scheduler feeds jobs into the queue
|
||||
- `check.operations.dead-letter` -- scheduler-triggered jobs that fail end up in dead letter
|
||||
- `check.release.schedule` -- release schedule depends on the scheduler service
|
||||
- `check.scanner.vuln` -- vulnerability database sync is scheduler-driven
|
||||
124
docs/doctor/articles/policy/engine.md
Normal file
124
docs/doctor/articles/policy/engine.md
Normal file
@@ -0,0 +1,124 @@
|
||||
---
|
||||
checkId: check.policy.engine
|
||||
plugin: stellaops.doctor.policy
|
||||
severity: fail
|
||||
tags: [policy, core, health]
|
||||
---
|
||||
# Policy Engine Health
|
||||
|
||||
## What It Checks
|
||||
Performs a three-part health check against the policy engine (OPA):
|
||||
|
||||
1. **Compilation**: queries `/health` to verify the engine is responding, `/v1/policies` to count loaded policies and verify they compiled, and `/v1/status` for engine version and cache metrics.
|
||||
2. **Evaluation**: sends a canary POST to `/v1/data/system/health` with a minimal input and measures response time. HTTP 200 or 404 are acceptable (no policy at that path is fine). HTTP 500 indicates an engine error. Evaluation latency above 100ms triggers a warning.
|
||||
3. **Storage**: queries `/v1/data` to verify the policy data store is accessible and counts top-level data entries.
|
||||
|
||||
If any of the three sub-checks fail, the overall result is fail. If all pass but evaluation latency exceeds 100ms, the result is warn.
|
||||
|
||||
Evidence collected: `engine_type`, `engine_version`, `engine_url`, `compilation_status`, `evaluation_status`, `storage_status`, `policy_count`, `compilation_time_ms`, `evaluation_latency_p50_ms`, `cache_hit_ratio`, `last_compilation_error`, `evaluation_error`, `storage_error`.
|
||||
|
||||
The check requires `Policy:Engine:Url` or `PolicyEngine:BaseUrl` to be configured.
|
||||
|
||||
## Why It Matters
|
||||
The policy engine is the decision authority for all release gates, promotion approvals, and security policy enforcement. If the policy engine is down, no release can pass its policy gate. If compilation fails, policies are not enforced. Slow evaluation delays release pipelines. A corrupt or inaccessible policy store means decisions are being made against stale or missing rules, which can result in either blocked releases or unintended policy bypasses.
|
||||
|
||||
## Common Causes
|
||||
- Policy engine service (OPA) not running or crashed
|
||||
- Policy storage backend unavailable (bundled or external)
|
||||
- OPA/Rego compilation error in a recently pushed policy
|
||||
- Policy cache corrupted after abnormal shutdown
|
||||
- Policy evaluation slower than expected due to complex rules
|
||||
- Network connectivity issue between Stella Ops services and the policy engine
|
||||
- Firewall blocking access to the policy engine port
|
||||
- DNS resolution failure for the policy engine hostname
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check policy engine container status
|
||||
docker compose -f docker-compose.stella-ops.yml ps policy-engine
|
||||
|
||||
# View policy engine logs
|
||||
docker compose -f docker-compose.stella-ops.yml logs --tail 200 policy-engine
|
||||
|
||||
# Test engine health directly
|
||||
curl -s http://localhost:8181/health
|
||||
|
||||
# Recompile all policies
|
||||
stella policy compile --all
|
||||
|
||||
# Warm the policy cache
|
||||
stella policy cache warm
|
||||
```
|
||||
|
||||
```yaml
|
||||
services:
|
||||
policy-engine:
|
||||
environment:
|
||||
Policy__Engine__Url: "http://policy-engine:8181"
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8181/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check OPA service status
|
||||
sudo systemctl status stellaops-policy-engine
|
||||
|
||||
# View logs
|
||||
sudo journalctl -u stellaops-policy-engine --since "1 hour ago"
|
||||
|
||||
# Restart the service
|
||||
sudo systemctl restart stellaops-policy-engine
|
||||
|
||||
# Verify health
|
||||
curl -s http://localhost:8181/health
|
||||
|
||||
# Recompile policies
|
||||
stella policy compile --all
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check policy engine pods
|
||||
kubectl get pods -l app=stellaops-policy-engine
|
||||
|
||||
# View pod logs
|
||||
kubectl logs -l app=stellaops-policy-engine --tail=200
|
||||
|
||||
# Restart policy engine
|
||||
kubectl rollout restart deployment stellaops-policy-engine
|
||||
|
||||
# Verify health from within the cluster
|
||||
kubectl exec -it <any-stellaops-pod> -- curl -s http://stellaops-policy-engine:8181/health
|
||||
```
|
||||
|
||||
Set in Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
policyEngine:
|
||||
replicas: 2
|
||||
resources:
|
||||
limits:
|
||||
memory: 1Gi
|
||||
cpu: "1"
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8181
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 30
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.policy.engine
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.release.promotion.gates` -- promotion gates depend on policy engine availability
|
||||
- `check.postgres.connectivity` -- policy storage may depend on database connectivity
|
||||
124
docs/doctor/articles/postgres/connectivity.md
Normal file
124
docs/doctor/articles/postgres/connectivity.md
Normal file
@@ -0,0 +1,124 @@
|
||||
---
|
||||
checkId: check.postgres.connectivity
|
||||
plugin: stellaops.doctor.postgres
|
||||
severity: fail
|
||||
tags: [database, postgres, connectivity, core]
|
||||
---
|
||||
# PostgreSQL Connectivity
|
||||
|
||||
## What It Checks
|
||||
Opens a connection to PostgreSQL and executes `SELECT version(), current_timestamp` to verify the database is accessible and responsive. Measures round-trip latency:
|
||||
|
||||
- **Critical latency**: fail if response time exceeds 500ms.
|
||||
- **Warning latency**: warn if response time exceeds 100ms.
|
||||
- **Connection timeout**: fail if the connection attempt exceeds 10 seconds.
|
||||
- **Connection failure**: fail on authentication errors, DNS failures, or network issues.
|
||||
|
||||
The connection string password is masked in all evidence output.
|
||||
|
||||
Evidence collected: `ConnectionString` (masked), `LatencyMs`, `Version`, `ServerTime`, `Status`, `Threshold`, `ErrorCode`, `ErrorMessage`, `TimeoutSeconds`.
|
||||
|
||||
The check requires `ConnectionStrings:StellaOps` or `Database:ConnectionString` to be configured.
|
||||
|
||||
## Why It Matters
|
||||
PostgreSQL is the primary data store for the entire Stella Ops platform. Every service depends on it for configuration, state, and transactional data. If the database is unreachable, the platform is effectively down. High latency propagates through every database operation, degrading the performance of all services, API endpoints, and background jobs simultaneously. This is the most fundamental infrastructure health check.
|
||||
|
||||
## Common Causes
|
||||
- Database server not running or crashed
|
||||
- Network connectivity issues between the application and database
|
||||
- Firewall blocking the database port (5432)
|
||||
- DNS resolution failure for the database hostname
|
||||
- Invalid connection string (wrong host, port, or database name)
|
||||
- Authentication failure (wrong username or password)
|
||||
- Database does not exist
|
||||
- Database server overloaded (high CPU, memory pressure, I/O saturation)
|
||||
- Network latency between application and database hosts
|
||||
- Slow queries blocking connections
|
||||
- SSL/TLS certificate issues
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check postgres container status
|
||||
docker compose -f docker-compose.stella-ops.yml ps postgres
|
||||
|
||||
# Test direct connection
|
||||
docker compose -f docker-compose.stella-ops.yml exec postgres \
|
||||
pg_isready -U stellaops -d stellaops_platform
|
||||
|
||||
# View postgres logs
|
||||
docker compose -f docker-compose.stella-ops.yml logs --tail 100 postgres
|
||||
|
||||
# Restart postgres if needed
|
||||
docker compose -f docker-compose.stella-ops.yml restart postgres
|
||||
```
|
||||
|
||||
Verify connection string in environment:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
platform:
|
||||
environment:
|
||||
ConnectionStrings__StellaOps: "Host=postgres;Port=5432;Database=stellaops_platform;Username=stellaops;Password=stellaops"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# Check PostgreSQL service status
|
||||
sudo systemctl status postgresql
|
||||
|
||||
# Test connectivity
|
||||
pg_isready -h localhost -p 5432 -U stellaops -d stellaops_platform
|
||||
|
||||
# Check PostgreSQL logs
|
||||
sudo tail -100 /var/log/postgresql/postgresql-*.log
|
||||
|
||||
# Verify connection string
|
||||
stella config get ConnectionStrings:StellaOps
|
||||
|
||||
# Test connection manually
|
||||
psql -h localhost -p 5432 -U stellaops -d stellaops_platform -c "SELECT 1;"
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check PostgreSQL pod status
|
||||
kubectl get pods -l app=postgresql
|
||||
|
||||
# Test connectivity from an application pod
|
||||
kubectl exec -it <platform-pod> -- pg_isready -h postgres -p 5432
|
||||
|
||||
# View PostgreSQL pod logs
|
||||
kubectl logs -l app=postgresql --tail=100
|
||||
|
||||
# Check service DNS resolution
|
||||
kubectl exec -it <platform-pod> -- nslookup postgres
|
||||
```
|
||||
|
||||
Verify connection string in secret:
|
||||
|
||||
```bash
|
||||
kubectl get secret stellaops-db-credentials -o jsonpath='{.data.connection-string}' | base64 -d
|
||||
```
|
||||
|
||||
Set in Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
postgresql:
|
||||
host: postgres
|
||||
port: 5432
|
||||
database: stellaops_platform
|
||||
auth:
|
||||
existingSecret: stellaops-db-credentials
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.postgres.connectivity
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.postgres.pool` -- pool exhaustion can masquerade as connectivity issues
|
||||
- `check.postgres.migrations` -- migration checks depend on connectivity
|
||||
- `check.operations.job-queue` -- database issues cause job queue failures
|
||||
127
docs/doctor/articles/postgres/migrations.md
Normal file
127
docs/doctor/articles/postgres/migrations.md
Normal file
@@ -0,0 +1,127 @@
|
||||
---
|
||||
checkId: check.postgres.migrations
|
||||
plugin: stellaops.doctor.postgres
|
||||
severity: warn
|
||||
tags: [database, postgres, migrations, schema]
|
||||
---
|
||||
# PostgreSQL Migration Status
|
||||
|
||||
## What It Checks
|
||||
Connects to PostgreSQL and examines the EF Core migration history to identify pending migrations:
|
||||
|
||||
1. **Migration table existence**: checks for the `__EFMigrationsHistory` table in the `public` schema. Warns if the table does not exist.
|
||||
2. **Applied migrations**: queries the migration history table (ordered by `MigrationId` descending) to determine which migrations have been applied.
|
||||
3. **Pending migrations**: compares applied migrations against the expected set to identify any unapplied migrations. Warns if pending migrations are found.
|
||||
|
||||
Evidence collected: `TableExists`, `AppliedCount`, `PendingCount`, `LatestApplied`, `PendingMigrations`, `Status`.
|
||||
|
||||
The check requires `ConnectionStrings:StellaOps` or `Database:ConnectionString` to be configured.
|
||||
|
||||
## Why It Matters
|
||||
Pending database migrations mean the database schema does not match what the application code expects. This causes 500 errors when the application tries to access tables or columns that do not exist, or uses schema features that have not been applied. In Stella Ops, missing migrations are the number one cause of service failures after an upgrade. Services may start and appear healthy but fail on the first database operation that touches a missing table or column.
|
||||
|
||||
## Common Causes
|
||||
- New deployment with schema changes but migration not executed
|
||||
- Migration was not run after a version update
|
||||
- Previous migration attempt failed partway through
|
||||
- Database initialized without EF Core (manual SQL scripts used instead)
|
||||
- Migration history table was accidentally dropped
|
||||
- First deployment to a fresh database with no migration history
|
||||
- Auto-migration disabled or not configured in service startup
|
||||
|
||||
## How to Fix
|
||||
|
||||
### Docker Compose
|
||||
```bash
|
||||
# Check migration status
|
||||
docker compose -f docker-compose.stella-ops.yml exec platform \
|
||||
stella db migrations status
|
||||
|
||||
# Apply pending migrations
|
||||
docker compose -f docker-compose.stella-ops.yml exec platform \
|
||||
stella db migrate
|
||||
|
||||
# If auto-migration is configured, restart the service (it migrates on startup)
|
||||
docker compose -f docker-compose.stella-ops.yml restart platform
|
||||
|
||||
# Verify migration status after applying
|
||||
docker compose -f docker-compose.stella-ops.yml exec platform \
|
||||
stella db migrations list
|
||||
```
|
||||
|
||||
Ensure auto-migration is enabled:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
platform:
|
||||
environment:
|
||||
Platform__AutoMigrate: "true"
|
||||
```
|
||||
|
||||
### Bare Metal / systemd
|
||||
```bash
|
||||
# List pending migrations
|
||||
stella db migrations list --pending
|
||||
|
||||
# Apply pending migrations
|
||||
stella db migrate
|
||||
|
||||
# Verify all migrations are applied
|
||||
stella db migrations status
|
||||
|
||||
# If auto-migration is configured, restart the service
|
||||
sudo systemctl restart stellaops-platform
|
||||
```
|
||||
|
||||
Edit `/etc/stellaops/platform/appsettings.json` to enable auto-migration:
|
||||
|
||||
```json
|
||||
{
|
||||
"Platform": {
|
||||
"AutoMigrate": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kubernetes / Helm
|
||||
```bash
|
||||
# Check migration status
|
||||
kubectl exec -it <platform-pod> -- stella db migrations status
|
||||
|
||||
# Apply pending migrations
|
||||
kubectl exec -it <platform-pod> -- stella db migrate
|
||||
|
||||
# Or use a migration Job
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: stellaops-migrate
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: migrate
|
||||
image: stellaops/platform:latest
|
||||
command: ["stella", "db", "migrate"]
|
||||
restartPolicy: Never
|
||||
EOF
|
||||
```
|
||||
|
||||
Set in Helm `values.yaml`:
|
||||
|
||||
```yaml
|
||||
platform:
|
||||
autoMigrate: true
|
||||
migrations:
|
||||
runOnStartup: true
|
||||
```
|
||||
|
||||
## Verification
|
||||
```
|
||||
stella doctor run --check check.postgres.migrations
|
||||
```
|
||||
|
||||
## Related Checks
|
||||
- `check.postgres.connectivity` -- migrations require a working database connection
|
||||
- `check.postgres.pool` -- connection pool issues can cause migration failures
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user