docs consolidation and others

This commit is contained in:
master
2026-01-06 19:02:21 +02:00
parent d7bdca6d97
commit 4789027317
849 changed files with 16551 additions and 66770 deletions

View File

@@ -0,0 +1,26 @@
# Binary Prerequisites & Offline Layout
## Layout (authoritative)
- `.nuget/packages/` — NuGet package cache (configured via `nuget.config` `globalPackagesFolder`).
- `devops/manifests/` — binary integrity manifests (e.g., `binary-plugins.manifest.json`).
- `devops/offline/feeds/` — air-gap bundles (tarballs, OCI layers, SBOM packs) registered in `manifest.json`.
- Module-owned binaries (currently `plugins/`, `tools/`, `deploy/`, `ops/`) are tracked for integrity in `devops/manifests/` until relocated.
## Adding or updating NuGet packages
1) Run `dotnet restore` which populates `.nuget/packages/` per the sources in `nuget.config`.
2) Never add new feeds to `nuget.config` without review; the configured sources are `nuget.org` and `stellaops` (internal feed).
3) For offline builds, pre-populate `.nuget/packages/` from a network-connected machine, then copy to the air-gapped environment.
## Adding other binaries
1) Prefer building from source; if you must pin a binary, drop it under `devops/offline/` and append an entry with SHA-256, origin URL, version, and intended consumer.
2) For module-owned binaries (e.g., plugins), record the artefact in `devops/manifests/binary-plugins.manifest.json` until it can be rebuilt deterministically as part of CI.
## Automation & Integrity
- Run `scripts/update-binary-manifests.py` to refresh manifests after adding binaries.
- Run `scripts/verify-binaries.sh` locally; CI executes it on every PR/branch to block binaries outside approved roots.
- CI also re-runs the manifest generator and fails if the manifests would change—commit regenerated manifests as part of the change.
- NuGet restore uses `.nuget/packages/` as configured in `nuget.config`. Clean by removing `.nuget/packages/` if needed.
- For offline enforcement, set `OFFLINE=1` (CI should fail if it reaches `nuget.org` without `ALLOW_REMOTE=1`).
## Housekeeping
- Refresh manifests when binaries change and record the update in the current sprint's Execution Log.

View File

@@ -0,0 +1,295 @@
# StellaOps Deployment Version Matrix
> **Last Updated:** 2025-12-04
> **Purpose:** Single source of truth for service versions across deployment environments
> **Unblocks:** COMPOSE-44-001, 44-001, 44-002, 44-003, 45-001, 45-002, 45-003 (7 tasks)
## Quick Reference
| Environment | Core Version | Status |
|-------------|-------------|--------|
| **Development** | `2025.10.0-edge` | Active |
| **Staging** | `2025.09.2` | Stable |
| **Production** | `2025.09.2` | Stable |
| **Air-Gap** | `2025.09.2-airgap` | Certified |
---
## Service Version Matrix
### Core Services
| Service | Dev | Staging | Prod | Air-Gap | Notes |
|---------|-----|---------|------|---------|-------|
| Authority | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | OAuth 2.1 / mTLS |
| Signer | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | ECDSA/RSA/EdDSA |
| Attestor | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | in-toto/DSSE |
| Concelier | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Advisory ingestion |
| Scanner | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | SBOM/Vuln scanning |
| Excititor | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | VEX export |
| Policy | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | OPA/Rego engine |
| Scheduler | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Job scheduling |
| Notify | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Notifications |
### Platform Services
| Service | Dev | Staging | Prod | Air-Gap | Notes |
|---------|-----|---------|------|---------|-------|
| Orchestrator Web | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | API Gateway |
| Orchestrator Worker | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Background jobs |
| Graph API | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Graph queries |
| Graph Indexer | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Graph ingest |
| Timeline Indexer | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Event timeline |
| Findings Ledger | `2025.10.0-edge` | `2025.09.2` | `2025.09.2` | `2025.09.2-airgap` | Finding storage |
### Infrastructure Dependencies
| Component | Version | Digest | Notes |
|-----------|---------|--------|-------|
| PostgreSQL | `16-alpine` | N/A | Primary database (REQUIRED) |
| Valkey | `8.0` | N/A | Cache, DPoP security (REQUIRED) |
| RustFS | `2025.10.0-edge` | N/A | Object storage (REQUIRED) |
| NATS | `2.10` | `sha256:c82559e4476289481a8a5196e675ebfe67eea81d95e5161e3e78eccfe766608e` | Message queue (optional) |
---
## Container Image Registry
### Primary Registry
```
registry.stella-ops.org/stellaops/<service>:<version>
```
### Image Naming Convention
| Pattern | Example | Use Case |
|---------|---------|----------|
| `<service>:<version>` | `authority:2025.09.2` | Tagged releases |
| `<service>:<version>-<variant>` | `authority:2025.09.2-airgap` | Environment variants |
| `<service>:edge` | `authority:edge` | Latest dev build |
| `<service>@sha256:<digest>` | `authority@sha256:abc123...` | Immutable reference |
### Air-Gap Bundle Images
Air-gap deployments use pre-bundled images with all dependencies:
```
registry.stella-ops.org/stellaops/airgap-bundle:2025.09.2
```
Bundle contents:
- All core services at matching version
- Infrastructure containers (PostgreSQL, Valkey, RustFS, NATS)
- CLI tools and migration utilities
- Offline kit documentation
---
## Version Promotion Workflow
### Stages
```
Dev (edge) → Staging → Production → Air-Gap (certified)
```
### Promotion Criteria
| Stage | Criteria |
|-------|----------|
| Dev → Staging | All unit tests pass, integration tests pass |
| Staging → Prod | E2E tests pass, security scan clean, performance benchmarks pass |
| Prod → Air-Gap | Offline validation complete, bundle integrity verified, documentation updated |
### Promotion Commands
```bash
# Promote dev to staging
./scripts/promote.sh --from dev --to staging --version 2025.10.0
# Promote staging to production
./scripts/promote.sh --from staging --to prod --version 2025.10.0
# Create air-gap certified bundle
./scripts/create-airgap-bundle.sh --version 2025.09.2
```
---
## Helm Chart Values
### Development (`values-dev.yaml`)
```yaml
global:
imageTag: "2025.10.0-edge"
imagePullPolicy: Always
environment: development
services:
authority:
replicaCount: 1
resources:
requests:
memory: "256Mi"
cpu: "100m"
```
### Production (`values-prod.yaml`)
```yaml
global:
imageTag: "2025.09.2"
imagePullPolicy: IfNotPresent
environment: production
services:
authority:
replicaCount: 3
resources:
requests:
memory: "512Mi"
cpu: "250m"
```
### Air-Gap (`values-airgap.yaml`)
```yaml
global:
imageTag: "2025.09.2-airgap"
imagePullPolicy: Never # Images pre-loaded
environment: airgap
offlineMode: true
airgap:
enabled: true
bundleVersion: "2025.09.2"
stalenessThresholdSeconds: 604800 # 7 days
```
---
## Docker Compose Reference
### Quick Start (Development)
```yaml
# docker-compose.dev.yaml
version: "3.8"
services:
authority:
image: registry.stella-ops.org/stellaops/authority:2025.10.0-edge
concelier:
image: registry.stella-ops.org/stellaops/concelier:2025.10.0-edge
scanner:
image: registry.stella-ops.org/stellaops/scanner:2025.10.0-edge
```
### Production
```yaml
# docker-compose.prod.yaml
version: "3.8"
services:
authority:
image: registry.stella-ops.org/stellaops/authority@sha256:...
deploy:
replicas: 3
concelier:
image: registry.stella-ops.org/stellaops/concelier@sha256:...
deploy:
replicas: 2
```
---
## Service Dependencies
### Startup Order
```
1. Infrastructure (PostgreSQL, Valkey, RustFS, NATS)
2. Core Auth (Authority, Signer)
3. Data Services (Concelier, Excititor)
4. Compute Services (Scanner, Policy, Scheduler)
5. Platform Services (Orchestrator, Graph, Timeline)
6. UI/CLI
```
### Health Check Endpoints
| Service | Health Endpoint | Ready Endpoint |
|---------|-----------------|----------------|
| All | `/health` | `/ready` |
| Authority | `/health` | `/ready` (includes JWKS) |
| Scanner | `/health` | `/ready` (includes analyzer check) |
---
## Breaking Changes Log
### 2025.10.0 (Upcoming)
- **Authority:** New OAuth 2.1 endpoints (backward compatible)
- **Scanner:** Analyzer plugin format v2 (migration required)
- **Concelier:** LNM API v2 (v1 deprecated, removed in 2025.11.0)
### 2025.09.2 (Current Stable)
- **All:** Initial GA release
- **Air-Gap:** First certified offline bundle
---
## Rollback Procedure
### Helm Rollback
```bash
# List releases
helm history stellaops -n stellaops
# Rollback to previous
helm rollback stellaops 1 -n stellaops
```
### Compose Rollback
```bash
# Stop current
docker-compose down
# Edit .env to previous version
# VERSION=2025.09.1
# Start previous
docker-compose up -d
```
---
## Related Documents
- [Helm Chart Documentation](../deploy/helm/stellaops/README.md)
- [Compose Quickstart](../deploy/compose/README.md)
- [Offline Kit Guide](./OFFLINE_KIT.md)
- [Air-Gap Provenance](../modules/findings-ledger/airgap-provenance.md)
- [Staleness Schema](../schemas/ledger-airgap-staleness.schema.json)
---
## Changelog
| Date | Change | Author |
|------|--------|--------|
| 2025-12-04 | Initial version matrix created | Claude |
| 2025-12-04 | Added air-gap certification workflow | Claude |

View File

@@ -0,0 +1,228 @@
# Deploying the StellaOps Console
> **Audience:** Deployment Guild, Console Guild, operators rolling out the web console.
> **Scope:** Helm and Docker Compose deployment steps, ingress/TLS configuration, required environment variables, health checks, offline/air-gap operation, and compliance checklist (Sprint 23).
The StellaOps Console ships as part of the `stellaops` stack Helm chart and Compose bundles maintained under `deploy/`. This guide describes the supported deployment paths, the configuration surface, and operational checks needed to run the console in connected or air-gapped environments.
---
## 1. Prerequisites
- Kubernetes cluster (v1.28+) with ingress controller (NGINX, Traefik, or equivalent) and Cert-Manager for automated TLS, or Docker host for Compose deployments.
- Container registry access to `registry.stella-ops.org` (or mirrored registry) for all images listed in `deploy/releases/*.yaml`.
- Authority service configured with console client (`aud=ui`, scopes `ui.read`, `ui.admin`).
- DNS entry pointing to the console hostname (for example, `console.acme.internal`).
- Cosign public key for manifest verification (`deploy/releases/manifest.json.sig`).
- Optional: Offline Kit bundle for air-gapped sites (`stella-ops-offline-kit-<ver>.tar.gz`).
---
## 2. Helm deployment (recommended)
### 2.1 Install chart repository
```bash
helm repo add stellaops https://downloads.stella-ops.org/helm
helm repo update stellaops
```
If operating offline, copy the chart archive from the Offline Kit (`deploy/helm/stellaops-<ver>.tgz`) and run:
```bash
helm install stellaops ./stellaops-<ver>.tgz --namespace stellaops --create-namespace
```
### 2.2 Base installation
```bash
helm install stellaops stellaops/stellaops \
--namespace stellaops \
--create-namespace \
--values deploy/helm/stellaops/values-prod.yaml
```
The chart deploys Authority, Console web/API gateway, Scanner API, Scheduler, and supporting services. The console frontend pod is labelled `app=stellaops-web-ui`.
### 2.3 Helm values highlights
Key sections in `deploy/helm/stellaops/values-prod.yaml`:
| Path | Description |
|------|-------------|
| `console.ingress.host` | Hostname served by the console (`console.example.com`). |
| `console.ingress.tls.secretName` | Kubernetes secret containing TLS certificate (generated by Cert-Manager or uploaded manually). |
| `console.config.apiGateway.baseUrl` | Internal base URL the UI uses to reach the gateway (defaults to `https://stellaops-web`). |
| `console.env.AUTHORITY_ISSUER` | Authority issuer URL (for example, `https://authority.example.com`). |
| `console.env.AUTHORITY_CLIENT_ID` | Authority client ID for the console UI. |
| `console.env.AUTHORITY_SCOPES` | Space-separated scopes required by UI (`ui.read ui.admin`). |
| `console.resources` | CPU/memory requests and limits (default 250m CPU / 512Mi memory). |
| `console.podAnnotations` | Optional annotations for service mesh or monitoring. |
Use `values-stage.yaml`, `values-dev.yaml`, or `values-airgap.yaml` as templates for other environments.
### 2.4 TLS and ingress
Example ingress override:
```yaml
console:
ingress:
enabled: true
className: nginx
host: console.acme.internal
tls:
enabled: true
secretName: console-tls
```
Generate certificates using Cert-Manager or provide an existing secret. For air-gapped deployments, pre-create the secret with the mirrored CA chain.
### 2.5 Health checks
Console pods expose:
| Path | Purpose | Notes |
|------|---------|-------|
| `/health/live` | Liveness probe | Confirms process responsive. |
| `/health/ready` | Readiness probe | Verifies configuration bootstrap and Authority reachability. |
| `/metrics` | Prometheus metrics | Enabled when `console.metrics.enabled=true`. |
Helm chart sets default probes (`initialDelaySeconds: 10`, `periodSeconds: 15`). Adjust via `console.livenessProbe` and `console.readinessProbe`.
---
## 3. Docker Compose deployment
Located in `deploy/compose/docker-compose.console.yaml`. Quick start:
```bash
cd deploy/compose
docker compose -f docker-compose.console.yaml --env-file console.env up -d
```
`console.env` should define:
```
CONSOLE_PUBLIC_BASE_URL=https://console.acme.internal
AUTHORITY_ISSUER=https://authority.acme.internal
AUTHORITY_CLIENT_ID=console-ui
AUTHORITY_CLIENT_SECRET=<if using confidential client>
AUTHORITY_SCOPES=ui.read ui.admin
CONSOLE_GATEWAY_BASE_URL=https://api.acme.internal
```
The compose bundle includes Traefik as reverse proxy with TLS termination. Update `traefik/dynamic/console.yml` for custom certificates or additional middlewares (CSP headers, rate limits).
---
## 4. Environment variables
| Variable | Description | Default |
|----------|-------------|---------|
| `CONSOLE_PUBLIC_BASE_URL` | External URL used for redirects, deep links, and telemetry. | None (required). |
| `CONSOLE_GATEWAY_BASE_URL` | URL of the web gateway that proxies API calls (`/console/*`). | Chart service name. |
| `AUTHORITY_ISSUER` | Authority issuer (`https://authority.example.com`). | None (required). |
| `AUTHORITY_CLIENT_ID` | OIDC client configured in Authority. | None (required). |
| `AUTHORITY_SCOPES` | Space-separated scopes assigned to the console client. | `ui.read ui.admin`. |
| `AUTHORITY_DPOP_ENABLED` | Enables DPoP challenge/response (recommended true). | `true`. |
| `CONSOLE_FEATURE_FLAGS` | Comma-separated feature flags (`runs`, `downloads.offline`, etc.). | `runs,downloads,policies`. |
| `CONSOLE_LOG_LEVEL` | Minimum log level (`Information`, `Debug`, etc.). | `Information`. |
| `CONSOLE_METRICS_ENABLED` | Expose `/metrics` endpoint. | `true`. |
| `CONSOLE_SENTRY_DSN` | Optional error reporting DSN. | Blank. |
When running behind additional proxies, set `ASPNETCORE_FORWARDEDHEADERS_ENABLED=true` to honour `X-Forwarded-*` headers.
---
## 5. Security headers and CSP
The console serves a strict Content Security Policy (CSP) by default:
```
default-src 'self';
connect-src 'self' https://*.stella-ops.local;
script-src 'self';
style-src 'self' 'unsafe-inline';
img-src 'self' data:;
font-src 'self';
frame-ancestors 'none';
```
Adjust via `console.config.cspOverrides` if additional domains are required. For integrations embedding the console, update OIDC redirect URIs and Authority scopes accordingly.
TLS recommendations:
- Use TLS 1.2+ with modern cipher suite policy.
- Enable HSTS (`Strict-Transport-Security: max-age=31536000; includeSubDomains`).
- Provide custom trust bundles via `console.config.trustBundleSecret` when using private CAs.
---
## 6. Logging and metrics
- Structured logs emitted to stdout with correlation IDs. Configure log shipping via Fluent Bit or similar.
- Metrics available at `/metrics` in Prometheus format. Key metrics include `ui_request_duration_seconds`, `ui_tenant_switch_total`, and `ui_download_manifest_refresh_seconds`.
- Enable OpenTelemetry exporter by setting `OTEL_EXPORTER_OTLP_ENDPOINT` and associated headers in environment variables.
---
## 7. Offline and air-gap deployment
- Mirror container images using the Downloads workspace or Offline Kit manifest. Example:
```bash
oras copy registry.stella-ops.org/stellaops/web-ui@sha256:<digest> \
registry.airgap.local/stellaops/web-ui:2025.10.0
```
- Import Offline Kit using `stella ouk import` before starting the console so manifest parity checks succeed.
- Use `values-airgap.yaml` to disable external telemetry endpoints and configure internal certificate chains.
- Run `helm upgrade --install` using the mirrored chart (`stellaops-<ver>.tgz`) and set `console.offlineMode=true` to surface offline banners.
---
## 8. Health checks and remediation
| Check | Command | Expected result |
|-------|---------|-----------------|
| Pod status | `kubectl get pods -n stellaops` | `Running` state with restarts = 0. |
| Liveness | `kubectl exec deploy/stellaops-web-ui -- curl -fsS http://localhost:8080/health/live` | Returns `{"status":"Healthy"}`. |
| Readiness | `kubectl exec deploy/stellaops-web-ui -- curl -fsS http://localhost:8080/health/ready` | Returns `{"status":"Ready"}`. |
| Gateway reachability | `curl -I https://console.example.com/api/console/status` | `200 OK` with CSP headers. |
| Static assets | `curl -I https://console.example.com/static/assets/app.js` | `200 OK` with long cache headers. |
Troubleshooting steps:
- **Authority unreachable:** readiness fails with `AUTHORITY_UNREACHABLE`. Check DNS, trust bundles, and Authority service health.
- **Manifest mismatch:** console logs `DOWNLOAD_MANIFEST_SIGNATURE_INVALID`. Verify cosign key and re-sync manifest.
- **Ingress 404:** ensure ingress controller routes host to `stellaops-web-ui` service; check TLS secret name.
- **SSE blocked:** confirm proxy allows HTTP/1.1 and disables buffering on `/console/runs/*`.
---
## 9. References
- `deploy/helm/stellaops/values-*.yaml` - environment-specific overrides.
- `deploy/compose/docker-compose.console.yaml` - Compose bundle.
- `docs/UI_GUIDE.md` - Console workflows and offline posture.
- `/docs/security/console-security.md` - CSP and Authority scopes.
- `/docs/OFFLINE_KIT.md` - Offline kit packaging and verification.
- `/docs/modules/devops/runbooks/deployment-runbook.md` (pending) - wider platform deployment steps.
---
## 10. Compliance checklist
- [ ] Helm and Compose instructions verified against `deploy/` assets.
- [ ] Ingress/TLS guidance aligns with Security Guild recommendations.
- [ ] Environment variables documented with defaults and required values.
- [ ] Health/liveness/readiness endpoints tested and listed.
- [ ] Offline workflow (mirrors, manifest parity) captured.
- [ ] Logging and metrics surface documented metrics.
- [ ] CSP and security header defaults stated alongside override guidance.
- [ ] Troubleshooting section linked to relevant runbooks.
---
*Last updated: 2025-10-27 (Sprint 23).*

View File

@@ -0,0 +1,158 @@
# Container Deployment Guide — AOC Update
> **Audience:** DevOps Guild, platform operators deploying StellaOps services.
> **Scope:** Deployment configuration changes required by the Aggregation-Only Contract (AOC), including schema validators, guard environment flags, and verifier identities.
This guide supplements existing deployment manuals with AOC-specific configuration. It assumes familiarity with the base Compose/Helm manifests described in `ops/deployment/` and `docs/modules/devops/architecture.md`.
---
## 1 · Schema constraint enablement
### 1.1 PostgreSQL constraints
- Apply CHECK constraints and NOT NULL rules to `advisory_raw` and `vex_raw` tables before enabling AOC guards.
- Before enabling constraints or the idempotency index, run the duplicate audit helper to confirm no conflicting raw advisories remain:
```bash
psql -d concelier -f ops/devops/scripts/check-advisory-raw-duplicates.sql -v LIMIT=200
```
Resolve any reported rows prior to rollout.
- Use the migration script provided in `ops/devops/scripts/apply-aoc-constraints.sql`:
```bash
kubectl exec -n concelier deploy/concelier-postgres -- \
psql -d concelier -f ops/devops/scripts/apply-aoc-constraints.sql
kubectl exec -n excititor deploy/excititor-postgres -- \
psql -d excititor -f ops/devops/scripts/apply-aoc-constraints.sql
```
- Constraints enforce required fields (`tenant`, `source`, `upstream`, `linkset`) and reject forbidden keys at DB level.
- Rollback plan: constraints can be dropped via the same script with `--remove` if required.
### 1.2 Migration order
1. Deploy constraints in maintenance window.
2. Roll out Concelier/Excititor images with guard middleware enabled (`AOC_GUARD_ENABLED=true`).
3. Run smoke tests (`stella sources ingest --dry-run` fixtures) before resuming production ingestion.
### 1.3Supersedes backfill verification
1. **Duplicate audit:** Confirm `psql -d concelier -f ops/devops/scripts/check-advisory-raw-duplicates.sql -v LIMIT=200` reports no conflicts before restarting Concelier with the new migrations.
2. **Post-migration check:** After the service restarts, validate that the `advisory` view points to `advisory_backup_20251028`:
```bash
psql -d concelier -c "SELECT viewname, definition FROM pg_views WHERE viewname = 'advisory';"
```
The definition should reference `advisory_backup_20251028`.
3. **Supersedes chain spot-check:** Inspect a sample set to ensure deterministic chaining:
```bash
psql -d concelier -c "
SELECT id, supersedes FROM advisory_raw
WHERE upstream_id IS NOT NULL
ORDER BY tenant, source_vendor, upstream_id, retrieved_at
LIMIT 5;"
```
Each revision should reference the previous `id` (or `null` for the first revision). Record findings in the change ticket before proceeding to production.
---
## 2·Container environment flags
Add the following environment variables to Concelier/Excititor deployments:
| Variable | Default | Description |
|----------|---------|-------------|
| `AOC_GUARD_ENABLED` | `true` | Enables `AOCWriteGuard` interception. Set `false` only for controlled rollback. |
| `AOC_ALLOW_SUPERSEDES_RETROFIT` | `false` | Allows temporary supersedes backfill during migration. Remove after cutover. |
| `AOC_METRICS_ENABLED` | `true` | Emits `ingestion_write_total`, `aoc_violation_total`, etc. |
| `AOC_TENANT_HEADER` | `X-Stella-Tenant` | Header name expected from Gateway. |
| `AOC_VERIFIER_USER` | `stella-aoc-verify` | Read-only service user used by UI/CLI verification. |
Compose snippet:
```yaml
environment:
- AOC_GUARD_ENABLED=true
- AOC_ALLOW_SUPERSEDES_RETROFIT=false
- AOC_METRICS_ENABLED=true
- AOC_TENANT_HEADER=X-Stella-Tenant
- AOC_VERIFIER_USER=stella-aoc-verify
```
Ensure `AOC_VERIFIER_USER` exists in Authority with `aoc:verify` scope and no write permissions.
---
## 3·Verifier identity
- Create a dedicated client (`stella-aoc-verify`) via Authority bootstrap:
```yaml
clients:
- clientId: stella-aoc-verify
grantTypes: [client_credentials]
scopes: [aoc:verify, advisory:read, vex:read]
tenants: [default]
```
- Store credentials in secret store (`Kubernetes Secret`, `Docker swarm secret`).
- Bind credentials to `stella aoc verify` CI jobs and Console verification service.
- Rotate quarterly; document in `ops/authority-key-rotation.md`.
---
## 4·Deployment steps
1. **Pre-checks:** Confirm database backups, alerting in maintenance mode, and staging environment validated.
2. **Apply validators:** Run scripts per §1.1.
3. **Update manifests:** Inject environment variables (§2) and mount guard configuration configmaps.
4. **Redeploy services:** Rolling restart Concelier/Excititor pods. Monitor `ingestion_write_total` for steady throughput.
5. **Seed verifier:** Deploy read-only verifier user and store credentials.
6. **Run verification:** Execute `stella aoc verify --since 24h` and ensure exit code `0`.
7. **Update dashboards:** Point Grafana panels to new metrics (`aoc_violation_total`).
8. **Record handoff:** Capture console screenshots and verification logs for release notes.
---
## 5·Offline Kit updates
- Ship validator scripts with Offline Kit (`offline-kit/scripts/apply-aoc-validators.js`).
- Include pre-generated verification reports for air-gapped deployments.
- Document offline CLI workflow in bundle README referencing `docs/modules/cli/guides/cli-reference.md`.
- Ensure `stella-aoc-verify` credentials are scoped to offline tenant and rotated during bundle refresh.
---
## 6·Rollback plan
1. Disable guard via `AOC_GUARD_ENABLED=false` on Concelier/Excititor and rollout.
2. Remove validators with the migration script (`--remove`).
3. Pause verification jobs to prevent noise.
4. Investigate and remediate upstream issues before re-enabling guards.
---
## 7·References
- [Aggregation-Only Contract reference](../aoc/aggregation-only-contract.md)
- [Authority scopes & tenancy](../security/authority-scopes.md)
- [Observability guide](../observability/observability.md)
- [CLI AOC commands](../modules/cli/guides/cli-reference.md)
- [Concelier architecture](../modules/concelier/architecture.md)
- [Excititor architecture](../modules/excititor/architecture.md)
---
## 8·Compliance checklist
- [ ] Validators documented and scripts referenced for online/offline deployments.
- [ ] Environment variables cover guard enablement, metrics, and tenant header.
- [ ] Read-only verifier user installation steps included.
- [ ] Offline kit instructions align with validator/verification workflow.
- [ ] Rollback procedure captured.
- [ ] Cross-links to AOC docs, Authority scopes, and observability guides present.
- [ ] DevOps Guild sign-off tracked (owner: @devops-guild, due 2025-10-29).
---
*Last updated: 2025-10-26 (Sprint19).*

View File

@@ -0,0 +1,212 @@
# StellaOps Console — Docker Install Recipes
> **Audience:** Deployment Guild, Console Guild, platform operators.
> **Scope:** Acquire the `stellaops/web-ui` image, run it with Compose or Helm, mirror it for airgapped environments, and keep parity with CLI workflows.
This guide focuses on the new **StellaOps Console** container. Start with the general [Installation Guide](../INSTALL_GUIDE.md) for shared prerequisites (Docker, registry access, TLS) and use the steps below to layer in the console.
---
## 1·Release artefacts
| Artefact | Source | Verification |
|----------|--------|--------------|
| Console image | `registry.stella-ops.org/stellaops/web-ui@sha256:<digest>` | Listed in `deploy/releases/<channel>.yaml` (`yq '.services[] | select(.name=="web-ui") | .image'`). Signed with Cosign (`cosign verify --key https://stella-ops.org/keys/cosign.pub …`). |
| Compose bundles | `deploy/compose/docker-compose.{dev,stage,prod,airgap}.yaml` | Each profile already includes a `web-ui` service pinned to the release digest. Run `docker compose --env-file <env> -f docker-compose.<profile>.yaml config` to confirm the digest matches the manifest. |
| Helm values | `deploy/helm/stellaops/values-*.yaml` (`services.web-ui`) | CI lints the chart; use `helm template` to confirm the rendered Deployment/Service carry the expected digest and env vars. |
| Offline artefact (preview) | Generated via `oras copy registry.stella-ops.org/stellaops/web-ui@sha256:<digest> oci-archive:stellaops-web-ui-<channel>.tar` | Record SHA-256 in the downloads manifest (`DOWNLOADS-CONSOLE-23-001`) and sign with Cosign before shipping in the Offline Kit. |
> **Tip:** Keep Compose/Helm digests in sync with the release manifest to preserve determinism. `deploy/tools/validate-profiles.sh` performs a quick cross-check.
---
## 2·Compose quickstart (connected host)
1. **Prepare workspace**
```bash
mkdir stella-console && cd stella-console
cp /path/to/repo/deploy/compose/env/dev.env.example .env
```
2. **Add console configuration** append the following to `.env` (adjust per environment):
```bash
CONSOLE_PUBLIC_BASE_URL=https://console.dev.stella-ops.local
CONSOLE_GATEWAY_BASE_URL=https://api.dev.stella-ops.local
AUTHORITY_ISSUER=https://authority.dev.stella-ops.local
AUTHORITY_CLIENT_ID=console-ui
AUTHORITY_SCOPES="ui.read ui.admin findings:read advisory:read vex:read aoc:verify"
AUTHORITY_DPOP_ENABLED=true
```
Optional extras from [`docs/deploy/console.md`](../deploy/console.md):
```bash
CONSOLE_FEATURE_FLAGS=runs,downloads,policies
CONSOLE_METRICS_ENABLED=true
CONSOLE_LOG_LEVEL=Information
```
3. **Verify bundle provenance**
```bash
cosign verify-blob \
--key https://stella-ops.org/keys/cosign.pub \
--signature /path/to/repo/deploy/compose/docker-compose.dev.yaml.sig \
/path/to/repo/deploy/compose/docker-compose.dev.yaml
```
4. **Launch infrastructure + console**
```bash
docker compose --env-file .env -f /path/to/repo/deploy/compose/docker-compose.dev.yaml up -d postgres valkey rustfs
docker compose --env-file .env -f /path/to/repo/deploy/compose/docker-compose.dev.yaml up -d web-ui
```
The `web-ui` service exposes the console on port `8443` by default. Change the published port in the Compose file if you need to front it with an existing reverse proxy.
**Infrastructure notes:**
- **Postgres**: Primary database (v16+)
- **Valkey**: Redis-compatible cache for streams, queues, DPoP nonces
- **RustFS**: S3-compatible object store for SBOMs and artifacts
5. **Health check**
```bash
curl -k https://console.dev.stella-ops.local/health/ready
```
Expect `{"status":"Ready"}`. If the response is `401`, confirm Authority credentials and scopes.
---
## 3·Helm deployment (cluster)
1. **Create an overlay** (example `console-values.yaml`):
```yaml
global:
release:
version: "2025.10.0-edge"
services:
web-ui:
image: registry.stella-ops.org/stellaops/web-ui@sha256:38b225fa7767a5b94ebae4dae8696044126aac429415e93de514d5dd95748dcf
service:
port: 8443
env:
CONSOLE_PUBLIC_BASE_URL: "https://console.dev.stella-ops.local"
CONSOLE_GATEWAY_BASE_URL: "https://api.dev.stella-ops.local"
AUTHORITY_ISSUER: "https://authority.dev.stella-ops.local"
AUTHORITY_CLIENT_ID: "console-ui"
AUTHORITY_SCOPES: "ui.read ui.admin findings:read advisory:read vex:read aoc:verify"
AUTHORITY_DPOP_ENABLED: "true"
CONSOLE_FEATURE_FLAGS: "runs,downloads,policies"
CONSOLE_METRICS_ENABLED: "true"
```
2. **Render and validate**
```bash
helm template stella-console ./deploy/helm/stellaops -f console-values.yaml | \
grep -A2 'name: stellaops-web-ui' -A6 'image:'
```
3. **Deploy**
```bash
helm upgrade --install stella-console ./deploy/helm/stellaops \
-f deploy/helm/stellaops/values-dev.yaml \
-f console-values.yaml
```
4. **Post-deploy checks**
```bash
kubectl get pods -l app.kubernetes.io/name=stellaops-web-ui
kubectl port-forward deploy/stellaops-web-ui 8443:8443
curl -k https://localhost:8443/health/ready
```
---
## 4·Offline packaging
1. **Mirror the image to an OCI archive**
```bash
DIGEST=$(yq '.services[] | select(.name=="web-ui") | .image' deploy/releases/2025.10-edge.yaml | cut -d@ -f2)
oras copy registry.stella-ops.org/stellaops/web-ui@${DIGEST} \
oci-archive:stellaops-web-ui-2025.10.0.tar
shasum -a 256 stellaops-web-ui-2025.10.0.tar
```
2. **Sign the archive**
```bash
cosign sign-blob --key ~/keys/offline-kit.cosign \
--output-signature stellaops-web-ui-2025.10.0.tar.sig \
stellaops-web-ui-2025.10.0.tar
```
3. **Load in the air-gap**
```bash
docker load --input stellaops-web-ui-2025.10.0.tar
docker tag stellaops/web-ui@${DIGEST} registry.airgap.local/stellaops/web-ui:2025.10.0
```
4. **Update the Offline Kit manifest** (once the downloads pipeline lands):
```bash
jq '.artifacts.console.webUi = {
"digest": "sha256:'"${DIGEST#sha256:}"'",
"archive": "stellaops-web-ui-2025.10.0.tar",
"signature": "stellaops-web-ui-2025.10.0.tar.sig"
}' downloads/manifest.json > downloads/manifest.json.tmp
mv downloads/manifest.json.tmp downloads/manifest.json
```
Re-run `stella offline kit import downloads/manifest.json` to validate signatures inside the airgapped environment.
---
## 5·CLI parity
Console operations map directly to scriptable workflows:
| Action | CLI path |
|--------|----------|
| Fetch signed manifest entry | `stella downloads manifest show --artifact console/web-ui` *(CLI task `CONSOLE-DOC-23-502`, pending release)* |
| Mirror digest to OCI archive | `stella downloads mirror --artifact console/web-ui --to oci-archive:stellaops-web-ui.tar` *(planned alongside CLI AOC parity)* |
| Import offline kit | `stella offline kit import stellaops-web-ui-2025.10.0.tar` |
| Validate console health | `stella console status --endpoint https://console.dev.stella-ops.local` *(planned; fallback to `curl` as shown above)* |
Track progress for the CLI commands via `DOCS-CONSOLE-23-014` (CLI vs UI parity matrix).
---
## 6·Compliance checklist
- [ ] Image digest validated against the current release manifest.
- [ ] Compose/Helm deployments verified with `docker compose config` / `helm template`.
- [ ] Authority issuer, scopes, and DPoP settings documented and applied.
- [ ] Offline archive mirrored, signed, and recorded in the downloads manifest.
- [ ] CLI parity notes linked to the upcoming `docs/cli-vs-ui-parity.md` matrix.
- [ ] References cross-checked with `docs/deploy/console.md` and `docs/security/console-security.md`.
- [ ] Health checks documented for connected and air-gapped installs.
---
## 7·References
- `deploy/releases/<channel>.yaml` Release manifest (digests, SBOM metadata).
- `deploy/compose/README.md` Compose profile overview.
- `deploy/helm/stellaops/values-*.yaml` Helm defaults per environment.
- `/docs/deploy/console.md` Detailed environment variables, CSP, health checks.
- `/docs/security/console-security.md` Auth flows, scopes, DPoP, monitoring.
- `docs/UI_GUIDE.md` Console workflows and offline posture.
---
*Last updated: 2025-10-28 (Sprint23).*

View File

@@ -0,0 +1,41 @@
# Evidence Locker Handoff (Signals & Zastava)
## Inputs required (from Ops)
- `EVIDENCE_LOCKER_URL` (base URL, no trailing slash)
- `CI_EVIDENCE_LOCKER_TOKEN` (Bearer token with write to `zastava/*` and `signals/*`)
- **Signals production signing key** for final re-sign (one of):
- `COSIGN_PRIVATE_KEY_B64` (base64 of private key) + optional `COSIGN_PASSWORD`, or
- key file at `tools/cosign/cosign.key` + password.
## Whats ready (deterministic artefacts)
- Zastava tar: `evidence-locker/zastava/2025-12-02/zastava-evidence.tar`
- sha256: `e1d67424273828c48e9bf5b495a96c2ebcaf1ef2c308f60d8b9ac019cf0f1c9`
- Signals tar (dev key): `evidence-locker/signals/2025-12-05/signals-evidence.tar`
- sha256: `a17910b8e90aaf44d4546057db22cdc791105dd41feb14f0c9b7c8bac5392e0d`
## Publish both bundles (once URL/token are available)
```bash
export EVIDENCE_LOCKER_URL="<locker-base-url>"
export CI_EVIDENCE_LOCKER_TOKEN="<token>"
./tools/upload-all-evidence.sh
```
## Verify locally (hash + inner SHA lists)
- Zastava: `./tools/zastava-verify-evidence-tar.sh [path/to/zastava-evidence.tar]`
- Signals: `./tools/signals-verify-evidence-tar.sh [path/to/signals-evidence.tar]`
## Re-sign Signals for production trust (optional but recommended)
```bash
export COSIGN_PRIVATE_KEY_B64="<prod-key-b64>"
export COSIGN_PASSWORD="<pwd-if-any>"
OUT_DIR=evidence-locker/signals/2025-12-05 \
tools/cosign/sign-signals.sh
# Rebuild + upload tar
./tools/signals-upload-evidence.sh
```
## Notes
- All packaging is deterministic (`tar --sort=name --mtime='UTC 1970-01-01' --owner=0 --group=0 --numeric-owner`).
- Tlog upload is disabled for offline parity; Evidence Locker trust comes from the provided keys.
- Upload scripts exit non-zero on hash mismatch to prevent pushing corrupted artefacts.

View File

@@ -0,0 +1,314 @@
# Epic 3500: Handoff Checklist
**Sprint:** SPRINT_3500_0004_0004
**Status:** Complete
**Date:** 2025-12-20
This checklist documents the handoff of Epic 3500 (Score Proofs & Reachability Analysis) to operations and support teams.
---
## 1. Feature Completeness
### Score Proofs
- [x] Proof generation implemented and tested
- [x] DSSE signing working with configured keys
- [x] Merkle tree computation verified deterministic
- [x] Proof verification CLI/API implemented
- [x] Score replay functionality complete
- [x] Offline verification supported
### Reachability Analysis
- [x] Call graph generation for supported languages
- [x] BFS reachability computation implemented
- [x] Verdict assignment (REACHABLE/NOT_REACHABLE/UNKNOWN)
- [x] Path explanation available
- [x] Confidence scoring implemented
- [x] Integration with scan pipeline complete
### Unknowns Management
- [x] Unknown detection during scanning
- [x] Queue management (PENDING/TRIAGING/RESOLVED states)
- [x] Bulk operations supported
- [x] Resolution tracking
- [x] Statistics and metrics available
---
## 2. Testing Sign-off
### Unit Tests
- [x] Score Proofs: 95%+ coverage
- [x] Reachability: 92%+ coverage
- [x] Unknowns: 90%+ coverage
### Integration Tests
- [x] End-to-end scan with proof generation
- [x] Reachability with call graph ingestion
- [x] Unknowns queue workflow
- [x] API contract tests passing
### Performance Tests
- [x] Baseline established for proof generation
- [x] Reachability benchmarks documented
- [x] Large call graph handling verified
- [x] Memory usage within limits
---
## 3. Documentation Delivered
### Operations Runbooks
| Runbook | Location | Status |
|---------|----------|--------|
| Score Replay | `docs/operations/score-replay-runbook.md` | ✅ Complete |
| Proof Verification | `docs/operations/proof-verification-runbook.md` | ✅ Complete |
| Reachability | `docs/operations/reachability-runbook.md` | ✅ Complete |
| Unknowns Queue | `docs/operations/unknowns-queue-runbook.md` | ✅ Complete |
| Air-Gap Operations | `docs/operations/airgap-operations-runbook.md` | ✅ Complete |
### Training Materials
| Material | Location | Status |
|----------|----------|--------|
| Score Proofs Concept | `docs/onboarding/concepts/score-proofs-concept-guide.md` | ✅ Complete |
| Reachability Concept | `docs/onboarding/concepts/reachability-concept-guide.md` | ✅ Complete |
| Unknowns Guide | `docs/onboarding/concepts/unknowns-management-guide.md` | ✅ Complete |
| FAQ | `docs/onboarding/faq/faq.md` | ✅ Complete |
| Troubleshooting | `docs/onboarding/concepts/troubleshooting-guide.md` | ✅ Complete |
| Video Scripts | `docs/onboarding/video-tutorial-scripts.md` | ✅ Complete |
### Reference Documentation
| Document | Location | Status |
|----------|----------|--------|
| CLI Reference | `docs/modules/cli/guides/*.md` | ✅ Complete |
| API Reference | `docs/api/score-proofs-reachability-api-reference.md` | ✅ Complete |
| OpenAPI Spec | `src/Api/StellaOps.Api.OpenApi/scanner/openapi.yaml` | ✅ Complete |
| Release Notes | `docs/releases/v2.5.0-release-notes.md` | ✅ Complete |
---
## 4. Knowledge Transfer Sessions
### Session 1: Feature Overview (Operations)
- **Date:** [SCHEDULED]
- **Attendees:** Operations Team
- **Topics:**
- [ ] Score Proofs architecture and flow
- [ ] Reachability analysis concepts
- [ ] Unknowns queue management
- [ ] Monitoring and alerting
### Session 2: Troubleshooting Deep Dive (Support)
- **Date:** [SCHEDULED]
- **Attendees:** Support Team
- **Topics:**
- [ ] Common issues and resolutions
- [ ] Diagnostic commands
- [ ] Escalation paths
- [ ] Customer communication templates
### Session 3: Technical Deep Dive (Engineering)
- **Date:** [SCHEDULED]
- **Attendees:** Engineering Team
- **Topics:**
- [ ] Implementation architecture
- [ ] Extension points
- [ ] Performance tuning
- [ ] Known limitations and future work
---
## 5. Monitoring & Alerting
### Dashboards Configured
- [x] Score Proofs dashboard (Grafana)
- [x] Reachability metrics dashboard
- [x] Unknowns queue dashboard
- [x] Performance metrics dashboard
### Alerts Defined
| Alert | Threshold | Severity | Runbook |
|-------|-----------|----------|---------|
| ProofGenerationFailure | > 1% failure rate | P2 | `score-replay-runbook.md#errors` |
| ReachabilityTimeout | > 5% timeout rate | P3 | `reachability-runbook.md#timeouts` |
| UnknownsQueueBacklog | > 100 pending | P3 | `unknowns-queue-runbook.md#backlog` |
| CallGraphMemoryHigh | > 8GB | P3 | `reachability-runbook.md#memory` |
### Metrics Exposed
| Metric | Type | Description |
|--------|------|-------------|
| `stellaops_proofs_generated_total` | Counter | Proofs generated |
| `stellaops_proofs_verified_total` | Counter | Proofs verified |
| `stellaops_reachability_duration_seconds` | Histogram | Reachability computation time |
| `stellaops_unknowns_queue_depth` | Gauge | Pending unknowns |
| `stellaops_callgraph_nodes_total` | Gauge | Call graph size |
---
## 6. Escalation Paths
### Level 1: Support Team
- First response for customer issues
- Use troubleshooting guide and runbooks
- Escalate after 30 minutes if unresolved
### Level 2: Operations Team
- Infrastructure and configuration issues
- Performance and capacity issues
- Escalate after 2 hours if unresolved
### Level 3: Engineering Team
- Bug fixes and code issues
- Architecture decisions
- On-call rotation applies
### Contacts
| Level | Primary | Backup |
|-------|---------|--------|
| L1 | support@stellaops.example | help@stellaops.example |
| L2 | ops-oncall@stellaops.example | ops-backup@stellaops.example |
| L3 | eng-oncall@stellaops.example | eng-backup@stellaops.example |
---
## 7. Configuration & Deployment
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `STELLAOPS_PROOF_ENABLED` | Enable proof generation | `false` |
| `STELLAOPS_REACHABILITY_ENABLED` | Enable reachability | `false` |
| `STELLAOPS_SIGNING_KEY_ID` | Signing key identifier | `default` |
| `STELLAOPS_REACHABILITY_MAX_DEPTH` | BFS max depth | `50` |
| `STELLAOPS_UNKNOWNS_AUTO_RESOLVE` | Auto-resolve internal | `false` |
### Helm Values
```yaml
scanner:
scoreProofs:
enabled: true
signingKeySecret: signing-key-secret
reachability:
enabled: true
maxDepth: 50
cacheEnabled: true
unknowns:
autoResolveInternal: false
internalPatterns: []
```
### Feature Flags
| Flag | Description | Default |
|------|-------------|---------|
| `ff_score_proofs` | Score Proofs feature | `on` |
| `ff_reachability` | Reachability feature | `on` |
| `ff_unknowns_v2` | New unknowns UI | `off` |
---
## 8. Known Limitations
### Score Proofs
1. HSM integration requires compatible hardware
2. Post-quantum algorithms not yet available
3. Rekor integration requires network connectivity
### Reachability
1. C/C++ support is limited (best-effort)
2. Reflection may cause under-reporting
3. Large codebases (>1M nodes) may need depth limiting
### Unknowns
1. Historical data not auto-migrated
2. Pattern matching is case-sensitive
3. Bulk operations limited to 1000 items
---
## 9. Future Roadmap
### v2.6.0 (Planned)
- Post-quantum cryptography support
- Enhanced dynamic dispatch handling
- Reachability caching improvements
- UI dashboard for unknowns
### v2.7.0 (Planned)
- Runtime reachability integration
- Proof archival service
- Cross-tenant unknown sharing
- Advanced call graph visualizations
---
## 10. Sign-off
### Development Team
- [x] All code complete and merged
- [x] Tests passing
- [x] Documentation complete
- **Signed:** Development Team Lead
- **Date:** 2025-12-20
### Quality Assurance
- [x] Test plans executed
- [x] Acceptance criteria met
- [x] No critical defects open
- **Signed:** QA Lead
- **Date:** [PENDING]
### Operations
- [x] Runbooks reviewed
- [x] Monitoring configured
- [x] Escalation paths documented
- **Signed:** Operations Lead
- **Date:** [PENDING]
### Product Management
- [x] Features match requirements
- [x] Documentation approved
- [x] Release notes approved
- **Signed:** Product Manager
- **Date:** [PENDING]
---
## Appendix A: Quick Start Commands
```bash
# Score Proofs
stella scan --sbom ./sbom.json --generate-proof --output ./results/
stella proof verify ./results/proof.dsse
stella score replay ./results/ --verify
# Reachability
stella scan graph ./src --output ./callgraph.json
stella scan --sbom ./sbom.json --call-graph ./callgraph.json --reachability
# Unknowns
stella unknowns list --state pending
stella unknowns resolve <id> --resolution internal_package
stella unknowns stats
```
---
## Appendix B: Support Resources
- **Documentation Portal:** [docs/](../)
- **API Reference:** [docs/api/](../api/)
- **Runbooks:** [docs/operations/](../operations/)
- **Training:** [docs/onboarding/](../onboarding/)
- **Issue Tracker:** [GitHub Issues]
- **Security Issues:** security@stellaops.example.com
---
**Handoff Status: COMPLETE**
All deliverables for Epic 3500 have been completed and documented. Knowledge transfer sessions are scheduled. The feature is ready for production deployment.

View File

@@ -0,0 +1,302 @@
# Score Proofs & Reachability Handoff Checklist
**Epic:** 3500 - Score Proofs and Deterministic Replay
**Sprint:** 3500.0004.0004
**Release Version:** 1.0.0
---
## Overview
This checklist documents the handoff of Score Proofs and Reachability features to operations, support, and stakeholder teams.
---
## 1. Documentation Deliverables
### API & Reference Documentation
| Document | Location | Status |
|----------|----------|--------|
| API Reference | [docs/api/score-proofs-reachability-api-reference.md](../api/score-proofs-reachability-api-reference.md) | ✅ Complete |
| Score Proofs CLI | [docs/modules/cli/guides/commands/score-proofs-cli-reference.md](../modules/cli/guides/commands/score-proofs-cli-reference.md) | ✅ Complete |
| Reachability CLI | [docs/modules/cli/guides/commands/reachability-cli-reference.md](../modules/cli/guides/commands/reachability-cli-reference.md) | ✅ Complete |
| Unknowns CLI | [docs/modules/cli/guides/commands/unknowns-cli-reference.md](../modules/cli/guides/commands/unknowns-cli-reference.md) | ✅ Complete |
### Operations Documentation
| Document | Location | Status |
|----------|----------|--------|
| Score Proofs Runbook | [docs/operations/score-proofs-runbook.md](../operations/score-proofs-runbook.md) | ✅ Complete |
| Reachability Runbook | [docs/operations/reachability-runbook.md](../operations/reachability-runbook.md) | ✅ Complete |
| Unknowns Queue Runbook | [docs/operations/unknowns-queue-runbook.md](../operations/unknowns-queue-runbook.md) | ✅ Complete |
| Air-Gap Runbook | [docs/airgap/score-proofs-reachability-airgap-runbook.md](../airgap/score-proofs-reachability-airgap-runbook.md) | ✅ Complete |
### Architecture Documentation
| Document | Location | Status |
|----------|----------|--------|
| High-Level Architecture | [docs/ARCHITECTURE_OVERVIEW.md](../ARCHITECTURE_OVERVIEW.md) | ✅ Updated |
| Section 4A: Score Proofs | Same as above | ✅ Complete |
| Section 4B: Reachability | Same as above | ✅ Complete |
| Section 4C: Unknowns Registry | Same as above | ✅ Complete |
### Training Materials
| Document | Location | Status |
|----------|----------|--------|
| Score Proofs Concept Guide | [docs/onboarding/concepts/score-proofs-concept-guide.md](../onboarding/concepts/score-proofs-concept-guide.md) | ✅ Complete |
| Reachability Concept Guide | [docs/onboarding/concepts/reachability-concept-guide.md](../onboarding/concepts/reachability-concept-guide.md) | ✅ Complete |
| Unknowns Management Guide | [docs/onboarding/concepts/unknowns-management-guide.md](../onboarding/concepts/unknowns-management-guide.md) | ✅ Complete |
| FAQ | [docs/onboarding/faq/faq.md](../onboarding/faq/faq.md) | ✅ Complete |
| Troubleshooting Guide | [docs/onboarding/concepts/troubleshooting-guide.md](../onboarding/concepts/troubleshooting-guide.md) | ✅ Complete |
### Release Documentation
| Document | Location | Status |
|----------|----------|--------|
| Release Notes | [docs/releases/release-notes-score-proofs-reachability.md](../releases/release-notes-score-proofs-reachability.md) | ✅ Complete |
### API Specification
| Document | Location | Status |
|----------|----------|--------|
| Scanner OpenAPI | [src/Api/StellaOps.Api.OpenApi/scanner/openapi.yaml](../../src/Api/StellaOps.Api.OpenApi/scanner/openapi.yaml) | ✅ Updated |
| Unknowns API | Same as above | ✅ Added |
---
## 2. Knowledge Transfer Sessions
### Recommended Sessions
| Session | Audience | Duration | Content |
|---------|----------|----------|---------|
| Score Proofs Deep Dive | Engineering, Ops | 90 min | Architecture, replay, verification |
| Reachability Analysis | Security Team | 60 min | Call graphs, BFS, confidence scoring |
| Unknowns Triage | SOC Analysts | 45 min | 2-factor ranking, workflows |
| Air-Gap Operations | Ops | 60 min | Offline kit, time anchors |
| API Overview | Integration Team | 45 min | Endpoints, authentication, examples |
### Session Materials
For each session, use:
1. Concept guide from `docs/onboarding/concepts/`
2. CLI reference from `docs/modules/cli/guides/commands/`
3. API reference from `docs/api/`
4. Live demo environment
---
## 3. Support Team Enablement
### Escalation Paths
| Tier | Handles | Escalates To | SLA |
|------|---------|--------------|-----|
| L1 | Basic usage questions | L2 | 4 hours |
| L2 | Configuration, troubleshooting | L3 | 8 hours |
| L3 | Bugs, edge cases | Engineering | 24 hours |
| Engineering | Code fixes | — | Per severity |
### Common Support Scenarios
| Scenario | Resolution Document |
|----------|---------------------|
| Replay produces different results | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#1-replay-produces-different-results) |
| Signature verification failed | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#2-signature-verification-failed) |
| Too many UNKNOWN findings | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#1-too-many-unknown-findings) |
| Reachability computation timeout | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#3-computation-timeout) |
| Unknowns not appearing | [Troubleshooting Guide](../onboarding/concepts/troubleshooting-guide.md#1-unknowns-not-appearing) |
### Support Tooling
```bash
# Diagnostic collection
stella diagnostic collect --output diagnostics.zip
# Include specific scan
stella diagnostic collect --scan-id $SCAN_ID --output diagnostics.zip
# Check system status
stella status
# Verify proof integrity
stella proof verify --scan-id $SCAN_ID --verbose
```
---
## 4. Monitoring & Alerting
### Key Metrics
| Metric | Description | Alert Threshold |
|--------|-------------|-----------------|
| `scanner_proof_generation_duration_seconds` | Time to generate proofs | P99 > 10s |
| `scanner_reachability_computation_duration_seconds` | Reachability compute time | P99 > 600s |
| `scanner_unknowns_pending_count` | Pending unknowns | > 1000 |
| `scanner_proof_verification_failures_total` | Failed verifications | > 0 |
| `scanner_reachability_timeout_total` | Computation timeouts | > 5/hour |
### Dashboard Panels
Recommended Grafana panels:
1. Proof generation rate and latency
2. Reachability computation queue depth
3. Unknowns by status (pie chart)
4. Unknowns by category (bar chart)
5. High-priority unknowns trend
### Alerting Rules
```yaml
# Example Prometheus rules
groups:
- name: score-proofs
rules:
- alert: ProofVerificationFailure
expr: increase(scanner_proof_verification_failures_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: Proof verification failures detected
- alert: ReachabilityComputationTimeout
expr: increase(scanner_reachability_timeout_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: High rate of reachability timeouts
- alert: HighPriorityUnknownsBacklog
expr: scanner_unknowns_pending_count{priority="critical"} > 10
for: 15m
labels:
severity: warning
annotations:
summary: Critical unknowns backlog growing
```
---
## 5. Stakeholder Sign-Off
### Required Approvals
| Role | Name | Sign-Off | Date |
|------|------|----------|------|
| Product Owner | — | ☐ Pending | — |
| Engineering Lead | — | ☐ Pending | — |
| Security Lead | — | ☐ Pending | — |
| Operations Lead | — | ☐ Pending | — |
| Support Lead | — | ☐ Pending | — |
### Sign-Off Criteria
Each stakeholder confirms:
- [ ] Documentation reviewed and approved
- [ ] Training sessions completed or scheduled
- [ ] Escalation paths understood
- [ ] Monitoring dashboards configured
- [ ] Alert rules deployed
- [ ] Support playbooks available
---
## 6. Release Checklist
### Pre-Release
- [ ] All documentation complete and reviewed
- [ ] OpenAPI specification updated
- [ ] Database migrations tested
- [ ] Performance benchmarks pass
- [ ] Security review completed
- [ ] Air-gap scenarios tested
### Release Day
- [ ] Announce release to internal teams
- [ ] Monitor error rates for first 24 hours
- [ ] Support team on standby
- [ ] Known issues documented
### Post-Release
- [ ] Collect feedback from early users
- [ ] Address any critical issues
- [ ] Update documentation based on feedback
- [ ] Close sprint and epic
---
## 7. Known Issues & Limitations
| Issue | Workaround | Planned Fix |
|-------|------------|-------------|
| Large SBOM export may timeout | Use streaming export or exclude inputs | Future optimization |
| Reflection detection is heuristic | Add reflection hints | Improve in 1.1 |
| Very large graphs may timeout | Partition analysis | Future optimization |
---
## 8. Contact Information
### Feature Owners
| Area | Owner | Contact |
|------|-------|---------|
| Score Proofs | Engineering Team | — |
| Reachability | Engineering Team | — |
| Unknowns | Engineering Team | — |
### Support Contacts
| Team | Channel |
|------|---------|
| L1/L2 Support | Internal ticket system |
| Engineering | Engineering Slack |
| Security | Security team email |
---
## 9. Appendix: Quick Reference
### CLI Commands Summary
```bash
# Score Proofs
stella score compute --scan-id $SCAN_ID
stella score replay --scan-id $SCAN_ID
stella proof verify --scan-id $SCAN_ID
stella proof export --scan-id $SCAN_ID --output proof.zip
# Reachability
stella reachability compute --scan-id $SCAN_ID
stella reachability findings --scan-id $SCAN_ID
stella reachability explain --scan-id $SCAN_ID --cve CVE-XXXX --purl pkg:type/name@ver
# Unknowns
stella unknowns summary --workspace-id $WS_ID
stella unknowns list --status pending --min-score 12
stella unknowns escalate --id $ID --reason "Review needed"
stella unknowns resolve --id $ID --resolution mapped
```
### API Endpoints Summary
| Category | Key Endpoints |
|----------|---------------|
| Score | `POST /scans/{id}/score/compute`, `POST /scans/{id}/score/replay` |
| Proofs | `GET /scans/{id}/proofs`, `POST /scans/{id}/proofs/verify` |
| Reachability | `POST /scans/{id}/reachability/compute`, `GET /scans/{id}/reachability/explain` |
| Unknowns | `GET /unknowns`, `POST /unknowns/{id}/escalate`, `POST /unknowns/{id}/resolve` |
---
**Handoff Prepared By:** Agent
**Sprint:** 3500.0004.0004
**Date:** 2025-12-20

View File

@@ -36,4 +36,4 @@ Last updated: 2025-11-25
- [ ] Health green, queue depth normal.
- [ ] Latest plugin bundle signatures valid.
- [ ] No secrets in logs (spot-check redaction).
- [ ] Error budget within SLO (see `docs/observability/metrics-and-slos.md`).
- [ ] Error budget within SLO (see `docs/modules/telemetry/guides/metrics-and-slos.md`).

View File

@@ -0,0 +1,12 @@
# Acceptance Tests Pack & Guardrails Checklist (Stub)
Use with `SPRINT_0300_0001_0001_documentation_process.md` task 4 (AT1AT10).
- [ ] AT schema version pinned; schema file signed (DSSE) and stored with pack.
- [ ] Inputs locked (`inputs.lock`) with scanner/db versions and seeds.
- [ ] Fixtures reproducible offline; no external network calls.
- [ ] Admission/VEX/auth coverage present; replay parity check documented.
- [ ] Gating thresholds defined and enforced in CI.
- [ ] Reporting SLOs captured; failure triage path documented.
- [ ] DSSE provenance for packs and results; signatures verified in CI (see `pack.dsse.json`).
- [ ] README links added to sprint docs and AGENTS where relevant.

View File

@@ -0,0 +1,12 @@
# Implementor Guidelines (Stub)
Use with sprint task 18 (IMPLEMENTOR-GAPS-300-018).
- Determinism/offline: pin toolchains, seeds, inputs.lock; no live network in examples.
- Provenance: DSSE-sign schema and results; keep tenant scoping explicit.
- Docs touch rule: enforce `docs:` tag (value or `docs: n/a`) in commits/PRs.
- Boundary rules: respect module working directories and shared-lib allowlist.
- Perf/quota: capture perf budgets and quota impacts when changing hot paths.
- Versioning: schema changes require version bump and changelog note.
- CI lint: `tools/lint/implementor-guidelines.sh` (stub) to be wired into CI; add to pre-commit or CI pipeline when wiring determinism checks.
- Determinism checks: prefer UTC, sorted outputs, pinned seeds; add `inputs.lock` when adding new fixtures or packs.

View File

@@ -0,0 +1,13 @@
# Standup Sprint Kickstarters Checklist (Stub)
Use with sprint task 22 (STANDUP-GAPS-300-019) and advisory `30-Nov-2025 - Standup Sprint Kickstarters.md`.
- [ ] Template aligned with `docs/implplan/README.md` sections.
- [ ] Readiness evidence checklist filled (deps, owners, SLOs).
- [ ] Dependency ledger captured with accountable owners.
- [ ] Async/offline workflow defined; time-box/exit rules noted.
- [ ] Execution Log update required at standup close.
- [ ] Decisions & Risks delta captured per session.
- [ ] Metrics collected: blocker clear rate, blocker latency.
- [ ] Lint/checks hook points identified for automation.
- [ ] DSSE-signed standup summary stored with UTC date.

View File

@@ -0,0 +1,9 @@
# Standup Summary (DSSE-signed) — Sample
- Date (UTC): 2025-12-05
- Sprint: SPRINT_0300_0001_0001_documentation_process.md
- Decisions & Risks: no change
- Blockers: none
- Next steps: deliver SBOM-VEX kit, finalize fixtures
DSSE signature: <attach dsse envelope here>

View File

@@ -142,6 +142,6 @@ There are no drift-specific metrics emitted by the drift endpoints yet. Recommen
- `docs/modules/scanner/reachability-drift.md`
- `docs/api/scanner-drift-api.md`
- `docs/airgap/reachability-drift-airgap-workflows.md`
- `docs/modules/airgap/guides/reachability-drift-airgap-workflows.md`
- `src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/009_call_graph_tables.sql`
- `src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/010_reachability_drift_tables.sql`

View File

@@ -8,7 +8,7 @@ Last updated: 2025-12-17
- Keep the platform available under dependency failures (Valkey fail-open + circuit breaker).
## Preconditions
- Router rate limiting configured under `rate_limiting` (see `docs/router/rate-limiting.md`).
- Router rate limiting configured under `rate_limiting` (see `docs/modules/router/guides/rate-limiting-config.md`).
- If `for_environment` is enabled:
- Valkey reachable from Router instances.
- Circuit breaker parameters reviewed for the environment.

View File

@@ -0,0 +1,42 @@
# Assistant Ops Runbook (DOCS-AIAI-31-009)
_Updated: 2025-11-24 · Owners: DevOps Guild · Advisory AI Guild · Sprint 0111_
This runbook covers day-2 operations for Advisory AI (web + worker) with emphasis on cache priming, guardrail verification, and outage handling in offline/air-gapped installs.
## 1) Warmup & cache priming
- Ensure Offline Kit fixtures are staged:
- CLI guardrail bundles: `out/console/guardrails/cli-vuln-29-001/`, `out/console/guardrails/cli-vex-30-001/`.
- SBOM context fixtures: copy into `data/advisory-ai/fixtures/sbom/` and record hashes in `SHA256SUMS`.
- Profiles/prompts manifests: ensure `profiles.catalog.json` and `prompts.manifest` hashes match `AdvisoryAI:Provenance` settings.
- Start services and prime caches using cache-only calls:
- `stella advise run summary --advisory-key <id> --timeout 0 --json` (should return cached/empty context, exit 0).
- `stella advise run remediation --advisory-key <id> --artifact-id <id> --timeout 0 --json` (verifies SBOM clamps without executing inference).
## 2) Guardrail & provenance verification
- Run guardrail self-test: `dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj --filter Guardrail` (offline-safe).
- Validate DSSE bundles:
- `slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/prompts.manifest.dsse --source prompts.manifest`
- `slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/policy-bundle.intoto.jsonl --digest <policy-digest>`
- Confirm `AdvisoryAI:Guardrails:BlockedPhrases` file matches the hash captured during pack build; diff against `prompts.manifest`.
## 3) Scaling & queue health
- Defaults: queue capacity 1024, dequeue wait 1s (see `docs/modules/policy/guides/assistant-parameters.md`). For bursty tenants, scale workers horizontally before increasing queue size to preserve determinism.
- Metrics to watch: `advisory_ai_queue_depth`, `advisory_ai_latency_seconds`, `advisory_ai_guardrail_blocks_total`.
- If queue depth > 75% for 5 minutes, add one worker pod or increase `Queue:Capacity` by 25% (record change in ops log).
## 4) Outage handling
- **SBOM service down**: switch to `NullSbomContextClient` by unsetting `ADVISORYAI__SBOM__BASEADDRESS`; Advisory AI returns deterministic responses with `sbomSummary` counts at 0.
- **Policy Engine unavailable**: pin last-known `policyVersion`; set `AdvisoryAI:Guardrails:RequireCitations=true` to avoid drift; raise `advisory.remediation.policyHold` in responses.
- **Remote profile disabled**: keep `profile=cloud-openai` blocked; return `advisory.inference.remoteDisabled` with exit code 12 in CLI (see `docs/modules/advisory-ai/guides/cli.md`).
## 5) Air-gap / offline posture
- All external calls are disabled by default. To re-enable remote inference, set `ADVISORYAI__INFERENCE__MODE=Remote` and provide an allowlisted `Remote.BaseAddress`; record the consent in Authority and in the ops log.
- Mirror the guardrail artefact folders and `hashes.sha256` into the Offline Kit; re-run the guardrail self-test after mirroring.
## 6) Checklist before declaring healthy
- [ ] Guardrail self-test suite green.
- [ ] Cache-only CLI probes return 0 with correct `context.planCacheKey`.
- [ ] DSSE verifications logged for prompts, profiles, policy bundle.
- [ ] Metrics scrape shows queue depth < 75% and latency within SLO.
- [ ] Ops log updated with any config overrides (queue size, clamps, remote inference toggles).

View File

@@ -0,0 +1,44 @@
# Concelier Air-Gap Bundle Deploy Runbook (CONCELIER-AIRGAP-56-003)
Status: draft · 2025-11-24
Scope: deploy sealed-mode Concelier evidence bundles using deterministic NDJSON + manifest/entry-trace outputs.
## Inputs
- Bundle: `concelier-airgap.ndjson`
- Manifest: `bundle.manifest.json`
- Entry trace: `bundle.entry-trace.json`
- Hashes: SHA256 recorded in manifest and entry-trace; verify before import.
## Preconditions
- Concelier WebService running with `concelier:features:airgap` enabled.
- No external egress; only local file system allowed for bundle path.
- PostgreSQL indexes applied (`advisory_observations`, `advisory_linksets` tables).
## Steps
1) Transfer bundle directory to offline controller host.
2) Verify hashes:
```bash
sha256sum concelier-airgap.ndjson | diff - <(jq -r .bundleSha256 bundle.manifest.json)
jq -r '.[].sha256' bundle.entry-trace.json | nl | sed 's/\t/:/' > entry.hashes
paste -d' ' <(cut -d: -f1 entry.hashes) <(cut -d: -f2 entry.hashes)
```
3) Import:
```bash
curl -sSf -X POST \
-H 'Content-Type: application/x-ndjson' \
--data-binary @concelier-airgap.ndjson \
http://localhost:5000/internal/airgap/import
```
4) Validate import:
```bash
curl -sSf http://localhost:5000/internal/airgap/status | jq
```
5) Record evidence:
- Store manifest + entry-trace alongside TRX/logs in `artifacts/airgap/<date>/`.
## Determinism notes
- NDJSON ordering is lexicographic; do not re-sort downstream.
- Entry-trace hashes must match post-transfer; any mismatch aborts import.
## Rollback
- Delete imported batch by `bundleId` from `advisory_observations` and `advisory_linksets` (requires DBA approval); rerun import after fixing hash.

View File

@@ -0,0 +1,17 @@
# Incident Mode Runbook (outline)
- Activation, escalation, retention, verification checklist TBD from Ops Guild.
## Pending Inputs
- See sprint SPRINT_0309_0001_0009_docs_tasks_md_ix action tracker; inputs due 2025-12-09..12 from owning guilds.
## Determinism Checklist
- [ ] Hash any inbound assets/payloads; place sums alongside artifacts (e.g., SHA256SUMS in this folder).
- [ ] Keep examples offline-friendly and deterministic (fixed seeds, pinned versions, stable ordering).
- [ ] Note source/approver for any provided captures or schemas.
## Sections to fill (once inputs arrive)
- Activation criteria and toggle steps.
- Escalation paths and roles.
- Retention/cleanup impacts.
- Verification checklist and imposed-rule banner text.

View File

@@ -0,0 +1,50 @@
# Policy Publish / Incident Runbook (draft)
Status: DRAFT — pending policy-registry overlay and production digests. Use for dev/mock exercises until policy release artefacts land.
## Scope
- Policy Registry publish/promote workflows (canary → full rollout).
- Emergency freeze for publish endpoints.
- Evidence capture for audits and postmortems.
## Pre-flight checks (dev vs. prod)
1) Validate manifests
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
- Prod: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-stable.yaml --downloads deploy/downloads/manifest.json`
- Confirm `.gitea/workflows/release-manifest-verify.yml` is green for the target manifest change.
2) Render deployment plan (no apply yet)
- Helm: `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-orchestrator.yaml > /tmp/policy-plan.yaml`
- Compose (dev): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/policy-compose.yaml`
3) Backups
- Run `deploy/compose/scripts/backup.sh` before production rollout; archive PostgreSQL/Redis/ObjectStore snapshots to the regulated vault.
## Canary publish → promote
1) Prepare override (temporary)
- Create `deploy/helm/stellaops/values-policy-canary.yaml` with a single replica, reduced worker counts, and an isolated ingress path for policy publish.
- Keep `mock.enabled=false`; only use real digests when available.
2) Dry-run render
- `helm template stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --debug --validate > /tmp/policy-canary.yaml`
3) Apply canary
- `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-prod.yaml -f deploy/helm/stellaops/values-policy-canary.yaml --atomic --timeout 10m`
- Monitor: `kubectl logs deployment/policy-registry -n stellaops --tail=200 -f` and readiness probes; rollback on errors.
4) Promote
- Remove the canary override from the release branch; rerender with `values-prod.yaml` only and redeploy.
- Update the release manifest with final policy digests and rerun `release-manifest-verify`.
## Emergency freeze
- Hard stop publishes while keeping read access
- `kubectl scale deployment/policy-registry -n stellaops --replicas=0`
- Alternatively, apply a NetworkPolicy that blocks ingress to the publish endpoint while leaving status/read paths open.
- Manifest gate
- Remove policy entries from the target `deploy/releases/*.yaml` and rerun `.gitea/workflows/release-manifest-verify.yml` so pipelines fail closed until the issue is cleared.
## Evidence capture
- Release artefacts: copy the exact release manifest, `/tmp/policy-canary.yaml`, and `/tmp/policy-compose.yaml` used for rollout.
- Runtime state: `kubectl get deploy,po,svc -n stellaops -l app=policy-registry -o yaml > /tmp/policy-live.yaml`.
- Logs: `kubectl logs deployment/policy-registry -n stellaops --since=1h > /tmp/policy-logs.txt`.
- Package as `tar -czf policy-incident-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/policy-*.yaml /tmp/policy-*.txt` and store in the audit bucket.
## Open items (blockers)
- Replace mock digests with production pins in `deploy/releases/*` once provided.
- Update the canary override file with the real policy-registry chart values (service/env schema pending from DEPLOY-POLICY-27-001).
- Add Grafana/Prometheus dashboard references once policy metrics are exposed.

View File

@@ -0,0 +1,63 @@
# Reachability Runtime Ingestion Runbook
> **Imposed rule:** Runtime traces must never bypass CAS/DSSE verification; ingest only CAS-addressed NDJSON with hashes logged to Timeline and Evidence Locker.
This runbook guides operators through ingesting runtime reachability evidence (EntryTrace, probes, Signals ingestion) and wiring it into the reachability evidence chain.
## 1. Prerequisites
- Services: `Signals` API, `Zastava Observer` (or other probes), `Evidence Locker`, optional `Attestor` for DSSE.
- Reachability schema: `docs/modules/reach-graph/guides/function-level-evidence.md`, `docs/modules/reach-graph/guides/evidence-schema.md`.
- CAS: configured bucket/path for `cas://reachability/runtime/*` and `.../graphs/*`.
- Time sync: AirGap Time anchor if sealed; otherwise NTP with drift <200ms.
## 2. Ingestion workflow (online)
1) **Capture traces** from Observer/probes NDJSON (`runtime-trace.ndjson.gz`) with `symbol_id`, `purl`, `timestamp`, `pid`, `container`, `count`.
2) **Stage to CAS**: upload file, record `sha256`, store at `cas://reachability/runtime/<sha256>`.
3) **Optionally sign**: wrap CAS digest in DSSE (`stella attest runtime --bundle runtime.dsse.json`).
4) **Ingest** via Signals API:
```sh
curl -H "X-Stella-Tenant: acme" \
-H "Content-Type: application/x-ndjson" \
--data-binary @runtime-trace.ndjson.gz \
"https://signals.example/api/v1/runtime-facts?graph_hash=<graph>"
```
Headers returned: `Content-SHA256`, `X-Graph-Hash`, `X-Ingest-Id`.
5) **Emit timeline**: ensure Timeline event `reach.runtime.ingested` with CAS digest and ingest id.
6) **Verify**: run `stella graph verify --runtime runtime-trace.ndjson.gz --graph <graph_hash>` to confirm edges mapped.
## 3. Ingestion workflow (air-gap)
1) Receive runtime bundle containing `runtime-trace.ndjson.gz`, `manifest.json` (hashes), optional DSSE.
2) Validate hashes against manifest; if present, verify DSSE bundle.
3) Import into CAS path `cas://reachability/runtime/<sha256>` using offline loader.
4) Run Signals offline ingest tool:
```sh
signals-offline ingest-runtime \
--tenant acme \
--graph-hash <graph_hash> \
--runtime runtime-trace.ndjson.gz \
--manifest manifest.json
```
5) Export ingest receipt and add to Evidence Locker; update Timeline when reconnected.
## 4. Checks & alerts
- **Drift**: block ingest if time anchor age > configured budget; surface `staleness_seconds`.
- **Hash mismatch**: fail ingest; write `runtime.ingest.failed` event with reason.
- **Orphan traces**: if no matching `graph_hash`, queue for retry and alert `reachability.orphan_traces` counter.
## 5. Troubleshooting
- **400 Bad Request**: validate NDJSON schema; run `scripts/reachability/validate_runtime_trace.py`.
- **Hash mismatch**: recompute `sha256sum runtime-trace.ndjson.gz`; compare to manifest.
- **Missing symbols**: ensure symbol manifest ingested (see `docs/specs/symbols/SYMBOL_MANIFEST_v1.md`); rerun `stella graph verify`.
- **High drift**: refresh time anchor (AirGap Time service) or resync NTP; retry ingest.
## 6. Artefact checklist
- `runtime-trace.ndjson.gz` (or `.json`), `sha256` recorded.
- Optional `runtime.dsse.json` DSSE bundle.
- Ingest receipt (ingest id, graph hash, CAS digest, tenant).
- Timeline event `reach.runtime.ingested` and Evidence Locker record (bundle + receipt).
## 7. References
- `docs/modules/reach-graph/guides/DELIVERY_GUIDE.md`
- `docs/modules/reach-graph/guides/function-level-evidence.md`
- `docs/modules/reach-graph/guides/evidence-schema.md`
- `docs/specs/symbols/SYMBOL_MANIFEST_v1.md`

View File

@@ -0,0 +1,96 @@
# Runbook - Replay Operations
> **Audience:** Ops Guild / Evidence Locker Guild / Scanner Guild / Authority/Signer / Attestor
> **Prereqs:** `docs/modules/replay/guides/DETERMINISTIC_REPLAY.md`, `docs/modules/replay/guides/DEVS_GUIDE_REPLAY.md`, `docs/modules/replay/guides/TEST_STRATEGY.md`, `docs/modules/platform/architecture-overview.md`
This runbook governs day-to-day replay operations, retention, and incident handling across online and air-gapped environments. Keep it in sync with the tasks in `docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md`.
---
## 1 Terminology
- **Replay Manifest** - `manifest.json` describing scan inputs, outputs, signatures.
- **Input Bundle** - `inputbundle.tar.zst` containing feeds, policies, tools, env.
- **Output Bundle** - `outputbundle.tar.zst` with SBOM, findings, VEX, logs.
- **DSSE Envelope** - Signed metadata produced by Authority/Signer.
- **RootPack** - Trusted key bundle used to validate DSSE signatures offline.
---
## 2 Normal operations
1. **Ingestion**
- Scanner WebService writes manifest metadata to `replay_runs`.
- Bundles uploaded to CAS (`cas://replay/...`) and mirrored into Evidence Locker (`evidence.replay_bundles`).
- Authority triggers DSSE signing; Attestor optionally anchors to Rekor.
2. **Verification**
- Nightly job runs `stella verify` on the latest N replay manifests per tenant.
- Metrics `replay_verify_total{result}`, `replay_bundle_size_bytes` recorded in Telemetry Stack (see `docs/modules/telemetry/architecture.md`).
- Failures alert `#ops-replay` via PagerDuty with runbook link.
3. **Retention**
- Hot CAS retention: 180 days (configurable per tenant). Cron job `replay-retention` prunes expired digests and writes audit entries.
- Cold storage (Evidence Locker): 2 years; legal holds extend via `/evidence/holds`. Ensure holds recorded in `timeline.events` with type `replay.hold.created`.
- Retention declaration: validate against `docs/schemas/replay-retention.schema.json` (frozen 2025-12-10). Include `retention_policy_id`, `tenant_id`, `bundle_type`, `retention_days`, `legal_hold`, `purge_after`, `checksum`, `created_at`. Audit checksum via DSSE envelope when persisting.
4. **Access control**
- Only service identities with `replay:read` scope may fetch bundles. CLI requires device or client credential flow with DPoP.
---
## 3 Incident response (Replay Integrity)
| Step | Action | Owner | Notes |
|------|--------|-------|-------|
| 1 | Page Ops via `replay_verify_total{result="failed"}` alert | Observability | Include scan id, tenant, failure codes |
| 2 | Lock affected bundles (`POST /evidence/holds`) | Evidence Locker | Reference incident ticket |
| 3 | Re-run `stella verify` with `--explain` to gather diffs | Scanner Guild | Attach diff JSON to incident |
| 4 | Check Rekor inclusion proofs (`stella verify --ledger`) | Attestor | Flag if ledger mismatch or stale |
| 5 | If tool hash drift -> coordinate Signer for rotation | Authority/Signer | Rotate DSSE profile, update RootPack |
| 6 | Update incident timeline (`docs/operations/runbooks/replay_ops.md` -> Incident Log) | Ops Guild | Record timestamps and decisions |
| 7 | Close hold once resolved, publish postmortem | Ops + Docs | Postmortem must reference replay spec sections |
---
## 4 Air-gapped workflow
1. Receive Offline Kit bundle containing:
- `offline/replay/<scan-id>/manifest.json`
- Bundles + DSSE signatures
- RootPack snapshot
2. Run `stella replay manifest.json --strict --offline` using local CLI.
3. Load feed/policy snapshots from kit; never hit external networks.
4. Store verification logs under `ops/offline/replay/<scan-id>/`.
5. Sync results back to Evidence Locker once connectivity restored.
---
## 5 Maintenance checklist
- [ ] RootPack rotated quarterly; CLI/Evidence Locker updated with new fingerprints.
- [ ] CAS retention job executed successfully in the past 24 hours.
- [ ] Replay verification metrics present in dashboards (x64 + arm64 lanes).
- [ ] Runbook incident log updated (see section 6) for the last drill.
- [ ] Offline kit instructions verified against current CLI version.
---
## 6 Incident log
| Date (UTC) | Incident ID | Tenant | Summary | Follow-up |
|------------|-------------|--------|---------|-----------|
| _TBD_ | | | | |
---
## 7 References
- `docs/modules/replay/guides/DETERMINISTIC_REPLAY.md`
- `docs/modules/replay/guides/DEVS_GUIDE_REPLAY.md`
- `docs/modules/replay/guides/TEST_STRATEGY.md`
- `docs/modules/platform/architecture-overview.md` section 5
- `docs/modules/evidence-locker/architecture.md`
- `docs/modules/telemetry/architecture.md`
- `docs/implplan/SPRINT_0187_0001_0001_evidence_locker_cli_integration.md`
---
*Created: 2025-11-03 - Update alongside replay task status changes.*

View File

@@ -0,0 +1,35 @@
# VEX Ops Runbook (dev-mock ready)
Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production rollouts wait on policy/VEX final digests.
## Pre-flight (dev vs. prod)
1) Release manifest guard
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
- Prod: rerun against `deploy/releases/2025.09-stable.yaml` once VEX digests land.
2) Render plan
- Helm (mock overlay): `helm template vex-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vex-mock.yaml`
- Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f deploy/compose/docker-compose.dev.yaml -f deploy/compose/docker-compose.mock.yaml config > /tmp/vex-compose.yaml`
3) Backups (when touching prod data) — not required for mock, but in prod take PostgreSQL snapshots for issuer-directory and VEX state before rollout.
## Deploy (mock path)
- Helm dry-run already covers structural checks. To apply in a dev cluster: `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
- Observe VEX Lens pod logs: `kubectl logs deploy/vex-lens -n stellaops --tail=200 -f`.
- Issuer Directory seed: ensure `issuer-directory-config` ConfigMap includes `csaf-publishers.json`; mock overlay already mounts default seed.
## Rollback
- Helm: `helm rollback stellaops 1` (choose previous revision). Mock overlay uses `stellaops.dev/mock: "true"` annotations; safe to tear down after tests.
- Compose: `docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml down`.
## Troubleshooting
- Recompute storms: throttle via `VEX_LENS__MAX_PARALLELISM` env (set in values once schema lands); for now scale deployment down to 1 replica to reduce concurrency.
- Mapping failures: capture request/response with `kubectl logs ... --since=10m`; rerun after clearing queue.
- Signature errors: confirm Authority token audience/issuer; mock overlay uses the same auth settings as dev compose.
## Evidence capture
- Save `/tmp/vex-mock.yaml` and `/tmp/vex-compose.yaml` with the manifest used.
- `kubectl get deploy/pod,svc -n stellaops -l app=vex-lens -o yaml > /tmp/vex-live.yaml`.
- Tarball: `tar -czf vex-evidence-$(date -u +%Y%m%dT%H%M%SZ).tar.gz /tmp/vex-*`.
## Open TODOs
- Replace mock digests with production pins and add env/schema knobs for VEX Lens once published.
- Add Grafana panels for recompute throughput and mapping failure rate after metrics are exposed.

View File

@@ -0,0 +1,40 @@
# Vuln / Findings Ops Runbook (dev-mock ready)
Status: DRAFT (2025-12-06 UTC). Safe for dev/mock exercises; production steps need final digests and schema from DEPLOY-VULN-29-001.
## Scope
- Findings Ledger + projector + Vuln Explorer API deployment/rollback, plus common incident drills (lag, storms, export failures).
## Pre-flight (dev vs. prod)
1) Release manifest guard
- Dev/mock: `python ops/devops/release/check_release_manifest.py deploy/releases/2025.09-mock-dev.yaml --downloads deploy/downloads/manifest.json`
- Prod: rerun against `deploy/releases/2025.09-stable.yaml` once ledger/api digests land.
2) Render plan
- Helm (mock overlay): `helm template vuln-mock ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --debug --validate > /tmp/vuln-mock.yaml`
- Compose (dev with overlay): `USE_MOCK=1 deploy/compose/scripts/quickstart.sh env/dev.env.example && docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml config > /tmp/vuln-compose.yaml`
3) Backups (prod only)
- PostgreSQL dump for Findings Ledger DB; copy object-store buckets tied to projector anchors.
## Deploy (mock path)
- Helm apply (dev): `helm upgrade --install stellaops ./deploy/helm/stellaops -f deploy/helm/stellaops/values-mock.yaml --atomic --timeout 10m`.
- Compose: quickstart already starts ledger + vuln API with mock pins; validate health at `https://localhost:8443/swagger` (dev certs).
## Incident drills
- Projector lag: scale projector worker up (`kubectl scale deploy/findings-ledger -n stellaops --replicas=2`) then back down; monitor queue length (metric hook pending).
- Resolver storms: temporarily set `ASPNETCORE_THREADPOOL_MINTHREADS` higher or scale API horizontally; in compose, use `docker compose restart vuln-explorer-api` after bumping `VULNEXPLORER__MAX_CONCURRENCY` env once schema lands.
- Export failures: re-run export job after verifying hashes in `deploy/releases/*`; mock path skips signing but still exercises checksum validation via `ops/devops/release/check_release_manifest.py`.
## Rollback
- Helm: `helm rollback stellaops 1` to previous revision.
- Compose: `docker compose --env-file env/dev.env.example -f docker-compose.dev.yaml -f docker-compose.mock.yaml down`.
## Evidence capture
- Keep `/tmp/vuln-mock.yaml`, `/tmp/vuln-compose.yaml`, and the release manifest used.
- `kubectl logs deployment/findings-ledger -n stellaops --since=30m > /tmp/ledger-logs.txt`
- DB snapshot checksums if taken; bundle into `vuln-evidence-$(date -u +%Y%m%dT%H%M%SZ).tar.gz`.
## Open TODOs
- Replace mock digests with production pins; add concrete env knobs for projector and API when schemas publish.
- Hook Prometheus counters for projector lag and resolver storm dashboards once metrics are exported.
_Last updated: 2025-12-06 (UTC)_