Files
git.stella-ops.org/docs/runbooks/assistant-ops.md
StellaOps Bot 6bee1fdcf5
Some checks failed
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled
Concelier Attestation Tests / attestation-tests (push) Has been cancelled
Docs CI / lint-and-preview (push) Has been cancelled
work
2025-11-25 08:01:23 +02:00

3.4 KiB

Assistant Ops Runbook (DOCS-AIAI-31-009)

Updated: 2025-11-24 · Owners: DevOps Guild · Advisory AI Guild · Sprint 0111

This runbook covers day-2 operations for Advisory AI (web + worker) with emphasis on cache priming, guardrail verification, and outage handling in offline/air-gapped installs.

1) Warmup & cache priming

  • Ensure Offline Kit fixtures are staged:
    • CLI guardrail bundles: out/console/guardrails/cli-vuln-29-001/, out/console/guardrails/cli-vex-30-001/.
    • SBOM context fixtures: copy into data/advisory-ai/fixtures/sbom/ and record hashes in SHA256SUMS.
    • Profiles/prompts manifests: ensure profiles.catalog.json and prompts.manifest hashes match AdvisoryAI:Provenance settings.
  • Start services and prime caches using cache-only calls:
    • stella advise run summary --advisory-key <id> --timeout 0 --json (should return cached/empty context, exit 0).
    • stella advise run remediation --advisory-key <id> --artifact-id <id> --timeout 0 --json (verifies SBOM clamps without executing inference).

2) Guardrail & provenance verification

  • Run guardrail self-test: dotnet test src/AdvisoryAI/__Tests/StellaOps.AdvisoryAI.Tests/StellaOps.AdvisoryAI.Tests.csproj --filter Guardrail (offline-safe).
  • Validate DSSE bundles:
    • slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/prompts.manifest.dsse --source prompts.manifest
    • slsa-verifier verify-attestation --bundle offline-kit/advisory-ai/provenance/policy-bundle.intoto.jsonl --digest <policy-digest>
  • Confirm AdvisoryAI:Guardrails:BlockedPhrases file matches the hash captured during pack build; diff against prompts.manifest.

3) Scaling & queue health

  • Defaults: queue capacity 1024, dequeue wait 1s (see docs/policy/assistant-parameters.md). For bursty tenants, scale workers horizontally before increasing queue size to preserve determinism.
  • Metrics to watch: advisory_ai_queue_depth, advisory_ai_latency_seconds, advisory_ai_guardrail_blocks_total.
  • If queue depth > 75% for 5 minutes, add one worker pod or increase Queue:Capacity by 25% (record change in ops log).

4) Outage handling

  • SBOM service down: switch to NullSbomContextClient by unsetting ADVISORYAI__SBOM__BASEADDRESS; Advisory AI returns deterministic responses with sbomSummary counts at 0.
  • Policy Engine unavailable: pin last-known policyVersion; set AdvisoryAI:Guardrails:RequireCitations=true to avoid drift; raise advisory.remediation.policyHold in responses.
  • Remote profile disabled: keep profile=cloud-openai blocked; return advisory.inference.remoteDisabled with exit code 12 in CLI (see docs/advisory-ai/cli.md).

5) Air-gap / offline posture

  • All external calls are disabled by default. To re-enable remote inference, set ADVISORYAI__INFERENCE__MODE=Remote and provide an allowlisted Remote.BaseAddress; record the consent in Authority and in the ops log.
  • Mirror the guardrail artefact folders and hashes.sha256 into the Offline Kit; re-run the guardrail self-test after mirroring.

6) Checklist before declaring healthy

  • Guardrail self-test suite green.
  • Cache-only CLI probes return 0 with correct context.planCacheKey.
  • DSSE verifications logged for prompts, profiles, policy bundle.
  • Metrics scrape shows queue depth < 75% and latency within SLO.
  • Ops log updated with any config overrides (queue size, clamps, remote inference toggles).