Files
git.stella-ops.org/docs/security/assistant-guardrails.md
2026-01-13 18:53:39 +02:00

6.5 KiB
Raw Blame History

Advisory AI Guardrails & Redaction Policy

Audience: Advisory AI guild, Security guild, Docs guild, operators consuming Advisory AI outputs. Scope: Prompt redaction rules, injection defenses, telemetry/alert wiring, and audit guidance for Advisory AI (Epic 8).

Advisory AI accepts structured evidence from Concelier/Excititor and assembles prompts before executing downstream inference. Guardrails enforce provenance, block injection attempts, and redact sensitive content prior to handing data to any inference provider (online or offline). This document enumerates the guardrail surface and how to observe, alert, and audit it.


1 · Input validation & injection defense

Advisory prompts are rejected when any of the following checks fail:

  1. Citation coverage every prompt must carry at least one citation with an index, document id, and chunk id. Missing or malformed citations raise the citation_missing / citation_invalid violations.
  2. Prompt length AdvisoryGuardrailOptions.MaxPromptLength defaults to 16000 characters. Longer payloads raise prompt_too_long.
  3. Blocked phrases the guardrail pipeline lowercases the prompt and searches for the blocked phrase cache (ignore previous instructions, disregard earlier instructions, you are now the system, override the system prompt, please jailbreak). Each hit raises prompt_injection and increments blocked_phrase_count metadata.
  4. Optional per-profile rules when additional phrases are configured via configuration, they are appended to the cache at startup and evaluated with the same logic.
  5. Token and rate budgets - per user/org budgets cap prompt size, requests/min, and tool calls/day; overages raise quota_exceeded.

Any validation failure stops the pipeline before inference and emits guardrail_blocked = true in the persisted output as well as the corresponding metric counter.

2 · Redaction rules

Redactions are deterministic so caches remain stable. The current rule set (in order) is:

Rule Regex Replacement
AWS secret access keys (?i)(aws_secret_access_key\s*[:=]\s*)([A-Za-z0-9/+=]{40,}) $1[REDACTED_AWS_SECRET]
Credentials/tokens `(?i)(token apikey
High entropy strings entropy >= threshold [REDACTED_HIGH_ENTROPY]
PEM private keys (?is)-----BEGIN [^-]+ PRIVATE KEY-----.*?-----END [^-]+ PRIVATE KEY----- [REDACTED_PRIVATE_KEY]

Redaction counts are surfaced via guardrailResult.Metadata["redaction_count"] and emitted as log fields to simplify threat hunting.

Allowlist and entropy tuning

  • Allowlist patterns bypass redaction for known-safe identifiers (scan IDs, digest prefixes, evidence refs).
  • Entropy thresholds are configurable per profile to reduce false positives in long hex IDs.
  • Configure scrubber knobs via AdvisoryAI:Guardrails:EntropyThreshold, AdvisoryAI:Guardrails:EntropyMinLength, AdvisoryAI:Guardrails:AllowlistFile, and AdvisoryAI:Guardrails:AllowlistPatterns.

3 · Telemetry, logs, and traces

Advisory AI now exposes the following metrics (all tagged with task_type and, where applicable, cache/citation metadata):

Metric Type Description
advisory_ai_latency_seconds Histogram End-to-end worker latency from dequeue through persisted output. Aggregated with plan_cache_hit to compare cached vs. regenerated plans.
advisory_ai_guardrail_blocks_total Counter Number of guardrail rejections per task.
advisory_ai_validation_failures_total Counter Total validation violations emitted by the guardrail pipeline (one increment per violation instance).
advisory_ai_citation_coverage_ratio Histogram Ratio of unique citations to structured chunks (01). Tags include citations and structured_chunks.
advisory_plans_created/queued/processed Counters Existing plan lifecycle metrics (unchanged but now tagged by task type).

Logging

  • Successful writes: Stored advisory pipeline output {CacheKey} log line now includes guardrail_blocked, validation_failures, and citation_coverage.
  • Guardrail rejection: warning log includes violation count and advisory key.
  • All dequeued jobs emit info logs carrying cache:{Cache} for quicker diagnosis.

Tracing

  • WebService (/v1/advisory-ai/pipeline*) emits advisory_ai.plan_request / plan_batch spans with tags for tenant, advisory key, cache key, and validation state.
  • Worker emits advisory_ai.process spans for each queue item with latency measurement and cache hit tags.

4 · Dashboards & alerts

Update the “Advisory AI” Grafana board with the new metrics:

  1. Latency panel plot advisory_ai_latency_seconds p50/p95 split by plan_cache_hit. Alert when p95 > 30s for 5 minutes.
  2. Guardrail burn rate advisory_ai_guardrail_blocks_total vs. advisory_ai_validation_failures_total. Alert when either exceeds 5 blocks/min or 1% of total traffic.
  3. Citation coverage histogram heatmap of advisory_ai_citation_coverage_ratio to identify evidence gaps (alert when <0.6 for more than 10 minutes).

All alerts should route to #advisory-ai-ops with the tenant, task type, and recent advisory keys in the message template.

5 · Operations & audit

  • When an alert fires: capture the guardrail log entry, relevant metrics sample, and the cached plan from the worker output store. Attach them to the incident timeline entry.
  • Tenant overrides: any request to loosen guardrails or blocked phrase lists requires a signed change request and security approval. Update AdvisoryGuardrailOptions via configuration bundles and document the reason in the change log.
  • Chat settings overrides: quotas and tool allowlists can be adjusted via the chat settings endpoints; env values remain defaults.
  • Doctor check: use /api/v1/chat/doctor to confirm quota/tool limits when chat requests are rejected.
  • Offline kit checks: ensure the offline inference bundle uses the same guardrail configuration file as production; mismatches should fail the bundle validation step.
  • Forensics: persisted outputs now contain guardrail_blocked, plan_cache_hit, and citation_coverage metadata. Include these fields when exporting evidence bundles to prove guardrail enforcement.
  • Chat audit trail: retain prompt hashes, redaction metadata, tool call hashes, and policy decisions for post-incident review.

Keep this document synced whenever guardrail rules, telemetry names, or alert targets change.