# Sovereign/Offline AI Inference with Signed Model Bundles

## Module
AdvisoryAI

## Status
IMPLEMENTED

## Description
Local LLM inference for air-gapped environments via a pluggable provider architecture supporting llama.cpp server, Ollama, OpenAI, Claude, and Gemini. DSSE-signed model bundle management with regional crypto support (eIDAS/FIPS/GOST/SM), digest verification at load time, deterministic output config (temperature=0, fixed seed), inference caching, benchmarking harness, and offline replay verification.

## Implementation Details
- **Modules**: `src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/`
- **Key Classes**:
  - `SignedModelBundleManager` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/SignedModelBundleManager.cs`) - manages DSSE-signed model bundles with digest verification at load time
  - `ModelBundle` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/ModelBundle.cs`) - model bundle metadata including hash, signature, and regional crypto info
  - `LlamaCppRuntime` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlamaCppRuntime.cs`) - llama.cpp local inference runtime
  - `OnnxRuntime` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/OnnxRuntime.cs`) - ONNX runtime for local model inference
  - `AdvisoryInferenceClient` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/AdvisoryInferenceClient.cs`) - main inference client with provider routing
  - `ProviderBasedAdvisoryInferenceClient` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/ProviderBasedAdvisoryInferenceClient.cs`) - provider-based inference with caching
  - `LlmBenchmark` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LlmBenchmark.cs`) - benchmarking harness for inference performance
  - `LocalInferenceOptions` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LocalInferenceOptions.cs`) - configuration for local inference (temperature, seed, context size)
  - `LocalLlmConfig` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Inference/LocalLlmConfig.cs`) - local LLM configuration (model path, quantization, GPU layers)
  - `LocalChatInferenceClient` (`src/AdvisoryAi/StellaOps.AdvisoryAI/Chat/Services/LocalChatInferenceClient.cs`) - chat-specific local inference client
- **Interfaces**: `ILocalLlmRuntime`
- **Source**: SPRINT_20251226_019_AI_offline_inference.md

## E2E Test Plan
- [ ] Load a signed model bundle via `SignedModelBundleManager` and verify DSSE signature and digest are validated
- [ ] Verify `SignedModelBundleManager` rejects a model bundle with a tampered digest
- [ ] Run inference through `LlamaCppRuntime` with temperature=0 and fixed seed and verify deterministic output
- [ ] Run `LlmBenchmark` and verify it measures tokens/second and latency metrics
- [ ] Verify `OnnxRuntime` loads and runs inference with an ONNX model
- [ ] Configure `LocalInferenceOptions` with air-gap settings and verify no external network calls are made
- [ ] Verify `ProviderBasedAdvisoryInferenceClient` caches deterministic responses and returns cached results on repeat queries