7.9 KiB
Offline AI Model Bundles
Sprint: SPRINT_20251226_019_AI_offline_inference Task: OFFLINE-23, OFFLINE-26
This guide covers transferring and configuring AI model bundles for air-gapped deployments.
Overview
Local LLM inference in air-gapped environments requires model weight bundles to be transferred via sneakernet (USB, portable media, or internal package servers). The AdvisoryAI module supports deterministic local inference with signed model bundles.
Model Bundle Format
/offline/models/<model-id>/
├── manifest.json # Bundle metadata + file digests
├── signature.dsse # DSSE envelope with model signature
├── weights/
│ ├── model.gguf # Quantized weights (llama.cpp format)
│ └── model.gguf.sha256 # SHA-256 digest
├── tokenizer/
│ ├── tokenizer.json # Tokenizer config
│ └── special_tokens.json # Special tokens map
└── config/
├── model_config.json # Model architecture config
└── inference.json # Recommended inference settings
Manifest Schema
{
"bundle_id": "llama3-8b-q4km-v1",
"model_family": "llama3",
"model_size": "8B",
"quantization": "Q4_K_M",
"license": "Apache-2.0",
"created_at": "2025-12-26T00:00:00Z",
"files": [
{
"path": "weights/model.gguf",
"digest": "sha256:a1b2c3d4e5f6...",
"size": 4893456789
},
{
"path": "tokenizer/tokenizer.json",
"digest": "sha256:1a2b3c4d5e6f...",
"size": 1842
}
],
"crypto_scheme": "ed25519",
"signature_id": "ed25519-20251226-a1b2c3d4"
}
Transfer Workflow
1. Export on Connected Machine
# Pull model from registry and create signed bundle
stella model pull llama3-8b-q4km --offline --output /mnt/usb/models/
# Verify bundle before transfer
stella model verify /mnt/usb/models/llama3-8b-q4km/ --verbose
2. Transfer Verification
Before physically transferring the media, verify the bundle integrity:
# Generate transfer manifest with all digests
stella model export-manifest /mnt/usb/models/ --output transfer-manifest.json
# Print weights digest for phone/radio verification
sha256sum /mnt/usb/models/llama3-8b-q4km/weights/model.gguf
# Example output: a1b2c3d4... model.gguf
# Cross-check against manifest
jq '.files[] | select(.path | contains("model.gguf")) | .digest' manifest.json
3. Import on Air-Gapped Host
# Import with signature verification
stella model import /mnt/usb/models/llama3-8b-q4km/ \
--verify-signature \
--destination /var/lib/stellaops/models/
# Verify loaded model matches expected digest
stella model info llama3-8b-q4km --verify
# List all installed models
stella model list
CLI Model Commands
| Command | Description |
|---|---|
stella model list |
List installed model bundles |
stella model pull --offline |
Download bundle to local path for transfer |
stella model verify <path> |
Verify bundle integrity and signature |
stella model import <path> |
Import bundle from external media |
stella model info <model-id> |
Display bundle details and verification status |
stella model remove <model-id> |
Remove installed model bundle |
Command Examples
# List models with details
stella model list --verbose
# Pull specific model variant
stella model pull llama3-8b --quantization Q4_K_M --offline --output ./bundle/
# Verify all installed bundles
stella model verify --all
# Get model info including signature status
stella model info llama3-8b-q4km --show-signature
# Remove model bundle
stella model remove llama3-8b-q4km --force
Configuration
Local Inference Configuration
Configure in etc/advisory-ai.yaml:
advisoryAi:
inference:
mode: Local # Local | Remote
local:
bundlePath: /var/lib/stellaops/models/llama3-8b-q4km
requiredDigest: "sha256:a1b2c3d4e5f6..."
verifySignature: true
deviceType: CPU # CPU | GPU | NPU
# Determinism settings (required for replay)
contextLength: 4096
temperature: 0.0
seed: 42
# Performance tuning
threads: 4
batchSize: 512
gpuLayers: 0 # 0 = CPU only
Environment Variables
| Variable | Description | Default |
|---|---|---|
ADVISORYAI_INFERENCE_MODE |
Local or Remote |
Local |
ADVISORYAI_MODEL_PATH |
Path to model bundle | /var/lib/stellaops/models |
ADVISORYAI_MODEL_VERIFY |
Verify signature on load | true |
ADVISORYAI_INFERENCE_THREADS |
CPU threads for inference | 4 |
Hardware Requirements
| Model Size | Quantization | RAM Required | GPU VRAM | Inference Speed |
|---|---|---|---|---|
| 7-8B | Q4_K_M | 8 GB | N/A (CPU) | ~10 tokens/sec |
| 7-8B | FP16 | 16 GB | 8 GB | ~50 tokens/sec |
| 13B | Q4_K_M | 16 GB | N/A (CPU) | ~5 tokens/sec |
| 13B | FP16 | 32 GB | 16 GB | ~30 tokens/sec |
Recommended Configurations
Minimal (CPU-only, 8GB RAM):
- Model: Llama 3 8B Q4_K_M
- Settings:
threads: 4,batchSize: 256 - Expected: ~10 tokens/sec
Standard (CPU, 16GB RAM):
- Model: Llama 3 8B Q4_K_M or 13B Q4_K_M
- Settings:
threads: 8,batchSize: 512 - Expected: ~15-20 tokens/sec (8B), ~5-8 tokens/sec (13B)
GPU-Accelerated (8GB VRAM):
- Model: Llama 3 8B FP16
- Settings:
gpuLayers: 35,batchSize: 512 - Expected: ~50 tokens/sec
Signing and Verification
Model Bundle Signing
Bundles are signed using DSSE (Dead Simple Signing Envelope) format:
{
"payloadType": "application/vnd.stellaops.model-bundle+json",
"payload": "<base64-encoded-manifest-digest>",
"signatures": [
{
"keyId": "stellaops-model-signer-2025",
"sig": "<base64-signature>"
}
]
}
Regional Crypto Support
| Region | Algorithm | Key Type |
|---|---|---|
| Default | Ed25519 | Ed25519 |
| FIPS (US) | ECDSA-P256 | NIST P-256 |
| GOST (RU) | GOST 34.10-2012 | GOST R 34.10-2012 |
| SM (CN) | SM2 | SM2 |
Verification at Load Time
When a model is loaded, the following checks occur:
- Signature verification: DSSE envelope is verified against known keys
- Manifest integrity: All file digests are recalculated and compared
- Bundle completeness: All required files are present
- Configuration validation: Inference settings are within safe bounds
Deterministic Inference
For reproducible AI outputs (required for attestation replay):
advisoryAi:
inference:
local:
# CRITICAL: These settings ensure deterministic output
temperature: 0.0
seed: 42
topK: 1
topP: 1.0
With these settings, the same prompt will produce identical output across runs, enabling:
- AI artifact replay for compliance audits
- Divergence detection between environments
- Attestation verification
Benchmarking
Run local inference benchmarks:
# Run standard benchmark suite
stella model benchmark llama3-8b-q4km --iterations 10
# Output includes:
# - Latency: mean, median, p95, p99, TTFT
# - Throughput: tokens/sec, requests/min
# - Resource usage: peak memory, CPU utilization
Troubleshooting
| Symptom | Cause | Resolution |
|---|---|---|
signature verification failed |
Bundle tampered or wrong key | Re-download bundle, verify chain of custody |
digest mismatch |
Corrupted during transfer | Re-copy from source, verify SHA-256 |
model not found |
Wrong bundle path | Check bundlePath in config |
out of memory |
Model too large | Use smaller quantization (Q4_K_M) |
inference timeout |
CPU too slow | Increase timeout or enable GPU |
non-deterministic output |
Wrong settings | Set temperature: 0, seed: 42 |