Refactor code structure and optimize performance across multiple modules
This commit is contained in:
278
docs/modules/advisory-ai/guides/offline-model-bundles.md
Normal file
278
docs/modules/advisory-ai/guides/offline-model-bundles.md
Normal file
@@ -0,0 +1,278 @@
|
||||
# Offline AI Model Bundles
|
||||
|
||||
> **Sprint:** SPRINT_20251226_019_AI_offline_inference
|
||||
> **Task:** OFFLINE-23, OFFLINE-26
|
||||
|
||||
This guide covers transferring and configuring AI model bundles for air-gapped deployments.
|
||||
|
||||
## Overview
|
||||
|
||||
Local LLM inference in air-gapped environments requires model weight bundles to be transferred via sneakernet (USB, portable media, or internal package servers). The AdvisoryAI module supports deterministic local inference with signed model bundles.
|
||||
|
||||
## Model Bundle Format
|
||||
|
||||
```
|
||||
/offline/models/<model-id>/
|
||||
├── manifest.json # Bundle metadata + file digests
|
||||
├── signature.dsse # DSSE envelope with model signature
|
||||
├── weights/
|
||||
│ ├── model.gguf # Quantized weights (llama.cpp format)
|
||||
│ └── model.gguf.sha256 # SHA-256 digest
|
||||
├── tokenizer/
|
||||
│ ├── tokenizer.json # Tokenizer config
|
||||
│ └── special_tokens.json # Special tokens map
|
||||
└── config/
|
||||
├── model_config.json # Model architecture config
|
||||
└── inference.json # Recommended inference settings
|
||||
```
|
||||
|
||||
## Manifest Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"bundle_id": "llama3-8b-q4km-v1",
|
||||
"model_family": "llama3",
|
||||
"model_size": "8B",
|
||||
"quantization": "Q4_K_M",
|
||||
"license": "Apache-2.0",
|
||||
"created_at": "2025-12-26T00:00:00Z",
|
||||
"files": [
|
||||
{
|
||||
"path": "weights/model.gguf",
|
||||
"digest": "sha256:a1b2c3d4e5f6...",
|
||||
"size": 4893456789
|
||||
},
|
||||
{
|
||||
"path": "tokenizer/tokenizer.json",
|
||||
"digest": "sha256:1a2b3c4d5e6f...",
|
||||
"size": 1842
|
||||
}
|
||||
],
|
||||
"crypto_scheme": "ed25519",
|
||||
"signature_id": "ed25519-20251226-a1b2c3d4"
|
||||
}
|
||||
```
|
||||
|
||||
## Transfer Workflow
|
||||
|
||||
### 1. Export on Connected Machine
|
||||
|
||||
```bash
|
||||
# Pull model from registry and create signed bundle
|
||||
stella model pull llama3-8b-q4km --offline --output /mnt/usb/models/
|
||||
|
||||
# Verify bundle before transfer
|
||||
stella model verify /mnt/usb/models/llama3-8b-q4km/ --verbose
|
||||
```
|
||||
|
||||
### 2. Transfer Verification
|
||||
|
||||
Before physically transferring the media, verify the bundle integrity:
|
||||
|
||||
```bash
|
||||
# Generate transfer manifest with all digests
|
||||
stella model export-manifest /mnt/usb/models/ --output transfer-manifest.json
|
||||
|
||||
# Print weights digest for phone/radio verification
|
||||
sha256sum /mnt/usb/models/llama3-8b-q4km/weights/model.gguf
|
||||
# Example output: a1b2c3d4... model.gguf
|
||||
|
||||
# Cross-check against manifest
|
||||
jq '.files[] | select(.path | contains("model.gguf")) | .digest' manifest.json
|
||||
```
|
||||
|
||||
### 3. Import on Air-Gapped Host
|
||||
|
||||
```bash
|
||||
# Import with signature verification
|
||||
stella model import /mnt/usb/models/llama3-8b-q4km/ \
|
||||
--verify-signature \
|
||||
--destination /var/lib/stellaops/models/
|
||||
|
||||
# Verify loaded model matches expected digest
|
||||
stella model info llama3-8b-q4km --verify
|
||||
|
||||
# List all installed models
|
||||
stella model list
|
||||
```
|
||||
|
||||
## CLI Model Commands
|
||||
|
||||
| Command | Description |
|
||||
|---------|-------------|
|
||||
| `stella model list` | List installed model bundles |
|
||||
| `stella model pull --offline` | Download bundle to local path for transfer |
|
||||
| `stella model verify <path>` | Verify bundle integrity and signature |
|
||||
| `stella model import <path>` | Import bundle from external media |
|
||||
| `stella model info <model-id>` | Display bundle details and verification status |
|
||||
| `stella model remove <model-id>` | Remove installed model bundle |
|
||||
|
||||
### Command Examples
|
||||
|
||||
```bash
|
||||
# List models with details
|
||||
stella model list --verbose
|
||||
|
||||
# Pull specific model variant
|
||||
stella model pull llama3-8b --quantization Q4_K_M --offline --output ./bundle/
|
||||
|
||||
# Verify all installed bundles
|
||||
stella model verify --all
|
||||
|
||||
# Get model info including signature status
|
||||
stella model info llama3-8b-q4km --show-signature
|
||||
|
||||
# Remove model bundle
|
||||
stella model remove llama3-8b-q4km --force
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Local Inference Configuration
|
||||
|
||||
Configure in `etc/advisory-ai.yaml`:
|
||||
|
||||
```yaml
|
||||
advisoryAi:
|
||||
inference:
|
||||
mode: Local # Local | Remote
|
||||
local:
|
||||
bundlePath: /var/lib/stellaops/models/llama3-8b-q4km
|
||||
requiredDigest: "sha256:a1b2c3d4e5f6..."
|
||||
verifySignature: true
|
||||
deviceType: CPU # CPU | GPU | NPU
|
||||
|
||||
# Determinism settings (required for replay)
|
||||
contextLength: 4096
|
||||
temperature: 0.0
|
||||
seed: 42
|
||||
|
||||
# Performance tuning
|
||||
threads: 4
|
||||
batchSize: 512
|
||||
gpuLayers: 0 # 0 = CPU only
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `ADVISORYAI_INFERENCE_MODE` | `Local` or `Remote` | `Local` |
|
||||
| `ADVISORYAI_MODEL_PATH` | Path to model bundle | `/var/lib/stellaops/models` |
|
||||
| `ADVISORYAI_MODEL_VERIFY` | Verify signature on load | `true` |
|
||||
| `ADVISORYAI_INFERENCE_THREADS` | CPU threads for inference | `4` |
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
| Model Size | Quantization | RAM Required | GPU VRAM | Inference Speed |
|
||||
|------------|--------------|--------------|----------|-----------------|
|
||||
| 7-8B | Q4_K_M | 8 GB | N/A (CPU) | ~10 tokens/sec |
|
||||
| 7-8B | FP16 | 16 GB | 8 GB | ~50 tokens/sec |
|
||||
| 13B | Q4_K_M | 16 GB | N/A (CPU) | ~5 tokens/sec |
|
||||
| 13B | FP16 | 32 GB | 16 GB | ~30 tokens/sec |
|
||||
|
||||
### Recommended Configurations
|
||||
|
||||
**Minimal (CPU-only, 8GB RAM):**
|
||||
- Model: Llama 3 8B Q4_K_M
|
||||
- Settings: `threads: 4`, `batchSize: 256`
|
||||
- Expected: ~10 tokens/sec
|
||||
|
||||
**Standard (CPU, 16GB RAM):**
|
||||
- Model: Llama 3 8B Q4_K_M or 13B Q4_K_M
|
||||
- Settings: `threads: 8`, `batchSize: 512`
|
||||
- Expected: ~15-20 tokens/sec (8B), ~5-8 tokens/sec (13B)
|
||||
|
||||
**GPU-Accelerated (8GB VRAM):**
|
||||
- Model: Llama 3 8B FP16
|
||||
- Settings: `gpuLayers: 35`, `batchSize: 512`
|
||||
- Expected: ~50 tokens/sec
|
||||
|
||||
## Signing and Verification
|
||||
|
||||
### Model Bundle Signing
|
||||
|
||||
Bundles are signed using DSSE (Dead Simple Signing Envelope) format:
|
||||
|
||||
```json
|
||||
{
|
||||
"payloadType": "application/vnd.stellaops.model-bundle+json",
|
||||
"payload": "<base64-encoded-manifest-digest>",
|
||||
"signatures": [
|
||||
{
|
||||
"keyId": "stellaops-model-signer-2025",
|
||||
"sig": "<base64-signature>"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Regional Crypto Support
|
||||
|
||||
| Region | Algorithm | Key Type |
|
||||
|--------|-----------|----------|
|
||||
| Default | Ed25519 | Ed25519 |
|
||||
| FIPS (US) | ECDSA-P256 | NIST P-256 |
|
||||
| GOST (RU) | GOST 34.10-2012 | GOST R 34.10-2012 |
|
||||
| SM (CN) | SM2 | SM2 |
|
||||
|
||||
### Verification at Load Time
|
||||
|
||||
When a model is loaded, the following checks occur:
|
||||
|
||||
1. **Signature verification**: DSSE envelope is verified against known keys
|
||||
2. **Manifest integrity**: All file digests are recalculated and compared
|
||||
3. **Bundle completeness**: All required files are present
|
||||
4. **Configuration validation**: Inference settings are within safe bounds
|
||||
|
||||
## Deterministic Inference
|
||||
|
||||
For reproducible AI outputs (required for attestation replay):
|
||||
|
||||
```yaml
|
||||
advisoryAi:
|
||||
inference:
|
||||
local:
|
||||
# CRITICAL: These settings ensure deterministic output
|
||||
temperature: 0.0
|
||||
seed: 42
|
||||
topK: 1
|
||||
topP: 1.0
|
||||
```
|
||||
|
||||
With these settings, the same prompt will produce identical output across runs, enabling:
|
||||
- AI artifact replay for compliance audits
|
||||
- Divergence detection between environments
|
||||
- Attestation verification
|
||||
|
||||
## Benchmarking
|
||||
|
||||
Run local inference benchmarks:
|
||||
|
||||
```bash
|
||||
# Run standard benchmark suite
|
||||
stella model benchmark llama3-8b-q4km --iterations 10
|
||||
|
||||
# Output includes:
|
||||
# - Latency: mean, median, p95, p99, TTFT
|
||||
# - Throughput: tokens/sec, requests/min
|
||||
# - Resource usage: peak memory, CPU utilization
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | Cause | Resolution |
|
||||
|---------|-------|------------|
|
||||
| `signature verification failed` | Bundle tampered or wrong key | Re-download bundle, verify chain of custody |
|
||||
| `digest mismatch` | Corrupted during transfer | Re-copy from source, verify SHA-256 |
|
||||
| `model not found` | Wrong bundle path | Check `bundlePath` in config |
|
||||
| `out of memory` | Model too large | Use smaller quantization (Q4_K_M) |
|
||||
| `inference timeout` | CPU too slow | Increase timeout or enable GPU |
|
||||
| `non-deterministic output` | Wrong settings | Set `temperature: 0`, `seed: 42` |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Advisory AI Architecture](../architecture.md)
|
||||
- [Offline Kit Overview](../../../24_OFFLINE_KIT.md)
|
||||
- [AI Attestations](../../../implplan/SPRINT_20251226_018_AI_attestations.md)
|
||||
- [Replay Semantics](./replay-semantics.md)
|
||||
Reference in New Issue
Block a user