Refactor code structure and optimize performance across multiple modules

This commit is contained in:
StellaOps Bot
2025-12-26 20:03:22 +02:00
parent c786faae84
commit b4fc66feb6
3353 changed files with 88254 additions and 1590657 deletions

View File

@@ -0,0 +1,278 @@
# Offline AI Model Bundles
> **Sprint:** SPRINT_20251226_019_AI_offline_inference
> **Task:** OFFLINE-23, OFFLINE-26
This guide covers transferring and configuring AI model bundles for air-gapped deployments.
## Overview
Local LLM inference in air-gapped environments requires model weight bundles to be transferred via sneakernet (USB, portable media, or internal package servers). The AdvisoryAI module supports deterministic local inference with signed model bundles.
## Model Bundle Format
```
/offline/models/<model-id>/
├── manifest.json # Bundle metadata + file digests
├── signature.dsse # DSSE envelope with model signature
├── weights/
│ ├── model.gguf # Quantized weights (llama.cpp format)
│ └── model.gguf.sha256 # SHA-256 digest
├── tokenizer/
│ ├── tokenizer.json # Tokenizer config
│ └── special_tokens.json # Special tokens map
└── config/
├── model_config.json # Model architecture config
└── inference.json # Recommended inference settings
```
## Manifest Schema
```json
{
"bundle_id": "llama3-8b-q4km-v1",
"model_family": "llama3",
"model_size": "8B",
"quantization": "Q4_K_M",
"license": "Apache-2.0",
"created_at": "2025-12-26T00:00:00Z",
"files": [
{
"path": "weights/model.gguf",
"digest": "sha256:a1b2c3d4e5f6...",
"size": 4893456789
},
{
"path": "tokenizer/tokenizer.json",
"digest": "sha256:1a2b3c4d5e6f...",
"size": 1842
}
],
"crypto_scheme": "ed25519",
"signature_id": "ed25519-20251226-a1b2c3d4"
}
```
## Transfer Workflow
### 1. Export on Connected Machine
```bash
# Pull model from registry and create signed bundle
stella model pull llama3-8b-q4km --offline --output /mnt/usb/models/
# Verify bundle before transfer
stella model verify /mnt/usb/models/llama3-8b-q4km/ --verbose
```
### 2. Transfer Verification
Before physically transferring the media, verify the bundle integrity:
```bash
# Generate transfer manifest with all digests
stella model export-manifest /mnt/usb/models/ --output transfer-manifest.json
# Print weights digest for phone/radio verification
sha256sum /mnt/usb/models/llama3-8b-q4km/weights/model.gguf
# Example output: a1b2c3d4... model.gguf
# Cross-check against manifest
jq '.files[] | select(.path | contains("model.gguf")) | .digest' manifest.json
```
### 3. Import on Air-Gapped Host
```bash
# Import with signature verification
stella model import /mnt/usb/models/llama3-8b-q4km/ \
--verify-signature \
--destination /var/lib/stellaops/models/
# Verify loaded model matches expected digest
stella model info llama3-8b-q4km --verify
# List all installed models
stella model list
```
## CLI Model Commands
| Command | Description |
|---------|-------------|
| `stella model list` | List installed model bundles |
| `stella model pull --offline` | Download bundle to local path for transfer |
| `stella model verify <path>` | Verify bundle integrity and signature |
| `stella model import <path>` | Import bundle from external media |
| `stella model info <model-id>` | Display bundle details and verification status |
| `stella model remove <model-id>` | Remove installed model bundle |
### Command Examples
```bash
# List models with details
stella model list --verbose
# Pull specific model variant
stella model pull llama3-8b --quantization Q4_K_M --offline --output ./bundle/
# Verify all installed bundles
stella model verify --all
# Get model info including signature status
stella model info llama3-8b-q4km --show-signature
# Remove model bundle
stella model remove llama3-8b-q4km --force
```
## Configuration
### Local Inference Configuration
Configure in `etc/advisory-ai.yaml`:
```yaml
advisoryAi:
inference:
mode: Local # Local | Remote
local:
bundlePath: /var/lib/stellaops/models/llama3-8b-q4km
requiredDigest: "sha256:a1b2c3d4e5f6..."
verifySignature: true
deviceType: CPU # CPU | GPU | NPU
# Determinism settings (required for replay)
contextLength: 4096
temperature: 0.0
seed: 42
# Performance tuning
threads: 4
batchSize: 512
gpuLayers: 0 # 0 = CPU only
```
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `ADVISORYAI_INFERENCE_MODE` | `Local` or `Remote` | `Local` |
| `ADVISORYAI_MODEL_PATH` | Path to model bundle | `/var/lib/stellaops/models` |
| `ADVISORYAI_MODEL_VERIFY` | Verify signature on load | `true` |
| `ADVISORYAI_INFERENCE_THREADS` | CPU threads for inference | `4` |
## Hardware Requirements
| Model Size | Quantization | RAM Required | GPU VRAM | Inference Speed |
|------------|--------------|--------------|----------|-----------------|
| 7-8B | Q4_K_M | 8 GB | N/A (CPU) | ~10 tokens/sec |
| 7-8B | FP16 | 16 GB | 8 GB | ~50 tokens/sec |
| 13B | Q4_K_M | 16 GB | N/A (CPU) | ~5 tokens/sec |
| 13B | FP16 | 32 GB | 16 GB | ~30 tokens/sec |
### Recommended Configurations
**Minimal (CPU-only, 8GB RAM):**
- Model: Llama 3 8B Q4_K_M
- Settings: `threads: 4`, `batchSize: 256`
- Expected: ~10 tokens/sec
**Standard (CPU, 16GB RAM):**
- Model: Llama 3 8B Q4_K_M or 13B Q4_K_M
- Settings: `threads: 8`, `batchSize: 512`
- Expected: ~15-20 tokens/sec (8B), ~5-8 tokens/sec (13B)
**GPU-Accelerated (8GB VRAM):**
- Model: Llama 3 8B FP16
- Settings: `gpuLayers: 35`, `batchSize: 512`
- Expected: ~50 tokens/sec
## Signing and Verification
### Model Bundle Signing
Bundles are signed using DSSE (Dead Simple Signing Envelope) format:
```json
{
"payloadType": "application/vnd.stellaops.model-bundle+json",
"payload": "<base64-encoded-manifest-digest>",
"signatures": [
{
"keyId": "stellaops-model-signer-2025",
"sig": "<base64-signature>"
}
]
}
```
### Regional Crypto Support
| Region | Algorithm | Key Type |
|--------|-----------|----------|
| Default | Ed25519 | Ed25519 |
| FIPS (US) | ECDSA-P256 | NIST P-256 |
| GOST (RU) | GOST 34.10-2012 | GOST R 34.10-2012 |
| SM (CN) | SM2 | SM2 |
### Verification at Load Time
When a model is loaded, the following checks occur:
1. **Signature verification**: DSSE envelope is verified against known keys
2. **Manifest integrity**: All file digests are recalculated and compared
3. **Bundle completeness**: All required files are present
4. **Configuration validation**: Inference settings are within safe bounds
## Deterministic Inference
For reproducible AI outputs (required for attestation replay):
```yaml
advisoryAi:
inference:
local:
# CRITICAL: These settings ensure deterministic output
temperature: 0.0
seed: 42
topK: 1
topP: 1.0
```
With these settings, the same prompt will produce identical output across runs, enabling:
- AI artifact replay for compliance audits
- Divergence detection between environments
- Attestation verification
## Benchmarking
Run local inference benchmarks:
```bash
# Run standard benchmark suite
stella model benchmark llama3-8b-q4km --iterations 10
# Output includes:
# - Latency: mean, median, p95, p99, TTFT
# - Throughput: tokens/sec, requests/min
# - Resource usage: peak memory, CPU utilization
```
## Troubleshooting
| Symptom | Cause | Resolution |
|---------|-------|------------|
| `signature verification failed` | Bundle tampered or wrong key | Re-download bundle, verify chain of custody |
| `digest mismatch` | Corrupted during transfer | Re-copy from source, verify SHA-256 |
| `model not found` | Wrong bundle path | Check `bundlePath` in config |
| `out of memory` | Model too large | Use smaller quantization (Q4_K_M) |
| `inference timeout` | CPU too slow | Increase timeout or enable GPU |
| `non-deterministic output` | Wrong settings | Set `temperature: 0`, `seed: 42` |
## Related Documentation
- [Advisory AI Architecture](../architecture.md)
- [Offline Kit Overview](../../../24_OFFLINE_KIT.md)
- [AI Attestations](../../../implplan/SPRINT_20251226_018_AI_attestations.md)
- [Replay Semantics](./replay-semantics.md)