git.stella-ops.org/docs/operations/trust-lattice-runbook.md

# Trust Lattice Operations Runbook

> **Version**: 1.0.0
> **Last Updated**: 2025-12-22
> **Audience**: Operations and Support teams

---

## 1. Overview

The Trust Lattice is a VEX claim scoring framework that produces explainable, deterministic verdicts. This runbook covers operational procedures for monitoring, troubleshooting, and maintaining the system.

---

## 2. System Components

| Component | Service | Purpose |
|-----------|---------|---------|
| TrustVector | Excititor | 3-component trust scoring (P/C/R) |
| ClaimScoreMerger | Policy | Merge scored claims into verdicts |
| PolicyGates | Policy | Enforce trust thresholds |
| VerdictManifest | Authority | Store signed verdicts |
| Calibration | Excititor | Adjust trust vectors over time |

---

## 3. Monitoring

### 3.1 Key Metrics

| Metric | Alert Threshold | Description |
|--------|-----------------|-------------|
| `trustlattice_score_latency_p95` | > 100ms | Claim scoring latency |
| `trustlattice_merge_conflicts_total` | Rate increase | Claims with status conflicts |
| `policy_gate_failures_total` | Rate increase | Gate rejections |
| `verdict_manifest_replay_failures` | > 0 | Non-deterministic verdicts |
| `calibration_drift_percent` | > 10% | Trust vector drift from baseline |

### 3.2 Dashboards

Access dashboards at:
- Grafana: `https://<grafana>/d/trustlattice`
- Prometheus queries:
  ```promql
  # Average claim score by source class
  avg(trustlattice_claim_score) by (source_class)

  # Gate failure rate
  rate(policy_gate_failures_total[5m])

  # Confidence distribution
  histogram_quantile(0.5, trustlattice_verdict_confidence_bucket)
  ```

### 3.3 Log Queries

Key log entries (Loki/ELK):
```
# Claim scoring
{app="excititor"} |= "ClaimScore computed"

# Gate failures
{app="policy"} |= "Gate failed" | json | gate_name != ""

# Verdict replay failures
{app="authority"} |= "Replay mismatch"
```

---

## 4. Common Operations

### 4.1 Viewing Current Trust Vectors

```bash
# Via CLI
stella trustvector list --source-class vendor

# Via API
curl -H "Authorization: Bearer $TOKEN" \
  https://api.example.com/api/v1/trustlattice/vectors
```

### 4.2 Inspecting a Verdict

```bash
# Get verdict details
stella verdict show verd:acme:abc123:CVE-2025-12345:1734873600

# Verify verdict replay
stella verdict replay verd:acme:abc123:CVE-2025-12345:1734873600
```

### 4.3 Viewing Gate Configuration

```bash
# List enabled gates
stella gates list --environment production

# Show gate thresholds
stella gates show minimumConfidence --environment production
```

### 4.4 Triggering Manual Calibration

```bash
# Trigger calibration epoch for a source
stella calibration run --source vendor:redhat \
  --start 2025-11-01 --end 2025-12-01

# View calibration history
stella calibration history vendor:redhat
```

---

## 5. Emergency Procedures

### 5.1 High Gate Failure Rate

**Symptoms:**
- Spike in `policy_gate_failures_total`
- Many builds failing due to low confidence

**Steps:**
1. Check if VEX source is unavailable:
   ```bash
   stella vex source status vendor:redhat
   ```

2. If source is stale, consider temporary threshold reduction:
   ```bash
   # Edit etc/policy-gates.yaml
   gates:
     minimumConfidence:
       thresholds:
         production: 0.60  # Reduced from 0.75
   ```

3. Restart Policy Engine to apply changes

4. Monitor and restore threshold once source recovers

### 5.2 Verdict Replay Failures

**Symptoms:**
- `verdict_manifest_replay_failures` > 0
- Audit compliance check failures

**Steps:**
1. Identify failing verdict:
   ```bash
   stella verdict list --replay-status failed --limit 10
   ```

2. Compare original and replayed inputs:
   ```bash
   stella verdict diff <manifestId>
   ```

3. Common causes:
   - VEX document modified after verdict
   - Clock drift during evaluation
   - Policy configuration changed

4. For clock drift, verify NTP synchronization:
   ```bash
   timedatectl status
   ```

### 5.3 Trust Vector Drift Emergency

**Symptoms:**
- `calibration_drift_percent` > 20%
- Sudden confidence changes across many assets

**Steps:**
1. Freeze calibration:
   ```bash
   stella calibration freeze vendor:redhat
   ```

2. Investigate recent calibration epochs:
   ```bash
   stella calibration history vendor:redhat --epochs 5
   ```

3. If false positive rate increased, rollback:
   ```bash
   stella calibration rollback vendor:redhat --to-epoch 41
   ```

4. Unfreeze after investigation:
   ```bash
   stella calibration unfreeze vendor:redhat
   ```

---

## 6. Configuration

### 6.1 Configuration Files

| File | Purpose |
|------|---------|
| `etc/trust-lattice.yaml` | Trust vector weights and defaults |
| `etc/policy-gates.yaml` | Gate thresholds and rules |
| `etc/excititor-calibration.yaml` | Calibration parameters |

### 6.2 Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `TRUSTLATTICE_WEIGHTS_PROVENANCE` | 0.45 | Provenance weight |
| `TRUSTLATTICE_WEIGHTS_COVERAGE` | 0.35 | Coverage weight |
| `TRUSTLATTICE_FRESHNESS_HALFLIFE` | 90 | Freshness half-life (days) |
| `GATES_MINIMUM_CONFIDENCE_PROD` | 0.75 | Production confidence threshold |
| `CALIBRATION_LEARNING_RATE` | 0.02 | Calibration learning rate |

---

## 7. Maintenance Tasks

### 7.1 Daily

- [ ] Review gate failure alerts
- [ ] Check verdict replay success rate
- [ ] Monitor trust vector stability

### 7.2 Weekly

- [ ] Review calibration epoch results
- [ ] Analyze conflict rate trends
- [ ] Update trust vectors for new sources

### 7.3 Monthly

- [ ] Audit high-drift sources
- [ ] Review and tune gate thresholds
- [ ] Clean up expired verdict manifests

---

## 8. Contact

- **On-call**: #trustlattice-oncall (Slack)
- **Escalation**: VEX Guild Lead
- **Documentation**: `docs/modules/excititor/trust-lattice.md`

---

*Document Version: 1.0.0*
*Sprint: 7100.0003.0002*