up
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
AOC Guard CI / aoc-guard (push) Has been cancelled
AOC Guard CI / aoc-verify (push) Has been cancelled

This commit is contained in:
StellaOps Bot
2025-11-29 11:08:08 +02:00
parent 7e7be4d2fd
commit 3488b22c0c
102 changed files with 18487 additions and 969 deletions

View File

@@ -1,602 +0,0 @@
Heres a simple, lowfriction way to keep priorities fresh without constant manual grooming: **let confidence decay over time**.
![A small curve sloping down over time, illustrating exponential decay](https://dummyimage.com/800x250/ffffff/000000\&text=confidence\(t\)%20=%20e^{-t/τ})
# Exponential confidence decay (what & why)
* **Idea:** Every item (task, lead, bug, doc, hypothesis) has a confidence score that **automatically shrinks with time** if you dont touch it.
* **Formula:** `confidence(t) = e^(t/τ)` where `t` is days since last signal (edit, comment, commit, new data), and **τ (“tau”)** is the decay constant.
* **Rule of thumb:** With **τ = 30 days**, at **t = 30** the confidence is **e^(1) ≈ 0.37**—about a **63% drop**. This surfaces longignored items *gradually*, not with harsh “stale/expired” flips.
# How to use it in practice
* **Signals that reset t → 0:** comment on the ticket, new benchmark, fresh log sample, doc update, CI run, new market news.
* **Sort queues by:** `priority × confidence(t)` (or severity × confidence). Quiet items drift down; truly active ones stay up.
* **Escalation bands:**
* `>0.6` = green (recently touched)
* `0.30.6` = amber (review soon)
* `<0.3` = red (poke or close)
# Quick presets
* **Fastmoving queues (incidents, hot leads):** τ = **714** days
* **Engineering tasks / product docs:** τ = **30** days
* **Research bets / roadmaps:** τ = **6090** days
# For your world (StellaOps + ops/dev work)
* **Vuln tickets:** `risk_score = CVSS × reachability × e^(t/30)`
* **Roadmap epics:** `value_score = impact × e^(t/60)` to rerank quarterly.
* **Docs:** show a badge “freshness: 42%” derived from last edit age to nudge updates.
# Minimal implementation sketch
* Store peritem: `last_signal_at`, `base_priority`.
* Compute on read:
```
days = (now - last_signal_at).days
conf = exp(-days / tau)
score = base_priority * conf
```
* Recompute in your API layer or materialize nightly; no cron spam needed.
If you want, I can draft a tiny C# helper (and SQL snippet) you can drop into your issue service to add `confidence(t)` and color bands to your lists.
Perfect, lets turn the idea into something your devs can actually build.
Below is an implementation plan you can drop into a ticketing/PRD — with clear phases, data model changes, APIs, and some sample code (C# + SQL). Ill also sprinkle in StellaOpsspecific notes.
---
## 0. Scope & Objectives
**Goal:** Introduce `confidence(t)` as an automatic freshness factor that decays with time and is used to rank and highlight work.
Well apply it to:
* Vulnerabilities (StellaOps)
* General issues / tasks / epics
* (Optional) Docs, leads, hypotheses later
**Core behavior:**
* Each item has:
* A base priority / risk (from severity, business impact, etc.)
* A timestamp of last signal (meaningful activity)
* A decay rate τ (tau) in days
* Effective priority = `base_priority × confidence(t)`
* `confidence(t) = exp( t / τ)` where `t` = days since last_signal
---
## 1. Data Model Changes
### 1.1. Add fields to core “work item” tables
For each relevant table (`Issues`, `Vulnerabilities`, `Epics`, …):
**New columns:**
* `base_priority` (FLOAT or INT)
* Example: 1100, or derived from severity.
* `last_signal_at` (DATETIME, NOT NULL, default = `created_at`)
* `tau_days` (FLOAT, nullable, falls back to type default)
* (Optional) `confidence_score_cached` (FLOAT, for materialized score)
* (Optional) `is_confidence_frozen` (BOOL, default FALSE)
For pinned items that should not decay.
**Example Postgres migration (Issues):**
```sql
ALTER TABLE issues
ADD COLUMN base_priority DOUBLE PRECISION,
ADD COLUMN last_signal_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
ADD COLUMN tau_days DOUBLE PRECISION,
ADD COLUMN confidence_cached DOUBLE PRECISION,
ADD COLUMN is_confidence_frozen BOOLEAN NOT NULL DEFAULT FALSE;
```
For StellaOps:
```sql
ALTER TABLE vulnerabilities
ADD COLUMN base_risk DOUBLE PRECISION,
ADD COLUMN last_signal_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
ADD COLUMN tau_days DOUBLE PRECISION,
ADD COLUMN confidence_cached DOUBLE PRECISION,
ADD COLUMN is_confidence_frozen BOOLEAN NOT NULL DEFAULT FALSE;
```
### 1.2. Add a config table for τ per entity type
```sql
CREATE TABLE confidence_decay_config (
id SERIAL PRIMARY KEY,
entity_type TEXT NOT NULL, -- 'issue', 'vulnerability', 'epic', 'doc'
tau_days_default DOUBLE PRECISION NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
INSERT INTO confidence_decay_config (entity_type, tau_days_default) VALUES
('incident', 7),
('vulnerability', 30),
('issue', 30),
('epic', 60),
('doc', 90);
```
---
## 2. Define “signal” events & instrumentation
We need a standardized way to say: “this item got activity → reset last_signal_at”.
### 2.1. Signals that should reset `last_signal_at`
For **issues / epics:**
* New comment
* Status change (e.g., Open → In Progress)
* Field change that matters (severity, owner, milestone)
* Attachment added
* Link to PR added or updated
* New CI failure linked
For **vulnerabilities (StellaOps):**
* New scanner result attached or status updated (e.g., “Verified”, “False Positive”)
* New evidence (PoC, exploit notes)
* SLA override change
* Assignment / ownership change
* Integration events (e.g., PR merge that references the vuln)
For **docs (if you do it):**
* Any edit
* Comment/annotation
### 2.2. Implement a shared helper to record a signal
**Service-level helper (pseudocode / C#-ish):**
```csharp
public interface IConfidenceSignalService
{
Task RecordSignalAsync(WorkItemType type, Guid itemId, DateTime? signalTimeUtc = null);
}
public class ConfidenceSignalService : IConfidenceSignalService
{
private readonly IWorkItemRepository _repo;
private readonly IConfidenceConfigService _config;
public async Task RecordSignalAsync(WorkItemType type, Guid itemId, DateTime? signalTimeUtc = null)
{
var now = signalTimeUtc ?? DateTime.UtcNow;
var item = await _repo.GetByIdAsync(type, itemId);
if (item == null) return;
item.LastSignalAt = now;
if (item.TauDays == null)
{
item.TauDays = await _config.GetDefaultTauAsync(type);
}
await _repo.UpdateAsync(item);
}
}
```
### 2.3. Wire signals into existing flows
Create small tasks for devs like:
* **ISS-01:** Call `RecordSignalAsync` on:
* New issue comment handler
* Issue status update handler
* Issue field update handler (severity/priority/owner)
* **VULN-01:** Call `RecordSignalAsync` when:
* New scanner result ingested for a vuln
* Vulnerability status, SLA, or owner changes
* New exploit evidence is attached
---
## 3. Confidence & scoring calculation
### 3.1. Shared confidence function
Definition:
```csharp
public static class ConfidenceMath
{
// t = days since last signal
public static double ConfidenceScore(DateTime lastSignalAtUtc, double tauDays, DateTime? nowUtc = null)
{
var now = nowUtc ?? DateTime.UtcNow;
var tDays = (now - lastSignalAtUtc).TotalDays;
if (tDays <= 0) return 1.0;
if (tauDays <= 0) return 1.0; // guard / fallback
var score = Math.Exp(-tDays / tauDays);
// Optional: never drop below a tiny floor, so items never "disappear"
const double floor = 0.01;
return Math.Max(score, floor);
}
}
```
### 3.2. Effective priority formulas
**Generic issues / tasks:**
```csharp
double effectiveScore = issue.BasePriority * ConfidenceMath.ConfidenceScore(issue.LastSignalAt, issue.TauDays ?? defaultTau);
```
**Vulnerabilities (StellaOps):**
Lets define:
* `severity_weight`: map CVSS or severity string to numeric (e.g. Critical=100, High=80, Medium=50, Low=20).
* `reachability`: 01 (e.g. from your reachability analysis).
* `exploitability`: 01 (optional, based on known exploits).
* `confidence`: as above.
```csharp
double baseRisk = severityWeight * reachability * exploitability; // or simpler: severityWeight * reachability
double conf = ConfidenceMath.ConfidenceScore(vuln.LastSignalAt, vuln.TauDays ?? defaultTau);
double effectiveRisk = baseRisk * conf;
```
Store `baseRisk` → `vulnerabilities.base_risk`, and compute `effectiveRisk` on the fly or via job.
### 3.3. SQL implementation (optional for server-side sorting)
**Postgres example:**
```sql
-- t_days = age in days
-- tau = tau_days
-- score = exp(-t_days / tau)
SELECT
i.*,
i.base_priority *
GREATEST(
EXP(- EXTRACT(EPOCH FROM (NOW() - i.last_signal_at)) / (86400 * COALESCE(i.tau_days, 30))),
0.01
) AS effective_priority
FROM issues i
ORDER BY effective_priority DESC;
```
You can wrap that in a view:
```sql
CREATE VIEW issues_with_confidence AS
SELECT
i.*,
GREATEST(
EXP(- EXTRACT(EPOCH FROM (NOW() - i.last_signal_at)) / (86400 * COALESCE(i.tau_days, 30))),
0.01
) AS confidence,
i.base_priority *
GREATEST(
EXP(- EXTRACT(EPOCH FROM (NOW() - i.last_signal_at)) / (86400 * COALESCE(i.tau_days, 30))),
0.01
) AS effective_priority
FROM issues i;
```
---
## 4. Caching & performance
You have two options:
### 4.1. Compute on read (simplest to start)
* Use the helper function in your service layer or a DB view.
* Pros:
* No jobs, always fresh.
* Cons:
* Slight CPU cost on heavy lists.
**Plan:** Start with this. If you see perf issues, move to 4.2.
### 4.2. Periodic materialization job (optional later)
Add a scheduled job (e.g. hourly) that:
1. Selects all active items.
2. Computes `confidence_score` and `effective_priority`.
3. Writes to `confidence_cached` and `effective_priority_cached` (if you add such a column).
Service then sorts by cached values.
---
## 5. Backfill & migration
### 5.1. Initial backfill script
For existing records:
* If `last_signal_at` is NULL → set to `created_at`.
* Derive `base_priority` / `base_risk` from existing severity fields.
* Set `tau_days` from config.
**Example:**
```sql
UPDATE issues
SET last_signal_at = created_at
WHERE last_signal_at IS NULL;
UPDATE issues
SET base_priority = CASE severity
WHEN 'critical' THEN 100
WHEN 'high' THEN 80
WHEN 'medium' THEN 50
WHEN 'low' THEN 20
ELSE 10
END
WHERE base_priority IS NULL;
UPDATE issues i
SET tau_days = c.tau_days_default
FROM confidence_decay_config c
WHERE c.entity_type = 'issue'
AND i.tau_days IS NULL;
```
Do similarly for `vulnerabilities` using severity / CVSS.
### 5.2. Sanity checks
Add a small script/test to verify:
* Newly created items → `confidence ≈ 1.0`.
* 30-day-old items with τ=30 → `confidence ≈ 0.37`.
* Ordering changes when you edit/comment on items.
---
## 6. API & Query Layer
### 6.1. New sorting options
Update list APIs:
* Accept parameter: `sort=effective_priority` or `sort=confidence`.
* Default sort for some views:
* Vulnerabilities backlog: `sort=effective_risk` (risk × confidence).
* Issues backlog: `sort=effective_priority`.
**Example REST API contract:**
`GET /api/issues?sort=effective_priority&state=open`
**Response fields (additions):**
```json
{
"id": "ISS-123",
"title": "Fix login bug",
"base_priority": 80,
"last_signal_at": "2025-11-01T10:00:00Z",
"tau_days": 30,
"confidence": 0.63,
"effective_priority": 50.4,
"confidence_band": "amber"
}
```
### 6.2. Confidence banding (for UI)
Define bands server-side (easy to change):
* Green: `confidence >= 0.6`
* Amber: `0.3 ≤ confidence < 0.6`
* Red: `confidence < 0.3`
You can compute on server:
```csharp
string ConfidenceBand(double confidence) =>
confidence >= 0.6 ? "green"
: confidence >= 0.3 ? "amber"
: "red";
```
---
## 7. UI / UX changes
### 7.1. List views (issues / vulns / epics)
For each item row:
* Show a small freshness pill:
* Text: `Active`, `Review soon`, `Stale`
* Derived from confidence band.
* Tooltip:
* “Confidence 78%. Last activity 3 days ago. τ = 30 days.”
* Sort default: by `effective_priority` / `effective_risk`.
* Filters:
* `Freshness: [All | Active | Review soon | Stale]`
* Optionally: “Show stale only” toggle.
**Example labels:**
* Green: “Active (confidence 82%)”
* Amber: “Review soon (confidence 45%)”
* Red: “Stale (confidence 18%)”
### 7.2. Detail views
On an issue / vuln page:
* Add a “Confidence” section:
* “Confidence: **52%**”
* “Last signal: **12 days ago**”
* “Decay τ: **30 days**”
* “Effective priority: **Base 80 × 0.52 = 42**”
* (Optional) small mini-chart (text-only or simple bar) showing approximate decay, but not necessary for first iteration.
### 7.3. Admin / settings UI
Add an internal settings page:
* Table of entity types with editable τ:
| Entity type | τ (days) | Notes |
| ------------- | -------- | ---------------------------- |
| Incident | 7 | Fast-moving |
| Vulnerability | 30 | Standard risk review cadence |
| Issue | 30 | Sprint-level decay |
| Epic | 60 | Quarterly |
| Doc | 90 | Slow decay |
* Optionally: toggle to pin item (`is_confidence_frozen`) from UI.
---
## 8. StellaOpsspecific behavior
For vulnerabilities:
### 8.1. Base risk calculation
Ingested fields you likely already have:
* `cvss_score` or `severity`
* `reachable` (true/false or numeric)
* (Optional) `exploit_available` (bool) or exploitability score
* `asset_criticality` (15)
Define `base_risk` as:
```text
severity_weight = f(cvss_score or severity)
reachability = reachable ? 1.0 : 0.5 -- example
exploitability = exploit_available ? 1.0 : 0.7
asset_factor = 0.5 + 0.1 * asset_criticality -- 1 → 1.0, 5 → 1.5
base_risk = severity_weight * reachability * exploitability * asset_factor
```
Store `base_risk` on vuln row.
Then:
```text
effective_risk = base_risk * confidence(t)
```
Use `effective_risk` for backlog ordering and SLAs dashboards.
### 8.2. Signals for vulns
Make sure these all call `RecordSignalAsync(Vulnerability, vulnId)`:
* New scan result for same vuln (re-detected).
* Change status to “In Progress”, “Ready for Deploy”, “Verified Fixed”, etc.
* Assigning an owner.
* Attaching PoC / exploit details.
### 8.3. Vuln UI copy ideas
* Pill text:
* “Risk: 850 (confidence 68%)”
* “Last analyst activity 11 days ago”
* In backlog view: show **Effective Risk** as main sort, with a smaller subtext “Base 1200 × Confidence 71%”.
---
## 9. Rollout plan
### Phase 1 Infrastructure (backend-only)
* [ ] DB migrations & config table
* [ ] Implement `ConfidenceMath` and helper functions
* [ ] Implement `IConfidenceSignalService`
* [ ] Wire signals into key flows (comments, state changes, scanner ingestion)
* [ ] Add `confidence` and `effective_priority/risk` to API responses
* [ ] Backfill script + dry run in staging
### Phase 2 Internal UI & feature flag
* [ ] Add optional sorting by effective score to internal/staff views
* [ ] Add confidence pill (hidden behind feature flag `confidence_decay_v1`)
* [ ] Dogfood internally:
* Do items bubble up/down as expected?
* Are any items “disappearing” because decay is too aggressive?
### Phase 3 Parameter tuning
* [ ] Adjust τ per type based on feedback:
* If things decay too fast → increase τ
* If queues rarely change → decrease τ
* [ ] Decide on confidence floor (0.01? 0.05?) so nothing goes to literal 0.
### Phase 4 General release
* [ ] Make effective score the default sort for key views:
* Vulnerabilities backlog
* Issues backlog
* [ ] Document behavior for users (help center / inline tooltip)
* [ ] Add admin UI to tweak τ per entity type.
---
## 10. Edge cases & safeguards
* **New items**
* `last_signal_at = created_at`, confidence = 1.0.
* **Pinned items**
* If `is_confidence_frozen = true` treat confidence as 1.0.
* **Items without τ**
* Always fallback to entity type default.
* **Timezones**
* Always store & compute in UTC.
* **Very old items**
* Floor the confidence so theyre still visible when explicitly searched.
---
If you want, I can turn this into:
* A short **technical design doc** (with sections: Problem, Proposal, Alternatives, Rollout).
* Or a **set of Jira tickets** grouped by backend / frontend / infra that your team can pick up directly.

View File

@@ -0,0 +1,402 @@
# CLI Developer Experience and Command UX
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, command surface design, and implementation strategy for the Stella Ops CLI, covering developer experience, CI/CD integration, output formatting, and offline operation.
---
## 1. Executive Summary
The Stella Ops CLI is the **primary interface for developers and CI/CD pipelines** interacting with the platform. Key capabilities:
- **Native AOT Binary** - Sub-20ms startup, single binary distribution
- **DPoP-Bound Authentication** - Secure device-code and service principal flows
- **Deterministic Outputs** - JSON/table modes with stable exit codes for CI
- **Buildx Integration** - SBOM generation at build time
- **Offline Kit Management** - Air-gapped deployment support
- **Shell Completions** - Bash/Zsh/Fish/PowerShell auto-complete
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | CLI Requirements | Use Case |
|---------|-----------------|----------|
| **DevSecOps** | CI integration, exit codes, JSON output | Pipeline gates |
| **Security Engineers** | Verification commands, policy testing | Audit workflows |
| **Platform Operators** | Offline kit, admin commands | Air-gap management |
| **Developers** | Scan commands, buildx integration | Local development |
### 2.2 Competitive Positioning
Most CLI tools in the vulnerability space are slow or lack CI ergonomics. Stella Ops differentiates with:
- **Native AOT** for instant startup (< 20ms vs 500ms+ for JIT)
- **Deterministic exit codes** (12 distinct codes for CI decision trees)
- **DPoP security** (no long-lived tokens on disk)
- **Unified command surface** (50+ commands, consistent patterns)
- **Offline-first design** (works without network in sealed mode)
---
## 3. Command Surface Architecture
### 3.1 Command Categories
| Category | Commands | Purpose |
|----------|----------|---------|
| **Auth** | `login`, `logout`, `status`, `token` | Authentication management |
| **Scan** | `scan image`, `scan fs` | Vulnerability scanning |
| **Export** | `export sbom`, `report final` | Artifact retrieval |
| **Verify** | `verify attestation`, `verify referrers`, `verify image-signature` | Cryptographic verification |
| **Policy** | `policy get`, `policy set`, `policy apply` | Policy management |
| **Buildx** | `buildx install`, `buildx verify`, `buildx build` | Build-time SBOM |
| **Runtime** | `runtime policy test` | Zastava integration |
| **Offline** | `offline kit pull`, `offline kit import`, `offline kit status` | Air-gap operations |
| **Decision** | `decision export`, `decision verify`, `decision compare` | VEX evidence management |
| **AOC** | `sources ingest`, `aoc verify` | Aggregation-only guards |
| **KMS** | `kms export`, `kms import` | Key management |
| **Advise** | `advise run` | AI-powered advisory summaries |
### 3.2 Output Modes
**Human Mode (default):**
```
$ stella scan image nginx:latest --wait
Scanning nginx:latest...
Found 12 vulnerabilities (2 critical, 3 high, 5 medium, 2 low)
Policy verdict: FAIL
Critical:
- CVE-2025-12345 in openssl (fixed in 3.0.14)
- CVE-2025-12346 in libcurl (no fix available)
See: https://ui.internal/scans/sha256:abc123...
```
**JSON Mode (`--json`):**
```json
{"event":"scan.complete","status":"fail","critical":2,"high":3,"medium":5,"low":2,"url":"https://..."}
```
### 3.3 Exit Codes
| Code | Meaning | CI Action |
|------|---------|-----------|
| 0 | Success | Continue |
| 2 | Policy fail | Block deployment |
| 3 | Verification failed | Security alert |
| 4 | Auth error | Re-authenticate |
| 5 | Resource not found | Check inputs |
| 6 | Rate limited | Retry with backoff |
| 7 | Backend unavailable | Retry |
| 9 | Invalid arguments | Fix command |
| 11-17 | AOC guard violations | Review ingestion |
| 18 | Verification truncated | Increase limit |
| 70 | Transport failure | Check network |
| 71 | Usage error | Fix command |
---
## 4. Authentication Model
### 4.1 Device Code Flow (Interactive)
```bash
$ stella auth login
Opening browser for authentication...
Device code: ABCD-EFGH
Waiting for authorization...
Logged in as user@example.com (tenant: acme-corp)
```
### 4.2 Service Principal (CI/CD)
```bash
$ stella auth login --client-credentials \
--client-id $STELLA_CLIENT_ID \
--private-key $STELLA_PRIVATE_KEY
```
### 4.3 DPoP Key Management
- Ephemeral Ed25519 keypair generated on first login
- Stored in OS keychain (Keychain/DPAPI/KWallet/Gnome Keyring)
- Every request includes DPoP proof header
- Tokens refreshed proactively (30s before expiry)
### 4.4 Token Credential Helper
```bash
# Get one-shot token for curl/scripts
TOKEN=$(stella auth token --aud scanner)
curl -H "Authorization: Bearer $TOKEN" https://scanner.internal/api/...
```
---
## 5. Buildx Integration
### 5.1 Generator Installation
```bash
$ stella buildx install
Installing SBOM generator plugin...
Verifying signature: OK
Generator installed at ~/.docker/cli-plugins/docker-buildx-stellaops
$ stella buildx verify
Docker version: 24.0.7
Buildx version: 0.12.1
Generator: stellaops/sbom-indexer:v1.2.3@sha256:abc123...
Status: Ready
```
### 5.2 Build with SBOM
```bash
$ stella buildx build -t myapp:v1.0.0 --push --attest
Building myapp:v1.0.0...
SBOM generation: enabled (stellaops/sbom-indexer)
Provenance: enabled
Attestation: requested
Build complete!
Image: myapp:v1.0.0@sha256:def456...
SBOM: attached as referrer
Attestation: logged to Rekor (uuid: abc123)
```
---
## 6. Implementation Strategy
### 6.1 Phase 1: Core Commands (Complete)
- [x] Auth commands with DPoP
- [x] Scan/export commands
- [x] JSON output mode
- [x] Exit code standardization
- [x] Shell completions
### 6.2 Phase 2: Buildx & Verification (Complete)
- [x] Buildx plugin management
- [x] Attestation verification
- [x] Referrer verification
- [x] Report commands
### 6.3 Phase 3: Advanced Features (In Progress)
- [x] Decision export/verify commands
- [x] AOC guard helpers
- [x] KMS management
- [ ] Advisory AI integration (CLI-ADVISE-48-001)
- [ ] Filesystem scanning (CLI-SCAN-49-001)
### 6.4 Phase 4: Distribution (Planned)
- [ ] Homebrew formula
- [ ] Scoop/Winget manifests
- [ ] Self-update mechanism
- [ ] Cosign signature verification
---
## 7. CI/CD Integration Patterns
### 7.1 GitHub Actions
```yaml
- name: Install Stella CLI
run: |
curl -sSL https://get.stella-ops.io | sh
echo "$HOME/.stella/bin" >> $GITHUB_PATH
- name: Authenticate
run: stella auth login --client-credentials
env:
STELLAOPS_CLIENT_ID: ${{ secrets.STELLA_CLIENT_ID }}
STELLAOPS_PRIVATE_KEY: ${{ secrets.STELLA_PRIVATE_KEY }}
- name: Scan Image
run: |
stella scan image ${{ env.IMAGE_REF }} --wait --json > scan-results.json
if [ $? -eq 2 ]; then
echo "::error::Policy failed - blocking deployment"
exit 1
fi
- name: Verify Attestation
run: stella verify attestation --artifact ${{ env.IMAGE_DIGEST }}
```
### 7.2 GitLab CI
```yaml
scan:
script:
- stella auth login --client-credentials
- stella buildx install
- docker buildx build --attest=type=sbom,generator=stellaops/sbom-indexer -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- stella scan image $CI_REGISTRY_IMAGE@$IMAGE_DIGEST --wait --json
artifacts:
reports:
container_scanning: scan-results.json
```
---
## 8. Configuration Model
### 8.1 Precedence
CLI flags > Environment variables > Config file > Defaults
### 8.2 Config File
```yaml
# ~/.config/stellaops/config.yaml
cli:
authority: "https://authority.example.com"
backend:
scanner: "https://scanner.example.com"
attestor: "https://attestor.example.com"
auth:
deviceCode: true
audienceDefault: "scanner"
output:
json: false
color: auto
tls:
caBundle: "/etc/ssl/certs/ca-bundle.crt"
offline:
kitMirror: "s3://mirror/stellaops-kit"
```
### 8.3 Environment Variables
| Variable | Purpose |
|----------|---------|
| `STELLAOPS_AUTHORITY` | Authority URL |
| `STELLAOPS_SCANNER_URL` | Scanner service URL |
| `STELLAOPS_CLIENT_ID` | Service principal ID |
| `STELLAOPS_PRIVATE_KEY` | Service principal key |
| `STELLAOPS_TENANT` | Default tenant |
| `STELLAOPS_JSON` | Enable JSON output |
---
## 9. Offline Operation
### 9.1 Sealed Mode Detection
```bash
$ stella scan image nginx:latest
Error: Sealed mode active - external network access blocked
Remediation: Import offline kit or disable sealed mode
$ stella offline kit import latest-kit.tar.gz
Importing offline kit...
Advisories: 45,230 records
VEX documents: 12,450 records
Policy packs: 3 bundles
Import complete!
$ stella scan image nginx:latest
Scanning with offline data (2025-11-28)...
```
### 9.2 Air-Gap Guard
All HTTP flows route through `StellaOps.AirGap.Policy`. When sealed mode is active:
- External egress is blocked with `AIRGAP_EGRESS_BLOCKED` error
- CLI provides clear remediation guidance
- Local verification continues to work
---
## 10. Security Considerations
### 10.1 Credential Protection
- DPoP private keys stored in OS keychain only
- No plaintext tokens on disk
- Short-lived OpToks held in memory only
- Authorization headers redacted from verbose logs
### 10.2 Binary Verification
```bash
# Verify CLI binary signature
$ stella version --verify
Version: 1.2.3
Built: 2025-11-29T12:00:00Z
Signature: Valid (cosign)
Signer: release@stella-ops.io
```
### 10.3 Hard Lines
- Refuse to print token values
- Disallow `--insecure` without explicit env var opt-in
- Enforce short token TTL with proactive refresh
- Device-code cache bound to machine + user
---
## 11. Performance Targets
| Metric | Target |
|--------|--------|
| Startup time | < 20ms (AOT) |
| Request overhead | < 5ms |
| Large download (100MB) | > 80 MB/s |
| Buildx wrapper overhead | < 1ms |
---
## 12. Related Documentation
| Resource | Location |
|----------|----------|
| CLI architecture | `docs/modules/cli/architecture.md` |
| Policy CLI guide | `docs/modules/cli/guides/policy.md` |
| API/CLI reference | `docs/09_API_CLI_REFERENCE.md` |
| Offline operation | `docs/24_OFFLINE_KIT.md` |
---
## 13. Sprint Mapping
- **Primary Sprint:** SPRINT_0400_cli_ux.md (NEW)
- **Related Sprints:**
- SPRINT_210_ui_ii.md (UI integration)
- SPRINT_0187_0001_0001_evidence_locker_cli_integration.md (Evidence CLI)
**Key Task IDs:**
- `CLI-AUTH-10-001` - DPoP authentication (DONE)
- `CLI-SCAN-20-001` - Scan commands (DONE)
- `CLI-BUILDX-30-001` - Buildx integration (DONE)
- `CLI-ADVISE-48-001` - Advisory AI commands (IN PROGRESS)
- `CLI-SCAN-49-001` - Filesystem scanning (TODO)
---
## 14. Success Metrics
| Metric | Target |
|--------|--------|
| Startup latency | < 20ms p99 |
| CI adoption | 80% of pipelines use CLI |
| Exit code coverage | 100% of failure modes |
| Shell completion coverage | 100% of commands |
| Offline operation success | Works without network |
---
*Last updated: 2025-11-29*

View File

@@ -0,0 +1,476 @@
# Concelier Advisory Ingestion Model
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, ingestion semantics, and implementation strategy for the Concelier module, covering the Link-Not-Merge model, connector pipelines, observation storage, and deterministic exports.
---
## 1. Executive Summary
Concelier is the **advisory ingestion engine** that acquires, normalizes, and correlates vulnerability advisories from authoritative sources. Key capabilities:
- **Aggregation-Only Contract** - No derived semantics in ingestion
- **Link-Not-Merge** - Observations correlated, never merged
- **Multi-Source Connectors** - Vendor PSIRTs, distros, OSS ecosystems
- **Deterministic Exports** - Reproducible JSON, Trivy DB bundles
- **Conflict Detection** - Structured payloads for divergent claims
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Ingestion Requirements | Use Case |
|---------|------------------------|----------|
| **Security Teams** | Authoritative data | Accurate vulnerability assessment |
| **Compliance** | Provenance tracking | Audit trail for advisory sources |
| **DevSecOps** | Fast updates | CI/CD pipeline integration |
| **Air-Gap Ops** | Offline bundles | Disconnected environment support |
### 2.2 Competitive Positioning
Most vulnerability databases merge data, losing provenance. Stella Ops differentiates with:
- **Link-Not-Merge** preserving all source claims
- **Conflict visibility** showing where sources disagree
- **Deterministic exports** enabling reproducible builds
- **Multi-format support** (CSAF, OSV, GHSA, vendor-specific)
- **Signature verification** for upstream integrity
---
## 3. Aggregation-Only Contract (AOC)
### 3.1 Core Principles
The AOC ensures ingestion purity:
1. **No derived semantics** - No severity consensus, merged status, or fix hints
2. **Immutable raw docs** - Append-only with version chains
3. **Mandatory provenance** - Source, timestamp, signature status
4. **Linkset only** - Joins stored separately, never mutate content
5. **Deterministic canonicalization** - Stable JSON output
6. **Idempotent upserts** - Same hash = no new record
7. **CI verification** - AOCVerifier enforces at runtime
### 3.2 Enforcement
```csharp
// AOCWriteGuard checks before every write
public class AOCWriteGuard
{
Task GuardAsync(AdvisoryObservation obs)
{
// Verify no forbidden properties
// Validate provenance completeness
// Check tenant claims
// Normalize timestamps
// Compute content hash
}
}
```
Roslyn analyzers (`StellaOps.AOC.Analyzers`) scan connectors at build time to prevent forbidden property usage.
---
## 4. Advisory Observation Model
### 4.1 Observation Structure
```json
{
"_id": "tenant:vendor:upstreamId:revision",
"tenant": "acme-corp",
"source": {
"vendor": "OSV",
"stream": "github",
"api": "https://api.osv.dev/v1/.../GHSA-...",
"collectorVersion": "concelier/1.7.3"
},
"upstream": {
"upstreamId": "GHSA-xxxx-....",
"documentVersion": "2025-09-01T12:13:14Z",
"fetchedAt": "2025-09-01T13:04:05Z",
"receivedAt": "2025-09-01T13:04:06Z",
"contentHash": "sha256:...",
"signature": {
"present": true,
"format": "dsse",
"keyId": "rekor:.../key/abc"
}
},
"content": {
"format": "OSV",
"specVersion": "1.6",
"raw": { /* unmodified upstream document */ }
},
"identifiers": {
"primary": "GHSA-xxxx-....",
"aliases": ["CVE-2025-12345", "GHSA-xxxx-...."]
},
"linkset": {
"purls": ["pkg:npm/lodash@4.17.21"],
"cpes": ["cpe:2.3:a:lodash:lodash:4.17.21:*:*:*:*:*:*:*"],
"references": [
{"type": "advisory", "url": "https://..."},
{"type": "fix", "url": "https://..."}
]
},
"supersedes": "tenant:vendor:upstreamId:prev-revision",
"createdAt": "2025-09-01T13:04:06Z"
}
```
### 4.2 Linkset Correlation
```json
{
"_id": "sha256:...",
"tenant": "acme-corp",
"key": {
"vulnerabilityId": "CVE-2025-12345",
"productKey": "pkg:npm/lodash@4.17.21",
"confidence": "high"
},
"observations": [
{
"observationId": "tenant:osv:GHSA-...:v1",
"sourceVendor": "OSV",
"statement": { "severity": "high" },
"collectedAt": "2025-09-01T13:04:06Z"
},
{
"observationId": "tenant:nvd:CVE-2025-12345:v2",
"sourceVendor": "NVD",
"statement": { "severity": "critical" },
"collectedAt": "2025-09-01T14:00:00Z"
}
],
"conflicts": [
{
"conflictId": "sha256:...",
"type": "severity-mismatch",
"observations": [
{ "source": "OSV", "value": "high" },
{ "source": "NVD", "value": "critical" }
],
"confidence": "medium",
"detectedAt": "2025-09-01T14:00:01Z"
}
]
}
```
---
## 5. Source Connectors
### 5.1 Source Families
| Family | Examples | Format |
|--------|----------|--------|
| **Vendor PSIRTs** | Microsoft, Oracle, Cisco, Adobe | CSAF, proprietary |
| **Linux Distros** | Red Hat, SUSE, Ubuntu, Debian, Alpine | CSAF, JSON, XML |
| **OSS Ecosystems** | OSV, GHSA, npm, PyPI, Maven | OSV, GraphQL |
| **CERTs** | CISA (KEV), JVN, CERT-FR | JSON, XML |
### 5.2 Connector Contract
```csharp
public interface IFeedConnector
{
string SourceName { get; }
// Fetch signed feeds or offline mirrors
Task FetchAsync(IServiceProvider sp, CancellationToken ct);
// Normalize to strongly-typed DTOs
Task ParseAsync(IServiceProvider sp, CancellationToken ct);
// Build canonical records with provenance
Task MapAsync(IServiceProvider sp, CancellationToken ct);
}
```
### 5.3 Connector Lifecycle
1. **Snapshot** - Fetch with cursor, ETag, rate limiting
2. **Parse** - Schema validation, normalization
3. **Guard** - AOCWriteGuard enforcement
4. **Write** - Append-only insert
5. **Event** - Emit `advisory.observation.updated`
---
## 6. Version Semantics
### 6.1 Ecosystem Normalization
| Ecosystem | Format | Normalization |
|-----------|--------|---------------|
| npm, PyPI, Maven | SemVer | Intervals with `<`, `>=`, `~`, `^` |
| RPM | EVR | `epoch:version-release` with order keys |
| DEB | dpkg | Version comparison with order keys |
| APK | Alpine | Computed order keys |
### 6.2 CVSS Handling
- Normalize CVSS v2/v3/v4 where available
- Track all source CVSS values
- Effective severity = max (configurable)
- Store KEV evidence with source and date
---
## 7. Conflict Detection
### 7.1 Conflict Types
| Type | Description | Resolution |
|------|-------------|------------|
| `severity-mismatch` | Different severity ratings | Policy decides |
| `affected-range-divergence` | Different version ranges | Most specific wins |
| `reference-clash` | Contradictory references | Surface all |
| `alias-inconsistency` | Different alias mappings | Union with provenance |
| `metadata-gap` | Missing information | Flag for review |
### 7.2 Conflict Visibility
Conflicts are never hidden - they are:
- Stored in linkset documents
- Surfaced in API responses
- Included in exports
- Displayed in Console UI
---
## 8. Deterministic Exports
### 8.1 JSON Export
```
exports/json/
├── CVE/
│ ├── 20/
│ │ └── CVE-2025-12345.json
│ └── ...
├── manifest.json
└── export-digest.sha256
```
- Deterministic folder structure
- Canonical JSON (sorted keys, stable timestamps)
- Manifest with SHA-256 per file
- Reproducible across runs
### 8.2 Trivy DB Export
```
exports/trivy/
├── db.tar.gz
├── metadata.json
└── manifest.json
```
- Bolt DB compatible with Trivy
- Full and delta modes
- ORAS push to registries
- Mirror manifests for domains
### 8.3 Export Determinism
Running the same export against the same data must produce:
- Identical file contents
- Identical manifest hashes
- Identical export digests
---
## 9. Implementation Strategy
### 9.1 Phase 1: Core Pipeline (Complete)
- [x] AOCWriteGuard implementation
- [x] Observation storage
- [x] Basic connectors (Red Hat, SUSE, OSV)
- [x] JSON export
### 9.2 Phase 2: Link-Not-Merge (Complete)
- [x] Linkset correlation engine
- [x] Conflict detection
- [x] Event emission
- [x] API surface
### 9.3 Phase 3: Expanded Sources (In Progress)
- [x] GHSA GraphQL connector
- [x] Debian DSA connector
- [ ] Alpine secdb connector (CONCELIER-CONN-50-001)
- [ ] CISA KEV enrichment (CONCELIER-KEV-51-001)
### 9.4 Phase 4: Export Enhancements (Planned)
- [ ] Delta Trivy DB exports
- [ ] ORAS registry push
- [ ] Attestation hand-off
- [ ] Mirror bundle signing
---
## 10. API Surface
### 10.1 Sources & Jobs
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/v1/concelier/sources` | GET | `concelier.read` | List sources |
| `/api/v1/concelier/sources/{name}/trigger` | POST | `concelier.admin` | Trigger fetch |
| `/api/v1/concelier/sources/{name}/pause` | POST | `concelier.admin` | Pause source |
| `/api/v1/concelier/jobs/{id}` | GET | `concelier.read` | Job status |
### 10.2 Exports
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/v1/concelier/exports/json` | POST | `concelier.export` | Trigger JSON export |
| `/api/v1/concelier/exports/trivy` | POST | `concelier.export` | Trigger Trivy export |
| `/api/v1/concelier/exports/{id}` | GET | `concelier.read` | Export status |
### 10.3 Search
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/v1/concelier/advisories/{key}` | GET | `concelier.read` | Get advisory |
| `/api/v1/concelier/observations/{id}` | GET | `concelier.read` | Get observation |
| `/api/v1/concelier/linksets` | GET | `concelier.read` | Query linksets |
---
## 11. Storage Model
### 11.1 Collections
| Collection | Purpose | Key Indexes |
|------------|---------|-------------|
| `sources` | Connector catalog | `{_id}` |
| `source_state` | Run state | `{sourceName}` |
| `documents` | Raw payloads | `{sourceName, uri}` |
| `advisory_observations` | Normalized records | `{tenant, upstream.upstreamId}` |
| `advisory_linksets` | Correlations | `{tenant, key.vulnerabilityId, key.productKey}` |
| `advisory_events` | Change log | `{type, occurredAt}` |
| `export_state` | Export cursors | `{exportKind}` |
### 11.2 GridFS Buckets
- `fs.documents` - Raw payloads (immutable)
- `fs.exports` - Historical archives
---
## 12. Event Model
### 12.1 Events
| Event | Trigger | Content |
|-------|---------|---------|
| `advisory.observation.updated@1` | New/superseded observation | IDs, hash, supersedes |
| `advisory.linkset.updated@1` | Correlation change | Deltas, conflicts |
### 12.2 Event Transport
- Primary: NATS
- Fallback: Redis Stream
- Offline Kit captures for replay
---
## 13. Observability
### 13.1 Metrics
- `concelier.fetch.docs_total{source}`
- `concelier.fetch.bytes_total{source}`
- `concelier.parse.failures_total{source}`
- `concelier.observations.write_total{result}`
- `concelier.linksets.updated_total{result}`
- `concelier.linksets.conflicts_total{type}`
- `concelier.export.duration_seconds{kind}`
### 13.2 Performance Targets
| Operation | Target |
|-----------|--------|
| Ingest throughput | 5k docs/min |
| Observation write | < 5ms p95 |
| Linkset build | < 15ms p95 |
| Export (1M advisories) | < 90 seconds |
---
## 14. Security Considerations
### 14.1 Outbound Security
- Allowlist per connector (domains, protocols)
- Proxy support with TLS pinning
- Rate limiting per source
### 14.2 Signature Verification
- PGP/cosign/x509 verification stored
- Failed verification flagged, not rejected
- Policy can down-weight unsigned sources
### 14.3 Determinism
- Canonical JSON writer
- Stable export digests
- Reproducible across runs
---
## 15. Related Documentation
| Resource | Location |
|----------|----------|
| Concelier architecture | `docs/modules/concelier/architecture.md` |
| Link-Not-Merge schema | `docs/modules/concelier/link-not-merge-schema.md` |
| Event schemas | `docs/modules/concelier/events/` |
| Attestation guide | `docs/modules/concelier/attestation.md` |
---
## 16. Sprint Mapping
- **Primary Sprint:** SPRINT_0115_0001_0004_concelier_iv.md
- **Related Sprints:**
- SPRINT_0113_0001_0002_concelier_ii.md
- SPRINT_0114_0001_0003_concelier_iii.md
**Key Task IDs:**
- `CONCELIER-AOC-40-001` - AOC enforcement (DONE)
- `CONCELIER-LNM-41-001` - Link-Not-Merge (DONE)
- `CONCELIER-CONN-50-001` - Alpine connector (IN PROGRESS)
- `CONCELIER-KEV-51-001` - KEV enrichment (TODO)
- `CONCELIER-EXPORT-55-001` - Delta exports (TODO)
---
## 17. Success Metrics
| Metric | Target |
|--------|--------|
| Advisory freshness | < 1 hour from source |
| Ingestion accuracy | 100% provenance retention |
| Export determinism | 100% hash reproducibility |
| Conflict detection | 100% of source divergence |
| Source coverage | 20+ authoritative sources |
---
*Last updated: 2025-11-29*

View File

@@ -0,0 +1,449 @@
# Export Center and Reporting Strategy
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, profile system, and implementation strategy for the Export Center module, covering bundle generation, adapter architecture, distribution channels, and compliance reporting.
---
## 1. Executive Summary
The Export Center is the **dedicated service layer for packaging reproducible evidence bundles**. Key capabilities:
- **Profile-Based Exports** - 6+ profile types (JSON, Trivy, Mirror, DevPortal)
- **Deterministic Bundles** - Bit-for-bit reproducible outputs with DSSE signatures
- **Multi-Format Adapters** - Pluggable adapters for different consumer needs
- **Distribution Channels** - HTTP download, OCI push, object storage
- **Compliance Ready** - Provenance, signatures, audit trails for SOC 2/FedRAMP
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Export Requirements | Use Case |
|---------|---------------------|----------|
| **Compliance Teams** | Signed bundles, provenance | Audit evidence |
| **Security Vendors** | Trivy DB format | Scanner integration |
| **Air-Gap Operators** | Offline mirrors | Disconnected environments |
| **Development Teams** | JSON exports | CI/CD integration |
### 2.2 Competitive Positioning
Most vulnerability platforms offer basic CSV/JSON exports. Stella Ops differentiates with:
- **Reproducible bundles** with cryptographic verification
- **Multi-format adapters** (Trivy, CycloneDX, SPDX, custom)
- **OCI distribution** for container-native workflows
- **Provenance attestations** meeting SLSA Level 2+
- **Delta exports** for bandwidth-efficient updates
---
## 3. Profile System
### 3.1 Built-in Profiles
| Profile | Variant | Description | Output Format |
|---------|---------|-------------|---------------|
| **JSON** | `raw` | Unprocessed advisory/VEX data | `.jsonl.zst` |
| **JSON** | `policy` | Policy-evaluated findings | `.jsonl.zst` |
| **Trivy** | `db` | Trivy vulnerability database | SQLite |
| **Trivy** | `java-db` | Trivy Java advisory database | SQLite |
| **Mirror** | `full` | Complete offline mirror | Filesystem tree |
| **Mirror** | `delta` | Incremental updates | Filesystem tree |
| **DevPortal** | `offline` | Developer portal assets | Archive |
### 3.2 Profile Configuration
```yaml
apiVersion: stellaops.io/export.v1
kind: ExportProfile
metadata:
name: compliance-report-monthly
tenant: acme-corp
spec:
kind: json
variant: policy
schedule: "0 0 1 * *" # Monthly
selectors:
tenants: ["acme-corp"]
timeWindow: "30d"
severities: ["critical", "high"]
ecosystems: ["npm", "maven", "pypi"]
options:
compression: zstd
encryption:
enabled: true
recipients: ["age1..."]
signing:
enabled: true
keyRef: "kms://acme-corp/export-signing-key"
distribution:
- type: http
retention: 90d
- type: oci
registry: "registry.acme.com/exports"
repository: "compliance-reports"
```
### 3.3 Selector Expressions
| Selector | Description | Example |
|----------|-------------|---------|
| `tenants` | Tenant filter | `["acme-*"]` |
| `timeWindow` | Time range | `"30d"`, `"2025-01-01/2025-12-31"` |
| `products` | Product PURLs | `["pkg:npm/*", "pkg:maven/org.apache/*"]` |
| `severities` | Severity filter | `["critical", "high"]` |
| `ecosystems` | Package ecosystems | `["npm", "maven"]` |
| `policyVersions` | Policy snapshot IDs | `["rev-42", "rev-43"]` |
---
## 4. Adapter Architecture
### 4.1 Adapter Contract
```csharp
public interface IExportAdapter
{
string Kind { get; } // "json" | "trivy" | "mirror"
string Variant { get; } // "raw" | "policy" | "db"
Task<ExportResult> RunAsync(
ExportContext context,
IAsyncEnumerable<ExportRecord> records,
CancellationToken ct);
}
```
### 4.2 JSON Adapter
**Responsibilities:**
- Canonical JSON serialization (sorted keys, RFC3339 UTC)
- Linkset preservation for traceability
- Zstandard compression
- AOC guardrails (no derived modifications to raw fields)
**Output:**
```
export/
├── advisories.jsonl.zst
├── vex-statements.jsonl.zst
├── findings.jsonl.zst (policy variant)
└── manifest.json
```
### 4.3 Trivy Adapter
**Responsibilities:**
- Map Stella Ops advisory schema to Trivy DB format
- Handle namespace collisions across ecosystems
- Validate against supported Trivy schema versions
- Generate severity distribution summary
**Compatibility:**
- Trivy DB schema v2 (current)
- Fail-fast on unsupported schema versions
### 4.4 Mirror Adapter
**Responsibilities:**
- Build self-contained filesystem layout
- Delta comparison against base manifest
- Optional encryption of `/data` subtree
- OCI layer generation
**Layout:**
```
mirror/
├── manifests/
│ ├── advisories.manifest.json
│ └── vex.manifest.json
├── data/
│ ├── raw/
│ │ ├── advisories/
│ │ └── vex/
│ └── policy/
│ └── findings/
├── indexes/
│ └── by-cve.index
└── manifest.json
```
---
## 5. Bundle Structure
### 5.1 Export Manifest
```json
{
"version": "1.0.0",
"exportId": "export-20251129-001",
"profile": {
"kind": "json",
"variant": "policy",
"name": "compliance-report-monthly"
},
"tenant": "acme-corp",
"generatedAt": "2025-11-29T12:00:00Z",
"generatedBy": "export-center-worker-1",
"selectors": {
"timeWindow": "2025-11-01/2025-11-30",
"severities": ["critical", "high"]
},
"contents": [
{
"path": "findings.jsonl.zst",
"size": 1048576,
"digest": "sha256:abc123...",
"recordCount": 45230
}
],
"totals": {
"advisories": 45230,
"vexStatements": 12450,
"findings": 8920
}
}
```
### 5.2 Provenance Attestation
```json
{
"predicateType": "https://slsa.dev/provenance/v1",
"subject": [
{
"name": "export-20251129-001.tar.gz",
"digest": { "sha256": "def456..." }
}
],
"predicate": {
"buildDefinition": {
"buildType": "https://stellaops.io/export/v1",
"externalParameters": {
"profile": "compliance-report-monthly",
"selectors": { "...": "..." }
}
},
"runDetails": {
"builder": {
"id": "https://stellaops.io/export-center",
"version": "1.2.3"
},
"metadata": {
"invocationId": "export-run-123",
"startedOn": "2025-11-29T12:00:00Z",
"finishedOn": "2025-11-29T12:05:00Z"
}
}
}
}
```
---
## 6. Distribution Channels
### 6.1 HTTP Download
```bash
# Download bundle
curl -H "Authorization: Bearer $TOKEN" \
"https://export.stellaops.io/api/export/runs/{id}/download" \
-o export-bundle.tar.gz
# Verify signature
cosign verify-blob --key export-key.pub \
--signature export-bundle.sig \
export-bundle.tar.gz
```
**Features:**
- Chunked transfer encoding
- Range request support (resumable)
- `X-Export-Digest` header
- Optional encryption metadata
### 6.2 OCI Push
```bash
# Pull from registry
oras pull registry.example.com/exports/compliance:2025-11
# Verify annotations
oras manifest fetch registry.example.com/exports/compliance:2025-11 | jq
```
**Annotations:**
- `io.stellaops.export.profile`
- `io.stellaops.export.tenant`
- `io.stellaops.export.manifest-digest`
- `io.stellaops.export.provenance-ref`
### 6.3 Object Storage
```yaml
distribution:
- type: object
provider: s3
bucket: stella-exports
prefix: "${tenant}/${exportId}"
retention: 365d
immutable: true
```
---
## 7. Implementation Strategy
### 7.1 Phase 1: Core Infrastructure (Complete)
- [x] Profile CRUD APIs
- [x] JSON adapter (raw, policy)
- [x] HTTP download distribution
- [x] Manifest generation
### 7.2 Phase 2: Trivy Integration (Complete)
- [x] Trivy DB adapter
- [x] Trivy Java DB adapter
- [x] Schema version validation
- [x] Compatibility testing
### 7.3 Phase 3: Mirror & Distribution (In Progress)
- [x] Mirror full adapter
- [x] Mirror delta adapter
- [ ] OCI push distribution (EXPORT-OCI-45-001)
- [ ] DevPortal adapter (EXPORT-DEV-46-001)
### 7.4 Phase 4: Advanced Features (Planned)
- [ ] Encryption at rest
- [ ] Scheduled exports
- [ ] Retention policies
- [ ] Cross-tenant exports (with approval)
---
## 8. API Surface
### 8.1 Profile Management
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/export/profiles` | GET | `export:read` | List profiles |
| `/api/export/profiles` | POST | `export:profile:manage` | Create profile |
| `/api/export/profiles/{id}` | PATCH | `export:profile:manage` | Update profile |
| `/api/export/profiles/{id}` | DELETE | `export:profile:manage` | Delete profile |
### 8.2 Export Runs
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/export/runs` | POST | `export:run` | Start export |
| `/api/export/runs/{id}` | GET | `export:read` | Get status |
| `/api/export/runs/{id}/events` | SSE | `export:read` | Stream progress |
| `/api/export/runs/{id}/cancel` | POST | `export:run` | Cancel export |
### 8.3 Downloads
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/export/runs/{id}/download` | GET | `export:download` | Download bundle |
| `/api/export/runs/{id}/manifest` | GET | `export:read` | Get manifest |
| `/api/export/runs/{id}/provenance` | GET | `export:read` | Get provenance |
---
## 9. Observability
### 9.1 Metrics
- `exporter_run_duration_seconds{profile,tenant}`
- `exporter_run_bytes_total{profile}`
- `exporter_run_failures_total{error_code}`
- `exporter_active_runs{tenant}`
- `exporter_distribution_push_seconds{type}`
### 9.2 Logs
Structured fields:
- `run_id`, `tenant`, `profile_kind`, `adapter`
- `phase` (plan, resolve, adapter, manifest, sign, distribute)
- `correlation_id`, `error_code`
---
## 10. Security Considerations
### 10.1 Access Control
- Tenant claim enforced at every query
- Cross-tenant selectors rejected (unless approved)
- RBAC scopes: `export:profile:manage`, `export:run`, `export:read`, `export:download`
### 10.2 Encryption
- Optional encryption per profile
- Keys derived from Authority-managed KMS
- Mirror encryption uses tenant-specific recipients
- Transport security (TLS) always required
### 10.3 Signing
- Cosign-compatible signatures
- SLSA Level 2 attestations by default
- Detached signatures stored alongside manifests
---
## 11. Related Documentation
| Resource | Location |
|----------|----------|
| Export Center architecture | `docs/modules/export-center/architecture.md` |
| Profile definitions | `docs/modules/export-center/profiles.md` |
| API reference | `docs/modules/export-center/api.md` |
| DevPortal bundle spec | `docs/modules/export-center/devportal-offline.md` |
---
## 12. Sprint Mapping
- **Primary Sprint:** SPRINT_0160_0001_0001_export_evidence.md
- **Related Sprints:**
- SPRINT_0161_0001_0001_evidencelocker.md
- SPRINT_0125_0001_0001_mirror.md
**Key Task IDs:**
- `EXPORT-CORE-40-001` - Profile system (DONE)
- `EXPORT-JSON-41-001` - JSON adapters (DONE)
- `EXPORT-TRIVY-42-001` - Trivy adapters (DONE)
- `EXPORT-OCI-45-001` - OCI distribution (IN PROGRESS)
- `EXPORT-DEV-46-001` - DevPortal adapter (TODO)
---
## 13. Success Metrics
| Metric | Target |
|--------|--------|
| Export reproducibility | 100% bit-identical |
| Bundle generation time | < 5 min for 100k records |
| Signature verification | 100% success rate |
| Distribution availability | 99.9% uptime |
| Retention compliance | 100% policy adherence |
---
*Last updated: 2025-11-29*

View File

@@ -0,0 +1,407 @@
# Findings Ledger and Immutable Audit Trail
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, ledger semantics, and implementation strategy for the Findings Ledger module, covering append-only events, Merkle anchoring, projections, and deterministic exports.
---
## 1. Executive Summary
The Findings Ledger provides **immutable, auditable records** of all vulnerability findings and their state transitions. Key capabilities:
- **Append-Only Events** - Every finding change recorded permanently
- **Merkle Anchoring** - Cryptographic proof of event ordering
- **Projections** - Materialized current state views
- **Deterministic Exports** - Reproducible compliance archives
- **Chain Integrity** - Hash-linked event sequences per tenant
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Ledger Requirements | Use Case |
|---------|---------------------|----------|
| **Compliance** | Immutable audit trail | SOC 2, FedRAMP evidence |
| **Security Teams** | Finding history | Investigation timelines |
| **Legal/eDiscovery** | Tamper-proof records | Litigation support |
| **Auditors** | Verifiable exports | Third-party attestation |
### 2.2 Competitive Positioning
Most vulnerability tools provide mutable databases. Stella Ops differentiates with:
- **Append-only architecture** ensuring no record deletion
- **Merkle trees** for cryptographic verification
- **Chain integrity** with hash-linked events
- **Deterministic exports** for reproducible audits
- **Air-gap support** with signed bundles
---
## 3. Event Model
### 3.1 Ledger Event Structure
```json
{
"id": "uuid",
"type": "finding.status.changed",
"tenant": "acme-corp",
"chainId": "chain-uuid",
"sequence": 12345,
"policyVersion": "sha256:abc...",
"finding": {
"id": "artifact:sha256:...|pkg:npm/lodash",
"artifactId": "sha256:...",
"vulnId": "CVE-2025-12345"
},
"actor": {
"id": "user:jane@acme.com",
"type": "human"
},
"occurredAt": "2025-11-29T12:00:00Z",
"recordedAt": "2025-11-29T12:00:01Z",
"payload": {
"previousStatus": "open",
"newStatus": "triaged",
"reason": "Under investigation"
},
"evidenceBundleRef": "bundle://tenant/2025/11/29/...",
"eventHash": "sha256:...",
"previousHash": "sha256:...",
"merkleLeafHash": "sha256:..."
}
```
### 3.2 Event Types
| Type | Trigger | Payload |
|------|---------|---------|
| `finding.discovered` | New finding | severity, purl, advisory |
| `finding.status.changed` | State transition | old/new status, reason |
| `finding.verdict.changed` | Policy decision | verdict, rules matched |
| `finding.vex.applied` | VEX override | status, justification |
| `finding.assigned` | Owner change | assignee, team |
| `finding.commented` | Annotation | comment text (redacted) |
| `finding.resolved` | Resolution | resolution type, version |
### 3.3 Chain Semantics
- Each tenant has one or more event chains
- Events are strictly ordered by sequence number
- `previousHash` links to prior event for integrity
- Chain forks are prohibited (409 on conflict)
---
## 4. Merkle Anchoring
### 4.1 Tree Structure
```
Root Hash
/ \
Hash(A+B) Hash(C+D)
/ \ / \
H(E1) H(E2) H(E3) H(E4)
| | | |
Event1 Event2 Event3 Event4
```
### 4.2 Anchoring Process
1. **Batch collection** - Events accumulate in windows (default 15 min)
2. **Tree construction** - Leaves are event hashes
3. **Root computation** - Merkle root represents batch
4. **Anchor record** - Root stored with timestamp
5. **Optional external** - Root can be published to external ledger
### 4.3 Configuration
```yaml
findings:
ledger:
merkle:
batchSize: 1000
windowDuration: 00:15:00
algorithm: sha256
externalAnchor:
enabled: false
type: rekor # or custom
```
---
## 5. Projections
### 5.1 Purpose
Projections provide **current state** views derived from event history. They are:
- Materialized for fast queries
- Reconstructible from events
- Validated via `cycleHash`
### 5.2 Finding Projection
```json
{
"tenantId": "acme-corp",
"findingId": "artifact:sha256:...|pkg:npm/lodash@4.17.20",
"policyVersion": "sha256:5f38c...",
"status": "triaged",
"severity": 6.7,
"riskScore": 85.2,
"riskSeverity": "high",
"riskProfileVersion": "v2.1",
"labels": {
"kev": true,
"runtime": "exposed"
},
"currentEventId": "uuid",
"cycleHash": "sha256:...",
"policyRationale": [
"explain://tenant/findings/...",
"policy://tenant/policy-v1/rationale/accepted"
],
"updatedAt": "2025-11-29T12:00:00Z"
}
```
### 5.3 Projection Refresh
| Trigger | Action |
|---------|--------|
| New event | Incremental update |
| Policy change | Full recalculation |
| Manual request | On-demand rebuild |
| Scheduled | Periodic validation |
---
## 6. Export Capabilities
### 6.1 Export Shapes
| Shape | Description | Use Case |
|-------|-------------|----------|
| `canonical` | Full event detail | Complete audit |
| `compact` | Summary fields only | Quick reports |
### 6.2 Export Types
**Findings Export:**
```json
{
"eventSequence": 12345,
"observedAt": "2025-11-29T12:00:00Z",
"findingId": "artifact:...|pkg:...",
"policyVersion": "sha256:...",
"status": "triaged",
"severity": 6.7,
"cycleHash": "sha256:...",
"evidenceBundleRef": "bundle://...",
"provenance": {
"policyVersion": "sha256:...",
"cycleHash": "sha256:...",
"ledgerEventHash": "sha256:..."
}
}
```
### 6.3 Export Formats
- **JSON** - Paged API responses
- **NDJSON** - Streaming exports
- **Bundle** - Signed archive packages
---
## 7. Implementation Strategy
### 7.1 Phase 1: Core Ledger (Complete)
- [x] Append-only event store
- [x] Hash-linked chains
- [x] Basic projection engine
- [x] REST API surface
### 7.2 Phase 2: Merkle & Exports (In Progress)
- [x] Merkle tree construction
- [x] Batch anchoring
- [ ] External anchor integration (LEDGER-MERKLE-50-001)
- [ ] Deterministic NDJSON exports (LEDGER-EXPORT-51-001)
### 7.3 Phase 3: Advanced Features (Planned)
- [ ] Chain integrity verification CLI
- [ ] Projection replay tooling
- [ ] Cross-tenant federation
- [ ] Long-term archival
---
## 8. API Surface
### 8.1 Events
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/v1/ledger/events` | GET | `vuln:audit` | List ledger events |
| `/v1/ledger/events` | POST | `vuln:operate` | Append event |
### 8.2 Projections
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/v1/ledger/projections/findings` | GET | `vuln:view` | List projections |
### 8.3 Exports
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/v1/ledger/export/findings` | GET | `vuln:audit` | Export findings |
| `/v1/ledger/export/vex` | GET | `vuln:audit` | Export VEX |
| `/v1/ledger/export/advisories` | GET | `vuln:audit` | Export advisories |
| `/v1/ledger/export/sboms` | GET | `vuln:audit` | Export SBOMs |
### 8.4 Attestations
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/v1/ledger/attestations` | GET | `vuln:audit` | List verifications |
---
## 9. Storage Model
### 9.1 Collections
| Collection | Purpose | Key Indexes |
|------------|---------|-------------|
| `ledger_events` | Append-only events | `{tenant, chainId, sequence}` |
| `ledger_chains` | Chain metadata | `{tenant, chainId}` |
| `ledger_merkle_roots` | Anchor records | `{tenant, batchId, anchoredAt}` |
| `finding_projections` | Current state | `{tenant, findingId}` |
### 9.2 Integrity Constraints
- Events are append-only (no update/delete)
- Sequence numbers strictly monotonic
- Hash chain validated on write
- Merkle roots immutable
---
## 10. Observability
### 10.1 Metrics
- `ledger.events.appended_total{tenant,type}`
- `ledger.events.rejected_total{reason}`
- `ledger.merkle.batches_total`
- `ledger.merkle.anchor_latency_seconds`
- `ledger.projection.updates_total`
- `ledger.projection.staleness_seconds`
- `ledger.export.rows_total{type,shape}`
### 10.2 SLO Targets
| Metric | Target |
|--------|--------|
| Event append latency | < 50ms p95 |
| Projection freshness | < 5 seconds |
| Merkle anchor window | 15 minutes |
| Export throughput | 10k rows/sec |
---
## 11. Security Considerations
### 11.1 Immutability Guarantees
- No UPDATE/DELETE operations exposed
- Admin override requires audit event
- Merkle roots provide tamper evidence
- External anchoring for non-repudiation
### 11.2 Access Control
- `vuln:view` - Read projections
- `vuln:investigate` - Triage actions
- `vuln:operate` - State transitions
- `vuln:audit` - Export and verify
### 11.3 Data Protection
- Sensitive payloads redacted in exports
- Comment text hashed, not stored
- PII filtered at ingest
- Tenant isolation enforced
---
## 12. Air-Gap Support
### 12.1 Offline Bundles
- Signed NDJSON exports
- Merkle proofs included
- Time anchors from trusted source
- Bundle verification CLI
### 12.2 Staleness Tracking
```yaml
airgap:
staleness:
warningThresholdDays: 7
blockThresholdDays: 30
riskCriticalExportsBlocked: true
```
---
## 13. Related Documentation
| Resource | Location |
|----------|----------|
| Ledger schema | `docs/modules/findings-ledger/schema.md` |
| OpenAPI spec | `docs/modules/findings-ledger/openapi/` |
| Export guide | `docs/modules/findings-ledger/exports.md` |
---
## 14. Sprint Mapping
- **Primary Sprint:** SPRINT_0186_0001_0001_record_deterministic_execution.md
- **Related Sprints:**
- SPRINT_0120_0000_0001_policy_reasoning.md
- SPRINT_311_docs_tasks_md_xi.md
**Key Task IDs:**
- `LEDGER-CORE-40-001` - Event store (DONE)
- `LEDGER-PROJ-41-001` - Projections (DONE)
- `LEDGER-MERKLE-50-001` - Merkle anchoring (IN PROGRESS)
- `LEDGER-EXPORT-51-001` - Deterministic exports (IN PROGRESS)
- `LEDGER-AIRGAP-56-001` - Bundle provenance (TODO)
---
## 15. Success Metrics
| Metric | Target |
|--------|--------|
| Event durability | 100% (no data loss) |
| Chain integrity | 100% hash verification |
| Projection accuracy | 100% event replay match |
| Export determinism | 100% hash reproducibility |
| Audit compliance | SOC 2 Type II |
---
*Last updated: 2025-11-29*

View File

@@ -0,0 +1,331 @@
# Graph Analytics and Dependency Insights
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, graph model, and implementation strategy for the Graph module, covering dependency analysis, impact visualization, and offline exports.
---
## 1. Executive Summary
The Graph module provides **dependency analysis and impact visualization** across the vulnerability landscape. Key capabilities:
- **Unified Graph Model** - Artifacts, components, advisories, policies linked
- **Impact Analysis** - Blast radius, affected paths, transitive dependencies
- **Policy Overlays** - VEX and policy decisions visualized on graph
- **Analytics** - Clustering, centrality, community detection
- **Offline Export** - Deterministic graph snapshots for air-gap
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Graph Requirements | Use Case |
|---------|-------------------|----------|
| **Security Teams** | Impact analysis | Vulnerability prioritization |
| **Developers** | Dependency visualization | Upgrade planning |
| **Compliance** | Audit trails | Relationship documentation |
| **Management** | Risk dashboards | Portfolio risk view |
### 2.2 Competitive Positioning
Most vulnerability tools show flat lists. Stella Ops differentiates with:
- **Graph-native architecture** linking all entities
- **Impact visualization** showing blast radius
- **Policy overlays** embedding decisions in graph
- **Offline-compatible** exports for air-gap analysis
- **Analytics** for community detection and centrality
---
## 3. Graph Model
### 3.1 Node Types
| Node | Description | Key Properties |
|------|-------------|----------------|
| **Artifact** | Image/application digest | tenant, environment, labels |
| **Component** | Package version | purl, ecosystem, version |
| **File** | Source/binary path | hash, mtime |
| **License** | License identifier | spdx-id, restrictions |
| **Advisory** | Vulnerability record | cve-id, severity, sources |
| **VEXStatement** | VEX decision | status, justification |
| **PolicyVersion** | Signed policy pack | version, digest |
### 3.2 Edge Types
| Edge | From | To | Properties |
|------|------|-----|------------|
| `DEPENDS_ON` | Component | Component | scope, optional |
| `BUILT_FROM` | Artifact | Component | layer, path |
| `DECLARED_IN` | Component | File | sbom-id |
| `AFFECTED_BY` | Component | Advisory | version-range |
| `VEX_EXEMPTS` | VEXStatement | Advisory | justification |
| `GOVERNS_WITH` | PolicyVersion | Artifact | run-id |
| `OBSERVED_RUNTIME` | Artifact | Component | zastava-event-id |
### 3.3 Provenance
Every edge carries:
- `createdAt` - UTC timestamp
- `sourceDigest` - SRM/SBOM hash
- `provenanceRef` - Link to source document
---
## 4. Overlay System
### 4.1 Overlay Types
| Overlay | Purpose | Content |
|---------|---------|---------|
| `policy.overlay.v1` | Policy decisions | verdict, severity, rules |
| `openvex.v1` | VEX status | status, justification |
| `reachability.v1` | Runtime reachability | state, confidence |
| `clustering.v1` | Community detection | cluster-id, modularity |
| `centrality.v1` | Node importance | degree, betweenness |
### 4.2 Overlay Structure
```json
{
"overlayId": "sha256(tenant|nodeId|overlayKind)",
"overlayKind": "policy.overlay.v1",
"nodeId": "component:pkg:npm/lodash@4.17.21",
"tenant": "acme-corp",
"generatedAt": "2025-11-29T12:00:00Z",
"content": {
"verdict": "blocked",
"severity": "critical",
"rulesMatched": ["rule-001", "rule-002"],
"explainTrace": "sampled trace data..."
}
}
```
---
## 5. Query Capabilities
### 5.1 Search API
```bash
POST /graph/search
{
"tenant": "acme-corp",
"query": "severity:critical AND ecosystem:npm",
"nodeTypes": ["Component", "Advisory"],
"limit": 100
}
```
### 5.2 Path Query
```bash
POST /graph/paths
{
"source": "artifact:sha256:abc123...",
"target": "advisory:CVE-2025-12345",
"maxDepth": 6,
"includeOverlays": true
}
```
**Response:**
```json
{
"paths": [
{
"nodes": ["artifact:sha256:...", "component:pkg:npm/...", "advisory:CVE-..."],
"edges": [{"type": "BUILT_FROM"}, {"type": "AFFECTED_BY"}],
"length": 2
}
],
"overlays": [
{"nodeId": "component:...", "overlayKind": "policy.overlay.v1", "content": {...}}
]
}
```
### 5.3 Diff Query
```bash
POST /graph/diff
{
"snapshotA": "snapshot-2025-11-28",
"snapshotB": "snapshot-2025-11-29",
"includeOverlays": true
}
```
---
## 6. Analytics Pipeline
### 6.1 Clustering
- **Algorithm:** Louvain community detection
- **Output:** Cluster IDs per node, modularity score
- **Use Case:** Identify tightly coupled component groups
### 6.2 Centrality
- **Degree centrality:** Most connected nodes
- **Betweenness centrality:** Critical path nodes
- **Use Case:** Identify high-impact components
### 6.3 Background Processing
```yaml
analytics:
enabled: true
schedule: "0 */6 * * *" # Every 6 hours
algorithms:
- clustering
- centrality
snapshotRetention: 30
```
---
## 7. Implementation Strategy
### 7.1 Phase 1: Core Model (Complete)
- [x] Node/edge schema
- [x] SBOM ingestion pipeline
- [x] Advisory/VEX linking
- [x] Basic search API
### 7.2 Phase 2: Overlays (In Progress)
- [x] Policy overlay generation
- [x] VEX overlay generation
- [ ] Reachability overlay (GRAPH-REACH-50-001)
- [ ] Inline overlay in query responses (GRAPH-QUERY-51-001)
### 7.3 Phase 3: Analytics (Planned)
- [ ] Clustering algorithm
- [ ] Centrality calculations
- [ ] Background worker
- [ ] Analytics overlays export
### 7.4 Phase 4: Visualization (Planned)
- [ ] Console graph viewer
- [ ] Impact tree visualization
- [ ] Diff visualization
---
## 8. API Surface
### 8.1 Core APIs
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/graph/search` | POST | `graph:read` | Search nodes |
| `/graph/query` | POST | `graph:read` | Complex queries |
| `/graph/paths` | POST | `graph:read` | Path finding |
| `/graph/diff` | POST | `graph:read` | Snapshot diff |
| `/graph/nodes/{id}` | GET | `graph:read` | Node detail |
### 8.2 Export APIs
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/graph/export` | POST | `graph:export` | Start export job |
| `/graph/export/{jobId}` | GET | `graph:read` | Job status |
| `/graph/export/{jobId}/download` | GET | `graph:export` | Download bundle |
---
## 9. Storage Model
### 9.1 Collections
| Collection | Purpose | Key Indexes |
|------------|---------|-------------|
| `graph_nodes` | Node records | `{tenant, nodeType, nodeId}` |
| `graph_edges` | Edge records | `{tenant, fromId, toId, edgeType}` |
| `graph_overlays` | Overlay data | `{tenant, nodeId, overlayKind}` |
| `graph_snapshots` | Point-in-time snapshots | `{tenant, snapshotId}` |
### 9.2 Export Format
```
graph-export/
├── nodes.jsonl # Sorted by nodeId
├── edges.jsonl # Sorted by (from, to, type)
├── overlays/
│ ├── policy.jsonl
│ ├── openvex.jsonl
│ └── manifest.json
└── manifest.json
```
---
## 10. Observability
### 10.1 Metrics
- `graph_ingest_lag_seconds`
- `graph_nodes_total{nodeType}`
- `graph_edges_total{edgeType}`
- `graph_query_latency_seconds{queryType}`
- `graph_analytics_runs_total`
- `graph_analytics_clusters_total`
### 10.2 Offline Support
- Graph snapshots packaged for Offline Kit
- Deterministic NDJSON exports
- Overlay manifests with digests
---
## 11. Related Documentation
| Resource | Location |
|----------|----------|
| Graph architecture | `docs/modules/graph/architecture.md` |
| Query language | `docs/modules/graph/query-language.md` |
| Overlay specification | `docs/modules/graph/overlays.md` |
---
## 12. Sprint Mapping
- **Primary Sprint:** SPRINT_0141_0001_0001_graph_indexer.md
- **Related Sprints:**
- SPRINT_0401_0001_0001_reachability_evidence_chain.md
- SPRINT_0140_0001_0001_runtime_signals.md
**Key Task IDs:**
- `GRAPH-CORE-40-001` - Core model (DONE)
- `GRAPH-INGEST-41-001` - SBOM ingestion (DONE)
- `GRAPH-REACH-50-001` - Reachability overlay (IN PROGRESS)
- `GRAPH-ANALYTICS-55-001` - Clustering (TODO)
- `GRAPH-VIZ-60-001` - Visualization (FUTURE)
---
## 13. Success Metrics
| Metric | Target |
|--------|--------|
| Query latency | < 500ms p95 |
| Ingestion lag | < 5 minutes |
| Path query depth | Up to 6 hops |
| Export reproducibility | 100% deterministic |
| Analytics freshness | < 6 hours |
---
*Last updated: 2025-11-29*

View File

@@ -0,0 +1,469 @@
# Notification Rules and Alerting Engine
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, rules engine semantics, and implementation strategy for the Notify module, covering channel connectors, throttling, digests, and delivery management.
---
## 1. Executive Summary
The Notify module provides **rules-driven, tenant-aware notification delivery** across security workflows. Key capabilities:
- **Rules Engine** - Declarative matchers for event routing
- **Multi-Channel Delivery** - Slack, Teams, Email, Webhooks
- **Noise Control** - Throttling, deduplication, digest windows
- **Approval Tokens** - DSSE-signed ack tokens for one-click workflows
- **Audit Trail** - Complete delivery history with redacted payloads
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Notification Requirements | Use Case |
|---------|--------------------------|----------|
| **Security Teams** | Real-time critical alerts | Incident response |
| **DevSecOps** | CI/CD integration | Pipeline notifications |
| **Compliance** | Audit trails | Delivery verification |
| **Management** | Digest summaries | Executive reporting |
### 2.2 Competitive Positioning
Most vulnerability tools offer basic email alerts. Stella Ops differentiates with:
- **Rules-based routing** with fine-grained matchers
- **Native Slack/Teams integration** with rich formatting
- **Digest windows** to prevent alert fatigue
- **Cryptographic ack tokens** for approval workflows
- **Tenant isolation** with quota controls
---
## 3. Rules Engine
### 3.1 Rule Structure
```yaml
name: "critical-alerts-prod"
enabled: true
tenant: "acme-corp"
match:
eventKinds:
- "scanner.report.ready"
- "scheduler.rescan.delta"
- "zastava.admission"
namespaces: ["prod-*"]
repos: ["ghcr.io/acme/*"]
minSeverity: "high"
kev: true
verdict: ["fail", "deny"]
vex:
includeRejectedJustifications: false
actions:
- channel: "slack:sec-alerts"
template: "concise"
throttle: "5m"
- channel: "email:soc"
digest: "hourly"
template: "detailed"
```
### 3.2 Matcher Types
| Matcher | Description | Example |
|---------|-------------|---------|
| `eventKinds` | Event type filter | `["scanner.report.ready"]` |
| `namespaces` | Namespace patterns | `["prod-*", "staging"]` |
| `repos` | Repository patterns | `["ghcr.io/acme/*"]` |
| `minSeverity` | Minimum severity | `"high"` |
| `kev` | KEV-tagged required | `true` |
| `verdict` | Report/admission verdict | `["fail", "deny"]` |
| `labels` | Kubernetes labels | `{"env": "production"}` |
### 3.3 Evaluation Order
1. **Tenant check** - Discard if rule tenant ≠ event tenant
2. **Kind filter** - Early discard for non-matching kinds
3. **Scope match** - Namespace/repo/label matching
4. **Delta gates** - Severity threshold evaluation
5. **VEX gate** - Filter based on VEX status
6. **Throttle/dedup** - Idempotency key check
7. **Actions** - Enqueue per-channel jobs
---
## 4. Channel Connectors
### 4.1 Built-in Channels
| Channel | Features | Rate Limits |
|---------|----------|-------------|
| **Slack** | Blocks, threads, reactions | 1 msg/sec per channel |
| **Teams** | Adaptive Cards, webhooks | 4 msgs/sec |
| **Email** | HTML+text, attachments | Relay-dependent |
| **Webhook** | JSON, HMAC signing | 10 req/sec |
### 4.2 Channel Configuration
```yaml
channels:
- name: "slack:sec-alerts"
type: slack
config:
channel: "#security-alerts"
workspace: "acme-corp"
secretRef: "ref://notify/slack-token"
- name: "email:soc"
type: email
config:
to: ["soc@acme.com"]
from: "stellaops@acme.com"
smtpHost: "smtp.acme.com"
secretRef: "ref://notify/smtp-creds"
- name: "webhook:siem"
type: webhook
config:
url: "https://siem.acme.com/api/events"
signMethod: "ed25519"
signKeyRef: "ref://notify/webhook-key"
```
### 4.3 Connector Contract
```csharp
public interface INotifyConnector
{
string Type { get; }
Task<DeliveryResult> SendAsync(DeliveryContext ctx, CancellationToken ct);
Task<HealthResult> HealthAsync(ChannelConfig cfg, CancellationToken ct);
}
```
---
## 5. Noise Control
### 5.1 Throttling
- **Per-action throttle** - Suppress duplicates within window
- **Idempotency key** - `hash(ruleId | actionId | event.kind | scope.digest | day)`
- **Configurable windows** - 5m, 15m, 1h, 1d
### 5.2 Digest Windows
```yaml
actions:
- channel: "email:weekly-summary"
digest: "weekly"
digestOptions:
maxItems: 100
groupBy: ["severity", "namespace"]
template: "digest-summary"
```
**Behavior:**
- Coalesce events within window
- Summarize top N items with counts
- Flush on window close or max items
- Safe truncation with "and X more" links
### 5.3 Quiet Hours
```yaml
notify:
quietHours:
enabled: true
window: "22:00-06:00"
timezone: "America/New_York"
minSeverity: "critical"
```
Only critical alerts during quiet hours; others deferred to digests.
---
## 6. Templates & Rendering
### 6.1 Template Engine
- Handlebars-style safe templates
- No arbitrary code execution
- Deterministic outputs (stable property order)
- Locale-aware formatting
### 6.2 Template Variables
| Variable | Description |
|----------|-------------|
| `event.kind` | Event type |
| `event.ts` | Timestamp |
| `scope.namespace` | Kubernetes namespace |
| `scope.repo` | Repository |
| `scope.digest` | Image digest |
| `payload.verdict` | Policy verdict |
| `payload.delta.newCritical` | New critical count |
| `payload.links.ui` | UI deep link |
| `topFindings[]` | Top N findings |
### 6.3 Channel-Specific Rendering
**Slack:**
```json
{
"blocks": [
{"type": "header", "text": {"type": "plain_text", "text": "Policy FAIL: nginx:latest"}},
{"type": "section", "text": {"type": "mrkdwn", "text": "*2 critical*, 3 high vulnerabilities"}}
]
}
```
**Email:**
```html
<h2>Policy FAIL: nginx:latest</h2>
<table>
<tr><td>Critical</td><td>2</td></tr>
<tr><td>High</td><td>3</td></tr>
</table>
<a href="https://ui.internal/reports/...">View Details</a>
```
---
## 7. Ack Tokens
### 7.1 Token Structure
DSSE-signed tokens for one-click acknowledgements:
```json
{
"payloadType": "application/vnd.stellaops.notify-ack-token+json",
"payload": {
"tenant": "acme-corp",
"deliveryId": "delivery-123",
"notificationId": "notif-456",
"channel": "slack:sec-alerts",
"webhookUrl": "https://notify.internal/ack",
"nonce": "random-nonce",
"actions": ["acknowledge", "escalate"],
"expiresAt": "2025-11-29T13:00:00Z"
},
"signatures": [{"keyid": "notify-ack-key-01", "sig": "..."}]
}
```
### 7.2 Token Workflow
1. **Issue** - `POST /notify/ack-tokens/issue`
2. **Embed** - Token included in message action button
3. **Click** - User clicks button, token sent to webhook
4. **Verify** - `POST /notify/ack-tokens/verify`
5. **Audit** - Ack event recorded
### 7.3 Token Rotation
```bash
# Rotate ack token signing key
stella notify rotate-ack-key --key-source kms://notify/ack-key
```
---
## 8. Implementation Strategy
### 8.1 Phase 1: Core Engine (Complete)
- [x] Rules engine with matchers
- [x] Slack connector
- [x] Teams connector
- [x] Email connector
- [x] Webhook connector
### 8.2 Phase 2: Noise Control (Complete)
- [x] Throttling
- [x] Digest windows
- [x] Idempotency
- [x] Quiet hours
### 8.3 Phase 3: Ack Tokens (In Progress)
- [x] Token issuance
- [x] Token verification
- [ ] Token rotation API (NOTIFY-ACK-45-001)
- [ ] Escalation workflows (NOTIFY-ESC-46-001)
### 8.4 Phase 4: Advanced Features (Planned)
- [ ] PagerDuty connector
- [ ] Jira ticket creation
- [ ] In-app notifications
- [ ] Anomaly suppression
---
## 9. API Surface
### 9.1 Channels
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/v1/notify/channels` | GET/POST | `notify.read/admin` | List/create channels |
| `/api/v1/notify/channels/{id}` | GET/PATCH/DELETE | `notify.admin` | Manage channel |
| `/api/v1/notify/channels/{id}/test` | POST | `notify.admin` | Send test message |
| `/api/v1/notify/channels/{id}/health` | GET | `notify.read` | Health check |
### 9.2 Rules
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/v1/notify/rules` | GET/POST | `notify.read/admin` | List/create rules |
| `/api/v1/notify/rules/{id}` | GET/PATCH/DELETE | `notify.admin` | Manage rule |
| `/api/v1/notify/rules/{id}/test` | POST | `notify.admin` | Dry-run rule |
### 9.3 Deliveries
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/v1/notify/deliveries` | GET | `notify.read` | List deliveries |
| `/api/v1/notify/deliveries/{id}` | GET | `notify.read` | Delivery detail |
| `/api/v1/notify/deliveries/{id}/retry` | POST | `notify.admin` | Retry delivery |
---
## 10. Event Sources
### 10.1 Subscribed Events
| Event | Source | Typical Actions |
|-------|--------|-----------------|
| `scanner.scan.completed` | Scanner | Immediate/digest |
| `scanner.report.ready` | Scanner | Immediate |
| `scheduler.rescan.delta` | Scheduler | Immediate/digest |
| `attestor.logged` | Attestor | Immediate |
| `zastava.admission` | Zastava | Immediate |
| `conselier.export.completed` | Concelier | Digest |
| `excitor.export.completed` | Excititor | Digest |
### 10.2 Event Envelope
```json
{
"eventId": "uuid",
"kind": "scanner.report.ready",
"tenant": "acme-corp",
"ts": "2025-11-29T12:00:00Z",
"actor": "scanner-webservice",
"scope": {
"namespace": "production",
"repo": "ghcr.io/acme/api",
"digest": "sha256:..."
},
"payload": {
"reportId": "report-123",
"verdict": "fail",
"summary": {"total": 12, "blocked": 2},
"delta": {"newCritical": 1, "kev": ["CVE-2025-..."]}
}
}
```
---
## 11. Observability
### 11.1 Metrics
- `notify.events_consumed_total{kind}`
- `notify.rules_matched_total{ruleId}`
- `notify.throttled_total{reason}`
- `notify.digest_coalesced_total{window}`
- `notify.sent_total{channel}`
- `notify.failed_total{channel,code}`
- `notify.delivery_latency_seconds{channel}`
### 11.2 SLO Targets
| Metric | Target |
|--------|--------|
| Event-to-delivery p95 | < 60 seconds |
| Failure rate | < 0.5% per hour |
| Duplicate rate | ~0% |
---
## 12. Security Considerations
### 12.1 Secret Management
- Secrets stored as references only
- Just-in-time fetch at send time
- No plaintext in Mongo
### 12.2 Webhook Signing
```
X-StellaOps-Signature: t=1732881600,v1=abc123...
X-StellaOps-Timestamp: 2025-11-29T12:00:00Z
```
- HMAC-SHA256 or Ed25519
- Replay window protection
- Canonical body hash
### 12.3 Loop Prevention
- Webhook target allowlist
- Event origin tags
- Own webhooks rejected
---
## 13. Related Documentation
| Resource | Location |
|----------|----------|
| Notify architecture | `docs/modules/notify/architecture.md` |
| Channel schemas | `docs/modules/notify/resources/schemas/` |
| Sample payloads | `docs/modules/notify/resources/samples/` |
| Bootstrap pack | `docs/modules/notify/bootstrap-pack.md` |
---
## 14. Sprint Mapping
- **Primary Sprint:** SPRINT_0170_0001_0001_notify_engine.md (NEW)
- **Related Sprints:**
- SPRINT_0171_0001_0002_notify_connectors.md
- SPRINT_0172_0001_0003_notify_ack_tokens.md
**Key Task IDs:**
- `NOTIFY-ENGINE-40-001` - Rules engine (DONE)
- `NOTIFY-CONN-41-001` - Connectors (DONE)
- `NOTIFY-NOISE-42-001` - Throttling/digests (DONE)
- `NOTIFY-ACK-45-001` - Token rotation (IN PROGRESS)
- `NOTIFY-ESC-46-001` - Escalation workflows (TODO)
---
## 15. Success Metrics
| Metric | Target |
|--------|--------|
| Delivery latency | < 60s p95 |
| Delivery success rate | > 99.5% |
| Duplicate rate | < 0.01% |
| Rule evaluation time | < 10ms |
| Channel health | 99.9% uptime |
---
*Last updated: 2025-11-29*

View File

@@ -0,0 +1,432 @@
# Orchestrator Event Model and Job Lifecycle
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, job lifecycle semantics, and implementation strategy for the Orchestrator module, covering event models, quota governance, replay semantics, and TaskRunner bridge.
---
## 1. Executive Summary
The Orchestrator is the **central job coordination layer** for all Stella Ops asynchronous operations. Key capabilities:
- **Unified Job Lifecycle** - Enqueue, schedule, lease, complete with audit trail
- **Quota Governance** - Per-tenant rate limits, burst controls, circuit breakers
- **Replay Semantics** - Deterministic job replay for audit and recovery
- **TaskRunner Bridge** - Pack-run integration with heartbeats and artifacts
- **Event Fan-Out** - SSE/GraphQL feeds for dashboards and notifications
- **Offline Export** - Audit bundles for compliance and investigations
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Orchestration Requirements | Use Case |
|---------|---------------------------|----------|
| **Enterprise** | Rate limiting, quota management | Multi-team resource sharing |
| **MSP/MSSP** | Multi-tenant isolation | Managed security services |
| **Compliance Teams** | Audit trails, replay | SOC 2, FedRAMP evidence |
| **DevSecOps** | CI/CD integration, webhooks | Pipeline automation |
### 2.2 Competitive Positioning
Most vulnerability platforms lack sophisticated job orchestration. Stella Ops differentiates with:
- **Deterministic replay** for audit and debugging
- **Fine-grained quotas** per tenant/job-type
- **Circuit breakers** for automatic failure isolation
- **Native pack-run integration** for workflow automation
- **Offline-compatible** audit bundles
---
## 3. Job Lifecycle Model
### 3.1 State Machine
```
[Created] --> [Queued] --> [Leased] --> [Running] --> [Completed]
| | | |
| | v v
| +-------> [Failed] <----[Canceled]
| |
v v
[Throttled] [Incident]
```
### 3.2 Lifecycle Phases
| Phase | Description | Transitions |
|-------|-------------|-------------|
| **Created** | Job request received | -> Queued |
| **Queued** | Awaiting scheduling | -> Leased, Throttled |
| **Throttled** | Rate limit applied | -> Queued (after delay) |
| **Leased** | Worker acquired job | -> Running, Expired |
| **Running** | Active execution | -> Completed, Failed, Canceled |
| **Completed** | Success, archived | Terminal |
| **Failed** | Error, may retry | -> Queued (retry), Incident |
| **Canceled** | Operator abort | Terminal |
| **Incident** | Escalated failure | Terminal (requires operator) |
### 3.3 Job Request Structure
```json
{
"jobId": "uuid",
"jobType": "scan|policy-run|export|pack-run|advisory-sync",
"tenant": "tenant-id",
"priority": "low|normal|high|emergency",
"payloadDigest": "sha256:...",
"payload": { "imageRef": "nginx:latest", "options": {} },
"dependencies": ["job-id-1", "job-id-2"],
"idempotencyKey": "unique-request-key",
"correlationId": "trace-id",
"requestedBy": "user-id|service-id",
"requestedAt": "2025-11-29T12:00:00Z"
}
```
---
## 4. Quota Governance
### 4.1 Quota Model
```yaml
quotas:
- tenant: "acme-corp"
jobType: "*"
maxActive: 50
maxPerHour: 500
burst: 10
priority:
emergency:
maxActive: 5
skipQueue: true
- tenant: "acme-corp"
jobType: "export"
maxActive: 4
maxPerHour: 100
```
### 4.2 Rate Limit Enforcement
1. **Quota Check** - Before leasing, verify tenant hasn't exceeded limits
2. **Burst Control** - Allow short bursts within configured window
3. **Staging** - Jobs exceeding limits staged with `nextEligibleAt` timestamp
4. **Priority Bypass** - Emergency jobs can skip queue (with separate limits)
### 4.3 Dynamic Controls
| Control | API | Purpose |
|---------|-----|---------|
| `pauseSource` | `POST /api/limits/pause` | Halt specific job sources |
| `resumeSource` | `POST /api/limits/resume` | Resume paused sources |
| `throttle` | `POST /api/limits/throttle` | Apply temporary throttle |
| `updateQuota` | `PATCH /api/quotas/{id}` | Modify quota limits |
### 4.4 Circuit Breakers
- Auto-pause job types when failure rate > threshold (default 50%)
- Incident events generated via Notify
- Half-open testing after cooldown period
- Manual reset via operator action
---
## 5. TaskRunner Bridge
### 5.1 Pack-Run Integration
The Orchestrator provides specialized support for TaskRunner pack executions:
```json
{
"jobType": "pack-run",
"payload": {
"packId": "vuln-scan-and-report",
"packVersion": "1.2.0",
"planHash": "sha256:...",
"inputs": { "imageRef": "nginx:latest" },
"artifacts": [],
"logChannel": "sse:/runs/{runId}/logs",
"heartbeatCadence": 30
}
}
```
### 5.2 Heartbeat Protocol
- Workers send heartbeats every `heartbeatCadence` seconds
- Missed heartbeats trigger lease expiration
- Lease can be extended for long-running tasks
- Dead workers detected within 2x heartbeat interval
### 5.3 Artifact & Log Streaming
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/runs/{runId}/logs` | SSE | Stream execution logs |
| `/runs/{runId}/artifacts` | GET | List produced artifacts |
| `/runs/{runId}/artifacts/{name}` | GET | Download artifact |
| `/runs/{runId}/heartbeat` | POST | Extend lease |
---
## 6. Event Model
### 6.1 Event Envelope
```json
{
"eventId": "uuid",
"eventType": "job.queued|job.leased|job.completed|job.failed",
"timestamp": "2025-11-29T12:00:00Z",
"tenant": "tenant-id",
"jobId": "job-id",
"jobType": "scan",
"correlationId": "trace-id",
"idempotencyKey": "unique-key",
"payload": {
"status": "completed",
"duration": 45.2,
"result": { "verdict": "pass" }
},
"provenance": {
"workerId": "worker-1",
"leaseId": "lease-id",
"taskRunnerId": "runner-1"
}
}
```
### 6.2 Event Types
| Event | Trigger | Consumers |
|-------|---------|-----------|
| `job.queued` | Job enqueued | Dashboard, Notify |
| `job.leased` | Worker acquired job | Dashboard |
| `job.started` | Execution began | Dashboard, Notify |
| `job.progress` | Progress update | Dashboard (SSE) |
| `job.completed` | Success | Dashboard, Notify, Export |
| `job.failed` | Error occurred | Dashboard, Notify, Incident |
| `job.canceled` | Operator abort | Dashboard, Notify |
| `job.replayed` | Replay initiated | Dashboard, Audit |
### 6.3 Fan-Out Channels
- **SSE** - Real-time dashboard feeds
- **GraphQL Subscriptions** - Console UI
- **Notify** - Alert routing based on rules
- **Webhooks** - External integrations
- **Audit Log** - Compliance storage
---
## 7. Replay Semantics
### 7.1 Deterministic Replay
Jobs can be replayed for audit, debugging, or recovery:
```bash
# Replay a completed job
stella job replay --id job-12345
# Replay with sealed mode (offline verification)
stella job replay --id job-12345 --sealed --bundle output.tar.gz
```
### 7.2 Replay Guarantees
| Property | Guarantee |
|----------|-----------|
| **Input preservation** | Same payloadDigest, cursors |
| **Ordering** | Same processing order |
| **Determinism** | Same outputs for same inputs |
| **Provenance** | `replayOf` pointer to original |
### 7.3 Replay Record
```json
{
"jobId": "replay-job-id",
"replayOf": "original-job-id",
"priority": "high",
"reason": "audit-verification",
"requestedBy": "auditor@example.com",
"cursors": {
"advisory": "cursor-abc",
"vex": "cursor-def"
}
}
```
---
## 8. Implementation Strategy
### 8.1 Phase 1: Core Lifecycle (Complete)
- [x] Job state machine
- [x] MongoDB queue with leasing
- [x] Basic quota enforcement
- [x] Dashboard SSE feeds
### 8.2 Phase 2: Pack-Run Bridge (In Progress)
- [x] Pack-run job type registration
- [x] Log/artifact streaming
- [ ] Heartbeat protocol (ORCH-PACK-37-001)
- [ ] Event envelope finalization (ORCH-SVC-37-101)
### 8.3 Phase 3: Advanced Controls (Planned)
- [ ] Circuit breaker automation
- [ ] Quota analytics dashboard
- [ ] Replay verification tooling
- [ ] Incident mode integration
---
## 9. API Surface
### 9.1 Job Management
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/jobs` | GET | `orch:read` | List jobs with filters |
| `/api/jobs/{id}` | GET | `orch:read` | Job detail |
| `/api/jobs/{id}/cancel` | POST | `orch:operate` | Cancel job |
| `/api/jobs/{id}/replay` | POST | `orch:operate` | Schedule replay |
### 9.2 Quota Management
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/quotas` | GET | `orch:read` | List quotas |
| `/api/quotas/{id}` | PATCH | `orch:quota` | Update quota |
| `/api/limits/throttle` | POST | `orch:quota` | Apply throttle |
| `/api/limits/pause` | POST | `orch:quota` | Pause source |
| `/api/limits/resume` | POST | `orch:quota` | Resume source |
### 9.3 Dashboard
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/dashboard/metrics` | GET | `orch:read` | Aggregated metrics |
| `/api/dashboard/events` | SSE | `orch:read` | Real-time events |
---
## 10. Storage Model
### 10.1 Collections
| Collection | Purpose | Key Fields |
|------------|---------|------------|
| `jobs` | Current job state | `_id`, `tenant`, `jobType`, `status`, `priority` |
| `job_history` | Append-only audit | `jobId`, `event`, `timestamp`, `actor` |
| `sources` | Job sources registry | `sourceId`, `tenant`, `status` |
| `quotas` | Quota definitions | `tenant`, `jobType`, `limits` |
| `throttles` | Active throttles | `tenant`, `source`, `until` |
| `incidents` | Escalated failures | `jobId`, `reason`, `status` |
### 10.2 Indexes
- `{tenant, jobType, status}` on `jobs`
- `{tenant, status, startedAt}` on `jobs`
- `{jobId, timestamp}` on `job_history`
- TTL index on transient lease records
---
## 11. Observability
### 11.1 Metrics
- `job_queue_depth{jobType,tenant}`
- `job_latency_seconds{jobType,phase}`
- `job_failures_total{jobType,reason}`
- `job_retry_total{jobType}`
- `lease_extensions_total{jobType}`
- `quota_exceeded_total{tenant}`
- `circuit_breaker_state{jobType}`
### 11.2 Pack-Run Metrics
- `pack_run_logs_stream_lag_seconds`
- `pack_run_heartbeats_total`
- `pack_run_artifacts_total`
- `pack_run_duration_seconds`
---
## 12. Offline Support
### 12.1 Audit Bundle Export
```bash
stella orch export --tenant acme-corp --since 2025-11-01 --output audit-bundle.tar.gz
```
Bundle contents:
- `jobs.jsonl` - Job records
- `history.jsonl` - State transitions
- `throttles.jsonl` - Throttle events
- `manifest.json` - Bundle metadata
- `signatures/` - DSSE signatures
### 12.2 Replay Verification
```bash
# Verify job determinism
stella job verify --bundle audit-bundle.tar.gz --job-id job-12345
```
---
## 13. Related Documentation
| Resource | Location |
|----------|----------|
| Orchestrator architecture | `docs/modules/orchestrator/architecture.md` |
| Event envelope spec | `docs/modules/orchestrator/event-envelope.md` |
| TaskRunner integration | `docs/modules/taskrunner/orchestrator-bridge.md` |
---
## 14. Sprint Mapping
- **Primary Sprint:** SPRINT_0151_0001_0001_orchestrator_i.md
- **Related Sprints:**
- SPRINT_0152_0001_0002_orchestrator_ii.md
- SPRINT_0153_0001_0003_orchestrator_iii.md
- SPRINT_0157_0001_0001_taskrunner_i.md
**Key Task IDs:**
- `ORCH-CORE-30-001` - Job lifecycle (DONE)
- `ORCH-QUOTA-31-001` - Quota governance (DONE)
- `ORCH-PACK-37-001` - Pack-run bridge (IN PROGRESS)
- `ORCH-SVC-37-101` - Event envelope (IN PROGRESS)
- `ORCH-REPLAY-38-001` - Replay verification (TODO)
---
## 15. Success Metrics
| Metric | Target |
|--------|--------|
| Job scheduling latency | < 100ms p99 |
| Lease acquisition time | < 50ms p99 |
| Event fan-out delay | < 500ms |
| Quota enforcement accuracy | 100% |
| Replay determinism | 100% match |
---
*Last updated: 2025-11-29*

View File

@@ -0,0 +1,394 @@
# Policy Simulation and Shadow Gates
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, simulation semantics, and implementation strategy for Policy Engine simulation features, covering shadow runs, coverage fixtures, and promotion gates.
---
## 1. Executive Summary
Policy simulation enables **safe testing of policy changes** before production deployment. Key capabilities:
- **Shadow Runs** - Execute policies without enforcement
- **Diff Summaries** - Compare old vs new policy outcomes
- **Coverage Fixtures** - Validate expected findings
- **Promotion Gates** - Block promotion until tests pass
- **Deterministic Replay** - Reproduce simulation results
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Simulation Requirements | Use Case |
|---------|------------------------|----------|
| **Policy Authors** | Preview changes | Development workflow |
| **Security Leads** | Approve promotions | Change management |
| **Compliance** | Audit trail | Policy change evidence |
| **DevSecOps** | CI integration | Automated testing |
### 2.2 Competitive Positioning
Most vulnerability tools lack policy simulation. Stella Ops differentiates with:
- **Shadow execution** without production impact
- **Diff visualization** of policy changes
- **Coverage testing** with fixture validation
- **Promotion gates** for governance
- **Deterministic replay** for audit
---
## 3. Simulation Modes
### 3.1 Shadow Run
Execute policy against real data without enforcement:
```bash
stella policy simulate \
--policy my-policy:v2 \
--scope "tenant:acme-corp,namespace:production" \
--shadow
```
**Behavior:**
- Evaluates all findings
- Records verdicts to shadow collections
- No enforcement actions
- No notifications triggered
- Metrics tagged with `shadow=true`
### 3.2 Diff Run
Compare two policy versions:
```bash
stella policy diff \
--old my-policy:v1 \
--new my-policy:v2 \
--scope "tenant:acme-corp"
```
**Output:**
```json
{
"summary": {
"added": 12,
"removed": 5,
"changed": 8,
"unchanged": 234
},
"changes": [
{
"findingId": "finding-123",
"cve": "CVE-2025-12345",
"oldVerdict": "warned",
"newVerdict": "blocked",
"reason": "rule 'critical-cves' now matches"
}
]
}
```
### 3.3 Coverage Run
Validate policy against fixture expectations:
```bash
stella policy coverage \
--policy my-policy:v2 \
--fixtures fixtures/policy-tests.yaml
```
---
## 4. Coverage Fixtures
### 4.1 Fixture Format
```yaml
apiVersion: stellaops.io/policy-test.v1
kind: PolicyFixture
metadata:
name: critical-cve-blocking
policy: my-policy
fixtures:
- name: "Block critical CVE in production"
input:
finding:
cve: "CVE-2025-12345"
severity: critical
ecosystem: npm
component: "lodash@4.17.20"
context:
namespace: production
labels:
tier: frontend
expected:
verdict: blocked
rulesMatched: ["critical-cves", "production-strict"]
- name: "Warn on high CVE in staging"
input:
finding:
cve: "CVE-2025-12346"
severity: high
ecosystem: npm
expected:
verdict: warned
- name: "Ignore low CVE with VEX"
input:
finding:
cve: "CVE-2025-12347"
severity: low
vexStatus: not_affected
vexJustification: "component_not_present"
expected:
verdict: ignored
```
### 4.2 Fixture Results
```json
{
"total": 25,
"passed": 23,
"failed": 2,
"failures": [
{
"fixture": "Block critical CVE in production",
"expected": {"verdict": "blocked"},
"actual": {"verdict": "warned"},
"diff": "rule 'critical-cves' did not match due to missing label"
}
]
}
```
---
## 5. Promotion Gates
### 5.1 Gate Requirements
Before a policy can be promoted from draft to active:
| Gate | Requirement | Enforcement |
|------|-------------|-------------|
| Shadow Run | Complete without errors | Required |
| Coverage | 100% fixtures pass | Required |
| Diff Review | Changes reviewed | Optional |
| Approval | Human sign-off | Configurable |
### 5.2 Promotion Workflow
```mermaid
stateDiagram-v2
[*] --> Draft
Draft --> Shadow: Start shadow run
Shadow --> Coverage: Run coverage tests
Coverage --> Review: Pass fixtures
Review --> Approval: Review diff
Approval --> Active: Approve
Coverage --> Draft: Fix failures
Approval --> Draft: Reject
```
### 5.3 CLI Commands
```bash
# Start shadow run
stella policy promote start --policy my-policy:v2
# Check promotion status
stella policy promote status --policy my-policy:v2
# Complete promotion (requires approval)
stella policy promote complete --policy my-policy:v2 --comment "Reviewed and approved"
```
---
## 6. Determinism Requirements
### 6.1 Simulation Guarantees
| Property | Guarantee |
|----------|-----------|
| Input ordering | Stable sort by (tenant, policyId, findingKey) |
| Rule evaluation | First-match semantics |
| Timestamp handling | Injected TimeProvider |
| Random values | Injected IRandom |
### 6.2 Replay Hash
Each simulation computes:
```
determinismHash = SHA256(policyVersion + inputsHash + rulesHash)
```
Replays with same hash must produce identical results.
---
## 7. Implementation Strategy
### 7.1 Phase 1: Shadow Runs (Complete)
- [x] Shadow collection isolation
- [x] Shadow metrics tagging
- [x] Shadow run API
- [x] CLI integration
### 7.2 Phase 2: Diff & Coverage (In Progress)
- [x] Policy diff algorithm
- [x] Diff visualization
- [ ] Coverage fixture parser (POLICY-COV-50-001)
- [ ] Coverage runner (POLICY-COV-50-002)
### 7.3 Phase 3: Promotion Gates (Planned)
- [ ] Gate configuration schema
- [ ] Promotion state machine
- [ ] Approval workflow integration
- [ ] Console UI for review
---
## 8. API Surface
### 8.1 Simulation APIs
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/policy/simulate` | POST | `policy:simulate` | Start simulation |
| `/api/policy/simulate/{id}` | GET | `policy:read` | Get simulation status |
| `/api/policy/simulate/{id}/results` | GET | `policy:read` | Get results |
### 8.2 Diff APIs
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/policy/diff` | POST | `policy:read` | Compare versions |
### 8.3 Coverage APIs
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/policy/coverage` | POST | `policy:simulate` | Run coverage |
| `/api/policy/coverage/{id}` | GET | `policy:read` | Get results |
### 8.4 Promotion APIs
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/api/policy/promote` | POST | `policy:promote` | Start promotion |
| `/api/policy/promote/{id}` | GET | `policy:read` | Get status |
| `/api/policy/promote/{id}/approve` | POST | `policy:approve` | Approve promotion |
| `/api/policy/promote/{id}/reject` | POST | `policy:approve` | Reject promotion |
---
## 9. Storage Model
### 9.1 Collections
| Collection | Purpose |
|------------|---------|
| `policy_simulations` | Simulation records |
| `policy_simulation_results` | Per-finding results |
| `policy_coverage_runs` | Coverage executions |
| `policy_promotions` | Promotion state |
### 9.2 Shadow Isolation
Shadow results stored in separate collections:
- `effective_finding_{policyId}_shadow`
- Never mixed with production data
- TTL-based cleanup (default 7 days)
---
## 10. Observability
### 10.1 Metrics
- `policy_simulation_duration_seconds{mode}`
- `policy_coverage_pass_rate{policy}`
- `policy_promotion_gate_status{gate,status}`
- `policy_diff_changes_total{changeType}`
### 10.2 Audit Events
- `policy.simulation.started`
- `policy.simulation.completed`
- `policy.coverage.passed`
- `policy.coverage.failed`
- `policy.promotion.approved`
- `policy.promotion.rejected`
---
## 11. Console Integration
### 11.1 Policy Editor
- Inline simulation button
- Real-time diff preview
- Coverage status badge
### 11.2 Promotion Dashboard
- Pending promotions list
- Gate status visualization
- Approval/reject actions
---
## 12. Related Documentation
| Resource | Location |
|----------|----------|
| Policy architecture | `docs/modules/policy/architecture.md` |
| DSL reference | `docs/policy/dsl.md` |
| Lifecycle guide | `docs/policy/lifecycle.md` |
| Runtime guide | `docs/policy/runtime.md` |
---
## 13. Sprint Mapping
- **Primary Sprint:** SPRINT_0185_0001_0001_policy_simulation.md (NEW)
- **Related Sprints:**
- SPRINT_0120_0000_0001_policy_reasoning.md
- SPRINT_0121_0001_0001_policy_reasoning.md
**Key Task IDs:**
- `POLICY-SIM-40-001` - Shadow runs (DONE)
- `POLICY-DIFF-41-001` - Diff algorithm (DONE)
- `POLICY-COV-50-001` - Coverage fixtures (IN PROGRESS)
- `POLICY-COV-50-002` - Coverage runner (IN PROGRESS)
- `POLICY-PROM-55-001` - Promotion gates (TODO)
---
## 14. Success Metrics
| Metric | Target |
|--------|--------|
| Simulation latency | < 2 min (10k findings) |
| Coverage accuracy | 100% fixture matching |
| Promotion gate enforcement | 100% adherence |
| Shadow isolation | Zero production leakage |
| Replay determinism | 100% hash match |
---
*Last updated: 2025-11-29*

View File

@@ -0,0 +1,444 @@
# Runtime Posture and Observation with Zastava
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, observation model, and implementation strategy for the Zastava module, covering runtime inspection, admission control, drift detection, and posture verification.
---
## 1. Executive Summary
Zastava is the **runtime inspector and enforcer** that provides ground-truth from running environments. Key capabilities:
- **Runtime Observation** - Inventory containers, track entrypoints, monitor loaded DSOs
- **Admission Control** - Kubernetes ValidatingAdmissionWebhook for pre-flight gates
- **Drift Detection** - Identify unexpected processes, libraries, and file changes
- **Posture Verification** - Validate signatures, SBOM referrers, attestations
- **Build-ID Tracking** - Correlate binaries to debug symbols and source
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Runtime Requirements | Use Case |
|---------|---------------------|----------|
| **Enterprise Security** | Runtime visibility | Post-deploy monitoring |
| **Platform Engineering** | Admission gates | Policy enforcement |
| **Compliance Teams** | Continuous verification | Runtime attestation |
| **DevSecOps** | Drift detection | Configuration management |
### 2.2 Competitive Positioning
Most vulnerability scanners focus on build-time analysis. Stella Ops differentiates with:
- **Runtime ground-truth** from actual container execution
- **DSO tracking** - which libraries are actually loaded
- **Entrypoint tracing** - what programs actually run
- **Native Kubernetes admission** with policy integration
- **Build-ID correlation** for symbol resolution
---
## 3. Architecture Overview
### 3.1 Component Topology
**Kubernetes Deployment:**
```
stellaops/zastava-observer # DaemonSet on every node (read-only host mounts)
stellaops/zastava-webhook # ValidatingAdmissionWebhook (Deployment, 2+ replicas)
```
**Docker/VM Deployment:**
```
stellaops/zastava-agent # System service; watch Docker events; observer only
```
### 3.2 Dependencies
| Dependency | Purpose |
|------------|---------|
| Authority | OpToks (DPoP/mTLS) for API calls |
| Scanner.WebService | Event ingestion, policy decisions |
| OCI Registry | Referrer/signature checks |
| Container Runtime | containerd/CRI-O/Docker interfaces |
| Kubernetes API | Pod watching, admission webhook |
---
## 4. Runtime Event Model
### 4.1 Event Types
| Kind | Trigger | Payload |
|------|---------|---------|
| `CONTAINER_START` | Container lifecycle | Image, entrypoint, namespace |
| `CONTAINER_STOP` | Container termination | Exit code, duration |
| `DRIFT` | Unexpected change | Changed files, new binaries |
| `POLICY_VIOLATION` | Rule breach | Reason, severity |
| `ATTESTATION_STATUS` | Verification result | Signed, SBOM present |
### 4.2 Event Envelope
```json
{
"eventId": "uuid",
"when": "2025-11-29T12:00:00Z",
"kind": "CONTAINER_START",
"tenant": "acme-corp",
"node": "worker-node-01",
"runtime": {
"engine": "containerd",
"version": "1.7.19"
},
"workload": {
"platform": "kubernetes",
"namespace": "production",
"pod": "api-7c9fbbd8b7-ktd84",
"container": "api",
"containerId": "containerd://abc123...",
"imageRef": "ghcr.io/acme/api@sha256:def456...",
"owner": {
"kind": "Deployment",
"name": "api"
}
},
"process": {
"pid": 12345,
"entrypoint": ["/entrypoint.sh", "--serve"],
"entryTrace": [
{"file": "/entrypoint.sh", "line": 3, "op": "exec", "target": "/usr/bin/python3"},
{"file": "<argv>", "op": "python", "target": "/opt/app/server.py"}
],
"buildId": "9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1"
},
"loadedLibs": [
{"path": "/lib/x86_64-linux-gnu/libssl.so.3", "inode": 123456, "sha256": "..."},
{"path": "/usr/lib/x86_64-linux-gnu/libcrypto.so.3", "inode": 123457, "sha256": "..."}
],
"posture": {
"imageSigned": true,
"sbomReferrer": "present",
"attestation": {
"uuid": "rekor-uuid",
"verified": true
}
}
}
```
---
## 5. Observer Capabilities
### 5.1 Container Lifecycle Tracking
- Watch container start/stop via CRI socket
- Resolve container to image digest
- Map mount points and rootfs paths
- Track container metadata (labels, annotations)
### 5.2 Entrypoint Tracing
- Attach short-lived nsenter to container PID 1
- Parse shell scripts for exec chain
- Record terminal program (actual binary)
- Bounded depth to prevent infinite loops
### 5.3 Loaded Library Sampling
- Read `/proc/<pid>/maps` for loaded DSOs
- Compute SHA-256 for each mapped file
- Track GNU build-IDs for symbol correlation
- Rate limits prevent resource exhaustion
### 5.4 Posture Verification
- Image signature presence (cosign policies)
- SBOM referrers check (registry HEAD)
- Rekor attestation lookup via Scanner.WebService
- Policy verdict from backend
---
## 6. Admission Control
### 6.1 Gate Criteria
| Criterion | Description | Configurable |
|-----------|-------------|--------------|
| Image Signature | Cosign-verifiable to configured keys | Yes |
| SBOM Availability | CycloneDX referrer or catalog entry | Yes |
| Policy Verdict | Backend PASS required | Yes |
| Registry Allowlist | Permitted registries | Yes |
| Tag Bans | Reject `:latest`, etc. | Yes |
| Base Image Allowlist | Permitted base digests | Yes |
### 6.2 Decision Flow
```mermaid
sequenceDiagram
participant K8s as API Server
participant WH as Zastava Webhook
participant SW as Scanner.WebService
K8s->>WH: AdmissionReview(Pod)
WH->>WH: Resolve images to digests
WH->>SW: POST /policy/runtime
SW-->>WH: {signed, hasSbom, verdict, reasons}
alt All pass
WH-->>K8s: Allow
else Any fail
WH-->>K8s: Deny (with reasons)
end
```
### 6.3 Response Caching
- Per-digest results cached for TTL (default 300s)
- Fail-open or fail-closed per namespace
- Cache invalidation on policy updates
---
## 7. Drift Detection
### 7.1 Signal Types
| Signal | Detection Method | Action |
|--------|-----------------|--------|
| Process Drift | Terminal program differs from EntryTrace baseline | Alert |
| Library Drift | Loaded DSOs not in Usage SBOM | Alert, delta scan |
| Filesystem Drift | New executables with mtime after image creation | Alert |
| Network Drift | Unexpected listening ports | Alert (optional) |
### 7.2 Drift Event
```json
{
"kind": "DRIFT",
"delta": {
"baselineImageDigest": "sha256:abc...",
"changedFiles": ["/opt/app/server.py"],
"newBinaries": [
{"path": "/usr/local/bin/helper", "sha256": "..."}
]
},
"evidence": [
{"signal": "procfs.maps", "value": "/lib/.../libssl.so.3@0x7f..."},
{"signal": "cri.task.inspect", "value": "pid=12345"}
]
}
```
---
## 8. Build-ID Workflow
### 8.1 Capture
1. Observer extracts `NT_GNU_BUILD_ID` from `/proc/<pid>/exe`
2. Normalize to lower-case hex
3. Include in runtime event as `process.buildId`
### 8.2 Correlation
1. Scanner.WebService persists observation
2. Policy responses include `buildIds` list
3. Debug files matched via `.build-id/<aa>/<rest>.debug`
### 8.3 Symbol Resolution
```bash
# Via CLI
stella runtime policy test --image sha256:abc123... | jq '.buildIds'
# Via debuginfod
debuginfod-find debuginfo 9f3a1cd4c0b7adfe91c0e3b51d2f45fb0f76a4c1
```
---
## 9. Implementation Strategy
### 9.1 Phase 1: Observer Core (Complete)
- [x] CRI socket integration
- [x] Container lifecycle tracking
- [x] Entrypoint tracing
- [x] Loaded library sampling
- [x] Event batching and compression
### 9.2 Phase 2: Admission Webhook (Complete)
- [x] ValidatingAdmissionWebhook
- [x] Image digest resolution
- [x] Policy integration
- [x] Response caching
- [x] Fail-open/closed modes
### 9.3 Phase 3: Drift Detection (In Progress)
- [x] Process drift detection
- [x] Library drift detection
- [ ] Filesystem drift monitoring (ZASTAVA-DRIFT-50-001)
- [ ] Network posture checks (ZASTAVA-NET-51-001)
### 9.4 Phase 4: Advanced Features (Planned)
- [ ] eBPF syscall tracing (optional)
- [ ] Windows container support
- [ ] Live used-by-entrypoint synthesis
- [ ] Admission dry-run dashboards
---
## 10. Configuration
```yaml
zastava:
mode:
observer: true
webhook: true
backend:
baseAddress: "https://scanner-web.internal"
policyPath: "/api/v1/scanner/policy/runtime"
requestTimeoutSeconds: 5
runtime:
authority:
issuer: "https://authority.internal"
clientId: "zastava-observer"
audience: ["scanner", "zastava"]
scopes: ["api:scanner.runtime.write"]
requireDpop: true
requireMutualTls: true
tenant: "acme-corp"
engine: "auto" # containerd|cri-o|docker|auto
procfs: "/host/proc"
collect:
entryTrace: true
loadedLibs: true
maxLibs: 256
maxHashBytesPerContainer: 64000000
admission:
enforce: true
failOpenNamespaces: ["dev", "test"]
verify:
imageSignature: true
sbomReferrer: true
scannerPolicyPass: true
cacheTtlSeconds: 300
limits:
eventsPerSecond: 50
burst: 200
perNodeQueue: 10000
```
---
## 11. Security Posture
### 11.1 Privileges
| Capability | Purpose | Mode |
|------------|---------|------|
| `CAP_SYS_PTRACE` | nsenter trace | Optional |
| `CAP_DAC_READ_SEARCH` | Read /proc | Required |
| Host PID namespace | Container PIDs | Required |
| Read-only mounts | /proc, sockets | Required |
### 11.2 Least Privilege
- No write mounts
- No host networking
- No privilege escalation
- Read-only rootfs
### 11.3 Data Minimization
- No env var exfiltration
- No command argument logging (unless diagnostic mode)
- Rate limits prevent abuse
---
## 12. Observability
### 12.1 Observer Metrics
- `zastava.runtime.events.total{kind}`
- `zastava.runtime.backend.latency.ms{endpoint}`
- `zastava.proc_maps.samples.total{result}`
- `zastava.entrytrace.depth{p99}`
- `zastava.hash.bytes.total`
- `zastava.buffer.drops.total`
### 12.2 Webhook Metrics
- `zastava.admission.decisions.total{decision}`
- `zastava.admission.cache.hits.total`
- `zastava.backend.failures.total`
---
## 13. Performance Targets
| Operation | Target |
|-----------|--------|
| `/proc/<pid>/maps` sampling | < 30ms (64 files) |
| Full library hash set | < 200ms (256 libs) |
| Admission with warm cache | < 8ms p95 |
| Admission with backend call | < 50ms p95 |
| Event throughput | 5k events/min/node |
---
## 14. Related Documentation
| Resource | Location |
|----------|----------|
| Zastava architecture | `docs/modules/zastava/architecture.md` |
| Runtime event schema | `docs/modules/zastava/event-schema.md` |
| Admission configuration | `docs/modules/zastava/admission-config.md` |
| Deployment guide | `docs/modules/zastava/deployment.md` |
---
## 15. Sprint Mapping
- **Primary Sprint:** SPRINT_0144_0001_0001_zastava_runtime_signals.md
- **Related Sprints:**
- SPRINT_0140_0001_0001_runtime_signals.md
- SPRINT_0143_0000_0001_signals.md
**Key Task IDs:**
- `ZASTAVA-OBS-40-001` - Observer core (DONE)
- `ZASTAVA-ADM-41-001` - Admission webhook (DONE)
- `ZASTAVA-DRIFT-50-001` - Filesystem drift (IN PROGRESS)
- `ZASTAVA-NET-51-001` - Network posture (TODO)
- `ZASTAVA-EBPF-60-001` - eBPF integration (FUTURE)
---
## 16. Success Metrics
| Metric | Target |
|--------|--------|
| Event capture rate | 99.9% of container starts |
| Admission latency | < 50ms p95 |
| Drift detection rate | 100% of runtime changes |
| False positive rate | < 1% of drift alerts |
| Node resource usage | < 2% CPU, < 100MB RAM |
---
*Last updated: 2025-11-29*

View File

@@ -0,0 +1,373 @@
# Telemetry and Observability Patterns
**Version:** 1.0
**Date:** 2025-11-29
**Status:** Canonical
This advisory defines the product rationale, collector topology, and implementation strategy for the Telemetry module, covering metrics, traces, logs, forensic pipelines, and offline packaging.
---
## 1. Executive Summary
The Telemetry module provides **unified observability infrastructure** across all Stella Ops components. Key capabilities:
- **OpenTelemetry Native** - OTLP collection for metrics, traces, logs
- **Forensic Mode** - Extended retention and 100% sampling during incidents
- **Profile-Based Configuration** - Default, forensic, and air-gap profiles
- **Sealed-Mode Guards** - Automatic exporter restrictions in air-gap
- **Offline Bundles** - Signed OTLP archives for compliance
---
## 2. Market Drivers
### 2.1 Target Segments
| Segment | Observability Requirements | Use Case |
|---------|---------------------------|----------|
| **Platform Ops** | Real-time monitoring | Operational health |
| **Security Teams** | Forensic investigation | Incident response |
| **Compliance** | Audit trails | SOC 2, FedRAMP |
| **DevSecOps** | Pipeline visibility | CI/CD debugging |
### 2.2 Competitive Positioning
Most vulnerability tools provide minimal observability. Stella Ops differentiates with:
- **Built-in OpenTelemetry** across all services
- **Forensic mode** with automatic retention extension
- **Sealed-mode compatibility** for air-gap
- **Signed OTLP bundles** for compliance archives
- **Incident-triggered sampling** escalation
---
## 3. Collector Topology
### 3.1 Architecture
```
┌─────────────────────────────────────────────────────┐
│ Services │
│ Scanner │ Policy │ Authority │ Orchestrator │ ... │
└─────────────────────┬───────────────────────────────┘
│ OTLP/gRPC
┌─────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ ┌─────────┐ ┌──────────┐ ┌─────────────────────┐ │
│ │ Traces │ │ Metrics │ │ Logs │ │
│ └────┬────┘ └────┬─────┘ └──────────┬──────────┘ │
│ │ Tail │ Batch │ Redaction │
│ │ Sampling │ │ │
└───────┼────────────┼─────────────────┼─────────────┘
│ │ │
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌────────┐
│ Tempo │ │Prometheus│ │ Loki │
└────────┘ └──────────┘ └────────┘
```
### 3.2 Collector Profiles
| Profile | Use Case | Configuration |
|---------|----------|---------------|
| **default** | Normal operation | 10% trace sampling, 30-day retention |
| **forensic** | Investigation mode | 100% sampling, 180-day retention |
| **airgap** | Offline deployment | File exporters, no external network |
---
## 4. Metrics
### 4.1 Standard Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `stellaops_request_duration_seconds` | Histogram | service, endpoint | Request latency |
| `stellaops_request_total` | Counter | service, status | Request count |
| `stellaops_active_jobs` | Gauge | tenant, jobType | Active job count |
| `stellaops_queue_depth` | Gauge | queue | Queue depth |
| `stellaops_scan_duration_seconds` | Histogram | tenant | Scan duration |
### 4.2 Module-Specific Metrics
**Policy Engine:**
- `policy_run_seconds{mode,tenant,policy}`
- `policy_rules_fired_total{policy,rule}`
- `policy_vex_overrides_total{policy,vendor}`
**Scanner:**
- `scanner_sbom_components_total{ecosystem}`
- `scanner_vulnerabilities_found_total{severity}`
- `scanner_attestations_logged_total`
**Authority:**
- `authority_token_issued_total{grant_type,audience}`
- `authority_token_rejected_total{reason}`
- `authority_dpop_nonce_miss_total`
---
## 5. Traces
### 5.1 Trace Context
All services propagate W3C Trace Context:
- `traceparent` header
- `tracestate` for vendor-specific data
- `baggage` for cross-service attributes
### 5.2 Span Conventions
| Span | Attributes | Description |
|------|------------|-------------|
| `http.request` | url, method, status | HTTP handler |
| `db.query` | collection, operation | MongoDB ops |
| `policy.evaluate` | policyId, version | Policy run |
| `scan.image` | imageRef, digest | Image scan |
| `sign.dsse` | predicateType | DSSE signing |
### 5.3 Sampling Strategy
**Default (Tail Sampling):**
- Error traces: 100%
- Slow traces (>2s): 100%
- Normal traces: 10%
**Forensic Mode:**
- All traces: 100%
- Extended attributes enabled
---
## 6. Logs
### 6.1 Structured Format
```json
{
"timestamp": "2025-11-29T12:00:00.123Z",
"level": "info",
"message": "Scan completed",
"service": "scanner",
"traceId": "abc123...",
"spanId": "def456...",
"tenant": "acme-corp",
"imageDigest": "sha256:...",
"componentCount": 245,
"vulnerabilityCount": 12
}
```
### 6.2 Redaction
Attribute processors strip sensitive data:
- `authorization` headers
- `secretRef` values
- PII based on allowed-key policy
### 6.3 Log Levels
| Level | Purpose | Retention |
|-------|---------|-----------|
| `error` | Failures | 180 days |
| `warn` | Anomalies | 90 days |
| `info` | Operations | 30 days |
| `debug` | Development | 7 days |
---
## 7. Forensic Mode
### 7.1 Activation
```bash
# Activate forensic mode for tenant
stella telemetry incident start --tenant acme-corp --reason "CVE-2025-12345 investigation"
# Check status
stella telemetry incident status
# Deactivate
stella telemetry incident stop --tenant acme-corp
```
### 7.2 Behavior Changes
| Aspect | Default | Forensic |
|--------|---------|----------|
| Trace sampling | 10% | 100% |
| Log level | info | debug |
| Retention | 30 days | 180 days |
| Attributes | Standard | Extended |
| Export frequency | 1 minute | 10 seconds |
### 7.3 Automatic Triggers
- Orchestrator incident escalation
- Policy violation threshold exceeded
- Circuit breaker activation
- Manual operator trigger
---
## 8. Implementation Strategy
### 8.1 Phase 1: Core Telemetry (Complete)
- [x] OpenTelemetry SDK integration
- [x] Metrics exporter (Prometheus)
- [x] Trace exporter (Tempo/Jaeger)
- [x] Log exporter (Loki)
### 8.2 Phase 2: Advanced Features (Complete)
- [x] Tail sampling configuration
- [x] Attribute redaction
- [x] Profile-based configuration
- [x] Dashboard provisioning
### 8.3 Phase 3: Forensic & Offline (In Progress)
- [x] Forensic mode toggle
- [ ] Forensic bundle export (TELEM-FOR-50-001)
- [ ] Sealed-mode guards (TELEM-SEAL-51-001)
- [ ] Offline bundle signing (TELEM-SIGN-52-001)
---
## 9. API Surface
### 9.1 Configuration
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/telemetry/config/profile/{name}` | GET | `telemetry:read` | Download collector config |
| `/telemetry/config/profiles` | GET | `telemetry:read` | List profiles |
### 9.2 Incident Mode
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/telemetry/incidents/mode` | POST | `telemetry:admin` | Toggle forensic mode |
| `/telemetry/incidents/status` | GET | `telemetry:read` | Current mode status |
### 9.3 Exports
| Endpoint | Method | Scope | Description |
|----------|--------|-------|-------------|
| `/telemetry/exports/forensic/{window}` | GET | `telemetry:export` | Stream OTLP bundle |
---
## 10. Offline Support
### 10.1 Bundle Structure
```
telemetry-bundle/
├── otlp/
│ ├── metrics.pb
│ ├── traces.pb
│ └── logs.pb
├── config/
│ ├── collector.yaml
│ └── dashboards/
├── manifest.json
└── signatures/
└── manifest.sig
```
### 10.2 Sealed-Mode Guards
```csharp
// StellaOps.Telemetry.Core enforces IEgressPolicy
if (sealedMode.IsActive)
{
// Disable non-loopback exporters
// Emit structured warning with remediation
// Fall back to file-based export
}
```
---
## 11. Dashboards & Alerts
### 11.1 Standard Dashboards
| Dashboard | Purpose | Panels |
|-----------|---------|--------|
| Platform Health | Overall status | Request rate, error rate, latency |
| Scan Operations | Scanner metrics | Scan rate, duration, findings |
| Policy Engine | Policy metrics | Evaluation rate, rule hits, verdicts |
| Job Orchestration | Queue metrics | Queue depth, job latency, failures |
### 11.2 Alert Rules
| Alert | Condition | Severity |
|-------|-----------|----------|
| High Error Rate | error_rate > 5% | critical |
| Slow Scans | p95 > 5m | warning |
| Queue Backlog | depth > 1000 | warning |
| Circuit Open | breaker_open = 1 | critical |
---
## 12. Security Considerations
### 12.1 Data Protection
- Sensitive attributes redacted at collection
- Encrypted in transit (TLS)
- Encrypted at rest (storage layer)
- Retention policies enforced
### 12.2 Access Control
- Authority scopes for API access
- Tenant isolation in queries
- Audit logging for forensic access
---
## 13. Related Documentation
| Resource | Location |
|----------|----------|
| Telemetry architecture | `docs/modules/telemetry/architecture.md` |
| Collector configuration | `docs/modules/telemetry/collector-config.md` |
| Dashboard provisioning | `docs/modules/telemetry/dashboards.md` |
---
## 14. Sprint Mapping
- **Primary Sprint:** SPRINT_0180_0001_0001_telemetry_core.md (NEW)
- **Related Sprints:**
- SPRINT_0181_0001_0002_telemetry_forensic.md
- SPRINT_0182_0001_0003_telemetry_offline.md
**Key Task IDs:**
- `TELEM-CORE-40-001` - SDK integration (DONE)
- `TELEM-DASH-41-001` - Dashboard provisioning (DONE)
- `TELEM-FOR-50-001` - Forensic bundles (IN PROGRESS)
- `TELEM-SEAL-51-001` - Sealed-mode guards (TODO)
- `TELEM-SIGN-52-001` - Bundle signing (TODO)
---
## 15. Success Metrics
| Metric | Target |
|--------|--------|
| Collection overhead | < 2% CPU |
| Trace sampling accuracy | 100% for errors |
| Log ingestion latency | < 5 seconds |
| Forensic activation time | < 30 seconds |
| Bundle export time | < 5 minutes (24h data) |
---
*Last updated: 2025-11-29*

View File

@@ -157,6 +157,107 @@ These are the authoritative advisories to reference for implementation:
- `docs/security/dpop-mtls-rollout.md` - Sender constraints
- **Status:** Fills HIGH-priority gap - consolidates token model, scopes, multi-tenant isolation
### CLI Developer Experience & Command UX
- **Canonical:** `29-Nov-2025 - CLI Developer Experience and Command UX.md`
- **Sprint:** SPRINT_0201_0001_0001_cli_i.md (PRIMARY)
- **Related Sprints:**
- SPRINT_203_cli_iii.md
- SPRINT_205_cli_v.md
- **Related Docs:**
- `docs/modules/cli/architecture.md` - Module architecture
- `docs/09_API_CLI_REFERENCE.md` - Command reference
- **Status:** Fills HIGH-priority gap - covers command surface, auth model, Buildx integration
### Orchestrator Event Model & Job Lifecycle
- **Canonical:** `29-Nov-2025 - Orchestrator Event Model and Job Lifecycle.md`
- **Sprint:** SPRINT_0151_0001_0001_orchestrator_i.md (PRIMARY)
- **Related Sprints:**
- SPRINT_152_orchestrator_ii.md
- SPRINT_0152_0001_0002_orchestrator_ii.md
- **Related Docs:**
- `docs/modules/orchestrator/architecture.md` - Module architecture
- **Status:** Fills HIGH-priority gap - covers job lifecycle, quota governance, replay semantics
### Export Center & Reporting Strategy
- **Canonical:** `29-Nov-2025 - Export Center and Reporting Strategy.md`
- **Sprint:** SPRINT_0160_0001_0001_export_evidence.md (PRIMARY)
- **Related Sprints:**
- SPRINT_0161_0001_0001_evidencelocker.md
- **Related Docs:**
- `docs/modules/export-center/architecture.md` - Module architecture
- **Status:** Fills MEDIUM-priority gap - covers profile system, adapters, distribution channels
### Runtime Posture & Observation (Zastava)
- **Canonical:** `29-Nov-2025 - Runtime Posture and Observation with Zastava.md`
- **Sprint:** SPRINT_0144_0001_0001_zastava_runtime_signals.md (PRIMARY)
- **Related Sprints:**
- SPRINT_0140_0001_0001_runtime_signals.md
- SPRINT_0143_0000_0001_signals.md
- **Related Docs:**
- `docs/modules/zastava/architecture.md` - Module architecture
- **Status:** Fills MEDIUM-priority gap - covers runtime events, admission control, drift detection
### Notification Rules & Alerting Engine
- **Canonical:** `29-Nov-2025 - Notification Rules and Alerting Engine.md`
- **Sprint:** SPRINT_0170_0001_0001_notify_engine.md (NEW)
- **Related Sprints:**
- SPRINT_0171_0001_0002_notify_connectors.md
- SPRINT_0172_0001_0003_notify_ack_tokens.md
- **Related Docs:**
- `docs/modules/notify/architecture.md` - Module architecture
- **Status:** Fills MEDIUM-priority gap - covers rules engine, channels, noise control, ack tokens
### Graph Analytics & Dependency Insights
- **Canonical:** `29-Nov-2025 - Graph Analytics and Dependency Insights.md`
- **Sprint:** SPRINT_0141_0001_0001_graph_indexer.md (PRIMARY)
- **Related Sprints:**
- SPRINT_0401_0001_0001_reachability_evidence_chain.md
- SPRINT_0140_0001_0001_runtime_signals.md
- **Related Docs:**
- `docs/modules/graph/architecture.md` - Module architecture
- **Status:** Fills MEDIUM-priority gap - covers graph model, overlays, analytics, visualization
### Telemetry & Observability Patterns
- **Canonical:** `29-Nov-2025 - Telemetry and Observability Patterns.md`
- **Sprint:** SPRINT_0180_0001_0001_telemetry_core.md (NEW)
- **Related Sprints:**
- SPRINT_0181_0001_0002_telemetry_forensic.md
- SPRINT_0182_0001_0003_telemetry_offline.md
- **Related Docs:**
- `docs/modules/telemetry/architecture.md` - Module architecture
- **Status:** Fills MEDIUM-priority gap - covers collector topology, forensic mode, offline bundles
### Policy Simulation & Shadow Gates
- **Canonical:** `29-Nov-2025 - Policy Simulation and Shadow Gates.md`
- **Sprint:** SPRINT_0185_0001_0001_policy_simulation.md (NEW)
- **Related Sprints:**
- SPRINT_0120_0000_0001_policy_reasoning.md
- SPRINT_0121_0001_0001_policy_reasoning.md
- **Related Docs:**
- `docs/modules/policy/architecture.md` - Module architecture
- **Status:** Fills MEDIUM-priority gap - covers shadow runs, coverage fixtures, promotion gates
### Findings Ledger & Immutable Audit Trail
- **Canonical:** `29-Nov-2025 - Findings Ledger and Immutable Audit Trail.md`
- **Sprint:** SPRINT_0186_0001_0001_record_deterministic_execution.md (PRIMARY)
- **Related Sprints:**
- SPRINT_0120_0000_0001_policy_reasoning.md
- SPRINT_311_docs_tasks_md_xi.md
- **Related Docs:**
- `docs/modules/findings-ledger/openapi/findings-ledger.v1.yaml` - OpenAPI spec
- **Status:** Fills MEDIUM-priority gap - covers append-only events, Merkle anchoring, projections
### Concelier Advisory Ingestion Model
- **Canonical:** `29-Nov-2025 - Concelier Advisory Ingestion Model.md`
- **Sprint:** SPRINT_0115_0001_0004_concelier_iv.md (PRIMARY)
- **Related Sprints:**
- SPRINT_0113_0001_0002_concelier_ii.md
- SPRINT_0114_0001_0003_concelier_iii.md
- **Related Docs:**
- `docs/modules/concelier/architecture.md` - Module architecture
- `docs/modules/concelier/link-not-merge-schema.md` - LNM schema
- **Status:** Fills MEDIUM-priority gap - covers AOC, Link-Not-Merge, connectors, deterministic exports
## Files Archived
The following files have been moved to `archived/27-Nov-2025-superseded/`:
@@ -198,6 +299,16 @@ The following issues were fixed:
| Mirror & Offline Kit | SPRINT_0125_0001_0001 | EXISTING |
| Task Pack Orchestration | SPRINT_0157_0001_0001 | EXISTING |
| Auth/AuthZ Architecture | Multiple (100, 314, 0514) | EXISTING |
| CLI Developer Experience | SPRINT_0201_0001_0001 | NEW |
| Orchestrator Event Model | SPRINT_0151_0001_0001 | NEW |
| Export Center Strategy | SPRINT_0160_0001_0001 | NEW |
| Zastava Runtime Posture | SPRINT_0144_0001_0001 | NEW |
| Notification Rules Engine | SPRINT_0170_0001_0001 | NEW |
| Graph Analytics | SPRINT_0141_0001_0001 | NEW |
| Telemetry & Observability | SPRINT_0180_0001_0001 | NEW |
| Policy Simulation | SPRINT_0185_0001_0001 | NEW |
| Findings Ledger | SPRINT_0186_0001_0001 | NEW |
| Concelier Ingestion | SPRINT_0115_0001_0004 | NEW |
## Implementation Priority
@@ -210,11 +321,21 @@ Based on gap analysis:
5. **P1 - Sovereign Crypto** (Sprint 0514) - Regional compliance enablement
6. **P1 - Evidence Bundle & Replay** (Sprint 0161, 0187) - Audit/compliance critical
7. **P1 - Mirror & Offline Kit** (Sprint 0125, 0150) - Air-gap deployment critical
8. **P2 - Task Pack Orchestration** (Sprint 0157, 0158) - Automation foundation
9. **P2 - Explainability** (Sprint 0401) - UX enhancement, existing tasks
10. **P2 - Plugin Architecture** (Multiple) - Foundational extensibility patterns
11. **P2 - Auth/AuthZ Architecture** (Multiple) - Security consolidation
12. **P3 - Already Implemented** - Unknowns, Graph IDs, DSSE batching
8. **P1 - CLI Developer Experience** (Sprint 0201) - Developer UX critical
9. **P1 - Orchestrator Event Model** (Sprint 0151) - Job lifecycle foundation
10. **P2 - Task Pack Orchestration** (Sprint 0157, 0158) - Automation foundation
11. **P2 - Explainability** (Sprint 0401) - UX enhancement, existing tasks
12. **P2 - Plugin Architecture** (Multiple) - Foundational extensibility patterns
13. **P2 - Auth/AuthZ Architecture** (Multiple) - Security consolidation
14. **P2 - Export Center** (Sprint 0160) - Reporting flexibility
15. **P2 - Zastava Runtime** (Sprint 0144) - Runtime observability
16. **P2 - Notification Rules** (Sprint 0170) - Alert management
17. **P2 - Graph Analytics** (Sprint 0141) - Dependency insights
18. **P2 - Telemetry** (Sprint 0180) - Observability infrastructure
19. **P2 - Policy Simulation** (Sprint 0185) - Safe policy testing
20. **P2 - Findings Ledger** (Sprint 0186) - Audit immutability
21. **P2 - Concelier Ingestion** (Sprint 0115) - Advisory pipeline
22. **P3 - Already Implemented** - Unknowns, Graph IDs, DSSE batching
## Implementer Quick Reference
@@ -241,6 +362,15 @@ For each topic, the implementer should read:
| Evidence Locker | `docs/modules/evidence-locker/*.md` | `src/EvidenceLocker/*/AGENTS.md` |
| Mirror | `docs/modules/mirror/*.md` | `src/Mirror/*/AGENTS.md` |
| TaskRunner | `docs/modules/taskrunner/*.md` | `src/TaskRunner/*/AGENTS.md` |
| CLI | `docs/modules/cli/architecture.md` | `src/Cli/*/AGENTS.md` |
| Orchestrator | `docs/modules/orchestrator/architecture.md` | `src/Orchestrator/*/AGENTS.md` |
| Export Center | `docs/modules/export-center/architecture.md` | `src/ExportCenter/*/AGENTS.md` |
| Zastava | `docs/modules/zastava/architecture.md` | `src/Zastava/*/AGENTS.md` |
| Notify | `docs/modules/notify/architecture.md` | `src/Notify/*/AGENTS.md` |
| Graph | `docs/modules/graph/architecture.md` | `src/Graph/*/AGENTS.md` |
| Telemetry | `docs/modules/telemetry/architecture.md` | `src/Telemetry/*/AGENTS.md` |
| Findings Ledger | `docs/modules/findings-ledger/openapi/` | `src/Findings/*/AGENTS.md` |
| Concelier | `docs/modules/concelier/architecture.md` | `src/Concelier/*/AGENTS.md` |
## Topical Gaps (Advisory Needed)
@@ -254,12 +384,17 @@ The following topics are mentioned in CLAUDE.md or module docs but lack dedicate
| ~~Mirror/Offline Kit Strategy~~ | HIGH | **FILLED** | `29-Nov-2025 - Mirror and Offline Kit Strategy.md` |
| ~~Task Pack Orchestration~~ | HIGH | **FILLED** | `29-Nov-2025 - Task Pack Orchestration and Automation.md` |
| ~~Auth/AuthZ Architecture~~ | HIGH | **FILLED** | `29-Nov-2025 - Authentication and Authorization Architecture.md` |
| ~~CLI Developer Experience~~ | HIGH | **FILLED** | `29-Nov-2025 - CLI Developer Experience and Command UX.md` |
| ~~Orchestrator Event Model~~ | HIGH | **FILLED** | `29-Nov-2025 - Orchestrator Event Model and Job Lifecycle.md` |
| ~~Export Center Strategy~~ | MEDIUM | **FILLED** | `29-Nov-2025 - Export Center and Reporting Strategy.md` |
| ~~Runtime Posture & Observation~~ | MEDIUM | **FILLED** | `29-Nov-2025 - Runtime Posture and Observation with Zastava.md` |
| ~~Notification Rules Engine~~ | MEDIUM | **FILLED** | `29-Nov-2025 - Notification Rules and Alerting Engine.md` |
| ~~Graph Analytics & Clustering~~ | MEDIUM | **FILLED** | `29-Nov-2025 - Graph Analytics and Dependency Insights.md` |
| ~~Telemetry & Observability~~ | MEDIUM | **FILLED** | `29-Nov-2025 - Telemetry and Observability Patterns.md` |
| ~~Policy Simulation & Shadow Gates~~ | MEDIUM | **FILLED** | `29-Nov-2025 - Policy Simulation and Shadow Gates.md` |
| ~~Findings Ledger & Audit Trail~~ | MEDIUM | **FILLED** | `29-Nov-2025 - Findings Ledger and Immutable Audit Trail.md` |
| ~~Concelier Advisory Ingestion~~ | MEDIUM | **FILLED** | `29-Nov-2025 - Concelier Advisory Ingestion Model.md` |
| **CycloneDX 1.6 .NET Integration** | LOW | Open | Deep Architecture covers generically; expand with .NET-specific guidance |
| **Findings Ledger & Audit Trail** | MEDIUM | Open | Immutable verdict tracking; module exists but no advisory |
| **Runtime Posture & Observation** | MEDIUM | Open | Zastava runtime signals; sprints exist but no advisory |
| **Graph Analytics & Clustering** | MEDIUM | Open | Community detection, blast-radius; implementation underway |
| **Policy Simulation & Shadow Gates** | MEDIUM | Open | Impact modeling; extensive sprints but no contract advisory |
| **Notification Rules Engine** | MEDIUM | Open | Throttling, digests, templating; sprints active |
## Known Issues (Non-Blocking)
@@ -274,4 +409,4 @@ Several filenames use en-dash (U+2011) instead of regular hyphen (-). This may c
---
*Index created: 2025-11-27*
*Last updated: 2025-11-29*
*Last updated: 2025-11-29 (added 10 new advisories filling all identified gaps)*