release orchestrator pivot, architecture and planning

2026-01-10 22:37:22 +02:00
parent c84f421e2f
commit d509c44411
130 changed files with 70292 additions and 721 deletions
--- a/docs/modules/release-orchestrator/design/decisions.md
+++ b/docs/modules/release-orchestrator/design/decisions.md
@@ -0,0 +1,249 @@
+# Key Architectural Decisions
+
+This document records significant architectural decisions and their rationale.
+
+## ADR-001: Digest-First Release Identity
+
+**Status:** Accepted
+
+**Context:**
+Container images can be referenced by tags (e.g., `v1.2.3`) or digests (e.g., `sha256:abc123...`). Tags are mutable - the same tag can point to different images over time.
+
+**Decision:**
+All releases are identified by immutable OCI digests, never tags. Tags are accepted as input but immediately resolved to digests at release creation time.
+
+**Consequences:**
+- Releases are immutable and reproducible
+- Digest mismatch at pull time indicates tampering (deployment fails)
+- Rollback targets specific digest, not "previous tag"
+- Requires registry integration for tag resolution
+- Users see both tag (friendly) and digest (authoritative) in UI
+
+---
+
+## ADR-002: Evidence for Every Decision
+
+**Status:** Accepted
+
+**Context:**
+Compliance and audit requirements demand proof of what was deployed, when, by whom, and why.
+
+**Decision:**
+Every promotion and deployment produces a cryptographically signed evidence packet that is immutable and append-only.
+
+**Consequences:**
+- Evidence table has no UPDATE/DELETE permissions
+- Evidence enables audit-grade compliance reporting
+- Evidence enables deterministic replay (same inputs + policy = same decision)
+- Evidence packets are exportable for external audit systems
+- Storage requirements increase over time
+
+---
+
+## ADR-003: Plugin Architecture for Integrations
+
+**Status:** Accepted
+
+**Context:**
+Organizations use diverse toolchains (registries, CI/CD, vaults, notification systems). Hard-coding integrations limits adoption.
+
+**Decision:**
+All integrations are implemented as plugins via a three-surface contract (Manifest, Connector Runtime, Step Provider). Core orchestration is stable and plugin-agnostic.
+
+**Consequences:**
+- Core has no hard-coded vendor integrations
+- New integrations can be added without core changes
+- Plugin failures cannot crash core (sandbox isolation)
+- Plugin interface must be versioned and stable
+- Additional complexity in plugin lifecycle management
+
+---
+
+## ADR-004: No Feature Gating
+
+**Status:** Accepted
+
+**Context:**
+Enterprise software often gates security features behind premium tiers, creating "pay for security" anti-patterns.
+
+**Decision:**
+All plans include all features. Pricing is based only on:
+- Number of environments
+- New digests analyzed per day
+- Fair use on deployments
+
+**Consequences:**
+- No feature flags tied to billing tier
+- Transparent pricing without feature fragmentation
+- May limit revenue optimization per customer
+- Quota enforcement must be clear and user-friendly
+
+---
+
+## ADR-005: Offline-First Operation
+
+**Status:** Accepted
+
+**Context:**
+Many organizations operate in air-gapped or restricted network environments. Dependency on external services limits adoption.
+
+**Decision:**
+All core operations must work in air-gapped environments. External data is synced via mirror bundles. Plugins may require connectivity; core does not.
+
+**Consequences:**
+- No runtime calls to external APIs for core decisions
+- Advisory data synced via offline bundles
+- Plugin connectivity requirements are declared in manifest
+- Evidence packets exportable for external submission
+- Additional complexity in data synchronization
+
+---
+
+## ADR-006: Agent-Based and Agentless Deployment
+
+**Status:** Accepted
+
+**Context:**
+Some organizations prefer agents for security isolation; others prefer agentless for simplicity.
+
+**Decision:**
+Support both agent-based (persistent daemon on targets) and agentless (SSH/WinRM on demand) deployment models.
+
+**Consequences:**
+- Agent provides better performance and reliability
+- Agentless reduces infrastructure footprint
+- Unified task model abstracts deployment details
+- Security model must handle both patterns
+- Higher testing matrix
+
+---
+
+## ADR-007: PostgreSQL as Primary Database
+
+**Status:** Accepted
+
+**Context:**
+Database choice affects scalability, operations, and feature availability.
+
+**Decision:**
+PostgreSQL (16+) as the primary database with:
+- Per-module schema isolation
+- Row-level security for multi-tenancy
+- JSONB for flexible configuration
+- Append-only triggers for evidence tables
+
+**Consequences:**
+- Proven scalability and reliability
+- Rich feature set (JSONB, RLS, triggers)
+- Single database technology to operate
+- Requires PostgreSQL expertise
+- Schema migrations must be carefully managed
+
+---
+
+## ADR-008: Workflow Engine with DAG Execution
+
+**Status:** Accepted
+
+**Context:**
+Deployment workflows need conditional logic, parallel execution, error handling, and rollback support.
+
+**Decision:**
+Implement a DAG-based workflow engine where:
+- Workflows are templates with nodes (steps) and edges (dependencies)
+- Steps execute when all dependencies are satisfied
+- Expressions reference previous step outputs
+- Built-in support for approval, retry, timeout, and rollback
+
+**Consequences:**
+- Flexible workflow composition
+- Visual representation in UI
+- Complex error handling scenarios supported
+- Learning curve for workflow authors
+- Expression engine security considerations
+
+---
+
+## ADR-009: Separation of Duties Enforcement
+
+**Status:** Accepted
+
+**Context:**
+Compliance requires that the person requesting a change cannot be the same person approving it.
+
+**Decision:**
+Separation of Duties (SoD) is enforced at the approval gateway level, preventing self-approval when SoD is enabled for an environment.
+
+**Consequences:**
+- Prevents single-person deployment to sensitive environments
+- Configurable per environment
+- May slow down deployments
+- Requires minimum team size for SoD-enabled environments
+
+---
+
+## ADR-010: Version Stickers for Drift Detection
+
+**Status:** Accepted
+
+**Context:**
+Knowing what's actually deployed on targets is essential for audit and troubleshooting.
+
+**Decision:**
+Every deployment writes a `stella.version.json` sticker file on the target containing release ID, digests, deployment timestamp, and deployer identity.
+
+**Consequences:**
+- Enables drift detection (expected vs actual)
+- Provides audit trail on target hosts
+- Enables accurate "what's deployed where" queries
+- Requires file access on targets
+- Sticker corruption/deletion must be handled
+
+---
+
+## ADR-011: Security Gate Integration
+
+**Status:** Accepted
+
+**Context:**
+Security scanning exists as a separate concern; release orchestration should leverage but not duplicate it.
+
+**Decision:**
+Security scanning remains in existing modules (Scanner, VEX). Release orchestration consumes scan results through a security gate that evaluates vulnerability thresholds.
+
+**Consequences:**
+- Clear separation of concerns
+- Existing scanning investment preserved
+- Gate configuration determines block thresholds
+- Requires API integration with scanning modules
+- Policy engine evaluates security verdicts
+
+---
+
+## ADR-012: gRPC for Agent Communication
+
+**Status:** Accepted
+
+**Context:**
+Agent communication requires efficient, bidirectional, and secure data transfer.
+
+**Decision:**
+Use gRPC for agent communication with:
+- mTLS for transport security
+- Bidirectional streaming for logs and progress
+- Protocol buffers for efficient serialization
+
+**Consequences:**
+- Efficient binary protocol
+- Strong typing via protobuf
+- Built-in streaming support
+- Requires gRPC infrastructure
+- Firewall considerations for gRPC traffic
+
+---
+
+## References
+
+- [Design Principles](principles.md)
+- [Security Architecture](../security/overview.md)
+- [Plugin System](../modules/plugin-system.md)
--- a/docs/modules/release-orchestrator/design/principles.md
+++ b/docs/modules/release-orchestrator/design/principles.md
@@ -0,0 +1,221 @@
+# Design Principles & Invariants
+
+> These principles are **inviolable** and MUST be reflected in all code, UI, documentation, and audit artifacts.
+
+## Core Principles
+
+### Principle 1: Release Identity via Digest
+
+```
+INVARIANT: A release is a set of OCI image digests (component → digest mapping), never tags.
+```
+
+- Tags are convenience inputs for resolution
+- Tags are resolved to digests at release creation time
+- All downstream operations (promotion, deployment, rollback) use digests
+- Digest mismatch at pull time = deployment failure (tamper detection)
+
+**Implementation Requirements:**
+- Release creation API accepts tags but immediately resolves to digests
+- All internal references use `sha256:` prefixed digests
+- Agent deployment verifies digest at pull time
+- Rollback targets specific digest, not "previous tag"
+
+### Principle 2: Determinism and Evidence
+
+```
+INVARIANT: Every deployment/promotion produces an immutable evidence record.
+```
+
+Evidence record contains:
+- **Who**: User identity (from Authority)
+- **What**: Release bundle (digests), target environment, target hosts
+- **Why**: Policy evaluation result, approval records, decision reasons
+- **How**: Generated artifacts (compose files, scripts), execution logs
+- **When**: Timestamps for request, decision, execution, completion
+
+Evidence enables:
+- Audit-grade compliance reporting
+- Deterministic replay (same inputs + policy → same decision)
+- "Why blocked?" explainability
+
+**Implementation Requirements:**
+- Evidence is generated synchronously with decision
+- Evidence is signed before storage
+- Evidence table is append-only (no UPDATE/DELETE)
+- Evidence includes hash of all inputs for replay verification
+
+### Principle 3: Pluggable Everything, Stable Core
+
+```
+INVARIANT: Integrations are plugins; the core orchestration engine is stable.
+```
+
+**Plugins contribute:**
+- Configuration screens (UI)
+- Connector logic (runtime)
+- Step node types (workflow)
+- Doctor checks (diagnostics)
+- Agent types (deployment)
+
+**Core engine provides:**
+- Workflow execution (DAG processing)
+- State machine management
+- Evidence generation
+- Policy evaluation
+- Credential brokering
+
+**Implementation Requirements:**
+- Core has no hard-coded integrations
+- Plugin interface is versioned and stable
+- Plugin failures cannot crash core
+- Core provides fallback behavior when plugins unavailable
+
+### Principle 4: No Feature Gating
+
+```
+INVARIANT: All plans include all features. Limits are only:
+- Number of environments
+- Number of new digests analyzed per day
+- Fair use on deployments
+```
+
+This prevents:
+- "Pay for security" anti-pattern
+- Per-project/per-seat billing landmines
+- Feature fragmentation across tiers
+
+**Implementation Requirements:**
+- No feature flags tied to billing tier
+- Quota enforcement is transparent (clear error messages)
+- Usage metrics exposed for customer visibility
+- Overage handling is graceful (soft limits with warnings)
+
+### Principle 5: Offline-First Operation
+
+```
+INVARIANT: All core operations MUST work in air-gapped environments.
+```
+
+Implications:
+- No runtime calls to external APIs for core decisions
+- Vulnerability data synced via mirror bundles
+- Plugins may require connectivity; core does not
+- Evidence packets exportable for external audit
+
+**Implementation Requirements:**
+- Core decision logic has no external HTTP calls
+- All external data is pre-synced and cached
+- Plugin connectivity requirements are declared in manifest
+- Offline mode is explicit configuration, not degraded fallback
+
+### Principle 6: Immutable Generated Artifacts
+
+```
+INVARIANT: Every deployment generates and stores immutable artifacts.
+```
+
+Generated artifacts:
+- `compose.stella.lock.yml`: Pinned digests, resolved env refs
+- `deploy.stella.script.dll`: Compiled C# script (or hash reference)
+- `release.evidence.json`: Decision record
+- `stella.version.json`: Version sticker placed on target
+
+Version sticker enables:
+- Drift detection (expected vs actual)
+- Audit trail on target host
+- Rollback reference
+
+**Implementation Requirements:**
+- Artifacts are content-addressed (hash in filename or metadata)
+- Artifacts are stored before deployment execution
+- Artifact storage is immutable (no overwrites)
+- Version sticker is atomic write on target
+
+---
+
+## Architectural Invariants (Enforced by Design)
+
+These invariants are enforced through database constraints, code architecture, and operational controls.
+
+| Invariant | Enforcement Mechanism |
+|-----------|----------------------|
+| Digests are immutable | Database constraint: digest column is unique, no updates |
+| Evidence packets are append-only | Evidence table has no UPDATE/DELETE permissions |
+| Secrets never in database | Vault integration; only references stored |
+| Plugins cannot bypass policy | Policy evaluation in core, not plugin |
+| Multi-tenant isolation | `tenant_id` FK on all tables; row-level security |
+| Workflow state is auditable | State transitions logged; no direct state manipulation |
+| Approvals are tamper-evident | Approval records are signed and append-only |
+
+### Database Enforcement
+
+```sql
+-- Example: Evidence table with no UPDATE/DELETE
+CREATE TABLE release.evidence_packets (
+    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    tenant_id UUID NOT NULL REFERENCES tenants(id),
+    promotion_id UUID NOT NULL REFERENCES release.promotions(id),
+    content_hash TEXT NOT NULL,
+    content JSONB NOT NULL,
+    signature TEXT NOT NULL,
+    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
+    -- No updated_at column; immutable by design
+);
+
+-- Revoke UPDATE/DELETE from application role
+REVOKE UPDATE, DELETE ON release.evidence_packets FROM app_role;
+```
+
+### Code Architecture Enforcement
+
+```csharp
+// Policy evaluation is ALWAYS in core, never delegated to plugins
+public sealed class PromotionDecisionEngine
+{
+    // Plugins provide gate implementations, but core orchestrates evaluation
+    public async Task<DecisionResult> EvaluateAsync(
+        Promotion promotion,
+        IReadOnlyList<IGateProvider> gates,
+        CancellationToken ct)
+    {
+        // Core controls evaluation order and aggregation
+        var results = new List<GateResult>();
+        foreach (var gate in gates)
+        {
+            // Plugin provides evaluation logic
+            var result = await gate.EvaluateAsync(promotion, ct);
+            results.Add(result);
+
+            // Core decides how to aggregate (plugins cannot override)
+            if (result.IsBlocking && _policy.FailFast)
+                break;
+        }
+
+        // Core makes final decision
+        return _decisionAggregator.Aggregate(results);
+    }
+}
+```
+
+---
+
+## Document Conventions
+
+Throughout the Release Orchestrator documentation:
+
+- **MUST**: Mandatory requirement; non-compliance is a bug
+- **SHOULD**: Recommended but not mandatory; deviation requires justification
+- **MAY**: Optional; implementation decision
+- **Entity names**: `PascalCase` (e.g., `ReleaseBundle`)
+- **Table names**: `snake_case` (e.g., `release_bundles`)
+- **API paths**: `/api/v1/resource-name`
+- **Module names**: `kebab-case` (e.g., `release-manager`)
+
+---
+
+## References
+
+- [Key Architectural Decisions](decisions.md)
+- [Module Architecture](../modules/overview.md)
+- [Security Architecture](../security/overview.md)