stella-ops.org/git.stella-ops.org

Fork 0

Files

master d509c44411 release orchestrator pivot, architecture and planning

2026-01-10 22:37:22 +02:00

7.4 KiB

Raw Permalink Blame History

Key Architectural Decisions

This document records significant architectural decisions and their rationale.

ADR-001: Digest-First Release Identity

Status: Accepted

Context: Container images can be referenced by tags (e.g., v1.2.3) or digests (e.g., sha256:abc123...). Tags are mutable - the same tag can point to different images over time.

Decision: All releases are identified by immutable OCI digests, never tags. Tags are accepted as input but immediately resolved to digests at release creation time.

Consequences:

Releases are immutable and reproducible
Digest mismatch at pull time indicates tampering (deployment fails)
Rollback targets specific digest, not "previous tag"
Requires registry integration for tag resolution
Users see both tag (friendly) and digest (authoritative) in UI

ADR-002: Evidence for Every Decision

Status: Accepted

Context: Compliance and audit requirements demand proof of what was deployed, when, by whom, and why.

Decision: Every promotion and deployment produces a cryptographically signed evidence packet that is immutable and append-only.

Consequences:

Evidence table has no UPDATE/DELETE permissions
Evidence enables audit-grade compliance reporting
Evidence enables deterministic replay (same inputs + policy = same decision)
Evidence packets are exportable for external audit systems
Storage requirements increase over time

ADR-003: Plugin Architecture for Integrations

Status: Accepted

Context: Organizations use diverse toolchains (registries, CI/CD, vaults, notification systems). Hard-coding integrations limits adoption.

Decision: All integrations are implemented as plugins via a three-surface contract (Manifest, Connector Runtime, Step Provider). Core orchestration is stable and plugin-agnostic.

Consequences:

Core has no hard-coded vendor integrations
New integrations can be added without core changes
Plugin failures cannot crash core (sandbox isolation)
Plugin interface must be versioned and stable
Additional complexity in plugin lifecycle management

ADR-004: No Feature Gating

Status: Accepted

Context: Enterprise software often gates security features behind premium tiers, creating "pay for security" anti-patterns.

Decision: All plans include all features. Pricing is based only on:

Number of environments
New digests analyzed per day
Fair use on deployments

Consequences:

No feature flags tied to billing tier
Transparent pricing without feature fragmentation
May limit revenue optimization per customer
Quota enforcement must be clear and user-friendly

ADR-005: Offline-First Operation

Status: Accepted

Context: Many organizations operate in air-gapped or restricted network environments. Dependency on external services limits adoption.

Decision: All core operations must work in air-gapped environments. External data is synced via mirror bundles. Plugins may require connectivity; core does not.

Consequences:

No runtime calls to external APIs for core decisions
Advisory data synced via offline bundles
Plugin connectivity requirements are declared in manifest
Evidence packets exportable for external submission
Additional complexity in data synchronization

ADR-006: Agent-Based and Agentless Deployment

Status: Accepted

Context: Some organizations prefer agents for security isolation; others prefer agentless for simplicity.

Decision: Support both agent-based (persistent daemon on targets) and agentless (SSH/WinRM on demand) deployment models.

Consequences:

Agent provides better performance and reliability
Agentless reduces infrastructure footprint
Unified task model abstracts deployment details
Security model must handle both patterns
Higher testing matrix

ADR-007: PostgreSQL as Primary Database

Status: Accepted

Context: Database choice affects scalability, operations, and feature availability.

Decision: PostgreSQL (16+) as the primary database with:

Per-module schema isolation
Row-level security for multi-tenancy
JSONB for flexible configuration
Append-only triggers for evidence tables

Consequences:

Proven scalability and reliability
Rich feature set (JSONB, RLS, triggers)
Single database technology to operate
Requires PostgreSQL expertise
Schema migrations must be carefully managed

ADR-008: Workflow Engine with DAG Execution

Status: Accepted

Context: Deployment workflows need conditional logic, parallel execution, error handling, and rollback support.

Decision: Implement a DAG-based workflow engine where:

Workflows are templates with nodes (steps) and edges (dependencies)
Steps execute when all dependencies are satisfied
Expressions reference previous step outputs
Built-in support for approval, retry, timeout, and rollback

Consequences:

Flexible workflow composition
Visual representation in UI
Complex error handling scenarios supported
Learning curve for workflow authors
Expression engine security considerations

ADR-009: Separation of Duties Enforcement

Status: Accepted

Context: Compliance requires that the person requesting a change cannot be the same person approving it.

Decision: Separation of Duties (SoD) is enforced at the approval gateway level, preventing self-approval when SoD is enabled for an environment.

Consequences:

Prevents single-person deployment to sensitive environments
Configurable per environment
May slow down deployments
Requires minimum team size for SoD-enabled environments

ADR-010: Version Stickers for Drift Detection

Status: Accepted

Context: Knowing what's actually deployed on targets is essential for audit and troubleshooting.

Decision: Every deployment writes a stella.version.json sticker file on the target containing release ID, digests, deployment timestamp, and deployer identity.

Consequences:

Enables drift detection (expected vs actual)
Provides audit trail on target hosts
Enables accurate "what's deployed where" queries
Requires file access on targets
Sticker corruption/deletion must be handled

ADR-011: Security Gate Integration

Status: Accepted

Context: Security scanning exists as a separate concern; release orchestration should leverage but not duplicate it.

Decision: Security scanning remains in existing modules (Scanner, VEX). Release orchestration consumes scan results through a security gate that evaluates vulnerability thresholds.

Consequences:

Clear separation of concerns
Existing scanning investment preserved
Gate configuration determines block thresholds
Requires API integration with scanning modules
Policy engine evaluates security verdicts

ADR-012: gRPC for Agent Communication

Status: Accepted

Context: Agent communication requires efficient, bidirectional, and secure data transfer.

Decision: Use gRPC for agent communication with:

mTLS for transport security
Bidirectional streaming for logs and progress
Protocol buffers for efficient serialization

Consequences:

Efficient binary protocol
Strong typing via protobuf
Built-in streaming support
Requires gRPC infrastructure
Firewall considerations for gRPC traffic

7.4 KiB Raw Permalink Blame History

Key Architectural Decisions

ADR-001: Digest-First Release Identity

ADR-002: Evidence for Every Decision

ADR-003: Plugin Architecture for Integrations

ADR-004: No Feature Gating

ADR-005: Offline-First Operation

ADR-006: Agent-Based and Agentless Deployment

ADR-007: PostgreSQL as Primary Database

ADR-008: Workflow Engine with DAG Execution

ADR-009: Separation of Duties Enforcement

ADR-010: Version Stickers for Drift Detection

ADR-011: Security Gate Integration

ADR-012: gRPC for Agent Communication

References

7.4 KiB

Raw Permalink Blame History