7.4 KiB
Key Architectural Decisions
This document records significant architectural decisions and their rationale.
ADR-001: Digest-First Release Identity
Status: Accepted
Context:
Container images can be referenced by tags (e.g., v1.2.3) or digests (e.g., sha256:abc123...). Tags are mutable - the same tag can point to different images over time.
Decision: All releases are identified by immutable OCI digests, never tags. Tags are accepted as input but immediately resolved to digests at release creation time.
Consequences:
- Releases are immutable and reproducible
- Digest mismatch at pull time indicates tampering (deployment fails)
- Rollback targets specific digest, not "previous tag"
- Requires registry integration for tag resolution
- Users see both tag (friendly) and digest (authoritative) in UI
ADR-002: Evidence for Every Decision
Status: Accepted
Context: Compliance and audit requirements demand proof of what was deployed, when, by whom, and why.
Decision: Every promotion and deployment produces a cryptographically signed evidence packet that is immutable and append-only.
Consequences:
- Evidence table has no UPDATE/DELETE permissions
- Evidence enables audit-grade compliance reporting
- Evidence enables deterministic replay (same inputs + policy = same decision)
- Evidence packets are exportable for external audit systems
- Storage requirements increase over time
ADR-003: Plugin Architecture for Integrations
Status: Accepted
Context: Organizations use diverse toolchains (registries, CI/CD, vaults, notification systems). Hard-coding integrations limits adoption.
Decision: All integrations are implemented as plugins via a three-surface contract (Manifest, Connector Runtime, Step Provider). Core orchestration is stable and plugin-agnostic.
Consequences:
- Core has no hard-coded vendor integrations
- New integrations can be added without core changes
- Plugin failures cannot crash core (sandbox isolation)
- Plugin interface must be versioned and stable
- Additional complexity in plugin lifecycle management
ADR-004: No Feature Gating
Status: Accepted
Context: Enterprise software often gates security features behind premium tiers, creating "pay for security" anti-patterns.
Decision: All plans include all features. Pricing is based only on:
- Number of environments
- New digests analyzed per day
- Fair use on deployments
Consequences:
- No feature flags tied to billing tier
- Transparent pricing without feature fragmentation
- May limit revenue optimization per customer
- Quota enforcement must be clear and user-friendly
ADR-005: Offline-First Operation
Status: Accepted
Context: Many organizations operate in air-gapped or restricted network environments. Dependency on external services limits adoption.
Decision: All core operations must work in air-gapped environments. External data is synced via mirror bundles. Plugins may require connectivity; core does not.
Consequences:
- No runtime calls to external APIs for core decisions
- Advisory data synced via offline bundles
- Plugin connectivity requirements are declared in manifest
- Evidence packets exportable for external submission
- Additional complexity in data synchronization
ADR-006: Agent-Based and Agentless Deployment
Status: Accepted
Context: Some organizations prefer agents for security isolation; others prefer agentless for simplicity.
Decision: Support both agent-based (persistent daemon on targets) and agentless (SSH/WinRM on demand) deployment models.
Consequences:
- Agent provides better performance and reliability
- Agentless reduces infrastructure footprint
- Unified task model abstracts deployment details
- Security model must handle both patterns
- Higher testing matrix
ADR-007: PostgreSQL as Primary Database
Status: Accepted
Context: Database choice affects scalability, operations, and feature availability.
Decision: PostgreSQL (16+) as the primary database with:
- Per-module schema isolation
- Row-level security for multi-tenancy
- JSONB for flexible configuration
- Append-only triggers for evidence tables
Consequences:
- Proven scalability and reliability
- Rich feature set (JSONB, RLS, triggers)
- Single database technology to operate
- Requires PostgreSQL expertise
- Schema migrations must be carefully managed
ADR-008: Workflow Engine with DAG Execution
Status: Accepted
Context: Deployment workflows need conditional logic, parallel execution, error handling, and rollback support.
Decision: Implement a DAG-based workflow engine where:
- Workflows are templates with nodes (steps) and edges (dependencies)
- Steps execute when all dependencies are satisfied
- Expressions reference previous step outputs
- Built-in support for approval, retry, timeout, and rollback
Consequences:
- Flexible workflow composition
- Visual representation in UI
- Complex error handling scenarios supported
- Learning curve for workflow authors
- Expression engine security considerations
ADR-009: Separation of Duties Enforcement
Status: Accepted
Context: Compliance requires that the person requesting a change cannot be the same person approving it.
Decision: Separation of Duties (SoD) is enforced at the approval gateway level, preventing self-approval when SoD is enabled for an environment.
Consequences:
- Prevents single-person deployment to sensitive environments
- Configurable per environment
- May slow down deployments
- Requires minimum team size for SoD-enabled environments
ADR-010: Version Stickers for Drift Detection
Status: Accepted
Context: Knowing what's actually deployed on targets is essential for audit and troubleshooting.
Decision:
Every deployment writes a stella.version.json sticker file on the target containing release ID, digests, deployment timestamp, and deployer identity.
Consequences:
- Enables drift detection (expected vs actual)
- Provides audit trail on target hosts
- Enables accurate "what's deployed where" queries
- Requires file access on targets
- Sticker corruption/deletion must be handled
ADR-011: Security Gate Integration
Status: Accepted
Context: Security scanning exists as a separate concern; release orchestration should leverage but not duplicate it.
Decision: Security scanning remains in existing modules (Scanner, VEX). Release orchestration consumes scan results through a security gate that evaluates vulnerability thresholds.
Consequences:
- Clear separation of concerns
- Existing scanning investment preserved
- Gate configuration determines block thresholds
- Requires API integration with scanning modules
- Policy engine evaluates security verdicts
ADR-012: gRPC for Agent Communication
Status: Accepted
Context: Agent communication requires efficient, bidirectional, and secure data transfer.
Decision: Use gRPC for agent communication with:
- mTLS for transport security
- Bidirectional streaming for logs and progress
- Protocol buffers for efficient serialization
Consequences:
- Efficient binary protocol
- Strong typing via protobuf
- Built-in streaming support
- Requires gRPC infrastructure
- Firewall considerations for gRPC traffic