git.stella-ops.org/docs/workflow/engine/01-requirements-and-principles.md

# 01. Requirements And Principles

## 1. Product Goal

Build a Serdica-owned workflow engine that can run the current Bulstrad workflow corpus without Elsa while preserving the existing service-level workflow product:

- workflow start
- task inbox and task lifecycle
- business-reference based lookup
- runtime state inspection
- workflow diagrams
- canonical schema and canonical validation exposure
- workflow retention and hosted jobs

The engine must execute the same business behavior currently expressed in the declarative workflow DSL and canonical workflow definition model.

## 2. Functional Requirements

### 2.1 Workflow Definition Handling

The engine must:

- discover workflow registrations from authored C# workflow classes
- resolve the latest or exact workflow version through the existing registration catalog
- compile authored declarative workflows into canonical runtime definitions
- keep canonical validation as a first-class platform capability
- reject invalid or unsupported definitions during startup or validation

### 2.2 Workflow Start

The engine must:

- bind the untyped start payload to the workflow start request type
- resolve or derive business reference data
- initialize canonical workflow state
- execute the initial sequence until a wait boundary or completion
- create workflow projections and runtime state in one durable flow
- support workflow continuations created during start

### 2.3 Human Tasks

The engine must:

- activate human tasks with:
  - task type
  - route
  - workflow roles
  - task roles
  - runtime roles
  - payload
  - business reference
- preserve the current task assignment model:
  - assign to self
  - assign to user
  - assign to runtime roles
  - release
- expose completed and active task history through the existing projection model

### 2.4 Task Completion

The engine must:

- load the current workflow state and task context
- authorize completion through the existing service layer
- apply completion payload
- continue execution from the task completion entry point
- produce next tasks, next waits, next continuations, or completion
- update runtime state and read projections durably

### 2.5 Runtime Semantics

The engine must support the semantic surface already present in declarative workflows:

- state assignment
- business reference assignment
- human task activation
- microservice calls
- legacy rabbit calls
- GraphQL calls
- HTTP calls
- conditional branches
- decision branches
- repeat loops
- subworkflow invocation
- continue-with orchestration
- timeout branches
- failure branches
- function-backed expressions

### 2.6 Subworkflows

The engine must:

- start child workflows
- persist parent resume frames
- carry child output back into parent state
- support nested resume across multiple levels
- preserve current declarative subworkflow semantics

### 2.7 Scheduling

The engine must support:

- timeouts
- retry wake-ups
- delayed continuation
- explicit wait-until behavior

This must happen without a steady-state polling loop.

### 2.8 Inspection And Operations

The service must continue to expose:

- workflow definitions
- workflow instances
- workflow tasks
- workflow task events
- workflow diagrams
- runtime state snapshots
- canonical schema
- canonical validation

## 3. Non-Functional Requirements

### 3.1 Multi-Instance Deployment

The service must support multiple application nodes against one shared Oracle database.

Implications:

- no single-node assumptions
- no in-memory-only correctness logic
- no sticky workflow ownership
- duplicate signal delivery must be safe

### 3.2 Durability

The system of record must be durable across:

- process restart
- node restart
- full cluster restart
- database restart

Workflow progress, pending waits, active tasks, and due timers must not be lost.

### 3.3 No Polling

Signal-driven wake-up is mandatory.

The engine must not rely on a periodic database scan loop to discover work. Blocking or event-driven delivery is required for:

- task completion wake-up
- delayed resume wake-up
- subworkflow completion wake-up
- external signal wake-up

### 3.4 One Database

Oracle is the shared durable state backend for:

- workflow projections
- workflow runtime snapshots
- host coordination
- signal and schedule durability through Oracle AQ

Redis may exist in the wider platform, but it is not required for engine correctness.

### 3.5 Observability

The engine must produce enough telemetry to answer:

- what instance is waiting
- why it is waiting
- which signal resumed it
- which node executed it
- which definition version it used
- why it failed
- whether a message was retried, dead-lettered, or ignored as stale

### 3.6 Compatibility

The engine must preserve the existing public workflow service contracts unless a future product change explicitly changes them.

The following service-contract groups are especially important:

- workflow start contracts
- workflow definition contracts
- workflow task contracts
- workflow instance contracts
- workflow operational contracts

## 4. Explicit V1 Assumptions

These assumptions simplify the engine architecture and are intentional.

### 4.1 Single Active Runtime Provider Per Deployment

The service runs one engine provider at a time.

This means:

- no mixed-provider instance routing
- no live migration between engines
- no simultaneous old-runtime and engine execution inside one deployment

The design still keeps abstractions around the runtime, signaling bus, and scheduler so that future replacement remains possible.

### 4.2 Canonical Runtime, Not Elsa Activity Runtime

The target engine executes canonical workflow definitions directly.

Authored C# remains the source of truth, but runtime semantics are driven by canonical definitions compiled from that source.

### 4.3 Oracle AQ Is The Default Event Backbone

Oracle AQ is treated as part of the durable engine platform because it satisfies:

- one-database architecture
- blocking dequeue
- durable delivery
- delayed delivery
- transactional behavior

## 5. Design Principles

### 5.1 Keep The Product Surface Stable

The workflow service remains the product boundary. The engine is an internal subsystem.

### 5.2 Separate Read Model From Runtime Model

Task and instance projections are optimized for product reads.

Runtime snapshots are optimized for deterministic resume.

They are related, but they are not the same data structure.

### 5.3 Run To Wait

The engine should never keep a workflow instance “hot” in memory for correctness.

Execution should run until:

- a task is activated
- a timer is scheduled
- an external signal wait is registered
- the workflow completes

Then the snapshot is persisted and released.

### 5.4 Make Delivery At-Least-Once And Resume Idempotent

Distributed delivery is never exactly-once in practice.

The engine must treat duplicate signals, duplicate wake-ups, and late timer arrivals as normal conditions.

### 5.5 Keep Signals Small

Signals should identify work, not carry the full workflow state.

The database snapshot remains authoritative.

### 5.6 Keep Abstractions At The Backend Boundary

Abstract:

- runtime provider
- signal bus
- schedule bus
- snapshot store

Do not abstract away the workflow semantics themselves.

### 5.7 Prefer Transactional Consistency Over Cleverness

If a feature can be made transactional in Oracle, prefer that over eventually-consistent coordination tricks.

## 6. Success Criteria

The engine architecture is successful when:

- the service can start and complete workflows without Elsa
- task projections remain correct
- delayed resumes happen without polling
- a stopped cluster resumes safely after restart
- a multi-node deployment does not corrupt workflow state
- canonical definitions remain the execution contract
- operations can inspect and support the system with existing product-level APIs