git.stella-ops.org/docs/qa/feature-checks/FLOW.md

# Feature Verification Pipeline - FLOW

This document defines the state machine, tier system, artifact format, and priority rules
for the automated feature verification pipeline.

All agents in the pipeline MUST read this document before taking any action.

> **THE PRIMARY GOAL IS END-TO-END BEHAVIORAL VERIFICATION.**
>
> This pipeline exists to prove that features **actually work** by exercising them
> as a real user would -- through APIs, CLIs, UIs, and integration tests. Tier 0
> (file checks) and Tier 1 (build + unit tests) are necessary prerequisites, but
> they are NOT the goal. **Tier 2 (E2E behavioral verification) is the goal.**
>
> Agents MUST:
> 1. Start Docker / Docker Desktop before running any checks
> 2. Set up required containers (Postgres, Redis, RabbitMQ, etc.)
> 3. Start application services needed for behavioral testing
> 4. Run ALL tiers including Tier 2 -- never stop at Tier 1
> 5. Act as a user: call APIs, run CLI commands, interact with UIs
>
> **Skipping Tier 2 is a verification failure, not a verification pass.**

> **EXECUTION ORDER IS PROBLEMS-FIRST (MANDATORY).**
>
> Agents MUST resolve in-progress/problem states before starting any new `queued` feature.
> A "problem/in-progress" state is any of:
> `checking`, `failed`, `triaged`, `confirmed`, `fixing`, `retesting`.
>
> If any feature exists in those states, starting a new `queued` feature is a FLOW violation.
> Only after all such features reach a terminal state (`done`, `blocked`, `skipped`, `not_implemented`)
> may the pipeline continue with `queued` features.

## 0. Execution Preflight (Mandatory Checklist)

Before selecting any feature, run this checklist in order:

1. Scan every state file in `docs/qa/feature-checks/state/*.json`.
2. If any feature is in `checking`, `failed`, `triaged`, `confirmed`, `fixing`, or `retesting`, pick from those problem features only.
3. Do not start any `queued` feature until all problem features are terminal (`done`, `blocked`, `skipped`, `not_implemented`).
4. Record the selected feature transition in state notes before running Tier 0.
5. Create a fresh run directory (`run-XYZ`) before collecting any tier artifacts.

---

## 1. Directory Layout

```
docs/features/
  unchecked/<module>/<feature>.md    # Input: features to verify (1,144 files)
  checked/<module>/<feature>.md      # Output: features that passed verification
  dropped/<feature>.md               # Not implemented / intentionally dropped

docs/qa/feature-checks/
  FLOW.md                            # This file (state machine spec)
  state/<module>.json                # Per-module state ledger (one file per module)
  runs/<module>/<feature>/<runId>/   # Artifacts per verification run
```

---

## 2. State Machine

### 2.1 States

| State | Meaning |
|-------|---------|
| `queued` | Discovered, not yet processed |
| `checking` | Feature checker is running |
| `passed` | All tier checks passed |
| `failed` | Check found issues (pre-triage) |
| `triaged` | Issue-finder identified root cause |
| `confirmed` | Issue-confirmer validated triage |
| `fixing` | Fixer is implementing the fix |
| `retesting` | Retester is re-running checks |
| `done` | Verified and moved to `checked/` |
| `blocked` | Requires human intervention |
| `skipped` | Cannot be automatically verified (manual-only) |
| `not_implemented` | Source files missing despite sprint claiming DONE |

### 2.2 Transitions

```
queued ──────────────> checking
                          │
                ┌─────────┼─────────────┐
                v         v             v
            passed     failed      not_implemented
               │         │              │
               v         v              │ (move file back
            done     triaged            │  to unimplemented/)
               │         │              v
               │         v           [terminal]
               │     confirmed
               │         │
               │         v
               │      fixing
               │         │
               │         v
               │     retesting
               │       │    │
               │       v    v
               │    done  failed ──> (retry or blocked)
               │
               v
         [move file to checked/]
```

### 2.3 Retry Policy

- Maximum retry count: **3** per feature
- After 3 retries with failures: transition to `blocked`
- Blocked features require human review before re-entering the pipeline
- Each retry increments `retryCount` in state

### 2.4 Skip Criteria (STRICT - almost nothing qualifies)

Features may ONLY be marked `skipped` if they match one of these 3 physical constraints:
- `hardware_required`: Requires physical HSM, smart card, or eIDAS hardware token
- `multi_datacenter`: Requires geographically distributed infrastructure
- `air_gap_network`: Requires a physically disconnected network

**Everything else MUST be tested.** Features that were previously classified as
"performance benchmarking" or "multi-node cluster" should be tested with whatever
scale is available locally (single-node Docker, local containers). Partial behavioral
verification is better than no verification.

The checker agent determines skip eligibility during Tier 0 and MUST justify the
skip with one of the 3 reasons above. Any other reason is invalid.

---

## 3. Tier System

Verification proceeds in tiers. Each tier is a gate - a feature must pass
the current tier before advancing to the next. **A feature is NOT verified
until ALL applicable tiers pass.** File existence alone is not verification.

### Tier 0: Source Verification (fast, cheap)

**Purpose**: Verify that the source files referenced in the feature file actually exist.

**Process**:
1. Read the feature `.md` file
2. Extract file paths from `## Implementation Details`, `## Key files`, or `## What's Implemented` sections
3. For each path, check if the file exists on disk
4. Extract class/interface names and grep for their declarations

**Outcomes**:
- All key files found: `source_verified = true`, advance to Tier 1
- Key files missing (>50% absent): `status = not_implemented`
- Some files missing (<50% absent): `source_verified = partial`, add note, advance to Tier 1

**What this proves**: The code exists on disk. Nothing more.

**Cost**: ~0.01 USD per feature (file existence checks only)

### Tier 1: Build + Code Review (medium)

**Purpose**: Verify the module compiles, tests pass, AND the code actually implements
the described behavior.

**Process**:
1. Identify the `.csproj` file(s) for the feature's module
2. Run `dotnet build <project>.csproj` and capture output
3. Run `dotnet test <test-project>.csproj --filter <relevant-filter>` -- tests MUST actually execute and pass
4. For Angular/frontend features: run `npx ng build` and `npx ng test` for the relevant library/app
5. **Code review** (CRITICAL): Read the key source files and verify:
   - The classes/methods described in the feature file actually contain the logic claimed
   - The feature description matches what the code does (not just that it exists)
   - Tests cover the core behavior described in the feature (not just compilation)
6. If the build succeeds but tests are blocked by upstream dependency errors:
   - Record as `build_verified = true, tests_blocked_upstream = true`
   - The feature CANNOT advance to `passed` -- mark as `failed` with category `env_issue`
   - The upstream blocker must be resolved before the feature can pass

**Code Review Checklist** (must answer YES to all):
- [ ] Does the main class/service exist with non-trivial implementation (not stubs/TODOs)?
- [ ] Does the logic match what the feature description claims?
- [ ] Are there unit tests that exercise the core behavior?
- [ ] Do those tests actually assert meaningful outcomes (not just "doesn't throw")?

**Outcomes**:
- Build + tests pass + code review confirms behavior: `build_verified = true`, advance to Tier 2
- Build fails: `status = failed`, record build errors
- Tests fail or blocked: `status = failed`, record reason
- Code review finds stubs/missing logic: `status = failed`, category = `missing_code`

**What this proves**: The code compiles, tests pass, and someone has verified the code
does what it claims.

**Cost**: ~0.10 USD per feature (compile + test execution + code reading)

### Tier 2: Behavioral Verification (API / CLI / UI) -- THE MAIN PURPOSE

**Purpose**: Verify the feature works end-to-end by actually exercising it through
its external interface. This is the only tier that proves the feature WORKS, not
just that code exists. **This is the primary reason the verification pipeline exists.**

**EVERY feature MUST have a Tier 2 check. E2E tests MUST NOT be skipped.** The whole
point of this pipeline is to act as a user and verify the software works. Tier 0 and
Tier 1 are prerequisites -- Tier 2 is the actual verification.

**If the environment is not set up, set it up.** If Docker is not running, start it.
If containers are not running, start them. If the app is not running, start it.
"Environment not ready" is never an excuse to skip -- it is a setup step the agent
must perform (see Section 9).

The check type depends on the module's external surface.

### Tier 2 Acceptance Gate (HARD REQUIREMENT)

A Tier 2 run is valid only if ALL of the following are true:
1. It uses a new run directory (`run-XYZ`) created for the current execution.
2. It contains fresh evidence captured in this run (new timestamps and new command/request outputs).
3. It includes user-surface interactions (HTTP requests, CLI invocations, or UI interactions), not only library test counts.
4. It verifies both positive and negative behavior paths when the feature has error semantics.
5. For rechecks, at least one new user transaction per feature is captured in the new run.

The following are forbidden and invalidate Tier 2:
- Copying a previous run directory and only editing `runId`, timestamps, or summary text.
- Declaring Tier 2 pass from suite totals alone without fresh request/response, command output, or UI step evidence.
- Reusing screenshots or response payloads from prior runs without replaying the interaction.

If any forbidden shortcut is detected, mark the feature `failed` with category `test_gap`
and rerun Tier 2 from scratch.

#### Tier 2a: API Testing (Gateway, Router, Api, Platform, backend services with HTTP endpoints)

**Process**:
1. Ensure the service is running (check port, or start via `docker compose up`)
2. Send HTTP requests to the feature's endpoints using `curl` or a test script
3. Verify response status codes, headers, and body structure
4. Test error cases (unauthorized, bad input, rate limited, etc.)
5. Verify the behavior described in the feature file actually happens

**Example for `gateway-identity-header-strip`**:
```bash
# Send request with spoofed identity header
curl -H "X-Forwarded-User: attacker" http://localhost:5000/api/test
# Verify the header was stripped (response should use authenticated identity, not spoofed)
```

**Artifact**: `tier2-api-check.json`
```json
{
  "type": "api",
  "baseUrl": "http://localhost:5000",
  "capturedAtUtc": "2026-02-10T12:00:00Z",
  "requests": [
    {
      "description": "Verify spoofed identity header is stripped",
      "method": "GET",
      "path": "/api/test",
      "headers": { "X-Forwarded-User": "attacker" },
      "expectedStatus": 200,
      "actualStatus": 200,
      "assertion": "Response X-Forwarded-User header matches authenticated user, not 'attacker'",
      "result": "pass|fail",
      "evidence": "actual response headers/body",
      "requestCapturedAtUtc": "2026-02-10T12:00:01Z",
      "responseSnippet": "HTTP/1.1 200 ..."
    }
  ],
  "verdict": "pass|fail|skip"
}
```

#### Tier 2b: CLI Testing (Cli, Tools, Bench modules)

**Process**:
1. Build the CLI tool if needed
2. Run the CLI command described in the feature's E2E Test Plan
3. Verify stdout/stderr output matches expected behavior
4. Test error cases (invalid args, missing config, etc.)
5. Verify exit codes

**Example for `cli-baseline-selection-logic`**:
```bash
stella scan --baseline last-green myimage:latest
# Verify output shows baseline was selected correctly
echo $?  # Verify exit code 0
```

**Artifact**: `tier2-cli-check.json`
```json
{
  "type": "cli",
  "capturedAtUtc": "2026-02-10T12:00:00Z",
  "commands": [
    {
      "description": "Verify baseline selection with last-green strategy",
      "command": "stella scan --baseline last-green myimage:latest",
      "expectedExitCode": 0,
      "actualExitCode": 0,
      "expectedOutput": "Using baseline: ...",
      "actualOutput": "...",
      "result": "pass|fail",
      "commandCapturedAtUtc": "2026-02-10T12:00:01Z"
    }
  ],
  "verdict": "pass|fail|skip"
}
```

#### Tier 2c: UI Testing (Web, ExportCenter, DevPortal, VulnExplorer, PacksRegistry)

**Process**:
1. Ensure the Angular app is running (`ng serve` or docker)
2. Use Playwright CLI or MCP to navigate to the feature's UI route
3. Follow E2E Test Plan steps: verify elements render, interactions work, data displays
4. Capture screenshots as evidence
5. Test accessibility (keyboard navigation, ARIA labels) if listed in E2E plan

**Example for `pipeline-run-centric-view`**:
```bash
npx playwright test --grep "pipeline-run" --reporter=json
# Or manually via MCP: navigate to /release-orchestrator/runs, verify table renders
```

**Artifact**: `tier2-ui-check.json`
```json
{
  "type": "ui",
  "baseUrl": "http://localhost:4200",
  "capturedAtUtc": "2026-02-10T12:00:00Z",
  "steps": [
    {
      "description": "Navigate to /release-orchestrator/runs",
      "action": "navigate",
      "target": "/release-orchestrator/runs",
      "expected": "Runs list table renders with columns",
      "result": "pass|fail",
      "screenshot": "step-1-runs-list.png",
      "stepCapturedAtUtc": "2026-02-10T12:00:01Z"
    }
  ],
  "verdict": "pass|fail|skip"
}
```

#### Tier 2d: Library/Internal Testing (Attestor, Policy, Scanner, etc. with no external surface)

For modules with no HTTP/CLI/UI surface, Tier 2 means running **targeted
integration tests** or **behavioral unit tests** that prove the feature logic:

**Process**:
1. Identify tests that specifically exercise the feature's behavior
2. Run those tests: `dotnet test --filter "FullyQualifiedName~FeatureClassName"`
3. Read the test code to confirm it asserts meaningful behavior (not just "compiles")
4. If no behavioral tests exist, write a focused test and run it

**Example for `evidence-weighted-score-model`**:
```bash
dotnet test --filter "FullyQualifiedName~EwsCalculatorTests"
# Verify: normalizers produce expected dimension scores
# Verify: guardrails cap/floor scores correctly
# Verify: composite score is deterministic for same inputs
```

**Artifact**: `tier2-integration-check.json`
```json
{
  "type": "integration",
  "capturedAtUtc": "2026-02-10T12:00:00Z",
  "testFilter": "FullyQualifiedName~EwsCalculatorTests",
  "testsRun": 21,
  "testsPassed": 21,
  "testsFailed": 0,
  "behaviorVerified": [
    "6-dimension normalization produces expected scores",
    "Guardrails enforce caps and floors",
    "Composite score is deterministic"
  ],
  "verdict": "pass|fail"
}
```

### When to skip Tier 2 (ALMOST NEVER)

**Default: Tier 2 is MANDATORY.** Agents must exhaust all options before marking skip.

The ONLY acceptable skip reasons (must match exactly one):
- `hardware_required`: Feature requires physical HSM, smart card, or eIDAS token
- `multi_datacenter`: Feature requires geographically distributed infrastructure
- `air_gap_network`: Feature requires a physically disconnected network (not just no internet)

**These are NOT valid skip reasons:**
- "The app isn't running" -- **START IT** (see Section 9). If it won't start, mark `failed` with `env_issue`.
- "Docker isn't running" -- **START DOCKER** (see Section 9.0). If it won't start, mark `failed` with `env_issue`.
- "No E2E tests exist" -- **WRITE ONE.** A focused behavioral test that exercises the feature as a user would.
- "The database isn't set up" -- **SET IT UP** using Docker containers (see Section 9.1).
- "Environment not ready" -- **PREPARE IT.** That is part of the agent's job, not an excuse.
- "Too complex to test" -- Break it into smaller testable steps. Test what you can.
- "Only unit tests needed" -- Unit tests are Tier 1. Tier 2 is behavioral/integration/E2E.
- "Application not running" -- See "The app isn't running" above.

**If an agent skips Tier 2 without one of the 3 valid reasons above, the entire
feature verification is INVALID and must be re-run.**

### Tier Classification by Module

| Tier 2 Type | Modules | Feature Count |
|-------------|---------|---------------|
| 2a (API) | Gateway, Router, Api, Platform | ~30 |
| 2b (CLI) | Cli, Tools, Bench | ~110 |
| 2c (UI/Playwright) | Web, ExportCenter, DevPortal, VulnExplorer, PacksRegistry | ~190 |
| 2d (Integration) | Attestor, Policy, Scanner, BinaryIndex, Concelier, Libraries, EvidenceLocker, Orchestrator, Signals, Authority, Signer, Cryptography, ReachGraph, Graph, RiskEngine, Replay, Unknowns, Scheduler, TaskRunner, Timeline, Notifier, Findings, SbomService, Mirror, Feedser, Analyzers | ~700 |
| Manual (skip) | AirGap (subset), SmRemote (HSM), DevOps (infra) | ~25 |

---

## 4. State File Format

Per-module state files live at `docs/qa/feature-checks/state/<module>.json`.

```json
{
  "module": "gateway",
  "featureCount": 8,
  "lastUpdatedUtc": "2026-02-09T12:00:00Z",
  "features": {
    "router-back-pressure-middleware": {
      "status": "queued",
      "tier": 0,
      "retryCount": 0,
      "sourceVerified": null,
      "buildVerified": null,
      "e2eVerified": null,
      "skipReason": null,
      "lastRunId": null,
      "lastUpdatedUtc": "2026-02-09T12:00:00Z",
      "featureFile": "docs/features/unchecked/gateway/router-back-pressure-middleware.md",
      "notes": []
    }
  }
}
```

### State File Rules

- **Single writer**: Only the orchestrator writes state files
- **Subagents report back**: Subagents return results to the orchestrator via their output; they do NOT write state files directly
- **Atomic updates**: Each state transition must update `lastUpdatedUtc`
- **Append-only notes**: The `notes` array is append-only; never remove entries

---

## 5. Run Artifact Format

Each verification run produces artifacts under:
`docs/qa/feature-checks/runs/<module>/<feature-slug>/<runId>/`

Where `<runId>` = `run-001`, `run-002`, etc. (zero-padded, sequential).

### Required Artifacts

| Stage | File | Format |
|-------|------|--------|
| Tier 0 | `tier0-source-check.json` | `{ "filesChecked": [...], "found": [...], "missing": [...], "verdict": "pass\|fail\|partial" }` |
| Tier 1 | `tier1-build-check.json` | `{ "project": "...", "buildResult": "pass\|fail", "testResult": "pass\|fail\|skipped", "errors": [...] }` |
| Tier 2 | `tier2-e2e-check.json` | `{ "steps": [{ "description": "...", "result": "pass\|fail", "evidence": "..." }], "screenshots": [...] }` |
| Triage | `triage.json` | `{ "rootCause": "...", "category": "missing_code\|bug\|config\|test_gap\|env_issue", "affectedFiles": [...], "confidence": 0.0-1.0 }` |
| Confirm | `confirmation.json` | `{ "approved": true\|false, "reason": "...", "revisedRootCause": "..." }` |
| Fix | `fix-summary.json` | `{ "filesModified": [...], "testsAdded": [...], "description": "..." }` |
| Retest | `retest-result.json` | `{ "previousFailures": [...], "retestResults": [...], "verdict": "pass\|fail" }` |

### Artifact Freshness Rules (MANDATORY)

- Every new run (`run-XYZ`) MUST be generated from fresh execution, not by copying prior run files.
- Every Tier 2 artifact MUST include `capturedAtUtc` and per-step/per-command/per-request capture times.
- Evidence fields MUST contain fresh raw output from the current run (response snippets, command output, screenshots, or logs).
- Recheck runs MUST include at least one newly captured user interaction per feature in that run directory.
- If a previous run is reused as input for convenience, that run is INVALID until all Tier 2 evidence files are regenerated.

### Screenshot Convention

Screenshots for Tier 2 go in `<runId>/screenshots/` with names:
`step-<N>-<description-slug>.png`

---

## 6. Priority Rules

When selecting the next feature to process, the orchestrator follows this priority order:

1. **`retesting`** - Finish in-progress retests first
2. **`fixing`** - Complete in-progress fixes
3. **`confirmed`** - Confirmed issues ready for fix
4. **`triaged`** - Triaged issues ready for confirmation
5. **`failed`** (retryCount < 3) - Failed features ready for triage
6. **`queued`** - New features not yet checked

### Failure-First Enforcement (Hard Rule)

- If the current feature enters `failed`, `triaged`, `confirmed`, `fixing`, or `retesting`, agents MUST complete that failure loop for the same feature before starting any next `queued` feature.
- A sprint batch task MUST NOT advance to the next feature while the current feature still has unresolved failures.
- The only exception is an explicit human instruction to pause or reorder.

Within the same priority level, prefer:
- Features in smaller modules first (faster to clear a module completely)
- Features with lower `retryCount`
- Alphabetical by feature slug (deterministic ordering)

### 6.1 Problem-First Lock (MANDATORY)

- If any feature exists in `checking`, `retesting`, `fixing`, `confirmed`, `triaged`, or `failed`, the orchestrator MUST NOT start a `queued` feature.
- This lock is cross-module: do not start or continue another module while the current module has an unresolved feature in `checking`/`failed`/`triaged`/`confirmed`/`fixing`/`retesting`.
- The orchestrator MUST pick the highest-priority problem feature and keep working that same feature through the chain (`failed -> triaged -> confirmed -> fixing -> retesting`) until it reaches a terminal state (`done` or `blocked`) for that cycle.
- "Touch one problem then switch to another" is forbidden unless the current feature is explicitly moved to `blocked` with a recorded reason.
- This rule is strict and exists to prevent problem backlog growth while new work is started.

### 6.2 Deterministic Selection Algorithm (MUST FOLLOW)

Before selecting a next feature:

1. Scan all module state files under `docs/qa/feature-checks/state/*.json`.
2. If any feature is in `checking`, `retesting`, `fixing`, `confirmed`, `triaged`, or `failed`, select from those only.
3. Selection order MUST be:
   `retesting` > `fixing` > `confirmed` > `triaged` > `failed` > `checking`.
4. Within the same status:
   - lower `retryCount` first
   - then alphabetical by `module`
   - then alphabetical by `feature` slug
5. Only when no features are in those statuses may the orchestrator select `queued`.

This algorithm is mandatory and overrides ad-hoc feature picking.

---

## 7. File Movement Rules

### On `passed` -> `done`

1. Copy feature file from `docs/features/unchecked/<module>/<feature>.md` to `docs/features/checked/<module>/<feature>.md`
2. Update the status line in the file from `IMPLEMENTED` to `VERIFIED`
3. Append a `## Verification` section with the run ID and date
4. Remove the original from `unchecked/`
5. Create the target module directory in `checked/` if it doesn't exist

### On `not_implemented`

1. Copy feature file from `docs/features/unchecked/<module>/<feature>.md` to `docs/features/unimplemented/<module>/<feature>.md`
2. Update status from `IMPLEMENTED` to `PARTIALLY_IMPLEMENTED`
3. Add notes about what was missing
4. Remove the original from `unchecked/`

### On `blocked`

- Do NOT move the file
- Add a `## Blocked` section to the feature file in `unchecked/` with the reason
- The feature stays in `unchecked/` until a human unblocks it

---

## 8. Agent Contracts

### stella-orchestrator
- **Reads**: State files, feature files (to pick next work)
- **Writes**: State files, moves feature files on pass/fail
- **Dispatches**: Subagents with specific feature context
- **Rule**: NEVER run checks itself; always delegate to subagents

### stella-feature-checker
- **Receives**: Feature file path, current tier, module info
- **Reads**: Feature .md file, source code files, build output
- **Executes**: File existence checks, `dotnet build`, `dotnet test`, Playwright CLI, Docker commands
- **Returns**: Tier check results (JSON) to orchestrator
- **Rule**: Read-only on feature files; never modify source code; never write state
- **MUST**: Set up required infrastructure (Docker, containers, databases) before testing.
  Environment setup is part of the checker's job. If Docker is not running, start it.
  If containers are needed, spin them up. If the app needs to be running, start it.
  The checker MUST leave the environment in a testable state before running Tier 2.
- **MUST NOT**: copy a previous run's artifacts to satisfy Tier 2. Checker must capture fresh user-surface evidence for each run.

### stella-issue-finder
- **Receives**: Check failure details, feature file path
- **Reads**: Source code in the relevant module, test files, build errors
- **Returns**: Triage JSON with root cause, category, affected files, confidence
- **Rule**: Read-only; never modify files; fast analysis

### stella-issue-confirmer
- **Receives**: Triage JSON, feature file path
- **Reads**: Same source code as finder, plus broader context
- **Returns**: Confirmation JSON (approved/rejected with reason)
- **Rule**: Read-only; never modify files; thorough analysis

### stella-fixer
- **Receives**: Confirmed triage, feature file path, affected files list
- **Writes**: Source code fixes, new/updated tests
- **Returns**: Fix summary JSON
- **Rule**: Only modify files listed in confirmed triage; add tests for every change; follow CODE_OF_CONDUCT.md

### stella-retester
- **Receives**: Feature file path, previous failure details, fix summary
- **Executes**: Same checks as feature-checker for the tiers that previously failed
- **Returns**: Retest result JSON
- **Rule**: Same constraints as feature-checker; never modify source code

---

## 9. Environment Prerequisites (MANDATORY - DO NOT SKIP)

The verification pipeline exists to prove features **actually work** from a user's
perspective. Agents MUST set up the full runtime environment before running checks.
Skipping environment setup is NEVER acceptable.

### 9.0 Docker / Container Runtime (MUST BE RUNNING FIRST)

Docker is required for Tier 1 tests (Testcontainers for Postgres, Redis, RabbitMQ, etc.)
and for Tier 2 behavioral checks (running services). **Start Docker before anything else.**

```bash
# Step 1: Ensure Docker Desktop is running (Windows/macOS)
# On Windows: Start Docker Desktop from Start Menu or:
Start-Process "C:\Program Files\Docker\Docker\Docker Desktop.exe"
# On macOS:
open -a Docker

# Step 2: Wait for Docker to be ready
docker info > /dev/null 2>&1
# If this fails, Docker is not running. DO NOT proceed without Docker.
# Retry up to 60 seconds:
for i in $(seq 1 12); do docker info > /dev/null 2>&1 && break || sleep 5; done

# Step 3: Verify Docker is functional
docker ps  # Should return (possibly empty) container list without errors
```

**If Docker is not available or cannot start:** Mark all affected features as
`failed` with category `env_issue` and note `"Docker unavailable"`. Do NOT mark
them as `skipped` -- infrastructure failures are failures, not skips.

### 9.1 Container Setup and Cleanup

Before running tests, ensure a clean container state:

```bash
# Clean up any stale containers from previous runs
docker compose -f devops/compose/docker-compose.dev.yml down --volumes --remove-orphans 2>/dev/null || true

# Pull required images
docker compose -f devops/compose/docker-compose.dev.yml pull

# Start infrastructure services (Postgres, Redis, RabbitMQ, etc.)
docker compose -f devops/compose/docker-compose.dev.yml up -d

# Wait for services to be healthy (check health status)
docker compose -f devops/compose/docker-compose.dev.yml ps
# Verify all services show "healthy" or "running"

# If no docker-compose file exists, start minimum required services manually:
docker run -d --name stella-postgres -e POSTGRES_PASSWORD=stella -e POSTGRES_DB=stellaops -p 5432:5432 postgres:16-alpine
docker run -d --name stella-redis -p 6379:6379 redis:7-alpine
docker run -d --name stella-rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3-management-alpine
```

### 9.2 Backend (.NET)
```bash
# Verify .NET SDK is available
dotnet --version  # Expected: 10.0.x

# Restore and build the solution
dotnet restore src/StellaOps.sln
dotnet build src/StellaOps.sln
```

### 9.3 Frontend (Angular)
```bash
# Verify Node.js and Angular CLI
node --version    # Expected: 22.x
npx ng version   # Expected: 21.x

# Install dependencies and build
cd src/Web/StellaOps.Web && npm ci && npx ng build
```

### 9.4 Playwright (Tier 2c UI testing)
```bash
npx playwright install chromium
```

### 9.5 Application Runtime (Tier 2 - ALL behavioral checks)

The application MUST be running for Tier 2 checks. This is not optional.

```bash
# Option A: Docker Compose (preferred - starts everything)
docker compose -f devops/compose/docker-compose.dev.yml up -d

# Option B: Run services individually
# Backend API:
dotnet run --project src/Gateway/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj &
# Frontend:
cd src/Web/StellaOps.Web && npx ng serve &

# Verify services are reachable
curl -s http://localhost:5000/health || echo "Backend not reachable"
curl -s http://localhost:4200 || echo "Frontend not reachable"
```

### 9.6 Environment Teardown (after all checks complete)

```bash
# Stop and clean up all containers
docker compose -f devops/compose/docker-compose.dev.yml down --volumes --remove-orphans 2>/dev/null || true
docker rm -f stella-postgres stella-redis stella-rabbitmq 2>/dev/null || true
```

---

## 10. Cost Estimation

| Tier | Per Feature | 1,144 Features | Notes |
|------|-------------|-----------------|-------|
| Tier 0 | ~$0.01 | ~$11 | File existence only |
| Tier 1 | ~$0.05 | ~$57 | Build + test |
| Tier 2 | ~$0.50 | ~$165 (330 UI features) | Playwright + Opus |
| Triage | ~$0.10 | ~$30 (est. 300 failures) | Sonnet |
| Confirm | ~$0.15 | ~$30 (est. 200 confirmed) | Opus |
| Fix | ~$0.50 | ~$75 (est. 150 fixes) | o3 |
| Retest | ~$0.20 | ~$30 (est. 150 retests) | Opus |
| **Total** | | **~$400** | Conservative estimate |

Run Tier 0 first to filter out `not_implemented` features before spending on higher tiers.