726 lines
30 KiB
Markdown
726 lines
30 KiB
Markdown
# Feature Verification Pipeline - FLOW
|
|
|
|
This document defines the state machine, tier system, artifact format, and priority rules
|
|
for the automated feature verification pipeline.
|
|
|
|
All agents in the pipeline MUST read this document before taking any action.
|
|
|
|
> **THE PRIMARY GOAL IS END-TO-END BEHAVIORAL VERIFICATION.**
|
|
>
|
|
> This pipeline exists to prove that features **actually work** by exercising them
|
|
> as a real user would -- through APIs, CLIs, UIs, and integration tests. Tier 0
|
|
> (file checks) and Tier 1 (build + unit tests) are necessary prerequisites, but
|
|
> they are NOT the goal. **Tier 2 (E2E behavioral verification) is the goal.**
|
|
>
|
|
> Agents MUST:
|
|
> 1. Start Docker / Docker Desktop before running any checks
|
|
> 2. Set up required containers (Postgres, Redis, RabbitMQ, etc.)
|
|
> 3. Start application services needed for behavioral testing
|
|
> 4. Run ALL tiers including Tier 2 -- never stop at Tier 1
|
|
> 5. Act as a user: call APIs, run CLI commands, interact with UIs
|
|
>
|
|
> **Skipping Tier 2 is a verification failure, not a verification pass.**
|
|
|
|
> **EXECUTION ORDER IS PROBLEMS-FIRST (MANDATORY).**
|
|
>
|
|
> Agents MUST resolve in-progress/problem states before starting any new `queued` feature.
|
|
> A "problem/in-progress" state is any of:
|
|
> `checking`, `failed`, `triaged`, `confirmed`, `fixing`, `retesting`.
|
|
>
|
|
> If any feature exists in those states, starting a new `queued` feature is a FLOW violation.
|
|
> Only after all such features reach a terminal state (`done`, `blocked`, `skipped`, `not_implemented`)
|
|
> may the pipeline continue with `queued` features.
|
|
|
|
## 0. Execution Preflight (Mandatory Checklist)
|
|
|
|
Before selecting any feature, run this checklist in order:
|
|
|
|
1. Scan every state file in `docs/qa/feature-checks/state/*.json`.
|
|
2. If any feature is in `checking`, `failed`, `triaged`, `confirmed`, `fixing`, or `retesting`, pick from those problem features only.
|
|
3. Do not start any `queued` feature until all problem features are terminal (`done`, `blocked`, `skipped`, `not_implemented`).
|
|
4. Record the selected feature transition in state notes before running Tier 0.
|
|
5. Create a fresh run directory (`run-XYZ`) before collecting any tier artifacts.
|
|
|
|
---
|
|
|
|
## 1. Directory Layout
|
|
|
|
```
|
|
docs/features/
|
|
unchecked/<module>/<feature>.md # Input: features to verify (1,144 files)
|
|
checked/<module>/<feature>.md # Output: features that passed verification
|
|
dropped/<feature>.md # Not implemented / intentionally dropped
|
|
|
|
docs/qa/feature-checks/
|
|
FLOW.md # This file (state machine spec)
|
|
state/<module>.json # Per-module state ledger (one file per module)
|
|
runs/<module>/<feature>/<runId>/ # Artifacts per verification run
|
|
```
|
|
|
|
---
|
|
|
|
## 2. State Machine
|
|
|
|
### 2.1 States
|
|
|
|
| State | Meaning |
|
|
|-------|---------|
|
|
| `queued` | Discovered, not yet processed |
|
|
| `checking` | Feature checker is running |
|
|
| `passed` | All tier checks passed |
|
|
| `failed` | Check found issues (pre-triage) |
|
|
| `triaged` | Issue-finder identified root cause |
|
|
| `confirmed` | Issue-confirmer validated triage |
|
|
| `fixing` | Fixer is implementing the fix |
|
|
| `retesting` | Retester is re-running checks |
|
|
| `done` | Verified and moved to `checked/` |
|
|
| `blocked` | Requires human intervention |
|
|
| `skipped` | Cannot be automatically verified (manual-only) |
|
|
| `not_implemented` | Source files missing despite sprint claiming DONE |
|
|
|
|
### 2.2 Transitions
|
|
|
|
```
|
|
queued ──────────────> checking
|
|
│
|
|
┌─────────┼─────────────┐
|
|
v v v
|
|
passed failed not_implemented
|
|
│ │ │
|
|
v v │ (move file back
|
|
done triaged │ to unimplemented/)
|
|
│ │ v
|
|
│ v [terminal]
|
|
│ confirmed
|
|
│ │
|
|
│ v
|
|
│ fixing
|
|
│ │
|
|
│ v
|
|
│ retesting
|
|
│ │ │
|
|
│ v v
|
|
│ done failed ──> (retry or blocked)
|
|
│
|
|
v
|
|
[move file to checked/]
|
|
```
|
|
|
|
### 2.3 Retry Policy
|
|
|
|
- Maximum retry count: **3** per feature
|
|
- After 3 retries with failures: transition to `blocked`
|
|
- Blocked features require human review before re-entering the pipeline
|
|
- Each retry increments `retryCount` in state
|
|
|
|
### 2.4 Skip Criteria (STRICT - almost nothing qualifies)
|
|
|
|
Features may ONLY be marked `skipped` if they match one of these 3 physical constraints:
|
|
- `hardware_required`: Requires physical HSM, smart card, or eIDAS hardware token
|
|
- `multi_datacenter`: Requires geographically distributed infrastructure
|
|
- `air_gap_network`: Requires a physically disconnected network
|
|
|
|
**Everything else MUST be tested.** Features that were previously classified as
|
|
"performance benchmarking" or "multi-node cluster" should be tested with whatever
|
|
scale is available locally (single-node Docker, local containers). Partial behavioral
|
|
verification is better than no verification.
|
|
|
|
The checker agent determines skip eligibility during Tier 0 and MUST justify the
|
|
skip with one of the 3 reasons above. Any other reason is invalid.
|
|
|
|
---
|
|
|
|
## 3. Tier System
|
|
|
|
Verification proceeds in tiers. Each tier is a gate - a feature must pass
|
|
the current tier before advancing to the next. **A feature is NOT verified
|
|
until ALL applicable tiers pass.** File existence alone is not verification.
|
|
|
|
### Tier 0: Source Verification (fast, cheap)
|
|
|
|
**Purpose**: Verify that the source files referenced in the feature file actually exist.
|
|
|
|
**Process**:
|
|
1. Read the feature `.md` file
|
|
2. Extract file paths from `## Implementation Details`, `## Key files`, or `## What's Implemented` sections
|
|
3. For each path, check if the file exists on disk
|
|
4. Extract class/interface names and grep for their declarations
|
|
|
|
**Outcomes**:
|
|
- All key files found: `source_verified = true`, advance to Tier 1
|
|
- Key files missing (>50% absent): `status = not_implemented`
|
|
- Some files missing (<50% absent): `source_verified = partial`, add note, advance to Tier 1
|
|
|
|
**What this proves**: The code exists on disk. Nothing more.
|
|
|
|
**Cost**: ~0.01 USD per feature (file existence checks only)
|
|
|
|
### Tier 1: Build + Code Review (medium)
|
|
|
|
**Purpose**: Verify the module compiles, tests pass, AND the code actually implements
|
|
the described behavior.
|
|
|
|
**Process**:
|
|
1. Identify the `.csproj` file(s) for the feature's module
|
|
2. Run `dotnet build <project>.csproj` and capture output
|
|
3. Run `dotnet test <test-project>.csproj --filter <relevant-filter>` -- tests MUST actually execute and pass
|
|
4. For Angular/frontend features: run `npx ng build` and `npx ng test` for the relevant library/app
|
|
5. **Code review** (CRITICAL): Read the key source files and verify:
|
|
- The classes/methods described in the feature file actually contain the logic claimed
|
|
- The feature description matches what the code does (not just that it exists)
|
|
- Tests cover the core behavior described in the feature (not just compilation)
|
|
6. If the build succeeds but tests are blocked by upstream dependency errors:
|
|
- Record as `build_verified = true, tests_blocked_upstream = true`
|
|
- The feature CANNOT advance to `passed` -- mark as `failed` with category `env_issue`
|
|
- The upstream blocker must be resolved before the feature can pass
|
|
|
|
**Code Review Checklist** (must answer YES to all):
|
|
- [ ] Does the main class/service exist with non-trivial implementation (not stubs/TODOs)?
|
|
- [ ] Does the logic match what the feature description claims?
|
|
- [ ] Are there unit tests that exercise the core behavior?
|
|
- [ ] Do those tests actually assert meaningful outcomes (not just "doesn't throw")?
|
|
|
|
**Outcomes**:
|
|
- Build + tests pass + code review confirms behavior: `build_verified = true`, advance to Tier 2
|
|
- Build fails: `status = failed`, record build errors
|
|
- Tests fail or blocked: `status = failed`, record reason
|
|
- Code review finds stubs/missing logic: `status = failed`, category = `missing_code`
|
|
|
|
**What this proves**: The code compiles, tests pass, and someone has verified the code
|
|
does what it claims.
|
|
|
|
**Cost**: ~0.10 USD per feature (compile + test execution + code reading)
|
|
|
|
### Tier 2: Behavioral Verification (API / CLI / UI) -- THE MAIN PURPOSE
|
|
|
|
**Purpose**: Verify the feature works end-to-end by actually exercising it through
|
|
its external interface. This is the only tier that proves the feature WORKS, not
|
|
just that code exists. **This is the primary reason the verification pipeline exists.**
|
|
|
|
**EVERY feature MUST have a Tier 2 check. E2E tests MUST NOT be skipped.** The whole
|
|
point of this pipeline is to act as a user and verify the software works. Tier 0 and
|
|
Tier 1 are prerequisites -- Tier 2 is the actual verification.
|
|
|
|
**If the environment is not set up, set it up.** If Docker is not running, start it.
|
|
If containers are not running, start them. If the app is not running, start it.
|
|
"Environment not ready" is never an excuse to skip -- it is a setup step the agent
|
|
must perform (see Section 9).
|
|
|
|
The check type depends on the module's external surface.
|
|
|
|
### Tier 2 Acceptance Gate (HARD REQUIREMENT)
|
|
|
|
A Tier 2 run is valid only if ALL of the following are true:
|
|
1. It uses a new run directory (`run-XYZ`) created for the current execution.
|
|
2. It contains fresh evidence captured in this run (new timestamps and new command/request outputs).
|
|
3. It includes user-surface interactions (HTTP requests, CLI invocations, or UI interactions), not only library test counts.
|
|
4. It verifies both positive and negative behavior paths when the feature has error semantics.
|
|
5. For rechecks, at least one new user transaction per feature is captured in the new run.
|
|
|
|
The following are forbidden and invalidate Tier 2:
|
|
- Copying a previous run directory and only editing `runId`, timestamps, or summary text.
|
|
- Declaring Tier 2 pass from suite totals alone without fresh request/response, command output, or UI step evidence.
|
|
- Reusing screenshots or response payloads from prior runs without replaying the interaction.
|
|
|
|
If any forbidden shortcut is detected, mark the feature `failed` with category `test_gap`
|
|
and rerun Tier 2 from scratch.
|
|
|
|
#### Tier 2a: API Testing (Gateway, Router, Api, Platform, backend services with HTTP endpoints)
|
|
|
|
**Process**:
|
|
1. Ensure the service is running (check port, or start via `docker compose up`)
|
|
2. Send HTTP requests to the feature's endpoints using `curl` or a test script
|
|
3. Verify response status codes, headers, and body structure
|
|
4. Test error cases (unauthorized, bad input, rate limited, etc.)
|
|
5. Verify the behavior described in the feature file actually happens
|
|
|
|
**Example for `gateway-identity-header-strip`**:
|
|
```bash
|
|
# Send request with spoofed identity header
|
|
curl -H "X-Forwarded-User: attacker" http://localhost:5000/api/test
|
|
# Verify the header was stripped (response should use authenticated identity, not spoofed)
|
|
```
|
|
|
|
**Artifact**: `tier2-api-check.json`
|
|
```json
|
|
{
|
|
"type": "api",
|
|
"baseUrl": "http://localhost:5000",
|
|
"capturedAtUtc": "2026-02-10T12:00:00Z",
|
|
"requests": [
|
|
{
|
|
"description": "Verify spoofed identity header is stripped",
|
|
"method": "GET",
|
|
"path": "/api/test",
|
|
"headers": { "X-Forwarded-User": "attacker" },
|
|
"expectedStatus": 200,
|
|
"actualStatus": 200,
|
|
"assertion": "Response X-Forwarded-User header matches authenticated user, not 'attacker'",
|
|
"result": "pass|fail",
|
|
"evidence": "actual response headers/body",
|
|
"requestCapturedAtUtc": "2026-02-10T12:00:01Z",
|
|
"responseSnippet": "HTTP/1.1 200 ..."
|
|
}
|
|
],
|
|
"verdict": "pass|fail|skip"
|
|
}
|
|
```
|
|
|
|
#### Tier 2b: CLI Testing (Cli, Tools, Bench modules)
|
|
|
|
**Process**:
|
|
1. Build the CLI tool if needed
|
|
2. Run the CLI command described in the feature's E2E Test Plan
|
|
3. Verify stdout/stderr output matches expected behavior
|
|
4. Test error cases (invalid args, missing config, etc.)
|
|
5. Verify exit codes
|
|
|
|
**Example for `cli-baseline-selection-logic`**:
|
|
```bash
|
|
stella scan --baseline last-green myimage:latest
|
|
# Verify output shows baseline was selected correctly
|
|
echo $? # Verify exit code 0
|
|
```
|
|
|
|
**Artifact**: `tier2-cli-check.json`
|
|
```json
|
|
{
|
|
"type": "cli",
|
|
"capturedAtUtc": "2026-02-10T12:00:00Z",
|
|
"commands": [
|
|
{
|
|
"description": "Verify baseline selection with last-green strategy",
|
|
"command": "stella scan --baseline last-green myimage:latest",
|
|
"expectedExitCode": 0,
|
|
"actualExitCode": 0,
|
|
"expectedOutput": "Using baseline: ...",
|
|
"actualOutput": "...",
|
|
"result": "pass|fail",
|
|
"commandCapturedAtUtc": "2026-02-10T12:00:01Z"
|
|
}
|
|
],
|
|
"verdict": "pass|fail|skip"
|
|
}
|
|
```
|
|
|
|
#### Tier 2c: UI Testing (Web, ExportCenter, DevPortal, VulnExplorer, PacksRegistry)
|
|
|
|
**Process**:
|
|
1. Ensure the Angular app is running (`ng serve` or docker)
|
|
2. Use Playwright CLI or MCP to navigate to the feature's UI route
|
|
3. Follow E2E Test Plan steps: verify elements render, interactions work, data displays
|
|
4. Capture screenshots as evidence
|
|
5. Test accessibility (keyboard navigation, ARIA labels) if listed in E2E plan
|
|
|
|
**Example for `pipeline-run-centric-view`**:
|
|
```bash
|
|
npx playwright test --grep "pipeline-run" --reporter=json
|
|
# Or manually via MCP: navigate to /release-orchestrator/runs, verify table renders
|
|
```
|
|
|
|
**Artifact**: `tier2-ui-check.json`
|
|
```json
|
|
{
|
|
"type": "ui",
|
|
"baseUrl": "http://localhost:4200",
|
|
"capturedAtUtc": "2026-02-10T12:00:00Z",
|
|
"steps": [
|
|
{
|
|
"description": "Navigate to /release-orchestrator/runs",
|
|
"action": "navigate",
|
|
"target": "/release-orchestrator/runs",
|
|
"expected": "Runs list table renders with columns",
|
|
"result": "pass|fail",
|
|
"screenshot": "step-1-runs-list.png",
|
|
"stepCapturedAtUtc": "2026-02-10T12:00:01Z"
|
|
}
|
|
],
|
|
"verdict": "pass|fail|skip"
|
|
}
|
|
```
|
|
|
|
#### Tier 2d: Library/Internal Testing (Attestor, Policy, Scanner, etc. with no external surface)
|
|
|
|
For modules with no HTTP/CLI/UI surface, Tier 2 means running **targeted
|
|
integration tests** or **behavioral unit tests** that prove the feature logic:
|
|
|
|
**Process**:
|
|
1. Identify tests that specifically exercise the feature's behavior
|
|
2. Run those tests: `dotnet test --filter "FullyQualifiedName~FeatureClassName"`
|
|
3. Read the test code to confirm it asserts meaningful behavior (not just "compiles")
|
|
4. If no behavioral tests exist, write a focused test and run it
|
|
|
|
**Example for `evidence-weighted-score-model`**:
|
|
```bash
|
|
dotnet test --filter "FullyQualifiedName~EwsCalculatorTests"
|
|
# Verify: normalizers produce expected dimension scores
|
|
# Verify: guardrails cap/floor scores correctly
|
|
# Verify: composite score is deterministic for same inputs
|
|
```
|
|
|
|
**Artifact**: `tier2-integration-check.json`
|
|
```json
|
|
{
|
|
"type": "integration",
|
|
"capturedAtUtc": "2026-02-10T12:00:00Z",
|
|
"testFilter": "FullyQualifiedName~EwsCalculatorTests",
|
|
"testsRun": 21,
|
|
"testsPassed": 21,
|
|
"testsFailed": 0,
|
|
"behaviorVerified": [
|
|
"6-dimension normalization produces expected scores",
|
|
"Guardrails enforce caps and floors",
|
|
"Composite score is deterministic"
|
|
],
|
|
"verdict": "pass|fail"
|
|
}
|
|
```
|
|
|
|
### When to skip Tier 2 (ALMOST NEVER)
|
|
|
|
**Default: Tier 2 is MANDATORY.** Agents must exhaust all options before marking skip.
|
|
|
|
The ONLY acceptable skip reasons (must match exactly one):
|
|
- `hardware_required`: Feature requires physical HSM, smart card, or eIDAS token
|
|
- `multi_datacenter`: Feature requires geographically distributed infrastructure
|
|
- `air_gap_network`: Feature requires a physically disconnected network (not just no internet)
|
|
|
|
**These are NOT valid skip reasons:**
|
|
- "The app isn't running" -- **START IT** (see Section 9). If it won't start, mark `failed` with `env_issue`.
|
|
- "Docker isn't running" -- **START DOCKER** (see Section 9.0). If it won't start, mark `failed` with `env_issue`.
|
|
- "No E2E tests exist" -- **WRITE ONE.** A focused behavioral test that exercises the feature as a user would.
|
|
- "The database isn't set up" -- **SET IT UP** using Docker containers (see Section 9.1).
|
|
- "Environment not ready" -- **PREPARE IT.** That is part of the agent's job, not an excuse.
|
|
- "Too complex to test" -- Break it into smaller testable steps. Test what you can.
|
|
- "Only unit tests needed" -- Unit tests are Tier 1. Tier 2 is behavioral/integration/E2E.
|
|
- "Application not running" -- See "The app isn't running" above.
|
|
|
|
**If an agent skips Tier 2 without one of the 3 valid reasons above, the entire
|
|
feature verification is INVALID and must be re-run.**
|
|
|
|
### Tier Classification by Module
|
|
|
|
| Tier 2 Type | Modules | Feature Count |
|
|
|-------------|---------|---------------|
|
|
| 2a (API) | Gateway, Router, Api, Platform | ~30 |
|
|
| 2b (CLI) | Cli, Tools, Bench | ~110 |
|
|
| 2c (UI/Playwright) | Web, ExportCenter, DevPortal, VulnExplorer, PacksRegistry | ~190 |
|
|
| 2d (Integration) | Attestor, Policy, Scanner, BinaryIndex, Concelier, Libraries, EvidenceLocker, Orchestrator, Signals, Authority, Signer, Cryptography, ReachGraph, Graph, RiskEngine, Replay, Unknowns, Scheduler, TaskRunner, Timeline, Notifier, Findings, SbomService, Mirror, Feedser, Analyzers | ~700 |
|
|
| Manual (skip) | AirGap (subset), SmRemote (HSM), DevOps (infra) | ~25 |
|
|
|
|
---
|
|
|
|
## 4. State File Format
|
|
|
|
Per-module state files live at `docs/qa/feature-checks/state/<module>.json`.
|
|
|
|
```json
|
|
{
|
|
"module": "gateway",
|
|
"featureCount": 8,
|
|
"lastUpdatedUtc": "2026-02-09T12:00:00Z",
|
|
"features": {
|
|
"router-back-pressure-middleware": {
|
|
"status": "queued",
|
|
"tier": 0,
|
|
"retryCount": 0,
|
|
"sourceVerified": null,
|
|
"buildVerified": null,
|
|
"e2eVerified": null,
|
|
"skipReason": null,
|
|
"lastRunId": null,
|
|
"lastUpdatedUtc": "2026-02-09T12:00:00Z",
|
|
"featureFile": "docs/features/unchecked/gateway/router-back-pressure-middleware.md",
|
|
"notes": []
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### State File Rules
|
|
|
|
- **Single writer**: Only the orchestrator writes state files
|
|
- **Subagents report back**: Subagents return results to the orchestrator via their output; they do NOT write state files directly
|
|
- **Atomic updates**: Each state transition must update `lastUpdatedUtc`
|
|
- **Append-only notes**: The `notes` array is append-only; never remove entries
|
|
|
|
---
|
|
|
|
## 5. Run Artifact Format
|
|
|
|
Each verification run produces artifacts under:
|
|
`docs/qa/feature-checks/runs/<module>/<feature-slug>/<runId>/`
|
|
|
|
Where `<runId>` = `run-001`, `run-002`, etc. (zero-padded, sequential).
|
|
|
|
### Required Artifacts
|
|
|
|
| Stage | File | Format |
|
|
|-------|------|--------|
|
|
| Tier 0 | `tier0-source-check.json` | `{ "filesChecked": [...], "found": [...], "missing": [...], "verdict": "pass\|fail\|partial" }` |
|
|
| Tier 1 | `tier1-build-check.json` | `{ "project": "...", "buildResult": "pass\|fail", "testResult": "pass\|fail\|skipped", "errors": [...] }` |
|
|
| Tier 2 | `tier2-e2e-check.json` | `{ "steps": [{ "description": "...", "result": "pass\|fail", "evidence": "..." }], "screenshots": [...] }` |
|
|
| Triage | `triage.json` | `{ "rootCause": "...", "category": "missing_code\|bug\|config\|test_gap\|env_issue", "affectedFiles": [...], "confidence": 0.0-1.0 }` |
|
|
| Confirm | `confirmation.json` | `{ "approved": true\|false, "reason": "...", "revisedRootCause": "..." }` |
|
|
| Fix | `fix-summary.json` | `{ "filesModified": [...], "testsAdded": [...], "description": "..." }` |
|
|
| Retest | `retest-result.json` | `{ "previousFailures": [...], "retestResults": [...], "verdict": "pass\|fail" }` |
|
|
|
|
### Artifact Freshness Rules (MANDATORY)
|
|
|
|
- Every new run (`run-XYZ`) MUST be generated from fresh execution, not by copying prior run files.
|
|
- Every Tier 2 artifact MUST include `capturedAtUtc` and per-step/per-command/per-request capture times.
|
|
- Evidence fields MUST contain fresh raw output from the current run (response snippets, command output, screenshots, or logs).
|
|
- Recheck runs MUST include at least one newly captured user interaction per feature in that run directory.
|
|
- If a previous run is reused as input for convenience, that run is INVALID until all Tier 2 evidence files are regenerated.
|
|
|
|
### Screenshot Convention
|
|
|
|
Screenshots for Tier 2 go in `<runId>/screenshots/` with names:
|
|
`step-<N>-<description-slug>.png`
|
|
|
|
---
|
|
|
|
## 6. Priority Rules
|
|
|
|
When selecting the next feature to process, the orchestrator follows this priority order:
|
|
|
|
1. **`retesting`** - Finish in-progress retests first
|
|
2. **`fixing`** - Complete in-progress fixes
|
|
3. **`confirmed`** - Confirmed issues ready for fix
|
|
4. **`triaged`** - Triaged issues ready for confirmation
|
|
5. **`failed`** (retryCount < 3) - Failed features ready for triage
|
|
6. **`queued`** - New features not yet checked
|
|
|
|
### Failure-First Enforcement (Hard Rule)
|
|
|
|
- If the current feature enters `failed`, `triaged`, `confirmed`, `fixing`, or `retesting`, agents MUST complete that failure loop for the same feature before starting any next `queued` feature.
|
|
- A sprint batch task MUST NOT advance to the next feature while the current feature still has unresolved failures.
|
|
- The only exception is an explicit human instruction to pause or reorder.
|
|
|
|
Within the same priority level, prefer:
|
|
- Features in smaller modules first (faster to clear a module completely)
|
|
- Features with lower `retryCount`
|
|
- Alphabetical by feature slug (deterministic ordering)
|
|
|
|
### 6.1 Problem-First Lock (MANDATORY)
|
|
|
|
- If any feature exists in `checking`, `retesting`, `fixing`, `confirmed`, `triaged`, or `failed`, the orchestrator MUST NOT start a `queued` feature.
|
|
- This lock is cross-module: do not start or continue another module while the current module has an unresolved feature in `checking`/`failed`/`triaged`/`confirmed`/`fixing`/`retesting`.
|
|
- The orchestrator MUST pick the highest-priority problem feature and keep working that same feature through the chain (`failed -> triaged -> confirmed -> fixing -> retesting`) until it reaches a terminal state (`done` or `blocked`) for that cycle.
|
|
- "Touch one problem then switch to another" is forbidden unless the current feature is explicitly moved to `blocked` with a recorded reason.
|
|
- This rule is strict and exists to prevent problem backlog growth while new work is started.
|
|
|
|
### 6.2 Deterministic Selection Algorithm (MUST FOLLOW)
|
|
|
|
Before selecting a next feature:
|
|
|
|
1. Scan all module state files under `docs/qa/feature-checks/state/*.json`.
|
|
2. If any feature is in `checking`, `retesting`, `fixing`, `confirmed`, `triaged`, or `failed`, select from those only.
|
|
3. Selection order MUST be:
|
|
`retesting` > `fixing` > `confirmed` > `triaged` > `failed` > `checking`.
|
|
4. Within the same status:
|
|
- lower `retryCount` first
|
|
- then alphabetical by `module`
|
|
- then alphabetical by `feature` slug
|
|
5. Only when no features are in those statuses may the orchestrator select `queued`.
|
|
|
|
This algorithm is mandatory and overrides ad-hoc feature picking.
|
|
|
|
---
|
|
|
|
## 7. File Movement Rules
|
|
|
|
### On `passed` -> `done`
|
|
|
|
1. Copy feature file from `docs/features/unchecked/<module>/<feature>.md` to `docs/features/checked/<module>/<feature>.md`
|
|
2. Update the status line in the file from `IMPLEMENTED` to `VERIFIED`
|
|
3. Append a `## Verification` section with the run ID and date
|
|
4. Remove the original from `unchecked/`
|
|
5. Create the target module directory in `checked/` if it doesn't exist
|
|
|
|
### On `not_implemented`
|
|
|
|
1. Copy feature file from `docs/features/unchecked/<module>/<feature>.md` to `docs/features/unimplemented/<module>/<feature>.md`
|
|
2. Update status from `IMPLEMENTED` to `PARTIALLY_IMPLEMENTED`
|
|
3. Add notes about what was missing
|
|
4. Remove the original from `unchecked/`
|
|
|
|
### On `blocked`
|
|
|
|
- Do NOT move the file
|
|
- Add a `## Blocked` section to the feature file in `unchecked/` with the reason
|
|
- The feature stays in `unchecked/` until a human unblocks it
|
|
|
|
---
|
|
|
|
## 8. Agent Contracts
|
|
|
|
### stella-orchestrator
|
|
- **Reads**: State files, feature files (to pick next work)
|
|
- **Writes**: State files, moves feature files on pass/fail
|
|
- **Dispatches**: Subagents with specific feature context
|
|
- **Rule**: NEVER run checks itself; always delegate to subagents
|
|
|
|
### stella-feature-checker
|
|
- **Receives**: Feature file path, current tier, module info
|
|
- **Reads**: Feature .md file, source code files, build output
|
|
- **Executes**: File existence checks, `dotnet build`, `dotnet test`, Playwright CLI, Docker commands
|
|
- **Returns**: Tier check results (JSON) to orchestrator
|
|
- **Rule**: Read-only on feature files; never modify source code; never write state
|
|
- **MUST**: Set up required infrastructure (Docker, containers, databases) before testing.
|
|
Environment setup is part of the checker's job. If Docker is not running, start it.
|
|
If containers are needed, spin them up. If the app needs to be running, start it.
|
|
The checker MUST leave the environment in a testable state before running Tier 2.
|
|
- **MUST NOT**: copy a previous run's artifacts to satisfy Tier 2. Checker must capture fresh user-surface evidence for each run.
|
|
|
|
### stella-issue-finder
|
|
- **Receives**: Check failure details, feature file path
|
|
- **Reads**: Source code in the relevant module, test files, build errors
|
|
- **Returns**: Triage JSON with root cause, category, affected files, confidence
|
|
- **Rule**: Read-only; never modify files; fast analysis
|
|
|
|
### stella-issue-confirmer
|
|
- **Receives**: Triage JSON, feature file path
|
|
- **Reads**: Same source code as finder, plus broader context
|
|
- **Returns**: Confirmation JSON (approved/rejected with reason)
|
|
- **Rule**: Read-only; never modify files; thorough analysis
|
|
|
|
### stella-fixer
|
|
- **Receives**: Confirmed triage, feature file path, affected files list
|
|
- **Writes**: Source code fixes, new/updated tests
|
|
- **Returns**: Fix summary JSON
|
|
- **Rule**: Only modify files listed in confirmed triage; add tests for every change; follow CODE_OF_CONDUCT.md
|
|
|
|
### stella-retester
|
|
- **Receives**: Feature file path, previous failure details, fix summary
|
|
- **Executes**: Same checks as feature-checker for the tiers that previously failed
|
|
- **Returns**: Retest result JSON
|
|
- **Rule**: Same constraints as feature-checker; never modify source code
|
|
|
|
---
|
|
|
|
## 9. Environment Prerequisites (MANDATORY - DO NOT SKIP)
|
|
|
|
The verification pipeline exists to prove features **actually work** from a user's
|
|
perspective. Agents MUST set up the full runtime environment before running checks.
|
|
Skipping environment setup is NEVER acceptable.
|
|
|
|
### 9.0 Docker / Container Runtime (MUST BE RUNNING FIRST)
|
|
|
|
Docker is required for Tier 1 tests (Testcontainers for Postgres, Redis, RabbitMQ, etc.)
|
|
and for Tier 2 behavioral checks (running services). **Start Docker before anything else.**
|
|
|
|
```bash
|
|
# Step 1: Ensure Docker Desktop is running (Windows/macOS)
|
|
# On Windows: Start Docker Desktop from Start Menu or:
|
|
Start-Process "C:\Program Files\Docker\Docker\Docker Desktop.exe"
|
|
# On macOS:
|
|
open -a Docker
|
|
|
|
# Step 2: Wait for Docker to be ready
|
|
docker info > /dev/null 2>&1
|
|
# If this fails, Docker is not running. DO NOT proceed without Docker.
|
|
# Retry up to 60 seconds:
|
|
for i in $(seq 1 12); do docker info > /dev/null 2>&1 && break || sleep 5; done
|
|
|
|
# Step 3: Verify Docker is functional
|
|
docker ps # Should return (possibly empty) container list without errors
|
|
```
|
|
|
|
**If Docker is not available or cannot start:** Mark all affected features as
|
|
`failed` with category `env_issue` and note `"Docker unavailable"`. Do NOT mark
|
|
them as `skipped` -- infrastructure failures are failures, not skips.
|
|
|
|
### 9.1 Container Setup and Cleanup
|
|
|
|
Before running tests, ensure a clean container state:
|
|
|
|
```bash
|
|
# Clean up any stale containers from previous runs
|
|
docker compose -f devops/compose/docker-compose.dev.yml down --volumes --remove-orphans 2>/dev/null || true
|
|
|
|
# Pull required images
|
|
docker compose -f devops/compose/docker-compose.dev.yml pull
|
|
|
|
# Start infrastructure services (Postgres, Redis, RabbitMQ, etc.)
|
|
docker compose -f devops/compose/docker-compose.dev.yml up -d
|
|
|
|
# Wait for services to be healthy (check health status)
|
|
docker compose -f devops/compose/docker-compose.dev.yml ps
|
|
# Verify all services show "healthy" or "running"
|
|
|
|
# If no docker-compose file exists, start minimum required services manually:
|
|
docker run -d --name stella-postgres -e POSTGRES_PASSWORD=stella -e POSTGRES_DB=stellaops -p 5432:5432 postgres:16-alpine
|
|
docker run -d --name stella-redis -p 6379:6379 redis:7-alpine
|
|
docker run -d --name stella-rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3-management-alpine
|
|
```
|
|
|
|
### 9.2 Backend (.NET)
|
|
```bash
|
|
# Verify .NET SDK is available
|
|
dotnet --version # Expected: 10.0.x
|
|
|
|
# Restore and build the solution
|
|
dotnet restore src/StellaOps.sln
|
|
dotnet build src/StellaOps.sln
|
|
```
|
|
|
|
### 9.3 Frontend (Angular)
|
|
```bash
|
|
# Verify Node.js and Angular CLI
|
|
node --version # Expected: 22.x
|
|
npx ng version # Expected: 21.x
|
|
|
|
# Install dependencies and build
|
|
cd src/Web/StellaOps.Web && npm ci && npx ng build
|
|
```
|
|
|
|
### 9.4 Playwright (Tier 2c UI testing)
|
|
```bash
|
|
npx playwright install chromium
|
|
```
|
|
|
|
### 9.5 Application Runtime (Tier 2 - ALL behavioral checks)
|
|
|
|
The application MUST be running for Tier 2 checks. This is not optional.
|
|
|
|
```bash
|
|
# Option A: Docker Compose (preferred - starts everything)
|
|
docker compose -f devops/compose/docker-compose.dev.yml up -d
|
|
|
|
# Option B: Run services individually
|
|
# Backend API:
|
|
dotnet run --project src/Gateway/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj &
|
|
# Frontend:
|
|
cd src/Web/StellaOps.Web && npx ng serve &
|
|
|
|
# Verify services are reachable
|
|
curl -s http://localhost:5000/health || echo "Backend not reachable"
|
|
curl -s http://localhost:4200 || echo "Frontend not reachable"
|
|
```
|
|
|
|
### 9.6 Environment Teardown (after all checks complete)
|
|
|
|
```bash
|
|
# Stop and clean up all containers
|
|
docker compose -f devops/compose/docker-compose.dev.yml down --volumes --remove-orphans 2>/dev/null || true
|
|
docker rm -f stella-postgres stella-redis stella-rabbitmq 2>/dev/null || true
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Cost Estimation
|
|
|
|
| Tier | Per Feature | 1,144 Features | Notes |
|
|
|------|-------------|-----------------|-------|
|
|
| Tier 0 | ~$0.01 | ~$11 | File existence only |
|
|
| Tier 1 | ~$0.05 | ~$57 | Build + test |
|
|
| Tier 2 | ~$0.50 | ~$165 (330 UI features) | Playwright + Opus |
|
|
| Triage | ~$0.10 | ~$30 (est. 300 failures) | Sonnet |
|
|
| Confirm | ~$0.15 | ~$30 (est. 200 confirmed) | Opus |
|
|
| Fix | ~$0.50 | ~$75 (est. 150 fixes) | o3 |
|
|
| Retest | ~$0.20 | ~$30 (est. 150 retests) | Opus |
|
|
| **Total** | | **~$400** | Conservative estimate |
|
|
|
|
Run Tier 0 first to filter out `not_implemented` features before spending on higher tiers.
|