Files

master 5593212b41 save checkpoint. addition features and their state. check some ofthem

2026-02-10 07:54:44 +02:00

19 KiB

Raw Blame History

Feature Verification Pipeline - FLOW

This document defines the state machine, tier system, artifact format, and priority rules for the automated feature verification pipeline.

All agents in the pipeline MUST read this document before taking any action.

1. Directory Layout

docs/features/
  unchecked/<module>/<feature>.md    # Input: features to verify (1,144 files)
  checked/<module>/<feature>.md      # Output: features that passed verification
  dropped/<feature>.md               # Not implemented / intentionally dropped

docs/qa/feature-checks/
  FLOW.md                            # This file (state machine spec)
  state/<module>.json                # Per-module state ledger (one file per module)
  runs/<module>/<feature>/<runId>/   # Artifacts per verification run

2. State Machine

2.1 States

State	Meaning
`queued`	Discovered, not yet processed
`checking`	Feature checker is running
`passed`	All tier checks passed
`failed`	Check found issues (pre-triage)
`triaged`	Issue-finder identified root cause
`confirmed`	Issue-confirmer validated triage
`fixing`	Fixer is implementing the fix
`retesting`	Retester is re-running checks
`done`	Verified and moved to `checked/`
`blocked`	Requires human intervention
`skipped`	Cannot be automatically verified (manual-only)
`not_implemented`	Source files missing despite sprint claiming DONE

2.2 Transitions

queued ──────────────> checking
                          │
                ┌─────────┼─────────────┐
                v         v             v
            passed     failed      not_implemented
               │         │              │
               v         v              │ (move file back
            done     triaged            │  to unimplemented/)
               │         │              v
               │         v           [terminal]
               │     confirmed
               │         │
               │         v
               │      fixing
               │         │
               │         v
               │     retesting
               │       │    │
               │       v    v
               │    done  failed ──> (retry or blocked)
               │
               v
         [move file to checked/]

2.3 Retry Policy

Maximum retry count: 3 per feature
After 3 retries with failures: transition to blocked
Blocked features require human review before re-entering the pipeline
Each retry increments retryCount in state

2.4 Skip Criteria

Features that CANNOT be automatically E2E tested should be marked skipped:

Air-gap/offline features (require disconnected environment)
Crypto-sovereign features (require HSM/eIDAS hardware)
Multi-node cluster features (require multi-host setup)
Performance benchmarking features (require dedicated infra)
Features with description containing "manual verification required"

The checker agent determines skip eligibility during Tier 0.

3. Tier System

Verification proceeds in tiers. Each tier is a gate - a feature must pass the current tier before advancing to the next. A feature is NOT verified until ALL applicable tiers pass. File existence alone is not verification.

Tier 0: Source Verification (fast, cheap)

Purpose: Verify that the source files referenced in the feature file actually exist.

Process:

Read the feature .md file
Extract file paths from ## Implementation Details, ## Key files, or ## What's Implemented sections
For each path, check if the file exists on disk
Extract class/interface names and grep for their declarations

Outcomes:

All key files found: source_verified = true, advance to Tier 1
Key files missing (>50% absent): status = not_implemented
Some files missing (<50% absent): source_verified = partial, add note, advance to Tier 1

What this proves: The code exists on disk. Nothing more.

Cost: ~0.01 USD per feature (file existence checks only)

Tier 1: Build + Code Review (medium)

Purpose: Verify the module compiles, tests pass, AND the code actually implements the described behavior.

Process:

Identify the .csproj file(s) for the feature's module
Run dotnet build <project>.csproj and capture output
Run dotnet test <test-project>.csproj --filter <relevant-filter> -- tests MUST actually execute and pass
For Angular/frontend features: run npx ng build and npx ng test for the relevant library/app
Code review (CRITICAL): Read the key source files and verify:
- The classes/methods described in the feature file actually contain the logic claimed
- The feature description matches what the code does (not just that it exists)
- Tests cover the core behavior described in the feature (not just compilation)
If the build succeeds but tests are blocked by upstream dependency errors:
- Record as build_verified = true, tests_blocked_upstream = true
- The feature CANNOT advance to passed -- mark as failed with category env_issue
- The upstream blocker must be resolved before the feature can pass

Code Review Checklist (must answer YES to all):

Does the main class/service exist with non-trivial implementation (not stubs/TODOs)?
Does the logic match what the feature description claims?
Are there unit tests that exercise the core behavior?
Do those tests actually assert meaningful outcomes (not just "doesn't throw")?

Outcomes:

Build + tests pass + code review confirms behavior: build_verified = true, advance to Tier 2
Build fails: status = failed, record build errors
Tests fail or blocked: status = failed, record reason
Code review finds stubs/missing logic: status = failed, category = missing_code

What this proves: The code compiles, tests pass, and someone has verified the code does what it claims.

Cost: ~0.10 USD per feature (compile + test execution + code reading)

Tier 2: Behavioral Verification (API / CLI / UI)

Purpose: Verify the feature works end-to-end by actually exercising it through its external interface. This is the only tier that proves the feature WORKS, not just that code exists.

EVERY feature MUST have a Tier 2 check unless explicitly skipped. The check type depends on the module's external surface.

Tier 2a: API Testing (Gateway, Router, Api, Platform, backend services with HTTP endpoints)

Process:

Ensure the service is running (check port, or start via docker compose up)
Send HTTP requests to the feature's endpoints using curl or a test script
Verify response status codes, headers, and body structure
Test error cases (unauthorized, bad input, rate limited, etc.)
Verify the behavior described in the feature file actually happens

Example for gateway-identity-header-strip:

# Send request with spoofed identity header
curl -H "X-Forwarded-User: attacker" http://localhost:5000/api/test
# Verify the header was stripped (response should use authenticated identity, not spoofed)

Artifact: tier2-api-check.json

{
  "type": "api",
  "baseUrl": "http://localhost:5000",
  "requests": [
    {
      "description": "Verify spoofed identity header is stripped",
      "method": "GET",
      "path": "/api/test",
      "headers": { "X-Forwarded-User": "attacker" },
      "expectedStatus": 200,
      "actualStatus": 200,
      "assertion": "Response X-Forwarded-User header matches authenticated user, not 'attacker'",
      "result": "pass|fail",
      "evidence": "actual response headers/body"
    }
  ],
  "verdict": "pass|fail|skip"
}

Tier 2b: CLI Testing (Cli, Tools, Bench modules)

Process:

Build the CLI tool if needed
Run the CLI command described in the feature's E2E Test Plan
Verify stdout/stderr output matches expected behavior
Test error cases (invalid args, missing config, etc.)
Verify exit codes

Example for cli-baseline-selection-logic:

stella scan --baseline last-green myimage:latest
# Verify output shows baseline was selected correctly
echo $?  # Verify exit code 0

Artifact: tier2-cli-check.json

{
  "type": "cli",
  "commands": [
    {
      "description": "Verify baseline selection with last-green strategy",
      "command": "stella scan --baseline last-green myimage:latest",
      "expectedExitCode": 0,
      "actualExitCode": 0,
      "expectedOutput": "Using baseline: ...",
      "actualOutput": "...",
      "result": "pass|fail"
    }
  ],
  "verdict": "pass|fail|skip"
}

Tier 2c: UI Testing (Web, ExportCenter, DevPortal, VulnExplorer, PacksRegistry)

Process:

Ensure the Angular app is running (ng serve or docker)
Use Playwright CLI or MCP to navigate to the feature's UI route
Follow E2E Test Plan steps: verify elements render, interactions work, data displays
Capture screenshots as evidence
Test accessibility (keyboard navigation, ARIA labels) if listed in E2E plan

Example for pipeline-run-centric-view:

npx playwright test --grep "pipeline-run" --reporter=json
# Or manually via MCP: navigate to /release-orchestrator/runs, verify table renders

Artifact: tier2-ui-check.json

{
  "type": "ui",
  "baseUrl": "http://localhost:4200",
  "steps": [
    {
      "description": "Navigate to /release-orchestrator/runs",
      "action": "navigate",
      "target": "/release-orchestrator/runs",
      "expected": "Runs list table renders with columns",
      "result": "pass|fail",
      "screenshot": "step-1-runs-list.png"
    }
  ],
  "verdict": "pass|fail|skip"
}

Tier 2d: Library/Internal Testing (Attestor, Policy, Scanner, etc. with no external surface)

For modules with no HTTP/CLI/UI surface, Tier 2 means running targeted integration tests or behavioral unit tests that prove the feature logic:

Process:

Identify tests that specifically exercise the feature's behavior
Run those tests: dotnet test --filter "FullyQualifiedName~FeatureClassName"
Read the test code to confirm it asserts meaningful behavior (not just "compiles")
If no behavioral tests exist, write a focused test and run it

Example for evidence-weighted-score-model:

dotnet test --filter "FullyQualifiedName~EwsCalculatorTests"
# Verify: normalizers produce expected dimension scores
# Verify: guardrails cap/floor scores correctly
# Verify: composite score is deterministic for same inputs

Artifact: tier2-integration-check.json

{
  "type": "integration",
  "testFilter": "FullyQualifiedName~EwsCalculatorTests",
  "testsRun": 21,
  "testsPassed": 21,
  "testsFailed": 0,
  "behaviorVerified": [
    "6-dimension normalization produces expected scores",
    "Guardrails enforce caps and floors",
    "Composite score is deterministic"
  ],
  "verdict": "pass|fail"
}

When to skip Tier 2

Mark skipped ONLY for features that literally cannot be tested in the current environment:

Air-gap features requiring a disconnected network
HSM/eIDAS features requiring physical hardware
Multi-datacenter features requiring distributed infrastructure
Performance benchmark features requiring dedicated load-gen infrastructure

"The app isn't running" is NOT a skip reason -- it's a failed with env_issue. "No tests exist" is NOT a skip reason -- write a focused test.

Tier Classification by Module

Tier 2 Type	Modules	Feature Count
2a (API)	Gateway, Router, Api, Platform	~30
2b (CLI)	Cli, Tools, Bench	~110
2c (UI/Playwright)	Web, ExportCenter, DevPortal, VulnExplorer, PacksRegistry	~190
2d (Integration)	Attestor, Policy, Scanner, BinaryIndex, Concelier, Libraries, EvidenceLocker, Orchestrator, Signals, Authority, Signer, Cryptography, ReachGraph, Graph, RiskEngine, Replay, Unknowns, Scheduler, TaskRunner, Timeline, Notifier, Findings, SbomService, Mirror, Feedser, Analyzers	~700
Manual (skip)	AirGap (subset), SmRemote (HSM), DevOps (infra)	~25

4. State File Format

Per-module state files live at docs/qa/feature-checks/state/<module>.json.

{
  "module": "gateway",
  "featureCount": 8,
  "lastUpdatedUtc": "2026-02-09T12:00:00Z",
  "features": {
    "router-back-pressure-middleware": {
      "status": "queued",
      "tier": 0,
      "retryCount": 0,
      "sourceVerified": null,
      "buildVerified": null,
      "e2eVerified": null,
      "skipReason": null,
      "lastRunId": null,
      "lastUpdatedUtc": "2026-02-09T12:00:00Z",
      "featureFile": "docs/features/unchecked/gateway/router-back-pressure-middleware.md",
      "notes": []
    }
  }
}

State File Rules

Single writer: Only the orchestrator writes state files
Subagents report back: Subagents return results to the orchestrator via their output; they do NOT write state files directly
Atomic updates: Each state transition must update lastUpdatedUtc
Append-only notes: The notes array is append-only; never remove entries

5. Run Artifact Format

Each verification run produces artifacts under: docs/qa/feature-checks/runs/<module>/<feature-slug>/<runId>/

Where <runId> = run-001, run-002, etc. (zero-padded, sequential).

Required Artifacts

Stage	File	Format
Tier 0	`tier0-source-check.json`	`{ "filesChecked": [...], "found": [...], "missing": [...], "verdict": "pass\|fail\|partial" }`
Tier 1	`tier1-build-check.json`	`{ "project": "...", "buildResult": "pass\|fail", "testResult": "pass\|fail\|skipped", "errors": [...] }`
Tier 2	`tier2-e2e-check.json`	`{ "steps": [{ "description": "...", "result": "pass\|fail", "evidence": "..." }], "screenshots": [...] }`
Triage	`triage.json`	`{ "rootCause": "...", "category": "missing_code\|bug\|config\|test_gap\|env_issue", "affectedFiles": [...], "confidence": 0.0-1.0 }`
Confirm	`confirmation.json`	`{ "approved": true\|false, "reason": "...", "revisedRootCause": "..." }`
Fix	`fix-summary.json`	`{ "filesModified": [...], "testsAdded": [...], "description": "..." }`
Retest	`retest-result.json`	`{ "previousFailures": [...], "retestResults": [...], "verdict": "pass\|fail" }`

Screenshot Convention

Screenshots for Tier 2 go in <runId>/screenshots/ with names: step-<N>-<description-slug>.png

6. Priority Rules

When selecting the next feature to process, the orchestrator follows this priority order:

retesting - Finish in-progress retests first
fixing - Complete in-progress fixes
confirmed - Confirmed issues ready for fix
triaged - Triaged issues ready for confirmation
failed (retryCount < 3) - Failed features ready for triage
queued - New features not yet checked

Within the same priority level, prefer:

Features in smaller modules first (faster to clear a module completely)
Features with lower retryCount
Alphabetical by feature slug (deterministic ordering)

7. File Movement Rules

On `passed` -> `done`

Copy feature file from docs/features/unchecked/<module>/<feature>.md to docs/features/checked/<module>/<feature>.md
Update the status line in the file from IMPLEMENTED to VERIFIED
Append a ## Verification section with the run ID and date
Remove the original from unchecked/
Create the target module directory in checked/ if it doesn't exist

On `not_implemented`

Copy feature file from docs/features/unchecked/<module>/<feature>.md to docs/features/unimplemented/<module>/<feature>.md
Update status from IMPLEMENTED to PARTIALLY_IMPLEMENTED
Add notes about what was missing
Remove the original from unchecked/

On `blocked`

Do NOT move the file
Add a ## Blocked section to the feature file in unchecked/ with the reason
The feature stays in unchecked/ until a human unblocks it

8. Agent Contracts

stella-orchestrator

Reads: State files, feature files (to pick next work)
Writes: State files, moves feature files on pass/fail
Dispatches: Subagents with specific feature context
Rule: NEVER run checks itself; always delegate to subagents

stella-feature-checker

Receives: Feature file path, current tier, module info
Reads: Feature .md file, source code files, build output
Executes: File existence checks, dotnet build, dotnet test, Playwright CLI
Returns: Tier check results (JSON) to orchestrator
Rule: Read-only on feature files; never modify source code; never write state

stella-issue-finder

Receives: Check failure details, feature file path
Reads: Source code in the relevant module, test files, build errors
Returns: Triage JSON with root cause, category, affected files, confidence
Rule: Read-only; never modify files; fast analysis

stella-issue-confirmer

Receives: Triage JSON, feature file path
Reads: Same source code as finder, plus broader context
Returns: Confirmation JSON (approved/rejected with reason)
Rule: Read-only; never modify files; thorough analysis

stella-fixer

Receives: Confirmed triage, feature file path, affected files list
Writes: Source code fixes, new/updated tests
Returns: Fix summary JSON
Rule: Only modify files listed in confirmed triage; add tests for every change; follow CODE_OF_CONDUCT.md

stella-retester

Receives: Feature file path, previous failure details, fix summary
Executes: Same checks as feature-checker for the tiers that previously failed
Returns: Retest result JSON
Rule: Same constraints as feature-checker; never modify source code

9. Environment Prerequisites

Before running Tier 1+ checks, ensure:

Backend (.NET)

# Verify .NET SDK is available
dotnet --version  # Expected: 10.0.x

# Verify the solution builds
dotnet build src/StellaOps.sln --no-restore

Frontend (Angular)

# Verify Node.js and Angular CLI
node --version    # Expected: 22.x
npx ng version   # Expected: 21.x

# Build the frontend
cd src/Web/StellaOps.Web && npm ci && npx ng build

Playwright (Tier 2 only)

npx playwright install chromium

Application Runtime (Tier 2 only)

# Start backend + frontend (if docker compose exists)
docker compose -f devops/compose/docker-compose.dev.yml up -d

# Or run individually
cd src/Web/StellaOps.Web && npx ng serve &

If the environment is not available, Tier 2 checks should be marked skipped with skipReason: "application not running".

10. Cost Estimation

Tier	Per Feature	1,144 Features	Notes
Tier 0	~$0.01	~$11	File existence only
Tier 1	~$0.05	~$57	Build + test
Tier 2	~$0.50	~$165 (330 UI features)	Playwright + Opus
Triage	~$0.10	~$30 (est. 300 failures)	Sonnet
Confirm	~$0.15	~$30 (est. 200 confirmed)	Opus
Fix	~$0.50	~$75 (est. 150 fixes)	o3
Retest	~$0.20	~$30 (est. 150 retests)	Opus
Total		~$400	Conservative estimate

Run Tier 0 first to filter out not_implemented features before spending on higher tiers.

19 KiB Raw Blame History

Feature Verification Pipeline - FLOW

1. Directory Layout

2. State Machine

2.1 States

2.2 Transitions

2.3 Retry Policy

2.4 Skip Criteria

3. Tier System

Tier 0: Source Verification (fast, cheap)

Tier 1: Build + Code Review (medium)

Tier 2: Behavioral Verification (API / CLI / UI)

Tier 2a: API Testing (Gateway, Router, Api, Platform, backend services with HTTP endpoints)

Tier 2b: CLI Testing (Cli, Tools, Bench modules)

Tier 2c: UI Testing (Web, ExportCenter, DevPortal, VulnExplorer, PacksRegistry)

Tier 2d: Library/Internal Testing (Attestor, Policy, Scanner, etc. with no external surface)

When to skip Tier 2

Tier Classification by Module

4. State File Format

State File Rules

5. Run Artifact Format

Required Artifacts

Screenshot Convention

6. Priority Rules

7. File Movement Rules

On passed -> done

On not_implemented

On blocked

8. Agent Contracts

stella-orchestrator

stella-feature-checker

stella-issue-finder

stella-issue-confirmer

stella-fixer

stella-retester

9. Environment Prerequisites

Backend (.NET)

Frontend (Angular)

Playwright (Tier 2 only)

Application Runtime (Tier 2 only)

10. Cost Estimation

19 KiB

Raw Blame History

On `passed` -> `done`

On `not_implemented`

On `blocked`