save checkpoint

2026-02-11 01:32:14 +02:00
parent 5593212b41
commit cf5b72974f
2316 changed files with 68799 additions and 3808 deletions
--- a/docs/qa/feature-checks/FLOW.md
+++ b/docs/qa/feature-checks/FLOW.md
@@ -5,6 +5,22 @@ for the automated feature verification pipeline.

 All agents in the pipeline MUST read this document before taking any action.

+> **THE PRIMARY GOAL IS END-TO-END BEHAVIORAL VERIFICATION.**
+>
+> This pipeline exists to prove that features **actually work** by exercising them
+> as a real user would -- through APIs, CLIs, UIs, and integration tests. Tier 0
+> (file checks) and Tier 1 (build + unit tests) are necessary prerequisites, but
+> they are NOT the goal. **Tier 2 (E2E behavioral verification) is the goal.**
+>
+> Agents MUST:
+> 1. Start Docker / Docker Desktop before running any checks
+> 2. Set up required containers (Postgres, Redis, RabbitMQ, etc.)
+> 3. Start application services needed for behavioral testing
+> 4. Run ALL tiers including Tier 2 -- never stop at Tier 1
+> 5. Act as a user: call APIs, run CLI commands, interact with UIs
+>
+> **Skipping Tier 2 is a verification failure, not a verification pass.**
+
 ---

 ## 1. Directory Layout
@@ -77,16 +93,20 @@ queued ──────────────> checking
 - Blocked features require human review before re-entering the pipeline
 - Each retry increments `retryCount` in state

-### 2.4 Skip Criteria
+### 2.4 Skip Criteria (STRICT - almost nothing qualifies)

-Features that CANNOT be automatically E2E tested should be marked `skipped`:
- Air-gap/offline features (require disconnected environment)
- Crypto-sovereign features (require HSM/eIDAS hardware)
- Multi-node cluster features (require multi-host setup)
- Performance benchmarking features (require dedicated infra)
- Features with description containing "manual verification required"
+Features may ONLY be marked `skipped` if they match one of these 3 physical constraints:
+- `hardware_required`: Requires physical HSM, smart card, or eIDAS hardware token
+- `multi_datacenter`: Requires geographically distributed infrastructure
+- `air_gap_network`: Requires a physically disconnected network

-The checker agent determines skip eligibility during Tier 0.
+**Everything else MUST be tested.** Features that were previously classified as
+"performance benchmarking" or "multi-node cluster" should be tested with whatever
+scale is available locally (single-node Docker, local containers). Partial behavioral
+verification is better than no verification.
+
+The checker agent determines skip eligibility during Tier 0 and MUST justify the
+skip with one of the 3 reasons above. Any other reason is invalid.

 ---

@@ -151,14 +171,39 @@ does what it claims.

 **Cost**: ~0.10 USD per feature (compile + test execution + code reading)

-### Tier 2: Behavioral Verification (API / CLI / UI)
+### Tier 2: Behavioral Verification (API / CLI / UI) -- THE MAIN PURPOSE

 **Purpose**: Verify the feature works end-to-end by actually exercising it through
 its external interface. This is the only tier that proves the feature WORKS, not
-just that code exists.
+just that code exists. **This is the primary reason the verification pipeline exists.**

-**EVERY feature MUST have a Tier 2 check unless explicitly skipped.** The check type
-depends on the module's external surface.
+**EVERY feature MUST have a Tier 2 check. E2E tests MUST NOT be skipped.** The whole
+point of this pipeline is to act as a user and verify the software works. Tier 0 and
+Tier 1 are prerequisites -- Tier 2 is the actual verification.
+
+**If the environment is not set up, set it up.** If Docker is not running, start it.
+If containers are not running, start them. If the app is not running, start it.
+"Environment not ready" is never an excuse to skip -- it is a setup step the agent
+must perform (see Section 9).
+
+The check type depends on the module's external surface.
+
+### Tier 2 Acceptance Gate (HARD REQUIREMENT)
+
+A Tier 2 run is valid only if ALL of the following are true:
+1. It uses a new run directory (`run-XYZ`) created for the current execution.
+2. It contains fresh evidence captured in this run (new timestamps and new command/request outputs).
+3. It includes user-surface interactions (HTTP requests, CLI invocations, or UI interactions), not only library test counts.
+4. It verifies both positive and negative behavior paths when the feature has error semantics.
+5. For rechecks, at least one new user transaction per feature is captured in the new run.
+
+The following are forbidden and invalidate Tier 2:
+- Copying a previous run directory and only editing `runId`, timestamps, or summary text.
+- Declaring Tier 2 pass from suite totals alone without fresh request/response, command output, or UI step evidence.
+- Reusing screenshots or response payloads from prior runs without replaying the interaction.
+
+If any forbidden shortcut is detected, mark the feature `failed` with category `test_gap`
+and rerun Tier 2 from scratch.

 #### Tier 2a: API Testing (Gateway, Router, Api, Platform, backend services with HTTP endpoints)

@@ -181,6 +226,7 @@ curl -H "X-Forwarded-User: attacker" http://localhost:5000/api/test
 {
  "type": "api",
  "baseUrl": "http://localhost:5000",
+  "capturedAtUtc": "2026-02-10T12:00:00Z",
  "requests": [
    {
      "description": "Verify spoofed identity header is stripped",
@@ -191,7 +237,9 @@ curl -H "X-Forwarded-User: attacker" http://localhost:5000/api/test
      "actualStatus": 200,
      "assertion": "Response X-Forwarded-User header matches authenticated user, not 'attacker'",
      "result": "pass|fail",
-      "evidence": "actual response headers/body"
+      "evidence": "actual response headers/body",
+      "requestCapturedAtUtc": "2026-02-10T12:00:01Z",
+      "responseSnippet": "HTTP/1.1 200 ..."
    }
  ],
  "verdict": "pass|fail|skip"
@@ -218,6 +266,7 @@ echo $?  # Verify exit code 0
 ```json
 {
  "type": "cli",
+  "capturedAtUtc": "2026-02-10T12:00:00Z",
  "commands": [
    {
      "description": "Verify baseline selection with last-green strategy",
@@ -226,7 +275,8 @@ echo $?  # Verify exit code 0
      "actualExitCode": 0,
      "expectedOutput": "Using baseline: ...",
      "actualOutput": "...",
-      "result": "pass|fail"
+      "result": "pass|fail",
+      "commandCapturedAtUtc": "2026-02-10T12:00:01Z"
    }
  ],
  "verdict": "pass|fail|skip"
@@ -253,6 +303,7 @@ npx playwright test --grep "pipeline-run" --reporter=json
 {
  "type": "ui",
  "baseUrl": "http://localhost:4200",
+  "capturedAtUtc": "2026-02-10T12:00:00Z",
  "steps": [
    {
      "description": "Navigate to /release-orchestrator/runs",
@@ -260,7 +311,8 @@ npx playwright test --grep "pipeline-run" --reporter=json
      "target": "/release-orchestrator/runs",
      "expected": "Runs list table renders with columns",
      "result": "pass|fail",
-      "screenshot": "step-1-runs-list.png"
+      "screenshot": "step-1-runs-list.png",
+      "stepCapturedAtUtc": "2026-02-10T12:00:01Z"
    }
  ],
  "verdict": "pass|fail|skip"
@@ -290,6 +342,7 @@ dotnet test --filter "FullyQualifiedName~EwsCalculatorTests"
 ```json
 {
  "type": "integration",
+  "capturedAtUtc": "2026-02-10T12:00:00Z",
  "testFilter": "FullyQualifiedName~EwsCalculatorTests",
  "testsRun": 21,
  "testsPassed": 21,
@@ -303,16 +356,27 @@ dotnet test --filter "FullyQualifiedName~EwsCalculatorTests"
 }
 ```

-### When to skip Tier 2
+### When to skip Tier 2 (ALMOST NEVER)

-Mark `skipped` ONLY for features that literally cannot be tested in the current environment:
- Air-gap features requiring a disconnected network
- HSM/eIDAS features requiring physical hardware
- Multi-datacenter features requiring distributed infrastructure
- Performance benchmark features requiring dedicated load-gen infrastructure
+**Default: Tier 2 is MANDATORY.** Agents must exhaust all options before marking skip.

-"The app isn't running" is NOT a skip reason -- it's a `failed` with `env_issue`.
-"No tests exist" is NOT a skip reason -- write a focused test.
+The ONLY acceptable skip reasons (must match exactly one):
+- `hardware_required`: Feature requires physical HSM, smart card, or eIDAS token
+- `multi_datacenter`: Feature requires geographically distributed infrastructure
+- `air_gap_network`: Feature requires a physically disconnected network (not just no internet)
+
+**These are NOT valid skip reasons:**
+- "The app isn't running" -- **START IT** (see Section 9). If it won't start, mark `failed` with `env_issue`.
+- "Docker isn't running" -- **START DOCKER** (see Section 9.0). If it won't start, mark `failed` with `env_issue`.
+- "No E2E tests exist" -- **WRITE ONE.** A focused behavioral test that exercises the feature as a user would.
+- "The database isn't set up" -- **SET IT UP** using Docker containers (see Section 9.1).
+- "Environment not ready" -- **PREPARE IT.** That is part of the agent's job, not an excuse.
+- "Too complex to test" -- Break it into smaller testable steps. Test what you can.
+- "Only unit tests needed" -- Unit tests are Tier 1. Tier 2 is behavioral/integration/E2E.
+- "Application not running" -- See "The app isn't running" above.
+
+**If an agent skips Tier 2 without one of the 3 valid reasons above, the entire
+feature verification is INVALID and must be re-run.**

 ### Tier Classification by Module

@@ -381,6 +445,14 @@ Where `<runId>` = `run-001`, `run-002`, etc. (zero-padded, sequential).
 | Fix | `fix-summary.json` | `{ "filesModified": [...], "testsAdded": [...], "description": "..." }` |
 | Retest | `retest-result.json` | `{ "previousFailures": [...], "retestResults": [...], "verdict": "pass\|fail" }` |

+### Artifact Freshness Rules (MANDATORY)
+
+- Every new run (`run-XYZ`) MUST be generated from fresh execution, not by copying prior run files.
+- Every Tier 2 artifact MUST include `capturedAtUtc` and per-step/per-command/per-request capture times.
+- Evidence fields MUST contain fresh raw output from the current run (response snippets, command output, screenshots, or logs).
+- Recheck runs MUST include at least one newly captured user interaction per feature in that run directory.
+- If a previous run is reused as input for convenience, that run is INVALID until all Tier 2 evidence files are regenerated.
+
 ### Screenshot Convention

 Screenshots for Tier 2 go in `<runId>/screenshots/` with names:
@@ -442,9 +514,14 @@ Within the same priority level, prefer:
 ### stella-feature-checker
 - **Receives**: Feature file path, current tier, module info
 - **Reads**: Feature .md file, source code files, build output
- **Executes**: File existence checks, `dotnet build`, `dotnet test`, Playwright CLI
+- **Executes**: File existence checks, `dotnet build`, `dotnet test`, Playwright CLI, Docker commands
 - **Returns**: Tier check results (JSON) to orchestrator
 - **Rule**: Read-only on feature files; never modify source code; never write state
+- **MUST**: Set up required infrastructure (Docker, containers, databases) before testing.
+  Environment setup is part of the checker's job. If Docker is not running, start it.
+  If containers are needed, spin them up. If the app needs to be running, start it.
+  The checker MUST leave the environment in a testable state before running Tier 2.
+- **MUST NOT**: copy a previous run's artifacts to satisfy Tier 2. Checker must capture fresh user-surface evidence for each run.

 ### stella-issue-finder
 - **Receives**: Check failure details, feature file path
@@ -472,45 +549,113 @@ Within the same priority level, prefer:

 ---

-## 9. Environment Prerequisites
+## 9. Environment Prerequisites (MANDATORY - DO NOT SKIP)

-Before running Tier 1+ checks, ensure:
+The verification pipeline exists to prove features **actually work** from a user's
+perspective. Agents MUST set up the full runtime environment before running checks.
+Skipping environment setup is NEVER acceptable.

-### Backend (.NET)
+### 9.0 Docker / Container Runtime (MUST BE RUNNING FIRST)
+
+Docker is required for Tier 1 tests (Testcontainers for Postgres, Redis, RabbitMQ, etc.)
+and for Tier 2 behavioral checks (running services). **Start Docker before anything else.**
+
+```bash
+# Step 1: Ensure Docker Desktop is running (Windows/macOS)
+# On Windows: Start Docker Desktop from Start Menu or:
+Start-Process "C:\Program Files\Docker\Docker\Docker Desktop.exe"
+# On macOS:
+open -a Docker
+
+# Step 2: Wait for Docker to be ready
+docker info > /dev/null 2>&1
+# If this fails, Docker is not running. DO NOT proceed without Docker.
+# Retry up to 60 seconds:
+for i in $(seq 1 12); do docker info > /dev/null 2>&1 && break || sleep 5; done
+
+# Step 3: Verify Docker is functional
+docker ps  # Should return (possibly empty) container list without errors
+```
+
+**If Docker is not available or cannot start:** Mark all affected features as
+`failed` with category `env_issue` and note `"Docker unavailable"`. Do NOT mark
+them as `skipped` -- infrastructure failures are failures, not skips.
+
+### 9.1 Container Setup and Cleanup
+
+Before running tests, ensure a clean container state:
+
+```bash
+# Clean up any stale containers from previous runs
+docker compose -f devops/compose/docker-compose.dev.yml down --volumes --remove-orphans 2>/dev/null || true
+
+# Pull required images
+docker compose -f devops/compose/docker-compose.dev.yml pull
+
+# Start infrastructure services (Postgres, Redis, RabbitMQ, etc.)
+docker compose -f devops/compose/docker-compose.dev.yml up -d
+
+# Wait for services to be healthy (check health status)
+docker compose -f devops/compose/docker-compose.dev.yml ps
+# Verify all services show "healthy" or "running"
+
+# If no docker-compose file exists, start minimum required services manually:
+docker run -d --name stella-postgres -e POSTGRES_PASSWORD=stella -e POSTGRES_DB=stellaops -p 5432:5432 postgres:16-alpine
+docker run -d --name stella-redis -p 6379:6379 redis:7-alpine
+docker run -d --name stella-rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3-management-alpine
+```
+
+### 9.2 Backend (.NET)
 ```bash
 # Verify .NET SDK is available
 dotnet --version  # Expected: 10.0.x

-# Verify the solution builds
-dotnet build src/StellaOps.sln --no-restore
+# Restore and build the solution
+dotnet restore src/StellaOps.sln
+dotnet build src/StellaOps.sln
 ```

-### Frontend (Angular)
+### 9.3 Frontend (Angular)
 ```bash
 # Verify Node.js and Angular CLI
 node --version    # Expected: 22.x
 npx ng version   # Expected: 21.x

-# Build the frontend
+# Install dependencies and build
 cd src/Web/StellaOps.Web && npm ci && npx ng build
 ```

-### Playwright (Tier 2 only)
+### 9.4 Playwright (Tier 2c UI testing)
 ```bash
 npx playwright install chromium
 ```

-### Application Runtime (Tier 2 only)
+### 9.5 Application Runtime (Tier 2 - ALL behavioral checks)
+
+The application MUST be running for Tier 2 checks. This is not optional.
+
 ```bash
-# Start backend + frontend (if docker compose exists)
+# Option A: Docker Compose (preferred - starts everything)
 docker compose -f devops/compose/docker-compose.dev.yml up -d

-# Or run individually
+# Option B: Run services individually
+# Backend API:
+dotnet run --project src/Gateway/StellaOps.Gateway.WebService/StellaOps.Gateway.WebService.csproj &
+# Frontend:
 cd src/Web/StellaOps.Web && npx ng serve &
+
+# Verify services are reachable
+curl -s http://localhost:5000/health || echo "Backend not reachable"
+curl -s http://localhost:4200 || echo "Frontend not reachable"
 ```

-If the environment is not available, Tier 2 checks should be marked `skipped`
-with `skipReason: "application not running"`.
+### 9.6 Environment Teardown (after all checks complete)
+
+```bash
+# Stop and clean up all containers
+docker compose -f devops/compose/docker-compose.dev.yml down --volumes --remove-orphans 2>/dev/null || true
+docker rm -f stella-postgres stella-redis stella-rabbitmq 2>/dev/null || true
+```

 ---