git.stella-ops.org/docs/technical/testing/post-incident-testing-guide.md

# Post-Incident Testing Guide

**Version:** 1.0
**Status:** Turn #6 Implementation
**Audience:** StellaOps developers, QA engineers, incident responders

---

## Overview

Every production incident should produce a permanent regression test. This guide describes the infrastructure and workflow for generating, reviewing, and maintaining post-incident tests in the StellaOps codebase.

### Key Principles

1. **Permanent Regression**: Incidents that reach production indicate a gap in testing. That gap must be permanently closed.
2. **Deterministic Replay**: Tests are generated from replay manifests captured during the incident.
3. **Severity-Gated**: P1/P2 incident tests block releases; P3/P4 tests are warning-only.
4. **Traceable**: Every incident test links back to the incident report and fix.

---

## Workflow

### 1. Incident Triggers Replay Capture

When an incident occurs, the replay infrastructure automatically captures:

- Event sequences with correlation IDs
- Input data (sanitized for PII)
- System state at time of incident
- Configuration and policy digests

This produces a **replay manifest** stored in the Evidence Locker.

### 2. Generate Test Scaffold

Use the `IncidentTestGenerator` to create a test scaffold from the replay manifest:

```csharp
using StellaOps.TestKit.Incident;

// Load the replay manifest
var manifestJson = File.ReadAllText("incident-replay-manifest.json");

// Create incident metadata
var metadata = new IncidentMetadata
{
    IncidentId = "INC-2026-001",
    OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
    RootCause = "Race condition in concurrent bundle creation",
    AffectedModules = ["EvidenceLocker", "Policy"],
    Severity = IncidentSeverity.P1,
    Title = "Evidence bundle duplication in high-concurrency scenario",
    ReportUrl = "https://incidents.stella-ops.internal/INC-2026-001"
};

// Generate the test scaffold
var generator = new IncidentTestGenerator();
var scaffold = generator.GenerateFromManifestJson(manifestJson, metadata);

// Output the generated test code
var code = scaffold.GenerateTestCode();
File.WriteAllText($"Tests/{scaffold.TestClassName}.cs", code);
```

### 3. Review and Complete Test

The generated scaffold is a starting point. A human must:

1. **Review fixtures**: Ensure input data is appropriate and sanitized.
2. **Complete assertions**: Add specific assertions for the expected behavior.
3. **Verify determinism**: Ensure the test produces consistent results.
4. **Add to CI**: Include the test in the appropriate test project.

### 4. Register for Tracking

Register the incident test for reporting:

```csharp
generator.RegisterIncidentTest(metadata.IncidentId, scaffold);

// Generate a summary report
var report = generator.GenerateReport();
Console.WriteLine($"Total incident tests: {report.TotalTests}");
Console.WriteLine($"P1 tests: {report.BySeveority.GetValueOrDefault(IncidentSeverity.P1, 0)}");
```

---

## Incident Metadata

The `IncidentMetadata` record captures essential incident context:

| Property | Required | Description |
|----------|----------|-------------|
| `IncidentId` | Yes | Unique identifier from incident management system |
| `OccurredAt` | Yes | When the incident occurred (UTC) |
| `RootCause` | Yes | Brief description of the root cause |
| `AffectedModules` | Yes | Modules impacted by the incident |
| `Severity` | Yes | P1 (critical) through P4 (low impact) |
| `Title` | No | Short descriptive title |
| `ReportUrl` | No | Link to incident report or postmortem |
| `ResolvedAt` | No | When the incident was resolved |
| `CorrelationIds` | No | IDs for replay matching |
| `FixTaskId` | No | Sprint task that implemented the fix |
| `Tags` | No | Categorization tags |

### Severity Levels

| Severity | Description | CI Behavior |
|----------|-------------|-------------|
| P1 | Critical: service down, data loss, security breach | Blocks releases |
| P2 | Major: significant degradation, partial outage | Blocks releases |
| P3 | Minor: limited impact, workaround available | Warning only |
| P4 | Low: cosmetic issues, minor bugs | Informational |

---

## Generated Test Structure

The scaffold generates a test class with:

```csharp
[Trait("Category", TestCategories.PostIncident)]
[Trait("Incident", "INC-2026-001")]
[Trait("Severity", "P1")]
public sealed class Incident_INC_2026_001_Tests
{
    private static readonly IncidentMetadata Incident = new()
    {
        IncidentId = "INC-2026-001",
        OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
        RootCause = "Race condition in concurrent bundle creation",
        AffectedModules = ["EvidenceLocker", "Policy"],
        Severity = IncidentSeverity.P1,
        Title = "Evidence bundle duplication"
    };

    [Fact]
    public async Task Validates_RaceCondition_Fix()
    {
        // Arrange
        // TODO: Load fixtures from replay manifest

        // Act
        // TODO: Execute the scenario that triggered the incident

        // Assert
        // TODO: Verify the fix prevents the incident condition
    }
}
```

---

## CI Integration

### Test Filtering

Filter post-incident tests in CI:

```bash
# Run all post-incident tests
dotnet test --filter "Category=PostIncident"

# Run only P1/P2 tests (release-gating)
dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)"

# Run tests for a specific incident
dotnet test --filter "Incident=INC-2026-001"

# Run tests for a specific module
dotnet test --filter "Category=PostIncident&Module:EvidenceLocker=true"
```

### CI Lanes

| Lane | Filter | Trigger | Behavior |
|------|--------|---------|----------|
| PR Gate | `Category=PostIncident&(Severity=P1\|Severity=P2)` | Pull requests | Blocks merge |
| Release Gate | `Category=PostIncident` | Release builds | P1/P2 block, P3/P4 warn |
| Nightly | `Category=PostIncident` | Scheduled | Full run, report only |

### Example CI Configuration

```yaml
# .gitea/workflows/post-incident-tests.yml
name: Post-Incident Tests
on:
  pull_request:
  release:
    types: [created]

jobs:
  post-incident:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '10.0.x'

      - name: Run P1/P2 Incident Tests
        run: |
          dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)" \
            --logger "trx;LogFileName=incident-results.trx"

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: incident-test-results
          path: '**/incident-results.trx'
```

---

## Best Practices

### 1. Sanitize Fixtures

Remove or mask any PII or sensitive data from replay fixtures:

```csharp
// Before storing fixture
var sanitizedFixture = fixture
    .Replace(userEmail, "user@example.com")
    .Replace(apiKey, "REDACTED");
```

### 2. Use Deterministic Infrastructure

Ensure incident tests use TestKit's deterministic primitives:

```csharp
// Use deterministic time
using var time = new DeterministicTime(Incident.OccurredAt);

// Use deterministic random if needed
var random = new DeterministicRandom(seed: 42);
```

### 3. Document the Incident

Include comprehensive documentation in the test:

```csharp
/// <summary>
/// Regression test for incident INC-2026-001: Evidence bundle duplication.
/// </summary>
/// <remarks>
/// Root cause: Race condition in concurrent bundle creation.
///
/// The incident occurred when multiple workers attempted to create the same
/// evidence bundle simultaneously. The fix added optimistic locking with
/// a unique constraint on (tenant_id, bundle_id).
///
/// Report: https://incidents.stella-ops.internal/INC-2026-001
/// Fix: PR #1234
/// </remarks>
```

### 4. Link to Sprint Tasks

Connect incident tests to the fix implementation:

```csharp
[Fact]
[Trait("SprintTask", "EVIDENCE-0115-001")]
public async Task Validates_RaceCondition_Fix()
```

### 5. Evolve Tests Over Time

Incident tests may need updates as the codebase evolves:

- Update fixtures when schemas change
- Adjust assertions when behavior intentionally changes
- Add new scenarios discovered during subsequent incidents

---

## Troubleshooting

### Manifest Not Available

If the replay manifest wasn't captured:

1. Check Evidence Locker for any captured events
2. Reconstruct the scenario from logs and metrics
3. Create a synthetic manifest for testing

### Flaky Incident Tests

If the test is non-deterministic:

1. Identify non-deterministic inputs (time, random, external state)
2. Replace with TestKit deterministic primitives
3. Add retry logic only as a last resort

### Test No Longer Relevant

If the fix makes the scenario impossible:

1. Document why the test is no longer applicable
2. Move to an "archived incidents" test category
3. Keep the test for documentation purposes

---

## Related Documentation

- [TestKit Usage Guide](testkit-usage-guide.md)
- [Testing Practices](../../code-of-conduct/TESTING_PRACTICES.md)
- [CI Quality Gates](ci-quality-gates.md)
- [Replay Infrastructure](../../modules/replay/architecture.md)

---

## Changelog

### v1.0 (2026-01-27)
- Initial release: IncidentTestGenerator, IncidentMetadata, TestScaffold
- CI integration patterns
- Best practices and troubleshooting