Files
git.stella-ops.org/docs/technical/testing/post-incident-testing-guide.md
2026-01-28 02:30:48 +02:00

9.1 KiB

Post-Incident Testing Guide

Version: 1.0 Status: Turn #6 Implementation Audience: StellaOps developers, QA engineers, incident responders


Overview

Every production incident should produce a permanent regression test. This guide describes the infrastructure and workflow for generating, reviewing, and maintaining post-incident tests in the StellaOps codebase.

Key Principles

  1. Permanent Regression: Incidents that reach production indicate a gap in testing. That gap must be permanently closed.
  2. Deterministic Replay: Tests are generated from replay manifests captured during the incident.
  3. Severity-Gated: P1/P2 incident tests block releases; P3/P4 tests are warning-only.
  4. Traceable: Every incident test links back to the incident report and fix.

Workflow

1. Incident Triggers Replay Capture

When an incident occurs, the replay infrastructure automatically captures:

  • Event sequences with correlation IDs
  • Input data (sanitized for PII)
  • System state at time of incident
  • Configuration and policy digests

This produces a replay manifest stored in the Evidence Locker.

2. Generate Test Scaffold

Use the IncidentTestGenerator to create a test scaffold from the replay manifest:

using StellaOps.TestKit.Incident;

// Load the replay manifest
var manifestJson = File.ReadAllText("incident-replay-manifest.json");

// Create incident metadata
var metadata = new IncidentMetadata
{
    IncidentId = "INC-2026-001",
    OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
    RootCause = "Race condition in concurrent bundle creation",
    AffectedModules = ["EvidenceLocker", "Policy"],
    Severity = IncidentSeverity.P1,
    Title = "Evidence bundle duplication in high-concurrency scenario",
    ReportUrl = "https://incidents.stella-ops.internal/INC-2026-001"
};

// Generate the test scaffold
var generator = new IncidentTestGenerator();
var scaffold = generator.GenerateFromManifestJson(manifestJson, metadata);

// Output the generated test code
var code = scaffold.GenerateTestCode();
File.WriteAllText($"Tests/{scaffold.TestClassName}.cs", code);

3. Review and Complete Test

The generated scaffold is a starting point. A human must:

  1. Review fixtures: Ensure input data is appropriate and sanitized.
  2. Complete assertions: Add specific assertions for the expected behavior.
  3. Verify determinism: Ensure the test produces consistent results.
  4. Add to CI: Include the test in the appropriate test project.

4. Register for Tracking

Register the incident test for reporting:

generator.RegisterIncidentTest(metadata.IncidentId, scaffold);

// Generate a summary report
var report = generator.GenerateReport();
Console.WriteLine($"Total incident tests: {report.TotalTests}");
Console.WriteLine($"P1 tests: {report.BySeveority.GetValueOrDefault(IncidentSeverity.P1, 0)}");

Incident Metadata

The IncidentMetadata record captures essential incident context:

Property Required Description
IncidentId Yes Unique identifier from incident management system
OccurredAt Yes When the incident occurred (UTC)
RootCause Yes Brief description of the root cause
AffectedModules Yes Modules impacted by the incident
Severity Yes P1 (critical) through P4 (low impact)
Title No Short descriptive title
ReportUrl No Link to incident report or postmortem
ResolvedAt No When the incident was resolved
CorrelationIds No IDs for replay matching
FixTaskId No Sprint task that implemented the fix
Tags No Categorization tags

Severity Levels

Severity Description CI Behavior
P1 Critical: service down, data loss, security breach Blocks releases
P2 Major: significant degradation, partial outage Blocks releases
P3 Minor: limited impact, workaround available Warning only
P4 Low: cosmetic issues, minor bugs Informational

Generated Test Structure

The scaffold generates a test class with:

[Trait("Category", TestCategories.PostIncident)]
[Trait("Incident", "INC-2026-001")]
[Trait("Severity", "P1")]
public sealed class Incident_INC_2026_001_Tests
{
    private static readonly IncidentMetadata Incident = new()
    {
        IncidentId = "INC-2026-001",
        OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
        RootCause = "Race condition in concurrent bundle creation",
        AffectedModules = ["EvidenceLocker", "Policy"],
        Severity = IncidentSeverity.P1,
        Title = "Evidence bundle duplication"
    };

    [Fact]
    public async Task Validates_RaceCondition_Fix()
    {
        // Arrange
        // TODO: Load fixtures from replay manifest

        // Act
        // TODO: Execute the scenario that triggered the incident

        // Assert
        // TODO: Verify the fix prevents the incident condition
    }
}

CI Integration

Test Filtering

Filter post-incident tests in CI:

# Run all post-incident tests
dotnet test --filter "Category=PostIncident"

# Run only P1/P2 tests (release-gating)
dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)"

# Run tests for a specific incident
dotnet test --filter "Incident=INC-2026-001"

# Run tests for a specific module
dotnet test --filter "Category=PostIncident&Module:EvidenceLocker=true"

CI Lanes

Lane Filter Trigger Behavior
PR Gate Category=PostIncident&(Severity=P1|Severity=P2) Pull requests Blocks merge
Release Gate Category=PostIncident Release builds P1/P2 block, P3/P4 warn
Nightly Category=PostIncident Scheduled Full run, report only

Example CI Configuration

# .gitea/workflows/post-incident-tests.yml
name: Post-Incident Tests
on:
  pull_request:
  release:
    types: [created]

jobs:
  post-incident:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '10.0.x'

      - name: Run P1/P2 Incident Tests
        run: |
          dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)" \
            --logger "trx;LogFileName=incident-results.trx"

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: incident-test-results
          path: '**/incident-results.trx'

Best Practices

1. Sanitize Fixtures

Remove or mask any PII or sensitive data from replay fixtures:

// Before storing fixture
var sanitizedFixture = fixture
    .Replace(userEmail, "user@example.com")
    .Replace(apiKey, "REDACTED");

2. Use Deterministic Infrastructure

Ensure incident tests use TestKit's deterministic primitives:

// Use deterministic time
using var time = new DeterministicTime(Incident.OccurredAt);

// Use deterministic random if needed
var random = new DeterministicRandom(seed: 42);

3. Document the Incident

Include comprehensive documentation in the test:

/// <summary>
/// Regression test for incident INC-2026-001: Evidence bundle duplication.
/// </summary>
/// <remarks>
/// Root cause: Race condition in concurrent bundle creation.
///
/// The incident occurred when multiple workers attempted to create the same
/// evidence bundle simultaneously. The fix added optimistic locking with
/// a unique constraint on (tenant_id, bundle_id).
///
/// Report: https://incidents.stella-ops.internal/INC-2026-001
/// Fix: PR #1234
/// </remarks>

Connect incident tests to the fix implementation:

[Fact]
[Trait("SprintTask", "EVIDENCE-0115-001")]
public async Task Validates_RaceCondition_Fix()

5. Evolve Tests Over Time

Incident tests may need updates as the codebase evolves:

  • Update fixtures when schemas change
  • Adjust assertions when behavior intentionally changes
  • Add new scenarios discovered during subsequent incidents

Troubleshooting

Manifest Not Available

If the replay manifest wasn't captured:

  1. Check Evidence Locker for any captured events
  2. Reconstruct the scenario from logs and metrics
  3. Create a synthetic manifest for testing

Flaky Incident Tests

If the test is non-deterministic:

  1. Identify non-deterministic inputs (time, random, external state)
  2. Replace with TestKit deterministic primitives
  3. Add retry logic only as a last resort

Test No Longer Relevant

If the fix makes the scenario impossible:

  1. Document why the test is no longer applicable
  2. Move to an "archived incidents" test category
  3. Keep the test for documentation purposes


Changelog

v1.0 (2026-01-27)

  • Initial release: IncidentTestGenerator, IncidentMetadata, TestScaffold
  • CI integration patterns
  • Best practices and troubleshooting