Files

master 644887997c test fixes and new product advisories work

2026-01-28 02:30:48 +02:00

9.1 KiB

Raw Blame History

Post-Incident Testing Guide

Version: 1.0 Status: Turn #6 Implementation Audience: StellaOps developers, QA engineers, incident responders

Overview

Every production incident should produce a permanent regression test. This guide describes the infrastructure and workflow for generating, reviewing, and maintaining post-incident tests in the StellaOps codebase.

Key Principles

Permanent Regression: Incidents that reach production indicate a gap in testing. That gap must be permanently closed.
Deterministic Replay: Tests are generated from replay manifests captured during the incident.
Severity-Gated: P1/P2 incident tests block releases; P3/P4 tests are warning-only.
Traceable: Every incident test links back to the incident report and fix.

Workflow

1. Incident Triggers Replay Capture

When an incident occurs, the replay infrastructure automatically captures:

Event sequences with correlation IDs
Input data (sanitized for PII)
System state at time of incident
Configuration and policy digests

This produces a replay manifest stored in the Evidence Locker.

2. Generate Test Scaffold

Use the IncidentTestGenerator to create a test scaffold from the replay manifest:

using StellaOps.TestKit.Incident;

// Load the replay manifest
var manifestJson = File.ReadAllText("incident-replay-manifest.json");

// Create incident metadata
var metadata = new IncidentMetadata
{
    IncidentId = "INC-2026-001",
    OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
    RootCause = "Race condition in concurrent bundle creation",
    AffectedModules = ["EvidenceLocker", "Policy"],
    Severity = IncidentSeverity.P1,
    Title = "Evidence bundle duplication in high-concurrency scenario",
    ReportUrl = "https://incidents.stella-ops.internal/INC-2026-001"
};

// Generate the test scaffold
var generator = new IncidentTestGenerator();
var scaffold = generator.GenerateFromManifestJson(manifestJson, metadata);

// Output the generated test code
var code = scaffold.GenerateTestCode();
File.WriteAllText($"Tests/{scaffold.TestClassName}.cs", code);

3. Review and Complete Test

The generated scaffold is a starting point. A human must:

Review fixtures: Ensure input data is appropriate and sanitized.
Complete assertions: Add specific assertions for the expected behavior.
Verify determinism: Ensure the test produces consistent results.
Add to CI: Include the test in the appropriate test project.

4. Register for Tracking

generator.RegisterIncidentTest(metadata.IncidentId, scaffold);

// Generate a summary report
var report = generator.GenerateReport();
Console.WriteLine($"Total incident tests: {report.TotalTests}");
Console.WriteLine($"P1 tests: {report.BySeveority.GetValueOrDefault(IncidentSeverity.P1, 0)}");

Incident Metadata

The IncidentMetadata record captures essential incident context:

Property	Required	Description
`IncidentId`	Yes	Unique identifier from incident management system
`OccurredAt`	Yes	When the incident occurred (UTC)
`RootCause`	Yes	Brief description of the root cause
`AffectedModules`	Yes	Modules impacted by the incident
`Severity`	Yes	P1 (critical) through P4 (low impact)
`Title`	No	Short descriptive title
`ReportUrl`	No	Link to incident report or postmortem
`ResolvedAt`	No	When the incident was resolved
`CorrelationIds`	No	IDs for replay matching
`FixTaskId`	No	Sprint task that implemented the fix
`Tags`	No	Categorization tags

Severity Levels

Severity	Description	CI Behavior
P1	Critical: service down, data loss, security breach	Blocks releases
P2	Major: significant degradation, partial outage	Blocks releases
P3	Minor: limited impact, workaround available	Warning only
P4	Low: cosmetic issues, minor bugs	Informational

Generated Test Structure

The scaffold generates a test class with:

[Trait("Category", TestCategories.PostIncident)]
[Trait("Incident", "INC-2026-001")]
[Trait("Severity", "P1")]
public sealed class Incident_INC_2026_001_Tests
{
    private static readonly IncidentMetadata Incident = new()
    {
        IncidentId = "INC-2026-001",
        OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
        RootCause = "Race condition in concurrent bundle creation",
        AffectedModules = ["EvidenceLocker", "Policy"],
        Severity = IncidentSeverity.P1,
        Title = "Evidence bundle duplication"
    };

    [Fact]
    public async Task Validates_RaceCondition_Fix()
    {
        // Arrange
        // TODO: Load fixtures from replay manifest

        // Act
        // TODO: Execute the scenario that triggered the incident

        // Assert
        // TODO: Verify the fix prevents the incident condition
    }
}

CI Integration

Test Filtering

Filter post-incident tests in CI:

# Run all post-incident tests
dotnet test --filter "Category=PostIncident"

# Run only P1/P2 tests (release-gating)
dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)"

# Run tests for a specific incident
dotnet test --filter "Incident=INC-2026-001"

# Run tests for a specific module
dotnet test --filter "Category=PostIncident&Module:EvidenceLocker=true"

CI Lanes

Lane	Filter	Trigger	Behavior
PR Gate	`Category=PostIncident&(Severity=P1\|Severity=P2)`	Pull requests	Blocks merge
Release Gate	`Category=PostIncident`	Release builds	P1/P2 block, P3/P4 warn
Nightly	`Category=PostIncident`	Scheduled	Full run, report only

Example CI Configuration

# .gitea/workflows/post-incident-tests.yml
name: Post-Incident Tests
on:
  pull_request:
  release:
    types: [created]

jobs:
  post-incident:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '10.0.x'

      - name: Run P1/P2 Incident Tests
        run: |
          dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)" \
            --logger "trx;LogFileName=incident-results.trx"

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: incident-test-results
          path: '**/incident-results.trx'

Best Practices

1. Sanitize Fixtures

Remove or mask any PII or sensitive data from replay fixtures:

// Before storing fixture
var sanitizedFixture = fixture
    .Replace(userEmail, "user@example.com")
    .Replace(apiKey, "REDACTED");

2. Use Deterministic Infrastructure

Ensure incident tests use TestKit's deterministic primitives:

// Use deterministic time
using var time = new DeterministicTime(Incident.OccurredAt);

// Use deterministic random if needed
var random = new DeterministicRandom(seed: 42);

3. Document the Incident

Include comprehensive documentation in the test:

/// <summary>
/// Regression test for incident INC-2026-001: Evidence bundle duplication.
/// </summary>
/// <remarks>
/// Root cause: Race condition in concurrent bundle creation.
///
/// The incident occurred when multiple workers attempted to create the same
/// evidence bundle simultaneously. The fix added optimistic locking with
/// a unique constraint on (tenant_id, bundle_id).
///
/// Report: https://incidents.stella-ops.internal/INC-2026-001
/// Fix: PR #1234
/// </remarks>

4. Link to Sprint Tasks

Connect incident tests to the fix implementation:

[Fact]
[Trait("SprintTask", "EVIDENCE-0115-001")]
public async Task Validates_RaceCondition_Fix()

5. Evolve Tests Over Time

Incident tests may need updates as the codebase evolves:

Update fixtures when schemas change
Adjust assertions when behavior intentionally changes
Add new scenarios discovered during subsequent incidents

Troubleshooting

Manifest Not Available

If the replay manifest wasn't captured:

Check Evidence Locker for any captured events
Reconstruct the scenario from logs and metrics
Create a synthetic manifest for testing

Flaky Incident Tests

If the test is non-deterministic:

Identify non-deterministic inputs (time, random, external state)
Replace with TestKit deterministic primitives
Add retry logic only as a last resort

Test No Longer Relevant

If the fix makes the scenario impossible:

Document why the test is no longer applicable
Move to an "archived incidents" test category
Keep the test for documentation purposes

Changelog

v1.0 (2026-01-27)

Initial release: IncidentTestGenerator, IncidentMetadata, TestScaffold
CI integration patterns
Best practices and troubleshooting

9.1 KiB Raw Blame History