9.1 KiB
Post-Incident Testing Guide
Version: 1.0 Status: Turn #6 Implementation Audience: StellaOps developers, QA engineers, incident responders
Overview
Every production incident should produce a permanent regression test. This guide describes the infrastructure and workflow for generating, reviewing, and maintaining post-incident tests in the StellaOps codebase.
Key Principles
- Permanent Regression: Incidents that reach production indicate a gap in testing. That gap must be permanently closed.
- Deterministic Replay: Tests are generated from replay manifests captured during the incident.
- Severity-Gated: P1/P2 incident tests block releases; P3/P4 tests are warning-only.
- Traceable: Every incident test links back to the incident report and fix.
Workflow
1. Incident Triggers Replay Capture
When an incident occurs, the replay infrastructure automatically captures:
- Event sequences with correlation IDs
- Input data (sanitized for PII)
- System state at time of incident
- Configuration and policy digests
This produces a replay manifest stored in the Evidence Locker.
2. Generate Test Scaffold
Use the IncidentTestGenerator to create a test scaffold from the replay manifest:
using StellaOps.TestKit.Incident;
// Load the replay manifest
var manifestJson = File.ReadAllText("incident-replay-manifest.json");
// Create incident metadata
var metadata = new IncidentMetadata
{
IncidentId = "INC-2026-001",
OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
RootCause = "Race condition in concurrent bundle creation",
AffectedModules = ["EvidenceLocker", "Policy"],
Severity = IncidentSeverity.P1,
Title = "Evidence bundle duplication in high-concurrency scenario",
ReportUrl = "https://incidents.stella-ops.internal/INC-2026-001"
};
// Generate the test scaffold
var generator = new IncidentTestGenerator();
var scaffold = generator.GenerateFromManifestJson(manifestJson, metadata);
// Output the generated test code
var code = scaffold.GenerateTestCode();
File.WriteAllText($"Tests/{scaffold.TestClassName}.cs", code);
3. Review and Complete Test
The generated scaffold is a starting point. A human must:
- Review fixtures: Ensure input data is appropriate and sanitized.
- Complete assertions: Add specific assertions for the expected behavior.
- Verify determinism: Ensure the test produces consistent results.
- Add to CI: Include the test in the appropriate test project.
4. Register for Tracking
Register the incident test for reporting:
generator.RegisterIncidentTest(metadata.IncidentId, scaffold);
// Generate a summary report
var report = generator.GenerateReport();
Console.WriteLine($"Total incident tests: {report.TotalTests}");
Console.WriteLine($"P1 tests: {report.BySeveority.GetValueOrDefault(IncidentSeverity.P1, 0)}");
Incident Metadata
The IncidentMetadata record captures essential incident context:
| Property | Required | Description |
|---|---|---|
IncidentId |
Yes | Unique identifier from incident management system |
OccurredAt |
Yes | When the incident occurred (UTC) |
RootCause |
Yes | Brief description of the root cause |
AffectedModules |
Yes | Modules impacted by the incident |
Severity |
Yes | P1 (critical) through P4 (low impact) |
Title |
No | Short descriptive title |
ReportUrl |
No | Link to incident report or postmortem |
ResolvedAt |
No | When the incident was resolved |
CorrelationIds |
No | IDs for replay matching |
FixTaskId |
No | Sprint task that implemented the fix |
Tags |
No | Categorization tags |
Severity Levels
| Severity | Description | CI Behavior |
|---|---|---|
| P1 | Critical: service down, data loss, security breach | Blocks releases |
| P2 | Major: significant degradation, partial outage | Blocks releases |
| P3 | Minor: limited impact, workaround available | Warning only |
| P4 | Low: cosmetic issues, minor bugs | Informational |
Generated Test Structure
The scaffold generates a test class with:
[Trait("Category", TestCategories.PostIncident)]
[Trait("Incident", "INC-2026-001")]
[Trait("Severity", "P1")]
public sealed class Incident_INC_2026_001_Tests
{
private static readonly IncidentMetadata Incident = new()
{
IncidentId = "INC-2026-001",
OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
RootCause = "Race condition in concurrent bundle creation",
AffectedModules = ["EvidenceLocker", "Policy"],
Severity = IncidentSeverity.P1,
Title = "Evidence bundle duplication"
};
[Fact]
public async Task Validates_RaceCondition_Fix()
{
// Arrange
// TODO: Load fixtures from replay manifest
// Act
// TODO: Execute the scenario that triggered the incident
// Assert
// TODO: Verify the fix prevents the incident condition
}
}
CI Integration
Test Filtering
Filter post-incident tests in CI:
# Run all post-incident tests
dotnet test --filter "Category=PostIncident"
# Run only P1/P2 tests (release-gating)
dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)"
# Run tests for a specific incident
dotnet test --filter "Incident=INC-2026-001"
# Run tests for a specific module
dotnet test --filter "Category=PostIncident&Module:EvidenceLocker=true"
CI Lanes
| Lane | Filter | Trigger | Behavior |
|---|---|---|---|
| PR Gate | Category=PostIncident&(Severity=P1|Severity=P2) |
Pull requests | Blocks merge |
| Release Gate | Category=PostIncident |
Release builds | P1/P2 block, P3/P4 warn |
| Nightly | Category=PostIncident |
Scheduled | Full run, report only |
Example CI Configuration
# .gitea/workflows/post-incident-tests.yml
name: Post-Incident Tests
on:
pull_request:
release:
types: [created]
jobs:
post-incident:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-dotnet@v4
with:
dotnet-version: '10.0.x'
- name: Run P1/P2 Incident Tests
run: |
dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)" \
--logger "trx;LogFileName=incident-results.trx"
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: incident-test-results
path: '**/incident-results.trx'
Best Practices
1. Sanitize Fixtures
Remove or mask any PII or sensitive data from replay fixtures:
// Before storing fixture
var sanitizedFixture = fixture
.Replace(userEmail, "user@example.com")
.Replace(apiKey, "REDACTED");
2. Use Deterministic Infrastructure
Ensure incident tests use TestKit's deterministic primitives:
// Use deterministic time
using var time = new DeterministicTime(Incident.OccurredAt);
// Use deterministic random if needed
var random = new DeterministicRandom(seed: 42);
3. Document the Incident
Include comprehensive documentation in the test:
/// <summary>
/// Regression test for incident INC-2026-001: Evidence bundle duplication.
/// </summary>
/// <remarks>
/// Root cause: Race condition in concurrent bundle creation.
///
/// The incident occurred when multiple workers attempted to create the same
/// evidence bundle simultaneously. The fix added optimistic locking with
/// a unique constraint on (tenant_id, bundle_id).
///
/// Report: https://incidents.stella-ops.internal/INC-2026-001
/// Fix: PR #1234
/// </remarks>
4. Link to Sprint Tasks
Connect incident tests to the fix implementation:
[Fact]
[Trait("SprintTask", "EVIDENCE-0115-001")]
public async Task Validates_RaceCondition_Fix()
5. Evolve Tests Over Time
Incident tests may need updates as the codebase evolves:
- Update fixtures when schemas change
- Adjust assertions when behavior intentionally changes
- Add new scenarios discovered during subsequent incidents
Troubleshooting
Manifest Not Available
If the replay manifest wasn't captured:
- Check Evidence Locker for any captured events
- Reconstruct the scenario from logs and metrics
- Create a synthetic manifest for testing
Flaky Incident Tests
If the test is non-deterministic:
- Identify non-deterministic inputs (time, random, external state)
- Replace with TestKit deterministic primitives
- Add retry logic only as a last resort
Test No Longer Relevant
If the fix makes the scenario impossible:
- Document why the test is no longer applicable
- Move to an "archived incidents" test category
- Keep the test for documentation purposes
Related Documentation
Changelog
v1.0 (2026-01-27)
- Initial release: IncidentTestGenerator, IncidentMetadata, TestScaffold
- CI integration patterns
- Best practices and troubleshooting