# Post-Incident Testing Guide **Version:** 1.0 **Status:** Turn #6 Implementation **Audience:** StellaOps developers, QA engineers, incident responders --- ## Overview Every production incident should produce a permanent regression test. This guide describes the infrastructure and workflow for generating, reviewing, and maintaining post-incident tests in the StellaOps codebase. ### Key Principles 1. **Permanent Regression**: Incidents that reach production indicate a gap in testing. That gap must be permanently closed. 2. **Deterministic Replay**: Tests are generated from replay manifests captured during the incident. 3. **Severity-Gated**: P1/P2 incident tests block releases; P3/P4 tests are warning-only. 4. **Traceable**: Every incident test links back to the incident report and fix. --- ## Workflow ### 1. Incident Triggers Replay Capture When an incident occurs, the replay infrastructure automatically captures: - Event sequences with correlation IDs - Input data (sanitized for PII) - System state at time of incident - Configuration and policy digests This produces a **replay manifest** stored in the Evidence Locker. ### 2. Generate Test Scaffold Use the `IncidentTestGenerator` to create a test scaffold from the replay manifest: ```csharp using StellaOps.TestKit.Incident; // Load the replay manifest var manifestJson = File.ReadAllText("incident-replay-manifest.json"); // Create incident metadata var metadata = new IncidentMetadata { IncidentId = "INC-2026-001", OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"), RootCause = "Race condition in concurrent bundle creation", AffectedModules = ["EvidenceLocker", "Policy"], Severity = IncidentSeverity.P1, Title = "Evidence bundle duplication in high-concurrency scenario", ReportUrl = "https://incidents.stella-ops.internal/INC-2026-001" }; // Generate the test scaffold var generator = new IncidentTestGenerator(); var scaffold = generator.GenerateFromManifestJson(manifestJson, metadata); // Output the generated test code var code = scaffold.GenerateTestCode(); File.WriteAllText($"Tests/{scaffold.TestClassName}.cs", code); ``` ### 3. Review and Complete Test The generated scaffold is a starting point. A human must: 1. **Review fixtures**: Ensure input data is appropriate and sanitized. 2. **Complete assertions**: Add specific assertions for the expected behavior. 3. **Verify determinism**: Ensure the test produces consistent results. 4. **Add to CI**: Include the test in the appropriate test project. ### 4. Register for Tracking Register the incident test for reporting: ```csharp generator.RegisterIncidentTest(metadata.IncidentId, scaffold); // Generate a summary report var report = generator.GenerateReport(); Console.WriteLine($"Total incident tests: {report.TotalTests}"); Console.WriteLine($"P1 tests: {report.BySeveority.GetValueOrDefault(IncidentSeverity.P1, 0)}"); ``` --- ## Incident Metadata The `IncidentMetadata` record captures essential incident context: | Property | Required | Description | |----------|----------|-------------| | `IncidentId` | Yes | Unique identifier from incident management system | | `OccurredAt` | Yes | When the incident occurred (UTC) | | `RootCause` | Yes | Brief description of the root cause | | `AffectedModules` | Yes | Modules impacted by the incident | | `Severity` | Yes | P1 (critical) through P4 (low impact) | | `Title` | No | Short descriptive title | | `ReportUrl` | No | Link to incident report or postmortem | | `ResolvedAt` | No | When the incident was resolved | | `CorrelationIds` | No | IDs for replay matching | | `FixTaskId` | No | Sprint task that implemented the fix | | `Tags` | No | Categorization tags | ### Severity Levels | Severity | Description | CI Behavior | |----------|-------------|-------------| | P1 | Critical: service down, data loss, security breach | Blocks releases | | P2 | Major: significant degradation, partial outage | Blocks releases | | P3 | Minor: limited impact, workaround available | Warning only | | P4 | Low: cosmetic issues, minor bugs | Informational | --- ## Generated Test Structure The scaffold generates a test class with: ```csharp [Trait("Category", TestCategories.PostIncident)] [Trait("Incident", "INC-2026-001")] [Trait("Severity", "P1")] public sealed class Incident_INC_2026_001_Tests { private static readonly IncidentMetadata Incident = new() { IncidentId = "INC-2026-001", OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"), RootCause = "Race condition in concurrent bundle creation", AffectedModules = ["EvidenceLocker", "Policy"], Severity = IncidentSeverity.P1, Title = "Evidence bundle duplication" }; [Fact] public async Task Validates_RaceCondition_Fix() { // Arrange // TODO: Load fixtures from replay manifest // Act // TODO: Execute the scenario that triggered the incident // Assert // TODO: Verify the fix prevents the incident condition } } ``` --- ## CI Integration ### Test Filtering Filter post-incident tests in CI: ```bash # Run all post-incident tests dotnet test --filter "Category=PostIncident" # Run only P1/P2 tests (release-gating) dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)" # Run tests for a specific incident dotnet test --filter "Incident=INC-2026-001" # Run tests for a specific module dotnet test --filter "Category=PostIncident&Module:EvidenceLocker=true" ``` ### CI Lanes | Lane | Filter | Trigger | Behavior | |------|--------|---------|----------| | PR Gate | `Category=PostIncident&(Severity=P1\|Severity=P2)` | Pull requests | Blocks merge | | Release Gate | `Category=PostIncident` | Release builds | P1/P2 block, P3/P4 warn | | Nightly | `Category=PostIncident` | Scheduled | Full run, report only | ### Example CI Configuration ```yaml # .gitea/workflows/post-incident-tests.yml name: Post-Incident Tests on: pull_request: release: types: [created] jobs: post-incident: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-dotnet@v4 with: dotnet-version: '10.0.x' - name: Run P1/P2 Incident Tests run: | dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)" \ --logger "trx;LogFileName=incident-results.trx" - name: Upload Results uses: actions/upload-artifact@v4 with: name: incident-test-results path: '**/incident-results.trx' ``` --- ## Best Practices ### 1. Sanitize Fixtures Remove or mask any PII or sensitive data from replay fixtures: ```csharp // Before storing fixture var sanitizedFixture = fixture .Replace(userEmail, "user@example.com") .Replace(apiKey, "REDACTED"); ``` ### 2. Use Deterministic Infrastructure Ensure incident tests use TestKit's deterministic primitives: ```csharp // Use deterministic time using var time = new DeterministicTime(Incident.OccurredAt); // Use deterministic random if needed var random = new DeterministicRandom(seed: 42); ``` ### 3. Document the Incident Include comprehensive documentation in the test: ```csharp /// /// Regression test for incident INC-2026-001: Evidence bundle duplication. /// /// /// Root cause: Race condition in concurrent bundle creation. /// /// The incident occurred when multiple workers attempted to create the same /// evidence bundle simultaneously. The fix added optimistic locking with /// a unique constraint on (tenant_id, bundle_id). /// /// Report: https://incidents.stella-ops.internal/INC-2026-001 /// Fix: PR #1234 /// ``` ### 4. Link to Sprint Tasks Connect incident tests to the fix implementation: ```csharp [Fact] [Trait("SprintTask", "EVIDENCE-0115-001")] public async Task Validates_RaceCondition_Fix() ``` ### 5. Evolve Tests Over Time Incident tests may need updates as the codebase evolves: - Update fixtures when schemas change - Adjust assertions when behavior intentionally changes - Add new scenarios discovered during subsequent incidents --- ## Troubleshooting ### Manifest Not Available If the replay manifest wasn't captured: 1. Check Evidence Locker for any captured events 2. Reconstruct the scenario from logs and metrics 3. Create a synthetic manifest for testing ### Flaky Incident Tests If the test is non-deterministic: 1. Identify non-deterministic inputs (time, random, external state) 2. Replace with TestKit deterministic primitives 3. Add retry logic only as a last resort ### Test No Longer Relevant If the fix makes the scenario impossible: 1. Document why the test is no longer applicable 2. Move to an "archived incidents" test category 3. Keep the test for documentation purposes --- ## Related Documentation - [TestKit Usage Guide](testkit-usage-guide.md) - [Testing Practices](../../code-of-conduct/TESTING_PRACTICES.md) - [CI Quality Gates](ci-quality-gates.md) - [Replay Infrastructure](../../modules/replay/architecture.md) --- ## Changelog ### v1.0 (2026-01-27) - Initial release: IncidentTestGenerator, IncidentMetadata, TestScaffold - CI integration patterns - Best practices and troubleshooting