# Post-Incident Testing Guide
**Version:** 1.0
**Status:** Turn #6 Implementation
**Audience:** StellaOps developers, QA engineers, incident responders
---
## Overview
Every production incident should produce a permanent regression test. This guide describes the infrastructure and workflow for generating, reviewing, and maintaining post-incident tests in the StellaOps codebase.
### Key Principles
1. **Permanent Regression**: Incidents that reach production indicate a gap in testing. That gap must be permanently closed.
2. **Deterministic Replay**: Tests are generated from replay manifests captured during the incident.
3. **Severity-Gated**: P1/P2 incident tests block releases; P3/P4 tests are warning-only.
4. **Traceable**: Every incident test links back to the incident report and fix.
---
## Workflow
### 1. Incident Triggers Replay Capture
When an incident occurs, the replay infrastructure automatically captures:
- Event sequences with correlation IDs
- Input data (sanitized for PII)
- System state at time of incident
- Configuration and policy digests
This produces a **replay manifest** stored in the Evidence Locker.
### 2. Generate Test Scaffold
Use the `IncidentTestGenerator` to create a test scaffold from the replay manifest:
```csharp
using StellaOps.TestKit.Incident;
// Load the replay manifest
var manifestJson = File.ReadAllText("incident-replay-manifest.json");
// Create incident metadata
var metadata = new IncidentMetadata
{
IncidentId = "INC-2026-001",
OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
RootCause = "Race condition in concurrent bundle creation",
AffectedModules = ["EvidenceLocker", "Policy"],
Severity = IncidentSeverity.P1,
Title = "Evidence bundle duplication in high-concurrency scenario",
ReportUrl = "https://incidents.stella-ops.internal/INC-2026-001"
};
// Generate the test scaffold
var generator = new IncidentTestGenerator();
var scaffold = generator.GenerateFromManifestJson(manifestJson, metadata);
// Output the generated test code
var code = scaffold.GenerateTestCode();
File.WriteAllText($"Tests/{scaffold.TestClassName}.cs", code);
```
### 3. Review and Complete Test
The generated scaffold is a starting point. A human must:
1. **Review fixtures**: Ensure input data is appropriate and sanitized.
2. **Complete assertions**: Add specific assertions for the expected behavior.
3. **Verify determinism**: Ensure the test produces consistent results.
4. **Add to CI**: Include the test in the appropriate test project.
### 4. Register for Tracking
Register the incident test for reporting:
```csharp
generator.RegisterIncidentTest(metadata.IncidentId, scaffold);
// Generate a summary report
var report = generator.GenerateReport();
Console.WriteLine($"Total incident tests: {report.TotalTests}");
Console.WriteLine($"P1 tests: {report.BySeveority.GetValueOrDefault(IncidentSeverity.P1, 0)}");
```
---
## Incident Metadata
The `IncidentMetadata` record captures essential incident context:
| Property | Required | Description |
|----------|----------|-------------|
| `IncidentId` | Yes | Unique identifier from incident management system |
| `OccurredAt` | Yes | When the incident occurred (UTC) |
| `RootCause` | Yes | Brief description of the root cause |
| `AffectedModules` | Yes | Modules impacted by the incident |
| `Severity` | Yes | P1 (critical) through P4 (low impact) |
| `Title` | No | Short descriptive title |
| `ReportUrl` | No | Link to incident report or postmortem |
| `ResolvedAt` | No | When the incident was resolved |
| `CorrelationIds` | No | IDs for replay matching |
| `FixTaskId` | No | Sprint task that implemented the fix |
| `Tags` | No | Categorization tags |
### Severity Levels
| Severity | Description | CI Behavior |
|----------|-------------|-------------|
| P1 | Critical: service down, data loss, security breach | Blocks releases |
| P2 | Major: significant degradation, partial outage | Blocks releases |
| P3 | Minor: limited impact, workaround available | Warning only |
| P4 | Low: cosmetic issues, minor bugs | Informational |
---
## Generated Test Structure
The scaffold generates a test class with:
```csharp
[Trait("Category", TestCategories.PostIncident)]
[Trait("Incident", "INC-2026-001")]
[Trait("Severity", "P1")]
public sealed class Incident_INC_2026_001_Tests
{
private static readonly IncidentMetadata Incident = new()
{
IncidentId = "INC-2026-001",
OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
RootCause = "Race condition in concurrent bundle creation",
AffectedModules = ["EvidenceLocker", "Policy"],
Severity = IncidentSeverity.P1,
Title = "Evidence bundle duplication"
};
[Fact]
public async Task Validates_RaceCondition_Fix()
{
// Arrange
// TODO: Load fixtures from replay manifest
// Act
// TODO: Execute the scenario that triggered the incident
// Assert
// TODO: Verify the fix prevents the incident condition
}
}
```
---
## CI Integration
### Test Filtering
Filter post-incident tests in CI:
```bash
# Run all post-incident tests
dotnet test --filter "Category=PostIncident"
# Run only P1/P2 tests (release-gating)
dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)"
# Run tests for a specific incident
dotnet test --filter "Incident=INC-2026-001"
# Run tests for a specific module
dotnet test --filter "Category=PostIncident&Module:EvidenceLocker=true"
```
### CI Lanes
| Lane | Filter | Trigger | Behavior |
|------|--------|---------|----------|
| PR Gate | `Category=PostIncident&(Severity=P1\|Severity=P2)` | Pull requests | Blocks merge |
| Release Gate | `Category=PostIncident` | Release builds | P1/P2 block, P3/P4 warn |
| Nightly | `Category=PostIncident` | Scheduled | Full run, report only |
### Example CI Configuration
```yaml
# .gitea/workflows/post-incident-tests.yml
name: Post-Incident Tests
on:
pull_request:
release:
types: [created]
jobs:
post-incident:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-dotnet@v4
with:
dotnet-version: '10.0.x'
- name: Run P1/P2 Incident Tests
run: |
dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)" \
--logger "trx;LogFileName=incident-results.trx"
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: incident-test-results
path: '**/incident-results.trx'
```
---
## Best Practices
### 1. Sanitize Fixtures
Remove or mask any PII or sensitive data from replay fixtures:
```csharp
// Before storing fixture
var sanitizedFixture = fixture
.Replace(userEmail, "user@example.com")
.Replace(apiKey, "REDACTED");
```
### 2. Use Deterministic Infrastructure
Ensure incident tests use TestKit's deterministic primitives:
```csharp
// Use deterministic time
using var time = new DeterministicTime(Incident.OccurredAt);
// Use deterministic random if needed
var random = new DeterministicRandom(seed: 42);
```
### 3. Document the Incident
Include comprehensive documentation in the test:
```csharp
///
/// Regression test for incident INC-2026-001: Evidence bundle duplication.
///
///
/// Root cause: Race condition in concurrent bundle creation.
///
/// The incident occurred when multiple workers attempted to create the same
/// evidence bundle simultaneously. The fix added optimistic locking with
/// a unique constraint on (tenant_id, bundle_id).
///
/// Report: https://incidents.stella-ops.internal/INC-2026-001
/// Fix: PR #1234
///
```
### 4. Link to Sprint Tasks
Connect incident tests to the fix implementation:
```csharp
[Fact]
[Trait("SprintTask", "EVIDENCE-0115-001")]
public async Task Validates_RaceCondition_Fix()
```
### 5. Evolve Tests Over Time
Incident tests may need updates as the codebase evolves:
- Update fixtures when schemas change
- Adjust assertions when behavior intentionally changes
- Add new scenarios discovered during subsequent incidents
---
## Troubleshooting
### Manifest Not Available
If the replay manifest wasn't captured:
1. Check Evidence Locker for any captured events
2. Reconstruct the scenario from logs and metrics
3. Create a synthetic manifest for testing
### Flaky Incident Tests
If the test is non-deterministic:
1. Identify non-deterministic inputs (time, random, external state)
2. Replace with TestKit deterministic primitives
3. Add retry logic only as a last resort
### Test No Longer Relevant
If the fix makes the scenario impossible:
1. Document why the test is no longer applicable
2. Move to an "archived incidents" test category
3. Keep the test for documentation purposes
---
## Related Documentation
- [TestKit Usage Guide](testkit-usage-guide.md)
- [Testing Practices](../../code-of-conduct/TESTING_PRACTICES.md)
- [CI Quality Gates](ci-quality-gates.md)
- [Replay Infrastructure](../../modules/replay/architecture.md)
---
## Changelog
### v1.0 (2026-01-27)
- Initial release: IncidentTestGenerator, IncidentMetadata, TestScaffold
- CI integration patterns
- Best practices and troubleshooting