325 lines
9.1 KiB
Markdown
325 lines
9.1 KiB
Markdown
# Post-Incident Testing Guide
|
|
|
|
**Version:** 1.0
|
|
**Status:** Turn #6 Implementation
|
|
**Audience:** StellaOps developers, QA engineers, incident responders
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Every production incident should produce a permanent regression test. This guide describes the infrastructure and workflow for generating, reviewing, and maintaining post-incident tests in the StellaOps codebase.
|
|
|
|
### Key Principles
|
|
|
|
1. **Permanent Regression**: Incidents that reach production indicate a gap in testing. That gap must be permanently closed.
|
|
2. **Deterministic Replay**: Tests are generated from replay manifests captured during the incident.
|
|
3. **Severity-Gated**: P1/P2 incident tests block releases; P3/P4 tests are warning-only.
|
|
4. **Traceable**: Every incident test links back to the incident report and fix.
|
|
|
|
---
|
|
|
|
## Workflow
|
|
|
|
### 1. Incident Triggers Replay Capture
|
|
|
|
When an incident occurs, the replay infrastructure automatically captures:
|
|
|
|
- Event sequences with correlation IDs
|
|
- Input data (sanitized for PII)
|
|
- System state at time of incident
|
|
- Configuration and policy digests
|
|
|
|
This produces a **replay manifest** stored in the Evidence Locker.
|
|
|
|
### 2. Generate Test Scaffold
|
|
|
|
Use the `IncidentTestGenerator` to create a test scaffold from the replay manifest:
|
|
|
|
```csharp
|
|
using StellaOps.TestKit.Incident;
|
|
|
|
// Load the replay manifest
|
|
var manifestJson = File.ReadAllText("incident-replay-manifest.json");
|
|
|
|
// Create incident metadata
|
|
var metadata = new IncidentMetadata
|
|
{
|
|
IncidentId = "INC-2026-001",
|
|
OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
|
|
RootCause = "Race condition in concurrent bundle creation",
|
|
AffectedModules = ["EvidenceLocker", "Policy"],
|
|
Severity = IncidentSeverity.P1,
|
|
Title = "Evidence bundle duplication in high-concurrency scenario",
|
|
ReportUrl = "https://incidents.stella-ops.internal/INC-2026-001"
|
|
};
|
|
|
|
// Generate the test scaffold
|
|
var generator = new IncidentTestGenerator();
|
|
var scaffold = generator.GenerateFromManifestJson(manifestJson, metadata);
|
|
|
|
// Output the generated test code
|
|
var code = scaffold.GenerateTestCode();
|
|
File.WriteAllText($"Tests/{scaffold.TestClassName}.cs", code);
|
|
```
|
|
|
|
### 3. Review and Complete Test
|
|
|
|
The generated scaffold is a starting point. A human must:
|
|
|
|
1. **Review fixtures**: Ensure input data is appropriate and sanitized.
|
|
2. **Complete assertions**: Add specific assertions for the expected behavior.
|
|
3. **Verify determinism**: Ensure the test produces consistent results.
|
|
4. **Add to CI**: Include the test in the appropriate test project.
|
|
|
|
### 4. Register for Tracking
|
|
|
|
Register the incident test for reporting:
|
|
|
|
```csharp
|
|
generator.RegisterIncidentTest(metadata.IncidentId, scaffold);
|
|
|
|
// Generate a summary report
|
|
var report = generator.GenerateReport();
|
|
Console.WriteLine($"Total incident tests: {report.TotalTests}");
|
|
Console.WriteLine($"P1 tests: {report.BySeveority.GetValueOrDefault(IncidentSeverity.P1, 0)}");
|
|
```
|
|
|
|
---
|
|
|
|
## Incident Metadata
|
|
|
|
The `IncidentMetadata` record captures essential incident context:
|
|
|
|
| Property | Required | Description |
|
|
|----------|----------|-------------|
|
|
| `IncidentId` | Yes | Unique identifier from incident management system |
|
|
| `OccurredAt` | Yes | When the incident occurred (UTC) |
|
|
| `RootCause` | Yes | Brief description of the root cause |
|
|
| `AffectedModules` | Yes | Modules impacted by the incident |
|
|
| `Severity` | Yes | P1 (critical) through P4 (low impact) |
|
|
| `Title` | No | Short descriptive title |
|
|
| `ReportUrl` | No | Link to incident report or postmortem |
|
|
| `ResolvedAt` | No | When the incident was resolved |
|
|
| `CorrelationIds` | No | IDs for replay matching |
|
|
| `FixTaskId` | No | Sprint task that implemented the fix |
|
|
| `Tags` | No | Categorization tags |
|
|
|
|
### Severity Levels
|
|
|
|
| Severity | Description | CI Behavior |
|
|
|----------|-------------|-------------|
|
|
| P1 | Critical: service down, data loss, security breach | Blocks releases |
|
|
| P2 | Major: significant degradation, partial outage | Blocks releases |
|
|
| P3 | Minor: limited impact, workaround available | Warning only |
|
|
| P4 | Low: cosmetic issues, minor bugs | Informational |
|
|
|
|
---
|
|
|
|
## Generated Test Structure
|
|
|
|
The scaffold generates a test class with:
|
|
|
|
```csharp
|
|
[Trait("Category", TestCategories.PostIncident)]
|
|
[Trait("Incident", "INC-2026-001")]
|
|
[Trait("Severity", "P1")]
|
|
public sealed class Incident_INC_2026_001_Tests
|
|
{
|
|
private static readonly IncidentMetadata Incident = new()
|
|
{
|
|
IncidentId = "INC-2026-001",
|
|
OccurredAt = DateTimeOffset.Parse("2026-01-15T10:30:00Z"),
|
|
RootCause = "Race condition in concurrent bundle creation",
|
|
AffectedModules = ["EvidenceLocker", "Policy"],
|
|
Severity = IncidentSeverity.P1,
|
|
Title = "Evidence bundle duplication"
|
|
};
|
|
|
|
[Fact]
|
|
public async Task Validates_RaceCondition_Fix()
|
|
{
|
|
// Arrange
|
|
// TODO: Load fixtures from replay manifest
|
|
|
|
// Act
|
|
// TODO: Execute the scenario that triggered the incident
|
|
|
|
// Assert
|
|
// TODO: Verify the fix prevents the incident condition
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## CI Integration
|
|
|
|
### Test Filtering
|
|
|
|
Filter post-incident tests in CI:
|
|
|
|
```bash
|
|
# Run all post-incident tests
|
|
dotnet test --filter "Category=PostIncident"
|
|
|
|
# Run only P1/P2 tests (release-gating)
|
|
dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)"
|
|
|
|
# Run tests for a specific incident
|
|
dotnet test --filter "Incident=INC-2026-001"
|
|
|
|
# Run tests for a specific module
|
|
dotnet test --filter "Category=PostIncident&Module:EvidenceLocker=true"
|
|
```
|
|
|
|
### CI Lanes
|
|
|
|
| Lane | Filter | Trigger | Behavior |
|
|
|------|--------|---------|----------|
|
|
| PR Gate | `Category=PostIncident&(Severity=P1\|Severity=P2)` | Pull requests | Blocks merge |
|
|
| Release Gate | `Category=PostIncident` | Release builds | P1/P2 block, P3/P4 warn |
|
|
| Nightly | `Category=PostIncident` | Scheduled | Full run, report only |
|
|
|
|
### Example CI Configuration
|
|
|
|
```yaml
|
|
# .gitea/workflows/post-incident-tests.yml
|
|
name: Post-Incident Tests
|
|
on:
|
|
pull_request:
|
|
release:
|
|
types: [created]
|
|
|
|
jobs:
|
|
post-incident:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
- uses: actions/setup-dotnet@v4
|
|
with:
|
|
dotnet-version: '10.0.x'
|
|
|
|
- name: Run P1/P2 Incident Tests
|
|
run: |
|
|
dotnet test --filter "Category=PostIncident&(Severity=P1|Severity=P2)" \
|
|
--logger "trx;LogFileName=incident-results.trx"
|
|
|
|
- name: Upload Results
|
|
uses: actions/upload-artifact@v4
|
|
with:
|
|
name: incident-test-results
|
|
path: '**/incident-results.trx'
|
|
```
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
### 1. Sanitize Fixtures
|
|
|
|
Remove or mask any PII or sensitive data from replay fixtures:
|
|
|
|
```csharp
|
|
// Before storing fixture
|
|
var sanitizedFixture = fixture
|
|
.Replace(userEmail, "user@example.com")
|
|
.Replace(apiKey, "REDACTED");
|
|
```
|
|
|
|
### 2. Use Deterministic Infrastructure
|
|
|
|
Ensure incident tests use TestKit's deterministic primitives:
|
|
|
|
```csharp
|
|
// Use deterministic time
|
|
using var time = new DeterministicTime(Incident.OccurredAt);
|
|
|
|
// Use deterministic random if needed
|
|
var random = new DeterministicRandom(seed: 42);
|
|
```
|
|
|
|
### 3. Document the Incident
|
|
|
|
Include comprehensive documentation in the test:
|
|
|
|
```csharp
|
|
/// <summary>
|
|
/// Regression test for incident INC-2026-001: Evidence bundle duplication.
|
|
/// </summary>
|
|
/// <remarks>
|
|
/// Root cause: Race condition in concurrent bundle creation.
|
|
///
|
|
/// The incident occurred when multiple workers attempted to create the same
|
|
/// evidence bundle simultaneously. The fix added optimistic locking with
|
|
/// a unique constraint on (tenant_id, bundle_id).
|
|
///
|
|
/// Report: https://incidents.stella-ops.internal/INC-2026-001
|
|
/// Fix: PR #1234
|
|
/// </remarks>
|
|
```
|
|
|
|
### 4. Link to Sprint Tasks
|
|
|
|
Connect incident tests to the fix implementation:
|
|
|
|
```csharp
|
|
[Fact]
|
|
[Trait("SprintTask", "EVIDENCE-0115-001")]
|
|
public async Task Validates_RaceCondition_Fix()
|
|
```
|
|
|
|
### 5. Evolve Tests Over Time
|
|
|
|
Incident tests may need updates as the codebase evolves:
|
|
|
|
- Update fixtures when schemas change
|
|
- Adjust assertions when behavior intentionally changes
|
|
- Add new scenarios discovered during subsequent incidents
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Manifest Not Available
|
|
|
|
If the replay manifest wasn't captured:
|
|
|
|
1. Check Evidence Locker for any captured events
|
|
2. Reconstruct the scenario from logs and metrics
|
|
3. Create a synthetic manifest for testing
|
|
|
|
### Flaky Incident Tests
|
|
|
|
If the test is non-deterministic:
|
|
|
|
1. Identify non-deterministic inputs (time, random, external state)
|
|
2. Replace with TestKit deterministic primitives
|
|
3. Add retry logic only as a last resort
|
|
|
|
### Test No Longer Relevant
|
|
|
|
If the fix makes the scenario impossible:
|
|
|
|
1. Document why the test is no longer applicable
|
|
2. Move to an "archived incidents" test category
|
|
3. Keep the test for documentation purposes
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [TestKit Usage Guide](testkit-usage-guide.md)
|
|
- [Testing Practices](../../code-of-conduct/TESTING_PRACTICES.md)
|
|
- [CI Quality Gates](ci-quality-gates.md)
|
|
- [Replay Infrastructure](../../modules/replay/architecture.md)
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
### v1.0 (2026-01-27)
|
|
- Initial release: IncidentTestGenerator, IncidentMetadata, TestScaffold
|
|
- CI integration patterns
|
|
- Best practices and troubleshooting
|