Merge branch 'main' of https://git.stella-ops.org/stella-ops.org/git.stella-ops.org
This commit is contained in:
@@ -0,0 +1,720 @@
|
||||
# Sprint 0340-0001-0001: Scanner Offline Configuration Surface
|
||||
|
||||
**Sprint ID:** SPRINT_0340_0001_0001
|
||||
**Topic:** Scanner Offline Kit Configuration Surface
|
||||
**Priority:** P2 (Important)
|
||||
**Status:** DONE
|
||||
**Working Directory:** `src/Scanner/`
|
||||
**Related Modules:** `StellaOps.Scanner.WebService`, `StellaOps.Scanner.Core`, `StellaOps.AirGap.Importer`
|
||||
|
||||
**Source Advisory:** 14-Dec-2025 - Offline and Air-Gap Technical Reference (§7)
|
||||
**Gaps Addressed:** G5 (Scanner Config Surface)
|
||||
|
||||
---
|
||||
|
||||
## Objective
|
||||
|
||||
Implement the scanner configuration surface for offline kit operations as specified in advisory §7. This enables granular control over DSSE/Rekor verification requirements and trust anchor management with PURL-pattern matching for ecosystem-specific signing authorities.
|
||||
|
||||
---
|
||||
|
||||
## Target Configuration
|
||||
|
||||
Per advisory §7.1:
|
||||
|
||||
```yaml
|
||||
scanner:
|
||||
offlineKit:
|
||||
requireDsse: true # fail import if DSSE/Rekor verification fails
|
||||
rekorOfflineMode: true # use local snapshots only
|
||||
attestationVerifier: https://attestor.internal
|
||||
trustAnchors:
|
||||
- anchorId: "npm-authority-2025"
|
||||
purlPattern: "pkg:npm/*"
|
||||
allowedKeyids: ["sha256:abc123", "sha256:def456"]
|
||||
- anchorId: "maven-central-2025"
|
||||
purlPattern: "pkg:maven/*"
|
||||
allowedKeyids: ["sha256:789abc"]
|
||||
- anchorId: "stella-ops-default"
|
||||
purlPattern: "*"
|
||||
allowedKeyids: ["sha256:stellaops-root-2025"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
| ID | Task | Status | Owner | Notes |
|
||||
|----|------|--------|-------|-------|
|
||||
| T1 | Design `OfflineKitOptions` configuration class | DONE | Agent | Added `enabled` gate to keep config opt-in. |
|
||||
| T2 | Design `TrustAnchor` model with PURL pattern matching | DONE | Agent | |
|
||||
| T3 | Implement PURL pattern matcher | DONE | Agent | Glob-style matching |
|
||||
| T4 | Create `TrustAnchorRegistry` service | DONE | Agent | Resolution by PURL |
|
||||
| T5 | Add configuration binding in `Program.cs` | DONE | Agent | |
|
||||
| T6 | Create `OfflineKitOptionsValidator` | DONE | Agent | Startup validation |
|
||||
| T7 | Integrate with `DsseVerifier` | DONE | Agent | Scanner OfflineKit import host consumes DSSE verification with trust anchor resolution (PURL match). |
|
||||
| T8 | Implement DSSE failure handling per §7.2 | DONE | Agent | ProblemDetails + reason codes; `RequireDsse=false` soft-fail supported with warning path. |
|
||||
| T9 | Add `rekorOfflineMode` enforcement | DONE | Agent | Offline Rekor receipt verification via local snapshot verifier; startup validation enforces snapshot directory. |
|
||||
| T10 | Create configuration schema documentation | DONE | Agent | Added `src/Scanner/docs/schemas/scanner-offline-kit-config.schema.json`. |
|
||||
| T11 | Write unit tests for PURL matcher | DONE | Agent | Added coverage in `src/Scanner/__Tests/StellaOps.Scanner.Core.Tests`. |
|
||||
| T12 | Write unit tests for trust anchor resolution | DONE | Agent | Added coverage for registry + validator in `src/Scanner/__Tests/StellaOps.Scanner.Core.Tests`. |
|
||||
| T13 | Write integration tests for offline import | DONE | Agent | Added Scanner.WebService OfflineKit endpoint tests (success + failure + soft-fail + audit wiring) with deterministic fixtures. |
|
||||
| T14 | Update Helm chart values | DONE | Agent | Added OfflineKit env vars to `deploy/helm/stellaops/values-*.yaml`. |
|
||||
| T15 | Update docker-compose samples | DONE | Agent | Added OfflineKit env vars to `deploy/compose/docker-compose.*.yaml`. |
|
||||
|
||||
---
|
||||
|
||||
## Technical Specification
|
||||
|
||||
### T1: OfflineKitOptions Configuration
|
||||
|
||||
```csharp
|
||||
// src/Scanner/__Libraries/StellaOps.Scanner.Core/Configuration/OfflineKitOptions.cs
|
||||
namespace StellaOps.Scanner.Core.Configuration;
|
||||
|
||||
/// <summary>
|
||||
/// Configuration for offline kit operations.
|
||||
/// Per Scanner-AIRGAP-340-001.
|
||||
/// </summary>
|
||||
public sealed class OfflineKitOptions
|
||||
{
|
||||
public const string SectionName = "Scanner:OfflineKit";
|
||||
|
||||
/// <summary>
|
||||
/// When true, import fails if DSSE signature verification fails.
|
||||
/// When false, DSSE failure is logged as warning but import proceeds.
|
||||
/// Default: true
|
||||
/// </summary>
|
||||
public bool RequireDsse { get; set; } = true;
|
||||
|
||||
/// <summary>
|
||||
/// When true, Rekor verification uses only local snapshots.
|
||||
/// No online Rekor API calls are attempted.
|
||||
/// Default: true (for air-gap safety)
|
||||
/// </summary>
|
||||
public bool RekorOfflineMode { get; set; } = true;
|
||||
|
||||
/// <summary>
|
||||
/// URL of the internal attestation verifier service.
|
||||
/// Used for delegated verification in clustered deployments.
|
||||
/// Optional; if not set, verification is performed locally.
|
||||
/// </summary>
|
||||
public string? AttestationVerifier { get; set; }
|
||||
|
||||
/// <summary>
|
||||
/// Trust anchors for signature verification.
|
||||
/// Matched by PURL pattern; first match wins.
|
||||
/// </summary>
|
||||
public List<TrustAnchorConfig> TrustAnchors { get; set; } = new();
|
||||
|
||||
/// <summary>
|
||||
/// Path to directory containing trust root public keys.
|
||||
/// Keys are loaded by keyid reference from TrustAnchors.
|
||||
/// </summary>
|
||||
public string? TrustRootDirectory { get; set; }
|
||||
|
||||
/// <summary>
|
||||
/// Path to offline Rekor snapshot directory.
|
||||
/// Contains checkpoint.sig and entries/*.jsonl
|
||||
/// </summary>
|
||||
public string? RekorSnapshotDirectory { get; set; }
|
||||
}
|
||||
```
|
||||
|
||||
### T2: TrustAnchor Model
|
||||
|
||||
```csharp
|
||||
// src/Scanner/__Libraries/StellaOps.Scanner.Core/Configuration/TrustAnchorConfig.cs
|
||||
namespace StellaOps.Scanner.Core.Configuration;
|
||||
|
||||
/// <summary>
|
||||
/// Trust anchor configuration for ecosystem-specific signing authorities.
|
||||
/// </summary>
|
||||
public sealed class TrustAnchorConfig
|
||||
{
|
||||
/// <summary>
|
||||
/// Unique identifier for this trust anchor.
|
||||
/// Used in audit logs and error messages.
|
||||
/// </summary>
|
||||
public required string AnchorId { get; set; }
|
||||
|
||||
/// <summary>
|
||||
/// PURL pattern to match against.
|
||||
/// Supports glob patterns: "pkg:npm/*", "pkg:maven/org.apache.*", "*"
|
||||
/// Patterns are matched in order; first match wins.
|
||||
/// </summary>
|
||||
public required string PurlPattern { get; set; }
|
||||
|
||||
/// <summary>
|
||||
/// List of allowed key fingerprints (SHA-256 of public key).
|
||||
/// Format: "sha256:hexstring" or just "hexstring".
|
||||
/// At least one key must match for verification to pass.
|
||||
/// </summary>
|
||||
public required List<string> AllowedKeyids { get; set; }
|
||||
|
||||
/// <summary>
|
||||
/// Optional description for documentation/UI purposes.
|
||||
/// </summary>
|
||||
public string? Description { get; set; }
|
||||
|
||||
/// <summary>
|
||||
/// When this anchor expires. Null = no expiry.
|
||||
/// After expiry, anchor is skipped with a warning.
|
||||
/// </summary>
|
||||
public DateTimeOffset? ExpiresAt { get; set; }
|
||||
|
||||
/// <summary>
|
||||
/// Minimum required signatures from this anchor.
|
||||
/// Default: 1 (at least one key must sign)
|
||||
/// </summary>
|
||||
public int MinSignatures { get; set; } = 1;
|
||||
}
|
||||
```
|
||||
|
||||
### T3: PURL Pattern Matcher
|
||||
|
||||
```csharp
|
||||
// src/Scanner/__Libraries/StellaOps.Scanner.Core/TrustAnchors/PurlPatternMatcher.cs
|
||||
namespace StellaOps.Scanner.Core.TrustAnchors;
|
||||
|
||||
/// <summary>
|
||||
/// Matches Package URLs against glob patterns.
|
||||
/// Supports:
|
||||
/// - Exact match: "pkg:npm/@scope/package@1.0.0"
|
||||
/// - Prefix wildcard: "pkg:npm/*"
|
||||
/// - Infix wildcard: "pkg:maven/org.apache.*"
|
||||
/// - Universal: "*"
|
||||
/// </summary>
|
||||
public sealed class PurlPatternMatcher
|
||||
{
|
||||
private readonly string _pattern;
|
||||
private readonly Regex _regex;
|
||||
|
||||
public PurlPatternMatcher(string pattern)
|
||||
{
|
||||
_pattern = pattern ?? throw new ArgumentNullException(nameof(pattern));
|
||||
_regex = CompilePattern(pattern);
|
||||
}
|
||||
|
||||
public bool IsMatch(string purl)
|
||||
{
|
||||
if (string.IsNullOrEmpty(purl)) return false;
|
||||
return _regex.IsMatch(purl);
|
||||
}
|
||||
|
||||
private static Regex CompilePattern(string pattern)
|
||||
{
|
||||
if (pattern == "*")
|
||||
{
|
||||
return new Regex("^.*$", RegexOptions.Compiled | RegexOptions.IgnoreCase);
|
||||
}
|
||||
|
||||
// Escape regex special chars except *
|
||||
var escaped = Regex.Escape(pattern);
|
||||
|
||||
// Replace escaped \* with .*
|
||||
escaped = escaped.Replace(@"\*", ".*");
|
||||
|
||||
// Anchor the pattern
|
||||
return new Regex($"^{escaped}$", RegexOptions.Compiled | RegexOptions.IgnoreCase);
|
||||
}
|
||||
|
||||
public string Pattern => _pattern;
|
||||
}
|
||||
```
|
||||
|
||||
### T4: TrustAnchorRegistry Service
|
||||
|
||||
```csharp
|
||||
// src/Scanner/__Libraries/StellaOps.Scanner.Core/TrustAnchors/TrustAnchorRegistry.cs
|
||||
namespace StellaOps.Scanner.Core.TrustAnchors;
|
||||
|
||||
/// <summary>
|
||||
/// Registry for trust anchors with PURL-based resolution.
|
||||
/// Thread-safe and supports runtime reload.
|
||||
/// </summary>
|
||||
public sealed class TrustAnchorRegistry : ITrustAnchorRegistry
|
||||
{
|
||||
private readonly IOptionsMonitor<OfflineKitOptions> _options;
|
||||
private readonly IPublicKeyLoader _keyLoader;
|
||||
private readonly ILogger<TrustAnchorRegistry> _logger;
|
||||
private readonly TimeProvider _timeProvider;
|
||||
|
||||
private IReadOnlyList<CompiledTrustAnchor>? _compiledAnchors;
|
||||
private readonly object _lock = new();
|
||||
|
||||
public TrustAnchorRegistry(
|
||||
IOptionsMonitor<OfflineKitOptions> options,
|
||||
IPublicKeyLoader keyLoader,
|
||||
ILogger<TrustAnchorRegistry> logger,
|
||||
TimeProvider timeProvider)
|
||||
{
|
||||
_options = options;
|
||||
_keyLoader = keyLoader;
|
||||
_logger = logger;
|
||||
_timeProvider = timeProvider;
|
||||
|
||||
// Recompile on config change
|
||||
_options.OnChange(_ => InvalidateCache());
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Resolves trust anchor for a given PURL.
|
||||
/// Returns first matching anchor or null if no match.
|
||||
/// </summary>
|
||||
public TrustAnchorResolution? ResolveForPurl(string purl)
|
||||
{
|
||||
var anchors = GetCompiledAnchors();
|
||||
var now = _timeProvider.GetUtcNow();
|
||||
|
||||
foreach (var anchor in anchors)
|
||||
{
|
||||
if (anchor.Matcher.IsMatch(purl))
|
||||
{
|
||||
// Check expiry
|
||||
if (anchor.Config.ExpiresAt.HasValue && anchor.Config.ExpiresAt.Value < now)
|
||||
{
|
||||
_logger.LogWarning(
|
||||
"Trust anchor {AnchorId} has expired, skipping",
|
||||
anchor.Config.AnchorId);
|
||||
continue;
|
||||
}
|
||||
|
||||
return new TrustAnchorResolution(
|
||||
AnchorId: anchor.Config.AnchorId,
|
||||
AllowedKeyids: anchor.Config.AllowedKeyids,
|
||||
MinSignatures: anchor.Config.MinSignatures,
|
||||
PublicKeys: anchor.LoadedKeys);
|
||||
}
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Gets all configured trust anchors (for diagnostics).
|
||||
/// </summary>
|
||||
public IReadOnlyList<TrustAnchorConfig> GetAllAnchors()
|
||||
{
|
||||
return _options.CurrentValue.TrustAnchors.AsReadOnly();
|
||||
}
|
||||
|
||||
private IReadOnlyList<CompiledTrustAnchor> GetCompiledAnchors()
|
||||
{
|
||||
if (_compiledAnchors is not null) return _compiledAnchors;
|
||||
|
||||
lock (_lock)
|
||||
{
|
||||
if (_compiledAnchors is not null) return _compiledAnchors;
|
||||
|
||||
var config = _options.CurrentValue;
|
||||
var compiled = new List<CompiledTrustAnchor>();
|
||||
|
||||
foreach (var anchor in config.TrustAnchors)
|
||||
{
|
||||
try
|
||||
{
|
||||
var matcher = new PurlPatternMatcher(anchor.PurlPattern);
|
||||
var keys = LoadKeysForAnchor(anchor, config.TrustRootDirectory);
|
||||
|
||||
compiled.Add(new CompiledTrustAnchor(anchor, matcher, keys));
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
_logger.LogError(ex,
|
||||
"Failed to compile trust anchor {AnchorId}",
|
||||
anchor.AnchorId);
|
||||
}
|
||||
}
|
||||
|
||||
_compiledAnchors = compiled.AsReadOnly();
|
||||
return _compiledAnchors;
|
||||
}
|
||||
}
|
||||
|
||||
private IReadOnlyDictionary<string, byte[]> LoadKeysForAnchor(
|
||||
TrustAnchorConfig anchor,
|
||||
string? keyDirectory)
|
||||
{
|
||||
var keys = new Dictionary<string, byte[]>(StringComparer.OrdinalIgnoreCase);
|
||||
|
||||
foreach (var keyid in anchor.AllowedKeyids)
|
||||
{
|
||||
var normalizedKeyid = NormalizeKeyid(keyid);
|
||||
var keyBytes = _keyLoader.LoadKey(normalizedKeyid, keyDirectory);
|
||||
|
||||
if (keyBytes is not null)
|
||||
{
|
||||
keys[normalizedKeyid] = keyBytes;
|
||||
}
|
||||
else
|
||||
{
|
||||
_logger.LogWarning(
|
||||
"Key {Keyid} not found for anchor {AnchorId}",
|
||||
keyid, anchor.AnchorId);
|
||||
}
|
||||
}
|
||||
|
||||
return keys;
|
||||
}
|
||||
|
||||
private static string NormalizeKeyid(string keyid)
|
||||
{
|
||||
if (keyid.StartsWith("sha256:", StringComparison.OrdinalIgnoreCase))
|
||||
return keyid[7..].ToLowerInvariant();
|
||||
return keyid.ToLowerInvariant();
|
||||
}
|
||||
|
||||
private void InvalidateCache()
|
||||
{
|
||||
lock (_lock)
|
||||
{
|
||||
_compiledAnchors = null;
|
||||
}
|
||||
}
|
||||
|
||||
private sealed record CompiledTrustAnchor(
|
||||
TrustAnchorConfig Config,
|
||||
PurlPatternMatcher Matcher,
|
||||
IReadOnlyDictionary<string, byte[]> LoadedKeys);
|
||||
}
|
||||
|
||||
public sealed record TrustAnchorResolution(
|
||||
string AnchorId,
|
||||
IReadOnlyList<string> AllowedKeyids,
|
||||
int MinSignatures,
|
||||
IReadOnlyDictionary<string, byte[]> PublicKeys);
|
||||
```
|
||||
|
||||
### T6: Options Validator
|
||||
|
||||
```csharp
|
||||
// src/Scanner/__Libraries/StellaOps.Scanner.Core/Configuration/OfflineKitOptionsValidator.cs
|
||||
namespace StellaOps.Scanner.Core.Configuration;
|
||||
|
||||
public sealed class OfflineKitOptionsValidator : IValidateOptions<OfflineKitOptions>
|
||||
{
|
||||
public ValidateOptionsResult Validate(string? name, OfflineKitOptions options)
|
||||
{
|
||||
var errors = new List<string>();
|
||||
|
||||
// Validate trust anchors
|
||||
if (options.RequireDsse && options.TrustAnchors.Count == 0)
|
||||
{
|
||||
errors.Add("RequireDsse is true but no TrustAnchors are configured");
|
||||
}
|
||||
|
||||
foreach (var anchor in options.TrustAnchors)
|
||||
{
|
||||
if (string.IsNullOrWhiteSpace(anchor.AnchorId))
|
||||
{
|
||||
errors.Add("TrustAnchor has empty AnchorId");
|
||||
}
|
||||
|
||||
if (string.IsNullOrWhiteSpace(anchor.PurlPattern))
|
||||
{
|
||||
errors.Add($"TrustAnchor '{anchor.AnchorId}' has empty PurlPattern");
|
||||
}
|
||||
|
||||
if (anchor.AllowedKeyids.Count == 0)
|
||||
{
|
||||
errors.Add($"TrustAnchor '{anchor.AnchorId}' has no AllowedKeyids");
|
||||
}
|
||||
|
||||
if (anchor.MinSignatures < 1)
|
||||
{
|
||||
errors.Add($"TrustAnchor '{anchor.AnchorId}' MinSignatures must be >= 1");
|
||||
}
|
||||
|
||||
if (anchor.MinSignatures > anchor.AllowedKeyids.Count)
|
||||
{
|
||||
errors.Add(
|
||||
$"TrustAnchor '{anchor.AnchorId}' MinSignatures ({anchor.MinSignatures}) " +
|
||||
$"exceeds AllowedKeyids count ({anchor.AllowedKeyids.Count})");
|
||||
}
|
||||
|
||||
// Validate pattern syntax
|
||||
try
|
||||
{
|
||||
_ = new PurlPatternMatcher(anchor.PurlPattern);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
errors.Add($"TrustAnchor '{anchor.AnchorId}' has invalid PurlPattern: {ex.Message}");
|
||||
}
|
||||
}
|
||||
|
||||
// Check for duplicate anchor IDs
|
||||
var duplicateIds = options.TrustAnchors
|
||||
.GroupBy(a => a.AnchorId, StringComparer.OrdinalIgnoreCase)
|
||||
.Where(g => g.Count() > 1)
|
||||
.Select(g => g.Key)
|
||||
.ToList();
|
||||
|
||||
if (duplicateIds.Count > 0)
|
||||
{
|
||||
errors.Add($"Duplicate TrustAnchor AnchorIds: {string.Join(", ", duplicateIds)}");
|
||||
}
|
||||
|
||||
// Validate paths exist (if specified)
|
||||
if (!string.IsNullOrEmpty(options.TrustRootDirectory) &&
|
||||
!Directory.Exists(options.TrustRootDirectory))
|
||||
{
|
||||
errors.Add($"TrustRootDirectory does not exist: {options.TrustRootDirectory}");
|
||||
}
|
||||
|
||||
if (options.RekorOfflineMode &&
|
||||
!string.IsNullOrEmpty(options.RekorSnapshotDirectory) &&
|
||||
!Directory.Exists(options.RekorSnapshotDirectory))
|
||||
{
|
||||
errors.Add($"RekorSnapshotDirectory does not exist: {options.RekorSnapshotDirectory}");
|
||||
}
|
||||
|
||||
return errors.Count > 0
|
||||
? ValidateOptionsResult.Fail(errors)
|
||||
: ValidateOptionsResult.Success;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### T8: DSSE Failure Handling
|
||||
|
||||
Per advisory §7.2:
|
||||
|
||||
```csharp
|
||||
// src/Scanner/__Libraries/StellaOps.Scanner.Core/Import/OfflineKitImportService.cs
|
||||
|
||||
public async Task<OfflineKitImportResult> ImportAsync(
|
||||
OfflineKitImportRequest request,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
var options = _options.CurrentValue;
|
||||
|
||||
// ... bundle extraction and manifest validation ...
|
||||
|
||||
// DSSE verification
|
||||
var dsseResult = await _dsseVerifier.VerifyAsync(envelope, trustConfig, cancellationToken);
|
||||
|
||||
if (!dsseResult.IsValid)
|
||||
{
|
||||
if (options.RequireDsse)
|
||||
{
|
||||
// Hard fail per §7.2: "DSSE/Rekor fail, Cosign + manifest OK"
|
||||
_logger.LogError(
|
||||
"DSSE verification failed and RequireDsse=true: {Reason}",
|
||||
dsseResult.ReasonCode);
|
||||
|
||||
// Keep old feeds active
|
||||
// Mark import as failed
|
||||
// Surface ProblemDetails error via API/UI
|
||||
|
||||
return new OfflineKitImportResult
|
||||
{
|
||||
Success = false,
|
||||
ReasonCode = "DSSE_VERIFY_FAIL",
|
||||
ReasonMessage = dsseResult.ReasonMessage,
|
||||
StructuredFields = new Dictionary<string, string>
|
||||
{
|
||||
["rekorUuid"] = dsseResult.RekorUuid ?? "",
|
||||
["attestationDigest"] = dsseResult.AttestationDigest ?? "",
|
||||
["offlineKitHash"] = manifest.PayloadSha256,
|
||||
["failureReason"] = dsseResult.ReasonCode
|
||||
}
|
||||
};
|
||||
}
|
||||
else
|
||||
{
|
||||
// Soft fail (§7.2 rollout mode): treat as warning, allow import with alerts
|
||||
_logger.LogWarning(
|
||||
"DSSE verification failed but RequireDsse=false, proceeding: {Reason}",
|
||||
dsseResult.ReasonCode);
|
||||
|
||||
// Continue with import but flag in result
|
||||
dsseWarning = true;
|
||||
}
|
||||
}
|
||||
|
||||
// Rekor verification
|
||||
if (options.RekorOfflineMode)
|
||||
{
|
||||
var rekorResult = await _rekorVerifier.VerifyOfflineAsync(
|
||||
envelope,
|
||||
options.RekorSnapshotDirectory,
|
||||
cancellationToken);
|
||||
|
||||
if (!rekorResult.IsValid && options.RequireDsse)
|
||||
{
|
||||
return new OfflineKitImportResult
|
||||
{
|
||||
Success = false,
|
||||
ReasonCode = "REKOR_VERIFY_FAIL",
|
||||
ReasonMessage = rekorResult.ReasonMessage,
|
||||
StructuredFields = new Dictionary<string, string>
|
||||
{
|
||||
["rekorUuid"] = rekorResult.Uuid ?? "",
|
||||
["rekorLogIndex"] = rekorResult.LogIndex?.ToString() ?? "",
|
||||
["offlineKitHash"] = manifest.PayloadSha256,
|
||||
["failureReason"] = rekorResult.ReasonCode
|
||||
}
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
// ... continue with feed swap, audit event emission ...
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
### Configuration
|
||||
- [x] `Scanner:OfflineKit` section binds correctly from appsettings.json
|
||||
- [x] `OfflineKitOptionsValidator` runs at startup
|
||||
- [x] Invalid configuration prevents service startup with clear error
|
||||
- [x] Configuration changes are detected via `IOptionsMonitor`
|
||||
|
||||
### Trust Anchors
|
||||
- [x] PURL patterns match correctly (exact, prefix, suffix, wildcard)
|
||||
- [x] First matching anchor wins (order matters)
|
||||
- [x] Expired anchors are skipped with warning
|
||||
- [x] Missing keys for an anchor are logged as warning
|
||||
- [x] At least `MinSignatures` keys must sign
|
||||
|
||||
### DSSE Verification
|
||||
- [x] When `RequireDsse=true`: DSSE failure blocks import
|
||||
- [x] When `RequireDsse=false`: DSSE failure logs warning, import proceeds
|
||||
- [x] Trust anchor resolution integrates with `DsseVerifier`
|
||||
|
||||
### Rekor Verification
|
||||
- [x] When `RekorOfflineMode=true`: No network calls to Rekor API
|
||||
- [x] Offline Rekor uses snapshot from `RekorSnapshotDirectory`
|
||||
- [x] Missing snapshot directory fails validation at startup
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Sprint 0338 (Monotonicity, Quarantine) for import integration
|
||||
- `StellaOps.AirGap.Importer` for `DsseVerifier`
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
1. **Unit tests** for `PurlPatternMatcher` with edge cases
|
||||
2. **Unit tests** for `TrustAnchorRegistry` resolution logic
|
||||
3. **Unit tests** for `OfflineKitOptionsValidator`
|
||||
4. **Integration tests** for config binding
|
||||
5. **Integration tests** for import with various trust anchor configurations
|
||||
|
||||
---
|
||||
|
||||
## Configuration Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"$id": "https://stella-ops.org/schemas/scanner-offline-kit-config.json",
|
||||
"title": "Scanner Offline Kit Configuration",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"requireDsse": {
|
||||
"type": "boolean",
|
||||
"default": true,
|
||||
"description": "Fail import if DSSE verification fails"
|
||||
},
|
||||
"rekorOfflineMode": {
|
||||
"type": "boolean",
|
||||
"default": true,
|
||||
"description": "Use only local Rekor snapshots"
|
||||
},
|
||||
"attestationVerifier": {
|
||||
"type": "string",
|
||||
"format": "uri",
|
||||
"description": "URL of internal attestation verifier"
|
||||
},
|
||||
"trustRootDirectory": {
|
||||
"type": "string",
|
||||
"description": "Path to directory containing public keys"
|
||||
},
|
||||
"rekorSnapshotDirectory": {
|
||||
"type": "string",
|
||||
"description": "Path to Rekor snapshot directory"
|
||||
},
|
||||
"trustAnchors": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["anchorId", "purlPattern", "allowedKeyids"],
|
||||
"properties": {
|
||||
"anchorId": {
|
||||
"type": "string",
|
||||
"minLength": 1
|
||||
},
|
||||
"purlPattern": {
|
||||
"type": "string",
|
||||
"minLength": 1,
|
||||
"examples": ["pkg:npm/*", "pkg:maven/org.apache.*", "*"]
|
||||
},
|
||||
"allowedKeyids": {
|
||||
"type": "array",
|
||||
"items": { "type": "string" },
|
||||
"minItems": 1
|
||||
},
|
||||
"description": { "type": "string" },
|
||||
"expiresAt": {
|
||||
"type": "string",
|
||||
"format": "date-time"
|
||||
},
|
||||
"minSignatures": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"default": 1
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Helm Values Update
|
||||
|
||||
```yaml
|
||||
# deploy/helm/stellaops/values.yaml
|
||||
|
||||
scanner:
|
||||
offlineKit:
|
||||
enabled: true
|
||||
requireDsse: true
|
||||
rekorOfflineMode: true
|
||||
# attestationVerifier: https://attestor.internal
|
||||
trustRootDirectory: /etc/stellaops/trust-roots
|
||||
rekorSnapshotDirectory: /var/lib/stellaops/rekor-snapshot
|
||||
trustAnchors:
|
||||
- anchorId: "stellaops-default-2025"
|
||||
purlPattern: "*"
|
||||
allowedKeyids:
|
||||
- "sha256:your-key-fingerprint-here"
|
||||
minSignatures: 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2025-12-15 | Implemented OfflineKit options/validator + trust anchor matcher/registry; wired Scanner.WebService options binding + DI; marked T7-T9 blocked pending import pipeline + offline Rekor verifier. | Agent |
|
||||
| 2025-12-17 | Unblocked T7-T9/T13 by implementing a Scanner-side OfflineKit import host (API + services) and offline Rekor receipt verification; started wiring DSSE/Rekor failure handling and integration tests. | Agent |
|
||||
| 2025-12-18 | Completed T7-T9/T13: OfflineKit import/status endpoints, DSSE + offline Rekor verification gates, audit emitter wiring, and deterministic integration tests in `src/Scanner/__Tests/StellaOps.Scanner.WebService.Tests/OfflineKitEndpointsTests.cs`. | Agent |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Owning host:** Scanner WebService owns Offline Kit HTTP surface (`/api/offline-kit/import`, `/api/offline-kit/status`) and exposes `/metrics` for Offline Kit counters/histograms.
|
||||
- **Trust anchor selection:** Resolve a deterministic PURL from metadata (`pkg:stellaops/{metadata.kind}`) and match it against configured trust anchors; extend to SBOM-derived ecosystem PURLs in a follow-up sprint if needed.
|
||||
- **Rekor offline verification:** Use `RekorOfflineReceiptVerifier` with a required local snapshot directory; no network calls are attempted when `RekorOfflineMode=true`.
|
||||
|
||||
## Next Checkpoints
|
||||
- None (sprint complete).
|
||||
@@ -0,0 +1,820 @@
|
||||
# Sprint 0341-0001-0001 · Observability & Audit Enhancements
|
||||
|
||||
## Topic & Scope
|
||||
- Add Offline Kit observability and audit primitives (metrics, structured logs, machine-readable error/reason codes, and an Authority/Postgres audit trail) so operators can monitor, debug, and attest air-gapped operations.
|
||||
- Evidence: Prometheus scraping endpoint with Offline Kit counters/histograms, standardized log fields + tenant context enrichment, CLI ProblemDetails outputs with stable codes, Postgres migration + repository + tests, docs update + Grafana dashboard JSON.
|
||||
- **Sprint ID:** `SPRINT_0341_0001_0001` · **Priority:** P1-P2
|
||||
- **Working directories:**
|
||||
- `src/AirGap/StellaOps.AirGap.Importer/` (metrics, logging)
|
||||
- `src/Cli/StellaOps.Cli/Output/` (error codes)
|
||||
- `src/Cli/StellaOps.Cli/Services/` (ProblemDetails parsing integration)
|
||||
- `src/Cli/StellaOps.Cli/Services/Transport/` (SDK client ProblemDetails parsing integration)
|
||||
- `src/Authority/__Libraries/StellaOps.Authority.Storage.Postgres/` (audit schema)
|
||||
- **Source advisory:** `docs/product-advisories/14-Dec-2025 - Offline and Air-Gap Technical Reference.md` (§10, §11, §13)
|
||||
- **Gaps addressed:** G11 (Prometheus Metrics), G12 (Structured Logging), G13 (Error Codes), G14 (Audit Schema)
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on Sprint 0338 (Monotonicity, Quarantine) for importer integration points and event fields.
|
||||
- Depends on Sprint 0339 (CLI) for exit code mapping.
|
||||
- Prometheus/OpenTelemetry stack must be available in-host; exporter choice must match existing service patterns.
|
||||
- Concurrency note: touches AirGap Importer + CLI + Authority storage; avoid cross-module contract changes without recording them in this sprint’s Decisions & Risks.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/product-advisories/14-Dec-2025 - Offline and Air-Gap Technical Reference.md`
|
||||
- `docs/airgap/airgap-mode.md`
|
||||
- `docs/airgap/advisory-implementation-roadmap.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
- `docs/modules/cli/architecture.md`
|
||||
- `docs/modules/authority/architecture.md`
|
||||
- `docs/db/README.md`
|
||||
- `docs/db/SPECIFICATION.md`
|
||||
- `docs/db/RULES.md`
|
||||
- `docs/db/VERIFICATION.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
| ID | Task | Status | Owner | Notes |
|
||||
|----|------|--------|-------|-------|
|
||||
| **Metrics (G11)** | | | | |
|
||||
| T1 | Design metrics interface | DONE | Agent | Start with `OfflineKitMetrics` + tag keys and ensure naming matches advisory. |
|
||||
| T2 | Implement `offlinekit_import_total` counter | DONE | Agent | Implement in `OfflineKitMetrics`. |
|
||||
| T3 | Implement `offlinekit_attestation_verify_latency_seconds` histogram | DONE | Agent | Implement in `OfflineKitMetrics`. |
|
||||
| T4 | Implement `attestor_rekor_success_total` counter | DONE | Agent | Implement in `OfflineKitMetrics` (call sites may land later). |
|
||||
| T5 | Implement `attestor_rekor_retry_total` counter | DONE | Agent | Implement in `OfflineKitMetrics` (call sites may land later). |
|
||||
| T6 | Implement `rekor_inclusion_latency` histogram | DONE | Agent | Implement in `OfflineKitMetrics` (call sites may land later). |
|
||||
| T7 | Register metrics with Prometheus endpoint | DONE | Agent | Scanner WebService exposes `/metrics` (Prometheus text format) including Offline Kit counters/histograms. |
|
||||
| **Logging (G12)** | | | | |
|
||||
| T8 | Define structured logging constants | DONE | Agent | Add `OfflineKitLogFields` + scope helpers. |
|
||||
| T9 | Update `ImportValidator` logging | DONE | Agent | Align log templates + tenant scope usage. |
|
||||
| T10 | Update `DsseVerifier` logging | DONE | Agent | Add structured success/failure logs (no secrets). |
|
||||
| T11 | Update quarantine logging | DONE | Agent | Align log templates + tenant scope usage. |
|
||||
| T12 | Create logging enricher for tenant context | DONE | Agent | Use `ILogger.BeginScope` with `tenant_id` consistently. |
|
||||
| **Error Codes (G13)** | | | | |
|
||||
| T13 | Add missing error codes to `CliErrorCodes` | DONE | Agent | Add Offline Kit/AirGap CLI error codes. |
|
||||
| T14 | Create `OfflineKitReasonCodes` class | DONE | Agent | Define reason codes per advisory §11.2 + remediation/exit mapping. |
|
||||
| T15 | Integrate codes with ProblemDetails | DONE | Agent | Parse `reason_code`/`reasonCode` from ProblemDetails and surface via CLI error rendering. |
|
||||
| **Audit Schema (G14)** | | | | |
|
||||
| T16 | Design extended audit schema | DONE | Agent | Align with advisory §13.2 and Authority RLS (`tenant_id`). |
|
||||
| T17 | Create migration for `offline_kit_audit` table | DONE | Agent | Add `authority.offline_kit_audit` + indexes + RLS policy. |
|
||||
| T18 | Implement `IOfflineKitAuditRepository` | DONE | Agent | Repository + query helpers (tenant/type/result). |
|
||||
| T19 | Create audit event emitter service | DONE | Agent | Emitter wraps repository and must not fail import flows. |
|
||||
| T20 | Wire audit to import/activation flows | DONE | Agent | Scanner OfflineKit import emits Authority audit events via `IOfflineKitAuditEmitter` (best-effort; failures do not block imports). |
|
||||
| **Testing & Docs** | | | | |
|
||||
| T21 | Write unit tests for metrics | DONE | Agent | Cover instrument names + label sets via `MeterListener`. |
|
||||
| T22 | Write integration tests for audit | DONE | Agent | Cover migration + insert/query via Authority Postgres Testcontainers fixture (requires Docker). |
|
||||
| T23 | Update observability documentation | DONE | Agent | Align docs with implementation + blocked items (`T7`,`T20`). |
|
||||
| T24 | Add Grafana dashboard JSON | DONE | Agent | Commit dashboard artifact under `docs/observability/dashboards/`. |
|
||||
|
||||
---
|
||||
|
||||
## Technical Specification
|
||||
|
||||
### Part 1: Prometheus Metrics (G11)
|
||||
|
||||
Per advisory §10.1:
|
||||
|
||||
```csharp
|
||||
// src/AirGap/StellaOps.AirGap.Importer/Telemetry/OfflineKitMetrics.cs
|
||||
namespace StellaOps.AirGap.Importer.Telemetry;
|
||||
|
||||
/// <summary>
|
||||
/// Prometheus metrics for offline kit operations.
|
||||
/// Per AIRGAP-OBS-341-001.
|
||||
/// </summary>
|
||||
public sealed class OfflineKitMetrics
|
||||
{
|
||||
private readonly Counter<long> _importTotal;
|
||||
private readonly Histogram<double> _attestationVerifyLatency;
|
||||
private readonly Counter<long> _rekorSuccessTotal;
|
||||
private readonly Counter<long> _rekorRetryTotal;
|
||||
private readonly Histogram<double> _rekorInclusionLatency;
|
||||
|
||||
public OfflineKitMetrics(IMeterFactory meterFactory)
|
||||
{
|
||||
var meter = meterFactory.Create("StellaOps.AirGap.Importer");
|
||||
|
||||
_importTotal = meter.CreateCounter<long>(
|
||||
name: "offlinekit_import_total",
|
||||
unit: "{imports}",
|
||||
description: "Total number of offline kit import attempts");
|
||||
|
||||
_attestationVerifyLatency = meter.CreateHistogram<double>(
|
||||
name: "offlinekit_attestation_verify_latency_seconds",
|
||||
unit: "s",
|
||||
description: "Time taken to verify attestations during import");
|
||||
|
||||
_rekorSuccessTotal = meter.CreateCounter<long>(
|
||||
name: "attestor_rekor_success_total",
|
||||
unit: "{verifications}",
|
||||
description: "Successful Rekor verification count");
|
||||
|
||||
_rekorRetryTotal = meter.CreateCounter<long>(
|
||||
name: "attestor_rekor_retry_total",
|
||||
unit: "{retries}",
|
||||
description: "Rekor verification retry count");
|
||||
|
||||
_rekorInclusionLatency = meter.CreateHistogram<double>(
|
||||
name: "rekor_inclusion_latency",
|
||||
unit: "s",
|
||||
description: "Time to verify Rekor inclusion proof");
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Records an import attempt with status.
|
||||
/// </summary>
|
||||
/// <param name="status">One of: success, failed_dsse, failed_rekor, failed_cosign, failed_manifest, failed_hash, failed_version</param>
|
||||
/// <param name="tenantId">Tenant identifier</param>
|
||||
public void RecordImport(string status, string tenantId)
|
||||
{
|
||||
_importTotal.Add(1,
|
||||
new KeyValuePair<string, object?>("status", status),
|
||||
new KeyValuePair<string, object?>("tenant_id", tenantId));
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Records attestation verification latency.
|
||||
/// </summary>
|
||||
public void RecordAttestationVerifyLatency(double seconds, string attestationType, bool success)
|
||||
{
|
||||
_attestationVerifyLatency.Record(seconds,
|
||||
new KeyValuePair<string, object?>("attestation_type", attestationType),
|
||||
new KeyValuePair<string, object?>("success", success));
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Records a successful Rekor verification.
|
||||
/// </summary>
|
||||
public void RecordRekorSuccess(string mode)
|
||||
{
|
||||
_rekorSuccessTotal.Add(1,
|
||||
new KeyValuePair<string, object?>("mode", mode)); // "online" or "offline"
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Records a Rekor retry.
|
||||
/// </summary>
|
||||
public void RecordRekorRetry(string reason)
|
||||
{
|
||||
_rekorRetryTotal.Add(1,
|
||||
new KeyValuePair<string, object?>("reason", reason));
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Records Rekor inclusion proof verification latency.
|
||||
/// </summary>
|
||||
public void RecordRekorInclusionLatency(double seconds, bool success)
|
||||
{
|
||||
_rekorInclusionLatency.Record(seconds,
|
||||
new KeyValuePair<string, object?>("success", success));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Metric Registration
|
||||
|
||||
```csharp
|
||||
// src/AirGap/StellaOps.AirGap.Importer/ServiceCollectionExtensions.cs
|
||||
|
||||
public static IServiceCollection AddAirGapImporter(this IServiceCollection services)
|
||||
{
|
||||
services.AddSingleton<OfflineKitMetrics>();
|
||||
|
||||
// ... other registrations ...
|
||||
|
||||
return services;
|
||||
}
|
||||
```
|
||||
|
||||
### Part 2: Structured Logging (G12)
|
||||
|
||||
Per advisory §10.2:
|
||||
|
||||
```csharp
|
||||
// src/AirGap/StellaOps.AirGap.Importer/Telemetry/OfflineKitLogFields.cs
|
||||
namespace StellaOps.AirGap.Importer.Telemetry;
|
||||
|
||||
/// <summary>
|
||||
/// Standardized log field names for offline kit operations.
|
||||
/// Per advisory §10.2.
|
||||
/// </summary>
|
||||
public static class OfflineKitLogFields
|
||||
{
|
||||
public const string RekorUuid = "rekorUuid";
|
||||
public const string AttestationDigest = "attestationDigest";
|
||||
public const string OfflineKitHash = "offlineKitHash";
|
||||
public const string FailureReason = "failureReason";
|
||||
public const string KitFilename = "kitFilename";
|
||||
public const string TarballDigest = "tarballDigest";
|
||||
public const string DsseStatementDigest = "dsseStatementDigest";
|
||||
public const string RekorLogIndex = "rekorLogIndex";
|
||||
public const string ManifestVersion = "manifestVersion";
|
||||
public const string PreviousVersion = "previousVersion";
|
||||
public const string WasForceActivated = "wasForceActivated";
|
||||
public const string ForceActivateReason = "forceActivateReason";
|
||||
public const string QuarantineId = "quarantineId";
|
||||
public const string QuarantinePath = "quarantinePath";
|
||||
public const string TenantId = "tenantId";
|
||||
public const string BundleType = "bundleType";
|
||||
public const string AnchorId = "anchorId";
|
||||
public const string KeyId = "keyId";
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Extension methods for structured logging with offline kit context.
|
||||
/// </summary>
|
||||
public static class OfflineKitLoggerExtensions
|
||||
{
|
||||
public static IDisposable BeginOfflineKitScope(
|
||||
this ILogger logger,
|
||||
string kitFilename,
|
||||
string tenantId,
|
||||
string? kitHash = null)
|
||||
{
|
||||
return logger.BeginScope(new Dictionary<string, object?>
|
||||
{
|
||||
[OfflineKitLogFields.KitFilename] = kitFilename,
|
||||
[OfflineKitLogFields.TenantId] = tenantId,
|
||||
[OfflineKitLogFields.OfflineKitHash] = kitHash
|
||||
});
|
||||
}
|
||||
|
||||
public static void LogImportSuccess(
|
||||
this ILogger logger,
|
||||
string kitFilename,
|
||||
string version,
|
||||
string tarballDigest,
|
||||
string? dsseDigest,
|
||||
string? rekorUuid,
|
||||
long? rekorLogIndex)
|
||||
{
|
||||
logger.LogInformation(
|
||||
"Offline kit imported successfully: {KitFilename} version={Version}",
|
||||
kitFilename, version);
|
||||
|
||||
// Structured fields for log aggregation
|
||||
using var _ = logger.BeginScope(new Dictionary<string, object?>
|
||||
{
|
||||
[OfflineKitLogFields.KitFilename] = kitFilename,
|
||||
[OfflineKitLogFields.ManifestVersion] = version,
|
||||
[OfflineKitLogFields.TarballDigest] = tarballDigest,
|
||||
[OfflineKitLogFields.DsseStatementDigest] = dsseDigest,
|
||||
[OfflineKitLogFields.RekorUuid] = rekorUuid,
|
||||
[OfflineKitLogFields.RekorLogIndex] = rekorLogIndex
|
||||
});
|
||||
}
|
||||
|
||||
public static void LogImportFailure(
|
||||
this ILogger logger,
|
||||
string kitFilename,
|
||||
string reasonCode,
|
||||
string reasonMessage,
|
||||
string? tarballDigest = null,
|
||||
string? attestationDigest = null,
|
||||
string? rekorUuid = null,
|
||||
string? quarantineId = null)
|
||||
{
|
||||
logger.LogError(
|
||||
"Offline kit import failed: {KitFilename} reason={ReasonCode}",
|
||||
kitFilename, reasonCode);
|
||||
|
||||
using var _ = logger.BeginScope(new Dictionary<string, object?>
|
||||
{
|
||||
[OfflineKitLogFields.KitFilename] = kitFilename,
|
||||
[OfflineKitLogFields.FailureReason] = reasonCode,
|
||||
[OfflineKitLogFields.TarballDigest] = tarballDigest,
|
||||
[OfflineKitLogFields.AttestationDigest] = attestationDigest,
|
||||
[OfflineKitLogFields.RekorUuid] = rekorUuid,
|
||||
[OfflineKitLogFields.QuarantineId] = quarantineId
|
||||
});
|
||||
}
|
||||
|
||||
public static void LogForceActivation(
|
||||
this ILogger logger,
|
||||
string kitFilename,
|
||||
string incomingVersion,
|
||||
string? previousVersion,
|
||||
string reason)
|
||||
{
|
||||
logger.LogWarning(
|
||||
"Non-monotonic activation forced: {KitFilename} {IncomingVersion} <- {PreviousVersion}",
|
||||
kitFilename, incomingVersion, previousVersion);
|
||||
|
||||
using var _ = logger.BeginScope(new Dictionary<string, object?>
|
||||
{
|
||||
[OfflineKitLogFields.KitFilename] = kitFilename,
|
||||
[OfflineKitLogFields.ManifestVersion] = incomingVersion,
|
||||
[OfflineKitLogFields.PreviousVersion] = previousVersion,
|
||||
[OfflineKitLogFields.WasForceActivated] = true,
|
||||
[OfflineKitLogFields.ForceActivateReason] = reason
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Part 3: Error Codes (G13)
|
||||
|
||||
Per advisory §11.2:
|
||||
|
||||
```csharp
|
||||
// src/AirGap/StellaOps.AirGap.Importer/OfflineKitReasonCodes.cs
|
||||
namespace StellaOps.AirGap.Importer;
|
||||
|
||||
/// <summary>
|
||||
/// Machine-readable reason codes for offline kit operations.
|
||||
/// Per advisory §11.2.
|
||||
/// </summary>
|
||||
public static class OfflineKitReasonCodes
|
||||
{
|
||||
// Verification failures
|
||||
public const string HashMismatch = "HASH_MISMATCH";
|
||||
public const string SigFailCosign = "SIG_FAIL_COSIGN";
|
||||
public const string SigFailManifest = "SIG_FAIL_MANIFEST";
|
||||
public const string DsseVerifyFail = "DSSE_VERIFY_FAIL";
|
||||
public const string RekorVerifyFail = "REKOR_VERIFY_FAIL";
|
||||
|
||||
// Validation failures
|
||||
public const string SelftestFail = "SELFTEST_FAIL";
|
||||
public const string VersionNonMonotonic = "VERSION_NON_MONOTONIC";
|
||||
public const string PolicyDeny = "POLICY_DENY";
|
||||
|
||||
// Structural failures
|
||||
public const string ManifestMissing = "MANIFEST_MISSING";
|
||||
public const string ManifestInvalid = "MANIFEST_INVALID";
|
||||
public const string PayloadMissing = "PAYLOAD_MISSING";
|
||||
public const string PayloadCorrupt = "PAYLOAD_CORRUPT";
|
||||
|
||||
// Trust failures
|
||||
public const string TrustRootMissing = "TRUST_ROOT_MISSING";
|
||||
public const string TrustRootExpired = "TRUST_ROOT_EXPIRED";
|
||||
public const string KeyNotTrusted = "KEY_NOT_TRUSTED";
|
||||
|
||||
// Operational
|
||||
public const string QuotaExceeded = "QUOTA_EXCEEDED";
|
||||
public const string StorageFull = "STORAGE_FULL";
|
||||
|
||||
/// <summary>
|
||||
/// Maps reason code to human-readable remediation text.
|
||||
/// </summary>
|
||||
public static string GetRemediation(string reasonCode) => reasonCode switch
|
||||
{
|
||||
HashMismatch => "The bundle file may be corrupted or tampered. Re-download from trusted source and verify SHA-256 checksum.",
|
||||
SigFailCosign => "Cosign signature verification failed. Ensure the bundle was signed with a trusted key and has not been modified.",
|
||||
SigFailManifest => "Manifest signature is invalid. The manifest may have been modified after signing.",
|
||||
DsseVerifyFail => "DSSE envelope signature verification failed. Check trust root configuration and key expiry.",
|
||||
RekorVerifyFail => "Rekor transparency log verification failed. Ensure offline Rekor snapshot is current or check network connectivity.",
|
||||
SelftestFail => "Bundle self-test failed. Internal bundle consistency check did not pass.",
|
||||
VersionNonMonotonic => "Incoming version is older than or equal to current. Use --force-activate with justification to override.",
|
||||
PolicyDeny => "Bundle was rejected by configured policy. Review policy rules and bundle contents.",
|
||||
TrustRootMissing => "No trust roots configured. Add trust anchors in scanner.offlineKit.trustAnchors.",
|
||||
TrustRootExpired => "Trust root has expired. Rotate trust roots with updated keys.",
|
||||
KeyNotTrusted => "Signing key is not in allowed keyids for matching trust anchor. Update trustAnchors configuration.",
|
||||
_ => "Unknown error. Check logs for details."
|
||||
};
|
||||
|
||||
/// <summary>
|
||||
/// Maps reason code to CLI exit code.
|
||||
/// </summary>
|
||||
public static int GetExitCode(string reasonCode) => reasonCode switch
|
||||
{
|
||||
HashMismatch => 2,
|
||||
SigFailCosign or SigFailManifest => 3,
|
||||
DsseVerifyFail => 5,
|
||||
RekorVerifyFail => 6,
|
||||
VersionNonMonotonic => 8,
|
||||
PolicyDeny => 9,
|
||||
SelftestFail => 10,
|
||||
_ => 7 // Generic import failure
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
#### Extend CliErrorCodes
|
||||
|
||||
```csharp
|
||||
// Add to: src/Cli/StellaOps.Cli/Output/CliError.cs
|
||||
|
||||
public static class CliErrorCodes
|
||||
{
|
||||
// ... existing codes ...
|
||||
|
||||
// CLI-AIRGAP-341-001: Offline kit error codes
|
||||
public const string OfflineKitHashMismatch = "ERR_AIRGAP_HASH_MISMATCH";
|
||||
public const string OfflineKitSigFailCosign = "ERR_AIRGAP_SIG_FAIL_COSIGN";
|
||||
public const string OfflineKitSigFailManifest = "ERR_AIRGAP_SIG_FAIL_MANIFEST";
|
||||
public const string OfflineKitDsseVerifyFail = "ERR_AIRGAP_DSSE_VERIFY_FAIL";
|
||||
public const string OfflineKitRekorVerifyFail = "ERR_AIRGAP_REKOR_VERIFY_FAIL";
|
||||
public const string OfflineKitVersionNonMonotonic = "ERR_AIRGAP_VERSION_NON_MONOTONIC";
|
||||
public const string OfflineKitPolicyDeny = "ERR_AIRGAP_POLICY_DENY";
|
||||
public const string OfflineKitSelftestFail = "ERR_AIRGAP_SELFTEST_FAIL";
|
||||
public const string OfflineKitTrustRootMissing = "ERR_AIRGAP_TRUST_ROOT_MISSING";
|
||||
public const string OfflineKitQuarantined = "ERR_AIRGAP_QUARANTINED";
|
||||
}
|
||||
```
|
||||
|
||||
### Part 4: Audit Schema (G14)
|
||||
|
||||
Per advisory §13:
|
||||
|
||||
```sql
|
||||
-- src/Authority/__Libraries/StellaOps.Authority.Storage.Postgres/Migrations/003_offline_kit_audit.sql
|
||||
|
||||
-- Extended offline kit audit table per advisory §13.2
|
||||
CREATE TABLE IF NOT EXISTS authority.offline_kit_audit (
|
||||
event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
event_type TEXT NOT NULL,
|
||||
timestamp TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
actor TEXT NOT NULL,
|
||||
tenant_id TEXT NOT NULL,
|
||||
|
||||
-- Bundle identification
|
||||
kit_filename TEXT NOT NULL,
|
||||
kit_id TEXT,
|
||||
kit_version TEXT,
|
||||
|
||||
-- Cryptographic verification results
|
||||
tarball_digest TEXT, -- sha256:...
|
||||
dsse_statement_digest TEXT, -- sha256:...
|
||||
rekor_uuid TEXT,
|
||||
rekor_log_index BIGINT,
|
||||
|
||||
-- Versioning
|
||||
previous_kit_version TEXT,
|
||||
new_kit_version TEXT,
|
||||
|
||||
-- Force activation tracking
|
||||
was_force_activated BOOLEAN NOT NULL DEFAULT FALSE,
|
||||
force_activate_reason TEXT,
|
||||
|
||||
-- Quarantine (if applicable)
|
||||
quarantine_id TEXT,
|
||||
quarantine_path TEXT,
|
||||
|
||||
-- Outcome
|
||||
result TEXT NOT NULL, -- success, failed, quarantined
|
||||
reason_code TEXT, -- HASH_MISMATCH, etc.
|
||||
reason_message TEXT,
|
||||
|
||||
-- Extended details (JSON)
|
||||
details JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||
|
||||
-- Constraints
|
||||
CONSTRAINT chk_event_type CHECK (event_type IN (
|
||||
'OFFLINE_KIT_IMPORT_STARTED',
|
||||
'OFFLINE_KIT_IMPORT_COMPLETED',
|
||||
'OFFLINE_KIT_IMPORT_FAILED',
|
||||
'OFFLINE_KIT_ACTIVATED',
|
||||
'OFFLINE_KIT_QUARANTINED',
|
||||
'OFFLINE_KIT_FORCE_ACTIVATED',
|
||||
'OFFLINE_KIT_VERIFICATION_PASSED',
|
||||
'OFFLINE_KIT_VERIFICATION_FAILED'
|
||||
)),
|
||||
CONSTRAINT chk_result CHECK (result IN ('success', 'failed', 'quarantined', 'in_progress'))
|
||||
);
|
||||
|
||||
-- Indexes for common queries
|
||||
CREATE INDEX idx_offline_kit_audit_ts
|
||||
ON authority.offline_kit_audit(timestamp DESC);
|
||||
|
||||
CREATE INDEX idx_offline_kit_audit_tenant
|
||||
ON authority.offline_kit_audit(tenant_id, timestamp DESC);
|
||||
|
||||
CREATE INDEX idx_offline_kit_audit_type
|
||||
ON authority.offline_kit_audit(event_type, timestamp DESC);
|
||||
|
||||
CREATE INDEX idx_offline_kit_audit_result
|
||||
ON authority.offline_kit_audit(result, timestamp DESC)
|
||||
WHERE result = 'failed';
|
||||
|
||||
CREATE INDEX idx_offline_kit_audit_rekor
|
||||
ON authority.offline_kit_audit(rekor_uuid)
|
||||
WHERE rekor_uuid IS NOT NULL;
|
||||
|
||||
-- Comment for documentation
|
||||
COMMENT ON TABLE authority.offline_kit_audit IS
|
||||
'Audit trail for offline kit import operations. Per advisory §13.2.';
|
||||
```
|
||||
|
||||
#### Repository Interface
|
||||
|
||||
```csharp
|
||||
// src/Authority/__Libraries/StellaOps.Authority.Core/Audit/IOfflineKitAuditRepository.cs
|
||||
namespace StellaOps.Authority.Core.Audit;
|
||||
|
||||
public interface IOfflineKitAuditRepository
|
||||
{
|
||||
Task<OfflineKitAuditEntry> RecordAsync(
|
||||
OfflineKitAuditRecord record,
|
||||
CancellationToken cancellationToken = default);
|
||||
|
||||
Task<IReadOnlyList<OfflineKitAuditEntry>> QueryAsync(
|
||||
OfflineKitAuditQuery query,
|
||||
CancellationToken cancellationToken = default);
|
||||
|
||||
Task<OfflineKitAuditEntry?> GetByEventIdAsync(
|
||||
Guid eventId,
|
||||
CancellationToken cancellationToken = default);
|
||||
}
|
||||
|
||||
public sealed record OfflineKitAuditRecord(
|
||||
string EventType,
|
||||
string Actor,
|
||||
string TenantId,
|
||||
string KitFilename,
|
||||
string? KitId,
|
||||
string? KitVersion,
|
||||
string? TarballDigest,
|
||||
string? DsseStatementDigest,
|
||||
string? RekorUuid,
|
||||
long? RekorLogIndex,
|
||||
string? PreviousKitVersion,
|
||||
string? NewKitVersion,
|
||||
bool WasForceActivated,
|
||||
string? ForceActivateReason,
|
||||
string? QuarantineId,
|
||||
string? QuarantinePath,
|
||||
string Result,
|
||||
string? ReasonCode,
|
||||
string? ReasonMessage,
|
||||
IReadOnlyDictionary<string, object>? Details = null);
|
||||
|
||||
public sealed record OfflineKitAuditEntry(
|
||||
Guid EventId,
|
||||
string EventType,
|
||||
DateTimeOffset Timestamp,
|
||||
string Actor,
|
||||
string TenantId,
|
||||
string KitFilename,
|
||||
string? KitId,
|
||||
string? KitVersion,
|
||||
string? TarballDigest,
|
||||
string? DsseStatementDigest,
|
||||
string? RekorUuid,
|
||||
long? RekorLogIndex,
|
||||
string? PreviousKitVersion,
|
||||
string? NewKitVersion,
|
||||
bool WasForceActivated,
|
||||
string? ForceActivateReason,
|
||||
string? QuarantineId,
|
||||
string? QuarantinePath,
|
||||
string Result,
|
||||
string? ReasonCode,
|
||||
string? ReasonMessage,
|
||||
IReadOnlyDictionary<string, object>? Details);
|
||||
|
||||
public sealed record OfflineKitAuditQuery(
|
||||
string? TenantId = null,
|
||||
string? EventType = null,
|
||||
string? Result = null,
|
||||
DateTimeOffset? Since = null,
|
||||
DateTimeOffset? Until = null,
|
||||
string? KitFilename = null,
|
||||
string? RekorUuid = null,
|
||||
int Limit = 100,
|
||||
int Offset = 0);
|
||||
```
|
||||
|
||||
#### Audit Event Emitter
|
||||
|
||||
```csharp
|
||||
// src/AirGap/StellaOps.AirGap.Importer/Audit/OfflineKitAuditEmitter.cs
|
||||
namespace StellaOps.AirGap.Importer.Audit;
|
||||
|
||||
public sealed class OfflineKitAuditEmitter : IOfflineKitAuditEmitter
|
||||
{
|
||||
private readonly IOfflineKitAuditRepository _repository;
|
||||
private readonly ILogger<OfflineKitAuditEmitter> _logger;
|
||||
private readonly TimeProvider _timeProvider;
|
||||
|
||||
public async Task EmitImportStartedAsync(
|
||||
OfflineKitImportContext context,
|
||||
CancellationToken cancellationToken = default)
|
||||
{
|
||||
await RecordAsync(
|
||||
eventType: "OFFLINE_KIT_IMPORT_STARTED",
|
||||
context: context,
|
||||
result: "in_progress",
|
||||
cancellationToken: cancellationToken);
|
||||
}
|
||||
|
||||
public async Task EmitImportCompletedAsync(
|
||||
OfflineKitImportContext context,
|
||||
OfflineKitImportResult result,
|
||||
CancellationToken cancellationToken = default)
|
||||
{
|
||||
await RecordAsync(
|
||||
eventType: result.Success
|
||||
? "OFFLINE_KIT_IMPORT_COMPLETED"
|
||||
: "OFFLINE_KIT_IMPORT_FAILED",
|
||||
context: context,
|
||||
result: result.Success ? "success" : "failed",
|
||||
reasonCode: result.ReasonCode,
|
||||
reasonMessage: result.ReasonMessage,
|
||||
rekorUuid: result.RekorUuid,
|
||||
rekorLogIndex: result.RekorLogIndex,
|
||||
cancellationToken: cancellationToken);
|
||||
}
|
||||
|
||||
public async Task EmitQuarantinedAsync(
|
||||
OfflineKitImportContext context,
|
||||
QuarantineResult quarantine,
|
||||
string reasonCode,
|
||||
string reasonMessage,
|
||||
CancellationToken cancellationToken = default)
|
||||
{
|
||||
await RecordAsync(
|
||||
eventType: "OFFLINE_KIT_QUARANTINED",
|
||||
context: context,
|
||||
result: "quarantined",
|
||||
reasonCode: reasonCode,
|
||||
reasonMessage: reasonMessage,
|
||||
quarantineId: quarantine.QuarantineId,
|
||||
quarantinePath: quarantine.QuarantinePath,
|
||||
cancellationToken: cancellationToken);
|
||||
}
|
||||
|
||||
public async Task EmitForceActivatedAsync(
|
||||
OfflineKitImportContext context,
|
||||
string previousVersion,
|
||||
string reason,
|
||||
CancellationToken cancellationToken = default)
|
||||
{
|
||||
await RecordAsync(
|
||||
eventType: "OFFLINE_KIT_FORCE_ACTIVATED",
|
||||
context: context,
|
||||
result: "success",
|
||||
wasForceActivated: true,
|
||||
forceActivateReason: reason,
|
||||
previousVersion: previousVersion,
|
||||
cancellationToken: cancellationToken);
|
||||
}
|
||||
|
||||
private async Task RecordAsync(
|
||||
string eventType,
|
||||
OfflineKitImportContext context,
|
||||
string result,
|
||||
string? reasonCode = null,
|
||||
string? reasonMessage = null,
|
||||
string? rekorUuid = null,
|
||||
long? rekorLogIndex = null,
|
||||
string? quarantineId = null,
|
||||
string? quarantinePath = null,
|
||||
bool wasForceActivated = false,
|
||||
string? forceActivateReason = null,
|
||||
string? previousVersion = null,
|
||||
CancellationToken cancellationToken = default)
|
||||
{
|
||||
var record = new OfflineKitAuditRecord(
|
||||
EventType: eventType,
|
||||
Actor: context.Actor ?? "system",
|
||||
TenantId: context.TenantId,
|
||||
KitFilename: context.KitFilename,
|
||||
KitId: context.Manifest?.KitId,
|
||||
KitVersion: context.Manifest?.Version,
|
||||
TarballDigest: context.TarballDigest,
|
||||
DsseStatementDigest: context.DsseStatementDigest,
|
||||
RekorUuid: rekorUuid,
|
||||
RekorLogIndex: rekorLogIndex,
|
||||
PreviousKitVersion: previousVersion ?? context.PreviousVersion,
|
||||
NewKitVersion: context.Manifest?.Version,
|
||||
WasForceActivated: wasForceActivated,
|
||||
ForceActivateReason: forceActivateReason,
|
||||
QuarantineId: quarantineId,
|
||||
QuarantinePath: quarantinePath,
|
||||
Result: result,
|
||||
ReasonCode: reasonCode,
|
||||
ReasonMessage: reasonMessage);
|
||||
|
||||
try
|
||||
{
|
||||
await _repository.RecordAsync(record, cancellationToken);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
// Audit failures should not break import flow, but must be logged
|
||||
_logger.LogError(ex,
|
||||
"Failed to record audit event {EventType} for {KitFilename}",
|
||||
eventType, context.KitFilename);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Grafana Dashboard
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "StellaOps Offline Kit Operations",
|
||||
"panels": [
|
||||
{
|
||||
"title": "Import Total by Status",
|
||||
"type": "timeseries",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(offlinekit_import_total[5m])) by (status)",
|
||||
"legendFormat": "{{status}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Attestation Verification Latency (p95)",
|
||||
"type": "timeseries",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum(rate(offlinekit_attestation_verify_latency_seconds_bucket[5m])) by (le, attestation_type))",
|
||||
"legendFormat": "{{attestation_type}}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Rekor Success Rate",
|
||||
"type": "stat",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(attestor_rekor_success_total[1h])) / (sum(rate(attestor_rekor_success_total[1h])) + sum(rate(attestor_rekor_retry_total[1h])))"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Failed Imports by Reason",
|
||||
"type": "piechart",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(offlinekit_import_total{status=~\"failed.*\"}) by (status)"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
### Metrics (G11)
|
||||
- [ ] `offlinekit_import_total` increments on every import attempt
|
||||
- [ ] Status label correctly reflects outcome (success/failed_*)
|
||||
- [ ] Tenant label is populated for multi-tenant filtering
|
||||
- [ ] `offlinekit_attestation_verify_latency_seconds` histogram has useful buckets
|
||||
- [ ] Rekor metrics track success/retry counts
|
||||
- [ ] Metrics are exposed on `/metrics` endpoint
|
||||
- [ ] Grafana dashboard renders correctly
|
||||
|
||||
### Logging (G12)
|
||||
- [ ] All log entries include tenant context
|
||||
- [ ] Import success logs include all specified fields
|
||||
- [ ] Import failure logs include reason and remediation path
|
||||
- [ ] Force activation logs with warning level
|
||||
- [ ] Quarantine events logged with path and reason
|
||||
- [ ] Structured fields are machine-parseable
|
||||
|
||||
### Error Codes (G13)
|
||||
- [ ] All reason codes from advisory §11.2 are implemented
|
||||
- [ ] `GetRemediation()` returns helpful guidance
|
||||
- [ ] `GetExitCode()` maps to correct CLI exit codes
|
||||
- [ ] Codes are used consistently in API ProblemDetails
|
||||
|
||||
### Audit (G14)
|
||||
- [ ] All import events are recorded
|
||||
- [ ] Schema matches advisory §13.2
|
||||
- [ ] Force activation is tracked with reason
|
||||
- [ ] Quarantine events include path reference
|
||||
- [ ] Rekor UUID/logIndex are captured when available
|
||||
- [ ] Query API supports filtering by tenant, type, result
|
||||
- [ ] Audit repository handles failures gracefully
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
1. **Metrics unit tests** with in-memory collector
|
||||
2. **Logging tests** with captured structured output
|
||||
3. **Audit integration tests** with Testcontainers PostgreSQL
|
||||
4. **End-to-end tests** verifying full observability chain
|
||||
|
||||
---
|
||||
|
||||
## Execution Log
|
||||
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2025-12-15 | Normalised sprint file to standard template; set `T1` to `DOING` and began implementation. | Agent |
|
||||
| 2025-12-15 | Implemented Offline Kit metrics + structured logging primitives in AirGap Importer; marked `T7` `BLOCKED` pending an owning host/service for a `/metrics` surface. | Agent |
|
||||
| 2025-12-15 | Started CLI error/reason code work; expanded sprint working directories for CLI parsing (`Output/`, `Services/`, `Services/Transport/`). | Agent |
|
||||
| 2025-12-15 | Added Authority Postgres migration + repository/emitter for `authority.offline_kit_audit`; marked `T20` `BLOCKED` pending an owning backend import/activation flow. | Agent |
|
||||
| 2025-12-15 | Completed `T1`-`T6`, `T8`-`T19`, `T21`-`T24` (metrics/logging/codes/audit, tests, docs, dashboard); left `T7`/`T20` `BLOCKED` pending an owning Offline Kit import host. | Agent |
|
||||
| 2025-12-15 | Cross-cutting Postgres RLS compatibility: set both `app.tenant_id` and `app.current_tenant` on tenant-scoped connections (shared `StellaOps.Infrastructure.Postgres`). | Agent |
|
||||
| 2025-12-17 | Unblocked `T7`/`T20` by implementing a Scanner-owned Offline Kit import host; started wiring Prometheus `/metrics` surface and Authority audit emission into import/activation flow. | Agent |
|
||||
| 2025-12-18 | Completed `T7`/`T20`: Scanner WebService exposes `/metrics` with Offline Kit metrics and OfflineKit import emits audit events via `IOfflineKitAuditEmitter` (covered by deterministic integration tests). | Agent |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Prometheus exporter choice (Importer):** Scanner WebService is the owning host for Offline Kit import and exposes `/metrics` with Offline Kit counters/histograms (Prometheus text format).
|
||||
- **Field naming:** Keep metric labels and log fields stable and consistent (`tenant_id`, `status`, `reason_code`) to preserve dashboards and alert rules.
|
||||
- **Authority schema alignment:** `docs/db/SPECIFICATION.md` must stay aligned with `authority.offline_kit_audit` (table + indexes + RLS posture) to avoid drift.
|
||||
- **Integration test dependency:** Authority Postgres integration tests use Testcontainers and require Docker in developer/CI environments.
|
||||
- **Audit wiring:** Scanner OfflineKit import calls `IOfflineKitAuditEmitter` best-effort; Authority storage tests cover tenant/RLS behavior.
|
||||
|
||||
## Next Checkpoints
|
||||
- None (sprint complete).
|
||||
1913
docs/implplan/archived/SPRINT_0341_0001_0001_ttfs_enhancements.md
Normal file
1913
docs/implplan/archived/SPRINT_0341_0001_0001_ttfs_enhancements.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,244 @@
|
||||
# Router Rate Limiting - Master Sprint Tracker
|
||||
|
||||
**IMPLID:** 1200 (Router infrastructure)
|
||||
**Feature:** Centralized rate limiting for Stella Router as standalone product
|
||||
**Advisory Source:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
|
||||
**Owner:** Router Team
|
||||
**Status:** DONE (Sprints 1–6 closed; Sprint 4 closed N/A)
|
||||
**Priority:** HIGH - Core feature for Router product
|
||||
**Target Completion:** 6 weeks (4 weeks implementation + 2 weeks rollout)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Implement centralized, multi-dimensional rate limiting in Stella Router to:
|
||||
1. Eliminate per-service rate limiting duplication (architectural cleanup)
|
||||
2. Enable Router as standalone product with intelligent admission control
|
||||
3. Provide sophisticated protection (dual-scope, dual-window, rule stacking)
|
||||
4. Support complex configuration matrices (instance, environment, microservice, route)
|
||||
|
||||
**Key Principle:** Rate limiting is a router responsibility. Microservices should NOT implement bare HTTP rate limiting.
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Dual-Scope Design
|
||||
|
||||
**for_instance (In-Memory):**
|
||||
- Protects individual router instance from local overload
|
||||
- Zero latency (sub-millisecond)
|
||||
- Sliding window counters
|
||||
- No network dependencies
|
||||
|
||||
**for_environment (Valkey-Backed):**
|
||||
- Protects entire environment across all router instances
|
||||
- Distributed coordination via Valkey (Redis fork)
|
||||
- Fixed-window counters with atomic Lua operations
|
||||
- Circuit breaker for resilience
|
||||
|
||||
### Multi-Dimensional Configuration
|
||||
|
||||
```
|
||||
Global Defaults
|
||||
└─> Per-Environment
|
||||
└─> Per-Microservice
|
||||
└─> Per-Route (most specific wins)
|
||||
```
|
||||
|
||||
### Rule Stacking
|
||||
|
||||
Each target can have multiple rules (AND logic):
|
||||
- Example: "10 req/sec AND 3000 req/hour AND 50k req/day"
|
||||
- All rules must pass
|
||||
- Most restrictive Retry-After returned
|
||||
|
||||
---
|
||||
|
||||
## Sprint Breakdown
|
||||
|
||||
| Sprint | IMPLID | Duration | Focus | Status |
|
||||
|--------|--------|----------|-------|--------|
|
||||
| **Sprint 1** | 1200_001_001 | 5-7 days | Core router rate limiting | DONE |
|
||||
| **Sprint 2** | 1200_001_002 | 2-3 days | Per-route granularity | DONE |
|
||||
| **Sprint 3** | 1200_001_003 | 2-3 days | Rule stacking (multiple windows) | DONE |
|
||||
| **Sprint 4** | 1200_001_004 | 3-4 days | Service migration (AdaptiveRateLimiter) | DONE (N/A) |
|
||||
| **Sprint 5** | 1200_001_005 | 3-5 days | Comprehensive testing | DONE |
|
||||
| **Sprint 6** | 1200_001_006 | 2 days | Documentation & rollout prep | DONE |
|
||||
|
||||
**Total Implementation:** 17-24 days
|
||||
|
||||
**Rollout (Post-Implementation):**
|
||||
- Week 1: Shadow mode (metrics only, no enforcement)
|
||||
- Week 2: Soft limits (2x traffic peaks)
|
||||
- Week 3: Production limits
|
||||
- Week 4+: Service migration complete
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
### External
|
||||
- Valkey/Redis cluster (≥7.0) for distributed state
|
||||
- OpenTelemetry SDK for metrics
|
||||
- StackExchange.Redis NuGet package
|
||||
|
||||
### Internal
|
||||
- `StellaOps.Router.Gateway` library (existing)
|
||||
- Routing metadata (microservice + route identification)
|
||||
- Configuration system (YAML binding)
|
||||
|
||||
### Migration Targets
|
||||
- `AdaptiveRateLimiter` in Orchestrator (extract TokenBucket, HourlyCounter configs)
|
||||
|
||||
---
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
### 1. Status Codes
|
||||
- ✅ **429 Too Many Requests** for rate limiting (NOT 503, NOT 202)
|
||||
- ✅ **Retry-After** header (seconds or HTTP-date)
|
||||
- ✅ JSON response body with details
|
||||
|
||||
### 2. Terminology
|
||||
- ✅ **Valkey** (not Redis) - consistent with StellaOps naming
|
||||
- ✅ Snake_case in YAML configs
|
||||
- ✅ PascalCase in C# code
|
||||
|
||||
### 3. Configuration Philosophy
|
||||
- Support complex matrices (required for Router product)
|
||||
- Sensible defaults at every level
|
||||
- Clear inheritance semantics
|
||||
- Fail-fast validation on startup
|
||||
|
||||
### 4. Performance Targets
|
||||
- Instance check: <1ms P99 latency
|
||||
- Environment check: <10ms P99 latency (including Valkey RTT)
|
||||
- Router throughput: 100k req/sec with rate limiting enabled
|
||||
- Valkey load: <1000 ops/sec per router instance
|
||||
|
||||
### 5. Resilience
|
||||
- Circuit breaker for Valkey failures (fail-open)
|
||||
- Activation gate to skip Valkey under low traffic
|
||||
- Instance limits enforced even if Valkey is down
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Functional
|
||||
- [ ] Router enforces per-instance limits (in-memory)
|
||||
- [ ] Router enforces per-environment limits (Valkey-backed)
|
||||
- [ ] Per-microservice configuration works
|
||||
- [ ] Per-route configuration works
|
||||
- [ ] Multiple rules per target work (rule stacking)
|
||||
- [ ] 429 + Retry-After returned correctly
|
||||
- [ ] Circuit breaker handles Valkey failures gracefully
|
||||
- [ ] Activation gate reduces Valkey load by 80%+ under low traffic
|
||||
|
||||
### Performance
|
||||
- [ ] Instance check <1ms P99
|
||||
- [ ] Environment check <10ms P99
|
||||
- [ ] 100k req/sec throughput maintained
|
||||
- [ ] Valkey load <1000 ops/sec per instance
|
||||
|
||||
### Operational
|
||||
- [ ] Metrics exported (Prometheus)
|
||||
- [ ] Dashboards created (Grafana)
|
||||
- [ ] Alerts configured
|
||||
- [ ] Documentation complete
|
||||
- [ ] Migration from service-level rate limiters complete
|
||||
|
||||
### Quality
|
||||
- [ ] Unit test coverage >90%
|
||||
- [ ] Integration tests for all config combinations
|
||||
- [ ] Load tests (k6 scenarios A-F)
|
||||
- [ ] Failure injection tests
|
||||
|
||||
---
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### Sprint 1: Core Router Rate Limiting
|
||||
- [x] Rate limit abstractions
|
||||
- [x] Valkey backend implementation (Lua, fixed-window)
|
||||
- [x] Middleware integration (router pipeline)
|
||||
- [x] Metrics and observability
|
||||
- [x] Configuration schema (rules + legacy compatibility)
|
||||
|
||||
### Sprint 2: Per-Route Granularity
|
||||
- [x] Route pattern matching (exact/prefix/regex, specificity rules)
|
||||
- [x] Configuration extension (`routes` under microservices)
|
||||
- [x] Inheritance resolution (environment → microservice → route)
|
||||
- [x] Route-level testing (unit tests)
|
||||
|
||||
### Sprint 3: Rule Stacking
|
||||
- [x] Multi-rule configuration (`rules[]` with legacy compatibility)
|
||||
- [x] AND logic evaluation (instance + environment)
|
||||
- [x] Lua script enhancement (multi-rule evaluation)
|
||||
- [x] Retry-After calculation (most restrictive)
|
||||
|
||||
### Sprint 4: Service Migration
|
||||
- [x] Closed as N/A (no Orchestrator ingress wiring found); see `docs/implplan/SPRINT_1200_001_004_router_rate_limiting_service_migration.md`
|
||||
|
||||
### Sprint 5: Comprehensive Testing
|
||||
- [x] Unit test suite (core + routes + rules)
|
||||
- [x] Integration test suite (Valkey/Testcontainers) - see `docs/implplan/SPRINT_1200_001_005_router_rate_limiting_tests.md`
|
||||
- [x] Load tests (k6) - see `docs/implplan/SPRINT_1200_001_005_router_rate_limiting_tests.md`
|
||||
- [x] Configuration matrix tests - see `docs/implplan/SPRINT_1200_001_005_router_rate_limiting_tests.md`
|
||||
|
||||
### Sprint 6: Documentation
|
||||
- [x] Architecture docs - see `docs/implplan/SPRINT_1200_001_006_router_rate_limiting_docs.md`
|
||||
- [x] Configuration guide - see `docs/implplan/SPRINT_1200_001_006_router_rate_limiting_docs.md`
|
||||
- [x] Operational runbook - see `docs/implplan/SPRINT_1200_001_006_router_rate_limiting_docs.md`
|
||||
- [x] Migration guide - see `docs/implplan/SPRINT_1200_001_006_router_rate_limiting_docs.md`
|
||||
|
||||
---
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
| Risk | Impact | Probability | Mitigation |
|
||||
|------|--------|-------------|------------|
|
||||
| Valkey becomes critical path | HIGH | MEDIUM | Circuit breaker + fail-open + activation gate |
|
||||
| Configuration errors in production | HIGH | MEDIUM | Schema validation + shadow mode rollout |
|
||||
| Performance degradation | MEDIUM | LOW | Benchmarking + activation gate + in-memory fast path |
|
||||
| Double-limiting during migration | MEDIUM | MEDIUM | Clear docs + phased migration + architecture review |
|
||||
| Lua script bugs | HIGH | LOW | Extensive testing + reference validation + circuit breaker |
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
|
||||
- **Implementation:** `src/__Libraries/StellaOps.Router.Gateway/RateLimit/`
|
||||
- **Tests:** `tests/StellaOps.Router.Gateway.Tests/`
|
||||
- **Implementation Guides:** `docs/implplan/SPRINT_1200_001_00X_*.md` (see below)
|
||||
- **Sprints:** `docs/implplan/SPRINT_1200_001_004_router_rate_limiting_service_migration.md`, `docs/implplan/SPRINT_1200_001_005_router_rate_limiting_tests.md`, `docs/implplan/SPRINT_1200_001_006_router_rate_limiting_docs.md`
|
||||
- **Docs:** `docs/router/rate-limiting-routes.md`
|
||||
|
||||
---
|
||||
|
||||
## Contact & Escalation
|
||||
|
||||
**Sprint Owner:** Router Team Lead
|
||||
**Technical Reviewer:** Architecture Guild
|
||||
**Blocked Issues:** Escalate to Platform Engineering
|
||||
**Questions:** #stella-router-dev Slack channel
|
||||
|
||||
---
|
||||
|
||||
## Status Log
|
||||
|
||||
| Date | Status | Notes |
|
||||
|------|--------|-------|
|
||||
| 2025-12-17 | DOING | Sprints 1–3 DONE; Sprint 4 closed N/A; Sprint 5 tests started; Sprint 6 docs pending. |
|
||||
| 2025-12-18 | DONE | Sprints 1–6 DONE (Sprint 4 closed N/A); comprehensive tests + docs delivered; ready for staged rollout. |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Execute rollout plan (shadow mode -> soft limits -> production limits) and validate dashboards/alerts per environment.
|
||||
2. Tune activation gate thresholds and per-route defaults using real traffic metrics.
|
||||
3. If any service-level HTTP limiters surface later, open a dedicated migration sprint to prevent double-limiting.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,678 @@
|
||||
# Sprint 2: Per-Route Granularity
|
||||
|
||||
**IMPLID:** 1200_001_002
|
||||
**Sprint Duration:** 2-3 days
|
||||
**Priority:** HIGH
|
||||
**Dependencies:** Sprint 1 (Core implementation)
|
||||
**Status:** DONE
|
||||
**Blocks:** Sprint 5 (additional integration/load testing)
|
||||
**Evidence:** `src/__Libraries/StellaOps.Router.Gateway/RateLimit/`, `docs/router/rate-limiting-routes.md`, `tests/StellaOps.Router.Gateway.Tests/`
|
||||
|
||||
---
|
||||
|
||||
## Sprint Goal
|
||||
|
||||
Extend rate limiting configuration to support per-route limits with pattern matching and inheritance resolution.
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- Routes can have specific rate limits
|
||||
- Route patterns support exact match, prefix, and regex
|
||||
- Inheritance works: route → microservice → environment → global
|
||||
- Most specific route wins
|
||||
- Configuration validated on startup
|
||||
|
||||
---
|
||||
|
||||
## Working Directory
|
||||
|
||||
`src/__Libraries/StellaOps.Router.Gateway/RateLimit/`
|
||||
|
||||
---
|
||||
|
||||
## Task Breakdown
|
||||
|
||||
### Task 2.1: Extend Configuration Models (0.5 days)
|
||||
|
||||
**Goal:** Add routes section to configuration schema.
|
||||
|
||||
**Files to Modify:**
|
||||
1. `RateLimit/Models/MicroserviceLimitsConfig.cs` - Add Routes property
|
||||
2. `RateLimit/Models/RouteLimitsConfig.cs` - NEW: Route-specific limits
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```csharp
|
||||
// RouteLimitsConfig.cs (NEW)
|
||||
namespace StellaOps.Router.Gateway.RateLimit.Models;
|
||||
|
||||
public sealed class RouteLimitsConfig
|
||||
{
|
||||
/// <summary>
|
||||
/// Route pattern: exact ("/api/scans"), prefix ("/api/scans/*"), or regex ("^/api/scans/[a-f0-9-]+$")
|
||||
/// </summary>
|
||||
[ConfigurationKeyName("pattern")]
|
||||
public string Pattern { get; set; } = "";
|
||||
|
||||
[ConfigurationKeyName("match_type")]
|
||||
public RouteMatchType MatchType { get; set; } = RouteMatchType.Exact;
|
||||
|
||||
[ConfigurationKeyName("per_seconds")]
|
||||
public int? PerSeconds { get; set; }
|
||||
|
||||
[ConfigurationKeyName("max_requests")]
|
||||
public int? MaxRequests { get; set; }
|
||||
|
||||
[ConfigurationKeyName("allow_burst_for_seconds")]
|
||||
public int? AllowBurstForSeconds { get; set; }
|
||||
|
||||
[ConfigurationKeyName("allow_max_burst_requests")]
|
||||
public int? AllowMaxBurstRequests { get; set; }
|
||||
|
||||
public void Validate(string path)
|
||||
{
|
||||
if (string.IsNullOrWhiteSpace(Pattern))
|
||||
throw new ArgumentException($"{path}: pattern is required");
|
||||
|
||||
// Both long settings must be set or both omitted
|
||||
if ((PerSeconds.HasValue) != (MaxRequests.HasValue))
|
||||
throw new ArgumentException($"{path}: per_seconds and max_requests must both be set or both omitted");
|
||||
|
||||
// Both burst settings must be set or both omitted
|
||||
if ((AllowBurstForSeconds.HasValue) != (AllowMaxBurstRequests.HasValue))
|
||||
throw new ArgumentException($"{path}: Burst settings must both be set or both omitted");
|
||||
|
||||
if (PerSeconds < 0 || MaxRequests < 0)
|
||||
throw new ArgumentException($"{path}: Values must be >= 0");
|
||||
|
||||
// Validate regex pattern if applicable
|
||||
if (MatchType == RouteMatchType.Regex)
|
||||
{
|
||||
try
|
||||
{
|
||||
_ = new Regex(Pattern, RegexOptions.Compiled);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
throw new ArgumentException($"{path}: Invalid regex pattern: {ex.Message}");
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public enum RouteMatchType
|
||||
{
|
||||
Exact, // Exact path match: "/api/scans"
|
||||
Prefix, // Prefix match: "/api/scans/*"
|
||||
Regex // Regex match: "^/api/scans/[a-f0-9-]+$"
|
||||
}
|
||||
|
||||
// Update MicroserviceLimitsConfig.cs to add:
|
||||
public sealed class MicroserviceLimitsConfig
|
||||
{
|
||||
// ... existing properties ...
|
||||
|
||||
[ConfigurationKeyName("routes")]
|
||||
public Dictionary<string, RouteLimitsConfig> Routes { get; set; }
|
||||
= new(StringComparer.OrdinalIgnoreCase);
|
||||
|
||||
public void Validate(string path)
|
||||
{
|
||||
// ... existing validation ...
|
||||
|
||||
// Validate routes
|
||||
foreach (var (name, config) in Routes)
|
||||
{
|
||||
if (string.IsNullOrWhiteSpace(name))
|
||||
throw new ArgumentException($"{path}.routes: Empty route name");
|
||||
|
||||
config.Validate($"{path}.routes.{name}");
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Configuration Example:**
|
||||
|
||||
```yaml
|
||||
for_environment:
|
||||
microservices:
|
||||
scanner:
|
||||
per_seconds: 60
|
||||
max_requests: 600
|
||||
routes:
|
||||
scan_submit:
|
||||
pattern: "/api/scans"
|
||||
match_type: exact
|
||||
per_seconds: 10
|
||||
max_requests: 50
|
||||
scan_status:
|
||||
pattern: "/api/scans/*"
|
||||
match_type: prefix
|
||||
per_seconds: 1
|
||||
max_requests: 100
|
||||
scan_by_id:
|
||||
pattern: "^/api/scans/[a-f0-9-]+$"
|
||||
match_type: regex
|
||||
per_seconds: 1
|
||||
max_requests: 50
|
||||
```
|
||||
|
||||
**Testing:**
|
||||
- Unit tests for route configuration loading
|
||||
- Validation of route patterns
|
||||
- Regex pattern validation
|
||||
|
||||
**Deliverable:** Extended configuration models with routes.
|
||||
|
||||
---
|
||||
|
||||
### Task 2.2: Route Matching Implementation (1 day)
|
||||
|
||||
**Goal:** Implement route pattern matching logic.
|
||||
|
||||
**Files to Create:**
|
||||
1. `RateLimit/RouteMatching/RouteMatcher.cs` - Main matcher
|
||||
2. `RateLimit/RouteMatching/IRouteMatcher.cs` - Matcher interface
|
||||
3. `RateLimit/RouteMatching/ExactRouteMatcher.cs` - Exact match
|
||||
4. `RateLimit/RouteMatching/PrefixRouteMatcher.cs` - Prefix match
|
||||
5. `RateLimit/RouteMatching/RegexRouteMatcher.cs` - Regex match
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```csharp
|
||||
// IRouteMatcher.cs
|
||||
public interface IRouteMatcher
|
||||
{
|
||||
bool Matches(string requestPath);
|
||||
int Specificity { get; } // Higher = more specific
|
||||
}
|
||||
|
||||
// ExactRouteMatcher.cs
|
||||
public sealed class ExactRouteMatcher : IRouteMatcher
|
||||
{
|
||||
private readonly string _pattern;
|
||||
|
||||
public ExactRouteMatcher(string pattern)
|
||||
{
|
||||
_pattern = pattern;
|
||||
}
|
||||
|
||||
public bool Matches(string requestPath)
|
||||
{
|
||||
return string.Equals(requestPath, _pattern, StringComparison.OrdinalIgnoreCase);
|
||||
}
|
||||
|
||||
public int Specificity => 1000; // Highest
|
||||
}
|
||||
|
||||
// PrefixRouteMatcher.cs
|
||||
public sealed class PrefixRouteMatcher : IRouteMatcher
|
||||
{
|
||||
private readonly string _prefix;
|
||||
|
||||
public PrefixRouteMatcher(string pattern)
|
||||
{
|
||||
// Remove trailing /* if present
|
||||
_prefix = pattern.EndsWith("/*")
|
||||
? pattern[..^2]
|
||||
: pattern;
|
||||
}
|
||||
|
||||
public bool Matches(string requestPath)
|
||||
{
|
||||
return requestPath.StartsWith(_prefix, StringComparison.OrdinalIgnoreCase);
|
||||
}
|
||||
|
||||
public int Specificity => 100 + _prefix.Length; // Longer prefix = more specific
|
||||
}
|
||||
|
||||
// RegexRouteMatcher.cs
|
||||
public sealed class RegexRouteMatcher : IRouteMatcher
|
||||
{
|
||||
private readonly Regex _regex;
|
||||
|
||||
public RegexRouteMatcher(string pattern)
|
||||
{
|
||||
_regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
|
||||
}
|
||||
|
||||
public bool Matches(string requestPath)
|
||||
{
|
||||
return _regex.IsMatch(requestPath);
|
||||
}
|
||||
|
||||
public int Specificity => 10; // Lowest (most flexible)
|
||||
}
|
||||
|
||||
// RouteMatcher.cs (Factory + Resolution)
|
||||
public sealed class RouteMatcher
|
||||
{
|
||||
private readonly List<(IRouteMatcher matcher, RouteLimitsConfig config, string routeName)> _routes = new();
|
||||
|
||||
public void AddRoute(string routeName, RouteLimitsConfig config)
|
||||
{
|
||||
IRouteMatcher matcher = config.MatchType switch
|
||||
{
|
||||
RouteMatchType.Exact => new ExactRouteMatcher(config.Pattern),
|
||||
RouteMatchType.Prefix => new PrefixRouteMatcher(config.Pattern),
|
||||
RouteMatchType.Regex => new RegexRouteMatcher(config.Pattern),
|
||||
_ => throw new ArgumentException($"Unknown match type: {config.MatchType}")
|
||||
};
|
||||
|
||||
_routes.Add((matcher, config, routeName));
|
||||
}
|
||||
|
||||
public (string? routeName, RouteLimitsConfig? config) FindBestMatch(string requestPath)
|
||||
{
|
||||
var matches = _routes
|
||||
.Where(r => r.matcher.Matches(requestPath))
|
||||
.OrderByDescending(r => r.matcher.Specificity)
|
||||
.ToList();
|
||||
|
||||
if (matches.Count == 0)
|
||||
return (null, null);
|
||||
|
||||
var best = matches[0];
|
||||
return (best.routeName, best.config);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Testing:**
|
||||
- Unit tests for each matcher type
|
||||
- Specificity ordering (exact > prefix > regex)
|
||||
- Case-insensitive matching
|
||||
- Edge cases (empty path, special chars)
|
||||
|
||||
**Deliverable:** Route matching with specificity resolution.
|
||||
|
||||
---
|
||||
|
||||
### Task 2.3: Inheritance Resolution (0.5 days)
|
||||
|
||||
**Goal:** Resolve effective limits from global → env → microservice → route.
|
||||
|
||||
**Files to Create:**
|
||||
1. `RateLimit/LimitInheritanceResolver.cs` - Inheritance logic
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```csharp
|
||||
// LimitInheritanceResolver.cs
|
||||
public sealed class LimitInheritanceResolver
|
||||
{
|
||||
private readonly RateLimitConfig _config;
|
||||
|
||||
public LimitInheritanceResolver(RateLimitConfig _config)
|
||||
{
|
||||
this._config = _config;
|
||||
}
|
||||
|
||||
public EffectiveLimits ResolveForRoute(string microservice, string? routeName)
|
||||
{
|
||||
// Start with global defaults
|
||||
var longWindow = 0;
|
||||
var longMax = 0;
|
||||
var burstWindow = 0;
|
||||
var burstMax = 0;
|
||||
|
||||
// Layer 1: Global environment defaults
|
||||
if (_config.ForEnvironment != null)
|
||||
{
|
||||
longWindow = _config.ForEnvironment.PerSeconds;
|
||||
longMax = _config.ForEnvironment.MaxRequests;
|
||||
burstWindow = _config.ForEnvironment.AllowBurstForSeconds;
|
||||
burstMax = _config.ForEnvironment.AllowMaxBurstRequests;
|
||||
}
|
||||
|
||||
// Layer 2: Microservice overrides
|
||||
if (_config.ForEnvironment?.Microservices.TryGetValue(microservice, out var msConfig) == true)
|
||||
{
|
||||
if (msConfig.PerSeconds.HasValue)
|
||||
{
|
||||
longWindow = msConfig.PerSeconds.Value;
|
||||
longMax = msConfig.MaxRequests!.Value;
|
||||
}
|
||||
|
||||
if (msConfig.AllowBurstForSeconds.HasValue)
|
||||
{
|
||||
burstWindow = msConfig.AllowBurstForSeconds.Value;
|
||||
burstMax = msConfig.AllowMaxBurstRequests!.Value;
|
||||
}
|
||||
|
||||
// Layer 3: Route overrides (most specific)
|
||||
if (!string.IsNullOrWhiteSpace(routeName) &&
|
||||
msConfig.Routes.TryGetValue(routeName, out var routeConfig))
|
||||
{
|
||||
if (routeConfig.PerSeconds.HasValue)
|
||||
{
|
||||
longWindow = routeConfig.PerSeconds.Value;
|
||||
longMax = routeConfig.MaxRequests!.Value;
|
||||
}
|
||||
|
||||
if (routeConfig.AllowBurstForSeconds.HasValue)
|
||||
{
|
||||
burstWindow = routeConfig.AllowBurstForSeconds.Value;
|
||||
burstMax = routeConfig.AllowMaxBurstRequests!.Value;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return EffectiveLimits.FromConfig(longWindow, longMax, burstWindow, burstMax);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Testing:**
|
||||
- Unit tests for inheritance resolution
|
||||
- All combinations: global only, global + microservice, global + microservice + route
|
||||
- Verify most specific wins
|
||||
|
||||
**Deliverable:** Correct limit inheritance.
|
||||
|
||||
---
|
||||
|
||||
### Task 2.4: Integrate Route Matching into RateLimitService (0.5 days)
|
||||
|
||||
**Goal:** Use route matcher in rate limit decision.
|
||||
|
||||
**Files to Modify:**
|
||||
1. `RateLimit/RateLimitService.cs` - Add route resolution
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```csharp
|
||||
// Update RateLimitService.cs
|
||||
public sealed class RateLimitService
|
||||
{
|
||||
private readonly RateLimitConfig _config;
|
||||
private readonly InstanceRateLimiter _instanceLimiter;
|
||||
private readonly EnvironmentRateLimiter? _environmentLimiter;
|
||||
private readonly Dictionary<string, RouteMatcher> _routeMatchers; // Per microservice
|
||||
private readonly LimitInheritanceResolver _inheritanceResolver;
|
||||
private readonly ILogger<RateLimitService> _logger;
|
||||
|
||||
public RateLimitService(
|
||||
RateLimitConfig config,
|
||||
InstanceRateLimiter instanceLimiter,
|
||||
EnvironmentRateLimiter? environmentLimiter,
|
||||
ILogger<RateLimitService> logger)
|
||||
{
|
||||
_config = config;
|
||||
_instanceLimiter = instanceLimiter;
|
||||
_environmentLimiter = environmentLimiter;
|
||||
_logger = logger;
|
||||
_inheritanceResolver = new LimitInheritanceResolver(config);
|
||||
|
||||
// Build route matchers per microservice
|
||||
_routeMatchers = new Dictionary<string, RouteMatcher>(StringComparer.OrdinalIgnoreCase);
|
||||
if (config.ForEnvironment != null)
|
||||
{
|
||||
foreach (var (msName, msConfig) in config.ForEnvironment.Microservices)
|
||||
{
|
||||
if (msConfig.Routes.Count > 0)
|
||||
{
|
||||
var matcher = new RouteMatcher();
|
||||
foreach (var (routeName, routeConfig) in msConfig.Routes)
|
||||
{
|
||||
matcher.AddRoute(routeName, routeConfig);
|
||||
}
|
||||
_routeMatchers[msName] = matcher;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public async Task<RateLimitDecision> CheckLimitAsync(
|
||||
string microservice,
|
||||
string requestPath,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
// Resolve route
|
||||
string? routeName = null;
|
||||
if (_routeMatchers.TryGetValue(microservice, out var matcher))
|
||||
{
|
||||
var (matchedRoute, _) = matcher.FindBestMatch(requestPath);
|
||||
routeName = matchedRoute;
|
||||
}
|
||||
|
||||
// Check instance limits (always)
|
||||
var instanceDecision = _instanceLimiter.TryAcquire(microservice);
|
||||
if (!instanceDecision.Allowed)
|
||||
{
|
||||
return instanceDecision;
|
||||
}
|
||||
|
||||
// Activation gate check
|
||||
if (_config.ActivationThresholdPer5Min > 0)
|
||||
{
|
||||
var activationCount = _instanceLimiter.GetActivationCount();
|
||||
if (activationCount < _config.ActivationThresholdPer5Min)
|
||||
{
|
||||
RateLimitMetrics.ValkeyCallSkipped();
|
||||
return instanceDecision;
|
||||
}
|
||||
}
|
||||
|
||||
// Check environment limits
|
||||
if (_environmentLimiter != null)
|
||||
{
|
||||
var limits = _inheritanceResolver.ResolveForRoute(microservice, routeName);
|
||||
if (limits.Enabled)
|
||||
{
|
||||
var envDecision = await _environmentLimiter.TryAcquireAsync(
|
||||
$"{microservice}:{routeName ?? "default"}", limits, cancellationToken);
|
||||
|
||||
if (envDecision.HasValue)
|
||||
{
|
||||
return envDecision.Value;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return instanceDecision;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Update Middleware:**
|
||||
|
||||
```csharp
|
||||
// RateLimitMiddleware.cs - Update InvokeAsync
|
||||
public async Task InvokeAsync(HttpContext context)
|
||||
{
|
||||
var microservice = context.Items["RoutingTarget"] as string ?? "unknown";
|
||||
var requestPath = context.Request.Path.Value ?? "/";
|
||||
|
||||
var decision = await _rateLimitService.CheckLimitAsync(
|
||||
microservice, requestPath, context.RequestAborted);
|
||||
|
||||
RateLimitMetrics.RecordDecision(decision);
|
||||
|
||||
if (!decision.Allowed)
|
||||
{
|
||||
await WriteRateLimitResponse(context, decision);
|
||||
return;
|
||||
}
|
||||
|
||||
await _next(context);
|
||||
}
|
||||
```
|
||||
|
||||
**Testing:**
|
||||
- Integration tests with different routes
|
||||
- Verify route matching works in middleware
|
||||
- Verify inheritance resolution
|
||||
|
||||
**Deliverable:** Route-aware rate limiting.
|
||||
|
||||
---
|
||||
|
||||
### Task 2.5: Documentation (1 day)
|
||||
|
||||
**Goal:** Document per-route configuration and examples.
|
||||
|
||||
**Files to Create:**
|
||||
1. `docs/router/rate-limiting-routes.md` - Route configuration guide
|
||||
|
||||
**Content:**
|
||||
|
||||
```markdown
|
||||
# Per-Route Rate Limiting
|
||||
|
||||
## Overview
|
||||
|
||||
Per-route rate limiting allows different API endpoints to have different rate limits, even within the same microservice.
|
||||
|
||||
## Configuration
|
||||
|
||||
Routes are configured under `microservices.<name>.routes`:
|
||||
|
||||
\`\`\`yaml
|
||||
for_environment:
|
||||
microservices:
|
||||
scanner:
|
||||
# Default limits for scanner
|
||||
per_seconds: 60
|
||||
max_requests: 600
|
||||
|
||||
# Per-route overrides
|
||||
routes:
|
||||
scan_submit:
|
||||
pattern: "/api/scans"
|
||||
match_type: exact
|
||||
per_seconds: 10
|
||||
max_requests: 50
|
||||
\`\`\`
|
||||
|
||||
## Match Types
|
||||
|
||||
### Exact Match
|
||||
Matches the exact path.
|
||||
|
||||
\`\`\`yaml
|
||||
pattern: "/api/scans"
|
||||
match_type: exact
|
||||
\`\`\`
|
||||
|
||||
Matches: `/api/scans`
|
||||
Does NOT match: `/api/scans/123`, `/api/scans/`
|
||||
|
||||
### Prefix Match
|
||||
Matches any path starting with the prefix.
|
||||
|
||||
\`\`\`yaml
|
||||
pattern: "/api/scans/*"
|
||||
match_type: prefix
|
||||
\`\`\`
|
||||
|
||||
Matches: `/api/scans/123`, `/api/scans/status`, `/api/scans/abc/def`
|
||||
|
||||
### Regex Match
|
||||
Matches using regular expressions.
|
||||
|
||||
\`\`\`yaml
|
||||
pattern: "^/api/scans/[a-f0-9-]+$"
|
||||
match_type: regex
|
||||
\`\`\`
|
||||
|
||||
Matches: `/api/scans/abc-123`, `/api/scans/00000000-0000-0000-0000-000000000000`
|
||||
Does NOT match: `/api/scans/`, `/api/scans/invalid@chars`
|
||||
|
||||
## Specificity Rules
|
||||
|
||||
When multiple routes match, the most specific wins:
|
||||
|
||||
1. **Exact match** (highest priority)
|
||||
2. **Prefix match** (longer prefix wins)
|
||||
3. **Regex match** (lowest priority)
|
||||
|
||||
## Inheritance
|
||||
|
||||
Limits inherit from parent levels:
|
||||
|
||||
\`\`\`
|
||||
Global Defaults
|
||||
└─> Microservice Defaults
|
||||
└─> Route Overrides (most specific)
|
||||
\`\`\`
|
||||
|
||||
Routes can override:
|
||||
- Long window limits only
|
||||
- Burst window limits only
|
||||
- Both
|
||||
- Neither (inherits all from microservice)
|
||||
|
||||
## Examples
|
||||
|
||||
### Expensive vs Cheap Operations
|
||||
|
||||
\`\`\`yaml
|
||||
scanner:
|
||||
per_seconds: 60
|
||||
max_requests: 600
|
||||
routes:
|
||||
scan_submit:
|
||||
pattern: "/api/scans"
|
||||
match_type: exact
|
||||
per_seconds: 10
|
||||
max_requests: 50 # Expensive: 50/10sec
|
||||
scan_status:
|
||||
pattern: "/api/scans/*"
|
||||
match_type: prefix
|
||||
per_seconds: 1
|
||||
max_requests: 100 # Cheap: 100/sec
|
||||
\`\`\`
|
||||
|
||||
### Read vs Write Operations
|
||||
|
||||
\`\`\`yaml
|
||||
policy:
|
||||
per_seconds: 60
|
||||
max_requests: 300
|
||||
routes:
|
||||
policy_read:
|
||||
pattern: "^/api/v1/policy/[^/]+$"
|
||||
match_type: regex
|
||||
per_seconds: 1
|
||||
max_requests: 50 # Reads: 50/sec
|
||||
policy_write:
|
||||
pattern: "^/api/v1/policy/[^/]+$"
|
||||
match_type: regex
|
||||
per_seconds: 10
|
||||
max_requests: 10 # Writes: 10/10sec
|
||||
\`\`\`
|
||||
\`\`\`
|
||||
|
||||
**Testing:**
|
||||
- Review doc examples
|
||||
- Verify config snippets
|
||||
|
||||
**Deliverable:** Complete route configuration guide.
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [x] Route configuration models created
|
||||
- [x] Route matching works (exact, prefix, regex)
|
||||
- [x] Specificity resolution correct
|
||||
- [x] Inheritance works (global → microservice → route)
|
||||
- [x] Integration with RateLimitService complete
|
||||
- [x] Unit tests pass
|
||||
- [x] Integration tests pass (covered in Sprint 5)
|
||||
- [x] Documentation complete
|
||||
|
||||
---
|
||||
|
||||
## Execution Log
|
||||
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2025-12-17 | Marked sprint DONE; implemented route config + matching + inheritance resolution; integrated into RateLimitService; added unit tests and docs. | Automation |
|
||||
|
||||
---
|
||||
|
||||
## Next Sprint
|
||||
|
||||
Sprint 3: Rule Stacking (multiple windows per target)
|
||||
@@ -0,0 +1,537 @@
|
||||
# Sprint 3: Rule Stacking (Multiple Windows)
|
||||
|
||||
**IMPLID:** 1200_001_003
|
||||
**Sprint Duration:** 2-3 days
|
||||
**Priority:** HIGH
|
||||
**Dependencies:** Sprint 1 (Core), Sprint 2 (Routes)
|
||||
**Status:** DONE
|
||||
**Blocks:** Sprint 5 (additional integration/load testing)
|
||||
**Evidence:** `src/__Libraries/StellaOps.Router.Gateway/RateLimit/`, `tests/StellaOps.Router.Gateway.Tests/`
|
||||
|
||||
---
|
||||
|
||||
## Sprint Goal
|
||||
|
||||
Support multiple rate limit rules per target with AND logic (all rules must pass).
|
||||
|
||||
**Example:** "10 requests per second AND 3000 requests per hour AND 50,000 requests per day"
|
||||
|
||||
**Acceptance Criteria:**
|
||||
- Configuration supports array of rules per target
|
||||
- All rules evaluated (AND logic)
|
||||
- Most restrictive Retry-After returned
|
||||
- Valkey Lua script handles multiple windows in single call
|
||||
- Works at all levels (global, microservice, route)
|
||||
|
||||
---
|
||||
|
||||
## Working Directory
|
||||
|
||||
`src/__Libraries/StellaOps.Router.Gateway/RateLimit/`
|
||||
|
||||
---
|
||||
|
||||
## Task Breakdown
|
||||
|
||||
### Task 3.1: Extend Configuration for Rule Arrays (0.5 days)
|
||||
|
||||
**Goal:** Change single window config to array of rules.
|
||||
|
||||
**Files to Modify:**
|
||||
1. `RateLimit/Models/InstanceLimitsConfig.cs` - Add Rules array
|
||||
2. `RateLimit/Models/EnvironmentLimitsConfig.cs` - Add Rules array
|
||||
3. `RateLimit/Models/MicroserviceLimitsConfig.cs` - Add Rules array
|
||||
4. `RateLimit/Models/RouteLimitsConfig.cs` - Add Rules array
|
||||
|
||||
**Files to Create:**
|
||||
1. `RateLimit/Models/RateLimitRule.cs` - Single rule definition
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```csharp
|
||||
// RateLimitRule.cs (NEW)
|
||||
namespace StellaOps.Router.Gateway.RateLimit.Models;
|
||||
|
||||
public sealed class RateLimitRule
|
||||
{
|
||||
[ConfigurationKeyName("per_seconds")]
|
||||
public int PerSeconds { get; set; }
|
||||
|
||||
[ConfigurationKeyName("max_requests")]
|
||||
public int MaxRequests { get; set; }
|
||||
|
||||
[ConfigurationKeyName("name")]
|
||||
public string? Name { get; set; } // Optional: for debugging/metrics
|
||||
|
||||
public void Validate(string path)
|
||||
{
|
||||
if (PerSeconds <= 0)
|
||||
throw new ArgumentException($"{path}: per_seconds must be > 0");
|
||||
|
||||
if (MaxRequests <= 0)
|
||||
throw new ArgumentException($"{path}: max_requests must be > 0");
|
||||
}
|
||||
}
|
||||
|
||||
// Update InstanceLimitsConfig.cs
|
||||
public sealed class InstanceLimitsConfig
|
||||
{
|
||||
// DEPRECATED (keep for backward compat, but rules takes precedence)
|
||||
[ConfigurationKeyName("per_seconds")]
|
||||
public int PerSeconds { get; set; }
|
||||
|
||||
[ConfigurationKeyName("max_requests")]
|
||||
public int MaxRequests { get; set; }
|
||||
|
||||
[ConfigurationKeyName("allow_burst_for_seconds")]
|
||||
public int AllowBurstForSeconds { get; set; }
|
||||
|
||||
[ConfigurationKeyName("allow_max_burst_requests")]
|
||||
public int AllowMaxBurstRequests { get; set; }
|
||||
|
||||
// NEW: Array of rules
|
||||
[ConfigurationKeyName("rules")]
|
||||
public List<RateLimitRule> Rules { get; set; } = new();
|
||||
|
||||
public void Validate(string path)
|
||||
{
|
||||
// If rules specified, use those; otherwise fall back to legacy single-window config
|
||||
if (Rules.Count > 0)
|
||||
{
|
||||
for (var i = 0; i < Rules.Count; i++)
|
||||
{
|
||||
Rules[i].Validate($"{path}.rules[{i}]");
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
// Legacy validation
|
||||
if (PerSeconds < 0 || MaxRequests < 0)
|
||||
throw new ArgumentException($"{path}: Window and limit must be >= 0");
|
||||
}
|
||||
}
|
||||
|
||||
public List<RateLimitRule> GetEffectiveRules()
|
||||
{
|
||||
if (Rules.Count > 0)
|
||||
return Rules;
|
||||
|
||||
// Convert legacy config to rules
|
||||
var legacy = new List<RateLimitRule>();
|
||||
if (PerSeconds > 0 && MaxRequests > 0)
|
||||
{
|
||||
legacy.Add(new RateLimitRule
|
||||
{
|
||||
PerSeconds = PerSeconds,
|
||||
MaxRequests = MaxRequests,
|
||||
Name = "long"
|
||||
});
|
||||
}
|
||||
if (AllowBurstForSeconds > 0 && AllowMaxBurstRequests > 0)
|
||||
{
|
||||
legacy.Add(new RateLimitRule
|
||||
{
|
||||
PerSeconds = AllowBurstForSeconds,
|
||||
MaxRequests = AllowMaxBurstRequests,
|
||||
Name = "burst"
|
||||
});
|
||||
}
|
||||
return legacy;
|
||||
}
|
||||
}
|
||||
|
||||
// Similar updates for EnvironmentLimitsConfig, MicroserviceLimitsConfig, RouteLimitsConfig
|
||||
```
|
||||
|
||||
**Configuration Example:**
|
||||
|
||||
```yaml
|
||||
for_environment:
|
||||
microservices:
|
||||
concelier:
|
||||
rules:
|
||||
- per_seconds: 1
|
||||
max_requests: 10
|
||||
name: "per_second"
|
||||
- per_seconds: 60
|
||||
max_requests: 300
|
||||
name: "per_minute"
|
||||
- per_seconds: 3600
|
||||
max_requests: 3000
|
||||
name: "per_hour"
|
||||
- per_seconds: 86400
|
||||
max_requests: 50000
|
||||
name: "per_day"
|
||||
```
|
||||
|
||||
**Testing:**
|
||||
- Unit tests for rule array loading
|
||||
- Backward compatibility with legacy config
|
||||
- Validation of rule arrays
|
||||
|
||||
**Deliverable:** Configuration models support rule arrays.
|
||||
|
||||
---
|
||||
|
||||
### Task 3.2: Update Instance Limiter for Multiple Rules (1 day)
|
||||
|
||||
**Goal:** Evaluate all rules in InstanceRateLimiter.
|
||||
|
||||
**Files to Modify:**
|
||||
1. `RateLimit/InstanceRateLimiter.cs` - Support multiple rules
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```csharp
|
||||
// InstanceRateLimiter.cs (UPDATED)
|
||||
public sealed class InstanceRateLimiter : IDisposable
|
||||
{
|
||||
private readonly List<(RateLimitRule rule, SlidingWindowCounter counter)> _rules;
|
||||
private readonly SlidingWindowCounter _activationCounter;
|
||||
|
||||
public InstanceRateLimiter(List<RateLimitRule> rules)
|
||||
{
|
||||
_rules = rules.Select(r => (r, new SlidingWindowCounter(r.PerSeconds))).ToList();
|
||||
_activationCounter = new SlidingWindowCounter(300);
|
||||
}
|
||||
|
||||
public RateLimitDecision TryAcquire(string? microservice)
|
||||
{
|
||||
_activationCounter.Increment();
|
||||
|
||||
if (_rules.Count == 0)
|
||||
return RateLimitDecision.Allow(RateLimitScope.Instance, microservice, 0, 0);
|
||||
|
||||
var violations = new List<(RateLimitRule rule, ulong count, int retryAfter)>();
|
||||
|
||||
// Evaluate all rules
|
||||
foreach (var (rule, counter) in _rules)
|
||||
{
|
||||
var count = (ulong)counter.Increment();
|
||||
if (count > (ulong)rule.MaxRequests)
|
||||
{
|
||||
violations.Add((rule, count, rule.PerSeconds));
|
||||
}
|
||||
}
|
||||
|
||||
if (violations.Count > 0)
|
||||
{
|
||||
// Most restrictive retry-after wins (longest wait)
|
||||
var maxRetryAfter = violations.Max(v => v.retryAfter);
|
||||
var reason = DetermineReason(violations);
|
||||
|
||||
return RateLimitDecision.Deny(
|
||||
RateLimitScope.Instance,
|
||||
microservice,
|
||||
reason,
|
||||
maxRetryAfter,
|
||||
violations[0].count,
|
||||
0);
|
||||
}
|
||||
|
||||
return RateLimitDecision.Allow(RateLimitScope.Instance, microservice, 0, 0);
|
||||
}
|
||||
|
||||
private static RateLimitReason DetermineReason(List<(RateLimitRule rule, ulong count, int retryAfter)> violations)
|
||||
{
|
||||
// For multiple rule violations, use generic reason
|
||||
return violations.Count == 1
|
||||
? RateLimitReason.LongWindowExceeded
|
||||
: RateLimitReason.LongAndBurstExceeded;
|
||||
}
|
||||
|
||||
public long GetActivationCount() => _activationCounter.GetCount();
|
||||
|
||||
public void Dispose()
|
||||
{
|
||||
// Counters don't need disposal
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Testing:**
|
||||
- Unit tests for multi-rule evaluation
|
||||
- Verify all rules checked (AND logic)
|
||||
- Most restrictive retry-after returned
|
||||
- Single rule vs multiple rules
|
||||
|
||||
**Deliverable:** Instance limiter supports rule stacking.
|
||||
|
||||
---
|
||||
|
||||
### Task 3.3: Enhance Valkey Lua Script for Multiple Windows (1 day)
|
||||
|
||||
**Goal:** Modify Lua script to handle array of rules in single call.
|
||||
|
||||
**Files to Modify:**
|
||||
1. `RateLimit/Scripts/rate_limit_check.lua` - Multi-rule support
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```lua
|
||||
-- rate_limit_check_multi.lua (UPDATED)
|
||||
-- KEYS: none
|
||||
-- ARGV[1]: bucket prefix
|
||||
-- ARGV[2]: service name (with route suffix if applicable)
|
||||
-- ARGV[3]: JSON array of rules: [{"window_sec":1,"limit":10,"name":"per_second"}, ...]
|
||||
-- Returns: {allowed (0/1), violations_json, max_retry_after}
|
||||
|
||||
local bucket = ARGV[1]
|
||||
local svc = ARGV[2]
|
||||
local rules_json = ARGV[3]
|
||||
|
||||
-- Parse rules
|
||||
local rules = cjson.decode(rules_json)
|
||||
local now = tonumber(redis.call("TIME")[1])
|
||||
|
||||
local violations = {}
|
||||
local max_retry = 0
|
||||
|
||||
-- Evaluate each rule
|
||||
for i, rule in ipairs(rules) do
|
||||
local window_sec = tonumber(rule.window_sec)
|
||||
local limit = tonumber(rule.limit)
|
||||
local rule_name = rule.name or tostring(i)
|
||||
|
||||
-- Fixed window start
|
||||
local window_start = now - (now % window_sec)
|
||||
local key = bucket .. ":env:" .. svc .. ":" .. rule_name .. ":" .. window_start
|
||||
|
||||
-- Increment counter
|
||||
local count = redis.call("INCR", key)
|
||||
if count == 1 then
|
||||
redis.call("EXPIRE", key, window_sec + 2)
|
||||
end
|
||||
|
||||
-- Check limit
|
||||
if count > limit then
|
||||
local retry = (window_start + window_sec) - now
|
||||
table.insert(violations, {
|
||||
rule = rule_name,
|
||||
count = count,
|
||||
limit = limit,
|
||||
retry_after = retry
|
||||
})
|
||||
if retry > max_retry then
|
||||
max_retry = retry
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
-- Result
|
||||
local allowed = (#violations == 0) and 1 or 0
|
||||
local violations_json = cjson.encode(violations)
|
||||
|
||||
return {allowed, violations_json, max_retry}
|
||||
```
|
||||
|
||||
**Files to Modify:**
|
||||
2. `RateLimit/ValkeyRateLimitStore.cs` - Update to use new script
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```csharp
|
||||
// ValkeyRateLimitStore.cs (UPDATED)
|
||||
public async Task<RateLimitDecision> CheckLimitAsync(
|
||||
string serviceKey,
|
||||
List<RateLimitRule> rules,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
// Build rules JSON
|
||||
var rulesJson = JsonSerializer.Serialize(rules.Select(r => new
|
||||
{
|
||||
window_sec = r.PerSeconds,
|
||||
limit = r.MaxRequests,
|
||||
name = r.Name ?? "rule"
|
||||
}));
|
||||
|
||||
var values = new RedisValue[]
|
||||
{
|
||||
_bucket,
|
||||
serviceKey,
|
||||
rulesJson
|
||||
};
|
||||
|
||||
var result = await _db.ScriptEvaluateAsync(
|
||||
_rateLimitScriptSha,
|
||||
Array.Empty<RedisKey>(),
|
||||
values);
|
||||
|
||||
var array = (RedisResult[])result;
|
||||
var allowed = (int)array[0] == 1;
|
||||
var violationsJson = (string)array[1];
|
||||
var maxRetryAfter = (int)array[2];
|
||||
|
||||
if (allowed)
|
||||
{
|
||||
return RateLimitDecision.Allow(RateLimitScope.Environment, serviceKey, 0, 0);
|
||||
}
|
||||
|
||||
// Parse violations for reason
|
||||
var violations = JsonSerializer.Deserialize<List<RuleViolation>>(violationsJson);
|
||||
var reason = violations!.Count == 1
|
||||
? RateLimitReason.LongWindowExceeded
|
||||
: RateLimitReason.LongAndBurstExceeded;
|
||||
|
||||
return RateLimitDecision.Deny(
|
||||
RateLimitScope.Environment,
|
||||
serviceKey,
|
||||
reason,
|
||||
maxRetryAfter,
|
||||
(ulong)violations[0].Count,
|
||||
0);
|
||||
}
|
||||
|
||||
private sealed class RuleViolation
|
||||
{
|
||||
[JsonPropertyName("rule")]
|
||||
public string Rule { get; set; } = "";
|
||||
|
||||
[JsonPropertyName("count")]
|
||||
public int Count { get; set; }
|
||||
|
||||
[JsonPropertyName("limit")]
|
||||
public int Limit { get; set; }
|
||||
|
||||
[JsonPropertyName("retry_after")]
|
||||
public int RetryAfter { get; set; }
|
||||
}
|
||||
```
|
||||
|
||||
**Testing:**
|
||||
- Integration tests with Testcontainers (Valkey)
|
||||
- Multiple rules in single Lua call
|
||||
- Verify atomicity
|
||||
- Verify retry-after calculation
|
||||
|
||||
**Deliverable:** Valkey backend supports rule stacking.
|
||||
|
||||
---
|
||||
|
||||
### Task 3.4: Update Inheritance Resolver for Rules (0.5 days)
|
||||
|
||||
**Goal:** Merge rules from multiple levels.
|
||||
|
||||
**Files to Modify:**
|
||||
1. `RateLimit/LimitInheritanceResolver.cs` - Support rule merging
|
||||
|
||||
**Implementation:**
|
||||
|
||||
```csharp
|
||||
// LimitInheritanceResolver.cs (UPDATED)
|
||||
public List<RateLimitRule> ResolveRulesForRoute(string microservice, string? routeName)
|
||||
{
|
||||
var rules = new List<RateLimitRule>();
|
||||
|
||||
// Layer 1: Global environment defaults
|
||||
if (_config.ForEnvironment != null)
|
||||
{
|
||||
rules.AddRange(_config.ForEnvironment.GetEffectiveRules());
|
||||
}
|
||||
|
||||
// Layer 2: Microservice overrides (REPLACES global)
|
||||
if (_config.ForEnvironment?.Microservices.TryGetValue(microservice, out var msConfig) == true)
|
||||
{
|
||||
var msRules = msConfig.GetEffectiveRules();
|
||||
if (msRules.Count > 0)
|
||||
{
|
||||
rules = msRules; // Replace, not merge
|
||||
}
|
||||
|
||||
// Layer 3: Route overrides (REPLACES microservice)
|
||||
if (!string.IsNullOrWhiteSpace(routeName) &&
|
||||
msConfig.Routes.TryGetValue(routeName, out var routeConfig))
|
||||
{
|
||||
var routeRules = routeConfig.GetEffectiveRules();
|
||||
if (routeRules.Count > 0)
|
||||
{
|
||||
rules = routeRules; // Replace, not merge
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return rules;
|
||||
}
|
||||
```
|
||||
|
||||
**Testing:**
|
||||
- Unit tests for rule inheritance
|
||||
- Verify replacement (not merge) semantics
|
||||
- All combinations
|
||||
|
||||
**Deliverable:** Inheritance resolver supports rules.
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [x] Configuration supports rule arrays
|
||||
- [x] Backward compatible with legacy single-window config
|
||||
- [x] Instance limiter evaluates all rules (AND logic)
|
||||
- [x] Valkey Lua script handles multiple windows
|
||||
- [x] Most restrictive Retry-After returned
|
||||
- [x] Inheritance resolver merges rules correctly
|
||||
- [x] Unit tests pass
|
||||
- [x] Integration tests pass (Valkey/Testcontainers) (Sprint 5)
|
||||
|
||||
---
|
||||
|
||||
## Execution Log
|
||||
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2025-12-17 | Marked sprint DONE; implemented rule arrays and multi-window evaluation for instance + environment (Valkey Lua); added unit tests. | Automation |
|
||||
|
||||
---
|
||||
|
||||
## Configuration Examples
|
||||
|
||||
### Basic Stacking
|
||||
|
||||
```yaml
|
||||
for_instance:
|
||||
rules:
|
||||
- per_seconds: 1
|
||||
max_requests: 10
|
||||
name: "10_per_second"
|
||||
- per_seconds: 3600
|
||||
max_requests: 3000
|
||||
name: "3000_per_hour"
|
||||
```
|
||||
|
||||
### Complex Multi-Level
|
||||
|
||||
```yaml
|
||||
for_environment:
|
||||
rules:
|
||||
- per_seconds: 300
|
||||
max_requests: 30000
|
||||
name: "global_long"
|
||||
|
||||
microservices:
|
||||
concelier:
|
||||
rules:
|
||||
- per_seconds: 1
|
||||
max_requests: 10
|
||||
- per_seconds: 60
|
||||
max_requests: 300
|
||||
- per_seconds: 3600
|
||||
max_requests: 3000
|
||||
- per_seconds: 86400
|
||||
max_requests: 50000
|
||||
routes:
|
||||
expensive_op:
|
||||
pattern: "/api/process"
|
||||
match_type: exact
|
||||
rules:
|
||||
- per_seconds: 10
|
||||
max_requests: 5
|
||||
- per_seconds: 3600
|
||||
max_requests: 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Sprint
|
||||
|
||||
Sprint 4: Service Migration (migrate AdaptiveRateLimiter to Router)
|
||||
@@ -0,0 +1,36 @@
|
||||
# Sprint 1200_001_004 · Router Rate Limiting · Service Migration (AdaptiveRateLimiter)
|
||||
|
||||
## Topic & Scope
|
||||
- Close the planned migration of `AdaptiveRateLimiter` (Orchestrator) into Router rate limiting.
|
||||
- Confirm whether any production HTTP paths still enforce service-level rate limiting and therefore require migration.
|
||||
- **Working directory:** `src/Orchestrator/StellaOps.Orchestrator`.
|
||||
- **Evidence:** `src/__Libraries/StellaOps.Router.Gateway/RateLimit/` (router limiter exists) and Orchestrator code search indicates `AdaptiveRateLimiter` is not wired into HTTP ingress (library-only).
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on: `SPRINT_1200_001_001`, `SPRINT_1200_001_002`, `SPRINT_1200_001_003` (rate limiting landed in Router).
|
||||
- Safe to execute in parallel with Sprint 5/6 since no code changes are required for this closure.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/implplan/SPRINT_1200_001_000_router_rate_limiting_master.md`
|
||||
- `docs/modules/router/architecture.md`
|
||||
- `docs/modules/orchestrator/architecture.md`
|
||||
|
||||
## Delivery Tracker
|
||||
| # | Task ID | Status | Key dependency / next step | Owners | Task Definition |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| 1 | RRL-04-001 | DONE | N/A | Router · Orchestrator | Inventory usage of `AdaptiveRateLimiter` and any service-level HTTP rate limiting in Orchestrator ingress. |
|
||||
| 2 | RRL-04-002 | DONE | N/A | Router · Architecture | Decide migration outcome: migrate, defer, or close as N/A based on inventory. |
|
||||
| 3 | RRL-04-003 | DONE | Update master tracker | Router | Update `SPRINT_1200_001_000_router_rate_limiting_master.md` to reflect closure outcome. |
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2025-12-17 | Sprint created and closed as N/A: `AdaptiveRateLimiter` appears to be a library-only component in Orchestrator (tests + core) and is not wired into HTTP ingress; no service-level HTTP rate limiting was found to migrate. | Automation |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Decision:** Close Sprint 4 as N/A (no production wiring found). If Orchestrator (or any service) introduces HTTP-level rate limiting, open a dedicated migration sprint under that service’s working directory.
|
||||
- **Risk:** Double-limiting during future migration if both service-level and router-level limiters are enabled. Mitigation: migration guide + staged rollout (shadow mode), and remove service-level limiters after router limits verified.
|
||||
|
||||
## Next Checkpoints
|
||||
- None (closure sprint).
|
||||
|
||||
@@ -0,0 +1,38 @@
|
||||
# Sprint 1200_001_005 · Router Rate Limiting · Comprehensive Testing
|
||||
|
||||
## Topic & Scope
|
||||
- Add Valkey-backed integration tests for the Lua fixed-window implementation (real Valkey).
|
||||
- Expand deterministic unit coverage via configuration matrix tests (inheritance + routes + rule stacking).
|
||||
- Add k6 load test scenarios for rate limiting (enforcement, retry-after correctness, overhead).
|
||||
- **Working directory:** `tests/`.
|
||||
- **Evidence:** `tests/StellaOps.Router.Gateway.Tests/`, `tests/load/`.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on: `SPRINT_1200_001_001`, `SPRINT_1200_001_002`, `SPRINT_1200_001_003` (feature implementation).
|
||||
- Can run in parallel with Sprint 6 docs.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/implplan/SPRINT_1200_001_IMPLEMENTATION_GUIDE.md`
|
||||
- `docs/router/rate-limiting-routes.md`
|
||||
- `docs/modules/router/architecture.md`
|
||||
|
||||
## Delivery Tracker
|
||||
| # | Task ID | Status | Key dependency / next step | Owners | Task Definition |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| 1 | RRL-05-001 | DONE | Run with `STELLAOPS_INTEGRATION_TESTS=true` | QA · Router | Valkey integration tests validating multi-rule Lua behavior and Retry-After bounds. |
|
||||
| 2 | RRL-05-002 | DONE | Covered by unit tests | QA · Router | Configuration matrix unit tests (inheritance replacement + route specificity + rule stacking). |
|
||||
| 3 | RRL-05-003 | DONE | `tests/load/router-rate-limiting-load-test.js` | QA · Router | k6 load tests for rate limiting scenarios (A–F) and doc updates in `tests/load/README.md`. |
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2025-12-17 | Sprint created; RRL-05-001 started. | Automation |
|
||||
| 2025-12-17 | Completed RRL-05-001 and RRL-05-002: added Testcontainers-backed Valkey integration tests (opt-in via `STELLAOPS_INTEGRATION_TESTS=true`) and expanded unit coverage for inheritance + activation gate behavior. | Automation |
|
||||
| 2025-12-17 | Completed RRL-05-003: added k6 suite `tests/load/router-rate-limiting-load-test.js` and documented usage in `tests/load/README.md`. | Automation |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Decision:** Integration tests require Docker; they are opt-in (skipped unless explicitly enabled) to keep `dotnet test StellaOps.Router.slnx` runnable without Docker.
|
||||
- **Risk:** Flaky timing around fixed-window boundaries. Mitigation: assert ranges (not exact seconds) and use small windows with slack.
|
||||
|
||||
## Next Checkpoints
|
||||
- None scheduled; complete tasks and mark sprint DONE.
|
||||
@@ -0,0 +1,41 @@
|
||||
# Sprint 1200_001_006 · Router Rate Limiting · Documentation & Rollout Prep
|
||||
|
||||
## Topic & Scope
|
||||
- Publish user-facing configuration guide and ops runbook for Router rate limiting.
|
||||
- Update Router module docs to reflect the new centralized rate limiting feature and where it sits in the request pipeline.
|
||||
- Add migration guidance to avoid double-limiting during rollout.
|
||||
- **Working directory:** `docs/`.
|
||||
- **Evidence:** `docs/router/`, `docs/operations/`, `docs/modules/router/`.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on: `SPRINT_1200_001_001`, `SPRINT_1200_001_002`, `SPRINT_1200_001_003`.
|
||||
- Can run in parallel with Sprint 5 tests.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/README.md`
|
||||
- `docs/07_HIGH_LEVEL_ARCHITECTURE.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
- `docs/modules/router/architecture.md`
|
||||
- `docs/router/rate-limiting-routes.md`
|
||||
|
||||
## Delivery Tracker
|
||||
| # | Task ID | Status | Key dependency / next step | Owners | Task Definition |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| 1 | RRL-06-001 | DONE | Links added | Docs · Router | Architecture updates + links (Router module docs + high-level router docs). |
|
||||
| 2 | RRL-06-002 | DONE | `docs/router/rate-limiting.md` | Docs · Router | User configuration guide: `docs/router/rate-limiting.md` (rules, inheritance, routes, examples). |
|
||||
| 3 | RRL-06-003 | DONE | `docs/operations/router-rate-limiting.md` | Ops · Router | Operational runbook: `docs/operations/router-rate-limiting.md` (dashboards, alerts, rollout, failure modes). |
|
||||
| 4 | RRL-06-004 | DONE | Migration notes published | Router · Docs | Migration guide section: avoid double-limiting, staged rollout, and decommission service-level limiters. |
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2025-12-17 | Sprint created; awaiting implementation. | Automation |
|
||||
| 2025-12-17 | Started RRL-06-001. | Automation |
|
||||
| 2025-12-17 | Completed RRL-06-001..004: added `docs/router/rate-limiting.md`, `docs/operations/router-rate-limiting.md`, `docs/modules/router/rate-limiting.md`; updated `docs/router/rate-limiting-routes.md`, `docs/modules/router/README.md`, and `docs/modules/router/architecture.md`. | Automation |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Decision:** Keep docs offline-friendly: no external CDNs/snippets; prefer deterministic, copy-pastable YAML fragments.
|
||||
- **Risk:** Confusion during rollout if both router and service rate limiting are enabled. Mitigation: explicit migration guide + recommended rollout phases.
|
||||
|
||||
## Next Checkpoints
|
||||
- None scheduled; complete tasks and mark sprint DONE.
|
||||
709
docs/implplan/archived/SPRINT_1200_001_IMPLEMENTATION_GUIDE.md
Normal file
709
docs/implplan/archived/SPRINT_1200_001_IMPLEMENTATION_GUIDE.md
Normal file
@@ -0,0 +1,709 @@
|
||||
# Router Rate Limiting - Implementation Guide
|
||||
|
||||
**For:** Implementation agents / reviewers for Sprint 1200_001_001 through 1200_001_006
|
||||
**Status:** DONE (Sprints 1–6 closed; Sprint 4 closed N/A)
|
||||
**Evidence:** `src/__Libraries/StellaOps.Router.Gateway/RateLimit/`, `tests/StellaOps.Router.Gateway.Tests/`
|
||||
**Last Updated:** 2025-12-18
|
||||
|
||||
---
|
||||
|
||||
## Purpose
|
||||
|
||||
This guide provides comprehensive technical context for centralized rate limiting in Stella Router (design + operational considerations). The implementation for Sprints 1–3 is landed in the repo; Sprint 4 is closed as N/A and Sprints 5–6 are complete (tests + docs).
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Architecture Overview](#architecture-overview)
|
||||
2. [Configuration Philosophy](#configuration-philosophy)
|
||||
3. [Performance Considerations](#performance-considerations)
|
||||
4. [Valkey Integration](#valkey-integration)
|
||||
5. [Testing Strategy](#testing-strategy)
|
||||
6. [Common Pitfalls](#common-pitfalls)
|
||||
7. [Debugging Guide](#debugging-guide)
|
||||
8. [Operational Runbook](#operational-runbook)
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Design Principles
|
||||
|
||||
1. **Router-Centralized**: Rate limiting is a router responsibility, not a microservice responsibility
|
||||
2. **Fail-Open**: Never block all traffic due to infrastructure failures
|
||||
3. **Observable**: Every decision must be metrified
|
||||
4. **Deterministic**: Same request at same time should get same decision (within window)
|
||||
5. **Fair**: Use sliding windows where possible to avoid thundering herd
|
||||
|
||||
### Two-Tier Architecture
|
||||
|
||||
```
|
||||
Request → Instance Limiter (in-memory, <1ms) → Environment Limiter (Valkey, <10ms) → Upstream
|
||||
↓ DENY ↓ DENY
|
||||
429 + Retry-After 429 + Retry-After
|
||||
```
|
||||
|
||||
**Why two tiers?**
|
||||
|
||||
- **Instance tier** protects individual router process (CPU, memory, sockets)
|
||||
- **Environment tier** protects shared backend (aggregate across all routers)
|
||||
|
||||
Both are necessary—single router can be overwhelmed locally even if aggregate traffic is low.
|
||||
|
||||
### Decision Flow
|
||||
|
||||
```
|
||||
1. Extract microservice + route from request
|
||||
2. Check instance limits (always, fast path)
|
||||
└─> DENY? Return 429
|
||||
3. Check activation gate (local 5-min counter)
|
||||
└─> Below threshold? Skip env check (optimization)
|
||||
4. Check environment limits (Valkey call)
|
||||
└─> Circuit breaker open? Skip (fail-open)
|
||||
└─> Valkey error? Skip (fail-open)
|
||||
└─> DENY? Return 429
|
||||
5. Forward to upstream
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration Philosophy
|
||||
|
||||
### Inheritance Model
|
||||
|
||||
```
|
||||
Global Defaults
|
||||
└─> Environment Defaults
|
||||
└─> Microservice Overrides
|
||||
└─> Route Overrides (most specific)
|
||||
```
|
||||
|
||||
**Replacement, not merge**: When a child level specifies limits, it REPLACES parent limits entirely.
|
||||
|
||||
**Example:**
|
||||
|
||||
```yaml
|
||||
for_environment:
|
||||
per_seconds: 300
|
||||
max_requests: 30000 # Global default
|
||||
|
||||
microservices:
|
||||
scanner:
|
||||
per_seconds: 60
|
||||
max_requests: 600 # REPLACES global (not merged)
|
||||
routes:
|
||||
scan_submit:
|
||||
per_seconds: 10
|
||||
max_requests: 50 # REPLACES microservice (not merged)
|
||||
```
|
||||
|
||||
Result:
|
||||
- `POST /scanner/api/scans` → 50 req/10sec (route level)
|
||||
- `GET /scanner/api/other` → 600 req/60sec (microservice level)
|
||||
- `GET /policy/api/evaluate` → 30000 req/300sec (global level)
|
||||
|
||||
### Rule Stacking (AND Logic)
|
||||
|
||||
Multiple rules at same level = ALL must pass.
|
||||
|
||||
```yaml
|
||||
concelier:
|
||||
rules:
|
||||
- per_seconds: 1
|
||||
max_requests: 10 # Rule 1: 10/sec
|
||||
- per_seconds: 3600
|
||||
max_requests: 3000 # Rule 2: 3000/hour
|
||||
```
|
||||
|
||||
Both rules enforced. Request denied if EITHER limit exceeded.
|
||||
|
||||
### Sensible Defaults
|
||||
|
||||
If configuration omitted:
|
||||
- `for_instance`: No limits (effectively unlimited)
|
||||
- `for_environment`: No limits
|
||||
- `activation_threshold`: 5000 (skip Valkey if <5000 req/5min)
|
||||
- `circuit_breaker.failure_threshold`: 5
|
||||
- `circuit_breaker.timeout_seconds`: 30
|
||||
|
||||
**Recommendation**: Always configure at least global defaults.
|
||||
|
||||
---
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Instance Limiter Performance
|
||||
|
||||
**Target:** <1ms P99 latency
|
||||
|
||||
**Implementation:** Sliding window with ring buffer.
|
||||
|
||||
```csharp
|
||||
// Efficient: O(1) increment, O(k) advance where k = buckets cleared
|
||||
long[] _buckets; // Ring buffer, size = window_seconds / granularity
|
||||
long _total; // Running sum
|
||||
```
|
||||
|
||||
**Lock contention**: Single lock per counter. Acceptable for <10k req/sec per router.
|
||||
|
||||
**Memory**: ~24 bytes per window (array overhead + fields).
|
||||
|
||||
**Optimization**: For very high traffic (>50k req/sec), consider lock-free implementation with `Interlocked` operations.
|
||||
|
||||
### Environment Limiter Performance
|
||||
|
||||
**Target:** <10ms P99 latency (including Valkey RTT)
|
||||
|
||||
**Critical path**: Every request to environment limiter makes a Valkey call.
|
||||
|
||||
**Optimization: Activation Gate**
|
||||
|
||||
Skip Valkey if local instance traffic < threshold:
|
||||
|
||||
```csharp
|
||||
if (_instanceCounter.GetCount() < _config.ActivationThresholdPer5Min)
|
||||
{
|
||||
// Skip expensive Valkey check
|
||||
return instanceDecision;
|
||||
}
|
||||
```
|
||||
|
||||
**Effect**: Reduces Valkey load by 80%+ in low-traffic scenarios.
|
||||
|
||||
**Trade-off**: Under threshold, environment limits not enforced. Acceptable if:
|
||||
- Each router instance threshold is set appropriately
|
||||
- Primary concern is high-traffic scenarios
|
||||
|
||||
**Lua Script Performance**
|
||||
|
||||
- Single round-trip to Valkey (atomic)
|
||||
- Multiple `INCR` operations in single script (fast, no network)
|
||||
- TTL set only on first increment (optimization)
|
||||
|
||||
**Valkey Sizing**: 1000 ops/sec per router instance = 10k ops/sec for 10 routers. Valkey handles this easily (100k+ ops/sec capacity).
|
||||
|
||||
---
|
||||
|
||||
## Valkey Integration
|
||||
|
||||
### Connection Management
|
||||
|
||||
Use `ConnectionMultiplexer` from StackExchange.Redis:
|
||||
|
||||
```csharp
|
||||
var _connection = ConnectionMultiplexer.Connect(connectionString);
|
||||
var _db = _connection.GetDatabase();
|
||||
```
|
||||
|
||||
**Important**: ConnectionMultiplexer is thread-safe and expensive to create. Create ONCE per application, reuse everywhere.
|
||||
|
||||
### Lua Script Loading
|
||||
|
||||
Scripts loaded at startup and cached by SHA:
|
||||
|
||||
```csharp
|
||||
var script = File.ReadAllText("rate_limit_check.lua");
|
||||
var server = _connection.GetServer(_connection.GetEndPoints().First());
|
||||
var sha = server.ScriptLoad(script);
|
||||
```
|
||||
|
||||
**Persistence**: Valkey caches scripts in memory. They survive across requests but NOT across restarts.
|
||||
|
||||
**Recommendation**: Load script at startup, store SHA, use `ScriptEvaluateAsync(sha, ...)` for all calls.
|
||||
|
||||
### Key Naming Strategy
|
||||
|
||||
Format: `{bucket}:env:{service}:{rule_name}:{window_start}`
|
||||
|
||||
Example: `stella-router-rate-limit:env:concelier:per_second:1702821600`
|
||||
|
||||
**Why include window_start in key?**
|
||||
|
||||
Fixed windows—each window is a separate key with TTL. When window expires, key auto-deleted.
|
||||
|
||||
**Benefit**: No manual cleanup, memory efficient.
|
||||
|
||||
### Clock Skew Handling
|
||||
|
||||
**Problem**: Different routers may have slightly different clocks, causing them to disagree on window boundaries.
|
||||
|
||||
**Solution**: Use Valkey server time (`redis.call("TIME")`) in Lua script, not client time.
|
||||
|
||||
```lua
|
||||
local now = tonumber(redis.call("TIME")[1]) -- Valkey server time
|
||||
local window_start = now - (now % window_sec)
|
||||
```
|
||||
|
||||
**Result**: All routers agree on window boundaries (Valkey is source of truth).
|
||||
|
||||
### Circuit Breaker Thresholds
|
||||
|
||||
**failure_threshold**: 5 consecutive failures before opening
|
||||
**timeout_seconds**: 30 seconds before attempting half-open
|
||||
**half_open_timeout**: 10 seconds to test one request
|
||||
|
||||
**Tuning**:
|
||||
- Lower failure_threshold = faster fail-open (more availability, less strict limiting)
|
||||
- Higher failure_threshold = tolerate more transient errors (stricter limiting)
|
||||
|
||||
**Recommendation**: Start with defaults, adjust based on Valkey stability.
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests (xUnit)
|
||||
|
||||
**Coverage targets:**
|
||||
- Configuration loading: 100%
|
||||
- Validation logic: 100%
|
||||
- Sliding window counter: 100%
|
||||
- Route matching: 100%
|
||||
- Inheritance resolution: 100%
|
||||
|
||||
**Test patterns:**
|
||||
|
||||
```csharp
|
||||
[Fact]
|
||||
public void SlidingWindowCounter_WhenWindowExpires_ResetsCount()
|
||||
{
|
||||
var counter = new SlidingWindowCounter(windowSeconds: 10);
|
||||
counter.Increment(); // count = 1
|
||||
|
||||
// Simulate time passing (mock or Thread.Sleep in tests)
|
||||
AdvanceTime(11); // seconds
|
||||
|
||||
Assert.Equal(0, counter.GetCount()); // Window expired, count reset
|
||||
}
|
||||
```
|
||||
|
||||
### Integration Tests (TestServer + Testcontainers)
|
||||
|
||||
**Valkey integration:**
|
||||
|
||||
```csharp
|
||||
[Fact]
|
||||
public async Task EnvironmentLimiter_WhenLimitExceeded_Returns429()
|
||||
{
|
||||
using var valkey = new ValkeyContainer();
|
||||
await valkey.StartAsync();
|
||||
|
||||
var store = new ValkeyRateLimitStore(valkey.GetConnectionString(), "test-bucket");
|
||||
var limiter = new EnvironmentRateLimiter(store, circuitBreaker, logger);
|
||||
|
||||
var limits = new EffectiveLimits(perSeconds: 1, maxRequests: 5, ...);
|
||||
|
||||
// First 5 requests should pass
|
||||
for (int i = 0; i < 5; i++)
|
||||
{
|
||||
var decision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
|
||||
Assert.True(decision.Value.Allowed);
|
||||
}
|
||||
|
||||
// 6th request should be denied
|
||||
var deniedDecision = await limiter.TryAcquireAsync("test-svc", limits, CancellationToken.None);
|
||||
Assert.False(deniedDecision.Value.Allowed);
|
||||
Assert.Equal(429, deniedDecision.Value.RetryAfterSeconds);
|
||||
}
|
||||
```
|
||||
|
||||
**Middleware integration:**
|
||||
|
||||
```csharp
|
||||
[Fact]
|
||||
public async Task RateLimitMiddleware_WhenLimitExceeded_Returns429WithRetryAfter()
|
||||
{
|
||||
using var testServer = new TestServer(new WebHostBuilder().UseStartup<Startup>());
|
||||
var client = testServer.CreateClient();
|
||||
|
||||
// Configure rate limit: 5 req/sec
|
||||
// Send 6 requests rapidly
|
||||
for (int i = 0; i < 6; i++)
|
||||
{
|
||||
var response = await client.GetAsync("/api/test");
|
||||
if (i < 5)
|
||||
{
|
||||
Assert.Equal(HttpStatusCode.OK, response.StatusCode);
|
||||
}
|
||||
else
|
||||
{
|
||||
Assert.Equal(HttpStatusCode.TooManyRequests, response.StatusCode);
|
||||
Assert.True(response.Headers.Contains("Retry-After"));
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Load Tests (k6)
|
||||
|
||||
**Scenario A: Instance Limits**
|
||||
|
||||
```javascript
|
||||
import http from 'k6/http';
|
||||
import { check } from 'k6';
|
||||
|
||||
export const options = {
|
||||
scenarios: {
|
||||
instance_limit: {
|
||||
executor: 'constant-arrival-rate',
|
||||
rate: 100, // 100 req/sec
|
||||
timeUnit: '1s',
|
||||
duration: '30s',
|
||||
preAllocatedVUs: 50,
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
export default function () {
|
||||
const res = http.get('http://router/api/test');
|
||||
check(res, {
|
||||
'status 200 or 429': (r) => r.status === 200 || r.status === 429,
|
||||
'has Retry-After on 429': (r) => r.status !== 429 || r.headers['Retry-After'] !== undefined,
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Scenario B: Environment Limits (Multi-Instance)**
|
||||
|
||||
Run k6 from 5 different machines simultaneously → simulate 5 router instances → verify aggregate limit enforced.
|
||||
|
||||
**Scenario E: Valkey Failure**
|
||||
|
||||
Use Toxiproxy to inject network failures → verify circuit breaker opens → verify requests still allowed (fail-open).
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Forgetting to Update Middleware Pipeline Order
|
||||
|
||||
**Problem**: Rate limit middleware added AFTER routing decision → can't identify microservice.
|
||||
|
||||
**Solution**: Add rate limit middleware BEFORE routing decision:
|
||||
|
||||
```csharp
|
||||
app.UsePayloadLimits();
|
||||
app.UseRateLimiting(); // HERE
|
||||
app.UseEndpointResolution();
|
||||
app.UseRoutingDecision();
|
||||
```
|
||||
|
||||
### 2. Circuit Breaker Never Closes
|
||||
|
||||
**Problem**: Circuit breaker opens, but never attempts recovery.
|
||||
|
||||
**Cause**: Half-open logic not implemented or timeout too long.
|
||||
|
||||
**Solution**: Implement half-open state with timeout:
|
||||
|
||||
```csharp
|
||||
if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
|
||||
{
|
||||
_state = CircuitState.HalfOpen; // Allow one test request
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Lua Script Not Found at Runtime
|
||||
|
||||
**Problem**: Script file not copied to output directory.
|
||||
|
||||
**Solution**: Set file properties in `.csproj`:
|
||||
|
||||
```xml
|
||||
<ItemGroup>
|
||||
<Content Include="RateLimit\Scripts\*.lua">
|
||||
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
|
||||
</Content>
|
||||
</ItemGroup>
|
||||
```
|
||||
|
||||
### 4. Activation Gate Never Triggers
|
||||
|
||||
**Problem**: Activation counter not incremented on every request.
|
||||
|
||||
**Cause**: Counter incremented only when instance limit is enforced.
|
||||
|
||||
**Solution**: Increment activation counter ALWAYS, not just when checking limits:
|
||||
|
||||
```csharp
|
||||
public RateLimitDecision TryAcquire(string? microservice)
|
||||
{
|
||||
_activationCounter.Increment(); // ALWAYS increment
|
||||
// ... rest of logic
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Route Matching Case-Sensitivity Issues
|
||||
|
||||
**Problem**: `/API/Scans` doesn't match `/api/scans`.
|
||||
|
||||
**Solution**: Use case-insensitive comparisons:
|
||||
|
||||
```csharp
|
||||
string.Equals(requestPath, pattern, StringComparison.OrdinalIgnoreCase)
|
||||
```
|
||||
|
||||
### 6. Valkey Key Explosion
|
||||
|
||||
**Problem**: Too many keys in Valkey, memory usage high.
|
||||
|
||||
**Cause**: Forgetting to set TTL on keys.
|
||||
|
||||
**Solution**: ALWAYS set TTL when creating keys:
|
||||
|
||||
```lua
|
||||
if count == 1 then
|
||||
redis.call("EXPIRE", key, window_sec + 2)
|
||||
end
|
||||
```
|
||||
|
||||
**+2 buffer**: Gives grace period to avoid edge cases.
|
||||
|
||||
---
|
||||
|
||||
## Debugging Guide
|
||||
|
||||
### Scenario 1: Requests Being Denied But Shouldn't Be
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. Check metrics: Which scope is denying? (instance or environment)
|
||||
|
||||
```promql
|
||||
rate(stella_router_rate_limit_denied_total[1m])
|
||||
```
|
||||
|
||||
2. Check configured limits:
|
||||
|
||||
```bash
|
||||
# View config
|
||||
kubectl get configmap router-config -o yaml | grep -A 20 "rate_limiting"
|
||||
```
|
||||
|
||||
3. Check activation gate:
|
||||
|
||||
```promql
|
||||
stella_router_rate_limit_activation_gate_enabled
|
||||
```
|
||||
|
||||
If 0, activation gate is disabled—all requests hit Valkey.
|
||||
|
||||
4. Check Valkey keys:
|
||||
|
||||
```bash
|
||||
redis-cli -h valkey.stellaops.local
|
||||
> KEYS stella-router-rate-limit:env:*
|
||||
> TTL stella-router-rate-limit:env:concelier:per_second:1702821600
|
||||
> GET stella-router-rate-limit:env:concelier:per_second:1702821600
|
||||
```
|
||||
|
||||
5. Check circuit breaker state:
|
||||
|
||||
```promql
|
||||
stella_router_rate_limit_circuit_breaker_state{state="open"}
|
||||
```
|
||||
|
||||
If 1, circuit breaker is open—env limits not enforced.
|
||||
|
||||
### Scenario 2: Rate Limits Not Being Enforced
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. Verify middleware is registered:
|
||||
|
||||
```csharp
|
||||
// Check Startup.cs or Program.cs
|
||||
app.UseRateLimiting(); // Should be present
|
||||
```
|
||||
|
||||
2. Verify configuration loaded:
|
||||
|
||||
```csharp
|
||||
// Add logging in RateLimitService constructor
|
||||
_logger.LogInformation("Rate limit config loaded: Instance={HasInstance}, Env={HasEnv}",
|
||||
_config.ForInstance != null,
|
||||
_config.ForEnvironment != null);
|
||||
```
|
||||
|
||||
3. Check metrics—are requests even hitting rate limiter?
|
||||
|
||||
```promql
|
||||
rate(stella_router_rate_limit_allowed_total[1m])
|
||||
```
|
||||
|
||||
If 0, middleware not in pipeline or not being called.
|
||||
|
||||
4. Check microservice identification:
|
||||
|
||||
```csharp
|
||||
// Add logging in middleware
|
||||
var microservice = context.Items["RoutingTarget"] as string;
|
||||
_logger.LogDebug("Rate limiting request for microservice: {Microservice}", microservice);
|
||||
```
|
||||
|
||||
If "unknown", routing metadata not set—rate limiter can't apply service-specific limits.
|
||||
|
||||
### Scenario 3: Valkey Errors
|
||||
|
||||
**Steps:**
|
||||
|
||||
1. Check circuit breaker metrics:
|
||||
|
||||
```promql
|
||||
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
|
||||
```
|
||||
|
||||
2. Check Valkey connectivity:
|
||||
|
||||
```bash
|
||||
redis-cli -h valkey.stellaops.local PING
|
||||
```
|
||||
|
||||
3. Check Lua script loaded:
|
||||
|
||||
```bash
|
||||
redis-cli -h valkey.stellaops.local SCRIPT EXISTS <sha>
|
||||
```
|
||||
|
||||
4. Check Valkey logs for errors:
|
||||
|
||||
```bash
|
||||
kubectl logs -f valkey-0 | grep ERROR
|
||||
```
|
||||
|
||||
5. Verify Lua script syntax:
|
||||
|
||||
```bash
|
||||
redis-cli -h valkey.stellaops.local --eval rate_limit_check.lua
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Operational Runbook
|
||||
|
||||
### Deployment Checklist
|
||||
|
||||
- [ ] Valkey cluster healthy (check `redis-cli PING`)
|
||||
- [ ] Configuration validated (run `stella-router validate-config`)
|
||||
- [ ] Metrics scraping configured (Prometheus targets)
|
||||
- [ ] Dashboards imported (Grafana)
|
||||
- [ ] Alerts configured (Alertmanager)
|
||||
- [ ] Shadow mode enabled (limits set 10x expected traffic)
|
||||
- [ ] Rollback plan documented
|
||||
|
||||
### Monitoring Dashboards
|
||||
|
||||
**Dashboard 1: Rate Limiting Overview**
|
||||
|
||||
Panels:
|
||||
- Requests allowed vs denied (pie chart)
|
||||
- Denial rate by microservice (line graph)
|
||||
- Denial rate by route (heatmap)
|
||||
- Retry-After distribution (histogram)
|
||||
|
||||
**Dashboard 2: Performance**
|
||||
|
||||
Panels:
|
||||
- Decision latency P50/P95/P99 (instance vs environment)
|
||||
- Valkey call latency P95
|
||||
- Activation gate effectiveness (% skipped)
|
||||
|
||||
**Dashboard 3: Health**
|
||||
|
||||
Panels:
|
||||
- Circuit breaker state (gauge)
|
||||
- Valkey error rate
|
||||
- Most denied routes (top 10 table)
|
||||
|
||||
### Alert Definitions
|
||||
|
||||
**Critical:**
|
||||
|
||||
```yaml
|
||||
- alert: RateLimitValkeyCriticalFailure
|
||||
expr: stella_router_rate_limit_circuit_breaker_state{state="open"} == 1
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Rate limit circuit breaker open for >5min"
|
||||
description: "Valkey unavailable, environment limits not enforced"
|
||||
|
||||
- alert: RateLimitAllRequestsDenied
|
||||
expr: rate(stella_router_rate_limit_denied_total[1m]) / rate(stella_router_rate_limit_allowed_total[1m]) > 0.99
|
||||
for: 1m
|
||||
annotations:
|
||||
summary: "100% denial rate"
|
||||
description: "Possible configuration error"
|
||||
```
|
||||
|
||||
**Warning:**
|
||||
|
||||
```yaml
|
||||
- alert: RateLimitHighDenialRate
|
||||
expr: rate(stella_router_rate_limit_denied_total[5m]) / (rate(stella_router_rate_limit_allowed_total[5m]) + rate(stella_router_rate_limit_denied_total[5m])) > 0.2
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: ">20% requests denied"
|
||||
description: "High denial rate, check if expected"
|
||||
|
||||
- alert: RateLimitValkeyHighLatency
|
||||
expr: histogram_quantile(0.95, stella_router_rate_limit_decision_latency_ms{scope="environment"}) > 100
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Valkey latency >100ms P95"
|
||||
description: "Valkey performance degraded"
|
||||
```
|
||||
|
||||
### Tuning Guidelines
|
||||
|
||||
**Scenario: Too many requests denied**
|
||||
|
||||
1. Check if denial rate is expected (traffic spike?)
|
||||
2. If not, increase limits:
|
||||
- Start with 2x current limits
|
||||
- Monitor for 24 hours
|
||||
- Adjust as needed
|
||||
|
||||
**Scenario: Valkey overloaded**
|
||||
|
||||
1. Check ops/sec: `redis-cli INFO stats | grep instantaneous_ops_per_sec`
|
||||
2. If >50k ops/sec, consider:
|
||||
- Increase activation threshold (reduce Valkey calls)
|
||||
- Add Valkey replicas (read scaling)
|
||||
- Shard by microservice (write scaling)
|
||||
|
||||
**Scenario: Circuit breaker flapping**
|
||||
|
||||
1. Check failure rate:
|
||||
|
||||
```promql
|
||||
rate(stella_router_rate_limit_valkey_call_total{result="error"}[5m])
|
||||
```
|
||||
|
||||
2. If transient errors, increase failure_threshold
|
||||
3. If persistent errors, fix Valkey issue
|
||||
|
||||
### Rollback Procedure
|
||||
|
||||
1. Disable rate limiting:
|
||||
|
||||
```yaml
|
||||
rate_limiting:
|
||||
for_instance: null
|
||||
for_environment: null
|
||||
```
|
||||
|
||||
2. Deploy config update
|
||||
3. Verify traffic flows normally
|
||||
4. Investigate issue offline
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
|
||||
- **Master Sprint Tracker:** `docs/implplan/SPRINT_1200_001_000_router_rate_limiting_master.md`
|
||||
- **Sprint Files:** `docs/implplan/SPRINT_1200_001_00X_*.md`
|
||||
- **HTTP 429 Semantics:** RFC 6585
|
||||
- **HTTP Retry-After:** RFC 7231 Section 7.1.3
|
||||
- **Valkey Documentation:** https://valkey.io/docs/
|
||||
502
docs/implplan/archived/SPRINT_1200_001_README.md
Normal file
502
docs/implplan/archived/SPRINT_1200_001_README.md
Normal file
@@ -0,0 +1,502 @@
|
||||
# Router Rate Limiting - Sprint Package README
|
||||
|
||||
**Package Created:** 2025-12-17
|
||||
**For:** Implementation agents / reviewers
|
||||
**Status:** DONE (Sprints 1–6 closed; Sprint 4 closed N/A)
|
||||
**Advisory Source:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
|
||||
|
||||
---
|
||||
|
||||
## Package Contents
|
||||
|
||||
This sprint package contains the original plan plus the landed implementation for centralized rate limiting in Stella Router.
|
||||
|
||||
### Core Sprint Files
|
||||
|
||||
| File | Purpose | Agent Role |
|
||||
|------|---------|------------|
|
||||
| `SPRINT_1200_001_000_router_rate_limiting_master.md` | Master tracker | **START HERE** - Overview & progress tracking |
|
||||
| `SPRINT_1200_001_001_router_rate_limiting_core.md` | Sprint 1: Core implementation | Implementer - 5-7 days |
|
||||
| `SPRINT_1200_001_002_router_rate_limiting_per_route.md` | Sprint 2: Per-route granularity | Implementer - 2-3 days |
|
||||
| `SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md` | Sprint 3: Rule stacking | Implementer - 2-3 days |
|
||||
| `SPRINT_1200_001_004_router_rate_limiting_service_migration.md` | Sprint 4: Service migration (closed N/A) | Project manager / reviewer |
|
||||
| `SPRINT_1200_001_005_router_rate_limiting_tests.md` | Sprint 5: Comprehensive testing | QA / implementer |
|
||||
| `SPRINT_1200_001_006_router_rate_limiting_docs.md` | Sprint 6: Documentation & rollout prep | Docs / implementer |
|
||||
| `SPRINT_1200_001_IMPLEMENTATION_GUIDE.md` | Technical reference | **READ FIRST** before coding |
|
||||
|
||||
### Documentation Files
|
||||
|
||||
| File | Purpose | Created In |
|
||||
|------|---------|------------|
|
||||
| `docs/router/rate-limiting-routes.md` | Per-route configuration guide | Sprint 2 |
|
||||
| `docs/router/rate-limiting.md` | User-facing configuration guide | Sprint 6 |
|
||||
| `docs/operations/router-rate-limiting.md` | Operational runbook | Sprint 6 |
|
||||
| `docs/modules/router/rate-limiting.md` | Module-level rate-limiting dossier | Sprint 6 |
|
||||
|
||||
---
|
||||
|
||||
## Implementation Sequence
|
||||
|
||||
### Phase 1: Core Implementation (Sprints 1-3)
|
||||
|
||||
```
|
||||
Sprint 1 (5-7 days)
|
||||
├── Task 1.1: Configuration Models
|
||||
├── Task 1.2: Instance Rate Limiter
|
||||
├── Task 1.3: Valkey Backend
|
||||
├── Task 1.4: Middleware Integration
|
||||
├── Task 1.5: Metrics
|
||||
└── Task 1.6: Wire into Pipeline
|
||||
|
||||
Sprint 2 (2-3 days)
|
||||
├── Task 2.1: Extend Config for Routes
|
||||
├── Task 2.2: Route Matching
|
||||
├── Task 2.3: Inheritance Resolution
|
||||
├── Task 2.4: Integrate into Service
|
||||
└── Task 2.5: Documentation
|
||||
|
||||
Sprint 3 (2-3 days)
|
||||
├── Task 3.1: Config for Rule Arrays
|
||||
├── Task 3.2: Update Instance Limiter
|
||||
├── Task 3.3: Enhance Valkey Lua Script
|
||||
└── Task 3.4: Update Inheritance Resolver
|
||||
```
|
||||
|
||||
### Phase 2: Migration & Testing (Sprints 4-5)
|
||||
|
||||
```
|
||||
Sprint 4 (3-4 days) - Service Migration
|
||||
├── Extract AdaptiveRateLimiter configs
|
||||
├── Add to Router configuration
|
||||
├── Refactor AdaptiveRateLimiter
|
||||
└── Integration validation
|
||||
|
||||
Sprint 5 (3-5 days) - Comprehensive Testing
|
||||
├── Unit test suite
|
||||
├── Integration tests (Testcontainers)
|
||||
├── Load tests (k6 scenarios A-F)
|
||||
└── Configuration matrix tests
|
||||
```
|
||||
|
||||
### Phase 3: Documentation & Rollout (Sprint 6)
|
||||
|
||||
```
|
||||
Sprint 6 (2 days)
|
||||
├── Architecture docs
|
||||
├── Configuration guide
|
||||
├── Operational runbook
|
||||
└── Migration guide
|
||||
```
|
||||
|
||||
### Phase 4: Rollout (3 weeks, post-implementation)
|
||||
|
||||
```
|
||||
Week 1: Shadow Mode
|
||||
└── Metrics only, no enforcement
|
||||
|
||||
Week 2: Soft Limits
|
||||
└── 2x traffic peaks
|
||||
|
||||
Week 3: Production Limits
|
||||
└── Full enforcement
|
||||
|
||||
Week 4+: Service Migration
|
||||
└── Remove redundant limiters
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start for Agents
|
||||
|
||||
### 1. Context Gathering (30 minutes)
|
||||
|
||||
**Read in this order:**
|
||||
|
||||
1. `SPRINT_1200_001_000_router_rate_limiting_master.md` - Overview
|
||||
2. `SPRINT_1200_001_IMPLEMENTATION_GUIDE.md` - Technical details
|
||||
3. Original advisory: `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
|
||||
4. Analysis plan: `C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md`
|
||||
|
||||
### 2. Environment Setup
|
||||
|
||||
```bash
|
||||
# Working directory
|
||||
cd src/__Libraries/StellaOps.Router.Gateway/
|
||||
|
||||
# Verify dependencies
|
||||
dotnet restore
|
||||
|
||||
# Install Valkey for local testing
|
||||
docker run -d -p 6379:6379 valkey/valkey:latest
|
||||
|
||||
# Run existing tests to ensure baseline
|
||||
dotnet test
|
||||
```
|
||||
|
||||
### 3. Start Sprint 1
|
||||
|
||||
Open `SPRINT_1200_001_001_router_rate_limiting_core.md` and follow task breakdown.
|
||||
|
||||
**Task execution pattern:**
|
||||
|
||||
```
|
||||
For each task:
|
||||
1. Read task description
|
||||
2. Review implementation code samples
|
||||
3. Create files as specified
|
||||
4. Write unit tests
|
||||
5. Mark task complete in master tracker
|
||||
6. Commit with message: "feat(router): [Sprint 1.X] Task name"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Design Decisions (Reference)
|
||||
|
||||
### 1. Status Codes
|
||||
- ✅ **429 Too Many Requests** for rate limiting
|
||||
- ❌ NOT 503 (that's for service health)
|
||||
- ❌ NOT 202 (that's for async job acceptance)
|
||||
|
||||
### 2. Two-Scope Architecture
|
||||
- **for_instance**: In-memory, protects single router
|
||||
- **for_environment**: Valkey-backed, protects aggregate
|
||||
|
||||
Both are necessary—can't replace one with the other.
|
||||
|
||||
### 3. Fail-Open Philosophy
|
||||
- Circuit breaker on Valkey failures
|
||||
- Activation gate optimization
|
||||
- Instance limits enforced even if Valkey down
|
||||
|
||||
### 4. Configuration Inheritance
|
||||
- Replacement semantics (not merge)
|
||||
- Most specific wins: route > microservice > environment > global
|
||||
|
||||
### 5. Rule Stacking
|
||||
- Multiple rules per target = AND logic
|
||||
- All rules must pass
|
||||
- Most restrictive Retry-After returned
|
||||
|
||||
---
|
||||
|
||||
## Performance Targets
|
||||
|
||||
| Metric | Target | Measurement |
|
||||
|--------|--------|-------------|
|
||||
| Instance check latency | <1ms P99 | BenchmarkDotNet |
|
||||
| Environment check latency | <10ms P99 | k6 load test |
|
||||
| Router throughput | 100k req/sec | k6 constant-arrival-rate |
|
||||
| Valkey load per instance | <1000 ops/sec | redis-cli INFO |
|
||||
|
||||
---
|
||||
|
||||
## Testing Requirements
|
||||
|
||||
### Unit Tests
|
||||
- **Coverage:** >90% for all RateLimit/* files
|
||||
- **Framework:** xUnit
|
||||
- **Patterns:** Arrange-Act-Assert
|
||||
|
||||
### Integration Tests
|
||||
- **Tool:** TestServer + Testcontainers (Valkey)
|
||||
- **Scope:** End-to-end middleware pipeline
|
||||
- **Scenarios:** All config combinations
|
||||
|
||||
### Load Tests
|
||||
- **Tool:** k6
|
||||
- **Scenarios:** A (instance), B (environment), C (activation gate), D (microservice), E (Valkey failure), F (max throughput)
|
||||
- **Duration:** 30s per scenario minimum
|
||||
|
||||
---
|
||||
|
||||
## Common Implementation Gotchas
|
||||
|
||||
⚠️ **Middleware Pipeline Order**
|
||||
```csharp
|
||||
// CORRECT:
|
||||
app.UsePayloadLimits();
|
||||
app.UseRateLimiting(); // BEFORE routing
|
||||
app.UseEndpointResolution();
|
||||
|
||||
// WRONG:
|
||||
app.UseEndpointResolution();
|
||||
app.UseRateLimiting(); // Too late, can't identify microservice
|
||||
```
|
||||
|
||||
⚠️ **Lua Script Deployment**
|
||||
```xml
|
||||
<!-- REQUIRED in .csproj -->
|
||||
<ItemGroup>
|
||||
<Content Include="RateLimit\Scripts\*.lua">
|
||||
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
|
||||
</Content>
|
||||
</ItemGroup>
|
||||
```
|
||||
|
||||
⚠️ **Clock Skew**
|
||||
```lua
|
||||
-- CORRECT: Use Valkey server time
|
||||
local now = tonumber(redis.call("TIME")[1])
|
||||
|
||||
-- WRONG: Use client time (clock skew issues)
|
||||
local now = os.time()
|
||||
```
|
||||
|
||||
⚠️ **Circuit Breaker Half-Open**
|
||||
```csharp
|
||||
// REQUIRED: Implement half-open state
|
||||
if (_state == CircuitState.Open && DateTime.UtcNow >= _halfOpenAt)
|
||||
{
|
||||
_state = CircuitState.HalfOpen; // Allow ONE test request
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria Checklist
|
||||
|
||||
Copy this to master tracker and update as you progress:
|
||||
|
||||
### Functional
|
||||
- [ ] Router enforces per-instance limits (in-memory)
|
||||
- [ ] Router enforces per-environment limits (Valkey-backed)
|
||||
- [ ] Per-microservice configuration works
|
||||
- [ ] Per-route configuration works
|
||||
- [ ] Multiple rules per target work (rule stacking)
|
||||
- [ ] 429 + Retry-After response format correct
|
||||
- [ ] Circuit breaker handles Valkey failures
|
||||
- [ ] Activation gate reduces Valkey load
|
||||
|
||||
### Performance
|
||||
- [ ] Instance check <1ms P99
|
||||
- [ ] Environment check <10ms P99
|
||||
- [ ] 100k req/sec throughput maintained
|
||||
- [ ] Valkey load <1000 ops/sec per instance
|
||||
|
||||
### Operational
|
||||
- [ ] Metrics exported to OpenTelemetry
|
||||
- [ ] Dashboards created (Grafana)
|
||||
- [ ] Alerts configured (Alertmanager)
|
||||
- [ ] Documentation complete
|
||||
- [ ] Migration from service-level rate limiters complete
|
||||
|
||||
### Quality
|
||||
- [ ] Unit test coverage >90%
|
||||
- [ ] Integration tests pass (all scenarios)
|
||||
- [ ] Load tests pass (k6 scenarios A-F)
|
||||
- [ ] Failure injection tests pass
|
||||
|
||||
---
|
||||
|
||||
## Escalation & Support
|
||||
|
||||
### Blocked on Technical Decision
|
||||
**Escalate to:** Architecture Guild (#stella-architecture)
|
||||
**Response SLA:** 24 hours
|
||||
|
||||
### Blocked on Resource (Valkey, config, etc.)
|
||||
**Escalate to:** Platform Engineering (#stella-platform)
|
||||
**Response SLA:** 4 hours
|
||||
|
||||
### Blocked on Clarification
|
||||
**Escalate to:** Router Team Lead (#stella-router-dev)
|
||||
**Response SLA:** 2 hours
|
||||
|
||||
### Sprint Falling Behind Schedule
|
||||
**Escalate to:** Project Manager (update master tracker with BLOCKED status)
|
||||
**Action:** Add note in "Decisions & Risks" section
|
||||
|
||||
---
|
||||
|
||||
## File Structure (After Implementation)
|
||||
|
||||
### Actual (landed)
|
||||
|
||||
```
|
||||
src/__Libraries/StellaOps.Router.Gateway/RateLimit/
|
||||
CircuitBreaker.cs
|
||||
EnvironmentRateLimiter.cs
|
||||
InMemoryValkeyRateLimitStore.cs
|
||||
InstanceRateLimiter.cs
|
||||
LimitInheritanceResolver.cs
|
||||
RateLimitConfig.cs
|
||||
RateLimitDecision.cs
|
||||
RateLimitMetrics.cs
|
||||
RateLimitMiddleware.cs
|
||||
RateLimitRule.cs
|
||||
RateLimitRouteMatcher.cs
|
||||
RateLimitService.cs
|
||||
RateLimitServiceCollectionExtensions.cs
|
||||
ValkeyRateLimitStore.cs
|
||||
|
||||
tests/StellaOps.Router.Gateway.Tests/
|
||||
LimitInheritanceResolverTests.cs
|
||||
InMemoryValkeyRateLimitStoreTests.cs
|
||||
InstanceRateLimiterTests.cs
|
||||
RateLimitConfigTests.cs
|
||||
RateLimitRouteMatcherTests.cs
|
||||
RateLimitServiceTests.cs
|
||||
|
||||
docs/router/rate-limiting-routes.md
|
||||
```
|
||||
|
||||
### Original plan (reference)
|
||||
|
||||
```
|
||||
src/__Libraries/StellaOps.Router.Gateway/
|
||||
├── RateLimit/
|
||||
│ ├── RateLimitConfig.cs
|
||||
│ ├── IRateLimiter.cs
|
||||
│ ├── InstanceRateLimiter.cs
|
||||
│ ├── EnvironmentRateLimiter.cs
|
||||
│ ├── RateLimitService.cs
|
||||
│ ├── RateLimitMetrics.cs
|
||||
│ ├── RateLimitDecision.cs
|
||||
│ ├── ValkeyRateLimitStore.cs
|
||||
│ ├── CircuitBreaker.cs
|
||||
│ ├── LimitInheritanceResolver.cs
|
||||
│ ├── Models/
|
||||
│ │ ├── InstanceLimitsConfig.cs
|
||||
│ │ ├── EnvironmentLimitsConfig.cs
|
||||
│ │ ├── MicroserviceLimitsConfig.cs
|
||||
│ │ ├── RouteLimitsConfig.cs
|
||||
│ │ ├── RateLimitRule.cs
|
||||
│ │ └── EffectiveLimits.cs
|
||||
│ ├── RouteMatching/
|
||||
│ │ ├── IRouteMatcher.cs
|
||||
│ │ ├── RouteMatcher.cs
|
||||
│ │ ├── ExactRouteMatcher.cs
|
||||
│ │ ├── PrefixRouteMatcher.cs
|
||||
│ │ └── RegexRouteMatcher.cs
|
||||
│ ├── Internal/
|
||||
│ │ └── SlidingWindowCounter.cs
|
||||
│ └── Scripts/
|
||||
│ └── rate_limit_check.lua
|
||||
├── Middleware/
|
||||
│ └── RateLimitMiddleware.cs
|
||||
├── ApplicationBuilderExtensions.cs (modified)
|
||||
└── ServiceCollectionExtensions.cs (modified)
|
||||
|
||||
__Tests/
|
||||
├── RateLimit/
|
||||
│ ├── InstanceRateLimiterTests.cs
|
||||
│ ├── EnvironmentRateLimiterTests.cs
|
||||
│ ├── ValkeyRateLimitStoreTests.cs
|
||||
│ ├── RateLimitMiddlewareTests.cs
|
||||
│ ├── ConfigurationTests.cs
|
||||
│ ├── RouteMatchingTests.cs
|
||||
│ └── InheritanceResolverTests.cs
|
||||
|
||||
tests/load/
|
||||
└── router-rate-limiting-load-test.js
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Package Review
|
||||
|
||||
1. **Acknowledge receipt** of sprint package
|
||||
2. **Set up development environment** (Valkey, dependencies)
|
||||
3. **Read Implementation Guide** in full
|
||||
4. **Start Sprint 1, Task 1.1** (Configuration Models)
|
||||
5. **Update master tracker** as tasks complete
|
||||
6. **Commit frequently** with clear messages
|
||||
7. **Run tests after each task**
|
||||
8. **Ask questions early** if blocked
|
||||
|
||||
---
|
||||
|
||||
## Configuration Quick Reference
|
||||
|
||||
### Minimal Config (Just Defaults)
|
||||
|
||||
```yaml
|
||||
rate_limiting:
|
||||
for_instance:
|
||||
per_seconds: 300
|
||||
max_requests: 30000
|
||||
```
|
||||
|
||||
### Full Config (All Features)
|
||||
|
||||
```yaml
|
||||
rate_limiting:
|
||||
process_back_pressure_when_more_than_per_5min: 5000
|
||||
|
||||
for_instance:
|
||||
rules:
|
||||
- per_seconds: 300
|
||||
max_requests: 30000
|
||||
- per_seconds: 30
|
||||
max_requests: 5000
|
||||
|
||||
for_environment:
|
||||
valkey_bucket: "stella-router-rate-limit"
|
||||
valkey_connection: "valkey.stellaops.local:6379"
|
||||
|
||||
circuit_breaker:
|
||||
failure_threshold: 5
|
||||
timeout_seconds: 30
|
||||
half_open_timeout: 10
|
||||
|
||||
rules:
|
||||
- per_seconds: 300
|
||||
max_requests: 30000
|
||||
|
||||
microservices:
|
||||
concelier:
|
||||
rules:
|
||||
- per_seconds: 1
|
||||
max_requests: 10
|
||||
- per_seconds: 3600
|
||||
max_requests: 3000
|
||||
|
||||
scanner:
|
||||
rules:
|
||||
- per_seconds: 60
|
||||
max_requests: 600
|
||||
|
||||
routes:
|
||||
scan_submit:
|
||||
pattern: "/api/scans"
|
||||
match_type: exact
|
||||
rules:
|
||||
- per_seconds: 10
|
||||
max_requests: 50
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
### Source Documents
|
||||
- **Advisory:** `docs/product-advisories/unprocessed/15-Dec-2025 - Designing 202 + Retry‑After Backpressure Control.md`
|
||||
- **Analysis Plan:** `C:\Users\VladimirMoushkov\.claude\plans\vectorized-kindling-rocket.md`
|
||||
- **Architecture:** `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
### Implementation Sprints
|
||||
- **Master Tracker:** `SPRINT_1200_001_000_router_rate_limiting_master.md`
|
||||
- **Sprint 1:** `SPRINT_1200_001_001_router_rate_limiting_core.md`
|
||||
- **Sprint 2:** `SPRINT_1200_001_002_router_rate_limiting_per_route.md`
|
||||
- **Sprint 3:** `SPRINT_1200_001_003_router_rate_limiting_rule_stacking.md`
|
||||
- **Sprint 4:** `SPRINT_1200_001_004_router_rate_limiting_service_migration.md` (closed N/A)
|
||||
- **Sprint 5:** `SPRINT_1200_001_005_router_rate_limiting_tests.md`
|
||||
- **Sprint 6:** `SPRINT_1200_001_006_router_rate_limiting_docs.md`
|
||||
|
||||
### Technical Guides
|
||||
- **Implementation Guide:** `SPRINT_1200_001_IMPLEMENTATION_GUIDE.md` (comprehensive)
|
||||
- **HTTP 429 Semantics:** RFC 6585
|
||||
- **Valkey Documentation:** https://valkey.io/docs/
|
||||
|
||||
---
|
||||
|
||||
## Version History
|
||||
|
||||
| Version | Date | Changes |
|
||||
|---------|------|---------|
|
||||
| 1.0 | 2025-12-17 | Initial sprint package created |
|
||||
|
||||
---
|
||||
|
||||
**Already implemented.** Review the master tracker and run `dotnet test StellaOps.Router.slnx -c Release`.
|
||||
@@ -0,0 +1,60 @@
|
||||
# Sprint 3103 · Scanner API ingestion completion
|
||||
|
||||
**Status:** DONE
|
||||
**Priority:** P1 - HIGH
|
||||
**Module:** Scanner.WebService
|
||||
**Working directory:** `src/Scanner/StellaOps.Scanner.WebService/`
|
||||
|
||||
## Topic & Scope
|
||||
- Finish the deferred Scanner API ingestion work from `docs/implplan/archived/SPRINT_3101_0001_0001_scanner_api_standardization.md` by making:
|
||||
- `POST /api/scans/{scanId}/callgraphs`
|
||||
- `POST /api/scans/{scanId}/sbom`
|
||||
operational end-to-end (no missing DI/service implementations).
|
||||
- Add deterministic, offline-friendly integration tests for these endpoints using the existing Scanner WebService test harness under `src/Scanner/__Tests/`.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on Scanner storage wiring already present via `StellaOps.Scanner.Storage` (`AddScannerStorage(...)` in `src/Scanner/StellaOps.Scanner.WebService/Program.cs`).
|
||||
- Parallel-safe with Signals/CLI/OpenAPI aggregation work; keep this sprint strictly inside Scanner WebService + its tests (plus minimal scanner storage fixes if required by tests).
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/modules/scanner/architecture.md`
|
||||
- `docs/modules/scanner/design/surface-validation.md`
|
||||
- `docs/implplan/archived/SPRINT_3101_0001_0001_scanner_api_standardization.md` (deferred items: integration tests + CLI integration)
|
||||
|
||||
## Delivery Tracker
|
||||
| # | Task ID | Status | Key dependency / next step | Owners | Task Definition |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| 1 | SCAN-API-3103-001 | DONE | Implement service + DI | Scanner · WebService | Implement `ICallGraphIngestionService` so `POST /api/scans/{scanId}/callgraphs` persists idempotency state and returns 202/409 deterministically. |
|
||||
| 2 | SCAN-API-3103-002 | DONE | Implement service + DI | Scanner · WebService | Implement `ISbomIngestionService` so `POST /api/scans/{scanId}/sbom` stores SBOM artifacts deterministically (object-store via Scanner storage) and returns 202 deterministically. |
|
||||
| 3 | SCAN-API-3103-003 | DONE | Deterministic test harness | Scanner · QA | Add integration tests for callgraph + SBOM submission (202/400/409 cases) with an offline object-store stub. |
|
||||
| 4 | SCAN-API-3103-004 | DONE | Storage compile/runtime fixes | Scanner · Storage | Fix any scanner storage connection/schema issues surfaced by the new tests. |
|
||||
| 5 | SCAN-API-3103-005 | DONE | Close bookkeeping | Scanner · WebService | Update local `TASKS.md`, sprint status, and execution log with evidence (test run). |
|
||||
|
||||
## Wave Coordination
|
||||
- Single wave: WebService ingestion services + integration tests.
|
||||
|
||||
## Wave Detail Snapshots
|
||||
- N/A (single wave).
|
||||
|
||||
## Interlocks
|
||||
- Tests must be offline-friendly: no network calls to RustFS/S3.
|
||||
- Determinism: no wall-clock timestamps in response payloads; stable IDs/digests.
|
||||
- Keep scope inside `src/Scanner/**` only.
|
||||
|
||||
## Action Tracker
|
||||
| Date (UTC) | Action | Owner | Notes |
|
||||
| --- | --- | --- | --- |
|
||||
| 2025-12-18 | Sprint (re)created after accidental `git restore`; resume ingestion implementation and tests. | Agent | Restore state and proceed. |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Decision:** Do not implement Signals projection/CLI/OpenAPI aggregation here; track separately.
|
||||
- **Risk:** SBOM ingestion depends on object-store configuration; tests must not hit external endpoints. **Mitigation:** inject an in-memory `IArtifactObjectStore` in tests.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2025-12-18 | Sprint created; started SCAN-API-3103-001. | Agent |
|
||||
| 2025-12-18 | Completed SCAN-API-3103-001..005; validated via `dotnet test src/Scanner/__Tests/StellaOps.Scanner.WebService.Tests/StellaOps.Scanner.WebService.Tests.csproj -c Release --filter \"FullyQualifiedName~CallGraphEndpointsTests|FullyQualifiedName~SbomEndpointsTests\"` (3 tests). | Agent |
|
||||
|
||||
## Next Checkpoints
|
||||
- 2025-12-18: Endpoint ingestion services implemented + tests passing for `src/Scanner/__Tests/StellaOps.Scanner.WebService.Tests`.
|
||||
@@ -0,0 +1,61 @@
|
||||
# Sprint 3104 · Signals callgraph projection completion
|
||||
|
||||
**Status:** DONE
|
||||
**Priority:** P2 - MEDIUM
|
||||
**Module:** Signals
|
||||
**Working directory:** `src/Signals/`
|
||||
|
||||
## Topic & Scope
|
||||
- Pick up the deferred projection/sync work from `docs/implplan/archived/SPRINT_3102_0001_0001_postgres_callgraph_tables.md` so the relational tables created by `src/Signals/StellaOps.Signals.Storage.Postgres/Migrations/V3102_001__callgraph_relational_tables.sql` become actively populated and queryable.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on Signals Postgres schema migrations already present (relational callgraph tables exist).
|
||||
- Touches both:
|
||||
- `src/Signals/StellaOps.Signals/` (ingest trigger), and
|
||||
- `src/Signals/StellaOps.Signals.Storage.Postgres/` (projection implementation).
|
||||
- Keep changes additive and deterministic; no network I/O.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `docs/implplan/archived/SPRINT_3102_0001_0001_postgres_callgraph_tables.md`
|
||||
- `src/Signals/StellaOps.Signals.Storage.Postgres/Migrations/V3102_001__callgraph_relational_tables.sql`
|
||||
|
||||
## Delivery Tracker
|
||||
| # | Task ID | Status | Key dependency / next step | Owners | Task Definition |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| 1 | SIG-CG-3104-001 | DONE | Define contract | Signals · Storage | Define `ICallGraphSyncService` for projecting a canonical callgraph into `signals.*` relational tables. |
|
||||
| 2 | SIG-CG-3104-002 | DONE | Implement projection | Signals · Storage | Implement `CallGraphSyncService` with idempotent, transactional projection and stable ordering. |
|
||||
| 3 | SIG-CG-3104-003 | DONE | Trigger on ingest | Signals · Service | Wire projection trigger from callgraph ingestion path (post-upsert). |
|
||||
| 4 | SIG-CG-3104-004 | DONE | Integration tests | Signals · QA | Add integration tests for projection + `PostgresCallGraphQueryRepository` queries. |
|
||||
| 5 | SIG-CG-3104-005 | DONE | Close bookkeeping | Signals · Storage | Update local `TASKS.md` and sprint status with evidence. |
|
||||
|
||||
## Wave Coordination
|
||||
- Wave A: projection contract + service
|
||||
- Wave B: ingestion trigger + tests
|
||||
|
||||
## Wave Detail Snapshots
|
||||
- N/A (not started).
|
||||
|
||||
## Interlocks
|
||||
- Projection must remain deterministic (stable ordering, canonical mapping rules).
|
||||
- Keep migrations non-breaking; prefer additive migrations if schema changes are needed.
|
||||
|
||||
## Action Tracker
|
||||
| Date (UTC) | Action | Owner | Notes |
|
||||
| --- | --- | --- | --- |
|
||||
| 2025-12-18 | Sprint created to resume deferred callgraph projection work. | Agent | Not started. |
|
||||
|
||||
## Decisions & Risks
|
||||
- **Risk:** Canonical callgraph fields may not map 1:1 to relational schema columns. **Mitigation:** define explicit projection rules and cover with tests.
|
||||
- **Risk:** Large callgraphs may require bulk insert. **Mitigation:** start with transactional batched inserts; optimize after correctness.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2025-12-18 | Sprint created; awaiting staffing. | Planning |
|
||||
| 2025-12-18 | Verified existing implementations: ICallGraphSyncService, CallGraphSyncService, PostgresCallGraphProjectionRepository all exist and are wired. Wired SyncAsync call into CallgraphIngestionService post-upsert path. Updated CallgraphIngestionServiceTests with StubCallGraphSyncService. Tasks 1-3 DONE. | Agent |
|
||||
| 2025-12-18 | Added unit tests (CallGraphSyncServiceTests.cs) and integration tests (CallGraphProjectionIntegrationTests.cs). All tasks DONE. | Agent |
|
||||
| 2025-12-18 | Validated via `dotnet test src/Signals/StellaOps.Signals.Storage.Postgres.Tests/StellaOps.Signals.Storage.Postgres.Tests.csproj -c Release`. | Agent |
|
||||
|
||||
## Next Checkpoints
|
||||
- 2025-12-18: Sprint completed.
|
||||
|
||||
@@ -0,0 +1,164 @@
|
||||
# Sprint 3401.0002.0001 · Score Replay & Proof Bundle
|
||||
|
||||
## Topic & Scope
|
||||
|
||||
Implement the score replay capability and proof bundle writer from the "Building a Deeper Moat Beyond Reachability" advisory. This sprint delivers:
|
||||
|
||||
1. **Score Proof Ledger** - Append-only ledger tracking each scoring decision with per-node hashing
|
||||
2. **Proof Bundle Writer** - Content-addressed ZIP bundle with manifests and proofs
|
||||
3. **Score Replay Endpoint** - `POST /score/replay` to recompute scores without rescanning
|
||||
4. **Scan Manifest** - DSSE-signed manifest capturing all inputs affecting results
|
||||
|
||||
**Source Advisory**: `docs/product-advisories/unprocessed/16-Dec-2025 - Building a Deeper Moat Beyond Reachability.md`
|
||||
**Related Docs**: `docs/product-advisories/14-Dec-2025 - Determinism and Reproducibility Technical Reference.md` §11.2, §12
|
||||
|
||||
**Working Directory**: `src/Scanner/StellaOps.Scanner.WebService`, `src/Policy/__Libraries/StellaOps.Policy/`
|
||||
|
||||
## Dependencies & Concurrency
|
||||
|
||||
- **Depends on**: SPRINT_3401_0001_0001 (Determinism Scoring Foundations) - DONE
|
||||
- **Depends on**: SPRINT_0501_0004_0001 (Proof Spine Assembly) - Partial (PROOF-SPINE-0009 blocked)
|
||||
- **Blocking**: Ground-truth corpus CI gates need this for replay validation
|
||||
- **Safe to parallelize with**: Unknowns ranking implementation
|
||||
|
||||
## Documentation Prerequisites
|
||||
|
||||
- `docs/README.md`
|
||||
- `docs/07_HIGH_LEVEL_ARCHITECTURE.md`
|
||||
- `docs/modules/scanner/architecture.md`
|
||||
- `docs/product-advisories/14-Dec-2025 - Determinism and Reproducibility Technical Reference.md`
|
||||
- `docs/benchmarks/ground-truth-corpus.md` (new)
|
||||
|
||||
---
|
||||
|
||||
## Technical Specifications
|
||||
|
||||
### Scan Manifest
|
||||
|
||||
```csharp
|
||||
public sealed record ScanManifest(
|
||||
string ScanId,
|
||||
DateTimeOffset CreatedAtUtc,
|
||||
string ArtifactDigest, // sha256:... or image digest
|
||||
string ArtifactPurl, // optional
|
||||
string ScannerVersion, // scanner.webservice version
|
||||
string WorkerVersion, // scanner.worker.* version
|
||||
string ConcelierSnapshotHash, // immutable feed snapshot digest
|
||||
string ExcititorSnapshotHash, // immutable vex snapshot digest
|
||||
string LatticePolicyHash, // policy bundle digest
|
||||
bool Deterministic,
|
||||
byte[] Seed, // 32 bytes
|
||||
IReadOnlyDictionary<string,string> Knobs // depth limits etc.
|
||||
);
|
||||
```
|
||||
|
||||
### Proof Bundle Contents
|
||||
|
||||
```
|
||||
bundle.zip/
|
||||
├── manifest.json # Canonical JSON scan manifest
|
||||
├── manifest.dsse.json # DSSE envelope for manifest
|
||||
├── score_proof.json # ProofLedger nodes array (v1 JSON, swap to CBOR later)
|
||||
├── proof_root.dsse.json # DSSE envelope for root hash
|
||||
└── meta.json # { rootHash, createdAtUtc }
|
||||
```
|
||||
|
||||
### Score Replay Contract
|
||||
|
||||
```
|
||||
POST /scan/{scanId}/score/replay
|
||||
Response:
|
||||
{
|
||||
"score": 0.73,
|
||||
"rootHash": "sha256:abc123...",
|
||||
"bundleUri": "/var/lib/stellaops/proofs/scanId_abc123.zip"
|
||||
}
|
||||
```
|
||||
|
||||
Invariant: Same manifest + same seed + same frozen clock = identical rootHash.
|
||||
|
||||
---
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
| # | Task ID | Status | Key Dependency / Next Step | Owners | Task Definition |
|
||||
|---|---------|--------|---------------------------|--------|-----------------|
|
||||
| 1 | SCORE-REPLAY-001 | DONE | None | Scoring Team | Implement `ProofNode` record and `ProofNodeKind` enum per spec |
|
||||
| 2 | SCORE-REPLAY-002 | DONE | Task 1 | Scoring Team | Implement `ProofHashing` with per-node canonical hash computation |
|
||||
| 3 | SCORE-REPLAY-003 | DONE | Task 2 | Scoring Team | Implement `ProofLedger` with deterministic append and RootHash() |
|
||||
| 4 | SCORE-REPLAY-004 | DONE | Task 3 | Scoring Team | Integrate ProofLedger into `RiskScoring.Score()` to emit ledger nodes |
|
||||
| 5 | SCORE-REPLAY-005 | DONE | None | Scanner Team | Define `ScanManifest` record with all input hashes |
|
||||
| 6 | SCORE-REPLAY-006 | DONE | Task 5 | Scanner Team | Implement manifest DSSE signing using existing Authority integration |
|
||||
| 7 | SCORE-REPLAY-007 | DONE | Task 5,6 | Agent | Add `scan_manifest` table to PostgreSQL with manifest_hash index |
|
||||
| 8 | SCORE-REPLAY-008 | DONE | Task 3,7 | Scanner Team | Implement `ProofBundleWriter` (ZIP + content-addressed storage) |
|
||||
| 9 | SCORE-REPLAY-009 | DONE | Task 8 | Agent | Add `proof_bundle` table with (scan_id, root_hash) primary key |
|
||||
| 10 | SCORE-REPLAY-010 | DONE | Task 4,8,9 | Scanner Team | Implement `POST /score/replay` endpoint in scanner.webservice |
|
||||
| 11 | SCORE-REPLAY-011 | DONE | Task 10 | Agent | ScoreReplaySchedulerJob.cs - scheduled job for feed changes |
|
||||
| 12 | SCORE-REPLAY-012 | DONE | Task 10 | QA Guild | Unit tests for ProofLedger determinism (hash match across runs) |
|
||||
| 13 | SCORE-REPLAY-013 | DONE | Task 11 | Agent | ScoreReplayEndpointsTests.cs - integration tests |
|
||||
| 14 | SCORE-REPLAY-014 | DONE | Task 13 | Agent | docs/api/score-replay-api.md - API documentation |
|
||||
|
||||
---
|
||||
|
||||
## PostgreSQL Schema
|
||||
|
||||
```sql
|
||||
-- Note: Full schema in src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/006_score_replay_tables.sql
|
||||
CREATE TABLE scan_manifest (
|
||||
scan_id TEXT PRIMARY KEY,
|
||||
created_at_utc TIMESTAMPTZ NOT NULL,
|
||||
artifact_digest TEXT NOT NULL,
|
||||
concelier_snapshot_hash TEXT NOT NULL,
|
||||
excititor_snapshot_hash TEXT NOT NULL,
|
||||
lattice_policy_hash TEXT NOT NULL,
|
||||
deterministic BOOLEAN NOT NULL,
|
||||
seed BYTEA NOT NULL,
|
||||
manifest_json JSONB NOT NULL,
|
||||
manifest_dsse_json JSONB NOT NULL,
|
||||
manifest_hash TEXT NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE proof_bundle (
|
||||
scan_id TEXT NOT NULL REFERENCES scan_manifest(scan_id),
|
||||
root_hash TEXT NOT NULL,
|
||||
bundle_uri TEXT NOT NULL,
|
||||
proof_root_dsse_json JSONB NOT NULL,
|
||||
created_at_utc TIMESTAMPTZ NOT NULL,
|
||||
PRIMARY KEY (scan_id, root_hash)
|
||||
);
|
||||
|
||||
CREATE INDEX ix_scan_manifest_artifact ON scan_manifest(artifact_digest);
|
||||
CREATE INDEX ix_scan_manifest_snapshots ON scan_manifest(concelier_snapshot_hash, excititor_snapshot_hash);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Execution Log
|
||||
|
||||
| Date (UTC) | Update | Owner |
|
||||
|------------|--------|-------|
|
||||
| 2025-12-17 | Sprint created from advisory "Building a Deeper Moat Beyond Reachability" | Planning |
|
||||
| 2025-12-17 | SCORE-REPLAY-005: Created ScanManifest.cs with builder pattern and canonical JSON | Agent |
|
||||
| 2025-12-17 | SCORE-REPLAY-006: Created ScanManifestSigner.cs with DSSE envelope support | Agent |
|
||||
| 2025-12-17 | SCORE-REPLAY-008: Created ProofBundleWriter.cs with ZIP bundle creation and content-addressed storage | Agent |
|
||||
| 2025-12-17 | SCORE-REPLAY-010: Created ScoreReplayEndpoints.cs with POST /score/{scanId}/replay, GET /score/{scanId}/bundle, POST /score/{scanId}/verify | Agent |
|
||||
| 2025-12-17 | SCORE-REPLAY-010: Created IScoreReplayService.cs and ScoreReplayService.cs with replay orchestration | Agent |
|
||||
| 2025-12-17 | SCORE-REPLAY-012: Created ProofLedgerDeterminismTests.cs with comprehensive determinism verification tests | Agent |
|
||||
| 2025-12-17 | SCORE-REPLAY-011: Created FeedChangeRescoreJob.cs for automatic rescoring on feed changes | Agent |
|
||||
| 2025-12-17 | SCORE-REPLAY-013: Created ScoreReplayEndpointsTests.cs with comprehensive integration tests | Agent |
|
||||
| 2025-12-17 | SCORE-REPLAY-014: Verified docs/api/score-replay-api.md already exists | Agent |
|
||||
|
||||
---
|
||||
|
||||
## Decisions & Risks
|
||||
|
||||
- **Risk**: Proof bundle storage could grow large for high-volume scanning. Mitigation: Add retention policy and cleanup job in follow-up sprint.
|
||||
- **Decision**: Use JSON for v1 proof ledger encoding; migrate to CBOR in v2 for compactness.
|
||||
- **Dependency**: Signer integration assumes SPRINT_0501_0008_0001 key rotation is available.
|
||||
|
||||
---
|
||||
|
||||
## Next Checkpoints
|
||||
|
||||
- [ ] Schema review with DB team before Task 7/9
|
||||
- [ ] API review with scanner team before Task 10
|
||||
@@ -0,0 +1,521 @@
|
||||
# SPRINT_3420_0001_0001 - Bitemporal Unknowns Schema
|
||||
|
||||
**Status:** DONE
|
||||
**Priority:** HIGH
|
||||
**Module:** Unknowns Registry (new schema)
|
||||
**Working Directory:** `src/Unknowns/`
|
||||
**Estimated Complexity:** Medium-High
|
||||
|
||||
## Topic & Scope
|
||||
|
||||
- Add a dedicated `unknowns` schema with bitemporal semantics for deterministic replay and compliance point-in-time queries.
|
||||
- Provide repository/query helpers and tests proving stable temporal snapshots and tenant isolation.
|
||||
- Deliver a Category C migration path from legacy VEX unknowns tables.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
|
||||
- **Depends on:** PostgreSQL init scripts and base infrastructure migrations.
|
||||
- **Safe to parallelize with:** All non-DB-cutover work (no runtime coupling).
|
||||
|
||||
## Documentation Prerequisites
|
||||
|
||||
- `docs/product-advisories/14-Dec-2025 - PostgreSQL Patterns Technical Reference.md` (Section 3.4)
|
||||
- `docs/db/SPECIFICATION.md`
|
||||
|
||||
---
|
||||
|
||||
## 1. Objective
|
||||
|
||||
Implement a dedicated `unknowns` schema with bitemporal semantics to track ambiguity in vulnerability scans, enabling point-in-time queries for compliance audits and supporting StellaOps' determinism and reproducibility principles.
|
||||
|
||||
## 2. Background
|
||||
|
||||
### 2.1 Why Bitemporal?
|
||||
|
||||
StellaOps scans produce "unknowns" - packages, versions, or ecosystems that cannot be definitively matched. Currently tracked in `vex.unknowns_snapshots` and `vex.unknown_items`, these lack temporal semantics required for:
|
||||
|
||||
- **Compliance audits**: "What unknowns existed on audit date X?"
|
||||
- **Immutable history**: Track when unknowns were discovered vs. when they were actually relevant
|
||||
- **Deterministic replay**: Reproduce scan results at any point in time
|
||||
|
||||
### 2.2 Bitemporal Dimensions
|
||||
|
||||
| Dimension | Column | Meaning |
|
||||
|-----------|--------|---------|
|
||||
| **Valid time** | `valid_from`, `valid_to` | When the unknown was relevant in the real world |
|
||||
| **System time** | `sys_from`, `sys_to` | When the system recorded/knew about the unknown |
|
||||
|
||||
### 2.3 Source Advisory
|
||||
|
||||
- `docs/product-advisories/14-Dec-2025 - PostgreSQL Patterns Technical Reference.md` (Section 3.4)
|
||||
- `docs/product-advisories/archived/14-Dec-2025/04-Dec-2025- Ranking Unknowns in Reachability Graphs.md`
|
||||
|
||||
---
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
| # | Task | Status | Assignee | Notes |
|
||||
|---|------|--------|----------|-------|
|
||||
| 1 | Create `unknowns` schema in postgres-init | DONE | | In 001_initial_schema.sql |
|
||||
| 2 | Design `unknowns.unknown` table with bitemporal columns | DONE | | Full bitemporal with valid_from/valid_to, sys_from/sys_to |
|
||||
| 3 | Implement migration script `001_initial.sql` | DONE | | Created 001_initial_schema.sql with enums, RLS, functions |
|
||||
| 4 | Create `UnknownsDataSource` base class | SKIPPED | | Using Npgsql directly in repository |
|
||||
| 5 | Implement `IUnknownRepository` interface | DONE | | Full interface with temporal query support |
|
||||
| 6 | Implement `PostgresUnknownRepository` | DONE | | Complete with enum TEXT casting fix |
|
||||
| 7 | Create temporal query helpers | DONE | | `unknowns.as_of()` function in SQL |
|
||||
| 8 | Add RLS policies for tenant isolation | DONE | | `unknowns_app.require_current_tenant()` pattern |
|
||||
| 9 | Create indexes for temporal queries | DONE | | BRIN for sys_from, B-tree for lookups |
|
||||
| 10 | Implement `UnknownsService` domain service | SKIPPED | | Repository is sufficient for current needs |
|
||||
| 11 | Add unit tests for repository | DONE | | 8 tests covering all operations |
|
||||
| 12 | Add integration tests with Testcontainers | DONE | | PostgreSQL container tests passing |
|
||||
| 13 | Create data migration from `vex.unknown_items` | DONE | | Migration script ready (Category C) |
|
||||
| 14 | Update documentation | DONE | | AGENTS.md, SPECIFICATION.md updated |
|
||||
| 15 | Verify determinism with replay tests | DONE | | Bitemporal queries produce stable results |
|
||||
|
||||
---
|
||||
|
||||
## 4. Technical Specification
|
||||
|
||||
### 4.1 Schema Definition
|
||||
|
||||
```sql
|
||||
-- File: deploy/compose/postgres-init/01-extensions.sql (add line)
|
||||
CREATE SCHEMA IF NOT EXISTS unknowns;
|
||||
GRANT USAGE ON SCHEMA unknowns TO PUBLIC;
|
||||
```
|
||||
|
||||
### 4.2 Table Design
|
||||
|
||||
```sql
|
||||
-- File: src/Unknowns/__Libraries/StellaOps.Unknowns.Storage.Postgres/Migrations/001_initial.sql
|
||||
|
||||
BEGIN;
|
||||
|
||||
CREATE TABLE unknowns.unknown (
|
||||
-- Identity
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
tenant_id UUID NOT NULL,
|
||||
|
||||
-- Subject identification
|
||||
subject_hash TEXT NOT NULL, -- SHA-256 of subject (purl, ecosystem, etc.)
|
||||
subject_type TEXT NOT NULL, -- 'package', 'ecosystem', 'version', 'sbom_edge'
|
||||
subject_ref TEXT NOT NULL, -- Human-readable reference (purl, name)
|
||||
|
||||
-- Classification
|
||||
kind TEXT NOT NULL CHECK (kind IN (
|
||||
'missing_sbom',
|
||||
'ambiguous_package',
|
||||
'missing_feed',
|
||||
'unresolved_edge',
|
||||
'no_version_info',
|
||||
'unknown_ecosystem',
|
||||
'partial_match',
|
||||
'version_range_unbounded'
|
||||
)),
|
||||
severity TEXT CHECK (severity IN ('critical', 'high', 'medium', 'low', 'info')),
|
||||
|
||||
-- Context
|
||||
context JSONB NOT NULL DEFAULT '{}',
|
||||
source_scan_id UUID,
|
||||
source_graph_id UUID,
|
||||
|
||||
-- Bitemporal columns
|
||||
valid_from TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
valid_to TIMESTAMPTZ, -- NULL = currently valid
|
||||
sys_from TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
sys_to TIMESTAMPTZ, -- NULL = current system state
|
||||
|
||||
-- Resolution tracking
|
||||
resolved_at TIMESTAMPTZ,
|
||||
resolution_type TEXT CHECK (resolution_type IN (
|
||||
'feed_updated',
|
||||
'sbom_provided',
|
||||
'manual_mapping',
|
||||
'superseded',
|
||||
'false_positive'
|
||||
)),
|
||||
resolution_ref TEXT,
|
||||
|
||||
-- Audit
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
created_by TEXT NOT NULL DEFAULT 'system'
|
||||
);
|
||||
|
||||
-- Ensure only one open unknown per subject per tenant
|
||||
CREATE UNIQUE INDEX uq_unknown_one_open_per_subject
|
||||
ON unknowns.unknown (tenant_id, subject_hash, kind)
|
||||
WHERE valid_to IS NULL AND sys_to IS NULL;
|
||||
|
||||
-- Temporal query indexes
|
||||
CREATE INDEX ix_unknown_tenant_valid
|
||||
ON unknowns.unknown (tenant_id, valid_from, valid_to);
|
||||
CREATE INDEX ix_unknown_tenant_sys
|
||||
ON unknowns.unknown (tenant_id, sys_from, sys_to);
|
||||
CREATE INDEX ix_unknown_tenant_kind_severity
|
||||
ON unknowns.unknown (tenant_id, kind, severity)
|
||||
WHERE valid_to IS NULL;
|
||||
|
||||
-- Source correlation
|
||||
CREATE INDEX ix_unknown_source_scan
|
||||
ON unknowns.unknown (source_scan_id)
|
||||
WHERE source_scan_id IS NOT NULL;
|
||||
CREATE INDEX ix_unknown_source_graph
|
||||
ON unknowns.unknown (source_graph_id)
|
||||
WHERE source_graph_id IS NOT NULL;
|
||||
|
||||
-- Context search
|
||||
CREATE INDEX ix_unknown_context_gin
|
||||
ON unknowns.unknown USING GIN (context jsonb_path_ops);
|
||||
|
||||
-- Current unknowns view (convenience)
|
||||
CREATE VIEW unknowns.current AS
|
||||
SELECT * FROM unknowns.unknown
|
||||
WHERE valid_to IS NULL AND sys_to IS NULL;
|
||||
|
||||
-- Historical snapshot view helper
|
||||
CREATE OR REPLACE FUNCTION unknowns.as_of(
|
||||
p_tenant_id UUID,
|
||||
p_valid_at TIMESTAMPTZ,
|
||||
p_sys_at TIMESTAMPTZ DEFAULT NOW()
|
||||
)
|
||||
RETURNS SETOF unknowns.unknown
|
||||
LANGUAGE sql STABLE
|
||||
AS $$
|
||||
SELECT * FROM unknowns.unknown
|
||||
WHERE tenant_id = p_tenant_id
|
||||
AND valid_from <= p_valid_at
|
||||
AND (valid_to IS NULL OR valid_to > p_valid_at)
|
||||
AND sys_from <= p_sys_at
|
||||
AND (sys_to IS NULL OR sys_to > p_sys_at);
|
||||
$$;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
### 4.3 RLS Policy
|
||||
|
||||
```sql
|
||||
-- File: src/Unknowns/__Libraries/StellaOps.Unknowns.Storage.Postgres/Migrations/002_enable_rls.sql
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- Create helper function for tenant context
|
||||
CREATE OR REPLACE FUNCTION unknowns_app.require_current_tenant()
|
||||
RETURNS UUID
|
||||
LANGUAGE plpgsql STABLE
|
||||
AS $$
|
||||
DECLARE
|
||||
v_tenant TEXT;
|
||||
BEGIN
|
||||
v_tenant := current_setting('app.tenant_id', true);
|
||||
IF v_tenant IS NULL OR v_tenant = '' THEN
|
||||
RAISE EXCEPTION 'app.tenant_id not set';
|
||||
END IF;
|
||||
RETURN v_tenant::UUID;
|
||||
END;
|
||||
$$;
|
||||
|
||||
-- Enable RLS
|
||||
ALTER TABLE unknowns.unknown ENABLE ROW LEVEL SECURITY;
|
||||
|
||||
-- Tenant isolation policy
|
||||
CREATE POLICY unknown_tenant_isolation
|
||||
ON unknowns.unknown
|
||||
FOR ALL
|
||||
USING (tenant_id = unknowns_app.require_current_tenant())
|
||||
WITH CHECK (tenant_id = unknowns_app.require_current_tenant());
|
||||
|
||||
-- Admin bypass role
|
||||
CREATE ROLE unknowns_admin WITH NOLOGIN BYPASSRLS;
|
||||
GRANT unknowns_admin TO stellaops_admin;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
### 4.4 Repository Interface
|
||||
|
||||
```csharp
|
||||
// File: src/Unknowns/__Libraries/StellaOps.Unknowns.Core/Repositories/IUnknownRepository.cs
|
||||
|
||||
namespace StellaOps.Unknowns.Core.Repositories;
|
||||
|
||||
public interface IUnknownRepository
|
||||
{
|
||||
/// <summary>Records a new unknown, closing any existing open unknown for same subject.</summary>
|
||||
Task<Unknown> RecordAsync(
|
||||
string tenantId,
|
||||
UnknownRecord record,
|
||||
CancellationToken cancellationToken);
|
||||
|
||||
/// <summary>Resolves an open unknown.</summary>
|
||||
Task ResolveAsync(
|
||||
string tenantId,
|
||||
Guid unknownId,
|
||||
ResolutionType resolutionType,
|
||||
string? resolutionRef,
|
||||
CancellationToken cancellationToken);
|
||||
|
||||
/// <summary>Gets current open unknowns for tenant.</summary>
|
||||
Task<IReadOnlyList<Unknown>> GetCurrentAsync(
|
||||
string tenantId,
|
||||
UnknownFilter? filter,
|
||||
CancellationToken cancellationToken);
|
||||
|
||||
/// <summary>Point-in-time query: what unknowns existed at given valid time?</summary>
|
||||
Task<IReadOnlyList<Unknown>> GetAsOfAsync(
|
||||
string tenantId,
|
||||
DateTimeOffset validAt,
|
||||
DateTimeOffset? systemAt,
|
||||
UnknownFilter? filter,
|
||||
CancellationToken cancellationToken);
|
||||
|
||||
/// <summary>Gets unknowns for a specific scan.</summary>
|
||||
Task<IReadOnlyList<Unknown>> GetByScanAsync(
|
||||
string tenantId,
|
||||
Guid scanId,
|
||||
CancellationToken cancellationToken);
|
||||
|
||||
/// <summary>Gets unknowns for a specific graph revision.</summary>
|
||||
Task<IReadOnlyList<Unknown>> GetByGraphAsync(
|
||||
string tenantId,
|
||||
Guid graphId,
|
||||
CancellationToken cancellationToken);
|
||||
|
||||
/// <summary>Counts unknowns by kind for dashboard metrics.</summary>
|
||||
Task<IReadOnlyDictionary<UnknownKind, int>> CountByKindAsync(
|
||||
string tenantId,
|
||||
CancellationToken cancellationToken);
|
||||
}
|
||||
```
|
||||
|
||||
### 4.5 Domain Model
|
||||
|
||||
```csharp
|
||||
// File: src/Unknowns/__Libraries/StellaOps.Unknowns.Core/Models/Unknown.cs
|
||||
|
||||
namespace StellaOps.Unknowns.Core.Models;
|
||||
|
||||
public sealed record Unknown
|
||||
{
|
||||
public required Guid Id { get; init; }
|
||||
public required Guid TenantId { get; init; }
|
||||
public required string SubjectHash { get; init; }
|
||||
public required UnknownSubjectType SubjectType { get; init; }
|
||||
public required string SubjectRef { get; init; }
|
||||
public required UnknownKind Kind { get; init; }
|
||||
public UnknownSeverity? Severity { get; init; }
|
||||
public JsonDocument? Context { get; init; }
|
||||
public Guid? SourceScanId { get; init; }
|
||||
public Guid? SourceGraphId { get; init; }
|
||||
|
||||
// Bitemporal
|
||||
public required DateTimeOffset ValidFrom { get; init; }
|
||||
public DateTimeOffset? ValidTo { get; init; }
|
||||
public required DateTimeOffset SysFrom { get; init; }
|
||||
public DateTimeOffset? SysTo { get; init; }
|
||||
|
||||
// Resolution
|
||||
public DateTimeOffset? ResolvedAt { get; init; }
|
||||
public ResolutionType? ResolutionType { get; init; }
|
||||
public string? ResolutionRef { get; init; }
|
||||
|
||||
// Computed
|
||||
public bool IsOpen => ValidTo is null && SysTo is null;
|
||||
public bool IsResolved => ResolvedAt is not null;
|
||||
}
|
||||
|
||||
public enum UnknownSubjectType
|
||||
{
|
||||
Package,
|
||||
Ecosystem,
|
||||
Version,
|
||||
SbomEdge
|
||||
}
|
||||
|
||||
public enum UnknownKind
|
||||
{
|
||||
MissingSbom,
|
||||
AmbiguousPackage,
|
||||
MissingFeed,
|
||||
UnresolvedEdge,
|
||||
NoVersionInfo,
|
||||
UnknownEcosystem,
|
||||
PartialMatch,
|
||||
VersionRangeUnbounded
|
||||
}
|
||||
|
||||
public enum UnknownSeverity
|
||||
{
|
||||
Critical,
|
||||
High,
|
||||
Medium,
|
||||
Low,
|
||||
Info
|
||||
}
|
||||
|
||||
public enum ResolutionType
|
||||
{
|
||||
FeedUpdated,
|
||||
SbomProvided,
|
||||
ManualMapping,
|
||||
Superseded,
|
||||
FalsePositive
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Migration Strategy
|
||||
|
||||
### 5.1 Data Migration from `vex.unknown_items`
|
||||
|
||||
```sql
|
||||
-- File: src/Unknowns/__Libraries/StellaOps.Unknowns.Storage.Postgres/Migrations/003_migrate_from_vex.sql
|
||||
-- Category: C (data migration, run manually)
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- Migrate existing unknown_items to new bitemporal structure
|
||||
INSERT INTO unknowns.unknown (
|
||||
tenant_id,
|
||||
subject_hash,
|
||||
subject_type,
|
||||
subject_ref,
|
||||
kind,
|
||||
severity,
|
||||
context,
|
||||
source_graph_id,
|
||||
valid_from,
|
||||
valid_to,
|
||||
sys_from,
|
||||
sys_to,
|
||||
resolved_at,
|
||||
resolution_type,
|
||||
resolution_ref,
|
||||
created_at,
|
||||
created_by
|
||||
)
|
||||
SELECT
|
||||
p.tenant_id,
|
||||
encode(sha256(ui.item_key::bytea), 'hex'),
|
||||
CASE ui.item_type
|
||||
WHEN 'missing_sbom' THEN 'package'
|
||||
WHEN 'ambiguous_package' THEN 'package'
|
||||
WHEN 'missing_feed' THEN 'ecosystem'
|
||||
WHEN 'unresolved_edge' THEN 'sbom_edge'
|
||||
WHEN 'no_version_info' THEN 'version'
|
||||
WHEN 'unknown_ecosystem' THEN 'ecosystem'
|
||||
ELSE 'package'
|
||||
END,
|
||||
ui.item_key,
|
||||
ui.item_type,
|
||||
ui.severity,
|
||||
ui.details,
|
||||
s.graph_revision_id,
|
||||
s.created_at, -- valid_from = snapshot creation
|
||||
ui.resolved_at, -- valid_to = resolution time
|
||||
s.created_at, -- sys_from = snapshot creation
|
||||
NULL, -- sys_to = NULL (current)
|
||||
ui.resolved_at,
|
||||
CASE
|
||||
WHEN ui.resolution IS NOT NULL THEN 'manual_mapping'
|
||||
ELSE NULL
|
||||
END,
|
||||
ui.resolution,
|
||||
s.created_at,
|
||||
COALESCE(s.created_by, 'migration')
|
||||
FROM vex.unknown_items ui
|
||||
JOIN vex.unknowns_snapshots s ON ui.snapshot_id = s.id
|
||||
JOIN vex.projects p ON s.project_id = p.id
|
||||
WHERE NOT EXISTS (
|
||||
SELECT 1 FROM unknowns.unknown u
|
||||
WHERE u.tenant_id = p.tenant_id
|
||||
AND u.subject_hash = encode(sha256(ui.item_key::bytea), 'hex')
|
||||
AND u.kind = ui.item_type
|
||||
);
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Testing Requirements
|
||||
|
||||
### 6.1 Unit Tests
|
||||
|
||||
- `UnknownTests.cs` - Domain model validation
|
||||
- `UnknownFilterTests.cs` - Filter logic
|
||||
- `SubjectHashCalculatorTests.cs` - Hash consistency
|
||||
|
||||
### 6.2 Integration Tests
|
||||
|
||||
- `PostgresUnknownRepositoryTests.cs`
|
||||
- `RecordAsync_CreatesNewUnknown`
|
||||
- `RecordAsync_ClosesExistingOpenUnknown`
|
||||
- `ResolveAsync_SetsResolutionFields`
|
||||
- `GetAsOfAsync_ReturnsCorrectTemporalSnapshot`
|
||||
- `GetAsOfAsync_SystemTimeFiltering`
|
||||
- `RlsPolicy_EnforcesTenantIsolation`
|
||||
|
||||
### 6.3 Determinism Tests
|
||||
|
||||
- `UnknownsDeterminismTests.cs`
|
||||
- Verify same inputs produce same unknowns across runs
|
||||
- Verify temporal queries are stable
|
||||
|
||||
---
|
||||
|
||||
## 7. Dependencies
|
||||
|
||||
### 7.1 Upstream
|
||||
|
||||
- PostgreSQL init scripts (`deploy/compose/postgres-init/`)
|
||||
- `StellaOps.Infrastructure.Postgres` base classes
|
||||
|
||||
### 7.2 Downstream
|
||||
|
||||
- Scanner module (records unknowns during scan)
|
||||
- VEX module (consumes unknowns for graph building)
|
||||
- Policy module (evaluates unknown impact)
|
||||
|
||||
---
|
||||
|
||||
## Decisions & Risks
|
||||
|
||||
| # | Decision/Risk | Status | Resolution |
|
||||
|---|---------------|--------|------------|
|
||||
| 1 | Use SHA-256 for subject_hash | DECIDED | Consistent with other hashing in codebase |
|
||||
| 2 | LIST partition by tenant vs. RANGE by time | OPEN | Start unpartitioned, add later if needed |
|
||||
| 3 | Migration from vex.unknown_items | OPEN | Run as Category C migration post-deployment |
|
||||
|
||||
---
|
||||
|
||||
## 9. Definition of Done
|
||||
|
||||
- [x] Schema created and deployed
|
||||
- [x] RLS policies active
|
||||
- [x] Repository implementation complete
|
||||
- [x] Unit tests passing (>90% coverage)
|
||||
- [x] Integration tests passing (8/8 tests pass)
|
||||
- [x] Data migration script tested
|
||||
- [x] Documentation updated (AGENTS.md, SPECIFICATION.md)
|
||||
- [x] Performance validated (EXPLAIN ANALYZE for key queries)
|
||||
|
||||
---
|
||||
|
||||
## 10. References
|
||||
|
||||
- ADR: `docs/adr/0001-postgresql-for-control-plane.md`
|
||||
- Spec: `docs/db/SPECIFICATION.md`
|
||||
- Rules: `docs/db/RULES.md`
|
||||
- Advisory: `docs/product-advisories/14-Dec-2025 - PostgreSQL Patterns Technical Reference.md`
|
||||
|
||||
## Execution Log
|
||||
|
||||
| Date (UTC) | Update | Owner |
|
||||
|---|---|---|
|
||||
| 2025-12-17 | Normalized sprint file headings to standard template; no semantic changes. | Agent |
|
||||
|
||||
## Next Checkpoints
|
||||
|
||||
- None (sprint complete).
|
||||
625
docs/implplan/archived/SPRINT_3421_0001_0001_rls_expansion.md
Normal file
625
docs/implplan/archived/SPRINT_3421_0001_0001_rls_expansion.md
Normal file
@@ -0,0 +1,625 @@
|
||||
# SPRINT_3421_0001_0001 - RLS Expansion to All Schemas
|
||||
|
||||
**Status:** DONE
|
||||
**Priority:** HIGH
|
||||
**Module:** Cross-cutting (all PostgreSQL schemas)
|
||||
**Working Directory:** `src/*/Migrations/`
|
||||
**Estimated Complexity:** Medium
|
||||
|
||||
## Topic & Scope
|
||||
|
||||
- Expand Row-Level Security (RLS) from `findings_ledger` to all tenant-scoped schemas for defense-in-depth.
|
||||
- Standardize `*_app.require_current_tenant()` helpers and BYPASSRLS admin roles where applicable.
|
||||
- Provide validation evidence (tests/validation scripts) proving tenant isolation.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
|
||||
- **Depends on:** Existing Postgres schema baselines per module.
|
||||
- **Safe to parallelize with:** Non-conflicting schema migrations in other modules (coordinate migration ordering).
|
||||
|
||||
## Documentation Prerequisites
|
||||
|
||||
- `docs/db/SPECIFICATION.md`
|
||||
- `docs/db/RULES.md`
|
||||
- `docs/db/VERIFICATION.md`
|
||||
- `docs/modules/platform/architecture-overview.md`
|
||||
|
||||
---
|
||||
|
||||
## 1. Objective
|
||||
|
||||
Expand Row-Level Security (RLS) policies from `findings_ledger` schema to all tenant-scoped schemas, providing defense-in-depth for multi-tenancy isolation and supporting sovereign deployment requirements.
|
||||
|
||||
## 2. Background
|
||||
|
||||
### 2.1 Current State
|
||||
|
||||
| Schema | RLS Status | Tables |
|
||||
|--------|------------|--------|
|
||||
| `findings_ledger` | ✅ Implemented | 9 tables with full RLS |
|
||||
| `scheduler` | ❌ Missing | 12 tenant-scoped tables |
|
||||
| `vex` | ❌ Missing | 18 tenant-scoped tables |
|
||||
| `authority` | ❌ Missing | 8 tenant-scoped tables |
|
||||
| `notify` | ❌ Missing | 14 tenant-scoped tables |
|
||||
| `policy` | ❌ Missing | 6 tenant-scoped tables |
|
||||
| `vuln` | N/A | Not tenant-scoped (global feed data) |
|
||||
|
||||
### 2.2 Why RLS?
|
||||
|
||||
- **Defense-in-depth**: Prevents accidental cross-tenant data exposure even with application bugs
|
||||
- **Sovereign requirements**: Regulated industries require database-level isolation
|
||||
- **Compliance**: FedRAMP, SOC 2 require demonstrable tenant isolation
|
||||
- **Air-gap security**: Extra protection when operator access is elevated
|
||||
|
||||
### 2.3 Pattern Reference
|
||||
|
||||
Based on successful `findings_ledger` implementation:
|
||||
```sql
|
||||
-- Reference: src/Findings/StellaOps.Findings.Ledger/migrations/007_enable_rls.sql
|
||||
CREATE POLICY tenant_isolation ON table_name
|
||||
FOR ALL
|
||||
USING (tenant_id = schema_app.require_current_tenant())
|
||||
WITH CHECK (tenant_id = schema_app.require_current_tenant());
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
| # | Task | Status | Assignee | Notes |
|
||||
|---|------|--------|----------|-------|
|
||||
| **Phase 1: Scheduler Schema** |||||
|
||||
| 1.1 | Create `scheduler_app.require_current_tenant()` function | DONE | | 010_enable_rls.sql |
|
||||
| 1.2 | Add RLS to `scheduler.schedules` | DONE | | |
|
||||
| 1.3 | Add RLS to `scheduler.runs` | DONE | | |
|
||||
| 1.4 | Add RLS to `scheduler.triggers` | DONE | | FK-based |
|
||||
| 1.5 | Add RLS to `scheduler.graph_jobs` | DONE | | |
|
||||
| 1.6 | Add RLS to `scheduler.policy_jobs` | DONE | | |
|
||||
| 1.7 | Add RLS to `scheduler.workers` | SKIPPED | | Global, no tenant_id |
|
||||
| 1.8 | Add RLS to `scheduler.locks` | DONE | | |
|
||||
| 1.9 | Add RLS to `scheduler.impact_snapshots` | DONE | | |
|
||||
| 1.10 | Add RLS to `scheduler.run_summaries` | DONE | | |
|
||||
| 1.11 | Add RLS to `scheduler.audit` | DONE | | |
|
||||
| 1.12 | Add RLS to `scheduler.execution_logs` | DONE | | FK-based via run_id |
|
||||
| 1.13 | Create `scheduler_admin` bypass role | DONE | | BYPASSRLS |
|
||||
| 1.14 | Add integration tests | DONE | | Via validation script |
|
||||
| **Phase 2: VEX Schema** |||||
|
||||
| 2.1 | Create `vex_app.require_current_tenant()` function | DONE | | 003_enable_rls.sql |
|
||||
| 2.2 | Add RLS to `vex.projects` | DONE | | |
|
||||
| 2.3 | Add RLS to `vex.graph_revisions` | DONE | | FK-based |
|
||||
| 2.4 | Add RLS to `vex.graph_nodes` | DONE | | FK-based |
|
||||
| 2.5 | Add RLS to `vex.graph_edges` | DONE | | FK-based |
|
||||
| 2.6 | Add RLS to `vex.statements` | DONE | | |
|
||||
| 2.7 | Add RLS to `vex.observations` | DONE | | |
|
||||
| 2.8 | Add RLS to `vex.linksets` | DONE | | |
|
||||
| 2.9 | Add RLS to `vex.consensus` | DONE | | |
|
||||
| 2.10 | Add RLS to `vex.attestations` | DONE | | |
|
||||
| 2.11 | Add RLS to `vex.timeline_events` | DONE | | |
|
||||
| 2.12 | Create `vex_admin` bypass role | DONE | | BYPASSRLS |
|
||||
| 2.13 | Add integration tests | DONE | | Via validation script |
|
||||
| **Phase 3: Authority Schema** |||||
|
||||
| 3.1 | Create `authority_app.require_current_tenant()` function | DONE | | 003_enable_rls.sql |
|
||||
| 3.2 | Add RLS to `authority.users` | DONE | | |
|
||||
| 3.3 | Add RLS to `authority.roles` | DONE | | |
|
||||
| 3.4 | Add RLS to `authority.user_roles` | DONE | | FK-based |
|
||||
| 3.5 | Add RLS to `authority.service_accounts` | DONE | | |
|
||||
| 3.6 | Add RLS to `authority.licenses` | DONE | | |
|
||||
| 3.7 | Add RLS to `authority.license_usage` | DONE | | FK-based |
|
||||
| 3.8 | Add RLS to `authority.login_attempts` | DONE | | |
|
||||
| 3.9 | Skip RLS on `authority.tenants` | DONE | | Meta-table, no tenant_id |
|
||||
| 3.10 | Skip RLS on `authority.clients` | DONE | | Global OAuth clients |
|
||||
| 3.11 | Skip RLS on `authority.scopes` | DONE | | Global scopes |
|
||||
| 3.12 | Create `authority_admin` bypass role | DONE | | BYPASSRLS |
|
||||
| 3.13 | Add integration tests | DONE | | Via validation script |
|
||||
| **Phase 4: Notify Schema** |||||
|
||||
| 4.1 | Create `notify_app.require_current_tenant()` function | DONE | | 003_enable_rls.sql |
|
||||
| 4.2 | Add RLS to `notify.channels` | DONE | | |
|
||||
| 4.3 | Add RLS to `notify.rules` | DONE | | |
|
||||
| 4.4 | Add RLS to `notify.templates` | DONE | | |
|
||||
| 4.5 | Add RLS to `notify.deliveries` | DONE | | |
|
||||
| 4.6 | Add RLS to `notify.digests` | DONE | | |
|
||||
| 4.7 | Add RLS to `notify.escalations` | DONE | | |
|
||||
| 4.8 | Add RLS to `notify.incidents` | DONE | | |
|
||||
| 4.9 | Add RLS to `notify.inbox` | DONE | | |
|
||||
| 4.10 | Add RLS to `notify.audit` | DONE | | |
|
||||
| 4.11 | Create `notify_admin` bypass role | DONE | | BYPASSRLS |
|
||||
| 4.12 | Add integration tests | DONE | | Via validation script |
|
||||
| **Phase 5: Policy Schema** |||||
|
||||
| 5.1 | Create `policy_app.require_current_tenant()` function | DONE | | 006_enable_rls.sql |
|
||||
| 5.2 | Add RLS to `policy.packs` | DONE | | |
|
||||
| 5.3 | Add RLS to `policy.rules` | DONE | | FK-based |
|
||||
| 5.4 | Add RLS to `policy.evaluations` | DONE | | |
|
||||
| 5.5 | Add RLS to `policy.risk_profiles` | DONE | | |
|
||||
| 5.6 | Add RLS to `policy.audit` | DONE | | |
|
||||
| 5.7 | Create `policy_admin` bypass role | DONE | | BYPASSRLS |
|
||||
| 5.8 | Add integration tests | DONE | | Via validation script |
|
||||
| **Phase 6: Validation & Documentation** |||||
|
||||
| 6.1 | Create RLS validation service (cross-schema) | DONE | | deploy/postgres-validation/001_validate_rls.sql |
|
||||
| 6.2 | Add RLS check to CI pipeline | DONE | | Added to build-test-deploy.yml quality-gates job |
|
||||
| 6.3 | Update docs/db/SPECIFICATION.md | DONE | | RLS now mandatory |
|
||||
| 6.4 | Update module dossiers with RLS status | DONE | | AGENTS.md files |
|
||||
| 6.5 | Create RLS troubleshooting runbook | DONE | | postgresql-patterns-runbook.md |
|
||||
|
||||
---
|
||||
|
||||
## 4. Technical Specification
|
||||
|
||||
### 4.1 RLS Helper Function Template
|
||||
|
||||
Each schema gets a dedicated helper function in a schema-specific `_app` schema:
|
||||
|
||||
```sql
|
||||
-- Template for each schema
|
||||
CREATE SCHEMA IF NOT EXISTS {schema}_app;
|
||||
|
||||
CREATE OR REPLACE FUNCTION {schema}_app.require_current_tenant()
|
||||
RETURNS UUID
|
||||
LANGUAGE plpgsql STABLE SECURITY DEFINER
|
||||
AS $$
|
||||
DECLARE
|
||||
v_tenant TEXT;
|
||||
BEGIN
|
||||
v_tenant := current_setting('app.tenant_id', true);
|
||||
IF v_tenant IS NULL OR v_tenant = '' THEN
|
||||
RAISE EXCEPTION 'app.tenant_id session variable not set'
|
||||
USING HINT = 'Set via: SELECT set_config(''app.tenant_id'', ''<uuid>'', false)';
|
||||
END IF;
|
||||
RETURN v_tenant::UUID;
|
||||
EXCEPTION
|
||||
WHEN invalid_text_representation THEN
|
||||
RAISE EXCEPTION 'app.tenant_id is not a valid UUID: %', v_tenant;
|
||||
END;
|
||||
$$;
|
||||
|
||||
-- Revoke direct execution, only callable via RLS policy
|
||||
REVOKE ALL ON FUNCTION {schema}_app.require_current_tenant() FROM PUBLIC;
|
||||
GRANT EXECUTE ON FUNCTION {schema}_app.require_current_tenant() TO stellaops_app;
|
||||
```
|
||||
|
||||
### 4.2 RLS Policy Template
|
||||
|
||||
```sql
|
||||
-- Standard tenant isolation policy
|
||||
ALTER TABLE {schema}.{table} ENABLE ROW LEVEL SECURITY;
|
||||
|
||||
CREATE POLICY {table}_tenant_isolation
|
||||
ON {schema}.{table}
|
||||
FOR ALL
|
||||
USING (tenant_id = {schema}_app.require_current_tenant())
|
||||
WITH CHECK (tenant_id = {schema}_app.require_current_tenant());
|
||||
|
||||
-- Force RLS even for table owner
|
||||
ALTER TABLE {schema}.{table} FORCE ROW LEVEL SECURITY;
|
||||
```
|
||||
|
||||
### 4.3 FK-Based RLS for Child Tables
|
||||
|
||||
For tables that inherit tenant_id through a foreign key:
|
||||
|
||||
```sql
|
||||
-- Example: scheduler.triggers references scheduler.schedules
|
||||
CREATE POLICY triggers_tenant_isolation
|
||||
ON scheduler.triggers
|
||||
FOR ALL
|
||||
USING (
|
||||
EXISTS (
|
||||
SELECT 1 FROM scheduler.schedules s
|
||||
WHERE s.id = schedule_id
|
||||
AND s.tenant_id = scheduler_app.require_current_tenant()
|
||||
)
|
||||
)
|
||||
WITH CHECK (
|
||||
EXISTS (
|
||||
SELECT 1 FROM scheduler.schedules s
|
||||
WHERE s.id = schedule_id
|
||||
AND s.tenant_id = scheduler_app.require_current_tenant()
|
||||
)
|
||||
);
|
||||
```
|
||||
|
||||
### 4.4 Admin Bypass Role
|
||||
|
||||
```sql
|
||||
-- Create bypass role (for migrations, admin operations)
|
||||
CREATE ROLE {schema}_admin WITH NOLOGIN BYPASSRLS;
|
||||
GRANT {schema}_admin TO stellaops_admin;
|
||||
|
||||
-- Grant to connection pool admin user
|
||||
GRANT {schema}_admin TO stellaops_migration;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Migration Scripts
|
||||
|
||||
### 5.1 Scheduler RLS Migration
|
||||
|
||||
```sql
|
||||
-- File: src/Scheduler/__Libraries/StellaOps.Scheduler.Storage.Postgres/Migrations/010_enable_rls.sql
|
||||
-- Category: B (release migration, requires coordination)
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- Create app schema for helper function
|
||||
CREATE SCHEMA IF NOT EXISTS scheduler_app;
|
||||
|
||||
-- Tenant context helper
|
||||
CREATE OR REPLACE FUNCTION scheduler_app.require_current_tenant()
|
||||
RETURNS UUID
|
||||
LANGUAGE plpgsql STABLE SECURITY DEFINER
|
||||
AS $$
|
||||
DECLARE
|
||||
v_tenant TEXT;
|
||||
BEGIN
|
||||
v_tenant := current_setting('app.tenant_id', true);
|
||||
IF v_tenant IS NULL OR v_tenant = '' THEN
|
||||
RAISE EXCEPTION 'app.tenant_id not set';
|
||||
END IF;
|
||||
RETURN v_tenant::UUID;
|
||||
END;
|
||||
$$;
|
||||
|
||||
REVOKE ALL ON FUNCTION scheduler_app.require_current_tenant() FROM PUBLIC;
|
||||
GRANT EXECUTE ON FUNCTION scheduler_app.require_current_tenant() TO stellaops_app;
|
||||
|
||||
-- Tables with direct tenant_id
|
||||
DO $$
|
||||
DECLARE
|
||||
tbl TEXT;
|
||||
tenant_tables TEXT[] := ARRAY[
|
||||
'schedules', 'runs', 'graph_jobs', 'policy_jobs',
|
||||
'locks', 'impact_snapshots', 'run_summaries', 'audit'
|
||||
];
|
||||
BEGIN
|
||||
FOREACH tbl IN ARRAY tenant_tables LOOP
|
||||
EXECUTE format('ALTER TABLE scheduler.%I ENABLE ROW LEVEL SECURITY', tbl);
|
||||
EXECUTE format('ALTER TABLE scheduler.%I FORCE ROW LEVEL SECURITY', tbl);
|
||||
EXECUTE format(
|
||||
'CREATE POLICY %I_tenant_isolation ON scheduler.%I
|
||||
FOR ALL
|
||||
USING (tenant_id = scheduler_app.require_current_tenant())
|
||||
WITH CHECK (tenant_id = scheduler_app.require_current_tenant())',
|
||||
tbl, tbl
|
||||
);
|
||||
END LOOP;
|
||||
END;
|
||||
$$;
|
||||
|
||||
-- FK-based RLS for triggers (references schedules)
|
||||
ALTER TABLE scheduler.triggers ENABLE ROW LEVEL SECURITY;
|
||||
ALTER TABLE scheduler.triggers FORCE ROW LEVEL SECURITY;
|
||||
CREATE POLICY triggers_tenant_isolation
|
||||
ON scheduler.triggers
|
||||
FOR ALL
|
||||
USING (
|
||||
schedule_id IN (
|
||||
SELECT id FROM scheduler.schedules
|
||||
WHERE tenant_id = scheduler_app.require_current_tenant()
|
||||
)
|
||||
);
|
||||
|
||||
-- FK-based RLS for execution_logs (references runs)
|
||||
ALTER TABLE scheduler.execution_logs ENABLE ROW LEVEL SECURITY;
|
||||
ALTER TABLE scheduler.execution_logs FORCE ROW LEVEL SECURITY;
|
||||
CREATE POLICY execution_logs_tenant_isolation
|
||||
ON scheduler.execution_logs
|
||||
FOR ALL
|
||||
USING (
|
||||
run_id IN (
|
||||
SELECT id FROM scheduler.runs
|
||||
WHERE tenant_id = scheduler_app.require_current_tenant()
|
||||
)
|
||||
);
|
||||
|
||||
-- Workers table is global (no tenant_id) - skip RLS
|
||||
|
||||
-- Admin bypass role
|
||||
CREATE ROLE scheduler_admin WITH NOLOGIN BYPASSRLS;
|
||||
GRANT scheduler_admin TO stellaops_admin;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
### 5.2 VEX RLS Migration
|
||||
|
||||
```sql
|
||||
-- File: src/Excititor/__Libraries/StellaOps.Excititor.Storage.Postgres/Migrations/010_enable_rls.sql
|
||||
-- Category: B (release migration)
|
||||
|
||||
BEGIN;
|
||||
|
||||
CREATE SCHEMA IF NOT EXISTS vex_app;
|
||||
|
||||
CREATE OR REPLACE FUNCTION vex_app.require_current_tenant()
|
||||
RETURNS UUID
|
||||
LANGUAGE plpgsql STABLE SECURITY DEFINER
|
||||
AS $$
|
||||
DECLARE
|
||||
v_tenant TEXT;
|
||||
BEGIN
|
||||
v_tenant := current_setting('app.tenant_id', true);
|
||||
IF v_tenant IS NULL OR v_tenant = '' THEN
|
||||
RAISE EXCEPTION 'app.tenant_id not set';
|
||||
END IF;
|
||||
RETURN v_tenant::UUID;
|
||||
END;
|
||||
$$;
|
||||
|
||||
REVOKE ALL ON FUNCTION vex_app.require_current_tenant() FROM PUBLIC;
|
||||
GRANT EXECUTE ON FUNCTION vex_app.require_current_tenant() TO stellaops_app;
|
||||
|
||||
-- Direct tenant_id tables
|
||||
DO $$
|
||||
DECLARE
|
||||
tbl TEXT;
|
||||
tenant_tables TEXT[] := ARRAY[
|
||||
'projects', 'statements', 'observations', 'linksets',
|
||||
'consensus', 'attestations', 'timeline_events', 'evidence_manifests'
|
||||
];
|
||||
BEGIN
|
||||
FOREACH tbl IN ARRAY tenant_tables LOOP
|
||||
EXECUTE format('ALTER TABLE vex.%I ENABLE ROW LEVEL SECURITY', tbl);
|
||||
EXECUTE format('ALTER TABLE vex.%I FORCE ROW LEVEL SECURITY', tbl);
|
||||
EXECUTE format(
|
||||
'CREATE POLICY %I_tenant_isolation ON vex.%I
|
||||
FOR ALL
|
||||
USING (tenant_id = vex_app.require_current_tenant())
|
||||
WITH CHECK (tenant_id = vex_app.require_current_tenant())',
|
||||
tbl, tbl
|
||||
);
|
||||
END LOOP;
|
||||
END;
|
||||
$$;
|
||||
|
||||
-- FK-based: graph_revisions → projects
|
||||
ALTER TABLE vex.graph_revisions ENABLE ROW LEVEL SECURITY;
|
||||
ALTER TABLE vex.graph_revisions FORCE ROW LEVEL SECURITY;
|
||||
CREATE POLICY graph_revisions_tenant_isolation
|
||||
ON vex.graph_revisions FOR ALL
|
||||
USING (project_id IN (
|
||||
SELECT id FROM vex.projects WHERE tenant_id = vex_app.require_current_tenant()
|
||||
));
|
||||
|
||||
-- FK-based: graph_nodes → graph_revisions → projects
|
||||
ALTER TABLE vex.graph_nodes ENABLE ROW LEVEL SECURITY;
|
||||
ALTER TABLE vex.graph_nodes FORCE ROW LEVEL SECURITY;
|
||||
CREATE POLICY graph_nodes_tenant_isolation
|
||||
ON vex.graph_nodes FOR ALL
|
||||
USING (graph_revision_id IN (
|
||||
SELECT gr.id FROM vex.graph_revisions gr
|
||||
JOIN vex.projects p ON gr.project_id = p.id
|
||||
WHERE p.tenant_id = vex_app.require_current_tenant()
|
||||
));
|
||||
|
||||
-- FK-based: graph_edges → graph_revisions → projects
|
||||
ALTER TABLE vex.graph_edges ENABLE ROW LEVEL SECURITY;
|
||||
ALTER TABLE vex.graph_edges FORCE ROW LEVEL SECURITY;
|
||||
CREATE POLICY graph_edges_tenant_isolation
|
||||
ON vex.graph_edges FOR ALL
|
||||
USING (graph_revision_id IN (
|
||||
SELECT gr.id FROM vex.graph_revisions gr
|
||||
JOIN vex.projects p ON gr.project_id = p.id
|
||||
WHERE p.tenant_id = vex_app.require_current_tenant()
|
||||
));
|
||||
|
||||
-- Admin bypass role
|
||||
CREATE ROLE vex_admin WITH NOLOGIN BYPASSRLS;
|
||||
GRANT vex_admin TO stellaops_admin;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Validation Service
|
||||
|
||||
```csharp
|
||||
// File: src/__Libraries/StellaOps.Infrastructure.Postgres/Validation/RlsValidationService.cs
|
||||
|
||||
namespace StellaOps.Infrastructure.Postgres.Validation;
|
||||
|
||||
public sealed class RlsValidationService
|
||||
{
|
||||
private readonly NpgsqlDataSource _dataSource;
|
||||
|
||||
public async Task<RlsValidationResult> ValidateAsync(CancellationToken ct)
|
||||
{
|
||||
var issues = new List<RlsIssue>();
|
||||
|
||||
await using var conn = await _dataSource.OpenConnectionAsync(ct);
|
||||
|
||||
// Query all tables that should have RLS
|
||||
const string sql = """
|
||||
SELECT
|
||||
n.nspname AS schema_name,
|
||||
c.relname AS table_name,
|
||||
c.relrowsecurity AS rls_enabled,
|
||||
c.relforcerowsecurity AS rls_forced,
|
||||
EXISTS (
|
||||
SELECT 1 FROM pg_policy p
|
||||
WHERE p.polrelid = c.oid
|
||||
) AS has_policy
|
||||
FROM pg_class c
|
||||
JOIN pg_namespace n ON c.relnamespace = n.oid
|
||||
WHERE n.nspname IN ('scheduler', 'vex', 'authority', 'notify', 'policy', 'unknowns')
|
||||
AND c.relkind = 'r'
|
||||
AND EXISTS (
|
||||
SELECT 1 FROM pg_attribute a
|
||||
WHERE a.attrelid = c.oid
|
||||
AND a.attname = 'tenant_id'
|
||||
AND NOT a.attisdropped
|
||||
)
|
||||
ORDER BY n.nspname, c.relname;
|
||||
""";
|
||||
|
||||
await using var cmd = new NpgsqlCommand(sql, conn);
|
||||
await using var reader = await cmd.ExecuteReaderAsync(ct);
|
||||
|
||||
while (await reader.ReadAsync(ct))
|
||||
{
|
||||
var schema = reader.GetString(0);
|
||||
var table = reader.GetString(1);
|
||||
var rlsEnabled = reader.GetBoolean(2);
|
||||
var rlsForced = reader.GetBoolean(3);
|
||||
var hasPolicy = reader.GetBoolean(4);
|
||||
|
||||
if (!rlsEnabled)
|
||||
issues.Add(new RlsIssue(schema, table, "RLS not enabled"));
|
||||
else if (!rlsForced)
|
||||
issues.Add(new RlsIssue(schema, table, "RLS not forced (owner can bypass)"));
|
||||
else if (!hasPolicy)
|
||||
issues.Add(new RlsIssue(schema, table, "No RLS policy defined"));
|
||||
}
|
||||
|
||||
return new RlsValidationResult(issues.Count == 0, issues);
|
||||
}
|
||||
}
|
||||
|
||||
public sealed record RlsValidationResult(bool IsValid, IReadOnlyList<RlsIssue> Issues);
|
||||
public sealed record RlsIssue(string Schema, string Table, string Issue);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Testing Requirements
|
||||
|
||||
### 7.1 Per-Schema Integration Tests
|
||||
|
||||
Each schema needs tests verifying:
|
||||
|
||||
```csharp
|
||||
[Fact]
|
||||
public async Task RlsPolicy_BlocksCrossTenantRead()
|
||||
{
|
||||
// Arrange: Insert data for tenant A
|
||||
await InsertTestData(TenantA);
|
||||
|
||||
// Act: Query as tenant B
|
||||
await SetTenantContext(TenantB);
|
||||
var results = await _repository.GetAllAsync(TenantB, CancellationToken.None);
|
||||
|
||||
// Assert: No data visible
|
||||
Assert.Empty(results);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task RlsPolicy_BlocksCrossTenantWrite()
|
||||
{
|
||||
// Arrange
|
||||
await SetTenantContext(TenantB);
|
||||
|
||||
// Act & Assert: Writing with wrong tenant_id fails
|
||||
await Assert.ThrowsAsync<PostgresException>(() =>
|
||||
_repository.InsertAsync(TenantA, new TestEntity(), CancellationToken.None));
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task RlsPolicy_AllowsSameTenantAccess()
|
||||
{
|
||||
// Arrange
|
||||
await SetTenantContext(TenantA);
|
||||
await InsertTestData(TenantA);
|
||||
|
||||
// Act
|
||||
var results = await _repository.GetAllAsync(TenantA, CancellationToken.None);
|
||||
|
||||
// Assert
|
||||
Assert.NotEmpty(results);
|
||||
}
|
||||
```
|
||||
|
||||
### 7.2 CI Pipeline Check
|
||||
|
||||
```yaml
|
||||
# .gitea/workflows/rls-validation.yml
|
||||
name: RLS Validation
|
||||
on:
|
||||
push:
|
||||
paths:
|
||||
- 'src/**/Migrations/*.sql'
|
||||
|
||||
jobs:
|
||||
validate-rls:
|
||||
runs-on: ubuntu-latest
|
||||
services:
|
||||
postgres:
|
||||
image: postgres:16
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Run migrations
|
||||
run: dotnet run --project src/Tools/MigrationRunner
|
||||
- name: Validate RLS
|
||||
run: dotnet run --project src/Tools/RlsValidator
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Rollout Strategy
|
||||
|
||||
### 8.1 Phased Deployment
|
||||
|
||||
| Phase | Schema | Risk Level | Rollback Plan |
|
||||
|-------|--------|------------|---------------|
|
||||
| 1 | `scheduler` | Medium | Disable RLS policies |
|
||||
| 2 | `vex` | High | Requires graph rebuild verification |
|
||||
| 3 | `authority` | High | Test auth flows thoroughly |
|
||||
| 4 | `notify` | Low | Notification delivery testing |
|
||||
| 5 | `policy` | Medium | Policy evaluation testing |
|
||||
|
||||
### 8.2 Rollback Script Template
|
||||
|
||||
```sql
|
||||
-- Emergency rollback for schema
|
||||
DO $$
|
||||
DECLARE
|
||||
tbl TEXT;
|
||||
BEGIN
|
||||
FOR tbl IN SELECT tablename FROM pg_tables WHERE schemaname = '{schema}' LOOP
|
||||
EXECUTE format('ALTER TABLE {schema}.%I DISABLE ROW LEVEL SECURITY', tbl);
|
||||
EXECUTE format('DROP POLICY IF EXISTS %I_tenant_isolation ON {schema}.%I', tbl, tbl);
|
||||
END LOOP;
|
||||
END;
|
||||
$$;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Decisions & Risks
|
||||
|
||||
| # | Decision/Risk | Status | Resolution |
|
||||
|---|---------------|--------|------------|
|
||||
| 1 | FK-based RLS has performance overhead | ACCEPTED | Add indexes on FK columns, monitor query plans |
|
||||
| 2 | Workers table is global (no RLS) | DECIDED | Acceptable - no tenant data in workers |
|
||||
| 3 | vuln schema excluded | DECIDED | Feed data is global, not tenant-specific |
|
||||
| 4 | FORCE ROW LEVEL SECURITY | DECIDED | Use everywhere for defense-in-depth |
|
||||
|
||||
---
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- [x] All tenant-scoped tables have RLS enabled and forced
|
||||
- [x] All tenant-scoped tables have tenant_isolation policy
|
||||
- [x] Admin bypass roles created for each schema
|
||||
- [x] Integration tests pass for each schema (via validation script)
|
||||
- [ ] RLS validation service added to CI (future enhancement)
|
||||
- [x] Performance impact measured (<10% overhead acceptable)
|
||||
- [x] Documentation updated (SPECIFICATION.md)
|
||||
- [x] Runbook for RLS troubleshooting created (postgresql-patterns-runbook.md)
|
||||
|
||||
---
|
||||
|
||||
## 11. References
|
||||
|
||||
- Reference implementation: `src/Findings/StellaOps.Findings.Ledger/migrations/007_enable_rls.sql`
|
||||
- PostgreSQL RLS docs: https://www.postgresql.org/docs/16/ddl-rowsecurity.html
|
||||
- Advisory: `docs/product-advisories/14-Dec-2025 - PostgreSQL Patterns Technical Reference.md` (Section 2.2)
|
||||
|
||||
## Execution Log
|
||||
|
||||
| Date (UTC) | Update | Owner |
|
||||
|---|---|---|
|
||||
| 2025-12-17 | Normalized sprint file headings to standard template; no semantic changes. | Agent |
|
||||
|
||||
## Next Checkpoints
|
||||
|
||||
- None (sprint complete).
|
||||
@@ -0,0 +1,527 @@
|
||||
# SPRINT_3423_0001_0001 - Generated Columns for JSONB Hot Keys
|
||||
|
||||
**Status:** DONE
|
||||
**Priority:** MEDIUM
|
||||
**Module:** Concelier (Advisory), Excititor (VEX), Scheduler
|
||||
**Working Directory:** `src/Concelier/`, `src/Excititor/`, `src/Scheduler/`
|
||||
**Estimated Complexity:** Low-Medium
|
||||
|
||||
## Topic & Scope
|
||||
|
||||
- Add generated columns for frequently-queried JSONB fields to enable efficient B-tree indexing and better planner statistics.
|
||||
- Provide migration scripts and verification evidence (query plans/validation checks).
|
||||
- Keep behavior deterministic and backward compatible (no contract changes to stored documents).
|
||||
|
||||
## Dependencies & Concurrency
|
||||
|
||||
- **Depends on:** Existing JSONB document schemas per module.
|
||||
- **Safe to parallelize with:** Other migrations that do not touch the same tables/indexes.
|
||||
|
||||
## Documentation Prerequisites
|
||||
|
||||
- `docs/db/SPECIFICATION.md`
|
||||
- `docs/product-advisories/14-Dec-2025 - PostgreSQL Patterns Technical Reference.md`
|
||||
|
||||
---
|
||||
|
||||
## 1. Objective
|
||||
|
||||
Implement PostgreSQL generated columns to extract frequently-queried JSONB fields as first-class columns, enabling efficient B-tree indexing and query planning statistics for SBOM and advisory document tables.
|
||||
|
||||
## 2. Background
|
||||
|
||||
### 2.1 Problem Statement
|
||||
|
||||
StellaOps stores SBOMs and advisories as JSONB documents. Common queries filter by fields like `bomFormat`, `specVersion`, `source_type` - but:
|
||||
|
||||
- **GIN indexes** are optimized for containment queries (`@>`), not equality
|
||||
- **Expression indexes** (`(doc->>'field')`) don't collect statistics
|
||||
- **Query planner** can't estimate cardinality for JSONB paths
|
||||
- **Index-only scans** impossible for JSONB subfields
|
||||
|
||||
### 2.2 Solution: Generated Columns
|
||||
|
||||
PostgreSQL 12+ supports generated columns:
|
||||
|
||||
```sql
|
||||
bom_format TEXT GENERATED ALWAYS AS ((doc->>'bomFormat')) STORED
|
||||
```
|
||||
|
||||
Benefits:
|
||||
- **B-tree indexable**: Standard index on generated column
|
||||
- **Statistics**: `ANALYZE` collects cardinality, MCV, histogram
|
||||
- **Index-only scans**: Visible to covering indexes
|
||||
- **Zero application changes**: Transparent to ORM/queries
|
||||
|
||||
### 2.3 Target Tables
|
||||
|
||||
| Table | JSONB Column | Hot Fields |
|
||||
|-------|--------------|------------|
|
||||
| `scanner.sbom_documents` | `doc` | `bomFormat`, `specVersion`, `serialNumber` |
|
||||
| `vuln.advisory_snapshots` | `raw_payload` | `source_type`, `schema_version` |
|
||||
| `vex.statements` | `evidence` | `evidence_type`, `tool_name` |
|
||||
| `scheduler.runs` | `stats` | `finding_count`, `critical_count` |
|
||||
|
||||
---
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
| # | Task | Status | Assignee | Notes |
|
||||
|---|------|--------|----------|-------|
|
||||
| **Phase 1: Scanner SBOM Documents** |||||
|
||||
| 1.1-1.9 | Scanner SBOM generated columns | N/A | | Table doesn't exist - Scanner uses artifacts table with different schema |
|
||||
| **Phase 2: Concelier Advisories** |||||
|
||||
| 2.1 | Add `provenance_source_key` generated column | DONE | | 007_generated_columns_advisories.sql |
|
||||
| 2.2 | Add `provenance_imported_at` generated column | DONE | | |
|
||||
| 2.3 | Create indexes on generated columns | DONE | | |
|
||||
| 2.4 | Verify query plans | DONE | | |
|
||||
| 2.5 | Integration tests | DONE | | Via runbook validation |
|
||||
| **Phase 3: VEX Raw Documents** |||||
|
||||
| 3.1 | Add `doc_format_version` generated column | DONE | | 004_generated_columns_vex.sql |
|
||||
| 3.2 | Add `doc_tool_name` generated column | DONE | | From metadata_json |
|
||||
| 3.3 | Create indexes on generated columns | DONE | | |
|
||||
| 3.4 | Verify query plans | DONE | | |
|
||||
| 3.5 | Integration tests | DONE | | Via runbook validation |
|
||||
| **Phase 4: Scheduler Stats Extraction** |||||
|
||||
| 4.1 | Add `finding_count` generated column | DONE | | 010_generated_columns_runs.sql |
|
||||
| 4.2 | Add `critical_count` generated column | DONE | | |
|
||||
| 4.3 | Add `high_count` generated column | DONE | | |
|
||||
| 4.4 | Add `new_finding_count` generated column | DONE | | |
|
||||
| 4.5 | Create indexes for dashboard queries | DONE | | Covering index with INCLUDE |
|
||||
| 4.6 | Verify query plans | DONE | | |
|
||||
| 4.7 | Integration tests | DONE | | Via runbook validation |
|
||||
| **Phase 5: Documentation** |||||
|
||||
| 5.1 | Update SPECIFICATION.md with generated column pattern | DONE | | Added Section 6.4 |
|
||||
| 5.2 | Add generated column guidelines to RULES.md | DONE | | Added Section 5.3.1 |
|
||||
| 5.3 | Document query optimization gains | DONE | | postgresql-patterns-runbook.md |
|
||||
|
||||
---
|
||||
|
||||
## 4. Technical Specification
|
||||
|
||||
### 4.1 SBOM Document Schema Enhancement
|
||||
|
||||
```sql
|
||||
-- File: src/Scanner/__Libraries/StellaOps.Scanner.Storage.Postgres/Migrations/020_generated_columns_sbom.sql
|
||||
-- Category: A (safe, can run at startup)
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- Add generated columns for hot JSONB fields
|
||||
-- Note: Must add columns as nullable first if table has data
|
||||
ALTER TABLE scanner.sbom_documents
|
||||
ADD COLUMN IF NOT EXISTS bom_format TEXT
|
||||
GENERATED ALWAYS AS ((doc->>'bomFormat')) STORED;
|
||||
|
||||
ALTER TABLE scanner.sbom_documents
|
||||
ADD COLUMN IF NOT EXISTS spec_version TEXT
|
||||
GENERATED ALWAYS AS ((doc->>'specVersion')) STORED;
|
||||
|
||||
ALTER TABLE scanner.sbom_documents
|
||||
ADD COLUMN IF NOT EXISTS serial_number TEXT
|
||||
GENERATED ALWAYS AS ((doc->>'serialNumber')) STORED;
|
||||
|
||||
ALTER TABLE scanner.sbom_documents
|
||||
ADD COLUMN IF NOT EXISTS component_count INT
|
||||
GENERATED ALWAYS AS ((doc->'components'->>'length')::int) STORED;
|
||||
|
||||
-- Create indexes on generated columns
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_sbom_docs_bom_format
|
||||
ON scanner.sbom_documents (bom_format);
|
||||
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_sbom_docs_spec_version
|
||||
ON scanner.sbom_documents (spec_version);
|
||||
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_sbom_docs_tenant_format
|
||||
ON scanner.sbom_documents (tenant_id, bom_format, spec_version);
|
||||
|
||||
-- Covering index for common dashboard query
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_sbom_docs_dashboard
|
||||
ON scanner.sbom_documents (tenant_id, created_at DESC)
|
||||
INCLUDE (bom_format, spec_version, component_count);
|
||||
|
||||
-- Update statistics
|
||||
ANALYZE scanner.sbom_documents;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
### 4.2 Advisory Snapshot Schema Enhancement
|
||||
|
||||
```sql
|
||||
-- File: src/Concelier/__Libraries/StellaOps.Concelier.Storage.Postgres/Migrations/030_generated_columns_advisory.sql
|
||||
-- Category: A
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- Extract source type from raw_payload for efficient filtering
|
||||
ALTER TABLE vuln.advisory_snapshots
|
||||
ADD COLUMN IF NOT EXISTS snapshot_source_type TEXT
|
||||
GENERATED ALWAYS AS ((raw_payload->>'sourceType')) STORED;
|
||||
|
||||
-- Schema version for compatibility filtering
|
||||
ALTER TABLE vuln.advisory_snapshots
|
||||
ADD COLUMN IF NOT EXISTS snapshot_schema_version TEXT
|
||||
GENERATED ALWAYS AS ((raw_payload->>'schemaVersion')) STORED;
|
||||
|
||||
-- CVE ID extraction for quick lookup
|
||||
ALTER TABLE vuln.advisory_snapshots
|
||||
ADD COLUMN IF NOT EXISTS extracted_cve_id TEXT
|
||||
GENERATED ALWAYS AS ((raw_payload->'id'->>'cveId')) STORED;
|
||||
|
||||
-- Indexes
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_advisory_snap_source_type
|
||||
ON vuln.advisory_snapshots (snapshot_source_type);
|
||||
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_advisory_snap_schema_version
|
||||
ON vuln.advisory_snapshots (snapshot_schema_version);
|
||||
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_advisory_snap_cve
|
||||
ON vuln.advisory_snapshots (extracted_cve_id)
|
||||
WHERE extracted_cve_id IS NOT NULL;
|
||||
|
||||
-- Composite for source-filtered queries
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_advisory_snap_source_latest
|
||||
ON vuln.advisory_snapshots (source_id, snapshot_source_type, imported_at DESC)
|
||||
WHERE is_latest = TRUE;
|
||||
|
||||
ANALYZE vuln.advisory_snapshots;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
### 4.3 VEX Statement Schema Enhancement
|
||||
|
||||
```sql
|
||||
-- File: src/Excititor/__Libraries/StellaOps.Excititor.Storage.Postgres/Migrations/025_generated_columns_vex.sql
|
||||
-- Category: A
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- Extract evidence type for filtering
|
||||
ALTER TABLE vex.statements
|
||||
ADD COLUMN IF NOT EXISTS evidence_type TEXT
|
||||
GENERATED ALWAYS AS ((evidence->>'type')) STORED;
|
||||
|
||||
-- Extract tool name that produced the evidence
|
||||
ALTER TABLE vex.statements
|
||||
ADD COLUMN IF NOT EXISTS evidence_tool TEXT
|
||||
GENERATED ALWAYS AS ((evidence->>'toolName')) STORED;
|
||||
|
||||
-- Extract confidence score for sorting
|
||||
ALTER TABLE vex.statements
|
||||
ADD COLUMN IF NOT EXISTS evidence_confidence NUMERIC(3,2)
|
||||
GENERATED ALWAYS AS ((evidence->>'confidence')::numeric) STORED;
|
||||
|
||||
-- Indexes for common query patterns
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_statements_evidence_type
|
||||
ON vex.statements (tenant_id, evidence_type)
|
||||
WHERE evidence_type IS NOT NULL;
|
||||
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_statements_tool
|
||||
ON vex.statements (evidence_tool)
|
||||
WHERE evidence_tool IS NOT NULL;
|
||||
|
||||
-- High-confidence statements index
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_statements_high_confidence
|
||||
ON vex.statements (tenant_id, evidence_confidence DESC)
|
||||
WHERE evidence_confidence >= 0.8;
|
||||
|
||||
ANALYZE vex.statements;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
### 4.4 Scheduler Run Stats Enhancement
|
||||
|
||||
```sql
|
||||
-- File: src/Scheduler/__Libraries/StellaOps.Scheduler.Storage.Postgres/Migrations/015_generated_columns_runs.sql
|
||||
-- Category: A
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- Extract finding counts from stats JSONB
|
||||
ALTER TABLE scheduler.runs
|
||||
ADD COLUMN IF NOT EXISTS finding_count INT
|
||||
GENERATED ALWAYS AS ((stats->>'findingCount')::int) STORED;
|
||||
|
||||
ALTER TABLE scheduler.runs
|
||||
ADD COLUMN IF NOT EXISTS critical_count INT
|
||||
GENERATED ALWAYS AS ((stats->>'criticalCount')::int) STORED;
|
||||
|
||||
ALTER TABLE scheduler.runs
|
||||
ADD COLUMN IF NOT EXISTS high_count INT
|
||||
GENERATED ALWAYS AS ((stats->>'highCount')::int) STORED;
|
||||
|
||||
ALTER TABLE scheduler.runs
|
||||
ADD COLUMN IF NOT EXISTS new_finding_count INT
|
||||
GENERATED ALWAYS AS ((stats->>'newFindingCount')::int) STORED;
|
||||
|
||||
-- Dashboard query index: runs with findings
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_runs_with_findings
|
||||
ON scheduler.runs (tenant_id, created_at DESC)
|
||||
WHERE finding_count > 0;
|
||||
|
||||
-- Critical findings index for alerting
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_runs_critical
|
||||
ON scheduler.runs (tenant_id, created_at DESC, critical_count)
|
||||
WHERE critical_count > 0;
|
||||
|
||||
-- Covering index for run summary dashboard
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_runs_summary_cover
|
||||
ON scheduler.runs (tenant_id, state, created_at DESC)
|
||||
INCLUDE (finding_count, critical_count, high_count);
|
||||
|
||||
ANALYZE scheduler.runs;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Query Optimization Examples
|
||||
|
||||
### 5.1 Before vs. After: SBOM Format Query
|
||||
|
||||
**Before (JSONB expression):**
|
||||
```sql
|
||||
EXPLAIN ANALYZE
|
||||
SELECT id, doc->>'name' AS name
|
||||
FROM scanner.sbom_documents
|
||||
WHERE tenant_id = 'abc'
|
||||
AND doc->>'bomFormat' = 'CycloneDX';
|
||||
|
||||
-- Result: Seq Scan or inefficient GIN lookup
|
||||
-- Rows: 10000 (estimated: 1, actual: 10000) - bad estimate!
|
||||
```
|
||||
|
||||
**After (generated column):**
|
||||
```sql
|
||||
EXPLAIN ANALYZE
|
||||
SELECT id, doc->>'name' AS name
|
||||
FROM scanner.sbom_documents
|
||||
WHERE tenant_id = 'abc'
|
||||
AND bom_format = 'CycloneDX';
|
||||
|
||||
-- Result: Index Scan using ix_sbom_docs_tenant_format
|
||||
-- Rows: 10000 (estimated: 10234, actual: 10000) - accurate estimate!
|
||||
```
|
||||
|
||||
### 5.2 Before vs. After: Dashboard Aggregation
|
||||
|
||||
**Before:**
|
||||
```sql
|
||||
EXPLAIN ANALYZE
|
||||
SELECT
|
||||
doc->>'bomFormat' AS format,
|
||||
count(*) AS count
|
||||
FROM scanner.sbom_documents
|
||||
WHERE tenant_id = 'abc'
|
||||
GROUP BY doc->>'bomFormat';
|
||||
|
||||
-- Result: Seq Scan, compute expression for every row
|
||||
-- Time: 1500ms for 100K rows
|
||||
```
|
||||
|
||||
**After:**
|
||||
```sql
|
||||
EXPLAIN ANALYZE
|
||||
SELECT
|
||||
bom_format,
|
||||
count(*) AS count
|
||||
FROM scanner.sbom_documents
|
||||
WHERE tenant_id = 'abc'
|
||||
GROUP BY bom_format;
|
||||
|
||||
-- Result: Index Only Scan using ix_sbom_docs_dashboard
|
||||
-- Time: 45ms for 100K rows (33x faster!)
|
||||
```
|
||||
|
||||
### 5.3 Repository Query Updates
|
||||
|
||||
```csharp
|
||||
// Before: Expression in WHERE clause
|
||||
const string SqlBefore = """
|
||||
SELECT id, doc FROM scanner.sbom_documents
|
||||
WHERE tenant_id = @tenant_id
|
||||
AND doc->>'bomFormat' = @format
|
||||
""";
|
||||
|
||||
// After: Direct column reference (no code change needed if using column)
|
||||
const string SqlAfter = """
|
||||
SELECT id, doc FROM scanner.sbom_documents
|
||||
WHERE tenant_id = @tenant_id
|
||||
AND bom_format = @format
|
||||
""";
|
||||
|
||||
// Alternative: Both work identically with generated column
|
||||
// The optimizer rewrites doc->>'bomFormat' to bom_format automatically
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Performance Benchmarks
|
||||
|
||||
### 6.1 Benchmark Queries
|
||||
|
||||
```sql
|
||||
-- Benchmark 1: Single format filter
|
||||
\timing on
|
||||
SELECT count(*) FROM scanner.sbom_documents
|
||||
WHERE tenant_id = 'test-tenant' AND bom_format = 'CycloneDX';
|
||||
|
||||
-- Benchmark 2: Format distribution
|
||||
SELECT bom_format, count(*) FROM scanner.sbom_documents
|
||||
WHERE tenant_id = 'test-tenant' GROUP BY bom_format;
|
||||
|
||||
-- Benchmark 3: Join with format filter
|
||||
SELECT s.id, a.advisory_key
|
||||
FROM scanner.sbom_documents s
|
||||
JOIN vuln.advisory_affected a ON s.doc @> jsonb_build_object('purl', a.package_purl)
|
||||
WHERE s.tenant_id = 'test-tenant' AND s.bom_format = 'SPDX';
|
||||
```
|
||||
|
||||
### 6.2 Expected Improvements
|
||||
|
||||
| Query Pattern | Before | After | Improvement |
|
||||
|---------------|--------|-------|-------------|
|
||||
| Single format filter (100K rows) | 800ms | 15ms | 53x |
|
||||
| Format distribution | 1500ms | 45ms | 33x |
|
||||
| Dashboard summary | 2000ms | 100ms | 20x |
|
||||
| Join with format | 5000ms | 200ms | 25x |
|
||||
|
||||
---
|
||||
|
||||
## 7. Migration Considerations
|
||||
|
||||
### 7.1 Adding Generated Columns to Large Tables
|
||||
|
||||
```sql
|
||||
-- For tables with millions of rows, add column in stages:
|
||||
|
||||
-- Stage 1: Add column without STORED (virtual, computed on read)
|
||||
-- NOT SUPPORTED in PostgreSQL - columns must be STORED
|
||||
|
||||
-- Stage 2: Add column concurrently
|
||||
-- Generated columns cannot be added CONCURRENTLY
|
||||
-- Must use maintenance window for large tables
|
||||
|
||||
-- Stage 3: Backfill approach (alternative)
|
||||
-- Add regular column, populate, then convert to generated
|
||||
ALTER TABLE scanner.sbom_documents
|
||||
ADD COLUMN bom_format_temp TEXT;
|
||||
|
||||
UPDATE scanner.sbom_documents
|
||||
SET bom_format_temp = doc->>'bomFormat'
|
||||
WHERE bom_format_temp IS NULL
|
||||
LIMIT 10000; -- Batch updates
|
||||
|
||||
-- Then rename and add constraint (requires table rewrite)
|
||||
```
|
||||
|
||||
### 7.2 Storage Impact
|
||||
|
||||
Generated STORED columns add storage:
|
||||
- Each column adds ~8-100 bytes per row (depending on data)
|
||||
- For 1M rows with 4 generated columns: ~50-400 MB additional storage
|
||||
- Trade-off: Storage vs. query performance (usually worthwhile)
|
||||
|
||||
---
|
||||
|
||||
## 8. Testing Requirements
|
||||
|
||||
### 8.1 Migration Tests
|
||||
|
||||
```csharp
|
||||
[Fact]
|
||||
public async Task GeneratedColumn_PopulatesFromJsonb()
|
||||
{
|
||||
// Arrange: Insert document with bomFormat in JSONB
|
||||
var doc = JsonDocument.Parse("""{"bomFormat": "CycloneDX", "specVersion": "1.6"}""");
|
||||
await InsertSbomDocument(doc);
|
||||
|
||||
// Act: Query using generated column
|
||||
var result = await QueryByBomFormat("CycloneDX");
|
||||
|
||||
// Assert: Row found via generated column
|
||||
Assert.Single(result);
|
||||
Assert.Equal("1.6", result[0].SpecVersion);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task GeneratedColumn_UpdatesOnJsonbChange()
|
||||
{
|
||||
// Arrange: Insert with SPDX format
|
||||
var id = await InsertSbomDocument("""{"bomFormat": "SPDX"}""");
|
||||
|
||||
// Act: Update JSONB
|
||||
await UpdateSbomDocument(id, """{"bomFormat": "CycloneDX"}""");
|
||||
|
||||
// Assert: Generated column updated
|
||||
var result = await GetById(id);
|
||||
Assert.Equal("CycloneDX", result.BomFormat);
|
||||
}
|
||||
```
|
||||
|
||||
### 8.2 Query Plan Tests
|
||||
|
||||
```csharp
|
||||
[Fact]
|
||||
public async Task QueryPlan_UsesGeneratedColumnIndex()
|
||||
{
|
||||
// Act: Get query plan
|
||||
var plan = await ExplainAnalyze("""
|
||||
SELECT id FROM scanner.sbom_documents
|
||||
WHERE tenant_id = @tenant AND bom_format = @format
|
||||
""", tenant, "CycloneDX");
|
||||
|
||||
// Assert: Uses index scan, not seq scan
|
||||
Assert.Contains("Index Scan", plan);
|
||||
Assert.Contains("ix_sbom_docs_tenant_format", plan);
|
||||
Assert.DoesNotContain("Seq Scan", plan);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Decisions & Risks
|
||||
|
||||
| # | Decision/Risk | Status | Resolution |
|
||||
|---|---------------|--------|------------|
|
||||
| 1 | NULL handling for missing JSONB keys | DECIDED | Generated column is NULL if key missing |
|
||||
| 2 | Storage overhead | ACCEPTED | Acceptable trade-off for query performance |
|
||||
| 3 | Cannot add CONCURRENTLY | RISK | Schedule during low-traffic maintenance window |
|
||||
| 4 | Expression rewrite behavior | VERIFIED | PostgreSQL automatically rewrites `doc->>'x'` to use generated column |
|
||||
| 5 | Index maintenance overhead on INSERT | ACCEPTED | Negligible for read-heavy workloads |
|
||||
|
||||
---
|
||||
|
||||
## 10. Definition of Done
|
||||
|
||||
- [x] Generated columns added to all target tables (vuln.advisories, vex.vex_raw_documents, scheduler.runs)
|
||||
- [x] Indexes created on generated columns (covering indexes with INCLUDE for dashboard queries)
|
||||
- [x] ANALYZE run to collect statistics
|
||||
- [x] Query plans verified (no seq scans on filtered queries)
|
||||
- [x] Performance benchmarks documented (postgresql-patterns-runbook.md)
|
||||
- [x] Repository queries updated where beneficial
|
||||
- [x] Integration tests passing (via validation scripts)
|
||||
- [x] Documentation updated (SPECIFICATION.md section 4.5 added)
|
||||
- [x] Storage impact measured and documented
|
||||
|
||||
---
|
||||
|
||||
## 11. References
|
||||
|
||||
- PostgreSQL Generated Columns: https://www.postgresql.org/docs/16/ddl-generated-columns.html
|
||||
- JSONB Indexing Strategies: https://www.postgresql.org/docs/16/datatype-json.html#JSON-INDEXING
|
||||
- Advisory: `docs/product-advisories/14-Dec-2025 - PostgreSQL Patterns Technical Reference.md` (Section 4)
|
||||
|
||||
## Execution Log
|
||||
|
||||
| Date (UTC) | Update | Owner |
|
||||
|---|---|---|
|
||||
| 2025-12-17 | Normalized sprint file headings to standard template; no semantic changes. | Agent |
|
||||
|
||||
## Next Checkpoints
|
||||
|
||||
- None (sprint complete).
|
||||
File diff suppressed because it is too large
Load Diff
1259
docs/implplan/archived/SPRINT_3500_0003_0001_smart_diff_detection.md
Normal file
1259
docs/implplan/archived/SPRINT_3500_0003_0001_smart_diff_detection.md
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,950 @@
|
||||
# SPRINT_3600_0003_0001 - Drift Detection Engine
|
||||
|
||||
**Status:** DONE
|
||||
**Priority:** P0 - CRITICAL
|
||||
**Module:** Scanner
|
||||
**Working Directory:** `src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/`
|
||||
**Estimated Effort:** Medium
|
||||
**Dependencies:** SPRINT_3600_0002_0001 (Call Graph Infrastructure)
|
||||
|
||||
---
|
||||
|
||||
## Topic & Scope
|
||||
|
||||
Implement the drift detection engine that compares call graphs between scans to identify reachability changes. This sprint covers:
|
||||
- Code change facts extraction (AST-level)
|
||||
- Cross-scan graph comparison
|
||||
- Drift cause attribution
|
||||
- Path compression for storage
|
||||
- API endpoints for drift results
|
||||
|
||||
---
|
||||
|
||||
## Documentation Prerequisites
|
||||
|
||||
- `docs/product-advisories/17-Dec-2025 - Reachability Drift Detection.md`
|
||||
- `docs/implplan/SPRINT_3600_0002_0001_call_graph_infrastructure.md`
|
||||
- `src/Scanner/__Libraries/StellaOps.Scanner.SmartDiff/AGENTS.md`
|
||||
|
||||
---
|
||||
|
||||
## Wave Coordination
|
||||
|
||||
Single wave with sequential tasks:
|
||||
1. Code change models and extraction
|
||||
2. Cross-scan comparison engine
|
||||
3. Cause attribution
|
||||
4. Path compression
|
||||
5. API integration
|
||||
|
||||
---
|
||||
|
||||
## Interlocks
|
||||
|
||||
- Depends on CallGraphSnapshot model from Sprint 3600.2
|
||||
- Must integrate with existing MaterialRiskChangeDetector
|
||||
- Must extend scanner.material_risk_changes table
|
||||
|
||||
---
|
||||
|
||||
## Action Tracker
|
||||
|
||||
| Date (UTC) | Action | Owner | Notes |
|
||||
|---|---|---|---|
|
||||
| 2025-12-17 | Created sprint from master plan | Agent | Initial |
|
||||
|
||||
---
|
||||
|
||||
## 1. OBJECTIVE
|
||||
|
||||
Build the drift detection engine:
|
||||
1. **Code Change Facts** - Extract AST-level changes between scans
|
||||
2. **Graph Comparison** - Detect reachability flips
|
||||
3. **Cause Attribution** - Explain why drift occurred
|
||||
4. **Path Compression** - Efficient storage for UI display
|
||||
|
||||
---
|
||||
|
||||
## 2. TECHNICAL DESIGN
|
||||
|
||||
### 2.1 Code Change Facts Model
|
||||
|
||||
```csharp
|
||||
// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Models/CodeChangeFact.cs
|
||||
|
||||
namespace StellaOps.Scanner.ReachabilityDrift;
|
||||
|
||||
using System.Text.Json;
|
||||
using System.Text.Json.Serialization;
|
||||
|
||||
/// <summary>
|
||||
/// Represents an AST-level code change fact.
|
||||
/// </summary>
|
||||
public sealed record CodeChangeFact
|
||||
{
|
||||
[JsonPropertyName("id")]
|
||||
public required Guid Id { get; init; }
|
||||
|
||||
[JsonPropertyName("scanId")]
|
||||
public required string ScanId { get; init; }
|
||||
|
||||
[JsonPropertyName("baseScanId")]
|
||||
public required string BaseScanId { get; init; }
|
||||
|
||||
[JsonPropertyName("file")]
|
||||
public required string File { get; init; }
|
||||
|
||||
[JsonPropertyName("symbol")]
|
||||
public required string Symbol { get; init; }
|
||||
|
||||
[JsonPropertyName("kind")]
|
||||
public required CodeChangeKind Kind { get; init; }
|
||||
|
||||
[JsonPropertyName("details")]
|
||||
public JsonDocument? Details { get; init; }
|
||||
|
||||
[JsonPropertyName("detectedAt")]
|
||||
public required DateTimeOffset DetectedAt { get; init; }
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Types of code changes relevant to reachability.
|
||||
/// </summary>
|
||||
[JsonConverter(typeof(JsonStringEnumConverter<CodeChangeKind>))]
|
||||
public enum CodeChangeKind
|
||||
{
|
||||
/// <summary>Symbol added (new function/method).</summary>
|
||||
[JsonStringEnumMemberName("added")]
|
||||
Added,
|
||||
|
||||
/// <summary>Symbol removed.</summary>
|
||||
[JsonStringEnumMemberName("removed")]
|
||||
Removed,
|
||||
|
||||
/// <summary>Function signature changed (parameters, return type).</summary>
|
||||
[JsonStringEnumMemberName("signature_changed")]
|
||||
SignatureChanged,
|
||||
|
||||
/// <summary>Guard condition around call modified.</summary>
|
||||
[JsonStringEnumMemberName("guard_changed")]
|
||||
GuardChanged,
|
||||
|
||||
/// <summary>Callee package/version changed.</summary>
|
||||
[JsonStringEnumMemberName("dependency_changed")]
|
||||
DependencyChanged,
|
||||
|
||||
/// <summary>Visibility changed (public<->internal<->private).</summary>
|
||||
[JsonStringEnumMemberName("visibility_changed")]
|
||||
VisibilityChanged
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2 Drift Result Model
|
||||
|
||||
```csharp
|
||||
// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Models/ReachabilityDriftResult.cs
|
||||
|
||||
namespace StellaOps.Scanner.ReachabilityDrift;
|
||||
|
||||
using System.Collections.Immutable;
|
||||
using System.Text.Json.Serialization;
|
||||
|
||||
/// <summary>
|
||||
/// Result of reachability drift detection between two scans.
|
||||
/// </summary>
|
||||
public sealed record ReachabilityDriftResult
|
||||
{
|
||||
[JsonPropertyName("baseScanId")]
|
||||
public required string BaseScanId { get; init; }
|
||||
|
||||
[JsonPropertyName("headScanId")]
|
||||
public required string HeadScanId { get; init; }
|
||||
|
||||
[JsonPropertyName("detectedAt")]
|
||||
public required DateTimeOffset DetectedAt { get; init; }
|
||||
|
||||
[JsonPropertyName("newlyReachable")]
|
||||
public required ImmutableArray<DriftedSink> NewlyReachable { get; init; }
|
||||
|
||||
[JsonPropertyName("newlyUnreachable")]
|
||||
public required ImmutableArray<DriftedSink> NewlyUnreachable { get; init; }
|
||||
|
||||
[JsonPropertyName("totalDriftCount")]
|
||||
public int TotalDriftCount => NewlyReachable.Length + NewlyUnreachable.Length;
|
||||
|
||||
[JsonPropertyName("hasMaterialDrift")]
|
||||
public bool HasMaterialDrift => TotalDriftCount > 0;
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// A sink that changed reachability status.
|
||||
/// </summary>
|
||||
public sealed record DriftedSink
|
||||
{
|
||||
[JsonPropertyName("sinkNodeId")]
|
||||
public required string SinkNodeId { get; init; }
|
||||
|
||||
[JsonPropertyName("symbol")]
|
||||
public required string Symbol { get; init; }
|
||||
|
||||
[JsonPropertyName("sinkCategory")]
|
||||
public required SinkCategory SinkCategory { get; init; }
|
||||
|
||||
[JsonPropertyName("direction")]
|
||||
public required DriftDirection Direction { get; init; }
|
||||
|
||||
[JsonPropertyName("cause")]
|
||||
public required DriftCause Cause { get; init; }
|
||||
|
||||
[JsonPropertyName("path")]
|
||||
public required CompressedPath Path { get; init; }
|
||||
|
||||
[JsonPropertyName("associatedVulns")]
|
||||
public ImmutableArray<AssociatedVuln> AssociatedVulns { get; init; } = [];
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Direction of reachability drift.
|
||||
/// </summary>
|
||||
[JsonConverter(typeof(JsonStringEnumConverter<DriftDirection>))]
|
||||
public enum DriftDirection
|
||||
{
|
||||
[JsonStringEnumMemberName("became_reachable")]
|
||||
BecameReachable,
|
||||
|
||||
[JsonStringEnumMemberName("became_unreachable")]
|
||||
BecameUnreachable
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Cause of the drift, linked to code changes.
|
||||
/// </summary>
|
||||
public sealed record DriftCause
|
||||
{
|
||||
[JsonPropertyName("kind")]
|
||||
public required DriftCauseKind Kind { get; init; }
|
||||
|
||||
[JsonPropertyName("description")]
|
||||
public required string Description { get; init; }
|
||||
|
||||
[JsonPropertyName("changedSymbol")]
|
||||
public string? ChangedSymbol { get; init; }
|
||||
|
||||
[JsonPropertyName("changedFile")]
|
||||
public string? ChangedFile { get; init; }
|
||||
|
||||
[JsonPropertyName("changedLine")]
|
||||
public int? ChangedLine { get; init; }
|
||||
|
||||
[JsonPropertyName("codeChangeId")]
|
||||
public Guid? CodeChangeId { get; init; }
|
||||
|
||||
public static DriftCause GuardRemoved(string symbol, string file, int line) =>
|
||||
new()
|
||||
{
|
||||
Kind = DriftCauseKind.GuardRemoved,
|
||||
Description = $"Guard condition removed in {symbol}",
|
||||
ChangedSymbol = symbol,
|
||||
ChangedFile = file,
|
||||
ChangedLine = line
|
||||
};
|
||||
|
||||
public static DriftCause NewPublicRoute(string symbol) =>
|
||||
new()
|
||||
{
|
||||
Kind = DriftCauseKind.NewPublicRoute,
|
||||
Description = $"New public entrypoint: {symbol}",
|
||||
ChangedSymbol = symbol
|
||||
};
|
||||
|
||||
public static DriftCause VisibilityEscalated(string symbol) =>
|
||||
new()
|
||||
{
|
||||
Kind = DriftCauseKind.VisibilityEscalated,
|
||||
Description = $"Visibility escalated to public: {symbol}",
|
||||
ChangedSymbol = symbol
|
||||
};
|
||||
|
||||
public static DriftCause DependencyUpgraded(string package, string fromVersion, string toVersion) =>
|
||||
new()
|
||||
{
|
||||
Kind = DriftCauseKind.DependencyUpgraded,
|
||||
Description = $"Dependency upgraded: {package} {fromVersion} -> {toVersion}"
|
||||
};
|
||||
|
||||
public static DriftCause GuardAdded(string symbol) =>
|
||||
new()
|
||||
{
|
||||
Kind = DriftCauseKind.GuardAdded,
|
||||
Description = $"Guard condition added in {symbol}",
|
||||
ChangedSymbol = symbol
|
||||
};
|
||||
|
||||
public static DriftCause SymbolRemoved(string symbol) =>
|
||||
new()
|
||||
{
|
||||
Kind = DriftCauseKind.SymbolRemoved,
|
||||
Description = $"Symbol removed: {symbol}",
|
||||
ChangedSymbol = symbol
|
||||
};
|
||||
|
||||
public static DriftCause Unknown() =>
|
||||
new()
|
||||
{
|
||||
Kind = DriftCauseKind.Unknown,
|
||||
Description = "Cause could not be determined"
|
||||
};
|
||||
}
|
||||
|
||||
[JsonConverter(typeof(JsonStringEnumConverter<DriftCauseKind>))]
|
||||
public enum DriftCauseKind
|
||||
{
|
||||
[JsonStringEnumMemberName("guard_removed")]
|
||||
GuardRemoved,
|
||||
|
||||
[JsonStringEnumMemberName("guard_added")]
|
||||
GuardAdded,
|
||||
|
||||
[JsonStringEnumMemberName("new_public_route")]
|
||||
NewPublicRoute,
|
||||
|
||||
[JsonStringEnumMemberName("visibility_escalated")]
|
||||
VisibilityEscalated,
|
||||
|
||||
[JsonStringEnumMemberName("dependency_upgraded")]
|
||||
DependencyUpgraded,
|
||||
|
||||
[JsonStringEnumMemberName("symbol_removed")]
|
||||
SymbolRemoved,
|
||||
|
||||
[JsonStringEnumMemberName("unknown")]
|
||||
Unknown
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Vulnerability associated with a sink.
|
||||
/// </summary>
|
||||
public sealed record AssociatedVuln
|
||||
{
|
||||
[JsonPropertyName("cveId")]
|
||||
public required string CveId { get; init; }
|
||||
|
||||
[JsonPropertyName("epss")]
|
||||
public double? Epss { get; init; }
|
||||
|
||||
[JsonPropertyName("cvss")]
|
||||
public double? Cvss { get; init; }
|
||||
|
||||
[JsonPropertyName("vexStatus")]
|
||||
public string? VexStatus { get; init; }
|
||||
|
||||
[JsonPropertyName("packagePurl")]
|
||||
public string? PackagePurl { get; init; }
|
||||
}
|
||||
```
|
||||
|
||||
### 2.3 Compressed Path Model
|
||||
|
||||
```csharp
|
||||
// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Models/CompressedPath.cs
|
||||
|
||||
namespace StellaOps.Scanner.ReachabilityDrift;
|
||||
|
||||
using System.Collections.Immutable;
|
||||
using System.Text.Json.Serialization;
|
||||
|
||||
/// <summary>
|
||||
/// Compressed representation of a call path for storage and UI.
|
||||
/// </summary>
|
||||
public sealed record CompressedPath
|
||||
{
|
||||
[JsonPropertyName("entrypoint")]
|
||||
public required PathNode Entrypoint { get; init; }
|
||||
|
||||
[JsonPropertyName("sink")]
|
||||
public required PathNode Sink { get; init; }
|
||||
|
||||
[JsonPropertyName("intermediateCount")]
|
||||
public required int IntermediateCount { get; init; }
|
||||
|
||||
[JsonPropertyName("keyNodes")]
|
||||
public required ImmutableArray<PathNode> KeyNodes { get; init; }
|
||||
|
||||
[JsonPropertyName("fullPath")]
|
||||
public ImmutableArray<string>? FullPath { get; init; } // Node IDs for expansion
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Node in a compressed path.
|
||||
/// </summary>
|
||||
public sealed record PathNode
|
||||
{
|
||||
[JsonPropertyName("nodeId")]
|
||||
public required string NodeId { get; init; }
|
||||
|
||||
[JsonPropertyName("symbol")]
|
||||
public required string Symbol { get; init; }
|
||||
|
||||
[JsonPropertyName("file")]
|
||||
public string? File { get; init; }
|
||||
|
||||
[JsonPropertyName("line")]
|
||||
public int? Line { get; init; }
|
||||
|
||||
[JsonPropertyName("package")]
|
||||
public string? Package { get; init; }
|
||||
|
||||
[JsonPropertyName("isChanged")]
|
||||
public bool IsChanged { get; init; }
|
||||
|
||||
[JsonPropertyName("changeKind")]
|
||||
public CodeChangeKind? ChangeKind { get; init; }
|
||||
}
|
||||
```
|
||||
|
||||
### 2.4 Drift Detector Service
|
||||
|
||||
```csharp
|
||||
// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Services/ReachabilityDriftDetector.cs
|
||||
|
||||
namespace StellaOps.Scanner.ReachabilityDrift.Services;
|
||||
|
||||
using StellaOps.Scanner.CallGraph;
|
||||
using StellaOps.Scanner.CallGraph.Analysis;
|
||||
|
||||
/// <summary>
|
||||
/// Detects reachability drift between two scan snapshots.
|
||||
/// </summary>
|
||||
public sealed class ReachabilityDriftDetector
|
||||
{
|
||||
private readonly ReachabilityAnalyzer _reachabilityAnalyzer = new();
|
||||
private readonly DriftCauseExplainer _causeExplainer = new();
|
||||
private readonly PathCompressor _pathCompressor = new();
|
||||
|
||||
/// <summary>
|
||||
/// Compares two call graph snapshots and returns drift results.
|
||||
/// </summary>
|
||||
public ReachabilityDriftResult Detect(
|
||||
CallGraphSnapshot baseGraph,
|
||||
CallGraphSnapshot headGraph,
|
||||
IReadOnlyList<CodeChangeFact> codeChanges)
|
||||
{
|
||||
// Compute reachability for both graphs
|
||||
var baseReachability = _reachabilityAnalyzer.Analyze(baseGraph);
|
||||
var headReachability = _reachabilityAnalyzer.Analyze(headGraph);
|
||||
|
||||
var newlyReachable = new List<DriftedSink>();
|
||||
var newlyUnreachable = new List<DriftedSink>();
|
||||
|
||||
// Find sinks that became reachable
|
||||
foreach (var sinkId in headGraph.SinkIds)
|
||||
{
|
||||
var wasReachable = baseReachability.ReachableSinks.Contains(sinkId);
|
||||
var isReachable = headReachability.ReachableSinks.Contains(sinkId);
|
||||
|
||||
if (!wasReachable && isReachable)
|
||||
{
|
||||
var sink = headGraph.Nodes.First(n => n.NodeId == sinkId);
|
||||
var path = headReachability.ShortestPaths.TryGetValue(sinkId, out var p) ? p : [];
|
||||
var cause = _causeExplainer.Explain(baseGraph, headGraph, sinkId, path, codeChanges);
|
||||
|
||||
newlyReachable.Add(new DriftedSink
|
||||
{
|
||||
SinkNodeId = sinkId,
|
||||
Symbol = sink.Symbol,
|
||||
SinkCategory = sink.SinkCategory ?? SinkCategory.CmdExec,
|
||||
Direction = DriftDirection.BecameReachable,
|
||||
Cause = cause,
|
||||
Path = _pathCompressor.Compress(path, headGraph, codeChanges)
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Find sinks that became unreachable
|
||||
foreach (var sinkId in baseGraph.SinkIds)
|
||||
{
|
||||
var wasReachable = baseReachability.ReachableSinks.Contains(sinkId);
|
||||
var isReachable = headReachability.ReachableSinks.Contains(sinkId);
|
||||
|
||||
if (wasReachable && !isReachable)
|
||||
{
|
||||
var sink = baseGraph.Nodes.First(n => n.NodeId == sinkId);
|
||||
var path = baseReachability.ShortestPaths.TryGetValue(sinkId, out var p) ? p : [];
|
||||
var cause = _causeExplainer.ExplainUnreachable(baseGraph, headGraph, sinkId, path, codeChanges);
|
||||
|
||||
newlyUnreachable.Add(new DriftedSink
|
||||
{
|
||||
SinkNodeId = sinkId,
|
||||
Symbol = sink.Symbol,
|
||||
SinkCategory = sink.SinkCategory ?? SinkCategory.CmdExec,
|
||||
Direction = DriftDirection.BecameUnreachable,
|
||||
Cause = cause,
|
||||
Path = _pathCompressor.Compress(path, baseGraph, codeChanges)
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
return new ReachabilityDriftResult
|
||||
{
|
||||
BaseScanId = baseGraph.ScanId,
|
||||
HeadScanId = headGraph.ScanId,
|
||||
DetectedAt = DateTimeOffset.UtcNow,
|
||||
NewlyReachable = newlyReachable.ToImmutableArray(),
|
||||
NewlyUnreachable = newlyUnreachable.ToImmutableArray()
|
||||
};
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.5 Drift Cause Explainer
|
||||
|
||||
```csharp
|
||||
// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Services/DriftCauseExplainer.cs
|
||||
|
||||
namespace StellaOps.Scanner.ReachabilityDrift.Services;
|
||||
|
||||
using StellaOps.Scanner.CallGraph;
|
||||
|
||||
/// <summary>
|
||||
/// Explains why a reachability drift occurred.
|
||||
/// </summary>
|
||||
public sealed class DriftCauseExplainer
|
||||
{
|
||||
/// <summary>
|
||||
/// Explains why a sink became reachable.
|
||||
/// </summary>
|
||||
public DriftCause Explain(
|
||||
CallGraphSnapshot baseGraph,
|
||||
CallGraphSnapshot headGraph,
|
||||
string sinkNodeId,
|
||||
ImmutableArray<string> path,
|
||||
IReadOnlyList<CodeChangeFact> codeChanges)
|
||||
{
|
||||
if (path.IsDefaultOrEmpty)
|
||||
return DriftCause.Unknown();
|
||||
|
||||
// Check each node on path for code changes
|
||||
foreach (var nodeId in path)
|
||||
{
|
||||
var headNode = headGraph.Nodes.FirstOrDefault(n => n.NodeId == nodeId);
|
||||
if (headNode is null) continue;
|
||||
|
||||
var change = codeChanges.FirstOrDefault(c =>
|
||||
c.Symbol == headNode.Symbol ||
|
||||
c.Symbol == ExtractTypeName(headNode.Symbol));
|
||||
|
||||
if (change is not null)
|
||||
{
|
||||
return change.Kind switch
|
||||
{
|
||||
CodeChangeKind.GuardChanged => DriftCause.GuardRemoved(
|
||||
headNode.Symbol, headNode.File, headNode.Line),
|
||||
CodeChangeKind.Added => DriftCause.NewPublicRoute(headNode.Symbol),
|
||||
CodeChangeKind.VisibilityChanged => DriftCause.VisibilityEscalated(headNode.Symbol),
|
||||
CodeChangeKind.DependencyChanged => ExplainDependencyChange(change),
|
||||
_ => DriftCause.Unknown()
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
// Check if entrypoint is new
|
||||
var entrypoint = path.FirstOrDefault();
|
||||
if (entrypoint is not null)
|
||||
{
|
||||
var baseHasEntrypoint = baseGraph.EntrypointIds.Contains(entrypoint);
|
||||
var headHasEntrypoint = headGraph.EntrypointIds.Contains(entrypoint);
|
||||
|
||||
if (!baseHasEntrypoint && headHasEntrypoint)
|
||||
{
|
||||
var epNode = headGraph.Nodes.First(n => n.NodeId == entrypoint);
|
||||
return DriftCause.NewPublicRoute(epNode.Symbol);
|
||||
}
|
||||
}
|
||||
|
||||
return DriftCause.Unknown();
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Explains why a sink became unreachable.
|
||||
/// </summary>
|
||||
public DriftCause ExplainUnreachable(
|
||||
CallGraphSnapshot baseGraph,
|
||||
CallGraphSnapshot headGraph,
|
||||
string sinkNodeId,
|
||||
ImmutableArray<string> basePath,
|
||||
IReadOnlyList<CodeChangeFact> codeChanges)
|
||||
{
|
||||
// Check if any node on path was removed
|
||||
foreach (var nodeId in basePath)
|
||||
{
|
||||
var existsInHead = headGraph.Nodes.Any(n => n.NodeId == nodeId);
|
||||
if (!existsInHead)
|
||||
{
|
||||
var baseNode = baseGraph.Nodes.First(n => n.NodeId == nodeId);
|
||||
return DriftCause.SymbolRemoved(baseNode.Symbol);
|
||||
}
|
||||
}
|
||||
|
||||
// Check for guard additions
|
||||
foreach (var nodeId in basePath)
|
||||
{
|
||||
var change = codeChanges.FirstOrDefault(c =>
|
||||
c.Kind == CodeChangeKind.GuardChanged);
|
||||
|
||||
if (change is not null)
|
||||
{
|
||||
return DriftCause.GuardAdded(change.Symbol);
|
||||
}
|
||||
}
|
||||
|
||||
return DriftCause.Unknown();
|
||||
}
|
||||
|
||||
private static string ExtractTypeName(string symbol)
|
||||
{
|
||||
var lastDot = symbol.LastIndexOf('.');
|
||||
if (lastDot > 0)
|
||||
{
|
||||
var beforeMethod = symbol[..lastDot];
|
||||
var typeEnd = beforeMethod.LastIndexOf('.');
|
||||
return typeEnd > 0 ? beforeMethod[(typeEnd + 1)..] : beforeMethod;
|
||||
}
|
||||
return symbol;
|
||||
}
|
||||
|
||||
private static DriftCause ExplainDependencyChange(CodeChangeFact change)
|
||||
{
|
||||
if (change.Details is not null)
|
||||
{
|
||||
var details = change.Details.RootElement;
|
||||
var package = details.TryGetProperty("package", out var p) ? p.GetString() : "unknown";
|
||||
var from = details.TryGetProperty("fromVersion", out var f) ? f.GetString() : "?";
|
||||
var to = details.TryGetProperty("toVersion", out var t) ? t.GetString() : "?";
|
||||
return DriftCause.DependencyUpgraded(package ?? "unknown", from ?? "?", to ?? "?");
|
||||
}
|
||||
return DriftCause.Unknown();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.6 Path Compressor
|
||||
|
||||
```csharp
|
||||
// File: src/Scanner/__Libraries/StellaOps.Scanner.ReachabilityDrift/Services/PathCompressor.cs
|
||||
|
||||
namespace StellaOps.Scanner.ReachabilityDrift.Services;
|
||||
|
||||
using StellaOps.Scanner.CallGraph;
|
||||
|
||||
/// <summary>
|
||||
/// Compresses call paths for efficient storage and UI display.
|
||||
/// </summary>
|
||||
public sealed class PathCompressor
|
||||
{
|
||||
private const int MaxKeyNodes = 5;
|
||||
|
||||
/// <summary>
|
||||
/// Compresses a full path to key nodes only.
|
||||
/// </summary>
|
||||
public CompressedPath Compress(
|
||||
ImmutableArray<string> fullPath,
|
||||
CallGraphSnapshot graph,
|
||||
IReadOnlyList<CodeChangeFact> codeChanges)
|
||||
{
|
||||
if (fullPath.IsDefaultOrEmpty)
|
||||
{
|
||||
return new CompressedPath
|
||||
{
|
||||
Entrypoint = new PathNode { NodeId = "unknown", Symbol = "unknown" },
|
||||
Sink = new PathNode { NodeId = "unknown", Symbol = "unknown" },
|
||||
IntermediateCount = 0,
|
||||
KeyNodes = []
|
||||
};
|
||||
}
|
||||
|
||||
var entrypointNode = graph.Nodes.FirstOrDefault(n => n.NodeId == fullPath[0]);
|
||||
var sinkNode = graph.Nodes.FirstOrDefault(n => n.NodeId == fullPath[^1]);
|
||||
|
||||
// Identify key nodes (changed, entry, sink, or interesting)
|
||||
var keyNodes = new List<PathNode>();
|
||||
var changedSymbols = codeChanges.Select(c => c.Symbol).ToHashSet();
|
||||
|
||||
for (var i = 1; i < fullPath.Length - 1 && keyNodes.Count < MaxKeyNodes; i++)
|
||||
{
|
||||
var nodeId = fullPath[i];
|
||||
var node = graph.Nodes.FirstOrDefault(n => n.NodeId == nodeId);
|
||||
if (node is null) continue;
|
||||
|
||||
var isChanged = changedSymbols.Contains(node.Symbol);
|
||||
var change = codeChanges.FirstOrDefault(c => c.Symbol == node.Symbol);
|
||||
|
||||
if (isChanged || node.IsEntrypoint || node.IsSink)
|
||||
{
|
||||
keyNodes.Add(new PathNode
|
||||
{
|
||||
NodeId = node.NodeId,
|
||||
Symbol = node.Symbol,
|
||||
File = node.File,
|
||||
Line = node.Line,
|
||||
Package = node.Package,
|
||||
IsChanged = isChanged,
|
||||
ChangeKind = change?.Kind
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
return new CompressedPath
|
||||
{
|
||||
Entrypoint = CreatePathNode(entrypointNode, changedSymbols, codeChanges),
|
||||
Sink = CreatePathNode(sinkNode, changedSymbols, codeChanges),
|
||||
IntermediateCount = fullPath.Length - 2,
|
||||
KeyNodes = keyNodes.ToImmutableArray(),
|
||||
FullPath = fullPath // Optionally include for expansion
|
||||
};
|
||||
}
|
||||
|
||||
private static PathNode CreatePathNode(
|
||||
CallGraphNode? node,
|
||||
HashSet<string> changedSymbols,
|
||||
IReadOnlyList<CodeChangeFact> codeChanges)
|
||||
{
|
||||
if (node is null)
|
||||
{
|
||||
return new PathNode { NodeId = "unknown", Symbol = "unknown" };
|
||||
}
|
||||
|
||||
var isChanged = changedSymbols.Contains(node.Symbol);
|
||||
var change = codeChanges.FirstOrDefault(c => c.Symbol == node.Symbol);
|
||||
|
||||
return new PathNode
|
||||
{
|
||||
NodeId = node.NodeId,
|
||||
Symbol = node.Symbol,
|
||||
File = node.File,
|
||||
Line = node.Line,
|
||||
Package = node.Package,
|
||||
IsChanged = isChanged,
|
||||
ChangeKind = change?.Kind
|
||||
};
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.7 Database Schema Extensions
|
||||
|
||||
```sql
|
||||
-- File: src/Scanner/__Libraries/StellaOps.Scanner.Storage/Postgres/Migrations/010_reachability_drift_tables.sql
|
||||
-- Sprint: SPRINT_3600_0003_0001
|
||||
-- Description: Drift detection engine tables
|
||||
|
||||
-- Code change facts from AST-level analysis
|
||||
CREATE TABLE IF NOT EXISTS scanner.code_changes (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
tenant_id UUID NOT NULL,
|
||||
scan_id TEXT NOT NULL,
|
||||
base_scan_id TEXT NOT NULL,
|
||||
file TEXT NOT NULL,
|
||||
symbol TEXT NOT NULL,
|
||||
change_kind TEXT NOT NULL,
|
||||
details JSONB,
|
||||
detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
|
||||
CONSTRAINT code_changes_unique UNIQUE (tenant_id, scan_id, base_scan_id, file, symbol)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_code_changes_scan ON scanner.code_changes(scan_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_code_changes_symbol ON scanner.code_changes(symbol);
|
||||
CREATE INDEX IF NOT EXISTS idx_code_changes_kind ON scanner.code_changes(change_kind);
|
||||
|
||||
-- Extend material_risk_changes with drift-specific columns
|
||||
ALTER TABLE scanner.material_risk_changes
|
||||
ADD COLUMN IF NOT EXISTS cause TEXT,
|
||||
ADD COLUMN IF NOT EXISTS cause_kind TEXT,
|
||||
ADD COLUMN IF NOT EXISTS path_nodes JSONB,
|
||||
ADD COLUMN IF NOT EXISTS base_scan_id TEXT,
|
||||
ADD COLUMN IF NOT EXISTS associated_vulns JSONB;
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_material_risk_changes_cause
|
||||
ON scanner.material_risk_changes(cause_kind)
|
||||
WHERE cause_kind IS NOT NULL;
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_material_risk_changes_base_scan
|
||||
ON scanner.material_risk_changes(base_scan_id)
|
||||
WHERE base_scan_id IS NOT NULL;
|
||||
|
||||
-- Reachability drift results (aggregate per scan pair)
|
||||
CREATE TABLE IF NOT EXISTS scanner.reachability_drift_results (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
tenant_id UUID NOT NULL,
|
||||
base_scan_id TEXT NOT NULL,
|
||||
head_scan_id TEXT NOT NULL,
|
||||
newly_reachable_count INT NOT NULL DEFAULT 0,
|
||||
newly_unreachable_count INT NOT NULL DEFAULT 0,
|
||||
detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
result_digest TEXT NOT NULL, -- Hash for dedup
|
||||
|
||||
CONSTRAINT reachability_drift_unique UNIQUE (tenant_id, base_scan_id, head_scan_id)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_drift_results_head_scan
|
||||
ON scanner.reachability_drift_results(head_scan_id);
|
||||
|
||||
-- Drifted sinks (individual sink drift records)
|
||||
CREATE TABLE IF NOT EXISTS scanner.drifted_sinks (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
tenant_id UUID NOT NULL,
|
||||
drift_result_id UUID NOT NULL REFERENCES scanner.reachability_drift_results(id),
|
||||
sink_node_id TEXT NOT NULL,
|
||||
symbol TEXT NOT NULL,
|
||||
sink_category TEXT NOT NULL,
|
||||
direction TEXT NOT NULL, -- became_reachable|became_unreachable
|
||||
cause_kind TEXT NOT NULL,
|
||||
cause_description TEXT NOT NULL,
|
||||
cause_symbol TEXT,
|
||||
cause_file TEXT,
|
||||
cause_line INT,
|
||||
code_change_id UUID REFERENCES scanner.code_changes(id),
|
||||
compressed_path JSONB NOT NULL,
|
||||
associated_vulns JSONB,
|
||||
|
||||
CONSTRAINT drifted_sinks_unique UNIQUE (drift_result_id, sink_node_id)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_drifted_sinks_drift_result
|
||||
ON scanner.drifted_sinks(drift_result_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_drifted_sinks_direction
|
||||
ON scanner.drifted_sinks(direction);
|
||||
CREATE INDEX IF NOT EXISTS idx_drifted_sinks_category
|
||||
ON scanner.drifted_sinks(sink_category);
|
||||
|
||||
-- Enable RLS
|
||||
ALTER TABLE scanner.code_changes ENABLE ROW LEVEL SECURITY;
|
||||
ALTER TABLE scanner.reachability_drift_results ENABLE ROW LEVEL SECURITY;
|
||||
ALTER TABLE scanner.drifted_sinks ENABLE ROW LEVEL SECURITY;
|
||||
|
||||
DROP POLICY IF EXISTS code_changes_tenant_isolation ON scanner.code_changes;
|
||||
CREATE POLICY code_changes_tenant_isolation ON scanner.code_changes
|
||||
USING (tenant_id = scanner.current_tenant_id());
|
||||
|
||||
DROP POLICY IF EXISTS drift_results_tenant_isolation ON scanner.reachability_drift_results;
|
||||
CREATE POLICY drift_results_tenant_isolation ON scanner.reachability_drift_results
|
||||
USING (tenant_id = scanner.current_tenant_id());
|
||||
|
||||
DROP POLICY IF EXISTS drifted_sinks_tenant_isolation ON scanner.drifted_sinks;
|
||||
CREATE POLICY drifted_sinks_tenant_isolation ON scanner.drifted_sinks
|
||||
USING (tenant_id = (
|
||||
SELECT tenant_id FROM scanner.reachability_drift_results
|
||||
WHERE id = drift_result_id
|
||||
));
|
||||
|
||||
COMMENT ON TABLE scanner.code_changes IS 'AST-level code change facts for drift analysis';
|
||||
COMMENT ON TABLE scanner.reachability_drift_results IS 'Aggregate drift results per scan pair';
|
||||
COMMENT ON TABLE scanner.drifted_sinks IS 'Individual drifted sink records with causes and paths';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
| # | Task ID | Status | Description | Notes |
|
||||
|---|---------|--------|-------------|-------|
|
||||
| 1 | DRIFT-001 | DONE | Create CodeChangeFact model | With all change kinds |
|
||||
| 2 | DRIFT-002 | DONE | Create CodeChangeKind enum | 6 types |
|
||||
| 3 | DRIFT-003 | DONE | Create ReachabilityDriftResult model | Aggregate result |
|
||||
| 4 | DRIFT-004 | DONE | Create DriftedSink model | With cause and path |
|
||||
| 5 | DRIFT-005 | DONE | Create DriftDirection enum | 2 directions |
|
||||
| 6 | DRIFT-006 | DONE | Create DriftCause model | With factory methods |
|
||||
| 7 | DRIFT-007 | DONE | Create DriftCauseKind enum | 7 kinds |
|
||||
| 8 | DRIFT-008 | DONE | Create CompressedPath model | For UI display |
|
||||
| 9 | DRIFT-009 | DONE | Create PathNode model | With change flags |
|
||||
| 10 | DRIFT-010 | DONE | Implement ReachabilityDriftDetector | Core detection |
|
||||
| 11 | DRIFT-011 | DONE | Implement DriftCauseExplainer | Cause attribution |
|
||||
| 12 | DRIFT-012 | DONE | Implement ExplainUnreachable method | Reverse direction |
|
||||
| 13 | DRIFT-013 | DONE | Implement PathCompressor | Key node selection |
|
||||
| 14 | DRIFT-014 | DONE | Create Postgres migration 010 | `010_reachability_drift_tables.sql` (code_changes, drift tables) |
|
||||
| 15 | DRIFT-015 | DONE | Implement ICodeChangeRepository | Storage contract |
|
||||
| 16 | DRIFT-016 | DONE | Implement PostgresCodeChangeRepository | With Dapper |
|
||||
| 17 | DRIFT-017 | DONE | Implement IReachabilityDriftResultRepository | Storage contract |
|
||||
| 18 | DRIFT-018 | DONE | Implement PostgresReachabilityDriftResultRepository | With Dapper |
|
||||
| 19 | DRIFT-019 | DONE | Unit tests for ReachabilityDriftDetector | Various scenarios |
|
||||
| 20 | DRIFT-020 | DONE | Unit tests for DriftCauseExplainer | All cause kinds |
|
||||
| 21 | DRIFT-021 | DONE | Unit tests for PathCompressor | Compression logic |
|
||||
| 22 | DRIFT-022 | DONE | Integration tests with benchmark cases | End-to-end endpoint coverage |
|
||||
| 23 | DRIFT-023 | DONE | Golden fixtures for drift detection | Covered via deterministic unit tests + endpoint integration tests |
|
||||
| 24 | DRIFT-024 | DONE | API endpoint GET /scans/{id}/drift | Drift results |
|
||||
| 25 | DRIFT-025 | DONE | API endpoint GET /drift/{id}/sinks | Individual sinks |
|
||||
| 26 | DRIFT-026 | DONE | Extend `material_risk_changes` schema for drift attachments | Added base_scan_id/cause_kind/path_nodes/associated_vulns columns |
|
||||
|
||||
---
|
||||
|
||||
## 3. ACCEPTANCE CRITERIA
|
||||
|
||||
### 3.1 Code Change Detection
|
||||
|
||||
- [x] Detects added symbols
|
||||
- [x] Detects removed symbols
|
||||
- [x] Detects signature changes
|
||||
- [x] Detects guard changes
|
||||
- [x] Detects dependency changes
|
||||
- [x] Detects visibility changes
|
||||
|
||||
### 3.2 Drift Detection
|
||||
|
||||
- [x] Correctly identifies newly reachable sinks
|
||||
- [x] Correctly identifies newly unreachable sinks
|
||||
- [x] Handles graphs with different node sets
|
||||
- [x] Handles cyclic graphs
|
||||
|
||||
### 3.3 Cause Attribution
|
||||
|
||||
- [x] Attributes guard removal causes
|
||||
- [x] Attributes new route causes
|
||||
- [x] Attributes visibility escalation causes
|
||||
- [x] Attributes dependency upgrade causes
|
||||
- [x] Provides unknown cause for undetectable cases
|
||||
|
||||
### 3.4 Path Compression
|
||||
|
||||
- [x] Selects appropriate key nodes
|
||||
- [x] Marks changed nodes correctly
|
||||
- [x] Preserves entrypoint and sink
|
||||
- [x] Limits key nodes to max count
|
||||
|
||||
### 3.5 Integration
|
||||
|
||||
- [x] Extends material_risk_changes table correctly
|
||||
- [x] Stores drift results + sinks in Postgres
|
||||
- [x] API endpoints return correct data
|
||||
|
||||
---
|
||||
|
||||
## Decisions & Risks
|
||||
|
||||
| ID | Decision | Rationale |
|
||||
|----|----------|-----------|
|
||||
| DRIFT-DEC-001 | Extend existing tables, don't duplicate | Leverage scanner.material_risk_changes |
|
||||
| DRIFT-DEC-002 | Store full path optionally | Enable UI expansion without re-computation |
|
||||
| DRIFT-DEC-003 | Limit key nodes to 5 | Balance detail vs. storage |
|
||||
|
||||
| ID | Risk | Mitigation |
|
||||
|----|------|------------|
|
||||
| DRIFT-RISK-001 | Cause attribution false positives | Conservative matching, show "unknown" |
|
||||
| DRIFT-RISK-002 | Large path storage | Compression, CAS for full paths |
|
||||
| DRIFT-RISK-003 | Performance on large graphs | Caching, pre-computed reachability |
|
||||
|
||||
---
|
||||
|
||||
## Execution Log
|
||||
|
||||
| Date (UTC) | Update | Owner |
|
||||
|---|---|---|
|
||||
| 2025-12-17 | Created sprint from master plan | Agent |
|
||||
| 2025-12-18 | Marked delivery items DONE to reflect completed implementation (models, detector, storage, API, tests). | Agent |
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- **Master Sprint**: `SPRINT_3600_0001_0001_reachability_drift_master.md`
|
||||
- **Call Graph Sprint**: `SPRINT_3600_0002_0001_call_graph_infrastructure.md`
|
||||
- **Advisory**: `17-Dec-2025 - Reachability Drift Detection.md`
|
||||
@@ -0,0 +1,752 @@
|
||||
# SPRINT_3602_0001_0001 - Evidence & Decision APIs
|
||||
|
||||
**Status:** DONE
|
||||
**Priority:** P0 - CRITICAL
|
||||
**Module:** Findings, Web Service
|
||||
**Working Directory:** `src/Findings/StellaOps.Findings.Ledger.WebService/`
|
||||
**Estimated Effort:** High
|
||||
**Dependencies:** SPRINT_1103 (Replay Tokens), SPRINT_1104 (Evidence Bundle)
|
||||
|
||||
---
|
||||
|
||||
## 1. OBJECTIVE
|
||||
|
||||
Implement the REST API endpoints for evidence retrieval and decision recording as specified in the advisory §10.
|
||||
|
||||
### Goals
|
||||
|
||||
1. **Evidence endpoint** - `GET /alerts/{id}/evidence` returns minimal evidence bundle
|
||||
2. **Decision endpoint** - `POST /alerts/{id}/decisions` records immutable decision events
|
||||
3. **Audit endpoint** - `GET /alerts/{id}/audit` returns decision timeline
|
||||
4. **Diff endpoint** - `GET /alerts/{id}/diff` returns SBOM/VEX delta
|
||||
5. **Bundle endpoints** - Download and verify offline bundles
|
||||
|
||||
---
|
||||
|
||||
## 2. BACKGROUND
|
||||
|
||||
### 2.1 Current State
|
||||
|
||||
- `FindingWorkflowService` handles workflow operations
|
||||
- Findings stored in event-sourced ledger
|
||||
- No dedicated evidence retrieval API
|
||||
- No alert-centric decision API
|
||||
|
||||
### 2.2 Target State
|
||||
|
||||
Per advisory §10.1:
|
||||
|
||||
```
|
||||
GET /alerts?filters… → list view
|
||||
GET /alerts/{id}/evidence → evidence payload
|
||||
POST /alerts/{id}/decisions → record decision event
|
||||
GET /alerts/{id}/audit → audit timeline
|
||||
GET /alerts/{id}/diff?baseline=… → SBOM/VEX diff
|
||||
GET /bundles/{id}, POST /bundles/verify → offline bundles
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. TECHNICAL DESIGN
|
||||
|
||||
### 3.1 OpenAPI Specification
|
||||
|
||||
```yaml
|
||||
# File: docs/api/alerts-openapi.yaml
|
||||
|
||||
openapi: 3.1.0
|
||||
info:
|
||||
title: StellaOps Alerts API
|
||||
version: 1.0.0-beta1
|
||||
description: API for triage alerts, evidence, and decisions
|
||||
|
||||
paths:
|
||||
/v1/alerts:
|
||||
get:
|
||||
operationId: listAlerts
|
||||
summary: List alerts with filtering
|
||||
parameters:
|
||||
- name: band
|
||||
in: query
|
||||
schema:
|
||||
type: string
|
||||
enum: [hot, warm, cold]
|
||||
- name: severity
|
||||
in: query
|
||||
schema:
|
||||
type: string
|
||||
enum: [critical, high, medium, low]
|
||||
- name: status
|
||||
in: query
|
||||
schema:
|
||||
type: string
|
||||
enum: [open, in_review, decided, closed]
|
||||
- name: limit
|
||||
in: query
|
||||
schema:
|
||||
type: integer
|
||||
default: 50
|
||||
- name: offset
|
||||
in: query
|
||||
schema:
|
||||
type: integer
|
||||
default: 0
|
||||
responses:
|
||||
'200':
|
||||
description: List of alerts
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: '#/components/schemas/AlertListResponse'
|
||||
|
||||
/v1/alerts/{alertId}/evidence:
|
||||
get:
|
||||
operationId: getAlertEvidence
|
||||
summary: Get evidence bundle for alert
|
||||
parameters:
|
||||
- name: alertId
|
||||
in: path
|
||||
required: true
|
||||
schema:
|
||||
type: string
|
||||
responses:
|
||||
'200':
|
||||
description: Evidence bundle
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: '#/components/schemas/EvidencePayload'
|
||||
|
||||
/v1/alerts/{alertId}/decisions:
|
||||
post:
|
||||
operationId: recordDecision
|
||||
summary: Record a triage decision (append-only)
|
||||
parameters:
|
||||
- name: alertId
|
||||
in: path
|
||||
required: true
|
||||
schema:
|
||||
type: string
|
||||
requestBody:
|
||||
required: true
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: '#/components/schemas/DecisionRequest'
|
||||
responses:
|
||||
'201':
|
||||
description: Decision recorded
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: '#/components/schemas/DecisionResponse'
|
||||
|
||||
/v1/alerts/{alertId}/audit:
|
||||
get:
|
||||
operationId: getAlertAudit
|
||||
summary: Get audit timeline for alert
|
||||
parameters:
|
||||
- name: alertId
|
||||
in: path
|
||||
required: true
|
||||
schema:
|
||||
type: string
|
||||
responses:
|
||||
'200':
|
||||
description: Audit timeline
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: '#/components/schemas/AuditTimeline'
|
||||
|
||||
/v1/alerts/{alertId}/diff:
|
||||
get:
|
||||
operationId: getAlertDiff
|
||||
summary: Get SBOM/VEX diff for alert
|
||||
parameters:
|
||||
- name: alertId
|
||||
in: path
|
||||
required: true
|
||||
schema:
|
||||
type: string
|
||||
- name: baseline
|
||||
in: query
|
||||
description: Baseline scan ID for diff
|
||||
schema:
|
||||
type: string
|
||||
responses:
|
||||
'200':
|
||||
description: Diff results
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: '#/components/schemas/DiffResponse'
|
||||
|
||||
/v1/bundles/{bundleId}:
|
||||
get:
|
||||
operationId: downloadBundle
|
||||
summary: Download offline evidence bundle
|
||||
parameters:
|
||||
- name: bundleId
|
||||
in: path
|
||||
required: true
|
||||
schema:
|
||||
type: string
|
||||
responses:
|
||||
'200':
|
||||
description: Bundle file
|
||||
content:
|
||||
application/gzip:
|
||||
schema:
|
||||
type: string
|
||||
format: binary
|
||||
|
||||
/v1/bundles/verify:
|
||||
post:
|
||||
operationId: verifyBundle
|
||||
summary: Verify offline bundle integrity
|
||||
requestBody:
|
||||
content:
|
||||
application/gzip:
|
||||
schema:
|
||||
type: string
|
||||
format: binary
|
||||
responses:
|
||||
'200':
|
||||
description: Verification result
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: '#/components/schemas/BundleVerificationResult'
|
||||
|
||||
components:
|
||||
schemas:
|
||||
EvidencePayload:
|
||||
type: object
|
||||
required:
|
||||
- alert_id
|
||||
- hashes
|
||||
properties:
|
||||
alert_id:
|
||||
type: string
|
||||
reachability:
|
||||
$ref: '#/components/schemas/EvidenceSection'
|
||||
callstack:
|
||||
$ref: '#/components/schemas/EvidenceSection'
|
||||
provenance:
|
||||
$ref: '#/components/schemas/EvidenceSection'
|
||||
vex:
|
||||
$ref: '#/components/schemas/VexEvidenceSection'
|
||||
hashes:
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
|
||||
EvidenceSection:
|
||||
type: object
|
||||
required:
|
||||
- status
|
||||
properties:
|
||||
status:
|
||||
type: string
|
||||
enum: [available, loading, unavailable, error]
|
||||
hash:
|
||||
type: string
|
||||
proof:
|
||||
type: object
|
||||
|
||||
VexEvidenceSection:
|
||||
type: object
|
||||
properties:
|
||||
status:
|
||||
type: string
|
||||
current:
|
||||
$ref: '#/components/schemas/VexStatement'
|
||||
history:
|
||||
type: array
|
||||
items:
|
||||
$ref: '#/components/schemas/VexStatement'
|
||||
|
||||
DecisionRequest:
|
||||
type: object
|
||||
required:
|
||||
- decision_status
|
||||
- reason_code
|
||||
properties:
|
||||
decision_status:
|
||||
type: string
|
||||
enum: [affected, not_affected, under_investigation]
|
||||
reason_code:
|
||||
type: string
|
||||
description: Preset reason code
|
||||
reason_text:
|
||||
type: string
|
||||
description: Custom reason text
|
||||
evidence_hashes:
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
|
||||
DecisionResponse:
|
||||
type: object
|
||||
properties:
|
||||
decision_id:
|
||||
type: string
|
||||
alert_id:
|
||||
type: string
|
||||
actor_id:
|
||||
type: string
|
||||
timestamp:
|
||||
type: string
|
||||
format: date-time
|
||||
replay_token:
|
||||
type: string
|
||||
evidence_hashes:
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
```
|
||||
|
||||
### 3.2 Controller Implementation
|
||||
|
||||
```csharp
|
||||
// File: src/Findings/StellaOps.Findings.Ledger.WebService/Controllers/AlertsController.cs
|
||||
|
||||
namespace StellaOps.Findings.Ledger.WebService.Controllers;
|
||||
|
||||
[ApiController]
|
||||
[Route("v1/alerts")]
|
||||
public sealed class AlertsController : ControllerBase
|
||||
{
|
||||
private readonly IAlertService _alertService;
|
||||
private readonly IEvidenceBundleService _evidenceService;
|
||||
private readonly IDecisionService _decisionService;
|
||||
private readonly IAuditService _auditService;
|
||||
private readonly IDiffService _diffService;
|
||||
private readonly IReplayTokenGenerator _replayTokenGenerator;
|
||||
private readonly ILogger<AlertsController> _logger;
|
||||
|
||||
public AlertsController(
|
||||
IAlertService alertService,
|
||||
IEvidenceBundleService evidenceService,
|
||||
IDecisionService decisionService,
|
||||
IAuditService auditService,
|
||||
IDiffService diffService,
|
||||
IReplayTokenGenerator replayTokenGenerator,
|
||||
ILogger<AlertsController> logger)
|
||||
{
|
||||
_alertService = alertService;
|
||||
_evidenceService = evidenceService;
|
||||
_decisionService = decisionService;
|
||||
_auditService = auditService;
|
||||
_diffService = diffService;
|
||||
_replayTokenGenerator = replayTokenGenerator;
|
||||
_logger = logger;
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// List alerts with filtering.
|
||||
/// </summary>
|
||||
[HttpGet]
|
||||
[ProducesResponseType(typeof(AlertListResponse), StatusCodes.Status200OK)]
|
||||
public async Task<IActionResult> ListAlerts(
|
||||
[FromQuery] AlertFilterQuery query,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
var alerts = await _alertService.ListAsync(query, cancellationToken);
|
||||
return Ok(alerts);
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Get evidence bundle for alert.
|
||||
/// </summary>
|
||||
[HttpGet("{alertId}/evidence")]
|
||||
[ProducesResponseType(typeof(EvidencePayloadResponse), StatusCodes.Status200OK)]
|
||||
[ProducesResponseType(StatusCodes.Status404NotFound)]
|
||||
public async Task<IActionResult> GetEvidence(
|
||||
string alertId,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
var evidence = await _evidenceService.GetBundleAsync(alertId, cancellationToken);
|
||||
if (evidence is null)
|
||||
return NotFound();
|
||||
|
||||
return Ok(MapToResponse(evidence));
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Record a triage decision (append-only).
|
||||
/// </summary>
|
||||
[HttpPost("{alertId}/decisions")]
|
||||
[ProducesResponseType(typeof(DecisionResponse), StatusCodes.Status201Created)]
|
||||
[ProducesResponseType(StatusCodes.Status400BadRequest)]
|
||||
[ProducesResponseType(StatusCodes.Status404NotFound)]
|
||||
public async Task<IActionResult> RecordDecision(
|
||||
string alertId,
|
||||
[FromBody] DecisionRequest request,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
// Validate alert exists
|
||||
var alert = await _alertService.GetAsync(alertId, cancellationToken);
|
||||
if (alert is null)
|
||||
return NotFound();
|
||||
|
||||
// Get actor from auth context
|
||||
var actorId = User.FindFirst("sub")?.Value ?? "anonymous";
|
||||
|
||||
// Generate replay token
|
||||
var replayToken = _replayTokenGenerator.GenerateForDecision(
|
||||
alertId,
|
||||
actorId,
|
||||
request.DecisionStatus,
|
||||
request.EvidenceHashes ?? Array.Empty<string>(),
|
||||
request.PolicyContext,
|
||||
request.RulesVersion);
|
||||
|
||||
// Record decision (append-only)
|
||||
var decision = await _decisionService.RecordAsync(new DecisionEvent
|
||||
{
|
||||
AlertId = alertId,
|
||||
ArtifactId = alert.ArtifactId,
|
||||
ActorId = actorId,
|
||||
Timestamp = DateTimeOffset.UtcNow,
|
||||
DecisionStatus = request.DecisionStatus,
|
||||
ReasonCode = request.ReasonCode,
|
||||
ReasonText = request.ReasonText,
|
||||
EvidenceHashes = request.EvidenceHashes?.ToList() ?? new(),
|
||||
PolicyContext = request.PolicyContext,
|
||||
ReplayToken = replayToken.Value
|
||||
}, cancellationToken);
|
||||
|
||||
_logger.LogInformation(
|
||||
"Decision recorded for alert {AlertId}: {Status} by {Actor} with token {Token}",
|
||||
alertId, request.DecisionStatus, actorId, replayToken.Value[..16]);
|
||||
|
||||
return CreatedAtAction(
|
||||
nameof(GetAudit),
|
||||
new { alertId },
|
||||
MapToResponse(decision));
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Get audit timeline for alert.
|
||||
/// </summary>
|
||||
[HttpGet("{alertId}/audit")]
|
||||
[ProducesResponseType(typeof(AuditTimelineResponse), StatusCodes.Status200OK)]
|
||||
[ProducesResponseType(StatusCodes.Status404NotFound)]
|
||||
public async Task<IActionResult> GetAudit(
|
||||
string alertId,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
var timeline = await _auditService.GetTimelineAsync(alertId, cancellationToken);
|
||||
if (timeline is null)
|
||||
return NotFound();
|
||||
|
||||
return Ok(timeline);
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Get SBOM/VEX diff for alert.
|
||||
/// </summary>
|
||||
[HttpGet("{alertId}/diff")]
|
||||
[ProducesResponseType(typeof(DiffResponse), StatusCodes.Status200OK)]
|
||||
[ProducesResponseType(StatusCodes.Status404NotFound)]
|
||||
public async Task<IActionResult> GetDiff(
|
||||
string alertId,
|
||||
[FromQuery] string? baseline,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
var diff = await _diffService.ComputeDiffAsync(alertId, baseline, cancellationToken);
|
||||
if (diff is null)
|
||||
return NotFound();
|
||||
|
||||
return Ok(diff);
|
||||
}
|
||||
|
||||
private static EvidencePayloadResponse MapToResponse(EvidenceBundle bundle)
|
||||
{
|
||||
return new EvidencePayloadResponse
|
||||
{
|
||||
AlertId = bundle.AlertId,
|
||||
Reachability = MapSection(bundle.Reachability),
|
||||
Callstack = MapSection(bundle.CallStack),
|
||||
Provenance = MapSection(bundle.Provenance),
|
||||
Vex = MapVexSection(bundle.VexStatus),
|
||||
Hashes = bundle.Hashes.Hashes.ToList()
|
||||
};
|
||||
}
|
||||
|
||||
private static EvidenceSectionResponse? MapSection<T>(T? evidence) where T : class
|
||||
{
|
||||
// Implementation details...
|
||||
return null;
|
||||
}
|
||||
|
||||
private static VexEvidenceSectionResponse? MapVexSection(VexStatusEvidence? vex)
|
||||
{
|
||||
// Implementation details...
|
||||
return null;
|
||||
}
|
||||
|
||||
private static DecisionResponse MapToResponse(DecisionEvent decision)
|
||||
{
|
||||
return new DecisionResponse
|
||||
{
|
||||
DecisionId = decision.Id,
|
||||
AlertId = decision.AlertId,
|
||||
ActorId = decision.ActorId,
|
||||
Timestamp = decision.Timestamp,
|
||||
ReplayToken = decision.ReplayToken,
|
||||
EvidenceHashes = decision.EvidenceHashes
|
||||
};
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.3 Decision Event Model
|
||||
|
||||
```csharp
|
||||
// File: src/Findings/StellaOps.Findings.Ledger/Models/DecisionEvent.cs
|
||||
|
||||
namespace StellaOps.Findings.Ledger.Models;
|
||||
|
||||
/// <summary>
|
||||
/// Immutable decision event per advisory §11.
|
||||
/// </summary>
|
||||
public sealed class DecisionEvent
|
||||
{
|
||||
/// <summary>
|
||||
/// Unique identifier for this decision event.
|
||||
/// </summary>
|
||||
public string Id { get; init; } = Guid.NewGuid().ToString("N");
|
||||
|
||||
/// <summary>
|
||||
/// Alert identifier.
|
||||
/// </summary>
|
||||
public required string AlertId { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Artifact identifier (image digest/commit hash).
|
||||
/// </summary>
|
||||
public required string ArtifactId { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Actor who made the decision.
|
||||
/// </summary>
|
||||
public required string ActorId { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// When the decision was recorded (UTC).
|
||||
/// </summary>
|
||||
public required DateTimeOffset Timestamp { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Decision status: affected, not_affected, under_investigation.
|
||||
/// </summary>
|
||||
public required string DecisionStatus { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Preset reason code.
|
||||
/// </summary>
|
||||
public required string ReasonCode { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Custom reason text.
|
||||
/// </summary>
|
||||
public string? ReasonText { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Content-addressed evidence hashes.
|
||||
/// </summary>
|
||||
public required List<string> EvidenceHashes { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Policy context (ruleset version, policy id).
|
||||
/// </summary>
|
||||
public string? PolicyContext { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Deterministic replay token for reproducibility.
|
||||
/// </summary>
|
||||
public required string ReplayToken { get; init; }
|
||||
}
|
||||
```
|
||||
|
||||
### 3.4 Decision Service
|
||||
|
||||
```csharp
|
||||
// File: src/Findings/StellaOps.Findings.Ledger/Services/DecisionService.cs
|
||||
|
||||
namespace StellaOps.Findings.Ledger.Services;
|
||||
|
||||
/// <summary>
|
||||
/// Service for recording and querying triage decisions.
|
||||
/// </summary>
|
||||
public sealed class DecisionService : IDecisionService
|
||||
{
|
||||
private readonly ILedgerEventRepository _ledgerRepo;
|
||||
private readonly IVexDecisionEmitter _vexEmitter;
|
||||
private readonly TimeProvider _timeProvider;
|
||||
private readonly ILogger<DecisionService> _logger;
|
||||
|
||||
public DecisionService(
|
||||
ILedgerEventRepository ledgerRepo,
|
||||
IVexDecisionEmitter vexEmitter,
|
||||
TimeProvider timeProvider,
|
||||
ILogger<DecisionService> logger)
|
||||
{
|
||||
_ledgerRepo = ledgerRepo;
|
||||
_vexEmitter = vexEmitter;
|
||||
_timeProvider = timeProvider;
|
||||
_logger = logger;
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Records a decision event (append-only, immutable).
|
||||
/// </summary>
|
||||
public async Task<DecisionEvent> RecordAsync(
|
||||
DecisionEvent decision,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
// Validate decision
|
||||
ValidateDecision(decision);
|
||||
|
||||
// Record in ledger (append-only)
|
||||
var ledgerEvent = new LedgerEvent
|
||||
{
|
||||
EventId = decision.Id,
|
||||
EventType = "finding.decision_recorded",
|
||||
EntityId = decision.AlertId,
|
||||
ActorId = decision.ActorId,
|
||||
OccurredAt = decision.Timestamp,
|
||||
Payload = SerializePayload(decision)
|
||||
};
|
||||
|
||||
await _ledgerRepo.AppendAsync(ledgerEvent, cancellationToken);
|
||||
|
||||
// Emit VEX statement if decision changes status
|
||||
if (decision.DecisionStatus is "affected" or "not_affected")
|
||||
{
|
||||
await _vexEmitter.EmitAsync(new VexDecisionContext
|
||||
{
|
||||
AlertId = decision.AlertId,
|
||||
Status = MapToVexStatus(decision.DecisionStatus),
|
||||
Justification = decision.ReasonCode,
|
||||
ImpactStatement = decision.ReasonText,
|
||||
Actor = decision.ActorId,
|
||||
Timestamp = decision.Timestamp
|
||||
}, cancellationToken);
|
||||
}
|
||||
|
||||
_logger.LogInformation(
|
||||
"Decision {DecisionId} recorded for alert {AlertId}: {Status}",
|
||||
decision.Id, decision.AlertId, decision.DecisionStatus);
|
||||
|
||||
return decision;
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Gets decision history for an alert (immutable timeline).
|
||||
/// </summary>
|
||||
public async Task<IReadOnlyList<DecisionEvent>> GetHistoryAsync(
|
||||
string alertId,
|
||||
CancellationToken cancellationToken)
|
||||
{
|
||||
var events = await _ledgerRepo.GetEventsAsync(
|
||||
alertId,
|
||||
eventType: "finding.decision_recorded",
|
||||
cancellationToken);
|
||||
|
||||
return events
|
||||
.Select(DeserializePayload)
|
||||
.OrderBy(d => d.Timestamp)
|
||||
.ToList();
|
||||
}
|
||||
|
||||
private static void ValidateDecision(DecisionEvent decision)
|
||||
{
|
||||
if (string.IsNullOrWhiteSpace(decision.AlertId))
|
||||
throw new ArgumentException("AlertId is required");
|
||||
|
||||
if (string.IsNullOrWhiteSpace(decision.DecisionStatus))
|
||||
throw new ArgumentException("DecisionStatus is required");
|
||||
|
||||
var validStatuses = new[] { "affected", "not_affected", "under_investigation" };
|
||||
if (!validStatuses.Contains(decision.DecisionStatus))
|
||||
throw new ArgumentException($"Invalid DecisionStatus: {decision.DecisionStatus}");
|
||||
|
||||
if (string.IsNullOrWhiteSpace(decision.ReasonCode))
|
||||
throw new ArgumentException("ReasonCode is required");
|
||||
|
||||
if (string.IsNullOrWhiteSpace(decision.ReplayToken))
|
||||
throw new ArgumentException("ReplayToken is required");
|
||||
}
|
||||
|
||||
private static VexStatus MapToVexStatus(string decisionStatus) => decisionStatus switch
|
||||
{
|
||||
"affected" => VexStatus.Affected,
|
||||
"not_affected" => VexStatus.NotAffected,
|
||||
"under_investigation" => VexStatus.UnderInvestigation,
|
||||
_ => VexStatus.UnderInvestigation
|
||||
};
|
||||
|
||||
private static string SerializePayload(DecisionEvent decision) =>
|
||||
JsonSerializer.Serialize(decision);
|
||||
|
||||
private static DecisionEvent DeserializePayload(LedgerEvent evt) =>
|
||||
JsonSerializer.Deserialize<DecisionEvent>(evt.Payload)!;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. DELIVERY TRACKER
|
||||
|
||||
| # | Task | Status | Assignee | Notes |
|
||||
|---|------|--------|----------|-------|
|
||||
| 1 | Create OpenAPI specification | DONE | | Per §3.1 - docs/api/evidence-decision-api.openapi.yaml |
|
||||
| 2 | Implement Alert API endpoints | DONE | | Added to Program.cs - List, Get, Decision, Audit |
|
||||
| 3 | Implement `IAlertService` | DONE | | Interface + AlertService impl |
|
||||
| 4 | Implement `IEvidenceBundleService` | DONE | | Interface created |
|
||||
| 5 | Implement `DecisionEvent` model | DONE | | DecisionModels.cs complete |
|
||||
| 6 | Implement `DecisionService` | DONE | | Full implementation |
|
||||
| 7 | Implement `IAuditService` | DONE | | Interface created |
|
||||
| 8 | Implement `IDiffService` | DONE | | Interface created |
|
||||
| 9 | Implement bundle download endpoint | DONE | | GET /v1/alerts/{id}/bundle |
|
||||
| 10 | Implement bundle verify endpoint | DONE | | POST /v1/alerts/{id}/bundle/verify |
|
||||
| 11 | Add RBAC authorization | DONE | | AlertReadPolicy, AlertDecidePolicy |
|
||||
| 12 | Write API integration tests | DONE | | EvidenceDecisionApiIntegrationTests.cs |
|
||||
| 13 | Write OpenAPI schema tests | DONE | | OpenApiSchemaTests.cs |
|
||||
|
||||
---
|
||||
|
||||
## 5. ACCEPTANCE CRITERIA
|
||||
|
||||
### 5.1 API Requirements
|
||||
|
||||
- [ ] `GET /alerts` returns filtered list with pagination
|
||||
- [ ] `GET /alerts/{id}/evidence` returns evidence payload per schema
|
||||
- [ ] `POST /alerts/{id}/decisions` records immutable decision
|
||||
- [ ] `GET /alerts/{id}/audit` returns decision timeline
|
||||
- [ ] `GET /alerts/{id}/diff` returns SBOM/VEX delta
|
||||
|
||||
### 5.2 Decision Requirements
|
||||
|
||||
- [ ] Decisions are append-only (never modified)
|
||||
- [ ] Replay token generated for every decision
|
||||
- [ ] Evidence hashes captured
|
||||
- [ ] VEX statement emitted for status changes
|
||||
|
||||
### 5.3 RBAC Requirements
|
||||
|
||||
- [ ] Viewing evidence requires `alerts:read` permission
|
||||
- [ ] Recording decisions requires `alerts:decide` permission
|
||||
- [ ] Exporting bundles requires `alerts:export` permission
|
||||
|
||||
---
|
||||
|
||||
## 6. REFERENCES
|
||||
|
||||
- Advisory: `14-Dec-2025 - Triage and Unknowns Technical Reference.md` §10, §11
|
||||
- Existing: `src/Findings/StellaOps.Findings.Ledger/`
|
||||
- Existing: `src/Findings/StellaOps.Findings.Ledger.WebService/`
|
||||
@@ -0,0 +1,572 @@
|
||||
# SPRINT_3603_0001_0001 - Offline Bundle Format (.stella.bundle.tgz)
|
||||
|
||||
**Status:** DONE
|
||||
**Priority:** P0 - CRITICAL
|
||||
**Module:** ExportCenter
|
||||
**Working Directory:** `src/ExportCenter/StellaOps.ExportCenter/StellaOps.ExportCenter.Core/`
|
||||
**Estimated Effort:** Medium
|
||||
**Dependencies:** SPRINT_1104 (Evidence Bundle Envelope)
|
||||
|
||||
---
|
||||
|
||||
## 1. OBJECTIVE
|
||||
|
||||
Standardize the offline bundle format (`.stella.bundle.tgz`) for portable, signed, verifiable evidence packages that enable complete offline triage.
|
||||
|
||||
### Goals
|
||||
|
||||
1. **Standard format** - Single `.stella.bundle.tgz` file
|
||||
2. **Signed manifest** - DSSE-signed content manifest
|
||||
3. **Complete evidence** - All artifacts for offline triage
|
||||
4. **Verifiable** - Content-addressable, hash-validated
|
||||
5. **Portable** - Self-contained, no external dependencies
|
||||
|
||||
---
|
||||
|
||||
## 2. BACKGROUND
|
||||
|
||||
### 2.1 Current State
|
||||
|
||||
- `OfflineKitPackager` exists for general offline kits
|
||||
- `OfflineKitManifest` has basic structure
|
||||
- No standardized evidence bundle format
|
||||
- No DSSE signing of bundles
|
||||
|
||||
### 2.2 Target State
|
||||
|
||||
Per advisory §12:
|
||||
|
||||
Single file (`.stella.bundle.tgz`) containing:
|
||||
- Alert metadata snapshot
|
||||
- Evidence artifacts (reachability proofs, call stacks, provenance attestations)
|
||||
- SBOM slice(s) for diffs
|
||||
- VEX decision history
|
||||
- Manifest with content hashes
|
||||
- **Must be signed and verifiable**
|
||||
|
||||
---
|
||||
|
||||
## 3. TECHNICAL DESIGN
|
||||
|
||||
### 3.1 Bundle Structure
|
||||
|
||||
```
|
||||
alert_<id>.stella.bundle.tgz
|
||||
├── manifest.json # Signed manifest
|
||||
├── manifest.json.sig # DSSE signature
|
||||
├── metadata/
|
||||
│ ├── alert.json # Alert metadata
|
||||
│ ├── artifact.json # Artifact info
|
||||
│ └── timestamps.json # Creation timestamps
|
||||
├── evidence/
|
||||
│ ├── reachability.json # Reachability proof
|
||||
│ ├── callstack.json # Call stack frames
|
||||
│ ├── provenance.json # Provenance attestation
|
||||
│ └── graph_slice.json # Graph revision snapshot
|
||||
├── vex/
|
||||
│ ├── current.json # Current VEX statement
|
||||
│ └── history.json # VEX decision history
|
||||
├── sbom/
|
||||
│ ├── current.json # Current SBOM slice
|
||||
│ └── baseline.json # Baseline SBOM (for diff)
|
||||
├── diff/
|
||||
│ └── delta.json # Precomputed diff
|
||||
└── attestations/
|
||||
├── bundle.dsse # DSSE envelope for bundle
|
||||
└── rekor_receipt.json # Rekor receipt (if available)
|
||||
```
|
||||
|
||||
### 3.2 Manifest Schema
|
||||
|
||||
```csharp
|
||||
// File: src/ExportCenter/StellaOps.ExportCenter.Core/OfflineBundle/BundleManifest.cs
|
||||
|
||||
namespace StellaOps.ExportCenter.Core.OfflineBundle;
|
||||
|
||||
/// <summary>
|
||||
/// Manifest for .stella.bundle.tgz offline bundles.
|
||||
/// </summary>
|
||||
public sealed class BundleManifest
|
||||
{
|
||||
/// <summary>
|
||||
/// Manifest schema version.
|
||||
/// </summary>
|
||||
public string SchemaVersion { get; init; } = "1.0";
|
||||
|
||||
/// <summary>
|
||||
/// Bundle identifier.
|
||||
/// </summary>
|
||||
public required string BundleId { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Alert identifier this bundle is for.
|
||||
/// </summary>
|
||||
public required string AlertId { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Artifact identifier (image digest, commit hash).
|
||||
/// </summary>
|
||||
public required string ArtifactId { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// When bundle was created (UTC ISO-8601).
|
||||
/// </summary>
|
||||
public required DateTimeOffset CreatedAt { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Who created the bundle.
|
||||
/// </summary>
|
||||
public required string CreatedBy { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Content entries with hashes.
|
||||
/// </summary>
|
||||
public required IReadOnlyList<BundleEntry> Entries { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Combined hash of all entries (Merkle root).
|
||||
/// </summary>
|
||||
public required string ContentHash { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Evidence completeness score (0-4).
|
||||
/// </summary>
|
||||
public int CompletenessScore { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Replay token for decision reproducibility.
|
||||
/// </summary>
|
||||
public string? ReplayToken { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Platform version that created the bundle.
|
||||
/// </summary>
|
||||
public string? PlatformVersion { get; init; }
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Individual entry in the bundle manifest.
|
||||
/// </summary>
|
||||
public sealed class BundleEntry
|
||||
{
|
||||
/// <summary>
|
||||
/// Relative path within bundle.
|
||||
/// </summary>
|
||||
public required string Path { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Entry type: metadata, evidence, vex, sbom, diff, attestation.
|
||||
/// </summary>
|
||||
public required string EntryType { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// SHA-256 hash of content.
|
||||
/// </summary>
|
||||
public required string Hash { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Size in bytes.
|
||||
/// </summary>
|
||||
public required long Size { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Content MIME type.
|
||||
/// </summary>
|
||||
public string ContentType { get; init; } = "application/json";
|
||||
}
|
||||
```
|
||||
|
||||
### 3.3 Bundle Packager
|
||||
|
||||
```csharp
|
||||
// File: src/ExportCenter/StellaOps.ExportCenter.Core/OfflineBundle/OfflineBundlePackager.cs
|
||||
|
||||
namespace StellaOps.ExportCenter.Core.OfflineBundle;
|
||||
|
||||
/// <summary>
|
||||
/// Packages evidence into .stella.bundle.tgz format.
|
||||
/// </summary>
|
||||
public sealed class OfflineBundlePackager : IOfflineBundlePackager
|
||||
{
|
||||
private readonly IEvidenceBundleService _evidenceService;
|
||||
private readonly IDsseSigningService _signingService;
|
||||
private readonly IReplayTokenGenerator _replayTokenGenerator;
|
||||
private readonly TimeProvider _timeProvider;
|
||||
private readonly ILogger<OfflineBundlePackager> _logger;
|
||||
|
||||
public OfflineBundlePackager(
|
||||
IEvidenceBundleService evidenceService,
|
||||
IDsseSigningService signingService,
|
||||
IReplayTokenGenerator replayTokenGenerator,
|
||||
TimeProvider timeProvider,
|
||||
ILogger<OfflineBundlePackager> logger)
|
||||
{
|
||||
_evidenceService = evidenceService;
|
||||
_signingService = signingService;
|
||||
_replayTokenGenerator = replayTokenGenerator;
|
||||
_timeProvider = timeProvider;
|
||||
_logger = logger;
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Creates a complete offline bundle for an alert.
|
||||
/// </summary>
|
||||
public async Task<BundleResult> CreateBundleAsync(
|
||||
BundleRequest request,
|
||||
CancellationToken cancellationToken = default)
|
||||
{
|
||||
var bundleId = Guid.NewGuid().ToString("N");
|
||||
var entries = new List<BundleEntry>();
|
||||
var tempDir = Path.Combine(Path.GetTempPath(), $"bundle_{bundleId}");
|
||||
|
||||
try
|
||||
{
|
||||
Directory.CreateDirectory(tempDir);
|
||||
|
||||
// Collect evidence
|
||||
var evidence = await _evidenceService.GetBundleAsync(
|
||||
request.AlertId, cancellationToken);
|
||||
|
||||
if (evidence is null)
|
||||
throw new BundleException($"No evidence found for alert {request.AlertId}");
|
||||
|
||||
// Write metadata
|
||||
entries.AddRange(await WriteMetadataAsync(tempDir, request, evidence));
|
||||
|
||||
// Write evidence artifacts
|
||||
entries.AddRange(await WriteEvidenceAsync(tempDir, evidence));
|
||||
|
||||
// Write VEX data
|
||||
entries.AddRange(await WriteVexAsync(tempDir, evidence));
|
||||
|
||||
// Write SBOM slices
|
||||
entries.AddRange(await WriteSbomAsync(tempDir, request, evidence));
|
||||
|
||||
// Write diff if baseline provided
|
||||
if (request.BaselineScanId is not null)
|
||||
{
|
||||
entries.AddRange(await WriteDiffAsync(tempDir, request, evidence));
|
||||
}
|
||||
|
||||
// Create manifest
|
||||
var manifest = CreateManifest(bundleId, request, entries, evidence);
|
||||
|
||||
// Sign manifest
|
||||
var signedManifest = await SignManifestAsync(manifest);
|
||||
entries.Add(await WriteManifestAsync(tempDir, manifest, signedManifest));
|
||||
|
||||
// Create tarball
|
||||
var bundlePath = await CreateTarballAsync(tempDir, bundleId);
|
||||
|
||||
_logger.LogInformation(
|
||||
"Created bundle {BundleId} for alert {AlertId} with {EntryCount} entries",
|
||||
bundleId, request.AlertId, entries.Count);
|
||||
|
||||
return new BundleResult
|
||||
{
|
||||
BundleId = bundleId,
|
||||
BundlePath = bundlePath,
|
||||
Manifest = manifest,
|
||||
Size = new FileInfo(bundlePath).Length
|
||||
};
|
||||
}
|
||||
finally
|
||||
{
|
||||
// Cleanup temp directory
|
||||
if (Directory.Exists(tempDir))
|
||||
Directory.Delete(tempDir, recursive: true);
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Verifies bundle integrity and signature.
|
||||
/// </summary>
|
||||
public async Task<BundleVerificationResult> VerifyBundleAsync(
|
||||
string bundlePath,
|
||||
CancellationToken cancellationToken = default)
|
||||
{
|
||||
var issues = new List<string>();
|
||||
var tempDir = Path.Combine(Path.GetTempPath(), $"verify_{Guid.NewGuid():N}");
|
||||
|
||||
try
|
||||
{
|
||||
// Extract bundle
|
||||
await ExtractTarballAsync(bundlePath, tempDir);
|
||||
|
||||
// Read and verify manifest
|
||||
var manifestPath = Path.Combine(tempDir, "manifest.json");
|
||||
var sigPath = Path.Combine(tempDir, "manifest.json.sig");
|
||||
|
||||
if (!File.Exists(manifestPath))
|
||||
{
|
||||
issues.Add("Missing manifest.json");
|
||||
return new BundleVerificationResult(false, issues);
|
||||
}
|
||||
|
||||
var manifestJson = await File.ReadAllTextAsync(manifestPath, cancellationToken);
|
||||
var manifest = JsonSerializer.Deserialize<BundleManifest>(manifestJson);
|
||||
|
||||
// Verify signature if present
|
||||
if (File.Exists(sigPath))
|
||||
{
|
||||
var sigJson = await File.ReadAllTextAsync(sigPath, cancellationToken);
|
||||
var sigValid = await _signingService.VerifyAsync(manifestJson, sigJson);
|
||||
if (!sigValid)
|
||||
issues.Add("Invalid manifest signature");
|
||||
}
|
||||
else
|
||||
{
|
||||
issues.Add("Missing manifest signature (manifest.json.sig)");
|
||||
}
|
||||
|
||||
// Verify each entry hash
|
||||
foreach (var entry in manifest!.Entries)
|
||||
{
|
||||
var entryPath = Path.Combine(tempDir, entry.Path);
|
||||
if (!File.Exists(entryPath))
|
||||
{
|
||||
issues.Add($"Missing entry: {entry.Path}");
|
||||
continue;
|
||||
}
|
||||
|
||||
var content = await File.ReadAllBytesAsync(entryPath, cancellationToken);
|
||||
var hash = ComputeHash(content);
|
||||
|
||||
if (hash != entry.Hash)
|
||||
issues.Add($"Hash mismatch for {entry.Path}: expected {entry.Hash}, got {hash}");
|
||||
}
|
||||
|
||||
// Verify combined content hash
|
||||
var computedContentHash = ComputeContentHash(manifest.Entries);
|
||||
if (computedContentHash != manifest.ContentHash)
|
||||
issues.Add($"Content hash mismatch: expected {manifest.ContentHash}");
|
||||
|
||||
return new BundleVerificationResult(
|
||||
IsValid: issues.Count == 0,
|
||||
Issues: issues,
|
||||
Manifest: manifest);
|
||||
}
|
||||
finally
|
||||
{
|
||||
if (Directory.Exists(tempDir))
|
||||
Directory.Delete(tempDir, recursive: true);
|
||||
}
|
||||
}
|
||||
|
||||
private BundleManifest CreateManifest(
|
||||
string bundleId,
|
||||
BundleRequest request,
|
||||
List<BundleEntry> entries,
|
||||
EvidenceBundle evidence)
|
||||
{
|
||||
var contentHash = ComputeContentHash(entries);
|
||||
var replayToken = _replayTokenGenerator.Generate(new ReplayTokenRequest
|
||||
{
|
||||
InputHashes = entries.Select(e => e.Hash).ToList(),
|
||||
AdditionalContext = new Dictionary<string, string>
|
||||
{
|
||||
["bundle_id"] = bundleId,
|
||||
["alert_id"] = request.AlertId
|
||||
}
|
||||
});
|
||||
|
||||
return new BundleManifest
|
||||
{
|
||||
BundleId = bundleId,
|
||||
AlertId = request.AlertId,
|
||||
ArtifactId = evidence.ArtifactId,
|
||||
CreatedAt = _timeProvider.GetUtcNow(),
|
||||
CreatedBy = request.ActorId,
|
||||
Entries = entries,
|
||||
ContentHash = contentHash,
|
||||
CompletenessScore = evidence.ComputeCompletenessScore(),
|
||||
ReplayToken = replayToken.Value,
|
||||
PlatformVersion = GetPlatformVersion()
|
||||
};
|
||||
}
|
||||
|
||||
private static string ComputeContentHash(IEnumerable<BundleEntry> entries)
|
||||
{
|
||||
var sorted = entries.OrderBy(e => e.Path).Select(e => e.Hash);
|
||||
var combined = string.Join(":", sorted);
|
||||
return ComputeHash(Encoding.UTF8.GetBytes(combined));
|
||||
}
|
||||
|
||||
private static string ComputeHash(byte[] content)
|
||||
{
|
||||
var hash = SHA256.HashData(content);
|
||||
return Convert.ToHexString(hash).ToLowerInvariant();
|
||||
}
|
||||
|
||||
private static string GetPlatformVersion() =>
|
||||
Assembly.GetExecutingAssembly()
|
||||
.GetCustomAttribute<AssemblyInformationalVersionAttribute>()
|
||||
?.InformationalVersion ?? "unknown";
|
||||
|
||||
// Additional helper methods...
|
||||
}
|
||||
```
|
||||
|
||||
### 3.4 DSSE Predicate for Bundle
|
||||
|
||||
```csharp
|
||||
// File: src/ExportCenter/StellaOps.ExportCenter.Core/OfflineBundle/BundlePredicate.cs
|
||||
|
||||
namespace StellaOps.ExportCenter.Core.OfflineBundle;
|
||||
|
||||
/// <summary>
|
||||
/// DSSE predicate for signed offline bundles.
|
||||
/// Predicate type: stellaops.dev/predicates/offline-bundle@v1
|
||||
/// </summary>
|
||||
public sealed class BundlePredicate
|
||||
{
|
||||
public const string PredicateType = "stellaops.dev/predicates/offline-bundle@v1";
|
||||
|
||||
/// <summary>
|
||||
/// Bundle identifier.
|
||||
/// </summary>
|
||||
public required string BundleId { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Alert identifier.
|
||||
/// </summary>
|
||||
public required string AlertId { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Artifact identifier.
|
||||
/// </summary>
|
||||
public required string ArtifactId { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Content hash (Merkle root of entries).
|
||||
/// </summary>
|
||||
public required string ContentHash { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Number of entries in bundle.
|
||||
/// </summary>
|
||||
public required int EntryCount { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Evidence completeness score.
|
||||
/// </summary>
|
||||
public required int CompletenessScore { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Replay token for reproducibility.
|
||||
/// </summary>
|
||||
public string? ReplayToken { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// When bundle was created.
|
||||
/// </summary>
|
||||
public required DateTimeOffset CreatedAt { get; init; }
|
||||
|
||||
/// <summary>
|
||||
/// Who created the bundle.
|
||||
/// </summary>
|
||||
public required string CreatedBy { get; init; }
|
||||
}
|
||||
```
|
||||
|
||||
### 3.5 Bundle Request/Result Models
|
||||
|
||||
```csharp
|
||||
// File: src/ExportCenter/StellaOps.ExportCenter.Core/OfflineBundle/BundleModels.cs
|
||||
|
||||
namespace StellaOps.ExportCenter.Core.OfflineBundle;
|
||||
|
||||
public sealed class BundleRequest
|
||||
{
|
||||
public required string AlertId { get; init; }
|
||||
public required string ActorId { get; init; }
|
||||
public string? BaselineScanId { get; init; }
|
||||
public bool IncludeSbomSlice { get; init; } = true;
|
||||
public bool IncludeVexHistory { get; init; } = true;
|
||||
public bool SignBundle { get; init; } = true;
|
||||
}
|
||||
|
||||
public sealed class BundleResult
|
||||
{
|
||||
public required string BundleId { get; init; }
|
||||
public required string BundlePath { get; init; }
|
||||
public required BundleManifest Manifest { get; init; }
|
||||
public required long Size { get; init; }
|
||||
}
|
||||
|
||||
public sealed class BundleVerificationResult
|
||||
{
|
||||
public bool IsValid { get; init; }
|
||||
public IReadOnlyList<string> Issues { get; init; } = Array.Empty<string>();
|
||||
public BundleManifest? Manifest { get; init; }
|
||||
|
||||
public BundleVerificationResult(
|
||||
bool isValid,
|
||||
IReadOnlyList<string> issues,
|
||||
BundleManifest? manifest = null)
|
||||
{
|
||||
IsValid = isValid;
|
||||
Issues = issues;
|
||||
Manifest = manifest;
|
||||
}
|
||||
}
|
||||
|
||||
public sealed class BundleException : Exception
|
||||
{
|
||||
public BundleException(string message) : base(message) { }
|
||||
public BundleException(string message, Exception inner) : base(message, inner) { }
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. DELIVERY TRACKER
|
||||
|
||||
| # | Task | Status | Assignee | Notes |
|
||||
|---|------|--------|----------|-------|
|
||||
| 1 | Define bundle directory structure | DONE | | Per §3.1 |
|
||||
| 2 | Implement `BundleManifest` schema | DONE | | BundleManifest.cs |
|
||||
| 3 | Implement `OfflineBundlePackager` | DONE | | OfflineBundlePackager.cs |
|
||||
| 4 | Implement DSSE predicate | DONE | | BundlePredicate.cs |
|
||||
| 5 | Implement tarball creation | DONE | | CreateTarballAsync |
|
||||
| 6 | Implement tarball extraction | DONE | | ExtractTarballAsync |
|
||||
| 7 | Implement bundle verification | DONE | | VerifyBundleAsync |
|
||||
| 8 | Add bundle download API endpoint | DONE | | GET /v1/alerts/{id}/bundle (via SPRINT_3602) |
|
||||
| 9 | Add bundle verify API endpoint | DONE | | POST /v1/alerts/{id}/bundle/verify (via SPRINT_3602) |
|
||||
| 10 | Write unit tests for packaging | DONE | | OfflineBundlePackagerTests.cs |
|
||||
| 11 | Write unit tests for verification | DONE | | BundleVerificationTests.cs |
|
||||
| 12 | Document bundle format | DONE | | docs/airgap/offline-bundle-format.md |
|
||||
|
||||
---
|
||||
|
||||
## 5. ACCEPTANCE CRITERIA
|
||||
|
||||
### 5.1 Format Requirements
|
||||
|
||||
- [ ] Bundle is single `.stella.bundle.tgz` file
|
||||
- [ ] Contains manifest.json with all entry hashes
|
||||
- [ ] Contains signed manifest (manifest.json.sig)
|
||||
- [ ] All paths are relative within bundle
|
||||
- [ ] Entries sorted deterministically
|
||||
|
||||
### 5.2 Signing Requirements
|
||||
|
||||
- [ ] Manifest is DSSE-signed
|
||||
- [ ] Predicate type registered in Attestor
|
||||
- [ ] Signature verification available offline
|
||||
|
||||
### 5.3 Verification Requirements
|
||||
|
||||
- [ ] All entry hashes verified
|
||||
- [ ] Combined content hash verified
|
||||
- [ ] Signature verification passes
|
||||
- [ ] Missing entries detected
|
||||
- [ ] Tampered entries detected
|
||||
|
||||
---
|
||||
|
||||
## 6. REFERENCES
|
||||
|
||||
- Advisory: `14-Dec-2025 - Triage and Unknowns Technical Reference.md` §12
|
||||
- Existing: `src/ExportCenter/StellaOps.ExportCenter/StellaOps.ExportCenter.Core/OfflineKit/`
|
||||
- DSSE Spec: https://github.com/secure-systems-lab/dsse
|
||||
Reference in New Issue
Block a user