diff --git a/docs/implplan/SPRINT_20260311_001_Graph_remote_localization_startup_nonblocking.md b/docs/implplan/SPRINT_20260311_001_Graph_remote_localization_startup_nonblocking.md new file mode 100644 index 000000000..c10d5395f --- /dev/null +++ b/docs/implplan/SPRINT_20260311_001_Graph_remote_localization_startup_nonblocking.md @@ -0,0 +1,77 @@ +# Sprint 20260311_001 - Graph Remote Localization Startup Nonblocking + +## Topic & Scope +- Remove the scratch-setup startup bottleneck where Graph API can stay dark for an extended period while remote localization overrides load before Kestrel binds. +- Treat remote translation bundles as optional startup enrichment, not a dependency that can hold a service offline during a fresh compose bootstrap. +- Verify the fix with focused localization-library tests, a rebuilt Graph image, and live service/browser checks on the scratch stack. +- Working directory: `src/__Libraries/StellaOps.Localization`. +- Allowed coordination edits: `src/Graph/**`, `src/__Libraries/__Tests/**`, `devops/compose/**`, `docs/modules/graph/architecture.md`, `docs/implplan/SPRINT_20260311_001_Graph_remote_localization_startup_nonblocking.md`. +- Expected evidence: targeted localization test output, rebuilt Graph runtime health, and live verification artifacts showing the scratch stack no longer masks the startup fault. + +## Dependencies & Concurrency +- Depends on the existing scratch-reset stack being up so the late-start Graph behavior can be reproduced and rechecked. +- Safe parallelism: stay inside the localization library, Graph service, and the listed docs; avoid unrelated web search or component-revival slices. + +## Documentation Prerequisites +- `AGENTS.md` +- `src/Graph/AGENTS.md` +- `docs/modules/graph/architecture.md` +- `docs/qa/feature-checks/FLOW.md` + +## Delivery Tracker + +### GRAPH-LOC-001 - Diagnose the real startup gate +Status: DONE +Dependency: none +Owners: QA, Developer +Task description: +- Reproduce the Graph startup fault from the scratch stack and separate product failures from harness noise. +- Capture why the container can stay unhealthy during scratch setup even though the same binary later starts when rerun interactively. + +Completion criteria: +- [x] Container/runtime evidence shows where startup is being gated. +- [x] The diagnosis identifies the shared-library behavior that needs correction. + +### GRAPH-LOC-002 - Make remote localization startup-safe +Status: DONE +Dependency: GRAPH-LOC-001 +Owners: Architect, Developer +Task description: +- Change the shared localization bootstrap so remote bundle overrides are bounded and parallelized per provider, preserving deterministic merge order while preventing optional remote fetches from serially blocking service readiness. +- Keep the contract library-centric so Graph is fixed through the real root cause rather than a service-specific workaround. + +Completion criteria: +- [x] Remote bundle fetches have an explicit bounded timeout. +- [x] Translation registry no longer serially waits per locale for a single provider. +- [x] Focused tests cover timeout handling and concurrent locale loading. + +### GRAPH-LOC-003 - Rebuild and prove the scratch-stack behavior +Status: DONE +Dependency: GRAPH-LOC-002 +Owners: QA +Task description: +- Rebuild the affected runtime, redeploy the live stack, and verify Graph startup and the related UI surface on the scratch environment. +- Record the new behavior in sprint evidence and module docs. + +Completion criteria: +- [x] Graph container becomes healthy promptly after redeploy. +- [x] Focused live checks confirm the reachability/security surfaces no longer surface backend-unavailable fallback on this defect path. +- [x] Docs and sprint log reflect the startup contract change. + +## Execution Log +| Date (UTC) | Update | Owner | +| --- | --- | --- | +| 2026-03-11 | Sprint created after a fresh scratch rebuild showed `stellaops-graph-api` remaining unhealthy while the frontdoor route sweep stayed green. | Developer | +| 2026-03-11 | Reproduced that the Graph binary starts normally on host and in-container when rerun interactively, but the scratch container can stay dark for a long interval before eventually binding. The shared startup gate is `LoadTranslationsAsync()` calling remote bundle overrides before `Run()`, with one remote fetch per locale executed serially. | QA | +| 2026-03-11 | Implemented the shared-library fix in `StellaOps.Localization`: remote bundle fetches now use a bounded per-request timeout and locale loads run concurrently within a provider while merging back in deterministic order. Added focused tests in `src/__Libraries/__Tests/StellaOps.Localization.Tests` covering timeout fallback and concurrent load behavior. | Developer | +| 2026-03-11 | Verified the fix on the live scratch stack by rebuilding only `graph-api`, stopping Platform, force-recreating the Graph container, and confirming immediate recovery: `stellaops-graph-api` reported `healthy` and `GET http://127.1.0.20/healthz` returned `200` while Platform was still down. Then brought Platform back and ran a live authenticated Playwright check on `/security/supply-chain-data/graph`, which passed with zero console errors, zero request failures, and zero error responses. | QA | + +## Decisions & Risks +- Decision: fix the startup contract in `StellaOps.Localization` instead of adding Graph-only retries, because remote translation overrides are used by many services and should never gate service availability during scratch bootstrap. +- Risk: changing translation loading order could accidentally alter merge determinism. +- Mitigation: keep provider priority ordering intact, parallelize only within a provider, and merge results back in deterministic locale order. +- Decision: bounded remote translation fetches default to a short timeout because remote overrides are optional enrichment; if Platform is unavailable during scratch bootstrap, services must prefer embedded bundles and come online instead of waiting unboundedly on localization. + +## Next Checkpoints +- Add focused localization tests before changing runtime behavior. +- Rebuild the Graph image and redeploy the stack immediately after the library fix. diff --git a/docs/modules/graph/architecture.md b/docs/modules/graph/architecture.md index a04e79df4..1f152a6e7 100644 --- a/docs/modules/graph/architecture.md +++ b/docs/modules/graph/architecture.md @@ -68,6 +68,7 @@ The edge metadata system provides explainability for graph relationships: - Graph API now initializes localization via `AddStellaOpsLocalization(...)`, `AddTranslationBundle(...)`, `AddRemoteTranslationBundles()`, `UseStellaOpsLocalization()`, and `LoadTranslationsAsync()`. - Locale resolution order for API messages is deterministic: `X-Locale` header -> `Accept-Language` header -> default locale (`en-US`). - Translation layering is deterministic: shared embedded `common` bundle -> Graph embedded bundle (`Translations/*.graph.json`) -> Platform runtime override bundle. +- Remote Platform override fetches are bounded and loaded concurrently per provider locale so scratch bootstrap cannot hold the Graph API offline while optional translation overrides load. - This rollout localizes selected error paths (for example, edge/export not found, invalid reason, and tenant/auth validation text) for `en-US` and `de-DE`. ## 4) Storage considerations diff --git a/src/__Libraries/StellaOps.Localization/RemoteBundleProvider.cs b/src/__Libraries/StellaOps.Localization/RemoteBundleProvider.cs index 335e29fe3..45fb1f7f9 100644 --- a/src/__Libraries/StellaOps.Localization/RemoteBundleProvider.cs +++ b/src/__Libraries/StellaOps.Localization/RemoteBundleProvider.cs @@ -11,6 +11,7 @@ namespace StellaOps.Localization; /// public sealed class RemoteBundleProvider : ITranslationBundleProvider { + private static readonly TimeSpan DefaultRequestTimeout = TimeSpan.FromSeconds(3); private readonly TranslationOptions _options; private readonly IHttpClientFactory? _httpClientFactory; private readonly ILogger _logger; @@ -45,7 +46,13 @@ public sealed class RemoteBundleProvider : ITranslationBundleProvider try { var client = _httpClientFactory.CreateClient("StellaOpsLocalization"); - var response = await client.GetAsync(url, ct).ConfigureAwait(false); + var requestTimeout = ResolveRequestTimeout(); + client.Timeout = requestTimeout; + + using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(ct); + timeoutCts.CancelAfter(requestTimeout); + + var response = await client.GetAsync(url, timeoutCts.Token).ConfigureAwait(false); if (!response.IsSuccessStatusCode) { @@ -78,6 +85,13 @@ public sealed class RemoteBundleProvider : ITranslationBundleProvider return Task.FromResult>([]); } + private TimeSpan ResolveRequestTimeout() + { + return _options.RemoteBundleRequestTimeout > TimeSpan.Zero + ? _options.RemoteBundleRequestTimeout + : DefaultRequestTimeout; + } + private sealed class RemoteBundleResponse { public string? Locale { get; set; } diff --git a/src/__Libraries/StellaOps.Localization/TranslationOptions.cs b/src/__Libraries/StellaOps.Localization/TranslationOptions.cs index d01c46bf4..4c0045fc3 100644 --- a/src/__Libraries/StellaOps.Localization/TranslationOptions.cs +++ b/src/__Libraries/StellaOps.Localization/TranslationOptions.cs @@ -22,6 +22,9 @@ public sealed class TranslationOptions /// Cache TTL for remote bundles. public TimeSpan RemoteBundleCacheDuration { get; set; } = TimeSpan.FromMinutes(30); + /// Maximum time to wait for a single remote bundle request during startup. + public TimeSpan RemoteBundleRequestTimeout { get; set; } = TimeSpan.FromSeconds(3); + /// Whether to return the key as fallback when translation is missing. public bool ReturnKeyWhenMissing { get; set; } = true; } diff --git a/src/__Libraries/StellaOps.Localization/TranslationRegistry.cs b/src/__Libraries/StellaOps.Localization/TranslationRegistry.cs index 217d7ca80..37edcf243 100644 --- a/src/__Libraries/StellaOps.Localization/TranslationRegistry.cs +++ b/src/__Libraries/StellaOps.Localization/TranslationRegistry.cs @@ -44,28 +44,28 @@ public sealed class TranslationRegistry // Ensure default locale is always loaded allLocales.Add(_options.DefaultLocale); - // Load bundles in priority order (lower first, higher overwrites) + // Load bundles in priority order (lower first, higher overwrites). + // Locales within the same provider are independent, so load them concurrently + // and merge back in deterministic locale order. foreach (var provider in ordered) { - foreach (var locale in allLocales) + var loadTasks = allLocales + .OrderBy(locale => locale, StringComparer.OrdinalIgnoreCase) + .Select(locale => LoadProviderBundleAsync(provider, locale, ct)) + .ToArray(); + + var results = await Task.WhenAll(loadTasks).ConfigureAwait(false); + foreach (var result in results) { - try + if (result.Bundle.Count == 0) { - var bundle = await provider.LoadAsync(locale, ct).ConfigureAwait(false); - if (bundle.Count > 0) - { - MergeBundles(locale, bundle); - _logger.LogDebug( - "Loaded {Count} translations for locale {Locale} from provider (priority {Priority})", - bundle.Count, locale, provider.Priority); - } - } - catch (Exception ex) - { - _logger.LogWarning(ex, - "Failed to load translations for locale {Locale} from provider (priority {Priority})", - locale, provider.Priority); + continue; } + + MergeBundles(result.Locale, result.Bundle); + _logger.LogDebug( + "Loaded {Count} translations for locale {Locale} from provider (priority {Priority})", + result.Bundle.Count, result.Locale, provider.Priority); } } @@ -75,6 +75,27 @@ public sealed class TranslationRegistry _store.Count, totalKeys); } + private async Task LoadProviderBundleAsync( + ITranslationBundleProvider provider, + string locale, + CancellationToken ct) + { + try + { + var bundle = await provider.LoadAsync(locale, ct).ConfigureAwait(false); + return new ProviderLocaleBundle(locale, bundle); + } + catch (Exception ex) + { + _logger.LogWarning(ex, + "Failed to load translations for locale {Locale} from provider (priority {Priority})", + locale, provider.Priority); + return new ProviderLocaleBundle( + locale, + new Dictionary(StringComparer.Ordinal)); + } + } + /// /// Merges a bundle into the store. Higher-priority values overwrite lower. /// @@ -248,4 +269,6 @@ public sealed class TranslationRegistry _ => value.ToString() ?? string.Empty }; } + + private sealed record ProviderLocaleBundle(string Locale, IReadOnlyDictionary Bundle); } diff --git a/src/__Libraries/__Tests/StellaOps.Localization.Tests/RemoteBundleProviderTests.cs b/src/__Libraries/__Tests/StellaOps.Localization.Tests/RemoteBundleProviderTests.cs new file mode 100644 index 000000000..4e17aa076 --- /dev/null +++ b/src/__Libraries/__Tests/StellaOps.Localization.Tests/RemoteBundleProviderTests.cs @@ -0,0 +1,58 @@ +using System.Diagnostics; +using System.Net; +using System.Net.Http; +using System.Net.Http.Json; +using Microsoft.Extensions.Logging.Abstractions; +using Microsoft.Extensions.Options; +using StellaOps.Localization; + +namespace StellaOps.Localization.Tests; + +public sealed class RemoteBundleProviderTests +{ + [Fact] + public async Task LoadAsync_ReturnsEmptyBundle_WhenRemoteFetchTimesOut() + { + using var client = new HttpClient(new BlockingMessageHandler()); + var provider = new RemoteBundleProvider( + Options.Create(new TranslationOptions + { + EnableRemoteBundles = true, + RemoteBundleUrl = "http://platform.stella-ops.local", + RemoteBundleRequestTimeout = TimeSpan.FromMilliseconds(100) + }), + NullLogger.Instance, + new FixedHttpClientFactory(client)); + + var stopwatch = Stopwatch.StartNew(); + var bundle = await provider.LoadAsync("en-US", TestContext.Current.CancellationToken); + stopwatch.Stop(); + + Assert.Empty(bundle); + Assert.InRange(stopwatch.Elapsed, TimeSpan.Zero, TimeSpan.FromSeconds(1)); + } + + private sealed class FixedHttpClientFactory : IHttpClientFactory + { + private readonly HttpClient _client; + + public FixedHttpClientFactory(HttpClient client) + { + _client = client; + } + + public HttpClient CreateClient(string name) => _client; + } + + private sealed class BlockingMessageHandler : HttpMessageHandler + { + protected override async Task SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) + { + await Task.Delay(TimeSpan.FromMinutes(1), cancellationToken); + return new HttpResponseMessage(HttpStatusCode.OK) + { + Content = JsonContent.Create(new { locale = "en-US", strings = new Dictionary() }) + }; + } + } +} diff --git a/src/__Libraries/__Tests/StellaOps.Localization.Tests/StellaOps.Localization.Tests.csproj b/src/__Libraries/__Tests/StellaOps.Localization.Tests/StellaOps.Localization.Tests.csproj new file mode 100644 index 000000000..6e3dbeeb0 --- /dev/null +++ b/src/__Libraries/__Tests/StellaOps.Localization.Tests/StellaOps.Localization.Tests.csproj @@ -0,0 +1,12 @@ + + + + net10.0 + enable + enable + false + + + + + diff --git a/src/__Libraries/__Tests/StellaOps.Localization.Tests/TranslationRegistryTests.cs b/src/__Libraries/__Tests/StellaOps.Localization.Tests/TranslationRegistryTests.cs new file mode 100644 index 000000000..7e0372742 --- /dev/null +++ b/src/__Libraries/__Tests/StellaOps.Localization.Tests/TranslationRegistryTests.cs @@ -0,0 +1,93 @@ +using Microsoft.Extensions.Logging.Abstractions; +using Microsoft.Extensions.Options; +using StellaOps.Localization; + +namespace StellaOps.Localization.Tests; + +public sealed class TranslationRegistryTests +{ + [Fact] + public async Task LoadAsync_LoadsLocalesConcurrentlyWithinSingleProvider() + { + var provider = new BlockingProvider("en-US", "de-DE", "bg-BG"); + var registry = new TranslationRegistry( + Options.Create(new TranslationOptions + { + DefaultLocale = "en-US", + SupportedLocales = ["en-US", "de-DE", "bg-BG"] + }), + NullLogger.Instance); + + var loadTask = registry.LoadAsync([provider], TestContext.Current.CancellationToken); + + await provider.AllLocalesStarted.Task.WaitAsync(TestContext.Current.CancellationToken); + Assert.True(provider.MaxConcurrentLoads >= 2); + + provider.Release(); + await loadTask; + + Assert.Equal("bundle:de-DE", registry.GetBundle("de-DE")["translation.loaded"]); + Assert.Equal("bundle:bg-BG", registry.GetBundle("bg-BG")["translation.loaded"]); + } + + private sealed class BlockingProvider : ITranslationBundleProvider + { + private readonly IReadOnlyList _locales; + private readonly TaskCompletionSource _release = new(TaskCreationOptions.RunContinuationsAsynchronously); + private int _activeLoads; + private int _startedLoads; + private int _maxConcurrentLoads; + + public BlockingProvider(params string[] locales) + { + _locales = locales; + } + + public int Priority => 10; + + public int MaxConcurrentLoads => _maxConcurrentLoads; + + public TaskCompletionSource AllLocalesStarted { get; } = new(TaskCreationOptions.RunContinuationsAsynchronously); + + public Task> GetAvailableLocalesAsync(CancellationToken ct) + => Task.FromResult(_locales); + + public async Task> LoadAsync(string locale, CancellationToken ct) + { + var concurrentLoads = Interlocked.Increment(ref _activeLoads); + UpdateMaxConcurrentLoads(concurrentLoads); + + if (Interlocked.Increment(ref _startedLoads) == _locales.Count) + { + AllLocalesStarted.TrySetResult(); + } + + await _release.Task.WaitAsync(ct); + Interlocked.Decrement(ref _activeLoads); + + return new Dictionary(StringComparer.Ordinal) + { + ["translation.loaded"] = $"bundle:{locale}" + }; + } + + public void Release() => _release.TrySetResult(); + + private void UpdateMaxConcurrentLoads(int concurrentLoads) + { + while (true) + { + var snapshot = _maxConcurrentLoads; + if (concurrentLoads <= snapshot) + { + return; + } + + if (Interlocked.CompareExchange(ref _maxConcurrentLoads, concurrentLoads, snapshot) == snapshot) + { + return; + } + } + } + } +}