Make remote localization startup non-blocking
This commit is contained in:
@@ -0,0 +1,77 @@
|
||||
# Sprint 20260311_001 - Graph Remote Localization Startup Nonblocking
|
||||
|
||||
## Topic & Scope
|
||||
- Remove the scratch-setup startup bottleneck where Graph API can stay dark for an extended period while remote localization overrides load before Kestrel binds.
|
||||
- Treat remote translation bundles as optional startup enrichment, not a dependency that can hold a service offline during a fresh compose bootstrap.
|
||||
- Verify the fix with focused localization-library tests, a rebuilt Graph image, and live service/browser checks on the scratch stack.
|
||||
- Working directory: `src/__Libraries/StellaOps.Localization`.
|
||||
- Allowed coordination edits: `src/Graph/**`, `src/__Libraries/__Tests/**`, `devops/compose/**`, `docs/modules/graph/architecture.md`, `docs/implplan/SPRINT_20260311_001_Graph_remote_localization_startup_nonblocking.md`.
|
||||
- Expected evidence: targeted localization test output, rebuilt Graph runtime health, and live verification artifacts showing the scratch stack no longer masks the startup fault.
|
||||
|
||||
## Dependencies & Concurrency
|
||||
- Depends on the existing scratch-reset stack being up so the late-start Graph behavior can be reproduced and rechecked.
|
||||
- Safe parallelism: stay inside the localization library, Graph service, and the listed docs; avoid unrelated web search or component-revival slices.
|
||||
|
||||
## Documentation Prerequisites
|
||||
- `AGENTS.md`
|
||||
- `src/Graph/AGENTS.md`
|
||||
- `docs/modules/graph/architecture.md`
|
||||
- `docs/qa/feature-checks/FLOW.md`
|
||||
|
||||
## Delivery Tracker
|
||||
|
||||
### GRAPH-LOC-001 - Diagnose the real startup gate
|
||||
Status: DONE
|
||||
Dependency: none
|
||||
Owners: QA, Developer
|
||||
Task description:
|
||||
- Reproduce the Graph startup fault from the scratch stack and separate product failures from harness noise.
|
||||
- Capture why the container can stay unhealthy during scratch setup even though the same binary later starts when rerun interactively.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Container/runtime evidence shows where startup is being gated.
|
||||
- [x] The diagnosis identifies the shared-library behavior that needs correction.
|
||||
|
||||
### GRAPH-LOC-002 - Make remote localization startup-safe
|
||||
Status: DONE
|
||||
Dependency: GRAPH-LOC-001
|
||||
Owners: Architect, Developer
|
||||
Task description:
|
||||
- Change the shared localization bootstrap so remote bundle overrides are bounded and parallelized per provider, preserving deterministic merge order while preventing optional remote fetches from serially blocking service readiness.
|
||||
- Keep the contract library-centric so Graph is fixed through the real root cause rather than a service-specific workaround.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Remote bundle fetches have an explicit bounded timeout.
|
||||
- [x] Translation registry no longer serially waits per locale for a single provider.
|
||||
- [x] Focused tests cover timeout handling and concurrent locale loading.
|
||||
|
||||
### GRAPH-LOC-003 - Rebuild and prove the scratch-stack behavior
|
||||
Status: DONE
|
||||
Dependency: GRAPH-LOC-002
|
||||
Owners: QA
|
||||
Task description:
|
||||
- Rebuild the affected runtime, redeploy the live stack, and verify Graph startup and the related UI surface on the scratch environment.
|
||||
- Record the new behavior in sprint evidence and module docs.
|
||||
|
||||
Completion criteria:
|
||||
- [x] Graph container becomes healthy promptly after redeploy.
|
||||
- [x] Focused live checks confirm the reachability/security surfaces no longer surface backend-unavailable fallback on this defect path.
|
||||
- [x] Docs and sprint log reflect the startup contract change.
|
||||
|
||||
## Execution Log
|
||||
| Date (UTC) | Update | Owner |
|
||||
| --- | --- | --- |
|
||||
| 2026-03-11 | Sprint created after a fresh scratch rebuild showed `stellaops-graph-api` remaining unhealthy while the frontdoor route sweep stayed green. | Developer |
|
||||
| 2026-03-11 | Reproduced that the Graph binary starts normally on host and in-container when rerun interactively, but the scratch container can stay dark for a long interval before eventually binding. The shared startup gate is `LoadTranslationsAsync()` calling remote bundle overrides before `Run()`, with one remote fetch per locale executed serially. | QA |
|
||||
| 2026-03-11 | Implemented the shared-library fix in `StellaOps.Localization`: remote bundle fetches now use a bounded per-request timeout and locale loads run concurrently within a provider while merging back in deterministic order. Added focused tests in `src/__Libraries/__Tests/StellaOps.Localization.Tests` covering timeout fallback and concurrent load behavior. | Developer |
|
||||
| 2026-03-11 | Verified the fix on the live scratch stack by rebuilding only `graph-api`, stopping Platform, force-recreating the Graph container, and confirming immediate recovery: `stellaops-graph-api` reported `healthy` and `GET http://127.1.0.20/healthz` returned `200` while Platform was still down. Then brought Platform back and ran a live authenticated Playwright check on `/security/supply-chain-data/graph`, which passed with zero console errors, zero request failures, and zero error responses. | QA |
|
||||
|
||||
## Decisions & Risks
|
||||
- Decision: fix the startup contract in `StellaOps.Localization` instead of adding Graph-only retries, because remote translation overrides are used by many services and should never gate service availability during scratch bootstrap.
|
||||
- Risk: changing translation loading order could accidentally alter merge determinism.
|
||||
- Mitigation: keep provider priority ordering intact, parallelize only within a provider, and merge results back in deterministic locale order.
|
||||
- Decision: bounded remote translation fetches default to a short timeout because remote overrides are optional enrichment; if Platform is unavailable during scratch bootstrap, services must prefer embedded bundles and come online instead of waiting unboundedly on localization.
|
||||
|
||||
## Next Checkpoints
|
||||
- Add focused localization tests before changing runtime behavior.
|
||||
- Rebuild the Graph image and redeploy the stack immediately after the library fix.
|
||||
@@ -68,6 +68,7 @@ The edge metadata system provides explainability for graph relationships:
|
||||
- Graph API now initializes localization via `AddStellaOpsLocalization(...)`, `AddTranslationBundle(...)`, `AddRemoteTranslationBundles()`, `UseStellaOpsLocalization()`, and `LoadTranslationsAsync()`.
|
||||
- Locale resolution order for API messages is deterministic: `X-Locale` header -> `Accept-Language` header -> default locale (`en-US`).
|
||||
- Translation layering is deterministic: shared embedded `common` bundle -> Graph embedded bundle (`Translations/*.graph.json`) -> Platform runtime override bundle.
|
||||
- Remote Platform override fetches are bounded and loaded concurrently per provider locale so scratch bootstrap cannot hold the Graph API offline while optional translation overrides load.
|
||||
- This rollout localizes selected error paths (for example, edge/export not found, invalid reason, and tenant/auth validation text) for `en-US` and `de-DE`.
|
||||
|
||||
## 4) Storage considerations
|
||||
|
||||
@@ -11,6 +11,7 @@ namespace StellaOps.Localization;
|
||||
/// </summary>
|
||||
public sealed class RemoteBundleProvider : ITranslationBundleProvider
|
||||
{
|
||||
private static readonly TimeSpan DefaultRequestTimeout = TimeSpan.FromSeconds(3);
|
||||
private readonly TranslationOptions _options;
|
||||
private readonly IHttpClientFactory? _httpClientFactory;
|
||||
private readonly ILogger<RemoteBundleProvider> _logger;
|
||||
@@ -45,7 +46,13 @@ public sealed class RemoteBundleProvider : ITranslationBundleProvider
|
||||
try
|
||||
{
|
||||
var client = _httpClientFactory.CreateClient("StellaOpsLocalization");
|
||||
var response = await client.GetAsync(url, ct).ConfigureAwait(false);
|
||||
var requestTimeout = ResolveRequestTimeout();
|
||||
client.Timeout = requestTimeout;
|
||||
|
||||
using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
|
||||
timeoutCts.CancelAfter(requestTimeout);
|
||||
|
||||
var response = await client.GetAsync(url, timeoutCts.Token).ConfigureAwait(false);
|
||||
|
||||
if (!response.IsSuccessStatusCode)
|
||||
{
|
||||
@@ -78,6 +85,13 @@ public sealed class RemoteBundleProvider : ITranslationBundleProvider
|
||||
return Task.FromResult<IReadOnlyList<string>>([]);
|
||||
}
|
||||
|
||||
private TimeSpan ResolveRequestTimeout()
|
||||
{
|
||||
return _options.RemoteBundleRequestTimeout > TimeSpan.Zero
|
||||
? _options.RemoteBundleRequestTimeout
|
||||
: DefaultRequestTimeout;
|
||||
}
|
||||
|
||||
private sealed class RemoteBundleResponse
|
||||
{
|
||||
public string? Locale { get; set; }
|
||||
|
||||
@@ -22,6 +22,9 @@ public sealed class TranslationOptions
|
||||
/// <summary>Cache TTL for remote bundles.</summary>
|
||||
public TimeSpan RemoteBundleCacheDuration { get; set; } = TimeSpan.FromMinutes(30);
|
||||
|
||||
/// <summary>Maximum time to wait for a single remote bundle request during startup.</summary>
|
||||
public TimeSpan RemoteBundleRequestTimeout { get; set; } = TimeSpan.FromSeconds(3);
|
||||
|
||||
/// <summary>Whether to return the key as fallback when translation is missing.</summary>
|
||||
public bool ReturnKeyWhenMissing { get; set; } = true;
|
||||
}
|
||||
|
||||
@@ -44,28 +44,28 @@ public sealed class TranslationRegistry
|
||||
// Ensure default locale is always loaded
|
||||
allLocales.Add(_options.DefaultLocale);
|
||||
|
||||
// Load bundles in priority order (lower first, higher overwrites)
|
||||
// Load bundles in priority order (lower first, higher overwrites).
|
||||
// Locales within the same provider are independent, so load them concurrently
|
||||
// and merge back in deterministic locale order.
|
||||
foreach (var provider in ordered)
|
||||
{
|
||||
foreach (var locale in allLocales)
|
||||
var loadTasks = allLocales
|
||||
.OrderBy(locale => locale, StringComparer.OrdinalIgnoreCase)
|
||||
.Select(locale => LoadProviderBundleAsync(provider, locale, ct))
|
||||
.ToArray();
|
||||
|
||||
var results = await Task.WhenAll(loadTasks).ConfigureAwait(false);
|
||||
foreach (var result in results)
|
||||
{
|
||||
try
|
||||
if (result.Bundle.Count == 0)
|
||||
{
|
||||
var bundle = await provider.LoadAsync(locale, ct).ConfigureAwait(false);
|
||||
if (bundle.Count > 0)
|
||||
{
|
||||
MergeBundles(locale, bundle);
|
||||
_logger.LogDebug(
|
||||
"Loaded {Count} translations for locale {Locale} from provider (priority {Priority})",
|
||||
bundle.Count, locale, provider.Priority);
|
||||
}
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
_logger.LogWarning(ex,
|
||||
"Failed to load translations for locale {Locale} from provider (priority {Priority})",
|
||||
locale, provider.Priority);
|
||||
continue;
|
||||
}
|
||||
|
||||
MergeBundles(result.Locale, result.Bundle);
|
||||
_logger.LogDebug(
|
||||
"Loaded {Count} translations for locale {Locale} from provider (priority {Priority})",
|
||||
result.Bundle.Count, result.Locale, provider.Priority);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -75,6 +75,27 @@ public sealed class TranslationRegistry
|
||||
_store.Count, totalKeys);
|
||||
}
|
||||
|
||||
private async Task<ProviderLocaleBundle> LoadProviderBundleAsync(
|
||||
ITranslationBundleProvider provider,
|
||||
string locale,
|
||||
CancellationToken ct)
|
||||
{
|
||||
try
|
||||
{
|
||||
var bundle = await provider.LoadAsync(locale, ct).ConfigureAwait(false);
|
||||
return new ProviderLocaleBundle(locale, bundle);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
_logger.LogWarning(ex,
|
||||
"Failed to load translations for locale {Locale} from provider (priority {Priority})",
|
||||
locale, provider.Priority);
|
||||
return new ProviderLocaleBundle(
|
||||
locale,
|
||||
new Dictionary<string, string>(StringComparer.Ordinal));
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Merges a bundle into the store. Higher-priority values overwrite lower.
|
||||
/// </summary>
|
||||
@@ -248,4 +269,6 @@ public sealed class TranslationRegistry
|
||||
_ => value.ToString() ?? string.Empty
|
||||
};
|
||||
}
|
||||
|
||||
private sealed record ProviderLocaleBundle(string Locale, IReadOnlyDictionary<string, string> Bundle);
|
||||
}
|
||||
|
||||
@@ -0,0 +1,58 @@
|
||||
using System.Diagnostics;
|
||||
using System.Net;
|
||||
using System.Net.Http;
|
||||
using System.Net.Http.Json;
|
||||
using Microsoft.Extensions.Logging.Abstractions;
|
||||
using Microsoft.Extensions.Options;
|
||||
using StellaOps.Localization;
|
||||
|
||||
namespace StellaOps.Localization.Tests;
|
||||
|
||||
public sealed class RemoteBundleProviderTests
|
||||
{
|
||||
[Fact]
|
||||
public async Task LoadAsync_ReturnsEmptyBundle_WhenRemoteFetchTimesOut()
|
||||
{
|
||||
using var client = new HttpClient(new BlockingMessageHandler());
|
||||
var provider = new RemoteBundleProvider(
|
||||
Options.Create(new TranslationOptions
|
||||
{
|
||||
EnableRemoteBundles = true,
|
||||
RemoteBundleUrl = "http://platform.stella-ops.local",
|
||||
RemoteBundleRequestTimeout = TimeSpan.FromMilliseconds(100)
|
||||
}),
|
||||
NullLogger<RemoteBundleProvider>.Instance,
|
||||
new FixedHttpClientFactory(client));
|
||||
|
||||
var stopwatch = Stopwatch.StartNew();
|
||||
var bundle = await provider.LoadAsync("en-US", TestContext.Current.CancellationToken);
|
||||
stopwatch.Stop();
|
||||
|
||||
Assert.Empty(bundle);
|
||||
Assert.InRange(stopwatch.Elapsed, TimeSpan.Zero, TimeSpan.FromSeconds(1));
|
||||
}
|
||||
|
||||
private sealed class FixedHttpClientFactory : IHttpClientFactory
|
||||
{
|
||||
private readonly HttpClient _client;
|
||||
|
||||
public FixedHttpClientFactory(HttpClient client)
|
||||
{
|
||||
_client = client;
|
||||
}
|
||||
|
||||
public HttpClient CreateClient(string name) => _client;
|
||||
}
|
||||
|
||||
private sealed class BlockingMessageHandler : HttpMessageHandler
|
||||
{
|
||||
protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
|
||||
{
|
||||
await Task.Delay(TimeSpan.FromMinutes(1), cancellationToken);
|
||||
return new HttpResponseMessage(HttpStatusCode.OK)
|
||||
{
|
||||
Content = JsonContent.Create(new { locale = "en-US", strings = new Dictionary<string, string>() })
|
||||
};
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,12 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<Project Sdk="Microsoft.NET.Sdk">
|
||||
<PropertyGroup>
|
||||
<TargetFramework>net10.0</TargetFramework>
|
||||
<ImplicitUsings>enable</ImplicitUsings>
|
||||
<Nullable>enable</Nullable>
|
||||
<UseConcelierTestInfra>false</UseConcelierTestInfra>
|
||||
</PropertyGroup>
|
||||
<ItemGroup>
|
||||
<ProjectReference Include="../../StellaOps.Localization/StellaOps.Localization.csproj" />
|
||||
</ItemGroup>
|
||||
</Project>
|
||||
@@ -0,0 +1,93 @@
|
||||
using Microsoft.Extensions.Logging.Abstractions;
|
||||
using Microsoft.Extensions.Options;
|
||||
using StellaOps.Localization;
|
||||
|
||||
namespace StellaOps.Localization.Tests;
|
||||
|
||||
public sealed class TranslationRegistryTests
|
||||
{
|
||||
[Fact]
|
||||
public async Task LoadAsync_LoadsLocalesConcurrentlyWithinSingleProvider()
|
||||
{
|
||||
var provider = new BlockingProvider("en-US", "de-DE", "bg-BG");
|
||||
var registry = new TranslationRegistry(
|
||||
Options.Create(new TranslationOptions
|
||||
{
|
||||
DefaultLocale = "en-US",
|
||||
SupportedLocales = ["en-US", "de-DE", "bg-BG"]
|
||||
}),
|
||||
NullLogger<TranslationRegistry>.Instance);
|
||||
|
||||
var loadTask = registry.LoadAsync([provider], TestContext.Current.CancellationToken);
|
||||
|
||||
await provider.AllLocalesStarted.Task.WaitAsync(TestContext.Current.CancellationToken);
|
||||
Assert.True(provider.MaxConcurrentLoads >= 2);
|
||||
|
||||
provider.Release();
|
||||
await loadTask;
|
||||
|
||||
Assert.Equal("bundle:de-DE", registry.GetBundle("de-DE")["translation.loaded"]);
|
||||
Assert.Equal("bundle:bg-BG", registry.GetBundle("bg-BG")["translation.loaded"]);
|
||||
}
|
||||
|
||||
private sealed class BlockingProvider : ITranslationBundleProvider
|
||||
{
|
||||
private readonly IReadOnlyList<string> _locales;
|
||||
private readonly TaskCompletionSource _release = new(TaskCreationOptions.RunContinuationsAsynchronously);
|
||||
private int _activeLoads;
|
||||
private int _startedLoads;
|
||||
private int _maxConcurrentLoads;
|
||||
|
||||
public BlockingProvider(params string[] locales)
|
||||
{
|
||||
_locales = locales;
|
||||
}
|
||||
|
||||
public int Priority => 10;
|
||||
|
||||
public int MaxConcurrentLoads => _maxConcurrentLoads;
|
||||
|
||||
public TaskCompletionSource AllLocalesStarted { get; } = new(TaskCreationOptions.RunContinuationsAsynchronously);
|
||||
|
||||
public Task<IReadOnlyList<string>> GetAvailableLocalesAsync(CancellationToken ct)
|
||||
=> Task.FromResult(_locales);
|
||||
|
||||
public async Task<IReadOnlyDictionary<string, string>> LoadAsync(string locale, CancellationToken ct)
|
||||
{
|
||||
var concurrentLoads = Interlocked.Increment(ref _activeLoads);
|
||||
UpdateMaxConcurrentLoads(concurrentLoads);
|
||||
|
||||
if (Interlocked.Increment(ref _startedLoads) == _locales.Count)
|
||||
{
|
||||
AllLocalesStarted.TrySetResult();
|
||||
}
|
||||
|
||||
await _release.Task.WaitAsync(ct);
|
||||
Interlocked.Decrement(ref _activeLoads);
|
||||
|
||||
return new Dictionary<string, string>(StringComparer.Ordinal)
|
||||
{
|
||||
["translation.loaded"] = $"bundle:{locale}"
|
||||
};
|
||||
}
|
||||
|
||||
public void Release() => _release.TrySetResult();
|
||||
|
||||
private void UpdateMaxConcurrentLoads(int concurrentLoads)
|
||||
{
|
||||
while (true)
|
||||
{
|
||||
var snapshot = _maxConcurrentLoads;
|
||||
if (concurrentLoads <= snapshot)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
if (Interlocked.CompareExchange(ref _maxConcurrentLoads, concurrentLoads, snapshot) == snapshot)
|
||||
{
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user