Make remote localization startup non-blocking

This commit is contained in:
master
2026-03-11 10:07:30 +02:00
parent 7a1c090f2e
commit 5c874c8f64
8 changed files with 299 additions and 18 deletions

View File

@@ -0,0 +1,77 @@
# Sprint 20260311_001 - Graph Remote Localization Startup Nonblocking
## Topic & Scope
- Remove the scratch-setup startup bottleneck where Graph API can stay dark for an extended period while remote localization overrides load before Kestrel binds.
- Treat remote translation bundles as optional startup enrichment, not a dependency that can hold a service offline during a fresh compose bootstrap.
- Verify the fix with focused localization-library tests, a rebuilt Graph image, and live service/browser checks on the scratch stack.
- Working directory: `src/__Libraries/StellaOps.Localization`.
- Allowed coordination edits: `src/Graph/**`, `src/__Libraries/__Tests/**`, `devops/compose/**`, `docs/modules/graph/architecture.md`, `docs/implplan/SPRINT_20260311_001_Graph_remote_localization_startup_nonblocking.md`.
- Expected evidence: targeted localization test output, rebuilt Graph runtime health, and live verification artifacts showing the scratch stack no longer masks the startup fault.
## Dependencies & Concurrency
- Depends on the existing scratch-reset stack being up so the late-start Graph behavior can be reproduced and rechecked.
- Safe parallelism: stay inside the localization library, Graph service, and the listed docs; avoid unrelated web search or component-revival slices.
## Documentation Prerequisites
- `AGENTS.md`
- `src/Graph/AGENTS.md`
- `docs/modules/graph/architecture.md`
- `docs/qa/feature-checks/FLOW.md`
## Delivery Tracker
### GRAPH-LOC-001 - Diagnose the real startup gate
Status: DONE
Dependency: none
Owners: QA, Developer
Task description:
- Reproduce the Graph startup fault from the scratch stack and separate product failures from harness noise.
- Capture why the container can stay unhealthy during scratch setup even though the same binary later starts when rerun interactively.
Completion criteria:
- [x] Container/runtime evidence shows where startup is being gated.
- [x] The diagnosis identifies the shared-library behavior that needs correction.
### GRAPH-LOC-002 - Make remote localization startup-safe
Status: DONE
Dependency: GRAPH-LOC-001
Owners: Architect, Developer
Task description:
- Change the shared localization bootstrap so remote bundle overrides are bounded and parallelized per provider, preserving deterministic merge order while preventing optional remote fetches from serially blocking service readiness.
- Keep the contract library-centric so Graph is fixed through the real root cause rather than a service-specific workaround.
Completion criteria:
- [x] Remote bundle fetches have an explicit bounded timeout.
- [x] Translation registry no longer serially waits per locale for a single provider.
- [x] Focused tests cover timeout handling and concurrent locale loading.
### GRAPH-LOC-003 - Rebuild and prove the scratch-stack behavior
Status: DONE
Dependency: GRAPH-LOC-002
Owners: QA
Task description:
- Rebuild the affected runtime, redeploy the live stack, and verify Graph startup and the related UI surface on the scratch environment.
- Record the new behavior in sprint evidence and module docs.
Completion criteria:
- [x] Graph container becomes healthy promptly after redeploy.
- [x] Focused live checks confirm the reachability/security surfaces no longer surface backend-unavailable fallback on this defect path.
- [x] Docs and sprint log reflect the startup contract change.
## Execution Log
| Date (UTC) | Update | Owner |
| --- | --- | --- |
| 2026-03-11 | Sprint created after a fresh scratch rebuild showed `stellaops-graph-api` remaining unhealthy while the frontdoor route sweep stayed green. | Developer |
| 2026-03-11 | Reproduced that the Graph binary starts normally on host and in-container when rerun interactively, but the scratch container can stay dark for a long interval before eventually binding. The shared startup gate is `LoadTranslationsAsync()` calling remote bundle overrides before `Run()`, with one remote fetch per locale executed serially. | QA |
| 2026-03-11 | Implemented the shared-library fix in `StellaOps.Localization`: remote bundle fetches now use a bounded per-request timeout and locale loads run concurrently within a provider while merging back in deterministic order. Added focused tests in `src/__Libraries/__Tests/StellaOps.Localization.Tests` covering timeout fallback and concurrent load behavior. | Developer |
| 2026-03-11 | Verified the fix on the live scratch stack by rebuilding only `graph-api`, stopping Platform, force-recreating the Graph container, and confirming immediate recovery: `stellaops-graph-api` reported `healthy` and `GET http://127.1.0.20/healthz` returned `200` while Platform was still down. Then brought Platform back and ran a live authenticated Playwright check on `/security/supply-chain-data/graph`, which passed with zero console errors, zero request failures, and zero error responses. | QA |
## Decisions & Risks
- Decision: fix the startup contract in `StellaOps.Localization` instead of adding Graph-only retries, because remote translation overrides are used by many services and should never gate service availability during scratch bootstrap.
- Risk: changing translation loading order could accidentally alter merge determinism.
- Mitigation: keep provider priority ordering intact, parallelize only within a provider, and merge results back in deterministic locale order.
- Decision: bounded remote translation fetches default to a short timeout because remote overrides are optional enrichment; if Platform is unavailable during scratch bootstrap, services must prefer embedded bundles and come online instead of waiting unboundedly on localization.
## Next Checkpoints
- Add focused localization tests before changing runtime behavior.
- Rebuild the Graph image and redeploy the stack immediately after the library fix.

View File

@@ -68,6 +68,7 @@ The edge metadata system provides explainability for graph relationships:
- Graph API now initializes localization via `AddStellaOpsLocalization(...)`, `AddTranslationBundle(...)`, `AddRemoteTranslationBundles()`, `UseStellaOpsLocalization()`, and `LoadTranslationsAsync()`.
- Locale resolution order for API messages is deterministic: `X-Locale` header -> `Accept-Language` header -> default locale (`en-US`).
- Translation layering is deterministic: shared embedded `common` bundle -> Graph embedded bundle (`Translations/*.graph.json`) -> Platform runtime override bundle.
- Remote Platform override fetches are bounded and loaded concurrently per provider locale so scratch bootstrap cannot hold the Graph API offline while optional translation overrides load.
- This rollout localizes selected error paths (for example, edge/export not found, invalid reason, and tenant/auth validation text) for `en-US` and `de-DE`.
## 4) Storage considerations

View File

@@ -11,6 +11,7 @@ namespace StellaOps.Localization;
/// </summary>
public sealed class RemoteBundleProvider : ITranslationBundleProvider
{
private static readonly TimeSpan DefaultRequestTimeout = TimeSpan.FromSeconds(3);
private readonly TranslationOptions _options;
private readonly IHttpClientFactory? _httpClientFactory;
private readonly ILogger<RemoteBundleProvider> _logger;
@@ -45,7 +46,13 @@ public sealed class RemoteBundleProvider : ITranslationBundleProvider
try
{
var client = _httpClientFactory.CreateClient("StellaOpsLocalization");
var response = await client.GetAsync(url, ct).ConfigureAwait(false);
var requestTimeout = ResolveRequestTimeout();
client.Timeout = requestTimeout;
using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
timeoutCts.CancelAfter(requestTimeout);
var response = await client.GetAsync(url, timeoutCts.Token).ConfigureAwait(false);
if (!response.IsSuccessStatusCode)
{
@@ -78,6 +85,13 @@ public sealed class RemoteBundleProvider : ITranslationBundleProvider
return Task.FromResult<IReadOnlyList<string>>([]);
}
private TimeSpan ResolveRequestTimeout()
{
return _options.RemoteBundleRequestTimeout > TimeSpan.Zero
? _options.RemoteBundleRequestTimeout
: DefaultRequestTimeout;
}
private sealed class RemoteBundleResponse
{
public string? Locale { get; set; }

View File

@@ -22,6 +22,9 @@ public sealed class TranslationOptions
/// <summary>Cache TTL for remote bundles.</summary>
public TimeSpan RemoteBundleCacheDuration { get; set; } = TimeSpan.FromMinutes(30);
/// <summary>Maximum time to wait for a single remote bundle request during startup.</summary>
public TimeSpan RemoteBundleRequestTimeout { get; set; } = TimeSpan.FromSeconds(3);
/// <summary>Whether to return the key as fallback when translation is missing.</summary>
public bool ReturnKeyWhenMissing { get; set; } = true;
}

View File

@@ -44,28 +44,28 @@ public sealed class TranslationRegistry
// Ensure default locale is always loaded
allLocales.Add(_options.DefaultLocale);
// Load bundles in priority order (lower first, higher overwrites)
// Load bundles in priority order (lower first, higher overwrites).
// Locales within the same provider are independent, so load them concurrently
// and merge back in deterministic locale order.
foreach (var provider in ordered)
{
foreach (var locale in allLocales)
var loadTasks = allLocales
.OrderBy(locale => locale, StringComparer.OrdinalIgnoreCase)
.Select(locale => LoadProviderBundleAsync(provider, locale, ct))
.ToArray();
var results = await Task.WhenAll(loadTasks).ConfigureAwait(false);
foreach (var result in results)
{
try
if (result.Bundle.Count == 0)
{
var bundle = await provider.LoadAsync(locale, ct).ConfigureAwait(false);
if (bundle.Count > 0)
{
MergeBundles(locale, bundle);
_logger.LogDebug(
"Loaded {Count} translations for locale {Locale} from provider (priority {Priority})",
bundle.Count, locale, provider.Priority);
}
}
catch (Exception ex)
{
_logger.LogWarning(ex,
"Failed to load translations for locale {Locale} from provider (priority {Priority})",
locale, provider.Priority);
continue;
}
MergeBundles(result.Locale, result.Bundle);
_logger.LogDebug(
"Loaded {Count} translations for locale {Locale} from provider (priority {Priority})",
result.Bundle.Count, result.Locale, provider.Priority);
}
}
@@ -75,6 +75,27 @@ public sealed class TranslationRegistry
_store.Count, totalKeys);
}
private async Task<ProviderLocaleBundle> LoadProviderBundleAsync(
ITranslationBundleProvider provider,
string locale,
CancellationToken ct)
{
try
{
var bundle = await provider.LoadAsync(locale, ct).ConfigureAwait(false);
return new ProviderLocaleBundle(locale, bundle);
}
catch (Exception ex)
{
_logger.LogWarning(ex,
"Failed to load translations for locale {Locale} from provider (priority {Priority})",
locale, provider.Priority);
return new ProviderLocaleBundle(
locale,
new Dictionary<string, string>(StringComparer.Ordinal));
}
}
/// <summary>
/// Merges a bundle into the store. Higher-priority values overwrite lower.
/// </summary>
@@ -248,4 +269,6 @@ public sealed class TranslationRegistry
_ => value.ToString() ?? string.Empty
};
}
private sealed record ProviderLocaleBundle(string Locale, IReadOnlyDictionary<string, string> Bundle);
}

View File

@@ -0,0 +1,58 @@
using System.Diagnostics;
using System.Net;
using System.Net.Http;
using System.Net.Http.Json;
using Microsoft.Extensions.Logging.Abstractions;
using Microsoft.Extensions.Options;
using StellaOps.Localization;
namespace StellaOps.Localization.Tests;
public sealed class RemoteBundleProviderTests
{
[Fact]
public async Task LoadAsync_ReturnsEmptyBundle_WhenRemoteFetchTimesOut()
{
using var client = new HttpClient(new BlockingMessageHandler());
var provider = new RemoteBundleProvider(
Options.Create(new TranslationOptions
{
EnableRemoteBundles = true,
RemoteBundleUrl = "http://platform.stella-ops.local",
RemoteBundleRequestTimeout = TimeSpan.FromMilliseconds(100)
}),
NullLogger<RemoteBundleProvider>.Instance,
new FixedHttpClientFactory(client));
var stopwatch = Stopwatch.StartNew();
var bundle = await provider.LoadAsync("en-US", TestContext.Current.CancellationToken);
stopwatch.Stop();
Assert.Empty(bundle);
Assert.InRange(stopwatch.Elapsed, TimeSpan.Zero, TimeSpan.FromSeconds(1));
}
private sealed class FixedHttpClientFactory : IHttpClientFactory
{
private readonly HttpClient _client;
public FixedHttpClientFactory(HttpClient client)
{
_client = client;
}
public HttpClient CreateClient(string name) => _client;
}
private sealed class BlockingMessageHandler : HttpMessageHandler
{
protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
await Task.Delay(TimeSpan.FromMinutes(1), cancellationToken);
return new HttpResponseMessage(HttpStatusCode.OK)
{
Content = JsonContent.Create(new { locale = "en-US", strings = new Dictionary<string, string>() })
};
}
}
}

View File

@@ -0,0 +1,12 @@
<?xml version="1.0" encoding="utf-8"?>
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net10.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
<UseConcelierTestInfra>false</UseConcelierTestInfra>
</PropertyGroup>
<ItemGroup>
<ProjectReference Include="../../StellaOps.Localization/StellaOps.Localization.csproj" />
</ItemGroup>
</Project>

View File

@@ -0,0 +1,93 @@
using Microsoft.Extensions.Logging.Abstractions;
using Microsoft.Extensions.Options;
using StellaOps.Localization;
namespace StellaOps.Localization.Tests;
public sealed class TranslationRegistryTests
{
[Fact]
public async Task LoadAsync_LoadsLocalesConcurrentlyWithinSingleProvider()
{
var provider = new BlockingProvider("en-US", "de-DE", "bg-BG");
var registry = new TranslationRegistry(
Options.Create(new TranslationOptions
{
DefaultLocale = "en-US",
SupportedLocales = ["en-US", "de-DE", "bg-BG"]
}),
NullLogger<TranslationRegistry>.Instance);
var loadTask = registry.LoadAsync([provider], TestContext.Current.CancellationToken);
await provider.AllLocalesStarted.Task.WaitAsync(TestContext.Current.CancellationToken);
Assert.True(provider.MaxConcurrentLoads >= 2);
provider.Release();
await loadTask;
Assert.Equal("bundle:de-DE", registry.GetBundle("de-DE")["translation.loaded"]);
Assert.Equal("bundle:bg-BG", registry.GetBundle("bg-BG")["translation.loaded"]);
}
private sealed class BlockingProvider : ITranslationBundleProvider
{
private readonly IReadOnlyList<string> _locales;
private readonly TaskCompletionSource _release = new(TaskCreationOptions.RunContinuationsAsynchronously);
private int _activeLoads;
private int _startedLoads;
private int _maxConcurrentLoads;
public BlockingProvider(params string[] locales)
{
_locales = locales;
}
public int Priority => 10;
public int MaxConcurrentLoads => _maxConcurrentLoads;
public TaskCompletionSource AllLocalesStarted { get; } = new(TaskCreationOptions.RunContinuationsAsynchronously);
public Task<IReadOnlyList<string>> GetAvailableLocalesAsync(CancellationToken ct)
=> Task.FromResult(_locales);
public async Task<IReadOnlyDictionary<string, string>> LoadAsync(string locale, CancellationToken ct)
{
var concurrentLoads = Interlocked.Increment(ref _activeLoads);
UpdateMaxConcurrentLoads(concurrentLoads);
if (Interlocked.Increment(ref _startedLoads) == _locales.Count)
{
AllLocalesStarted.TrySetResult();
}
await _release.Task.WaitAsync(ct);
Interlocked.Decrement(ref _activeLoads);
return new Dictionary<string, string>(StringComparer.Ordinal)
{
["translation.loaded"] = $"bundle:{locale}"
};
}
public void Release() => _release.TrySetResult();
private void UpdateMaxConcurrentLoads(int concurrentLoads)
{
while (true)
{
var snapshot = _maxConcurrentLoads;
if (concurrentLoads <= snapshot)
{
return;
}
if (Interlocked.CompareExchange(ref _maxConcurrentLoads, concurrentLoads, snapshot) == snapshot)
{
return;
}
}
}
}
}