Implement ledger metrics for observability and add tests for Ruby packages endpoints
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled

- Added `LedgerMetrics` class to record write latency and total events for ledger operations.
- Created comprehensive tests for Ruby packages endpoints, covering scenarios for missing inventory, successful retrieval, and identifier handling.
- Introduced `TestSurfaceSecretsScope` for managing environment variables during tests.
- Developed `ProvenanceMongoExtensions` for attaching DSSE provenance and trust information to event documents.
- Implemented `EventProvenanceWriter` and `EventWriter` classes for managing event provenance in MongoDB.
- Established MongoDB indexes for efficient querying of events based on provenance and trust.
- Added models and JSON parsing logic for DSSE provenance and trust information.
This commit is contained in:
master
2025-11-13 09:29:09 +02:00
parent 151f6b35cc
commit 61f963fd52
101 changed files with 5881 additions and 1776 deletions

View File

@@ -40,35 +40,49 @@ Surface.Env exposes `ISurfaceEnvironment` which returns an immutable `SurfaceEnv
| Variable | Description | Default | Notes |
|----------|-------------|---------|-------|
| `SCANNER_SURFACE_FS_ENDPOINT` | Base URI for Surface.FS service (RustFS, S3-compatible). | _required_ | e.g. `https://surface-cache.svc.cluster.local`. Zastava uses `ZASTAVA_SURFACE_FS_ENDPOINT`; when absent, falls back to scanner value. |
| `SCANNER_SURFACE_FS_BUCKET` | Bucket/container name used for manifests and artefacts. | `surface-cache` | Must be unique per tenant. |
| `SCANNER_SURFACE_FS_REGION` | Optional region (S3-style). | `null` | Required for AWS S3. |
| `SCANNER_SURFACE_CACHE_ROOT` | Local filesystem directory for warm caches. | `/var/lib/stellaops/surface` | Should reside on fast SSD. |
| `SCANNER_SURFACE_CACHE_QUOTA_MB` | Soft limit for local cache usage. | `4096` | Enforced by Surface.FS eviction policy. |
| `SCANNER_SURFACE_TLS_CERT_PATH` | Path to PEM bundle for mutual TLS with Surface.FS. | `null` | If provided, library loads cert/key pair. |
| `SCANNER_SURFACE_TENANT` | Tenant identifier used for cache namespaces. | derived from Authority token | Can be overridden for multi-tenant workers. |
| `SCANNER_SURFACE_PREFETCH_ENABLED` | Toggle surface prefetch threads. | `false` | If `true`, Worker prefetches manifests before analyzer stage. |
| `SCANNER_SURFACE_FEATURES` | Comma-separated feature switches. | `""` | e.g. `validation,prewarm,runtime-diff`. |
| `SCANNER_SURFACE_FS_ENDPOINT` | Base URI for Surface.FS / RustFS / S3-compatible store. | _required_ | Throws `SurfaceEnvironmentException` when `RequireSurfaceEndpoint = true`. When disabled (tests), builder falls back to `https://surface.invalid` so validation can fail fast. Also binds `Surface:Fs:Endpoint` from `IConfiguration`. |
| `SCANNER_SURFACE_FS_BUCKET` | Bucket/container used for manifests and artefacts. | `surface-cache` | Must be unique per tenant; validators enforce non-empty value. |
| `SCANNER_SURFACE_FS_REGION` | Optional region for S3-compatible stores. | `null` | Needed only when the backing store requires it (AWS/GCS). |
| `SCANNER_SURFACE_CACHE_ROOT` | Local directory for warm caches. | `<temp>/stellaops/surface` | Directory is created if missing. Override to `/var/lib/stellaops/surface` (or another fast SSD) in production. |
| `SCANNER_SURFACE_CACHE_QUOTA_MB` | Soft limit for on-disk cache usage. | `4096` | Enforced range 64262144 MB; validation emits `SURFACE_ENV_CACHE_QUOTA_INVALID` outside the range. |
| `SCANNER_SURFACE_PREFETCH_ENABLED` | Enables manifest prefetch threads. | `false` | Workers honour this before analyzer execution. |
| `SCANNER_SURFACE_TENANT` | Tenant namespace used by cache + secret resolvers. | `TenantResolver(...)` or `"default"` | Default resolver may pull from Authority claims; you can override via env for multi-tenant pools. |
| `SCANNER_SURFACE_FEATURES` | Comma-separated feature switches. | `""` | Compared against `SurfaceEnvironmentOptions.KnownFeatureFlags`; unknown flags raise warnings. |
| `SCANNER_SURFACE_TLS_CERT_PATH` | Path to PEM/PKCS#12 file for client auth. | `null` | When present, `SurfaceEnvironmentBuilder` loads the certificate into `SurfaceTlsConfiguration`. |
| `SCANNER_SURFACE_TLS_KEY_PATH` | Optional private-key path when cert/key are stored separately. | `null` | Stored in `SurfaceTlsConfiguration` for hosts that need to hydrate the key themselves. |
### 3.2 Secrets provider keys
| Variable | Description | Notes |
|----------|-------------|-------|
| `SCANNER_SURFACE_SECRETS_PROVIDER` | Provider ID (`kubernetes`, `file`, `inline`). | Controls Surface.Secrets back-end. |
| `SCANNER_SURFACE_SECRETS_ROOT` | Path or secret namespace. | Example: `/etc/stellaops/secrets` for file provider. |
| `SCANNER_SURFACE_SECRETS_TENANT` | Tenant override for secret lookup. | Defaults to `SCANNER_SURFACE_TENANT`. |
| `SCANNER_SURFACE_SECRETS_PROVIDER` | Provider ID (`kubernetes`, `file`, `inline`, future back-ends). | Defaults to `kubernetes`; validators reject unknown values via `SURFACE_SECRET_PROVIDER_UNKNOWN`. |
| `SCANNER_SURFACE_SECRETS_ROOT` | Path or base namespace for the provider. | Required for the `file` provider (e.g., `/etc/stellaops/secrets`). |
| `SCANNER_SURFACE_SECRETS_NAMESPACE` | Kubernetes namespace used by the secrets provider. | Mandatory when `provider = kubernetes`. |
| `SCANNER_SURFACE_SECRETS_FALLBACK_PROVIDER` | Optional secondary provider ID. | Enables tiered lookups (e.g., `kubernetes``inline`) without changing code. |
| `SCANNER_SURFACE_SECRETS_ALLOW_INLINE` | Allows returning inline secrets (useful for tests). | Defaults to `false`; Production deployments should keep this disabled. |
| `SCANNER_SURFACE_SECRETS_TENANT` | Tenant override for secret lookups. | Defaults to `SCANNER_SURFACE_TENANT` or the tenant resolver result. |
### 3.3 Zastava-specific keys
### 3.3 Component-specific prefixes
Zastava containers read the same primary variables but may override names under the `ZASTAVA_` prefix (e.g., `ZASTAVA_SURFACE_CACHE_ROOT`, `ZASTAVA_SURFACE_FEATURES`). Surface.Env automatically checks component-specific prefixes before falling back to the scanner defaults.
`SurfaceEnvironmentOptions.Prefixes` controls the order in which suffixes are probed. Every suffix listed above is combined with each prefix (e.g., `SCANNER_SURFACE_FS_ENDPOINT`, `ZASTAVA_SURFACE_FS_ENDPOINT`) and finally the bare suffix (`SURFACE_FS_ENDPOINT`). Configure prefixes per host so local overrides win but global scanner defaults remain available:
| Component | Suggested prefixes (first match wins) | Notes |
|-----------|---------------------------------------|-------|
| Scanner.Worker / WebService | `SCANNER` | Default already added by `AddSurfaceEnvironment`. |
| Zastava Observer/Webhook (planned) | `ZASTAVA`, `SCANNER` | Call `options.AddPrefix("ZASTAVA")` before relying on `ZASTAVA_*` overrides. |
| Future CLI / BuildX plug-ins | `CLI`, `SCANNER` | Allows per-user overrides without breaking shared env files. |
This approach means operators can define a single env file (SCANNER_*) and only override the handful of settings that diverge for a specific component by introducing an additional prefix.
### 3.4 Configuration precedence
1. Explicit overrides passed to `SurfaceEnvBuilder` (e.g., from appsettings).
2. Component-specific env (e.g., `ZASTAVA_SURFACE_FS_ENDPOINT`).
3. Scanner global env (e.g., `SCANNER_SURFACE_FS_ENDPOINT`).
4. `SurfaceEnvDefaults.json` (shipped with library for sensible defaults).
5. Emergency fallback values defined in code (only for development scenarios).
The builder resolves every suffix using the following precedence:
1. Environment variables using the configured prefixes (e.g., `ZASTAVA_SURFACE_FS_ENDPOINT`, then `SCANNER_SURFACE_FS_ENDPOINT`, then the bare `SURFACE_FS_ENDPOINT`).
2. Configuration values under the `Surface:*` section (for example `Surface:Fs:Endpoint`, `Surface:Cache:Root` in `appsettings.json` or Helm values).
3. Hard-coded defaults baked into `SurfaceEnvironmentBuilder` (temporary directory, `surface-cache` bucket, etc.).
`SurfaceEnvironmentOptions.RequireSurfaceEndpoint` controls whether a missing endpoint results in an exception (default: `true`). Other values fall back to the default listed in §3.1/3.2 and are further validated by the Surface.Validation pipeline.
## 4. API Surface
@@ -79,65 +93,99 @@ public interface ISurfaceEnvironment
IReadOnlyDictionary<string, string> RawVariables { get; }
}
public sealed record SurfaceEnvironmentSettings
(
public sealed record SurfaceEnvironmentSettings(
Uri SurfaceFsEndpoint,
string SurfaceFsBucket,
string? SurfaceFsRegion,
DirectoryInfo CacheRoot,
int CacheQuotaMegabytes,
X509Certificate2Collection? ClientCertificates,
string Tenant,
bool PrefetchEnabled,
IReadOnlyCollection<string> FeatureFlags,
SecretProviderConfiguration Secrets,
IDictionary<string,string> ComponentOverrides
);
SurfaceSecretsConfiguration Secrets,
string Tenant,
SurfaceTlsConfiguration Tls)
{
public DateTimeOffset CreatedAtUtc { get; init; }
}
public sealed record SurfaceSecretsConfiguration(
string Provider,
string Tenant,
string? Root,
string? Namespace,
string? FallbackProvider,
bool AllowInline);
public sealed record SurfaceTlsConfiguration(
string? CertificatePath,
string? PrivateKeyPath,
X509Certificate2Collection? ClientCertificates);
```
Consumers access `ISurfaceEnvironment.Settings` and pass the record into Surface.FS / Surface.Secrets factories. The interface memoises results so repeated access is cheap.
`ISurfaceEnvironment.RawVariables` captures the exact env/config keys that produced the snapshot so operators can export them in diagnostics bundles.
`SurfaceEnvironmentOptions` configures how the snapshot is built:
* `ComponentName` used in logs/validation output.
* `Prefixes` ordered list of env prefixes (see §3.3). Defaults to `["SCANNER"]`.
* `RequireSurfaceEndpoint` throw when no endpoint is provided (default `true`).
* `TenantResolver` delegate invoked when `SCANNER_SURFACE_TENANT` is absent.
* `KnownFeatureFlags` recognised feature switches; unexpected values raise warnings.
Example registration:
```csharp
builder.Services.AddSurfaceEnvironment(options =>
{
options.ComponentName = "Scanner.Worker";
options.AddPrefix("ZASTAVA"); // optional future override
options.KnownFeatureFlags.Add("validation");
options.TenantResolver = sp => sp.GetRequiredService<ITenantContext>().TenantId;
});
```
Consumers access `ISurfaceEnvironment.Settings` and pass the record into Surface.FS, Surface.Secrets, cache, and validation helpers. The interface memoises results so repeated access is cheap.
## 5. Validation
Surface.Env invokes the following validators (implemented in Surface.Validation):
`SurfaceEnvironmentBuilder` only throws `SurfaceEnvironmentException` for malformed inputs (non-integer quota, invalid URI, missing required variable when `RequireSurfaceEndpoint = true`). The richer validation pipeline lives in `StellaOps.Scanner.Surface.Validation` and runs via `services.AddSurfaceValidation()`:
1. **EndpointValidator** ensures endpoint URI is absolute HTTPS and not localhost in production.
2. **CacheQuotaValidator** verifies quota > 0 and below host max.
3. **FilesystemValidator** checks cache root exists/writable; attempts to create directory if missing.
4. **SecretsProviderValidator** ensures provider-specific settings (e.g., Kubernetes namespace) are present.
5. **FeatureFlagValidator** warns on unknown feature flag tokens.
1. **SurfaceEndpointValidator** checks for a non-placeholder endpoint and bucket (`SURFACE_ENV_MISSING_ENDPOINT`, `SURFACE_FS_BUCKET_MISSING`).
2. **SurfaceCacheValidator** verifies the cache directory exists/is writable and that the quota is positive (`SURFACE_ENV_CACHE_DIR_UNWRITABLE`, `SURFACE_ENV_CACHE_QUOTA_INVALID`).
3. **SurfaceSecretsValidator** validates provider names, required namespace/root fields, and tenant presence (`SURFACE_SECRET_PROVIDER_UNKNOWN`, `SURFACE_SECRET_CONFIGURATION_MISSING`, `SURFACE_ENV_TENANT_MISSING`).
Failures throw `SurfaceEnvironmentException` with error codes (`SURFACE_ENV_MISSING_ENDPOINT`, `SURFACE_ENV_CACHE_DIR_UNWRITABLE`, etc.). Hosts log the error and fail fast during startup.
Validators emit `SurfaceValidationIssue` instances with codes defined in `SurfaceValidationIssueCodes`. `LoggingSurfaceValidationReporter` writes structured log entries (Info/Warning/Error) using the component name, issue code, and remediation hint. Hosts fail startup if any issue has `Error` severity; warnings allow startup but surface actionable hints.
## 6. Integration Guidance
- **Scanner Worker**: call `services.AddSurfaceEnvironment()` in `Program.cs` before registering analyzers. Pass `hostContext.Configuration.GetSection("Surface")` for overrides.
- **Scanner WebService**: build environment during startup using `AddSurfaceEnvironment`, `AddSurfaceValidation`, `AddSurfaceFileCache`, and `AddSurfaceSecrets`; readiness checks execute the validator runner and scan/report APIs emit Surface CAS pointers derived from the resolved configuration.
- **Zastava Observer/Webhook**: use the same builder; ensure Helm charts set `ZASTAVA_` variables.
- **Scheduler Planner (future)**: treat Surface.Env as read-only input; do not mutate settings.
- `Scanner.Worker` and `Scanner.WebService` automatically bind the `SurfaceCacheOptions.RootDirectory` to `SurfaceEnvironment.Settings.CacheRoot` (2025-11-05); both hosts emit structured warnings (`surface.env.misconfiguration`) when the helper detects missing cache roots, endpoints, or secrets provider settings (2025-11-06).
- **Scanner Worker**: register `AddSurfaceEnvironment`, `AddSurfaceValidation`, `AddSurfaceFileCache`, and `AddSurfaceSecrets` before analyzer/services (see `src/Scanner/StellaOps.Scanner.Worker/Program.cs`). `SurfaceCacheOptionsConfigurator` already binds the cache root from `ISurfaceEnvironment`.
- **Scanner WebService**: identical wiring, plus `SurfacePointerService`/`ScannerSurfaceSecretConfigurator` reuse the resolved settings (`Program.cs` demonstrates the pattern).
- **Zastava Observer/Webhook**: will reuse the same helper once the service adds `AddSurfaceEnvironment(options => options.AddPrefix("ZASTAVA"))` so per-component overrides function without diverging defaults.
- **Scheduler / CLI / BuildX (future)**: treat `ISurfaceEnvironment` as read-only input; secret lookup, cache plumbing, and validation happen before any queue/enqueue work.
### 6.1 Misconfiguration warnings
Readiness probes should invoke `ISurfaceValidatorRunner` (registered by `AddSurfaceValidation`) and fail the endpoint when any issue is returned. The Scanner Worker/WebService hosted services already run the validators on startup; other consumers should follow the same pattern.
Surface.Env surfaces actionable warnings that appear in structured logs and readiness responses:
### 6.1 Validation output
- `surface.env.cache_root_missing` emitted when the resolved cache directory does not exist or is not writable. The host attempts to create the directory once; subsequent failures block startup.
- `surface.env.endpoint_unreachable` emitted when `SurfaceFsEndpoint` is missing or not an absolute HTTPS URI.
- `surface.env.secrets_provider_invalid` emitted when the configured secrets provider lacks mandatory fields (e.g., `SCANNER_SURFACE_SECRETS_ROOT` for the `file` provider).
`LoggingSurfaceValidationReporter` produces log entries that include:
Each warning includes remediation text and a reference to this design document; operations runbooks should treat these warnings as blockers in production and as validation hints in staging.
```
Surface validation issue for component Scanner.Worker: SURFACE_ENV_MISSING_ENDPOINT - Surface FS endpoint is missing or invalid. Hint: Set SCANNER_SURFACE_FS_ENDPOINT to the RustFS/S3 endpoint.
```
Treat `SurfaceValidationIssueCodes.*` with severity `Error` as hard blockers (readiness must fail). `Warning` entries flag configuration drift (for example, missing namespaces) but allow startup so staging/offline runs can proceed. The codes appear in both the structured log state and the reporter payload, making it easy to alert on them.
## 7. Security & Observability
- Never log raw secrets; Surface.Env redacts values by default.
- Emit metric `surface_env_validation_total{status}` to observe validation outcomes.
- Provide `/metrics` gauge for cache quota/residual via Surface.FS integration.
- Surface.Env never logs raw values; only suffix names and issue codes appear in logs. `RawVariables` is intended for diagnostics bundles and should be treated as sensitive metadata.
- TLS certificates are loaded into memory and not re-serialised; only the configured paths are exposed to downstream services.
- To emit metrics, register a custom `ISurfaceValidationReporter` (e.g., wrapping Prometheus counters) in addition to the logging reporter.
## 8. Offline & Air-Gap Support
- Defaults assume no public network access; endpoints should point to internal RustFS or S3-compatible system.
- Offline kit templates supply env files under `offline/scanner/surface-env.env`.
- Document steps in `docs/modules/devops/runbooks/zastava-deployment.md` and `offline-kit` tasks for synchronising env values.
- Defaults assume no public network access; point `SCANNER_SURFACE_FS_ENDPOINT` at an internal RustFS/S3 mirror.
- Offline bundles must capture an env file (Ops track this under the Offline Kit tasks) so operators can seed `SCANNER_*` values before first boot.
- Keep `docs/modules/devops/runbooks/zastava-deployment.md` in sync so Zastava deployments reuse the same env contract.
## 9. Testing Strategy