Add tests for SBOM generation determinism across multiple formats

- Created `StellaOps.TestKit.Tests` project for unit tests related to determinism.
- Implemented `DeterminismManifestTests` to validate deterministic output for canonical bytes and strings, file read/write operations, and error handling for invalid schema versions.
- Added `SbomDeterminismTests` to ensure identical inputs produce consistent SBOMs across SPDX 3.0.1 and CycloneDX 1.6/1.7 formats, including parallel execution tests.
- Updated project references in `StellaOps.Integration.Determinism` to include the new determinism testing library.
This commit is contained in:
master
2025-12-23 18:56:12 +02:00
parent 7ac70ece71
commit bc4318ef97
88 changed files with 6974 additions and 1230 deletions

View File

@@ -36,6 +36,7 @@ How to navigate
- orchestrator/api.md - Orchestrator API surface
- orchestrator/cli.md - Orchestrator CLI commands
- orchestrator/console.md - Orchestrator console views
- orchestrator/runbook.md - Orchestrator operations runbook
- operations/quickstart.md - First scan workflow
- operations/install-deploy.md - Install and deployment guidance
- operations/deployment-versioning.md - Versioning and promotion model
@@ -47,6 +48,12 @@ How to navigate
- operations/runtime-readiness.md - Runtime readiness checks
- operations/slo.md - Service SLO overview
- operations/runbooks.md - Operational runbooks and incident response
- operations/key-rotation.md - Signing key rotation runbook
- operations/proof-verification.md - Proof verification runbook
- operations/score-proofs.md - Score proofs and replay operations
- operations/reachability.md - Reachability operations
- operations/trust-lattice.md - Trust lattice operations
- operations/unknowns-queue.md - Unknowns queue operations
- operations/notifications.md - Notifications Studio operations
- notifications/overview.md - Notifications overview
- notifications/rules.md - Notification rules and routing
@@ -54,8 +61,11 @@ How to navigate
- notifications/templates.md - Notification templates
- notifications/digests.md - Notification digests
- notifications/pack-approvals.md - Pack approval notifications
- notifications/runbook.md - Notifications operations runbook
- operations/router-rate-limiting.md - Gateway rate limiting
- release/release-engineering.md - Release and CI/CD overview
- release/promotion-attestations.md - Promotion-time attestation predicate
- release/release-notes.md - Release notes index and templates
- api/overview.md - API surface and conventions
- api/auth-and-tokens.md - Authority, OpTok, DPoP and mTLS, PoE
- policy/policy-system.md - Policy DSL, lifecycle, and governance
@@ -99,12 +109,16 @@ How to navigate
- ui/branding.md - Tenant branding model
- data-and-schemas.md - Storage, schemas, and determinism rules
- data/persistence.md - Database model and migration notes
- data/postgresql-operations.md - PostgreSQL operations guide
- data/postgresql-patterns.md - RLS and partitioning patterns
- data/events.md - Event envelopes and validation
- sbom/overview.md - SBOM formats, mapping, and heuristics
- governance/approvals.md - Approval routing and audit
- governance/exceptions.md - Exception lifecycle and controls
- security-and-governance.md - Security policy, hardening, governance, compliance
- security/identity-tenancy-and-scopes.md - Authority scopes and tenancy rules
- security/multi-tenancy.md - Tenant lifecycle and isolation model
- security/row-level-security.md - Database RLS enforcement
- security/crypto-and-trust.md - Crypto profiles and trust roots
- security/crypto-compliance.md - Regional crypto profiles and licensing notes
- security/quota-and-licensing.md - Offline quota and JWT licensing
@@ -114,8 +128,19 @@ How to navigate
- security/audit-events.md - Authority audit event schema
- security/revocation-bundles.md - Revocation bundle format and verification
- security/risk-model.md - Risk scoring model and explainability
- risk/overview.md - Risk scoring overview
- risk/factors.md - Risk factor catalog
- risk/formulas.md - Risk scoring formulas
- risk/profiles.md - Risk profile schema and lifecycle
- risk/explainability.md - Risk explainability payloads
- risk/api.md - Risk API endpoints
- security/forensics-and-evidence-locker.md - Evidence locker and forensic storage
- security/evidence-locker-publishing.md - Evidence locker publishing process
- security/timeline.md - Timeline event ledger and exports
- provenance/inline-provenance.md - DSSE metadata and transparency links
- provenance/attestation-workflow.md - Attestation workflow and verification
- provenance/rekor-policy.md - Rekor submission budget policy
- provenance/backfill.md - Provenance backfill procedure
- signals/unknowns.md - Unknowns registry and signals model
- signals/unknowns-ranking.md - Unknowns scoring and triage bands
- signals/uncertainty.md - Uncertainty states and tiers
@@ -129,7 +154,18 @@ How to navigate
- migration/overview.md - Migration paths and parity guidance
- vex/consensus.md - VEX consensus overview
- testing-and-quality.md - Test strategy and quality gates
- testing/router-chaos.md - Router chaos testing scenarios
- observability.md - Metrics, logs, tracing, telemetry stack
- observability-standards.md - Telemetry envelope, scrubbing, sampling
- observability-logging.md - Logging fields and redaction
- observability-tracing.md - Trace propagation and span conventions
- observability-metrics-slos.md - Core metrics and SLO guidance
- observability-telemetry-controls.md - Propagation, sealed mode, incident mode
- observability-aoc.md - AOC ingestion observability
- observability-aggregation.md - Aggregation pipeline observability
- observability-policy.md - Policy Engine observability
- observability-ui-telemetry.md - Console telemetry metrics and alerts
- observability-vuln-telemetry.md - Vulnerability explorer telemetry
- developer/onboarding.md - Local dev setup and workflows
- developer/plugin-sdk.md - Plugin SDK summary
- developer/devportal.md - Developer portal publishing

View File

@@ -7,6 +7,11 @@ Envelope types
- Orchestrator events: versioned envelopes with idempotency keys and trace context.
- Legacy Redis envelopes: transitional schemas used for older consumers.
Event catalog (examples)
- scanner.event.report.ready@1 and scanner.event.scan.completed@1 (orchestrator envelopes).
- scanner.report.ready@1 and scanner.scan.completed@1 (legacy Redis envelopes).
- scheduler.rescan.delta@1, scheduler.graph.job.completed@1, attestor.logged@1.
Orchestrator envelope fields (v1)
- eventId, kind, version, tenant
- occurredAt, recordedAt
@@ -26,6 +31,8 @@ Versioning rules
Validation
- Schemas and samples live under docs/events/ and docs/events/samples/.
- Offline validation uses ajv-cli; keep schema checks deterministic.
- Validate schemas with ajv compile and validate samples against matching schemas.
- Add new samples for each new schema version.
Related references
- docs/events/README.md

View File

@@ -32,3 +32,5 @@ Migration notes
Related references
- ADR: docs/adr/0001-postgresql-for-control-plane.md
- Module architecture: docs/modules/*/architecture.md
- data/postgresql-operations.md
- data/postgresql-patterns.md

View File

@@ -0,0 +1,36 @@
# PostgreSQL operations
Purpose
- Operate the canonical PostgreSQL control plane with deterministic behavior.
Schema topology
- Per-module schemas: authority, vuln, vex, scheduler, notify, policy, concelier, audit.
- Tenant isolation enforced via tenant_id and RLS policies.
Performance setup
- Enable pg_stat_statements for query analysis.
- Tune shared_buffers, effective_cache_size, work_mem, and WAL sizes per host.
- Use PgBouncer in transaction pooling mode for high concurrency.
Session defaults
- SET app.tenant_id per connection.
- SET timezone to UTC.
- Enforce statement_timeout for long-running queries.
Query analysis
- Use pg_stat_statements to find high total and high mean latency queries.
- Use EXPLAIN ANALYZE with BUFFERS to detect missing indexes.
Backups and restore
- Use scheduled logical or physical backups with tested restore paths.
- Keep PITR capability where required by retention policies.
- Validate backups with deterministic restore tests.
Monitoring
- Track connection count, replication lag, and slow query rates.
- Alert on pool saturation and replication delays.
Related references
- data/postgresql-patterns.md
- data/persistence.md
- docs/operations/postgresql-guide.md

View File

@@ -0,0 +1,33 @@
# PostgreSQL patterns
Row-level security (RLS)
- Require tenant context via app.tenant_id session setting.
- Policies filter by tenant_id on all tenant-scoped tables.
- Admin operations use explicit bypass roles and audited access.
Validating RLS
- Run staging tests that attempt cross-tenant reads and writes.
- Use deterministic replay tests for RLS regressions.
Bitemporal unknowns
- Store current and historical states with valid_from and valid_to.
- Support point-in-time queries and deterministic ordering.
Time-based partitioning
- Partition high-volume tables by time.
- Pre-create future partitions and archive old partitions.
- Use deterministic maintenance checklists for partition health.
Generated columns
- Use generated columns for derived flags and query optimization.
- Add columns via migrations and backfill deterministically.
Troubleshooting
- RLS failures: verify tenant context and policy attachment.
- Partition issues: check missing partitions and default tables.
- Bitemporal queries: confirm valid time windows and index usage.
Related references
- data/postgresql-operations.md
- security/multi-tenancy.md
- docs/operations/postgresql-patterns-runbook.md

View File

@@ -22,3 +22,4 @@ Related references
- docs/notifications/overview.md
- docs/notifications/architecture.md
- docs2/operations/notifications.md
- notifications/runbook.md

View File

@@ -0,0 +1,40 @@
# Notifications runbook
Purpose
- Deploy and operate the Notifications WebService and Worker.
Pre-flight
- Secrets stored in Authority (SMTP, Slack, webhook HMAC).
- Outbound allowlist configured for channels.
- PostgreSQL and Valkey reachable; health checks pass.
- Offline kit loaded with templates and rule seeds.
Deploy
- Deploy images with digests pinned.
- Set Notify Postgres, Redis, Authority, and allowlist settings.
- Warm caches via /api/v1/notify/admin/warm when needed.
Monitor
- notify_delivery_attempts_total by status and channel.
- notify_escalation_stage_total and notify_rule_eval_seconds.
- Logs include tenant, ruleId, deliveryId, channel, status.
Common operations
- List failed deliveries and replay.
- Pause a tenant without dropping audit events.
- Rotate channel secrets via refresh endpoints.
Failure recovery
- Validate templates and Redis connectivity for worker crashes.
- Replay deliveries after database recovery.
- Disable channels during upstream outages.
Determinism safeguards
- Rule snapshots versioned per tenant.
- Template rendering uses deterministic helpers.
- UTC time sources for quiet hours.
Related references
- notifications/overview.md
- notifications/rules.md
- docs/operations/notifier-runbook.md

View File

@@ -0,0 +1,34 @@
# Aggregation observability
Purpose
- Track Link-Not-Merge aggregation and overlay pipelines.
Metrics
- aggregation_ingest_latency_seconds{tenant,source,status}
- aggregation_conflict_total{tenant,advisory,product,reason}
- aggregation_overlay_cache_hits_total, aggregation_overlay_cache_misses_total
- aggregation_vex_gate_total{tenant,status}
- aggregation_queue_depth{tenant}
Traces
- Span: aggregation.process
- Attributes: tenant, advisory, product, vex_status, source_kind, overlay_version, cache_hit
Logs
- tenant, advisory, product, vex_status
- decision (merged, suppressed, dropped)
- reason, duration_ms, trace_id
SLOs
- Ingest latency p95 < 500ms per statement.
- Overlay cache hit rate > 80%.
- Error rate < 0.1% over 10 minutes.
Alerts
- HighConflictRate: aggregation_conflict_total delta > 100 per minute.
- QueueBacklog: aggregation_queue_depth > 10k for 5 minutes.
- LowCacheHit: cache hit rate < 60% for 10 minutes.
Offline posture
- Export metrics to local Prometheus scrape.
- Deterministic ordering preserved; cache warmers seeded from bundled fixtures.

View File

@@ -0,0 +1,49 @@
# AOC observability
Purpose
- Monitor Aggregation-Only ingestion for Concelier and Excititor.
- Provide deterministic metrics, traces, and logs for AOC guardrails.
Core metrics
- ingestion_write_total{source,tenant,result}
- ingestion_latency_seconds{source,tenant,phase}
- aoc_violation_total{source,tenant,code}
- ingestion_signature_verified_total{source,tenant,result}
- advisory_revision_count{source,tenant}
- verify_runs_total{tenant,initiator}
- verify_duration_seconds{tenant,initiator}
Alert guidance
- Violation spike: increase(aoc_violation_total[15m]) > 0 for critical sources.
- Stale ingestion: no growth in ingestion_write_total for > 60 minutes.
- Signature drop: rising ingestion_signature_verified_total{result="fail"}.
Health snapshot endpoint
- GET /obs/excititor/health returns ingest, link, signature, conflict status.
- Settings control warning and critical thresholds for lag, coverage, and conflict ratio.
Trace taxonomy
- ingest.fetch, ingest.transform, ingest.write
- aoc.guard for violations
- verify.run for verification jobs
Log fields
- traceId, tenant, source.vendor, upstream.upstreamId
- contentHash, violation.code, verification.window
- Correlation headers: X-Stella-TraceId, X-Stella-CorrelationId
Advisory AI chunk metrics
- advisory_ai_chunk_requests_total
- advisory_ai_chunk_latency_milliseconds
- advisory_ai_chunk_segments
- advisory_ai_chunk_sources
- advisory_ai_guardrail_blocks_total
Dashboards
- AOC ingestion health: sources overview, violations, signature rate, supersedes depth.
- Offline mode dashboard from offline snapshots.
Offline posture
- Metrics exporters write to local Prometheus snapshots in offline kits.
- CLI verification reports are hashed and archived.
- Dashboards support offline data sources.

View File

@@ -0,0 +1,39 @@
# Logging standards
Goals
- Deterministic, structured logs for all services.
- Safe for tenant isolation and offline review.
Required fields
- timestamp (UTC ISO-8601)
- tenant, workload, env, region, version
- level (debug, info, warn, error, fatal)
- category and operation
- trace_id, span_id, correlation_id when present
- message (concise, no secrets)
- status (ok, error, fault, throttle)
- error.code, error.message (redacted), retryable when status is not ok
Optional fields
- resource, http.method, http.status_code, duration_ms
- host, pid, thread
Offline kit import fields
- tenant_id, bundle_type, bundle_digest, bundle_path
- manifest_version, manifest_created_at
- force_activate, force_activate_reason
- result, reason_code, reason_message
- quarantine_id, quarantine_path
Redaction rules
- Never log auth headers, tokens, passwords, private keys, or full bodies.
- Redact to "[redacted]" and add redaction.reason.
- Hash low-cardinality identifiers and mark hashed=true.
Determinism and offline posture
- NDJSON with LF endings; UTC timestamps only.
- No external enrichment; rely on bundled metadata.
Sampling and rate limits
- Info logs rate-limited per component; warn and error never sampled.
- Audit logs are never sampled and include actor, action, target, result.

View File

@@ -0,0 +1,57 @@
# Metrics and SLOs
Core metrics (platform-wide)
- http_requests_total{tenant,workload,route,status}
- http_request_duration_seconds (histogram)
- worker_jobs_total{tenant,queue,status}
- worker_job_duration_seconds (histogram)
- db_query_duration_seconds{db,operation}
- db_pool_in_use, db_pool_available
- cache_requests_total{result=hit|miss}
- cache_latency_seconds (histogram)
- queue_depth{tenant,queue}
- errors_total{tenant,workload,code}
SLO targets (suggested)
- API availability: 99.9% monthly per public service.
- P95 latency: <300ms reads, <1s writes.
- Worker job success: >99% over 30d.
- Queue backlog: alert when queue_depth > 1000 for 5 minutes.
Alert examples
- Error rate: rate(errors_total[5m]) / rate(http_requests_total[5m]) > 0.02
- Latency regression: p95 http_request_duration_seconds > 0.3s
- Queue backlog: queue_depth > 1000 for 5 minutes
- Job failures: rate(worker_jobs_total{status="failed"}[10m]) > 0.01
UX KPIs (triage TTFS)
- P95 first evidence <= 1.5s; skeleton <= 0.2s.
- Clicks-to-closure median <= 6.
- Evidence completeness >= 90% (>= 3.6/4).
TTFS metrics
- ttfs_latency_seconds{surface,cache_hit,signal_source,kind,phase,tenant_id}
- ttfs_signal_total{surface,cache_hit,signal_source,kind,phase,tenant_id}
- ttfs_cache_hit_total, ttfs_cache_miss_total
- ttfs_slo_breach_total{surface,cache_hit,signal_source,kind,phase,tenant_id}
- ttfs_error_total{surface,cache_hit,signal_source,kind,phase,tenant_id,error_type,error_code}
Offline kit metrics
- offlinekit_import_total{status,tenant_id}
- offlinekit_attestation_verify_latency_seconds{attestation_type,success}
- attestor_rekor_success_total{mode}
- attestor_rekor_retry_total{reason}
- rekor_inclusion_latency{success}
Scanner FN-Drift metrics
- scanner.fn_drift.percent (30-day rolling percentage)
- scanner.fn_drift.transitions_30d and scanner.fn_drift.evaluated_30d
- scanner.fn_drift.cause.feed_delta, rule_delta, lattice_delta, reachability_delta, engine
- scanner.classification_changes_total{cause}
- scanner.fn_transitions_total{cause}
- SLO targets: warning above 1.0%, critical above 2.5%, engine drift > 0%
Hygiene
- Tag metrics with tenant, workload, env, region, version.
- Keep metric names stable and namespace custom metrics per module.
- Use deterministic bucket boundaries and consistent units.

View File

@@ -0,0 +1,48 @@
# Policy observability
Purpose
- Capture Policy Engine metrics, logs, traces, and incident workflows.
Metrics
- policy_run_seconds{tenant,policy,mode}
- policy_run_queue_depth{tenant}
- policy_run_failures_total{tenant,policy,reason}
- policy_run_retries_total{tenant,policy}
- policy_run_inputs_pending_bytes{tenant}
- policy_rules_fired_total{tenant,policy,rule}
- policy_vex_overrides_total{tenant,policy,vendor,justification}
- policy_suppressions_total{tenant,policy,action}
- policy_selection_batch_duration_seconds{tenant,policy}
- policy_materialization_conflicts_total{tenant,policy}
- policy_api_requests_total{endpoint,method,status}
- policy_api_latency_seconds{endpoint,method}
- policy_api_rate_limited_total{endpoint}
- policy_queue_leases_active{tenant}
- policy_queue_lease_expirations_total{tenant}
- policy_delta_backlog_age_seconds{tenant,source}
Logs
- Structured JSON with policyId, policyVersion, tenant, runId, rule, traceId, env.sealed.
- Categories: policy.run, policy.evaluate, policy.materialize, policy.simulate, policy.lifecycle.
- Rule-hit logs sample at 1% by default; incident mode raises to 100%.
Traces
- policy.api, policy.select, policy.evaluate, policy.materialize, policy.simulate.
- Trace context propagated to CLI and UI.
Alerts
- PolicyRunSlaBreach: p95 policy_run_seconds too high.
- PolicyQueueStuck: policy_delta_backlog_age_seconds > 600.
- DeterminismMismatch: ERR_POL_004 or replay diff.
- SimulationDrift: simulation exit 20 over threshold.
- VexOverrideSpike and SuppressionSurge.
Incident mode
- POST /api/policy/incidents/activate toggles sampling to 100%.
- Retention extends to 30 days during incident.
- policy.incident.activated event emitted.
Integration points
- Authority metrics for scope_denied events.
- Concelier and Excititor trace propagation via gRPC metadata.
- Offline kits export metrics and logs snapshots.

View File

@@ -0,0 +1,29 @@
# Observability standards
Common envelope fields
- Trace context: trace_id, span_id, trace_flags; propagate W3C traceparent and baggage.
- Tenant and workload: tenant, workload (service), region, env, version.
- Subject: component, operation, resource (purl or uri when safe).
- Timing: UTC ISO-8601 timestamp; durations in milliseconds.
- Outcome: status (ok, error, fault, throttle), error.code, redacted error.message, retryable.
Scrubbing policy
- Denylist PII and secrets: emails, tokens, auth headers, private keys, passwords.
- Redact to "[redacted]" and add redaction.reason (secret, pii, tenant_policy).
- Hash low-cardinality identifiers with sha256 and mark hashed=true.
- Never log full request or response bodies; store hashes and lengths only.
Sampling defaults
- Traces: 10% non-prod, 5% prod; always sample error or audit spans.
- Logs: info logs rate-limited; warn and error never sampled.
- Metrics: never sampled; stable histogram buckets per component.
Redaction override
- Overrides require a ticket id and are time-bound.
- Config: telemetry.redaction.overrides and telemetry.redaction.override_ttl (default 24h).
- Emit telemetry.redaction.audit with actor, fields, and TTL.
Determinism and offline
- No external enrichers; use bundled service maps and tenant metadata only.
- Export ordering: timestamp, workload, operation.
- Always use UTC; NDJSON for log exports.

View File

@@ -0,0 +1,61 @@
# Telemetry controls and propagation
Bootstrap wiring
- AddStellaOpsTelemetry wires metrics and tracing with deterministic defaults.
- Disable exporters when sealed or when egress is not allowed.
Minimal host wiring (example)
```csharp
builder.Services.AddStellaOpsTelemetry(
builder.Configuration,
serviceName: "StellaOps.SampleService",
serviceVersion: builder.Configuration["VERSION"],
configureOptions: options =>
{
options.Collector.Enabled = builder.Configuration.GetValue<bool>("Telemetry:Collector:Enabled", true);
options.Collector.Endpoint = builder.Configuration["Telemetry:Collector:Endpoint"];
options.Collector.Protocol = TelemetryCollectorProtocol.Grpc;
},
configureMetrics: m => m.AddAspNetCoreInstrumentation(),
configureTracing: t => t.AddHttpClientInstrumentation());
```
Propagation rules
- HTTP headers: traceparent, tracestate, x-stella-tenant, x-stella-actor, x-stella-imposed-rule.
- gRPC metadata: stella-tenant, stella-actor, stella-imposed-rule.
- Tenant is required for all requests except sealed diagnostics jobs.
Metrics helper expectations
- Golden signals: http.server.duration, http.client.duration, messaging.operation.duration,
job.execution.duration, runtime.gc.pause, db.call.duration.
- Mandatory tags: tenant, service, endpoint or operation, result (ok|error|cancelled|throttled), sealed.
- Cardinality guard trims tag values to 64 chars and caps distinct values per key.
Scrubbing configuration
- Telemetry:Scrub:Enabled (default true)
- Telemetry:Scrub:Sealed (forces scrubbing when sealed)
- Telemetry:Scrub:HashSalt (optional)
- Telemetry:Scrub:MaxValueLength (default 256)
Sealed mode behavior
- Disable external exporters; use memory or file OTLP.
- Tag sealed=true and scrubbed=true on all records.
- Sampling capped by Telemetry:Sealed:MaxSamplingPercent.
- File exporter rotates deterministically and enforces 0600 permissions.
Sealed mode config keys
- Telemetry:Sealed:Enabled
- Telemetry:Sealed:Exporter (memory|file)
- Telemetry:Sealed:FilePath
- Telemetry:Sealed:MaxBytes
- Telemetry:Sealed:MaxSamplingPercent
Incident mode (CLI)
- Flag: --incident-mode
- Config: Telemetry:Incident:Enabled and Telemetry:Incident:TTL
- State file: ~/.stellaops/incident-mode.json (0600 permissions)
- Emits telemetry.incident.activated and telemetry.incident.expired audit events.
Determinism
- UTC timestamps and stable ordering for OTLP exports.
- No external enrichment in sealed mode.

View File

@@ -0,0 +1,27 @@
# Tracing standards
Goals
- Consistent distributed tracing across services, workers, and CLI.
- Safe for offline and air-gapped deployments.
Context propagation
- Use W3C traceparent and baggage only.
- Preserve incoming trace_id and create child spans per operation.
- For async work, attach stored trace context as links rather than a new parent.
Span conventions
- Names use <component>.<operation> (example: policy.evaluate).
- Required attributes: tenant, workload, env, region, version, operation, status.
- HTTP spans: http.method, http.route, http.status_code, net.peer.name, net.peer.port.
- DB spans: db.system, db.name, db.operation, db.statement (no literals).
- Message spans: messaging.system, messaging.destination, messaging.operation, messaging.message_id.
- Errors: status=error with error.code, redacted error.message, retryable.
Sampling
- Default head sampling: 10% non-prod, 5% prod.
- Always sample error or audit spans.
- Override via Tracing__SampleRate per service.
Offline posture
- No external exporters; emit OTLP to local collector or file.
- UTC timestamps only.

View File

@@ -0,0 +1,45 @@
# Console telemetry
Purpose
- Capture console performance, security signals, and offline behavior.
Metrics
- ui_route_render_seconds{route,tenant,device}
- ui_request_duration_seconds{service,method,status,tenant}
- ui_filter_apply_total{route,filter,tenant}
- ui_tenant_switch_total{fromTenant,toTenant,trigger}
- ui_offline_banner_seconds{reason,tenant}
- ui_dpop_failure_total{endpoint,reason}
- ui_fresh_auth_prompt_total{action,tenant}
- ui_fresh_auth_failure_total{action,reason}
- ui_download_manifest_refresh_seconds{tenant,channel}
- ui_download_export_queue_depth{tenant,artifactType}
- ui_download_command_copied_total{tenant,artifactType}
- ui_telemetry_batch_failures_total{transport,reason}
- ui_telemetry_queue_depth{priority,tenant}
Logs
- Categories: ui.action, ui.tenant.switch, ui.download.commandCopied, ui.security.anomaly, ui.telemetry.failure.
- Core fields: timestamp, level, action, route, tenant, subject, correlationId, offlineMode.
- PII is scrubbed; user identifiers are hashed.
Traces
- ui.route.transition, ui.api.fetch, ui.sse.stream, ui.telemetry.batch, ui.policy.action.
- W3C traceparent propagated through the gateway for cross-service stitching.
Feature flags and config
- CONSOLE_METRICS_ENABLED, CONSOLE_METRICS_VERBOSE, CONSOLE_LOG_LEVEL.
- OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS.
- CONSOLE_TELEMETRY_SSE_ENABLED to expose /console/telemetry.
Offline workflow
- Metrics scraped locally and stored with offline bundles.
- OTLP batches queue locally and expose ui_telemetry_queue_depth.
- Retain telemetry bundles for audit; export Grafana JSON with bundles.
Alerting hints
- ConsoleLatencyHigh when ui_route_render_seconds p95 exceeds target.
- BackendLatencyHigh when ui_request_duration_seconds spikes.
- TenantSwitchFailures when ui_dpop_failure_total increases.
- DownloadsBacklog when ui_download_export_queue_depth grows.
- TelemetryExportErrors when ui_telemetry_batch_failures_total > 0.

View File

@@ -0,0 +1,22 @@
# Vuln explorer telemetry
Purpose
- Define metrics, logs, traces, and dashboards for vulnerability triage.
Planned metrics (pending final identifiers)
- findings_open_total
- mttr_seconds
- triage_actions_total
- report_generation_seconds
Planned logs
- Fields: findingId, artifactId, advisoryId, policyVersion, actor, actionType.
- Deterministic JSON with correlation IDs.
Planned traces
- Spans for triage actions and report generation.
- Sampling follows global tracing defaults; errors always sampled.
Assets and hashes
- Capture metrics, logs, traces, and dashboard exports with SHA256SUMS.
- Store assets under docs/assets/vuln-explorer/ once available.

View File

@@ -1,14 +1,23 @@
# Observability
## Telemetry signals
- Metrics for scan latency, cache hit rate, policy evaluation time, queue depth.
- Logs are structured and include correlation IDs.
- Traces connect Scanner, Policy, Scheduler, and Notify workflows.
Overview
- Deterministic metrics, logs, and traces with tenant isolation.
- Offline-friendly exports for audits and air-gap review.
## Audit trails
- Signing and policy actions are recorded for compliance.
- Tenant and actor metadata is included in audit records.
Core references
- observability-standards.md
- observability-logging.md
- observability-tracing.md
- observability-metrics-slos.md
- observability-telemetry-controls.md
## Telemetry stack
- Telemetry module provides collectors, dashboards, and alert rules.
- Offline bundles include telemetry assets for air-gapped installs.
Service and workflow observability
- observability-aoc.md
- observability-aggregation.md
- observability-policy.md
- observability-ui-telemetry.md
- observability-vuln-telemetry.md
Audit alignment
- security/forensics-and-evidence-locker.md
- security/timeline.md

View File

@@ -6,6 +6,30 @@ Core runbooks
- Quarantine: isolate bundles with hash or signature mismatches.
- Sealed startup diagnostics: confirm egress block and time anchor validity.
Offline kit management
- Generate full or delta kits in connected environments.
- Verify kit hash and signature before transfer.
- Import and install kit, then confirm component freshness.
Feed updates
- Use delta kits for smaller updates.
- Roll back to previous snapshot when feeds introduce regressions.
- Track feed age and kit expiry thresholds.
Scanning in air-gap mode
- Scan local images or SBOMs without registry pull.
- Generate SBOMs locally and scan from file.
- Force offline feeds when required by policy.
Verification in air-gap mode
- Verify proof bundles offline with local trust roots.
- Export and import trust bundles for signer and CA rotation.
- Run score replay with frozen timestamps if needed.
Health checks
- Monitor kit age, feed freshness, trust store validity, disk usage.
- Use deterministic health checks and keep results for audit.
Import and verify
- Validate bundle hash, manifest entries, and schema checks.
- Record import receipt with operator, time anchor, and manifest hash.

View File

@@ -0,0 +1,49 @@
# Key rotation
Purpose
- Rotate signing keys without invalidating historical DSSE proofs.
Principles
- Do not mutate old DSSE envelopes.
- Keep key history; revoke instead of delete.
- Publish key material to trust anchors and mirrors.
- Audit all key lifecycle events.
Key profiles (examples)
- default: SHA256-ED25519
- fips: SHA256-ECDSA-P256
- gost: GOST-R-34.10-2012
- sm2: SM2-P256
- pqc: ML-DSA-65
Rotation workflow
1. Generate a new key in the configured keystore.
2. Add the key to the trust anchor without removing old keys.
3. Run a transition period where both keys verify.
4. Revoke the old key with an effective date.
5. Publish updated key material to attestation feeds or mirrors.
Trust anchors
- Scoped by PURL pattern and allowed predicate types.
- Store allowedKeyIds, revokedKeys, and keyHistory with timestamps.
Verification with key history
- Verify signatures using the key valid at the time of signing.
- Revoked keys remain valid for pre-revocation attestations.
Emergency revocation
- Revoke compromised keys immediately and publish updated anchors.
- Re-issue trust bundles and notify downstream verifiers.
Metrics and alerts
- signer_key_age_days
- signer_keys_active_total
- signer_keys_revoked_total
- signer_rotation_events_total
- signer_verification_key_lookups_total
- Alerts when keys near or exceed maximum age.
Related references
- security/crypto-and-trust.md
- provenance/attestation-workflow.md
- docs/operations/key-rotation-runbook.md

View File

@@ -0,0 +1,37 @@
# Proof verification
Purpose
- Verify DSSE bundles and transparency proofs for scan and score evidence.
Components
- DSSE envelope and signature bundle.
- Certificate chain and trust roots.
- Rekor inclusion proof and checkpoint when online.
Basic verification
- Verify DSSE signature against trusted roots.
- Confirm subject digest matches expected artifact.
- Validate Merkle inclusion proof when available.
Offline verification
- Use embedded proofs and local trust bundles.
- Skip online Rekor queries in sealed mode.
- Record verification results in timeline events.
Transparency log integration
- Check Rekor entry status and inclusion proof.
- When Rekor is unavailable, rely on cached checkpoint and proofs.
Troubleshooting cues
- DSSE signature invalid: check key rotation or trust anchors.
- Merkle root mismatch: verify checkpoint and bundle integrity.
- Certificate chain failure: refresh trust roots.
Monitoring
- Track verification latency and failure counts.
- Alert on certificate expiry or rising verification failures.
Related references
- provenance/attestation-workflow.md
- release/promotion-attestations.md
- docs/operations/proof-verification-runbook.md

View File

@@ -0,0 +1,36 @@
# Reachability operations
Purpose
- Operate call graph ingestion, reachability computation, and explain queries.
Reachability statuses
- unreachable, possibly_reachable, reachable_static, reachable_proven, unknown.
Call graph operations
- Upload call graphs and validate schema.
- Inspect entrypoints and merge graphs when required.
- Enforce size limits and deterministic ordering.
Computation
- Trigger reachability computation per scan or batch.
- Monitor jobs for timeouts and memory caps.
- Persist results with graph_cache_epoch for replay.
Explain queries
- Explain a single finding or batch.
- Provide alternate paths and reasons for unreachable results.
Drift handling
- Track changes due to graph updates or reachability algorithm changes.
- Use drift reports to compare runs and highlight path changes.
Monitoring
- Track computation latency, queue depth, and explain request rates.
- Alert on repeated timeouts or inconsistent results.
Related references
- architecture/reachability-lattice.md
- architecture/reachability-evidence.md
- operations/score-proofs.md
- docs/operations/reachability-runbook.md
- docs/operations/reachability-drift-guide.md

View File

@@ -12,6 +12,12 @@ Runbook set (current)
- docs/runbooks/replay_ops.md
- docs/runbooks/vex-ops.md
- docs/runbooks/vuln-ops.md
- operations/score-proofs.md
- operations/proof-verification.md
- operations/reachability.md
- operations/trust-lattice.md
- operations/unknowns-queue.md
- operations/key-rotation.md
Common expectations
- Hash and store any inbound artifacts with SHA256SUMS.

View File

@@ -0,0 +1,46 @@
# Score proofs and replay
Purpose
- Provide deterministic score proofs with replayable inputs and attestations.
When to replay
- Determinism audits and compliance checks.
- Dispute resolution or vendor verification.
- Regression investigation after feed or policy changes.
Replay operations
- Trigger replay via CLI or API with scan or job id.
- Support batch replay with concurrency limits.
- Nightly replay jobs validate determinism at scale.
Verification
- Online verification uses DSSE and Rekor proofs.
- Offline verification uses embedded proofs and local trust bundles.
- Verification checks include bundle hash, signature, and input digests.
Bundle contents
- Manifest with inputs and hashes.
- SBOM, advisories, VEX snapshots.
- Deterministic scoring outputs and explain traces.
- DSSE bundle and transparency proof.
Retention and export
- Retain bundles per policy; export for audit with manifests.
- Store in Evidence Locker and Offline Kits.
Monitoring metrics
- score_replay_duration_seconds
- proof_verification_success_rate
- proof_bundle_size_bytes
- replay_queue_depth
- proof_generation_failures
Alerting cues
- Replay latency p95 > 30s.
- Verification failures or queue backlog spikes.
Related references
- operations/proof-verification.md
- operations/replay-and-determinism.md
- docs/operations/score-proofs-runbook.md
- docs/operations/score-replay-runbook.md

View File

@@ -0,0 +1,33 @@
# Trust lattice operations
Purpose
- Monitor and operate trust lattice gates for VEX and policy decisions.
Core components
- Trust vectors and gate configuration.
- Verdict replay for deterministic validation.
Monitoring
- Track gate failure rate, verdict replay failures, and trust vector drift.
- Use dashboards for gate health and override usage.
Common operations
- View current trust vectors and gate configuration.
- Inspect a verdict and its trust inputs.
- Trigger manual calibration when required.
Emergency procedures
- High gate failure rate: pause dependent workflows and investigate sources.
- Verdict replay failures: verify inputs, cache epochs, and policy versions.
- Trust vector drift: run replay with frozen inputs and compare hashes.
Maintenance
- Daily checks: gate failure rate and queue depth.
- Weekly checks: trust vector calibration and drift review.
- Monthly checks: update trust bundles and audit logs.
Related references
- architecture/reachability-vex.md
- vex/consensus.md
- docs/operations/trust-lattice-runbook.md
- docs/operations/trust-lattice-troubleshooting.md

View File

@@ -0,0 +1,32 @@
# Unknowns queue operations
Purpose
- Manage unknown components with deterministic triage and SLA tracking.
Queue model
- Bands: HOT, WARM, COLD based on score and SLA.
- Reasons include reachability gaps, provenance gaps, VEX conflicts, and ingestion gaps.
Core workflows
- List and triage unknowns by band and reason.
- Escalate or resolve with documented justification.
- Suppress with expiry and audit trail when approved.
Budgets and SLAs
- Per-environment budgets cap unknowns by reason.
- SLA timers trigger alerts when breached.
Monitoring
- unknowns_total, unknowns_hot_count, unknowns_sla_breached
- unknowns_escalation_failures, unknowns_avg_age_hours
- KEV-specific unknown counts and age
Alerting cues
- HOT band spikes or SLA breaches.
- KEV unknowns older than 24 hours.
- Rising queue growth rate.
Related references
- signals/unknowns.md
- signals/unknowns-ranking.md
- docs/operations/unknowns-queue-runbook.md

View File

@@ -39,3 +39,4 @@ Related references
- orchestrator/cli.md
- orchestrator/console.md
- orchestrator/run-ledger.md
- orchestrator/runbook.md

View File

@@ -0,0 +1,36 @@
# Orchestrator runbook
Pre-flight
- Verify database and queue backends are healthy.
- Confirm tenant allowlist and orchestrator scopes in Authority.
- Ensure plugin bundles are present and signatures verified.
Common operations
- Start a run via API or CLI.
- Cancel runs with idempotent requests.
- Stream status via WebSocket or CLI.
- Export run ledger as NDJSON for audit.
Incident response
- Queue backlog: scale workers and drain oldest first.
- Repeated failures: inspect error codes and inputsHash; roll back DAG version.
- Plugin auth errors: rotate secrets and warm caches.
Health checks
- /admin/health for liveness and queue depth.
- Metrics: orchestrator_runs_total, orchestrator_queue_depth,
orchestrator_step_retries_total, orchestrator_run_duration_seconds.
- Logs include tenant, dagId, runId, status with redaction.
Determinism and immutability
- Runs are append-only; never mutate ledger entries.
- Use runToken for idempotent retries.
Offline posture
- Keep DAG specs and plugins in sealed storage.
- Export logs, metrics, and traces as NDJSON.
Related references
- orchestrator/overview.md
- orchestrator/architecture.md
- docs/operations/orchestrator-runbook.md

View File

@@ -0,0 +1,46 @@
# Attestation workflow
Purpose
- Ensure all exported evidence includes DSSE signatures and transparency proofs.
- Provide deterministic verification for online and air-gapped environments.
Workflow overview
- Producer emits a payload and requests signing.
- Signer validates policy and signs with tenant or keyless credentials.
- Attestor wraps the payload in DSSE, records transparency data, and publishes bundles.
- Export Center and Evidence Locker embed bundles in export artifacts.
- Verifiers (CLI, services, auditors) validate signatures and proofs.
Payload types
- StellaOps.BuildProvenance@1
- StellaOps.SBOMAttestation@1
- StellaOps.ScanResults@1
- StellaOps.PolicyEvaluation@1
- StellaOps.VEXAttestation@1
- StellaOps.RiskProfileEvidence@1
- StellaOps.PromotionAttestation@1
Signing and storage controls
- Default is short-lived keyless signing; tenant KMS keys are supported.
- Ed25519 and ECDSA P-256 are supported.
- Payloads must exclude PII and secrets; redaction is required before signing.
- Evidence Locker stores immutable copies with retention and legal hold.
Verification steps
- Verify DSSE signature against trusted roots.
- Confirm subject digest matches expected artifact.
- Verify transparency proof when available.
- Enforce freshness using attestation.max_age_days policy.
- Record verification results in timeline events.
Offline posture
- Bundles include DSSE, transparency proofs, and certificate chains.
- Offline verification uses embedded proofs and cached trust roots.
- Pending transparency entries are replayed when connectivity returns.
Related references
- provenance/inline-provenance.md
- security/forensics-and-evidence-locker.md
- docs/modules/attestor/architecture.md
- docs/modules/signer/architecture.md
- docs/modules/export-center/architecture.md

View File

@@ -0,0 +1,24 @@
# Provenance backfill
Purpose
- Backfill missing provenance records with deterministic ordering.
Inputs
- Attestation inventory (NDJSON) with subject and digest data.
- Subject to Rekor map for resolving transparency entries.
Procedure
1. Validate inventory records (UUID or ULID and digest formats).
2. Resolve each subject to a Rekor entry; record gaps and skip if missing.
3. Emit backfilled provenance events using a backfill mode that preserves ordering.
4. Log every backfilled subject and Rekor digest pair as NDJSON.
5. Repeat until gaps are zero and record completion in audit logs.
Determinism
- Sort by subject then Rekor entry before processing.
- Use canonical JSON writers and UTC timestamps.
Related references
- provenance/inline-provenance.md
- provenance/attestation-workflow.md
- docs/provenance/prov-backfill-plan.md

View File

@@ -0,0 +1,34 @@
# Rekor submission policy
Purpose
- Balance transparency log usage with budget limits and offline safety.
Submission tiers
- Tier 1: graph-level attestations per scan (default).
- Tier 2: edge bundle attestations for escalations.
Budgets
- Hourly limits for graph submissions.
- Daily limits for edge bundle submissions.
- Burst windows for Tier 1 only.
Enforcement
- Queue excess submissions with backpressure.
- Retry failed submissions with backoff.
- Store overflow locally for later submission.
Offline behavior
- Queue submissions in attestor.rekor_offline_queue.
- Bundle pending submissions in offline kits.
- Drain queue when connectivity returns.
Monitoring
- attestor_rekor_submissions_total
- attestor_rekor_submission_latency_seconds
- attestor_rekor_queue_depth
- attestor_rekor_budget_remaining
Related references
- provenance/attestation-workflow.md
- security/crypto-and-trust.md
- docs/operations/rekor-policy.md

View File

@@ -0,0 +1,41 @@
# Promotion attestations
Purpose
- Capture promotion-time evidence in a DSSE predicate for offline audit.
Predicate: stella.ops/promotion@v1
- subject: image name and digest.
- materials: SBOM and VEX digests with format and OCI uri.
- promotion: from, to, actor, timestamp, pipeline, ticket, notes.
- rekor: uuid, logIndex, inclusionProof, checkpoint.
- attestation: bundle_sha256 and optional witness.
Producer workflow
1. Resolve and freeze image digest.
2. Hash SBOM and VEX artifacts and publish to OCI if needed.
3. Obtain Rekor inclusion proof and checkpoint.
4. Build promotion predicate JSON.
5. Sign with Signer to produce DSSE bundle.
6. Store bundle in Evidence Locker and Export Center.
Verification flow
- Verify DSSE signature using trusted roots.
- Verify Merkle inclusion using the embedded proof and checkpoint.
- Hash SBOM and VEX artifacts and compare to materials digests.
- Confirm promotion metadata and ticket evidence.
Storage and APIs
- Signer: /api/v1/signer/sign/dsse
- Attestor: /api/v1/rekor/entries
- Export Center: serves promotion bundles for offline kits
- Evidence Locker: long-term retention of DSSE and proofs
Security considerations
- Promotion metadata is tenant scoped.
- Rekor proofs must be embedded for air-gap verification.
- Key rotation follows Signer and Authority policies.
Related references
- release/release-engineering.md
- provenance/attestation-workflow.md
- security/forensics-and-evidence-locker.md

View File

@@ -23,6 +23,7 @@ Artifact signing
- Cosign for containers and bundles
- DSSE envelopes for attestations
- Optional Rekor anchoring when available
- Promotion attestations capture release evidence for offline audit
Offline update kit (OUK)
- Monthly bundle of feeds and tooling
@@ -41,3 +42,5 @@ Related references
- docs/ci/*
- docs/devops/*
- docs/release/* and docs/releases/*
- release/promotion-attestations.md
- release/release-notes.md

View File

@@ -0,0 +1,22 @@
# Release notes and templates
Release notes
- Historical release notes live under docs/releases/.
- Use release notes for time-specific changes; refer to docs2 for current behavior.
Determinism snippet template
- Use a deterministic score summary in release notes when publishing scans.
Template
```
- Determinism score: {{overall_score}} (threshold {{overall_min}})
- {{image_digest}} score {{score}} ({{identical}}/{{runs}} identical)
- Inputs: policy {{policy_sha}}, feeds {{feeds_sha}}, scanner {{scanner_sha}}, platform {{platform}}
- Evidence: determinism.json and artifact hashes (DSSE signed, offline ready)
- Actions: rerun stella detscore run --bundle determinism.json if score < threshold
```
Related references
- release/release-engineering.md
- operations/replay-and-determinism.md
- docs/release/templates/determinism-score.md

36
docs2/risk/api.md Normal file
View File

@@ -0,0 +1,36 @@
# Risk API
Purpose
- Expose risk jobs, profiles, simulations, explainability, and exports.
Endpoints (v1)
- POST /api/v1/risk/jobs: submit scoring job.
- GET /api/v1/risk/jobs/{job_id}: job status and results.
- GET /api/v1/risk/explain/{job_id}: explainability payload.
- GET /api/v1/risk/profiles: list profiles with hashes and versions.
- POST /api/v1/risk/profiles: create or update profiles with DSSE metadata.
- POST /api/v1/risk/simulations: dry-run scoring with fixtures.
- GET /api/v1/risk/export/{job_id}: export bundle for audit.
Auth and tenancy
- Headers: X-Stella-Tenant, Authorization Bearer token.
- Optional X-Stella-Scope for imposed rule reminders.
Error model
- Envelope: code, message, correlation_id, severity, remediation.
- Rate-limit headers: Retry-After, X-RateLimit-Remaining.
- ETag headers for profile and explain responses.
Feature flags
- risk.jobs, risk.explain, risk.simulations, risk.export.
Determinism and offline
- Samples in docs/risk/samples/api/ with SHA256SUMS.
- Stable field ordering and UTC timestamps.
Related references
- risk/overview.md
- risk/profiles.md
- risk/factors.md
- risk/formulas.md
- risk/explainability.md

View File

@@ -0,0 +1,28 @@
# Risk explainability
Purpose
- Provide per-factor contributions with provenance and gating rationale.
Explainability envelope
- job_id, tenant_id, context_id
- profile_id, profile_version, profile_hash
- finding_id, raw_score, normalized_score, severity
- signal_values and signal_contributions
- override_applied, override_reason, gates_triggered
- scored_at and provenance hashes
UI and CLI expectations
- Deterministic ordering by factor type, source, then timestamp.
- Highlight top contributors and gates.
- Export Center bundles include explain payload and manifest hashes.
Determinism and offline
- Fixtures under docs/risk/samples/explain/ with SHA256SUMS.
- No live calls in examples or captures.
Related references
- risk/overview.md
- risk/factors.md
- risk/formulas.md
- risk/profiles.md
- risk/api.md

29
docs2/risk/factors.md Normal file
View File

@@ -0,0 +1,29 @@
# Risk factors
Purpose
- Define factor catalog and normalization rules for risk scoring.
Factor catalog (examples)
- CVSS or exploit likelihood: numeric 0-10 normalized to 0-1.
- KEV flag: boolean boost with provenance.
- Reachability: numeric with entrypoint and path provenance.
- Runtime facts: categorical or numeric with trace references.
- Fix availability: vendor status and mitigation context.
- Asset criticality: tenant or service criticality signals.
- Provenance trust: categorical trust tier with attestation hash.
- Custom overrides: scoped, expiring, and auditable.
Normalization rules
- Validate against profile signal types and transforms.
- Clamp numeric inputs to 0-1 and record original values in provenance.
- Apply TTL or decay deterministically; drop expired signals.
- Precedence: signed over unsigned, runtime over static, newer over older.
Determinism and ordering
- Sort factors by factor type, source, then timestamp.
- Hash fixtures and record SHA256 in docs/risk/samples/factors/.
Related references
- risk/overview.md
- risk/formulas.md
- risk/profiles.md

28
docs2/risk/formulas.md Normal file
View File

@@ -0,0 +1,28 @@
# Risk formulas
Purpose
- Define how normalized factors combine into a risk score and severity.
Formula building blocks
- Weighted sum with per-factor caps and family caps.
- Normalize raw score to 0-1 and apply gates.
- VEX gate: not_affected can short-circuit to 0.0.
- CVSS + KEV boost: clamp01((cvss/10) + kev_bonus).
- Trust gates: fail or down-weight low-trust provenance.
- Decay: apply time-based decay to stale signals.
- Overrides: tenant or asset overrides with expiry and audit.
Severity mapping
- Map normalized_score to critical, high, medium, low, informational.
- Store band rationale in explainability output.
Determinism
- Stable factor ordering before aggregation.
- Fixed precision (example: 4 decimals) before severity mapping.
- Hash fixtures and record SHA256 in docs/risk/samples/formulas/.
Related references
- risk/overview.md
- risk/factors.md
- risk/profiles.md
- risk/explainability.md

36
docs2/risk/overview.md Normal file
View File

@@ -0,0 +1,36 @@
# Risk overview
Purpose
- Explain risk scoring concepts, lifecycle, and artifacts.
- Preserve deterministic, provenance-backed outputs.
Core concepts
- Signals become evidence after validation and normalization.
- Profiles define weights, thresholds, overrides, and severity mapping.
- Formulas aggregate normalized factors into a 0-1 score.
- Provenance carries source hashes and attestation references.
Lifecycle
1. Submit a risk job with tenant, context, profile, and findings.
2. Ingest evidence from scanners, reachability, VEX, runtime signals, and KEV.
3. Normalize and dedupe by provenance hash.
4. Evaluate profile rules, gates, and overrides.
5. Assign severity band and emit explainability output.
6. Export bundles with profile hash and evidence references.
Artifacts
- Profile schema: id, version, signals, weights, overrides, metadata, provenance.
- Job and result fields: job_id, profile_hash, normalized_score, severity.
- Explainability envelope: signal_values, signal_contributions, gates_triggered.
Determinism and offline posture
- Stable ordering for factors and contributions.
- Fixed precision math with UTC timestamps only.
- Fixtures and hashes live under docs/risk/samples/.
Related references
- risk/factors.md
- risk/formulas.md
- risk/profiles.md
- risk/explainability.md
- risk/api.md

37
docs2/risk/profiles.md Normal file
View File

@@ -0,0 +1,37 @@
# Risk profiles
Purpose
- Define profile schema, lifecycle, and governance for risk scoring.
Schema essentials
- id, version, description, signals[], weights, metadata.
- signals[] fields: name, source, type (numeric, boolean, categorical), path, transform, unit.
- overrides: severity rules and decision rules.
- Optional: extends, rollout flags, valid_from, valid_until.
Severity levels
- critical, high, medium, low, informational.
Lifecycle
1. Author profiles in Policy Studio.
2. Simulate against deterministic fixtures.
3. Review and approve with DSSE signatures.
4. Promote and activate in Policy Engine.
5. Roll back by activating a previous version.
Governance and determinism
- Profiles are immutable after promotion.
- Each version carries a profile_hash and signed manifest entry.
- Simulation and production share the same evaluation codepath.
- Offline bundles include profiles and fixtures with hashes.
Explainability and observability
- Emit per-factor contributions with stable ordering.
- Track evaluation latency, factor coverage, profile hit rate, and override usage.
Related references
- risk/overview.md
- risk/factors.md
- risk/formulas.md
- risk/explainability.md
- risk/api.md

View File

@@ -32,3 +32,6 @@ Related references
- docs/security/crypto-simulation-services.md
- docs/security/crypto-compliance.md
- docs/airgap/staleness-and-time.md
- operations/key-rotation.md
- provenance/rekor-policy.md
- release/promotion-attestations.md

View File

@@ -0,0 +1,30 @@
# Evidence locker publishing
Purpose
- Publish deterministic evidence bundles to the Evidence Locker.
Required inputs
- Evidence locker base URL (no trailing slash).
- Bearer token with write scopes for required prefixes.
- Signing key for final bundle signing (Cosign key or key file).
Publishing flow
- Build deterministic tar bundles for each producer (signals, runtime, evidence packs).
- Verify bundle hashes and inner SHA256 lists before upload.
- Upload bundles to the Evidence Locker using the configured token.
- Re-sign bundles with production keys when required.
Deterministic packaging rules
- tar --sort=name
- fixed mtime (UTC 1970-01-01)
- owner and group set to 0
- numeric-owner enabled
Offline posture
- Transparency log upload may be disabled in sealed mode.
- Trust derives from local keys and recorded hashes.
- Upload scripts must fail on hash mismatch.
Related references
- security/forensics-and-evidence-locker.md
- provenance/attestation-workflow.md

View File

@@ -28,7 +28,8 @@ Minimum bundle layout
- signatures/ for DSSE or sigstore bundles
Related references
- provenance/attestation-workflow.md
- security/timeline.md
- security/evidence-locker-publishing.md
- docs/forensics/evidence-locker.md
- docs/forensics/provenance-attestation.md
- docs/forensics/timeline.md
- docs/evidence-locker/evidence-pack-schema.md

View File

@@ -0,0 +1,27 @@
# Multi-tenancy
Purpose
- Ensure strict tenant isolation across APIs, storage, and observability.
Tenant lifecycle
- Create tenants with scoped roles and default policies.
- Suspend or retire tenants with audit records.
- Migrations and data retention follow governance policy.
Isolation model
- Tokens carry tenant identifiers and scopes.
- APIs require tenant headers; cross-tenant actions are explicit.
- Datastores enforce tenant_id and RLS where supported.
Observability
- Metrics, logs, and traces always include tenant.
- Cross-tenant access attempts emit audit events.
Offline posture
- Offline bundles are tenant scoped.
- Tenant list in offline mode is limited to snapshot contents.
Related references
- security/identity-tenancy-and-scopes.md
- security/row-level-security.md
- docs/operations/multi-tenancy.md

View File

@@ -40,3 +40,9 @@ Related references
- docs/risk/profiles.md
- docs/risk/api.md
- docs/guides/epss-integration.md
- risk/overview.md
- risk/factors.md
- risk/formulas.md
- risk/profiles.md
- risk/explainability.md
- risk/api.md

View File

@@ -0,0 +1,21 @@
# Row-level security
Purpose
- Enforce tenant isolation at the database level with RLS policies.
Strategy
- Apply RLS to tenant-scoped tables and views.
- Require app.tenant_id session setting on every connection.
- Deny access when tenant context is missing.
Policy evaluation
- Policies filter rows by tenant_id and optional scope.
- Admin bypass uses explicit roles with audited access.
Validation
- Run cross-tenant read and write tests in staging.
- Include RLS checks in deterministic replay suites.
Related references
- data/postgresql-patterns.md
- docs/operations/rls-and-data-isolation.md

View File

@@ -0,0 +1,47 @@
# Timeline forensics
Purpose
- Provide an append-only event ledger for audit, replay, and incident analysis.
- Support deterministic exports for offline review.
Event model
- event_id (ULID)
- tenant
- timestamp (UTC ISO-8601)
- category (scanner, policy, runtime, evidence, notify)
- details (JSON payload)
- trace_id for correlation
Event kinds
- scan.completed
- policy.verdict
- attestation.verified
- evidence.ingested
- notify.sent
- runtime.alert
- redaction_notice (compensating event)
APIs
- GET /api/v1/timeline/events with filters for tenant, category, time window, trace_id.
- GET /api/v1/timeline/events/{id} for a single event.
- GET /api/v1/timeline/export for NDJSON exports.
- Headers: X-Stella-Tenant, optional X-Stella-TraceId, If-None-Match.
Query guidance
- Use category plus trace_id to track scan to policy to notify flow.
- Use tenant and timestamp ranges for SLA audits.
- CLI parity: stella timeline list mirrors the API.
Retention and redaction
- Append-only storage; no deletes.
- Redactions use redaction_notice events that reference the superseded event.
- Retention is tenant-configurable and exported weekly to cold storage.
Offline posture
- Offline kits include timeline exports for compliance review.
- Exports include stable ordering and manifest hashes.
Related references
- security/forensics-and-evidence-locker.md
- observability.md
- docs/forensics/timeline.md

View File

@@ -10,15 +10,37 @@ Core states (examples)
- U4: Unknown (no analysis yet)
Tiers and scoring
- Tiers group states by entropy ranges.
- The aggregate tier is the maximum severity present.
- Risk score adds an entropy-based modifier.
- Tiers group states by entropy ranges (T1 high to T4 negligible).
- Aggregate tier is the maximum tier across states.
- Risk score adds tier and entropy modifiers.
Tier ranges (example)
- T1: 0.7 to 1.0, blocks not_affected.
- T2: 0.4 to 0.69, warns on not_affected.
- T3: 0.1 to 0.39, allow with caveat.
- T4: 0.0 to 0.09, no special handling.
Risk score formula (simplified)
- meanEntropy = avg(states[].entropy)
- entropyBoost = clamp(meanEntropy * k, 0..boostCeiling)
- tierModifier = {T1:0.50, T2:0.25, T3:0.10, T4:0.00}[aggregateTier]
- riskScore = clamp(baseScore * (1 + tierModifier + entropyBoost), 0..1)
Policy guidance
- High uncertainty blocks not_affected claims.
- Lower tiers allow decisions with caveats.
- Remediation hints are attached to findings.
Remediation examples
- U1: upload symbols or resolve unknowns registry.
- U2: generate lockfile and resolve package coordinates.
- U3: cross-reference trusted advisories.
- U4: run initial analysis to remove unknown state.
Payload fields
- states[] include code, name, entropy, tier, timestamp, evidence.
- aggregateTier and riskScore recorded with computedAt timestamp.
Determinism rules
- Stable ordering of uncertainty states.
- UTC timestamps and fixed precision for entropy values.

View File

@@ -17,3 +17,6 @@
- Interop checks against external tooling formats.
- Offline E2E runs as a release gate.
- Policy and schema validation in CI.
Related references
- testing/router-chaos.md

View File

@@ -0,0 +1,34 @@
# Router chaos testing
Purpose
- Validate backpressure, recovery, and cache failure behavior for the router.
Test categories
- Load testing with spike scenarios (baseline, 10x, 50x, recovery).
- Backpressure verification for 429 and 503 with Retry-After.
- Recovery tests to ensure queues drain quickly.
- Valkey failure injection with graceful fallback.
Expected behavior
- Normal load returns 200 OK.
- High load returns 429 with Retry-After.
- Critical load returns 503 with Retry-After.
- Recovery within 30 seconds, zero data loss.
Metrics
- http_requests_total{status}
- router_request_queue_depth
- request_recovery_seconds
Alert cues
- Throttle rate above 10% for 5 minutes.
- P95 recovery time above 30 seconds.
- Missing Retry-After headers.
CI integration
- Runs on PRs touching router code and nightly staging runs.
- Stores results as artifacts for audits.
Related references
- operations/router-rate-limiting.md
- docs/operations/router-chaos-testing-runbook.md

View File

@@ -18,6 +18,10 @@ Architecture and system model
docs/modules/platform/architecture-overview.md, docs/modules/*/architecture.md
- Docs2: architecture/overview.md, architecture/workflows.md, modules/index.md
Advisory alignment
- Sources: docs/architecture/advisory-alignment-report.md
- Docs2: architecture/advisory-alignment.md
Component map
- Sources: docs/technical/architecture/component-map.md
- Docs2: architecture/component-map.md
@@ -77,7 +81,7 @@ Advisory AI
Orchestrator detail
- Sources: docs/orchestrator/*
- Docs2: orchestrator/overview.md, orchestrator/architecture.md, orchestrator/api.md,
orchestrator/cli.md, orchestrator/console.md
orchestrator/cli.md, orchestrator/console.md, orchestrator/runbook.md
Orchestrator run ledger
- Sources: docs/orchestrator/run-ledger.md
@@ -118,7 +122,10 @@ Replay and determinism
Runbooks and incident response
- Sources: docs/runbooks/*, docs/operations/*
- Docs2: operations/runbooks.md
- Docs2: operations/runbooks.md, operations/key-rotation.md,
operations/proof-verification.md, operations/score-proofs.md,
operations/reachability.md, operations/trust-lattice.md,
operations/unknowns-queue.md
Notifications
- Sources: docs/notifications/*, docs/modules/notify/*
@@ -129,7 +136,8 @@ Notifications details
docs/notifications/channels.md, docs/notifications/templates.md,
docs/notifications/digests.md, docs/notifications/pack-approvals-integration.md
- Docs2: notifications/overview.md, notifications/rules.md, notifications/channels.md,
notifications/templates.md, notifications/digests.md, notifications/pack-approvals.md
notifications/templates.md, notifications/digests.md, notifications/pack-approvals.md,
notifications/runbook.md
Router rate limiting
- Sources: docs/router/*
@@ -138,7 +146,8 @@ Router rate limiting
Release engineering and CI/DevOps
- Sources: docs/13_RELEASE_ENGINEERING_PLAYBOOK.md, docs/ci/*, docs/devops/*,
docs/release/*, docs/releases/*
- Docs2: release/release-engineering.md
- Docs2: release/release-engineering.md, release/promotion-attestations.md,
release/release-notes.md
API and contracts
- Sources: docs/09_API_CLI_REFERENCE.md, docs/api/*, docs/schemas/*,
@@ -177,7 +186,8 @@ Regulator threat and evidence model
Identity, tenancy, and scopes
- Sources: docs/security/authority-scopes.md, docs/security/scopes-and-roles.md,
docs/architecture/console-admin-rbac.md
- Docs2: security/identity-tenancy-and-scopes.md
- Docs2: security/identity-tenancy-and-scopes.md, security/multi-tenancy.md,
security/row-level-security.md
Console admin RBAC
- Sources: docs/architecture/console-admin-rbac.md
@@ -213,20 +223,26 @@ Quota and licensing
Risk model and scoring
- Sources: docs/risk/*, docs/contracts/risk-scoring.md
- Docs2: security/risk-model.md
- Docs2: security/risk-model.md, risk/overview.md, risk/factors.md, risk/formulas.md,
risk/profiles.md, risk/explainability.md, risk/api.md
Forensics and evidence locker
- Sources: docs/forensics/*, docs/evidence-locker/*
- Docs2: security/forensics-and-evidence-locker.md
- Sources: docs/forensics/*, docs/evidence-locker/*, docs/ops/evidence-locker-handoff.md
- Docs2: security/forensics-and-evidence-locker.md, security/evidence-locker-publishing.md
Timeline forensics
- Sources: docs/forensics/timeline.md
- Docs2: security/timeline.md
Provenance and transparency
- Sources: docs/provenance/*, docs/security/trust-and-signing.md,
docs/modules/attestor/*, docs/modules/signer/*
- Docs2: provenance/inline-provenance.md
- Docs2: provenance/inline-provenance.md, provenance/attestation-workflow.md,
provenance/rekor-policy.md, provenance/backfill.md
Database and persistence
- Sources: docs/db/*, docs/adr/0001-postgresql-for-control-plane.md
- Docs2: data/persistence.md
- Docs2: data/persistence.md, data/postgresql-operations.md, data/postgresql-patterns.md
Events and messaging
- Sources: docs/events/*, docs/samples/*
@@ -334,19 +350,22 @@ Vuln Explorer overview
Testing and quality
- Sources: docs/19_TEST_SUITE_OVERVIEW.md, docs/testing/*
- Docs2: testing-and-quality.md
- Docs2: testing-and-quality.md, testing/router-chaos.md
Observability and telemetry
- Sources: docs/metrics/*, docs/observability/*, docs/modules/telemetry/*,
docs/technical/observability/*
- Docs2: observability.md
- Docs2: observability.md, observability-standards.md, observability-logging.md,
observability-tracing.md, observability-metrics-slos.md, observability-telemetry-controls.md,
observability-aoc.md, observability-aggregation.md, observability-policy.md,
observability-ui-telemetry.md, observability-vuln-telemetry.md
Benchmarks and performance
- Sources: docs/benchmarks/*, docs/12_PERFORMANCE_WORKBOOK.md
- Docs2: benchmarks.md
Guides and workflows
- Sources: docs/guides/*, docs/ci/sarif-integration.md
- Sources: docs/guides/*, docs/ci/sarif-integration.md, docs/architecture/epss-versioning-clarification.md
- Docs2: guides/compare-workflow.md, guides/epss-integration.md
Examples and fixtures