Files
git.stella-ops.org/docs/implplan/EPIC_14.md
root 68da90a11a
Some checks failed
Docs CI / lint-and-preview (push) Has been cancelled
Restructure solution layout by module
2025-10-28 15:10:40 +02:00

38 KiB
Raw Blame History

No file to print Fine. Identity and tenancy: the part everyone underestimates until they trip over it in prod. Heres the clean, docready version.

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.


Epic 14: AuthorityBacked Scopes & Tenancy

Short name: AuthorityBacked Scopes & Tenancy Primary components: Authority (authN/Z), Web Services API, Policy Engine, Orchestrator, Task Runner, Console, CLI Surfaces: /auth/*, request middleware, DB schema (RLS), object storage layout, message bus topics, audit logs, CLI login/impersonation flows Touches: Conseiller (Feedser), Excitator (Vexer), Findings Ledger, Export Center, Notifications Studio, Advisory AI Assistant

AOC ground rule reminder: Conseiller and Excitator aggregate and link advisories/VEX. They never merge or mutate source records. Enforcement of aggregationonly behavior is tenantagnostic and must hold across all scopes.


1) What it is

A uniform model for identity, authorization, and isolation that is enforced endtoend:

  • Authoritybacked tokens: JWT/OIDC tokens issued by a configured Authority. Tokens carry scopes, roles, and tenant memberships as claims. Services verify and authorize locally; no outofband ACL calls during the hot path.
  • Tenancy: Firstclass multitenant isolation with optional projects within a tenant. Strong separation at the database layer via rowlevel security (RLS) and in object storage via tenantprefixed paths (and optionally pertenant KMS keys).
  • Scopes & roles: Minimal set of composable scopes (stella:{resource}:{verb}) that map to roles (viewer, editor, operator, admin, owner) and can be constrained to {tenant}/{project}.
  • Context propagation: Every API request, job, message, and artifact is stamped with {tenant_id, project_id, actor} and validated at ingress and again at persistence.
  • Service accounts & delegation: Robot identities with scoped, expiring credentials for CI, Task Packs, and webhooks. Human-to-robot delegation is explicit and auditable.
  • Audit: Immutable decision logs for authN/Z events with resource, scope, and policy evaluation outcomes.

Tenancy model:

Organization (optional, for billing) 
└── Tenant (isolation boundary)
    ├── Projects (isolation + grouping)
    │   ├── Sources (registries, repos)
    │   ├── Jobs & Runs
    │   ├── SBOMs & Artifacts
    │   ├── Findings / Evaluations
    │   └── Policies (bound or inherited)
    └── Shared tenant services (notifications, exports, secrets)

Knowledge planes:

  • Global knowledge plane: Advisories, CVE metadata, CPE, KEV, etc. No tenant data.
  • Tenant plane: SBOMs, VEX attachments, policy results, exposures, notifications, exports, audits.

Conseiller/Excitator live across both planes: they collect into the global plane and link to tenant plane without merging sources.


2) Why (brief)

Security that depends on “being careful” is not security. We need hard boundaries the platform cannot cross by accident:

  • Run many teams/customers safely on one deployment.
  • Minimize blast radius for credentials and mistakes.
  • Make CI and automation safe with leastprivileged scopes.
  • Keep latency low by verifying locally with signed tokens.

3) How it should work (maximum detail)

3.1 Tokens, claims, and scopes

Token type: JWT, signed by the Authority. Services cache JWKS and verify locally.

Required claims:

  • iss, sub, aud, iat, exp
  • scope: spaceseparated scopes (stella:sbom:read, stella:job:run)
  • tenants: array of tenant IDs the subject may access
  • tenant (active): the currently selected tenant for the request
  • roles: object map { "<tenant>": ["viewer", "editor", ...] }
  • projects (optional): array of project IDs or { "<tenant>": ["projA", "projB"] }
  • mfa (optional): boolean or level for stepup enforcement
  • act (optional): actor chain for delegation/impersonation

Scope grammar:

stella:{resource}:{verb}[#{constraint}]
  resource ∈ {tenant, project, source, job, sbom, vex, advisory, policy, finding, export, notify, secret, pack, ledger, console}
  verb ∈ {read, list, write, delete, run, execute, approve, admin}
  constraint := tenant/{tenantId}[/project/{projectId}]

Examples:

  • stella:sbom:read#tenant/t-123/project/p-abc
  • stella:job:run#tenant/t-123
  • stella:tenant:admin#tenant/t-123 (Tenant Owner)

Role mapping (default):

  • viewer → read/list on most resources
  • editor → viewer + write on sbom/policy
  • operator → editor + job:run, export:run, notify:manage
  • admin → operator + user/role management inside tenant
  • owner → admin + billing/tenant lifecycle

3.2 Selecting the active tenant

  • API: XStellaTenant: <tenant_id> header or ?tenant=<id> query. If omitted and the token has exactly one tenant, that tenant is active; else 400.
  • Console: Tenant switcher in the top bar. Console includes header on all calls.
  • CLI: stella login --tenant <id> sets the default; override per command --tenant.

All services must reject requests where the active tenant is not in tenants[] and scopes do not include that tenant constraint.

3.3 Request pipeline

  1. Authentication middleware: verify JWT signature and expiry.
  2. Tenant activation: pick active tenant per header; set perrequest context {tenant_id, actor}.
  3. Scope check: compare required scope for the route with token scopes. If route accepts project limiters, check constraints align.
  4. Policy overlay (optional): ABAC evaluation for fine controls (e.g., “deny job:run outside business hours”).
  5. Persistence guard: set DB session GUC stella.tenant_id and verify any writes include matching tenant_id. Enforce Postgres RLS.
  6. Audit: write decision to audit bus (async) with permit|deny, reasons, and matched rule.

3.4 Database isolation

Approach: shared schema with Row Level Security. Every tenantscoped table includes tenant_id and optional project_id.

RLS policy template (Postgres):

ALTER TABLE sboms ENABLE ROW LEVEL SECURITY;

CREATE POLICY sboms_isolate ON sboms
USING (tenant_id = current_setting('stella.tenant_id', true));

-- For INSERT/UPDATE guard:
CREATE POLICY sboms_write_guard ON sboms
AS PERMISSIVE FOR ALL
TO PUBLIC
WITH CHECK (tenant_id = current_setting('stella.tenant_id', true));

Set stella.tenant_id at connection checkout:

SELECT set_config('stella.tenant_id', $1, true); -- $1 = active tenant

Migrations:

  • Add tenant_id to all tenantscoped tables.
  • Backfill existing rows with the default tenant in Quickstart.
  • Enable RLS and policies in a reversible migration.

3.5 Object storage and artifacts

  • Layout: s3://<bucket>/tenants/<tenant_id>/projects/<project_id>/<resource>/<uuid>...
  • KMS keys: optional pertenant key alias. Map via kms_alias = "stella-<tenant_id>".
  • Ensure Task Runner and Export Center only operate within the prefixed path of the active tenant.

3.6 Message bus topics

  • Topic naming: stella.<tenant_id>.<domain>.<event> for tenantscoped events.
  • Global knowledge events remain stella.global.kb.*.
  • Subscriptions always include a tenant filter unless consuming global knowledge.

3.7 Background workers

  • Orchestrator & Task Runner: each job carries {tenant_id, project_id}. Workers set stella.tenant_id before any DB or object store access. Reject jobs that miss the context.
  • Conseiller/Excitator: ingest to global plane; linking jobs (matching advisories to tenant SBOMs) run per tenant and respect RLS.

Integrate Policy Engine with condition keys:

  • tenant, project, resource.type, resource.id, actor.role, actor.mfa, time, ip. Examples:
  • Deny job:run from IPs outside CIDR.
  • Require mfa=true to approve notifications templates.
  • Quotas: “max exports per hour per tenant.”

3.9 Service accounts & delegation

  • Robot principals: sa:{tenant}:{name} with scopes constrained to tenant/project. Default TTL 1h; max TTL policycontrolled.
  • Token minting: Tenant admins can generate tokens via API/Console; all tokens auditable; optional bound to CIDR or workload identity.
  • Delegation: stella token delegate --to sa:... --scopes ... --ttl 15m produces a token with act chain, recorded in audit log.

3.10 Auditing

Every decision logs:

  • ts, tenant, actor, route, resource, action, effect, reason, scopes_used, policy_rule_id
  • Persist in tenantscoped audit table and stream to stella.<tenant>.audit.decisions.
  • Expose search/filter in Console → Admin → Audit.

3.11 CLI and Console UX

  • CLI: stella login, stella whoami, stella tenants list, --tenant flag everywhere. Clear error if token lacks tenant or scope.
  • Console: Tenant switcher, role badges, “why denied?” modal showing scope and policy reasons without leaking internals.
  • Impersonation (admin only): sudo as <user> for debugging with visible banner; issues delegated token with act chain.

3.12 Compatibility modes

  • Quickstart singletenant: hidden tenant local. Header optional. RLS enabled with constant.
  • Multitenant: full model active; migrations buttoned up; Console exposes tenant admin.

4) Architecture

4.1 New/updated modules

  • auth/authority: JWKS fetching, token validation, scope parser, cache.
  • auth/middleware: HTTP/gRPC interceptors for authN/Z, tenant activation, audit emit.
  • auth/roles: role → scope mapping + tenant/project constraints.
  • auth/policy-bridge: optional ABAC evaluation using Policy Engine.
  • storage/tenantctx: helpers to set stella.tenant_id in DB session and objectstore prefixes.
  • audit/decisions: structured logging and bus producer.
  • cli/auth: login, token store, tenant switcher, whoami.

4.2 Data model changes

  • Add tenant_id (and project_id where appropriate) to: sources, jobs, runs, sboms, components, findings, policies, exports, notifications, secrets, packs, ledger, audits.

  • Create tables:

    • tenants(id, name, status, created_at, owner_user_id)
    • projects(id, tenant_id, name, meta, created_at)
    • memberships(user_id, tenant_id, roles[])
    • service_accounts(id, tenant_id, name, scopes[], created_at, disabled)
    • audit_decisions(...) (tenantscoped)

5) APIs and contracts

5.1 Standard headers

  • Authorization: Bearer <jwt>
  • XStellaTenant: <tenant_id>
  • XRequestID (propagated for audit correlation)

5.2 Auth endpoints

  • POST /auth/login (OIDC code flow start for Console)
  • GET /auth/jwks.json (proxy/cached from Authority if needed)
  • GET /auth/whoami{ sub, tenants[], activeTenant, roles, scopes, mfa }
  • POST /auth/tokens/service (tenant admin) → mint robot token with constrained scopes/ttl
  • POST /auth/tokens/delegate → mint delegated token with act chain

5.3 Tenant admin endpoints

  • POST /tenants (owner only)
  • GET /tenants, GET /tenants/:id
  • POST /tenants/:id/projects, GET /tenants/:id/projects
  • POST /tenants/:id/members (assign role), DELETE /tenants/:id/members/:user
  • GET /tenants/:id/audit (search)

5.4 Route protection conventions

Each route declares:

  • resource, verb
  • Whether it requires project constraint
  • Optional policy gates (e.g., require_mfa)

Example (OpenAPI extension):

x-stella-auth:
  resource: sbom
  verb: read
  requireTenant: true
  allowProjectScoped: true
  requireMFA: false

6) Documentation changes

Create/update:

  1. /docs/security/tenancy-overview.md Concepts, knowledge planes, tenant/project model, isolation layers.

  2. /docs/security/scopes-and-roles.md Scope grammar, default roles, examples, custom role mapping.

  3. /docs/security/authority-config.md How to connect to an OIDC provider, JWKS caching, audience, issuers, time skew, MFA signal.

  4. /docs/operations/multi-tenancy.md Running multitenant deployments: quotas, KMS per tenant, object store layout, message topics, backup/restore per tenant.

  5. /docs/operations/rls-and-data-isolation.md Postgres RLS policy reference, migrations, troubleshooting leaks.

  6. /docs/console/admin-tenants.md Tenant switcher, managing members, roles, audit viewer.

  7. /docs/cli/authentication.md login, whoami, tenants list, --tenant, service tokens, delegation.

  8. /docs/api/authentication.md Headers, error codes, sample requests, OpenAPI x-stella-auth annotations.

  9. /docs/policy/examples/abac-overlays.md Optional policy snippets: MFA requirements, time windows, IP restrictions, quotas.

  10. /docs/install/configuration-reference.md New STELLA_AUTH_*, STELLA_TENANCY_*, and perservice flags.

Add at the top of each page:

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.


7) Implementation plan

Middleware & libraries

  • Implement auth/middleware with:

    • JWKS cache and signature verification (kid pinning, 10 min refresh).
    • Scope parser and matcher with constraint evaluation.
    • Tenant activator reading header/query and verifying membership.
    • Policy hook for ABAC (feature flag).
    • Decision audit emitter (nonblocking).

Services

  • API: wrap all handlers with middleware; declare route protection via decorators/annotations; enforce project constraints.
  • DB access layer: on connection checkout set stella.tenant_id; forbid raw SQL that bypasses the session GUC.
  • Orchestrator/Task Runner: include {tenant_id, project_id} in job spec; enforce before any IO.
  • Export/Notify/AI: stamp tenant in outbound payloads and logs; include it in idempotency keys.
  • Conseiller/Excitator: keep global ingestion; ensure linking jobs run with tenant context only.

Console

  • Add tenant switcher and “whoami” panel.
  • Show role badges; display “why denied?” with scope/policy explanation from 403 payload.
  • Tenant Admin screens: members, roles, service tokens, audit.

CLI

  • stella login (device/code or local browser) and tenant selection.
  • Persist tokens per profile; --tenant override; whoami.
  • Commands fail with clear errors on scope violation.

Storage

  • Prefix object store paths by tenant/project.
  • Optional pertenant KMS key integration.

Migrations

  • Add tenant_id and backfill.
  • Enable RLS with policies and tests.
  • Seed default tenant for Quickstart.

8) Engineering tasks

Auth & middleware

  • Implement JWKS retrieval and caching with rotation tests.
  • Implement scope parser and matcher with constraint support.
  • Build HTTP/gRPC interceptors and integrate across services.
  • Add ABAC policy hook and sample policies.
  • Emit structured decision audits.

Data & storage

  • Add tenant_id columns and indices; backfill in migration.
  • Enable Postgres RLS policies for all tenantscoped tables.
  • Update ORM/queries to rely on RLS; remove any “WHERE tenant_id = ...” duplication.
  • Tenantprefixed object store paths; optional pertenant KMS keys.

Services

  • Annotate all routes with x-stella-auth or equivalent decorator.
  • Propagate tenant context through orchestrator and workers.
  • Update Conseiller/Excitator linkers to require tenant context.

Console

  • Implement tenant switcher, role display, and “whoami.”
  • Add Tenant Admin screens (members, projects, service accounts).
  • Implement “Why denied?” modal reading 403 details.

CLI

  • login, whoami, tenants list, tenant flag and persistence.
  • Service token minting for tenant admins.
  • Delegate token creation for robot use.

Audit

  • Create audit_decisions table and producer.
  • Add search API and Console viewer.

Docs

  • Author the ten docs listed in §6 with examples and diagrams.
  • Add troubleshooting: common 401/403 causes and fixes.
  • Add migration guide from singletenant to multitenant.

Tests

  • Unit tests for scope matching and token validation.
  • RLS tests verifying crosstenant reads/writes fail.
  • E2E tests: multitenant users, robot tokens, ABAC overlays, orchestrator runs.
  • Fuzz tests on header handling to prevent tenant confusion bugs.

9) Feature changes required

  • All services: must expose 403 payloads with machineand humanreadable denial reasons and the missing scope string.
  • Export Center: requires tenant in all manifests; deny crosstenant exports by default; allow explicit crosstenant mirror via signed bundle.
  • Notifications: destinations and templates are tenantscoped; sending pipeline stamps tenant.
  • Advisory AI Assistant: restricts training context to a tenants data; global knowledge plane may be referenced but never log tenant data into global.
  • Findings Ledger: partition by tenant; queries must be tenantfiltered or rely solely on RLS.
  • Policy Engine: support condition keys for tenant and actor; ship example policies.

10) Acceptance criteria

  • Requests lacking XStellaTenant in multitenant mode are rejected unless singletenant Quickstart.
  • RLS prevents crosstenant leakage proven by tests that attempt blind selects/inserts.
  • CLI can log in, list tenants, select tenant, and perform a job limited to that tenant.
  • Console shows tenant switcher; admin can invite a member and assign roles.
  • Service token can be minted with narrow scopes and expires as configured.
  • Every 403 returns a clear “missing required scope …” with the exact scope string.
  • Conseiller/Excitator continue aggregationonly behavior; linking jobs run strictly under tenant context.
  • Audit stream captures all permit/deny decisions with correlation IDs.

11) Risks & mitigations

  • RLS misconfiguration. Write tests that run with and without RLS; block migrations unless policies are present. Provide a canary query per service on boot to verify isolation.
  • Scope explosion. Keep a minimal, stable scope set; use constraints for specificity; document patterns.
  • JWKS outages. Cache keys with TTL, support multiple kids, and tolerate short network failures.
  • Privilege creep in robots. Short TTLs by default, clear UI to rotate/revoke, and audit for usage anomalies.
  • Tenant confusion bugs. Require tenant header, validate against token memberships, and pin tenant context into DB session and job payloads, never threadlocals only.

12) Philosophy

  • Isolation by default. Tenancy isnt a UI filter; its enforced where the data lives.
  • Least privilege wins. Humans and robots get only what they need for as long as they need it.
  • Explain denials. If the platform cant explain “why no,” its broken.
  • Global vs tenant plane. Public knowledge is shared; customer data is not, ever.

Imposed rule: Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.