git.stella-ops.org/docs/implplan/EPIC_14.md

No file to print
Fine. Identity and tenancy: the part everyone underestimates until they trip over it in prod. Here’s the clean, doc‑ready version.

> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

---

# Epic 14: Authority‑Backed Scopes & Tenancy

**Short name:** `Authority‑Backed Scopes & Tenancy`
**Primary components:** Authority (authN/Z), Web Services API, Policy Engine, Orchestrator, Task Runner, Console, CLI
**Surfaces:** `/auth/*`, request middleware, DB schema (RLS), object storage layout, message bus topics, audit logs, CLI login/impersonation flows
**Touches:** Conseiller (Feedser), Excitator (Vexer), Findings Ledger, Export Center, Notifications Studio, Advisory AI Assistant

**AOC ground rule reminder:** Conseiller and Excitator aggregate and link advisories/VEX. They never merge or mutate source records. Enforcement of aggregation‑only behavior is tenant‑agnostic and must hold across all scopes.

---

## 1) What it is

A uniform model for identity, authorization, and isolation that is enforced end‑to‑end:

* **Authority‑backed tokens:** JWT/OIDC tokens issued by a configured Authority. Tokens carry **scopes**, **roles**, and **tenant memberships** as claims. Services verify and authorize locally; no out‑of‑band ACL calls during the hot path.
* **Tenancy:** First‑class multi‑tenant isolation with optional **projects** within a tenant. Strong separation at the database layer via **row‑level security (RLS)** and in object storage via **tenant‑prefixed paths** (and optionally per‑tenant KMS keys).
* **Scopes & roles:** Minimal set of composable scopes (`stella:{resource}:{verb}`) that map to roles (`viewer`, `editor`, `operator`, `admin`, `owner`) and can be constrained to `{tenant}/{project}`.
* **Context propagation:** Every API request, job, message, and artifact is stamped with `{tenant_id, project_id, actor}` and validated at ingress and again at persistence.
* **Service accounts & delegation:** Robot identities with scoped, expiring credentials for CI, Task Packs, and webhooks. Human-to-robot delegation is explicit and auditable.
* **Audit:** Immutable decision logs for authN/Z events with resource, scope, and policy evaluation outcomes.

**Tenancy model:**

```
Organization (optional, for billing)
└── Tenant (isolation boundary)
    ├── Projects (isolation + grouping)
    │   ├── Sources (registries, repos)
    │   ├── Jobs & Runs
    │   ├── SBOMs & Artifacts
    │   ├── Findings / Evaluations
    │   └── Policies (bound or inherited)
    └── Shared tenant services (notifications, exports, secrets)
```

**Knowledge planes:**

* **Global knowledge plane:** Advisories, CVE metadata, CPE, KEV, etc. No tenant data.
* **Tenant plane:** SBOMs, VEX attachments, policy results, exposures, notifications, exports, audits.

Conseiller/Excitator live across both planes: they collect into the global plane and link to tenant plane without merging sources.

---

## 2) Why (brief)

Security that depends on “being careful” is not security. We need hard boundaries the platform cannot cross by accident:

* Run many teams/customers safely on one deployment.
* Minimize blast radius for credentials and mistakes.
* Make CI and automation safe with least‑privileged scopes.
* Keep latency low by verifying locally with signed tokens.

---

## 3) How it should work (maximum detail)

### 3.1 Tokens, claims, and scopes

**Token type:** JWT, signed by the Authority. Services cache JWKS and verify locally.

**Required claims:**

* `iss`, `sub`, `aud`, `iat`, `exp`
* `scope`: space‑separated scopes (`stella:sbom:read`, `stella:job:run`)
* `tenants`: array of tenant IDs the subject may access
* `tenant` (active): the currently selected tenant for the request
* `roles`: object map `{ "<tenant>": ["viewer", "editor", ...] }`
* `projects` (optional): array of project IDs or `{ "<tenant>": ["projA", "projB"] }`
* `mfa` (optional): boolean or level for step‑up enforcement
* `act` (optional): actor chain for delegation/impersonation

**Scope grammar:**

```
stella:{resource}:{verb}[#{constraint}]
  resource ∈ {tenant, project, source, job, sbom, vex, advisory, policy, finding, export, notify, secret, pack, ledger, console}
  verb ∈ {read, list, write, delete, run, execute, approve, admin}
  constraint := tenant/{tenantId}[/project/{projectId}]
```

Examples:

* `stella:sbom:read#tenant/t-123/project/p-abc`
* `stella:job:run#tenant/t-123`
* `stella:tenant:admin#tenant/t-123` (Tenant Owner)

**Role mapping (default):**

* `viewer` → read/list on most resources
* `editor` → viewer + write on sbom/policy
* `operator` → editor + job:run, export:run, notify:manage
* `admin` → operator + user/role management inside tenant
* `owner` → admin + billing/tenant lifecycle

### 3.2 Selecting the active tenant

* **API:** `X‑Stella‑Tenant: <tenant_id>` header or `?tenant=<id>` query. If omitted and the token has exactly one tenant, that tenant is active; else 400.
* **Console:** Tenant switcher in the top bar. Console includes header on all calls.
* **CLI:** `stella login --tenant <id>` sets the default; override per command `--tenant`.

All services must reject requests where the active tenant is not in `tenants[]` and scopes do not include that tenant constraint.

### 3.3 Request pipeline

1. **Authentication middleware:** verify JWT signature and expiry.
2. **Tenant activation:** pick active tenant per header; set per‑request context `{tenant_id, actor}`.
3. **Scope check:** compare required scope for the route with token scopes. If route accepts project limiters, check constraints align.
4. **Policy overlay (optional):** ABAC evaluation for fine controls (e.g., “deny job:run outside business hours”).
5. **Persistence guard:** set DB session GUC `stella.tenant_id` and verify any writes include matching `tenant_id`. Enforce Postgres RLS.
6. **Audit:** write decision to audit bus (async) with `permit|deny`, reasons, and matched rule.

### 3.4 Database isolation

**Approach:** shared schema with **Row Level Security**. Every tenant‑scoped table includes `tenant_id` and optional `project_id`.

**RLS policy template (Postgres):**

```sql
ALTER TABLE sboms ENABLE ROW LEVEL SECURITY;

CREATE POLICY sboms_isolate ON sboms
USING (tenant_id = current_setting('stella.tenant_id', true));

-- For INSERT/UPDATE guard:
CREATE POLICY sboms_write_guard ON sboms
AS PERMISSIVE FOR ALL
TO PUBLIC
WITH CHECK (tenant_id = current_setting('stella.tenant_id', true));
```

Set `stella.tenant_id` at connection checkout:

```sql
SELECT set_config('stella.tenant_id', $1, true); -- $1 = active tenant
```

**Migrations:**

* Add `tenant_id` to all tenant‑scoped tables.
* Backfill existing rows with the default tenant in Quickstart.
* Enable RLS and policies in a reversible migration.

### 3.5 Object storage and artifacts

* **Layout:** `s3://<bucket>/tenants/<tenant_id>/projects/<project_id>/<resource>/<uuid>...`
* **KMS keys:** optional per‑tenant key alias. Map via `kms_alias = "stella-<tenant_id>"`.
* Ensure Task Runner and Export Center only operate within the prefixed path of the active tenant.

### 3.6 Message bus topics

* Topic naming: `stella.<tenant_id>.<domain>.<event>` for tenant‑scoped events.
* Global knowledge events remain `stella.global.kb.*`.
* Subscriptions always include a tenant filter unless consuming global knowledge.

### 3.7 Background workers

* **Orchestrator & Task Runner:** each job carries `{tenant_id, project_id}`. Workers set `stella.tenant_id` before any DB or object store access. Reject jobs that miss the context.
* **Conseiller/Excitator:** ingest to global plane; linking jobs (matching advisories to tenant SBOMs) run per tenant and respect RLS.

### 3.8 Policy overlay (optional but recommended)

Integrate Policy Engine with condition keys:

* `tenant`, `project`, `resource.type`, `resource.id`, `actor.role`, `actor.mfa`, `time`, `ip`.
  Examples:
* Deny `job:run` from IPs outside CIDR.
* Require `mfa=true` to approve notifications templates.
* Quotas: “max exports per hour per tenant.”

### 3.9 Service accounts & delegation

* **Robot principals:** `sa:{tenant}:{name}` with scopes constrained to tenant/project. Default TTL 1h; max TTL policy‑controlled.
* **Token minting:** Tenant admins can generate tokens via API/Console; all tokens auditable; optional bound to CIDR or workload identity.
* **Delegation:** `stella token delegate --to sa:... --scopes ... --ttl 15m` produces a token with `act` chain, recorded in audit log.

### 3.10 Auditing

Every decision logs:

* `ts, tenant, actor, route, resource, action, effect, reason, scopes_used, policy_rule_id`
* Persist in tenant‑scoped audit table and stream to `stella.<tenant>.audit.decisions`.
* Expose search/filter in Console → Admin → Audit.

### 3.11 CLI and Console UX

* **CLI:** `stella login`, `stella whoami`, `stella tenants list`, `--tenant` flag everywhere. Clear error if token lacks tenant or scope.
* **Console:** Tenant switcher, role badges, “why denied?” modal showing scope and policy reasons without leaking internals.
* **Impersonation (admin only):** `sudo as <user>` for debugging with visible banner; issues delegated token with `act` chain.

### 3.12 Compatibility modes

* **Quickstart single‑tenant:** hidden tenant `local`. Header optional. RLS enabled with constant.
* **Multi‑tenant:** full model active; migrations buttoned up; Console exposes tenant admin.

---

## 4) Architecture

### 4.1 New/updated modules

* `auth/authority`: JWKS fetching, token validation, scope parser, cache.
* `auth/middleware`: HTTP/gRPC interceptors for authN/Z, tenant activation, audit emit.
* `auth/roles`: role → scope mapping + tenant/project constraints.
* `auth/policy-bridge`: optional ABAC evaluation using Policy Engine.
* `storage/tenantctx`: helpers to set `stella.tenant_id` in DB session and object‑store prefixes.
* `audit/decisions`: structured logging and bus producer.
* `cli/auth`: login, token store, tenant switcher, whoami.

### 4.2 Data model changes

* Add `tenant_id` (and `project_id` where appropriate) to: `sources, jobs, runs, sboms, components, findings, policies, exports, notifications, secrets, packs, ledger, audits`.
* Create tables:

  * `tenants(id, name, status, created_at, owner_user_id)`
  * `projects(id, tenant_id, name, meta, created_at)`
  * `memberships(user_id, tenant_id, roles[])`
  * `service_accounts(id, tenant_id, name, scopes[], created_at, disabled)`
  * `audit_decisions(...)` (tenant‑scoped)

---

## 5) APIs and contracts

### 5.1 Standard headers

* `Authorization: Bearer <jwt>`
* `X‑Stella‑Tenant: <tenant_id>`
* `X‑Request‑ID` (propagated for audit correlation)

### 5.2 Auth endpoints

* `POST /auth/login` (OIDC code flow start for Console)
* `GET /auth/jwks.json` (proxy/cached from Authority if needed)
* `GET /auth/whoami` → `{ sub, tenants[], activeTenant, roles, scopes, mfa }`
* `POST /auth/tokens/service` (tenant admin) → mint robot token with constrained scopes/ttl
* `POST /auth/tokens/delegate` → mint delegated token with `act` chain

### 5.3 Tenant admin endpoints

* `POST /tenants` (owner only)
* `GET /tenants`, `GET /tenants/:id`
* `POST /tenants/:id/projects`, `GET /tenants/:id/projects`
* `POST /tenants/:id/members` (assign role), `DELETE /tenants/:id/members/:user`
* `GET /tenants/:id/audit` (search)

### 5.4 Route protection conventions

Each route declares:

* `resource`, `verb`
* Whether it requires project constraint
* Optional policy gates (e.g., `require_mfa`)

Example (OpenAPI extension):

```yaml
x-stella-auth:
  resource: sbom
  verb: read
  requireTenant: true
  allowProjectScoped: true
  requireMFA: false
```

---

## 6) Documentation changes

Create/update:

1. `/docs/security/tenancy-overview.md`
   Concepts, knowledge planes, tenant/project model, isolation layers.

2. `/docs/security/scopes-and-roles.md`
   Scope grammar, default roles, examples, custom role mapping.

3. `/docs/security/authority-config.md`
   How to connect to an OIDC provider, JWKS caching, audience, issuers, time skew, MFA signal.

4. `/docs/operations/multi-tenancy.md`
   Running multi‑tenant deployments: quotas, KMS per tenant, object store layout, message topics, backup/restore per tenant.

5. `/docs/operations/rls-and-data-isolation.md`
   Postgres RLS policy reference, migrations, troubleshooting leaks.

6. `/docs/console/admin-tenants.md`
   Tenant switcher, managing members, roles, audit viewer.

7. `/docs/cli/authentication.md`
   `login`, `whoami`, `tenants list`, `--tenant`, service tokens, delegation.

8. `/docs/api/authentication.md`
   Headers, error codes, sample requests, OpenAPI `x-stella-auth` annotations.

9. `/docs/policy/examples/abac-overlays.md`
   Optional policy snippets: MFA requirements, time windows, IP restrictions, quotas.

10. `/docs/install/configuration-reference.md`
    New `STELLA_AUTH_*`, `STELLA_TENANCY_*`, and per‑service flags.

Add at the top of each page:

> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.

---

## 7) Implementation plan

### Middleware & libraries

* Implement `auth/middleware` with:

  * JWKS cache and signature verification (kid pinning, 10 min refresh).
  * Scope parser and matcher with constraint evaluation.
  * Tenant activator reading header/query and verifying membership.
  * Policy hook for ABAC (feature flag).
  * Decision audit emitter (non‑blocking).

### Services

* **API:** wrap all handlers with middleware; declare route protection via decorators/annotations; enforce project constraints.
* **DB access layer:** on connection checkout set `stella.tenant_id`; forbid raw SQL that bypasses the session GUC.
* **Orchestrator/Task Runner:** include `{tenant_id, project_id}` in job spec; enforce before any IO.
* **Export/Notify/AI:** stamp tenant in outbound payloads and logs; include it in idempotency keys.
* **Conseiller/Excitator:** keep global ingestion; ensure linking jobs run with tenant context only.

### Console

* Add tenant switcher and “whoami” panel.
* Show role badges; display “why denied?” with scope/policy explanation from 403 payload.
* Tenant Admin screens: members, roles, service tokens, audit.

### CLI

* `stella login` (device/code or local browser) and tenant selection.
* Persist tokens per profile; `--tenant` override; `whoami`.
* Commands fail with clear errors on scope violation.

### Storage

* Prefix object store paths by tenant/project.
* Optional per‑tenant KMS key integration.

### Migrations

* Add `tenant_id` and backfill.
* Enable RLS with policies and tests.
* Seed default tenant for Quickstart.

---

## 8) Engineering tasks

**Auth & middleware**

* [ ] Implement JWKS retrieval and caching with rotation tests.
* [ ] Implement scope parser and matcher with constraint support.
* [ ] Build HTTP/gRPC interceptors and integrate across services.
* [ ] Add ABAC policy hook and sample policies.
* [ ] Emit structured decision audits.

**Data & storage**

* [ ] Add `tenant_id` columns and indices; backfill in migration.
* [ ] Enable Postgres RLS policies for all tenant‑scoped tables.
* [ ] Update ORM/queries to rely on RLS; remove any “WHERE tenant_id = ...” duplication.
* [ ] Tenant‑prefixed object store paths; optional per‑tenant KMS keys.

**Services**

* [ ] Annotate all routes with `x-stella-auth` or equivalent decorator.
* [ ] Propagate tenant context through orchestrator and workers.
* [ ] Update Conseiller/Excitator linkers to require tenant context.

**Console**

* [ ] Implement tenant switcher, role display, and “whoami.”
* [ ] Add Tenant Admin screens (members, projects, service accounts).
* [ ] Implement “Why denied?” modal reading 403 details.

**CLI**

* [ ] `login`, `whoami`, `tenants list`, tenant flag and persistence.
* [ ] Service token minting for tenant admins.
* [ ] Delegate token creation for robot use.

**Audit**

* [ ] Create audit_decisions table and producer.
* [ ] Add search API and Console viewer.

**Docs**

* [ ] Author the ten docs listed in §6 with examples and diagrams.
* [ ] Add troubleshooting: common 401/403 causes and fixes.
* [ ] Add migration guide from single‑tenant to multi‑tenant.

**Tests**

* [ ] Unit tests for scope matching and token validation.
* [ ] RLS tests verifying cross‑tenant reads/writes fail.
* [ ] E2E tests: multi‑tenant users, robot tokens, ABAC overlays, orchestrator runs.
* [ ] Fuzz tests on header handling to prevent tenant confusion bugs.

---

## 9) Feature changes required

* **All services:** must expose 403 payloads with machine‑and human‑readable denial reasons and the missing scope string.
* **Export Center:** requires tenant in all manifests; deny cross‑tenant exports by default; allow explicit cross‑tenant mirror via signed bundle.
* **Notifications:** destinations and templates are tenant‑scoped; sending pipeline stamps tenant.
* **Advisory AI Assistant:** restricts training context to a tenant’s data; global knowledge plane may be referenced but never log tenant data into global.
* **Findings Ledger:** partition by tenant; queries must be tenant‑filtered or rely solely on RLS.
* **Policy Engine:** support condition keys for tenant and actor; ship example policies.

---

## 10) Acceptance criteria

* Requests lacking `X‑Stella‑Tenant` in multi‑tenant mode are rejected unless single‑tenant Quickstart.
* RLS prevents cross‑tenant leakage proven by tests that attempt blind selects/inserts.
* CLI can log in, list tenants, select tenant, and perform a job limited to that tenant.
* Console shows tenant switcher; admin can invite a member and assign roles.
* Service token can be minted with narrow scopes and expires as configured.
* Every 403 returns a clear “missing required scope …” with the exact scope string.
* Conseiller/Excitator continue aggregation‑only behavior; linking jobs run strictly under tenant context.
* Audit stream captures all permit/deny decisions with correlation IDs.

---

## 11) Risks & mitigations

* **RLS misconfiguration.** Write tests that run with and without RLS; block migrations unless policies are present. Provide a canary query per service on boot to verify isolation.
* **Scope explosion.** Keep a minimal, stable scope set; use constraints for specificity; document patterns.
* **JWKS outages.** Cache keys with TTL, support multiple `kid`s, and tolerate short network failures.
* **Privilege creep in robots.** Short TTLs by default, clear UI to rotate/revoke, and audit for usage anomalies.
* **Tenant confusion bugs.** Require tenant header, validate against token memberships, and pin tenant context into DB session and job payloads, never thread‑locals only.

---

## 12) Philosophy

* **Isolation by default.** Tenancy isn’t a UI filter; it’s enforced where the data lives.
* **Least privilege wins.** Humans and robots get only what they need for as long as they need it.
* **Explain denials.** If the platform can’t explain “why no,” it’s broken.
* **Global vs tenant plane.** Public knowledge is shared; customer data is not, ever.

> **Imposed rule:** Work of this type or tasks of this type on this component must also be applied everywhere else it should be applied.