# Scanner Schema Specification **Schema**: `scanner` **Owner**: Scanner.WebService **Purpose**: Scan orchestration, call-graphs, proof bundles, reachability analysis **Sprint**: SPRINT_3500_0002_0001, SPRINT_3500_0003_0002 --- ## Overview The `scanner` schema contains all tables related to: 1. Scan manifests and deterministic replay 2. Proof bundles (content-addressed storage metadata) 3. Call-graph nodes and edges (reachability analysis) 4. Entrypoints (framework-specific entry discovery) 5. Runtime samples (profiling data for reachability validation) **Design Principles**: - All tables use `scan_id` as primary partition key for scan isolation - Deterministic data only (no timestamps in core algorithms) - Content-addressed references (hashes, not paths) - Forward-only schema evolution --- ## Tables ### 1. scan_manifest **Purpose**: Stores immutable scan manifests capturing all inputs for deterministic replay. **Schema**: | Column | Type | Nullable | Description | |--------|------|----------|-------------| | `scan_id` | `text` | NOT NULL | Primary key; UUID format | | `created_at_utc` | `timestamptz` | NOT NULL | Scan creation timestamp | | `artifact_digest` | `text` | NOT NULL | Image/artifact digest (sha256:...) | | `artifact_purl` | `text` | NULL | PURL identifier (pkg:oci/...) | | `scanner_version` | `text` | NOT NULL | Scanner.WebService version | | `worker_version` | `text` | NOT NULL | Scanner.Worker version | | `concelier_snapshot_hash` | `text` | NOT NULL | Concelier feed snapshot digest | | `excititor_snapshot_hash` | `text` | NOT NULL | Excititor VEX snapshot digest | | `lattice_policy_hash` | `text` | NOT NULL | Policy bundle digest | | `deterministic` | `boolean` | NOT NULL | Whether scan used deterministic mode | | `seed` | `bytea` | NOT NULL | 32-byte deterministic seed | | `knobs` | `jsonb` | NULL | Configuration knobs (depth limits, etc.) | | `manifest_hash` | `text` | NOT NULL | SHA-256 of canonical manifest JSON (UNIQUE) | | `manifest_json` | `jsonb` | NOT NULL | Canonical JSON manifest | | `manifest_dsse_json` | `jsonb` | NOT NULL | DSSE signature envelope | **Indexes**: ```sql CREATE INDEX idx_scan_manifest_artifact ON scanner.scan_manifest(artifact_digest); CREATE INDEX idx_scan_manifest_snapshots ON scanner.scan_manifest(concelier_snapshot_hash, excititor_snapshot_hash); CREATE INDEX idx_scan_manifest_created ON scanner.scan_manifest(created_at_utc DESC); CREATE UNIQUE INDEX idx_scan_manifest_hash ON scanner.scan_manifest(manifest_hash); ``` **Constraints**: - `manifest_hash` format: `sha256:[0-9a-f]{64}` - `seed` must be exactly 32 bytes - `scan_id` format: UUID v4 **Partitioning**: None (lookup table, <100k rows expected) **Retention**: 180 days (drop scans older than 180 days) --- ### 2. proof_bundle **Purpose**: Metadata for content-addressed proof bundles (zip archives). **Schema**: | Column | Type | Nullable | Description | |--------|------|----------|-------------| | `scan_id` | `text` | NOT NULL | Foreign key to `scan_manifest.scan_id` | | `root_hash` | `text` | NOT NULL | Merkle root hash of bundle contents | | `bundle_uri` | `text` | NOT NULL | File path or S3 URI to bundle zip | | `proof_root_dsse_json` | `jsonb` | NOT NULL | DSSE signature of root hash | | `created_at_utc` | `timestamptz` | NOT NULL | Bundle creation timestamp | **Primary Key**: `(scan_id, root_hash)` **Indexes**: ```sql CREATE INDEX idx_proof_bundle_scan ON scanner.proof_bundle(scan_id); CREATE INDEX idx_proof_bundle_created ON scanner.proof_bundle(created_at_utc DESC); ``` **Constraints**: - `root_hash` format: `sha256:[0-9a-f]{64}` - `bundle_uri` must be accessible file path or S3 URI **Partitioning**: None (<100k rows expected) **Retention**: 365 days (compliance requirement for signed bundles) --- ### 3. cg_node (call-graph nodes) **Purpose**: Stores call-graph nodes (methods/functions) extracted from artifacts. **Schema**: | Column | Type | Nullable | Description | |--------|------|----------|-------------| | `scan_id` | `text` | NOT NULL | Partition key | | `node_id` | `text` | NOT NULL | Deterministic node ID (hash-based) | | `artifact_key` | `text` | NOT NULL | Artifact identifier (assembly name, JAR, etc.) | | `symbol_key` | `text` | NOT NULL | Canonical symbol name (Namespace.Type::Method) | | `visibility` | `text` | NOT NULL | `public`, `internal`, `private`, `unknown` | | `flags` | `integer` | NOT NULL | Bitfield: `IS_ENTRYPOINT_CANDIDATE=1`, `IS_VIRTUAL=2`, etc. | **Primary Key**: `(scan_id, node_id)` **Indexes**: ```sql CREATE INDEX idx_cg_node_artifact ON scanner.cg_node(scan_id, artifact_key); CREATE INDEX idx_cg_node_symbol ON scanner.cg_node(scan_id, symbol_key); CREATE INDEX idx_cg_node_flags ON scanner.cg_node(scan_id, flags) WHERE (flags & 1) = 1; -- Entrypoint candidates ``` **Constraints**: - `node_id` format: `sha256:[0-9a-f]{64}` (deterministic hash) - `visibility` must be one of: `public`, `internal`, `private`, `unknown` **Partitioning**: Hash partition by `scan_id` (for scans with >100k nodes) **Retention**: 90 days (call-graphs recomputed on rescan) --- ### 4. cg_edge (call-graph edges) **Purpose**: Stores call-graph edges (invocations) between nodes. **Schema**: | Column | Type | Nullable | Description | |--------|------|----------|-------------| | `scan_id` | `text` | NOT NULL | Partition key | | `from_node_id` | `text` | NOT NULL | Caller node ID | | `to_node_id` | `text` | NOT NULL | Callee node ID | | `kind` | `smallint` | NOT NULL | `1=static`, `2=heuristic` | | `reason` | `smallint` | NOT NULL | `1=direct_call`, `2=virtual_call`, `3=reflection_string`, etc. | | `weight` | `real` | NOT NULL | Edge confidence weight (0.0-1.0) | **Primary Key**: `(scan_id, from_node_id, to_node_id, kind, reason)` **Indexes**: ```sql CREATE INDEX idx_cg_edge_from ON scanner.cg_edge(scan_id, from_node_id); CREATE INDEX idx_cg_edge_to ON scanner.cg_edge(scan_id, to_node_id); CREATE INDEX idx_cg_edge_static ON scanner.cg_edge(scan_id, kind) WHERE kind = 1; CREATE INDEX idx_cg_edge_heuristic ON scanner.cg_edge(scan_id, kind) WHERE kind = 2; ``` **Constraints**: - `kind` must be 1 (static) or 2 (heuristic) - `reason` must be in range 1-10 (enum defined in code) - `weight` must be in range [0.0, 1.0] **Partitioning**: Hash partition by `scan_id` (for scans with >500k edges) **Retention**: 90 days **Notes**: - High-volume table (1M+ rows per large scan) - Use partial indexes for `kind` to optimize static-only queries - Consider GIN index on `(from_node_id, to_node_id)` for bidirectional BFS --- ### 5. entrypoint **Purpose**: Stores discovered entrypoints (HTTP routes, CLI commands, background jobs). **Schema**: | Column | Type | Nullable | Description | |--------|------|----------|-------------| | `scan_id` | `text` | NOT NULL | Partition key | | `node_id` | `text` | NOT NULL | Reference to `cg_node.node_id` | | `kind` | `text` | NOT NULL | `http`, `grpc`, `cli`, `job`, `event`, `unknown` | | `framework` | `text` | NOT NULL | `aspnetcore`, `spring`, `express`, etc. | | `route` | `text` | NULL | HTTP route pattern (e.g., `/api/orders/{id}`) | | `metadata` | `jsonb` | NULL | Framework-specific metadata | **Primary Key**: `(scan_id, node_id, kind, framework, route)` **Indexes**: ```sql CREATE INDEX idx_entrypoint_scan ON scanner.entrypoint(scan_id); CREATE INDEX idx_entrypoint_kind ON scanner.entrypoint(scan_id, kind); CREATE INDEX idx_entrypoint_framework ON scanner.entrypoint(scan_id, framework); ``` **Constraints**: - `kind` must be one of: `http`, `grpc`, `cli`, `job`, `event`, `unknown` - `route` required for `kind='http'` or `kind='grpc'` **Partitioning**: None (<10k rows per scan) **Retention**: 90 days --- ### 6. runtime_sample **Purpose**: Stores runtime profiling samples (stack traces) for reachability validation. **Schema**: | Column | Type | Nullable | Description | |--------|------|----------|-------------| | `scan_id` | `text` | NOT NULL | Partition key (links to scan) | | `collected_at` | `timestamptz` | NOT NULL | Sample collection timestamp | | `env_hash` | `text` | NOT NULL | Environment hash (k8s ns+pod+container) | | `sample_id` | `bigserial` | NOT NULL | Auto-incrementing sample ID | | `timestamp` | `timestamptz` | NOT NULL | Sample timestamp | | `pid` | `integer` | NOT NULL | Process ID | | `thread_id` | `integer` | NOT NULL | Thread ID | | `frames` | `text[]` | NOT NULL | Array of node IDs (stack trace) | | `weight` | `real` | NOT NULL | Sample weight (1.0 for discrete samples) | **Primary Key**: `(scan_id, sample_id)` **Indexes**: ```sql CREATE INDEX idx_runtime_sample_scan ON scanner.runtime_sample(scan_id, collected_at DESC); CREATE INDEX idx_runtime_sample_frames ON scanner.runtime_sample USING GIN(frames); CREATE INDEX idx_runtime_sample_env ON scanner.runtime_sample(scan_id, env_hash); ``` **Constraints**: - `frames` array length must be >0 and <1000 - `weight` must be >0.0 **Partitioning**: **TIME-BASED** (monthly partitions by `collected_at`) ```sql CREATE TABLE scanner.runtime_sample_2025_01 PARTITION OF scanner.runtime_sample FOR VALUES FROM ('2025-01-01') TO ('2025-02-01'); ``` **Retention**: 90 days (drop old partitions automatically) **Notes**: - **Highest volume table** (10M+ rows for long-running services) - GIN index on `frames[]` enables fast "find samples containing node X" queries - Partition pruning critical for performance --- ## Enums (Defined in Code) ### cg_edge.kind | Value | Name | Description | |-------|------|-------------| | 1 | `static` | Statically proven call edge | | 2 | `heuristic` | Heuristic/inferred edge (reflection, DI, dynamic) | ### cg_edge.reason | Value | Name | Description | |-------|------|-------------| | 1 | `direct_call` | Direct method invocation | | 2 | `virtual_call` | Virtual/interface dispatch | | 3 | `reflection_string` | Reflection with string name | | 4 | `di_binding` | Dependency injection registration | | 5 | `dynamic_import` | Dynamic module import (JS/Python) | | 6 | `delegate_invoke` | Delegate/lambda invocation | | 7 | `async_await` | Async method call | | 8 | `constructor` | Object constructor invocation | | 9 | `plt_got` | PLT/GOT indirect call (native binaries) | | 10 | `unknown` | Unknown edge type | ### cg_node.flags (Bitfield) | Bit | Flag | Description | |-----|------|-------------| | 0 | `IS_ENTRYPOINT_CANDIDATE` | Node could be an entrypoint | | 1 | `IS_VIRTUAL` | Virtual or interface method | | 2 | `IS_ASYNC` | Async method | | 3 | `IS_CONSTRUCTOR` | Constructor method | | 4 | `IS_EXPORTED` | Publicly exported (for native binaries) | --- ## Schema Evolution ### Migration Categories Per `docs/db/SPECIFICATION.md`: | Category | Prefix | Execution | Description | |----------|--------|-----------|-------------| | Startup (A) | `001-099` | Automatic at boot | Non-breaking DDL (CREATE IF NOT EXISTS) | | Release (B) | `100-199` | Manual via CLI | Breaking changes (requires maintenance window) | | Seed | `S001-S999` | After schema | Reference data with ON CONFLICT DO NOTHING | | Data (C) | `DM001-DM999` | Background job | Batched data transformations | ### Upcoming Migrations | Migration | Category | Sprint | Description | |-----------|----------|--------|-------------| | `010_scanner_schema.sql` | Startup (A) | 3500.0002.0001 | Create scanner schema, scan_manifest, proof_bundle | | `011_call_graph_tables.sql` | Startup (A) | 3500.0003.0002 | Create cg_node, cg_edge, entrypoint | | `012_runtime_sample_partitions.sql` | Startup (A) | 3500.0003.0004 | Create runtime_sample with monthly partitions | | `S001_seed_edge_reasons.sql` | Seed | 3500.0003.0002 | Seed edge reason lookup table | --- ## Performance Considerations ### Query Patterns **High-frequency queries**: 1. **Scan manifest lookup by artifact**: ```sql SELECT * FROM scanner.scan_manifest WHERE artifact_digest = $1 ORDER BY created_at_utc DESC LIMIT 1; ``` - Index: `idx_scan_manifest_artifact` 2. **Reachability BFS (forward)**: ```sql SELECT to_node_id FROM scanner.cg_edge WHERE scan_id = $1 AND from_node_id = ANY($2) AND kind = 1; ``` - Index: `idx_cg_edge_from` 3. **Reachability BFS (backward)**: ```sql SELECT from_node_id FROM scanner.cg_edge WHERE scan_id = $1 AND to_node_id = $2 AND kind = 1; ``` - Index: `idx_cg_edge_to` 4. **Find runtime samples containing node**: ```sql SELECT * FROM scanner.runtime_sample WHERE scan_id = $1 AND $2 = ANY(frames); ``` - Index: `idx_runtime_sample_frames` (GIN) ### Index Maintenance **Reindex schedule**: - `cg_edge` indexes: Weekly (high churn) - `runtime_sample` GIN index: Monthly (after partition drops) **Vacuum**: - Autovacuum enabled for all tables - Manual VACUUM ANALYZE after bulk inserts (>1M rows) ### Partition Management **Automated partition creation** (cron job): ```sql -- Create next month's partition 7 days in advance CREATE TABLE IF NOT EXISTS scanner.runtime_sample_2025_02 PARTITION OF scanner.runtime_sample FOR VALUES FROM ('2025-02-01') TO ('2025-03-01'); ``` **Automated partition dropping** (90-day retention): ```sql DROP TABLE IF EXISTS scanner.runtime_sample_2024_10; -- Older than 90 days ``` --- ## Compliance & Auditing ### DSSE Signatures All proof bundles and manifests include DSSE signatures: - `manifest_dsse_json` in `scan_manifest` - `proof_root_dsse_json` in `proof_bundle` **Verification**: - Signatures verified on read using `IContentSigner.Verify` - Invalid signatures → reject proof bundle ### Immutability **Immutable tables**: - `scan_manifest` — No updates allowed after insert - `proof_bundle` — No updates allowed after insert **Enforcement**: Application-level (no UPDATE grants in production) ### Retention Policies | Table | Retention | Enforcement | |-------|-----------|-------------| | `scan_manifest` | 180 days | DELETE WHERE created_at_utc < NOW() - INTERVAL '180 days' | | `proof_bundle` | 365 days | DELETE WHERE created_at_utc < NOW() - INTERVAL '365 days' | | `cg_node` | 90 days | CASCADE delete on scan_manifest | | `cg_edge` | 90 days | CASCADE delete on scan_manifest | | `runtime_sample` | 90 days | DROP PARTITION (monthly) | --- ## Monitoring ### Key Metrics 1. **Table sizes**: ```sql SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables WHERE schemaname = 'scanner'; ``` 2. **Index usage**: ```sql SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE schemaname = 'scanner' ORDER BY idx_scan DESC; ``` 3. **Partition sizes**: ```sql SELECT tablename, pg_size_pretty(pg_total_relation_size('scanner.'||tablename)) FROM pg_tables WHERE schemaname = 'scanner' AND tablename LIKE 'runtime_sample_%' ORDER BY tablename DESC; ``` ### Alerts - **Table growth**: Alert if `cg_edge` >10GB per scan - **Index bloat**: Alert if index size >2x expected - **Partition creation**: Alert if next month's partition not created 7 days in advance - **Vacuum lag**: Alert if last autovacuum >7 days --- ## References - `docs/07_HIGH_LEVEL_ARCHITECTURE.md` — Schema isolation design - `docs/db/SPECIFICATION.md` — Database specification - `docs/operations/postgresql-guide.md` — Operations guide - `SPRINT_3500_0002_0001_score_proofs_foundations.md` — Implementation sprint - `SPRINT_3500_0003_0002_reachability_dotnet_call_graphs.md` — Call-graph implementation --- **Last Updated**: 2025-12-17 **Schema Version**: 1.0 **Next Review**: Sprint 3500.0003.0004 (partition strategy)