- Add RateLimitConfig for configuration management with YAML binding support. - Introduce RateLimitDecision to encapsulate the result of rate limit checks. - Implement RateLimitMetrics for OpenTelemetry metrics tracking. - Create RateLimitMiddleware for enforcing rate limits on incoming requests. - Develop RateLimitService to orchestrate instance and environment rate limit checks. - Add RateLimitServiceCollectionExtensions for dependency injection registration.
15 KiB
Scanner Schema Specification
Schema: scanner
Owner: Scanner.WebService
Purpose: Scan orchestration, call-graphs, proof bundles, reachability analysis
Sprint: SPRINT_3500_0002_0001, SPRINT_3500_0003_0002
Overview
The scanner schema contains all tables related to:
- Scan manifests and deterministic replay
- Proof bundles (content-addressed storage metadata)
- Call-graph nodes and edges (reachability analysis)
- Entrypoints (framework-specific entry discovery)
- Runtime samples (profiling data for reachability validation)
Design Principles:
- All tables use
scan_idas primary partition key for scan isolation - Deterministic data only (no timestamps in core algorithms)
- Content-addressed references (hashes, not paths)
- Forward-only schema evolution
Tables
1. scan_manifest
Purpose: Stores immutable scan manifests capturing all inputs for deterministic replay.
Schema:
| Column | Type | Nullable | Description |
|---|---|---|---|
scan_id |
text |
NOT NULL | Primary key; UUID format |
created_at_utc |
timestamptz |
NOT NULL | Scan creation timestamp |
artifact_digest |
text |
NOT NULL | Image/artifact digest (sha256:...) |
artifact_purl |
text |
NULL | PURL identifier (pkg:oci/...) |
scanner_version |
text |
NOT NULL | Scanner.WebService version |
worker_version |
text |
NOT NULL | Scanner.Worker version |
concelier_snapshot_hash |
text |
NOT NULL | Concelier feed snapshot digest |
excititor_snapshot_hash |
text |
NOT NULL | Excititor VEX snapshot digest |
lattice_policy_hash |
text |
NOT NULL | Policy bundle digest |
deterministic |
boolean |
NOT NULL | Whether scan used deterministic mode |
seed |
bytea |
NOT NULL | 32-byte deterministic seed |
knobs |
jsonb |
NULL | Configuration knobs (depth limits, etc.) |
manifest_hash |
text |
NOT NULL | SHA-256 of canonical manifest JSON (UNIQUE) |
manifest_json |
jsonb |
NOT NULL | Canonical JSON manifest |
manifest_dsse_json |
jsonb |
NOT NULL | DSSE signature envelope |
Indexes:
CREATE INDEX idx_scan_manifest_artifact ON scanner.scan_manifest(artifact_digest);
CREATE INDEX idx_scan_manifest_snapshots ON scanner.scan_manifest(concelier_snapshot_hash, excititor_snapshot_hash);
CREATE INDEX idx_scan_manifest_created ON scanner.scan_manifest(created_at_utc DESC);
CREATE UNIQUE INDEX idx_scan_manifest_hash ON scanner.scan_manifest(manifest_hash);
Constraints:
manifest_hashformat:sha256:[0-9a-f]{64}seedmust be exactly 32 bytesscan_idformat: UUID v4
Partitioning: None (lookup table, <100k rows expected)
Retention: 180 days (drop scans older than 180 days)
2. proof_bundle
Purpose: Metadata for content-addressed proof bundles (zip archives).
Schema:
| Column | Type | Nullable | Description |
|---|---|---|---|
scan_id |
text |
NOT NULL | Foreign key to scan_manifest.scan_id |
root_hash |
text |
NOT NULL | Merkle root hash of bundle contents |
bundle_uri |
text |
NOT NULL | File path or S3 URI to bundle zip |
proof_root_dsse_json |
jsonb |
NOT NULL | DSSE signature of root hash |
created_at_utc |
timestamptz |
NOT NULL | Bundle creation timestamp |
Primary Key: (scan_id, root_hash)
Indexes:
CREATE INDEX idx_proof_bundle_scan ON scanner.proof_bundle(scan_id);
CREATE INDEX idx_proof_bundle_created ON scanner.proof_bundle(created_at_utc DESC);
Constraints:
root_hashformat:sha256:[0-9a-f]{64}bundle_urimust be accessible file path or S3 URI
Partitioning: None (<100k rows expected)
Retention: 365 days (compliance requirement for signed bundles)
3. cg_node (call-graph nodes)
Purpose: Stores call-graph nodes (methods/functions) extracted from artifacts.
Schema:
| Column | Type | Nullable | Description |
|---|---|---|---|
scan_id |
text |
NOT NULL | Partition key |
node_id |
text |
NOT NULL | Deterministic node ID (hash-based) |
artifact_key |
text |
NOT NULL | Artifact identifier (assembly name, JAR, etc.) |
symbol_key |
text |
NOT NULL | Canonical symbol name (Namespace.Type::Method) |
visibility |
text |
NOT NULL | public, internal, private, unknown |
flags |
integer |
NOT NULL | Bitfield: IS_ENTRYPOINT_CANDIDATE=1, IS_VIRTUAL=2, etc. |
Primary Key: (scan_id, node_id)
Indexes:
CREATE INDEX idx_cg_node_artifact ON scanner.cg_node(scan_id, artifact_key);
CREATE INDEX idx_cg_node_symbol ON scanner.cg_node(scan_id, symbol_key);
CREATE INDEX idx_cg_node_flags ON scanner.cg_node(scan_id, flags) WHERE (flags & 1) = 1; -- Entrypoint candidates
Constraints:
node_idformat:sha256:[0-9a-f]{64}(deterministic hash)visibilitymust be one of:public,internal,private,unknown
Partitioning: Hash partition by scan_id (for scans with >100k nodes)
Retention: 90 days (call-graphs recomputed on rescan)
4. cg_edge (call-graph edges)
Purpose: Stores call-graph edges (invocations) between nodes.
Schema:
| Column | Type | Nullable | Description |
|---|---|---|---|
scan_id |
text |
NOT NULL | Partition key |
from_node_id |
text |
NOT NULL | Caller node ID |
to_node_id |
text |
NOT NULL | Callee node ID |
kind |
smallint |
NOT NULL | 1=static, 2=heuristic |
reason |
smallint |
NOT NULL | 1=direct_call, 2=virtual_call, 3=reflection_string, etc. |
weight |
real |
NOT NULL | Edge confidence weight (0.0-1.0) |
Primary Key: (scan_id, from_node_id, to_node_id, kind, reason)
Indexes:
CREATE INDEX idx_cg_edge_from ON scanner.cg_edge(scan_id, from_node_id);
CREATE INDEX idx_cg_edge_to ON scanner.cg_edge(scan_id, to_node_id);
CREATE INDEX idx_cg_edge_static ON scanner.cg_edge(scan_id, kind) WHERE kind = 1;
CREATE INDEX idx_cg_edge_heuristic ON scanner.cg_edge(scan_id, kind) WHERE kind = 2;
Constraints:
kindmust be 1 (static) or 2 (heuristic)reasonmust be in range 1-10 (enum defined in code)weightmust be in range [0.0, 1.0]
Partitioning: Hash partition by scan_id (for scans with >500k edges)
Retention: 90 days
Notes:
- High-volume table (1M+ rows per large scan)
- Use partial indexes for
kindto optimize static-only queries - Consider GIN index on
(from_node_id, to_node_id)for bidirectional BFS
5. entrypoint
Purpose: Stores discovered entrypoints (HTTP routes, CLI commands, background jobs).
Schema:
| Column | Type | Nullable | Description |
|---|---|---|---|
scan_id |
text |
NOT NULL | Partition key |
node_id |
text |
NOT NULL | Reference to cg_node.node_id |
kind |
text |
NOT NULL | http, grpc, cli, job, event, unknown |
framework |
text |
NOT NULL | aspnetcore, spring, express, etc. |
route |
text |
NULL | HTTP route pattern (e.g., /api/orders/{id}) |
metadata |
jsonb |
NULL | Framework-specific metadata |
Primary Key: (scan_id, node_id, kind, framework, route)
Indexes:
CREATE INDEX idx_entrypoint_scan ON scanner.entrypoint(scan_id);
CREATE INDEX idx_entrypoint_kind ON scanner.entrypoint(scan_id, kind);
CREATE INDEX idx_entrypoint_framework ON scanner.entrypoint(scan_id, framework);
Constraints:
kindmust be one of:http,grpc,cli,job,event,unknownrouterequired forkind='http'orkind='grpc'
Partitioning: None (<10k rows per scan)
Retention: 90 days
6. runtime_sample
Purpose: Stores runtime profiling samples (stack traces) for reachability validation.
Schema:
| Column | Type | Nullable | Description |
|---|---|---|---|
scan_id |
text |
NOT NULL | Partition key (links to scan) |
collected_at |
timestamptz |
NOT NULL | Sample collection timestamp |
env_hash |
text |
NOT NULL | Environment hash (k8s ns+pod+container) |
sample_id |
bigserial |
NOT NULL | Auto-incrementing sample ID |
timestamp |
timestamptz |
NOT NULL | Sample timestamp |
pid |
integer |
NOT NULL | Process ID |
thread_id |
integer |
NOT NULL | Thread ID |
frames |
text[] |
NOT NULL | Array of node IDs (stack trace) |
weight |
real |
NOT NULL | Sample weight (1.0 for discrete samples) |
Primary Key: (scan_id, sample_id)
Indexes:
CREATE INDEX idx_runtime_sample_scan ON scanner.runtime_sample(scan_id, collected_at DESC);
CREATE INDEX idx_runtime_sample_frames ON scanner.runtime_sample USING GIN(frames);
CREATE INDEX idx_runtime_sample_env ON scanner.runtime_sample(scan_id, env_hash);
Constraints:
framesarray length must be >0 and <1000weightmust be >0.0
Partitioning: TIME-BASED (monthly partitions by collected_at)
CREATE TABLE scanner.runtime_sample_2025_01 PARTITION OF scanner.runtime_sample
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
Retention: 90 days (drop old partitions automatically)
Notes:
- Highest volume table (10M+ rows for long-running services)
- GIN index on
frames[]enables fast "find samples containing node X" queries - Partition pruning critical for performance
Enums (Defined in Code)
cg_edge.kind
| Value | Name | Description |
|---|---|---|
| 1 | static |
Statically proven call edge |
| 2 | heuristic |
Heuristic/inferred edge (reflection, DI, dynamic) |
cg_edge.reason
| Value | Name | Description |
|---|---|---|
| 1 | direct_call |
Direct method invocation |
| 2 | virtual_call |
Virtual/interface dispatch |
| 3 | reflection_string |
Reflection with string name |
| 4 | di_binding |
Dependency injection registration |
| 5 | dynamic_import |
Dynamic module import (JS/Python) |
| 6 | delegate_invoke |
Delegate/lambda invocation |
| 7 | async_await |
Async method call |
| 8 | constructor |
Object constructor invocation |
| 9 | plt_got |
PLT/GOT indirect call (native binaries) |
| 10 | unknown |
Unknown edge type |
cg_node.flags (Bitfield)
| Bit | Flag | Description |
|---|---|---|
| 0 | IS_ENTRYPOINT_CANDIDATE |
Node could be an entrypoint |
| 1 | IS_VIRTUAL |
Virtual or interface method |
| 2 | IS_ASYNC |
Async method |
| 3 | IS_CONSTRUCTOR |
Constructor method |
| 4 | IS_EXPORTED |
Publicly exported (for native binaries) |
Schema Evolution
Migration Categories
Per docs/db/SPECIFICATION.md:
| Category | Prefix | Execution | Description |
|---|---|---|---|
| Startup (A) | 001-099 |
Automatic at boot | Non-breaking DDL (CREATE IF NOT EXISTS) |
| Release (B) | 100-199 |
Manual via CLI | Breaking changes (requires maintenance window) |
| Seed | S001-S999 |
After schema | Reference data with ON CONFLICT DO NOTHING |
| Data (C) | DM001-DM999 |
Background job | Batched data transformations |
Upcoming Migrations
| Migration | Category | Sprint | Description |
|---|---|---|---|
010_scanner_schema.sql |
Startup (A) | 3500.0002.0001 | Create scanner schema, scan_manifest, proof_bundle |
011_call_graph_tables.sql |
Startup (A) | 3500.0003.0002 | Create cg_node, cg_edge, entrypoint |
012_runtime_sample_partitions.sql |
Startup (A) | 3500.0003.0004 | Create runtime_sample with monthly partitions |
S001_seed_edge_reasons.sql |
Seed | 3500.0003.0002 | Seed edge reason lookup table |
Performance Considerations
Query Patterns
High-frequency queries:
-
Scan manifest lookup by artifact:
SELECT * FROM scanner.scan_manifest WHERE artifact_digest = $1 ORDER BY created_at_utc DESC LIMIT 1;- Index:
idx_scan_manifest_artifact
- Index:
-
Reachability BFS (forward):
SELECT to_node_id FROM scanner.cg_edge WHERE scan_id = $1 AND from_node_id = ANY($2) AND kind = 1;- Index:
idx_cg_edge_from
- Index:
-
Reachability BFS (backward):
SELECT from_node_id FROM scanner.cg_edge WHERE scan_id = $1 AND to_node_id = $2 AND kind = 1;- Index:
idx_cg_edge_to
- Index:
-
Find runtime samples containing node:
SELECT * FROM scanner.runtime_sample WHERE scan_id = $1 AND $2 = ANY(frames);- Index:
idx_runtime_sample_frames(GIN)
- Index:
Index Maintenance
Reindex schedule:
cg_edgeindexes: Weekly (high churn)runtime_sampleGIN index: Monthly (after partition drops)
Vacuum:
- Autovacuum enabled for all tables
- Manual VACUUM ANALYZE after bulk inserts (>1M rows)
Partition Management
Automated partition creation (cron job):
-- Create next month's partition 7 days in advance
CREATE TABLE IF NOT EXISTS scanner.runtime_sample_2025_02 PARTITION OF scanner.runtime_sample
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
Automated partition dropping (90-day retention):
DROP TABLE IF EXISTS scanner.runtime_sample_2024_10; -- Older than 90 days
Compliance & Auditing
DSSE Signatures
All proof bundles and manifests include DSSE signatures:
manifest_dsse_jsoninscan_manifestproof_root_dsse_jsoninproof_bundle
Verification:
- Signatures verified on read using
IContentSigner.Verify - Invalid signatures → reject proof bundle
Immutability
Immutable tables:
scan_manifest— No updates allowed after insertproof_bundle— No updates allowed after insert
Enforcement: Application-level (no UPDATE grants in production)
Retention Policies
| Table | Retention | Enforcement |
|---|---|---|
scan_manifest |
180 days | DELETE WHERE created_at_utc < NOW() - INTERVAL '180 days' |
proof_bundle |
365 days | DELETE WHERE created_at_utc < NOW() - INTERVAL '365 days' |
cg_node |
90 days | CASCADE delete on scan_manifest |
cg_edge |
90 days | CASCADE delete on scan_manifest |
runtime_sample |
90 days | DROP PARTITION (monthly) |
Monitoring
Key Metrics
-
Table sizes:
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables WHERE schemaname = 'scanner'; -
Index usage:
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE schemaname = 'scanner' ORDER BY idx_scan DESC; -
Partition sizes:
SELECT tablename, pg_size_pretty(pg_total_relation_size('scanner.'||tablename)) FROM pg_tables WHERE schemaname = 'scanner' AND tablename LIKE 'runtime_sample_%' ORDER BY tablename DESC;
Alerts
- Table growth: Alert if
cg_edge>10GB per scan - Index bloat: Alert if index size >2x expected
- Partition creation: Alert if next month's partition not created 7 days in advance
- Vacuum lag: Alert if last autovacuum >7 days
References
docs/07_HIGH_LEVEL_ARCHITECTURE.md— Schema isolation designdocs/db/SPECIFICATION.md— Database specificationdocs/operations/postgresql-guide.md— Operations guideSPRINT_3500_0002_0001_score_proofs_foundations.md— Implementation sprintSPRINT_3500_0003_0002_reachability_dotnet_call_graphs.md— Call-graph implementation
Last Updated: 2025-12-17 Schema Version: 1.0 Next Review: Sprint 3500.0003.0004 (partition strategy)