Files

master 8bbfe4d2d2 feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration

- Add RateLimitConfig for configuration management with YAML binding support.
- Introduce RateLimitDecision to encapsulate the result of rate limit checks.
- Implement RateLimitMetrics for OpenTelemetry metrics tracking.
- Create RateLimitMiddleware for enforcing rate limits on incoming requests.
- Develop RateLimitService to orchestrate instance and environment rate limit checks.
- Add RateLimitServiceCollectionExtensions for dependency injection registration.

2025-12-17 18:02:37 +02:00

15 KiB

Raw Blame History

Scanner Schema Specification

Schema: scanner Owner: Scanner.WebService Purpose: Scan orchestration, call-graphs, proof bundles, reachability analysis Sprint: SPRINT_3500_0002_0001, SPRINT_3500_0003_0002

Overview

The scanner schema contains all tables related to:

Scan manifests and deterministic replay
Proof bundles (content-addressed storage metadata)
Call-graph nodes and edges (reachability analysis)
Entrypoints (framework-specific entry discovery)
Runtime samples (profiling data for reachability validation)

Design Principles:

All tables use scan_id as primary partition key for scan isolation
Deterministic data only (no timestamps in core algorithms)
Content-addressed references (hashes, not paths)
Forward-only schema evolution

Tables

1. scan_manifest

Purpose: Stores immutable scan manifests capturing all inputs for deterministic replay.

Schema:

Column	Type	Nullable	Description
`scan_id`	`text`	NOT NULL	Primary key; UUID format
`created_at_utc`	`timestamptz`	NOT NULL	Scan creation timestamp
`artifact_digest`	`text`	NOT NULL	Image/artifact digest (sha256:...)
`artifact_purl`	`text`	NULL	PURL identifier (pkg:oci/...)
`scanner_version`	`text`	NOT NULL	Scanner.WebService version
`worker_version`	`text`	NOT NULL	Scanner.Worker version
`concelier_snapshot_hash`	`text`	NOT NULL	Concelier feed snapshot digest
`excititor_snapshot_hash`	`text`	NOT NULL	Excititor VEX snapshot digest
`lattice_policy_hash`	`text`	NOT NULL	Policy bundle digest
`deterministic`	`boolean`	NOT NULL	Whether scan used deterministic mode
`seed`	`bytea`	NOT NULL	32-byte deterministic seed
`knobs`	`jsonb`	NULL	Configuration knobs (depth limits, etc.)
`manifest_hash`	`text`	NOT NULL	SHA-256 of canonical manifest JSON (UNIQUE)
`manifest_json`	`jsonb`	NOT NULL	Canonical JSON manifest
`manifest_dsse_json`	`jsonb`	NOT NULL	DSSE signature envelope

Indexes:

CREATE INDEX idx_scan_manifest_artifact ON scanner.scan_manifest(artifact_digest);
CREATE INDEX idx_scan_manifest_snapshots ON scanner.scan_manifest(concelier_snapshot_hash, excititor_snapshot_hash);
CREATE INDEX idx_scan_manifest_created ON scanner.scan_manifest(created_at_utc DESC);
CREATE UNIQUE INDEX idx_scan_manifest_hash ON scanner.scan_manifest(manifest_hash);

Constraints:

manifest_hash format: sha256:[0-9a-f]{64}
seed must be exactly 32 bytes
scan_id format: UUID v4

Partitioning: None (lookup table, <100k rows expected)

Retention: 180 days (drop scans older than 180 days)

2. proof_bundle

Purpose: Metadata for content-addressed proof bundles (zip archives).

Schema:

Column	Type	Nullable	Description
`scan_id`	`text`	NOT NULL	Foreign key to `scan_manifest.scan_id`
`root_hash`	`text`	NOT NULL	Merkle root hash of bundle contents
`bundle_uri`	`text`	NOT NULL	File path or S3 URI to bundle zip
`proof_root_dsse_json`	`jsonb`	NOT NULL	DSSE signature of root hash
`created_at_utc`	`timestamptz`	NOT NULL	Bundle creation timestamp

Primary Key: (scan_id, root_hash)

Indexes:

CREATE INDEX idx_proof_bundle_scan ON scanner.proof_bundle(scan_id);
CREATE INDEX idx_proof_bundle_created ON scanner.proof_bundle(created_at_utc DESC);

Constraints:

root_hash format: sha256:[0-9a-f]{64}
bundle_uri must be accessible file path or S3 URI

Partitioning: None (<100k rows expected)

Retention: 365 days (compliance requirement for signed bundles)

3. cg_node (call-graph nodes)

Purpose: Stores call-graph nodes (methods/functions) extracted from artifacts.

Schema:

Column	Type	Nullable	Description
`scan_id`	`text`	NOT NULL	Partition key
`node_id`	`text`	NOT NULL	Deterministic node ID (hash-based)
`artifact_key`	`text`	NOT NULL	Artifact identifier (assembly name, JAR, etc.)
`symbol_key`	`text`	NOT NULL	Canonical symbol name (Namespace.Type::Method)
`visibility`	`text`	NOT NULL	`public`, `internal`, `private`, `unknown`
`flags`	`integer`	NOT NULL	Bitfield: `IS_ENTRYPOINT_CANDIDATE=1`, `IS_VIRTUAL=2`, etc.

Primary Key: (scan_id, node_id)

Indexes:

CREATE INDEX idx_cg_node_artifact ON scanner.cg_node(scan_id, artifact_key);
CREATE INDEX idx_cg_node_symbol ON scanner.cg_node(scan_id, symbol_key);
CREATE INDEX idx_cg_node_flags ON scanner.cg_node(scan_id, flags) WHERE (flags & 1) = 1; -- Entrypoint candidates

Constraints:

node_id format: sha256:[0-9a-f]{64} (deterministic hash)
visibility must be one of: public, internal, private, unknown

Partitioning: Hash partition by scan_id (for scans with >100k nodes)

Retention: 90 days (call-graphs recomputed on rescan)

4. cg_edge (call-graph edges)

Purpose: Stores call-graph edges (invocations) between nodes.

Schema:

Column	Type	Nullable	Description
`scan_id`	`text`	NOT NULL	Partition key
`from_node_id`	`text`	NOT NULL	Caller node ID
`to_node_id`	`text`	NOT NULL	Callee node ID
`kind`	`smallint`	NOT NULL	`1=static`, `2=heuristic`
`reason`	`smallint`	NOT NULL	`1=direct_call`, `2=virtual_call`, `3=reflection_string`, etc.
`weight`	`real`	NOT NULL	Edge confidence weight (0.0-1.0)

Primary Key: (scan_id, from_node_id, to_node_id, kind, reason)

Indexes:

CREATE INDEX idx_cg_edge_from ON scanner.cg_edge(scan_id, from_node_id);
CREATE INDEX idx_cg_edge_to ON scanner.cg_edge(scan_id, to_node_id);
CREATE INDEX idx_cg_edge_static ON scanner.cg_edge(scan_id, kind) WHERE kind = 1;
CREATE INDEX idx_cg_edge_heuristic ON scanner.cg_edge(scan_id, kind) WHERE kind = 2;

Constraints:

kind must be 1 (static) or 2 (heuristic)
reason must be in range 1-10 (enum defined in code)
weight must be in range [0.0, 1.0]

Partitioning: Hash partition by scan_id (for scans with >500k edges)

Retention: 90 days

Notes:

High-volume table (1M+ rows per large scan)
Use partial indexes for kind to optimize static-only queries
Consider GIN index on (from_node_id, to_node_id) for bidirectional BFS

5. entrypoint

Purpose: Stores discovered entrypoints (HTTP routes, CLI commands, background jobs).

Schema:

Column	Type	Nullable	Description
`scan_id`	`text`	NOT NULL	Partition key
`node_id`	`text`	NOT NULL	Reference to `cg_node.node_id`
`kind`	`text`	NOT NULL	`http`, `grpc`, `cli`, `job`, `event`, `unknown`
`framework`	`text`	NOT NULL	`aspnetcore`, `spring`, `express`, etc.
`route`	`text`	NULL	HTTP route pattern (e.g., `/api/orders/{id}`)
`metadata`	`jsonb`	NULL	Framework-specific metadata

Primary Key: (scan_id, node_id, kind, framework, route)

Indexes:

CREATE INDEX idx_entrypoint_scan ON scanner.entrypoint(scan_id);
CREATE INDEX idx_entrypoint_kind ON scanner.entrypoint(scan_id, kind);
CREATE INDEX idx_entrypoint_framework ON scanner.entrypoint(scan_id, framework);

Constraints:

kind must be one of: http, grpc, cli, job, event, unknown
route required for kind='http' or kind='grpc'

Partitioning: None (<10k rows per scan)

Retention: 90 days

6. runtime_sample

Purpose: Stores runtime profiling samples (stack traces) for reachability validation.

Schema:

Column	Type	Nullable	Description
`scan_id`	`text`	NOT NULL	Partition key (links to scan)
`collected_at`	`timestamptz`	NOT NULL	Sample collection timestamp
`env_hash`	`text`	NOT NULL	Environment hash (k8s ns+pod+container)
`sample_id`	`bigserial`	NOT NULL	Auto-incrementing sample ID
`timestamp`	`timestamptz`	NOT NULL	Sample timestamp
`pid`	`integer`	NOT NULL	Process ID
`thread_id`	`integer`	NOT NULL	Thread ID
`frames`	`text[]`	NOT NULL	Array of node IDs (stack trace)
`weight`	`real`	NOT NULL	Sample weight (1.0 for discrete samples)

Primary Key: (scan_id, sample_id)

Indexes:

CREATE INDEX idx_runtime_sample_scan ON scanner.runtime_sample(scan_id, collected_at DESC);
CREATE INDEX idx_runtime_sample_frames ON scanner.runtime_sample USING GIN(frames);
CREATE INDEX idx_runtime_sample_env ON scanner.runtime_sample(scan_id, env_hash);

Constraints:

frames array length must be >0 and <1000
weight must be >0.0

Partitioning: TIME-BASED (monthly partitions by collected_at)

CREATE TABLE scanner.runtime_sample_2025_01 PARTITION OF scanner.runtime_sample
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

Retention: 90 days (drop old partitions automatically)

Notes:

Highest volume table (10M+ rows for long-running services)
GIN index on frames[] enables fast "find samples containing node X" queries
Partition pruning critical for performance

Enums (Defined in Code)

cg_edge.kind

Value	Name	Description
1	`static`	Statically proven call edge
2	`heuristic`	Heuristic/inferred edge (reflection, DI, dynamic)

cg_edge.reason

Value	Name	Description
1	`direct_call`	Direct method invocation
2	`virtual_call`	Virtual/interface dispatch
3	`reflection_string`	Reflection with string name
4	`di_binding`	Dependency injection registration
5	`dynamic_import`	Dynamic module import (JS/Python)
6	`delegate_invoke`	Delegate/lambda invocation
7	`async_await`	Async method call
8	`constructor`	Object constructor invocation
9	`plt_got`	PLT/GOT indirect call (native binaries)
10	`unknown`	Unknown edge type

cg_node.flags (Bitfield)

Bit	Flag	Description
0	`IS_ENTRYPOINT_CANDIDATE`	Node could be an entrypoint
1	`IS_VIRTUAL`	Virtual or interface method
2	`IS_ASYNC`	Async method
3	`IS_CONSTRUCTOR`	Constructor method
4	`IS_EXPORTED`	Publicly exported (for native binaries)

Schema Evolution

Migration Categories

Per docs/db/SPECIFICATION.md:

Category	Prefix	Execution	Description
Startup (A)	`001-099`	Automatic at boot	Non-breaking DDL (CREATE IF NOT EXISTS)
Release (B)	`100-199`	Manual via CLI	Breaking changes (requires maintenance window)
Seed	`S001-S999`	After schema	Reference data with ON CONFLICT DO NOTHING
Data (C)	`DM001-DM999`	Background job	Batched data transformations

Upcoming Migrations

Migration	Category	Sprint	Description
`010_scanner_schema.sql`	Startup (A)	3500.0002.0001	Create scanner schema, scan_manifest, proof_bundle
`011_call_graph_tables.sql`	Startup (A)	3500.0003.0002	Create cg_node, cg_edge, entrypoint
`012_runtime_sample_partitions.sql`	Startup (A)	3500.0003.0004	Create runtime_sample with monthly partitions
`S001_seed_edge_reasons.sql`	Seed	3500.0003.0002	Seed edge reason lookup table

Performance Considerations

Query Patterns

High-frequency queries:

Scan manifest lookup by artifact:

SELECT * FROM scanner.scan_manifest
WHERE artifact_digest = $1
ORDER BY created_at_utc DESC LIMIT 1;

Index: idx_scan_manifest_artifact

Reachability BFS (forward):

SELECT to_node_id FROM scanner.cg_edge
WHERE scan_id = $1 AND from_node_id = ANY($2) AND kind = 1;

Index: idx_cg_edge_from

Reachability BFS (backward):

SELECT from_node_id FROM scanner.cg_edge
WHERE scan_id = $1 AND to_node_id = $2 AND kind = 1;

Index: idx_cg_edge_to

Find runtime samples containing node:

SELECT * FROM scanner.runtime_sample
WHERE scan_id = $1 AND $2 = ANY(frames);

Index: idx_runtime_sample_frames (GIN)

Index Maintenance

Reindex schedule:

cg_edge indexes: Weekly (high churn)
runtime_sample GIN index: Monthly (after partition drops)

Vacuum:

Autovacuum enabled for all tables
Manual VACUUM ANALYZE after bulk inserts (>1M rows)

Partition Management

Automated partition creation (cron job):

-- Create next month's partition 7 days in advance
CREATE TABLE IF NOT EXISTS scanner.runtime_sample_2025_02 PARTITION OF scanner.runtime_sample
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

Automated partition dropping (90-day retention):

DROP TABLE IF EXISTS scanner.runtime_sample_2024_10; -- Older than 90 days

Compliance & Auditing

DSSE Signatures

All proof bundles and manifests include DSSE signatures:

manifest_dsse_json in scan_manifest
proof_root_dsse_json in proof_bundle

Verification:

Signatures verified on read using IContentSigner.Verify
Invalid signatures → reject proof bundle

Immutability

Immutable tables:

scan_manifest — No updates allowed after insert
proof_bundle — No updates allowed after insert

Enforcement: Application-level (no UPDATE grants in production)

Retention Policies

Table	Retention	Enforcement
`scan_manifest`	180 days	DELETE WHERE created_at_utc < NOW() - INTERVAL '180 days'
`proof_bundle`	365 days	DELETE WHERE created_at_utc < NOW() - INTERVAL '365 days'
`cg_node`	90 days	CASCADE delete on scan_manifest
`cg_edge`	90 days	CASCADE delete on scan_manifest
`runtime_sample`	90 days	DROP PARTITION (monthly)

Monitoring

Key Metrics

Table sizes:

SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables WHERE schemaname = 'scanner';

Index usage:

SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'scanner'
ORDER BY idx_scan DESC;

Partition sizes:

SELECT tablename, pg_size_pretty(pg_total_relation_size('scanner.'||tablename))
FROM pg_tables
WHERE schemaname = 'scanner' AND tablename LIKE 'runtime_sample_%'
ORDER BY tablename DESC;

Alerts

Table growth: Alert if cg_edge >10GB per scan
Index bloat: Alert if index size >2x expected
Partition creation: Alert if next month's partition not created 7 days in advance
Vacuum lag: Alert if last autovacuum >7 days

References

docs/07_HIGH_LEVEL_ARCHITECTURE.md — Schema isolation design
docs/db/SPECIFICATION.md — Database specification
docs/operations/postgresql-guide.md — Operations guide
SPRINT_3500_0002_0001_score_proofs_foundations.md — Implementation sprint
SPRINT_3500_0003_0002_reachability_dotnet_call_graphs.md — Call-graph implementation

Last Updated: 2025-12-17 Schema Version: 1.0 Next Review: Sprint 3500.0003.0004 (partition strategy)

15 KiB Raw Blame History

Scanner Schema Specification

Overview

Tables

1. scan_manifest

2. proof_bundle

3. cg_node (call-graph nodes)

4. cg_edge (call-graph edges)

5. entrypoint

6. runtime_sample

Enums (Defined in Code)

cg_edge.kind

cg_edge.reason

cg_node.flags (Bitfield)

Schema Evolution

Migration Categories

Upcoming Migrations

Performance Considerations

Query Patterns

Index Maintenance

Partition Management

Compliance & Auditing

DSSE Signatures

Immutability

Retention Policies

Monitoring

Key Metrics

Alerts

References

15 KiB

Raw Blame History