Files
git.stella-ops.org/docs/db/schemas/scanner_schema_specification.md
master 8bbfe4d2d2 feat(rate-limiting): Implement core rate limiting functionality with configuration, decision-making, metrics, middleware, and service registration
- Add RateLimitConfig for configuration management with YAML binding support.
- Introduce RateLimitDecision to encapsulate the result of rate limit checks.
- Implement RateLimitMetrics for OpenTelemetry metrics tracking.
- Create RateLimitMiddleware for enforcing rate limits on incoming requests.
- Develop RateLimitService to orchestrate instance and environment rate limit checks.
- Add RateLimitServiceCollectionExtensions for dependency injection registration.
2025-12-17 18:02:37 +02:00

15 KiB

Scanner Schema Specification

Schema: scanner Owner: Scanner.WebService Purpose: Scan orchestration, call-graphs, proof bundles, reachability analysis Sprint: SPRINT_3500_0002_0001, SPRINT_3500_0003_0002


Overview

The scanner schema contains all tables related to:

  1. Scan manifests and deterministic replay
  2. Proof bundles (content-addressed storage metadata)
  3. Call-graph nodes and edges (reachability analysis)
  4. Entrypoints (framework-specific entry discovery)
  5. Runtime samples (profiling data for reachability validation)

Design Principles:

  • All tables use scan_id as primary partition key for scan isolation
  • Deterministic data only (no timestamps in core algorithms)
  • Content-addressed references (hashes, not paths)
  • Forward-only schema evolution

Tables

1. scan_manifest

Purpose: Stores immutable scan manifests capturing all inputs for deterministic replay.

Schema:

Column Type Nullable Description
scan_id text NOT NULL Primary key; UUID format
created_at_utc timestamptz NOT NULL Scan creation timestamp
artifact_digest text NOT NULL Image/artifact digest (sha256:...)
artifact_purl text NULL PURL identifier (pkg:oci/...)
scanner_version text NOT NULL Scanner.WebService version
worker_version text NOT NULL Scanner.Worker version
concelier_snapshot_hash text NOT NULL Concelier feed snapshot digest
excititor_snapshot_hash text NOT NULL Excititor VEX snapshot digest
lattice_policy_hash text NOT NULL Policy bundle digest
deterministic boolean NOT NULL Whether scan used deterministic mode
seed bytea NOT NULL 32-byte deterministic seed
knobs jsonb NULL Configuration knobs (depth limits, etc.)
manifest_hash text NOT NULL SHA-256 of canonical manifest JSON (UNIQUE)
manifest_json jsonb NOT NULL Canonical JSON manifest
manifest_dsse_json jsonb NOT NULL DSSE signature envelope

Indexes:

CREATE INDEX idx_scan_manifest_artifact ON scanner.scan_manifest(artifact_digest);
CREATE INDEX idx_scan_manifest_snapshots ON scanner.scan_manifest(concelier_snapshot_hash, excititor_snapshot_hash);
CREATE INDEX idx_scan_manifest_created ON scanner.scan_manifest(created_at_utc DESC);
CREATE UNIQUE INDEX idx_scan_manifest_hash ON scanner.scan_manifest(manifest_hash);

Constraints:

  • manifest_hash format: sha256:[0-9a-f]{64}
  • seed must be exactly 32 bytes
  • scan_id format: UUID v4

Partitioning: None (lookup table, <100k rows expected)

Retention: 180 days (drop scans older than 180 days)


2. proof_bundle

Purpose: Metadata for content-addressed proof bundles (zip archives).

Schema:

Column Type Nullable Description
scan_id text NOT NULL Foreign key to scan_manifest.scan_id
root_hash text NOT NULL Merkle root hash of bundle contents
bundle_uri text NOT NULL File path or S3 URI to bundle zip
proof_root_dsse_json jsonb NOT NULL DSSE signature of root hash
created_at_utc timestamptz NOT NULL Bundle creation timestamp

Primary Key: (scan_id, root_hash)

Indexes:

CREATE INDEX idx_proof_bundle_scan ON scanner.proof_bundle(scan_id);
CREATE INDEX idx_proof_bundle_created ON scanner.proof_bundle(created_at_utc DESC);

Constraints:

  • root_hash format: sha256:[0-9a-f]{64}
  • bundle_uri must be accessible file path or S3 URI

Partitioning: None (<100k rows expected)

Retention: 365 days (compliance requirement for signed bundles)


3. cg_node (call-graph nodes)

Purpose: Stores call-graph nodes (methods/functions) extracted from artifacts.

Schema:

Column Type Nullable Description
scan_id text NOT NULL Partition key
node_id text NOT NULL Deterministic node ID (hash-based)
artifact_key text NOT NULL Artifact identifier (assembly name, JAR, etc.)
symbol_key text NOT NULL Canonical symbol name (Namespace.Type::Method)
visibility text NOT NULL public, internal, private, unknown
flags integer NOT NULL Bitfield: IS_ENTRYPOINT_CANDIDATE=1, IS_VIRTUAL=2, etc.

Primary Key: (scan_id, node_id)

Indexes:

CREATE INDEX idx_cg_node_artifact ON scanner.cg_node(scan_id, artifact_key);
CREATE INDEX idx_cg_node_symbol ON scanner.cg_node(scan_id, symbol_key);
CREATE INDEX idx_cg_node_flags ON scanner.cg_node(scan_id, flags) WHERE (flags & 1) = 1; -- Entrypoint candidates

Constraints:

  • node_id format: sha256:[0-9a-f]{64} (deterministic hash)
  • visibility must be one of: public, internal, private, unknown

Partitioning: Hash partition by scan_id (for scans with >100k nodes)

Retention: 90 days (call-graphs recomputed on rescan)


4. cg_edge (call-graph edges)

Purpose: Stores call-graph edges (invocations) between nodes.

Schema:

Column Type Nullable Description
scan_id text NOT NULL Partition key
from_node_id text NOT NULL Caller node ID
to_node_id text NOT NULL Callee node ID
kind smallint NOT NULL 1=static, 2=heuristic
reason smallint NOT NULL 1=direct_call, 2=virtual_call, 3=reflection_string, etc.
weight real NOT NULL Edge confidence weight (0.0-1.0)

Primary Key: (scan_id, from_node_id, to_node_id, kind, reason)

Indexes:

CREATE INDEX idx_cg_edge_from ON scanner.cg_edge(scan_id, from_node_id);
CREATE INDEX idx_cg_edge_to ON scanner.cg_edge(scan_id, to_node_id);
CREATE INDEX idx_cg_edge_static ON scanner.cg_edge(scan_id, kind) WHERE kind = 1;
CREATE INDEX idx_cg_edge_heuristic ON scanner.cg_edge(scan_id, kind) WHERE kind = 2;

Constraints:

  • kind must be 1 (static) or 2 (heuristic)
  • reason must be in range 1-10 (enum defined in code)
  • weight must be in range [0.0, 1.0]

Partitioning: Hash partition by scan_id (for scans with >500k edges)

Retention: 90 days

Notes:

  • High-volume table (1M+ rows per large scan)
  • Use partial indexes for kind to optimize static-only queries
  • Consider GIN index on (from_node_id, to_node_id) for bidirectional BFS

5. entrypoint

Purpose: Stores discovered entrypoints (HTTP routes, CLI commands, background jobs).

Schema:

Column Type Nullable Description
scan_id text NOT NULL Partition key
node_id text NOT NULL Reference to cg_node.node_id
kind text NOT NULL http, grpc, cli, job, event, unknown
framework text NOT NULL aspnetcore, spring, express, etc.
route text NULL HTTP route pattern (e.g., /api/orders/{id})
metadata jsonb NULL Framework-specific metadata

Primary Key: (scan_id, node_id, kind, framework, route)

Indexes:

CREATE INDEX idx_entrypoint_scan ON scanner.entrypoint(scan_id);
CREATE INDEX idx_entrypoint_kind ON scanner.entrypoint(scan_id, kind);
CREATE INDEX idx_entrypoint_framework ON scanner.entrypoint(scan_id, framework);

Constraints:

  • kind must be one of: http, grpc, cli, job, event, unknown
  • route required for kind='http' or kind='grpc'

Partitioning: None (<10k rows per scan)

Retention: 90 days


6. runtime_sample

Purpose: Stores runtime profiling samples (stack traces) for reachability validation.

Schema:

Column Type Nullable Description
scan_id text NOT NULL Partition key (links to scan)
collected_at timestamptz NOT NULL Sample collection timestamp
env_hash text NOT NULL Environment hash (k8s ns+pod+container)
sample_id bigserial NOT NULL Auto-incrementing sample ID
timestamp timestamptz NOT NULL Sample timestamp
pid integer NOT NULL Process ID
thread_id integer NOT NULL Thread ID
frames text[] NOT NULL Array of node IDs (stack trace)
weight real NOT NULL Sample weight (1.0 for discrete samples)

Primary Key: (scan_id, sample_id)

Indexes:

CREATE INDEX idx_runtime_sample_scan ON scanner.runtime_sample(scan_id, collected_at DESC);
CREATE INDEX idx_runtime_sample_frames ON scanner.runtime_sample USING GIN(frames);
CREATE INDEX idx_runtime_sample_env ON scanner.runtime_sample(scan_id, env_hash);

Constraints:

  • frames array length must be >0 and <1000
  • weight must be >0.0

Partitioning: TIME-BASED (monthly partitions by collected_at)

CREATE TABLE scanner.runtime_sample_2025_01 PARTITION OF scanner.runtime_sample
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

Retention: 90 days (drop old partitions automatically)

Notes:

  • Highest volume table (10M+ rows for long-running services)
  • GIN index on frames[] enables fast "find samples containing node X" queries
  • Partition pruning critical for performance

Enums (Defined in Code)

cg_edge.kind

Value Name Description
1 static Statically proven call edge
2 heuristic Heuristic/inferred edge (reflection, DI, dynamic)

cg_edge.reason

Value Name Description
1 direct_call Direct method invocation
2 virtual_call Virtual/interface dispatch
3 reflection_string Reflection with string name
4 di_binding Dependency injection registration
5 dynamic_import Dynamic module import (JS/Python)
6 delegate_invoke Delegate/lambda invocation
7 async_await Async method call
8 constructor Object constructor invocation
9 plt_got PLT/GOT indirect call (native binaries)
10 unknown Unknown edge type

cg_node.flags (Bitfield)

Bit Flag Description
0 IS_ENTRYPOINT_CANDIDATE Node could be an entrypoint
1 IS_VIRTUAL Virtual or interface method
2 IS_ASYNC Async method
3 IS_CONSTRUCTOR Constructor method
4 IS_EXPORTED Publicly exported (for native binaries)

Schema Evolution

Migration Categories

Per docs/db/SPECIFICATION.md:

Category Prefix Execution Description
Startup (A) 001-099 Automatic at boot Non-breaking DDL (CREATE IF NOT EXISTS)
Release (B) 100-199 Manual via CLI Breaking changes (requires maintenance window)
Seed S001-S999 After schema Reference data with ON CONFLICT DO NOTHING
Data (C) DM001-DM999 Background job Batched data transformations

Upcoming Migrations

Migration Category Sprint Description
010_scanner_schema.sql Startup (A) 3500.0002.0001 Create scanner schema, scan_manifest, proof_bundle
011_call_graph_tables.sql Startup (A) 3500.0003.0002 Create cg_node, cg_edge, entrypoint
012_runtime_sample_partitions.sql Startup (A) 3500.0003.0004 Create runtime_sample with monthly partitions
S001_seed_edge_reasons.sql Seed 3500.0003.0002 Seed edge reason lookup table

Performance Considerations

Query Patterns

High-frequency queries:

  1. Scan manifest lookup by artifact:

    SELECT * FROM scanner.scan_manifest
    WHERE artifact_digest = $1
    ORDER BY created_at_utc DESC LIMIT 1;
    
    • Index: idx_scan_manifest_artifact
  2. Reachability BFS (forward):

    SELECT to_node_id FROM scanner.cg_edge
    WHERE scan_id = $1 AND from_node_id = ANY($2) AND kind = 1;
    
    • Index: idx_cg_edge_from
  3. Reachability BFS (backward):

    SELECT from_node_id FROM scanner.cg_edge
    WHERE scan_id = $1 AND to_node_id = $2 AND kind = 1;
    
    • Index: idx_cg_edge_to
  4. Find runtime samples containing node:

    SELECT * FROM scanner.runtime_sample
    WHERE scan_id = $1 AND $2 = ANY(frames);
    
    • Index: idx_runtime_sample_frames (GIN)

Index Maintenance

Reindex schedule:

  • cg_edge indexes: Weekly (high churn)
  • runtime_sample GIN index: Monthly (after partition drops)

Vacuum:

  • Autovacuum enabled for all tables
  • Manual VACUUM ANALYZE after bulk inserts (>1M rows)

Partition Management

Automated partition creation (cron job):

-- Create next month's partition 7 days in advance
CREATE TABLE IF NOT EXISTS scanner.runtime_sample_2025_02 PARTITION OF scanner.runtime_sample
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

Automated partition dropping (90-day retention):

DROP TABLE IF EXISTS scanner.runtime_sample_2024_10; -- Older than 90 days

Compliance & Auditing

DSSE Signatures

All proof bundles and manifests include DSSE signatures:

  • manifest_dsse_json in scan_manifest
  • proof_root_dsse_json in proof_bundle

Verification:

  • Signatures verified on read using IContentSigner.Verify
  • Invalid signatures → reject proof bundle

Immutability

Immutable tables:

  • scan_manifest — No updates allowed after insert
  • proof_bundle — No updates allowed after insert

Enforcement: Application-level (no UPDATE grants in production)

Retention Policies

Table Retention Enforcement
scan_manifest 180 days DELETE WHERE created_at_utc < NOW() - INTERVAL '180 days'
proof_bundle 365 days DELETE WHERE created_at_utc < NOW() - INTERVAL '365 days'
cg_node 90 days CASCADE delete on scan_manifest
cg_edge 90 days CASCADE delete on scan_manifest
runtime_sample 90 days DROP PARTITION (monthly)

Monitoring

Key Metrics

  1. Table sizes:

    SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
    FROM pg_tables WHERE schemaname = 'scanner';
    
  2. Index usage:

    SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
    FROM pg_stat_user_indexes
    WHERE schemaname = 'scanner'
    ORDER BY idx_scan DESC;
    
  3. Partition sizes:

    SELECT tablename, pg_size_pretty(pg_total_relation_size('scanner.'||tablename))
    FROM pg_tables
    WHERE schemaname = 'scanner' AND tablename LIKE 'runtime_sample_%'
    ORDER BY tablename DESC;
    

Alerts

  • Table growth: Alert if cg_edge >10GB per scan
  • Index bloat: Alert if index size >2x expected
  • Partition creation: Alert if next month's partition not created 7 days in advance
  • Vacuum lag: Alert if last autovacuum >7 days

References

  • docs/07_HIGH_LEVEL_ARCHITECTURE.md — Schema isolation design
  • docs/db/SPECIFICATION.md — Database specification
  • docs/operations/postgresql-guide.md — Operations guide
  • SPRINT_3500_0002_0001_score_proofs_foundations.md — Implementation sprint
  • SPRINT_3500_0003_0002_reachability_dotnet_call_graphs.md — Call-graph implementation

Last Updated: 2025-12-17 Schema Version: 1.0 Next Review: Sprint 3500.0003.0004 (partition strategy)