Files
git.stella-ops.org/docs/operations/postgresql-patterns-runbook.md
StellaOps Bot b058dbe031 up
2025-12-14 23:20:14 +02:00

426 lines
10 KiB
Markdown

# PostgreSQL Patterns Operational Runbook
**Version:** 1.0
**Last Updated:** 2025-12-14
**Related Sprints:** SPRINT_3420, SPRINT_3421, SPRINT_3422, SPRINT_3423
---
## Table of Contents
1. [Row-Level Security (RLS)](#1-row-level-security-rls)
2. [Bitemporal Unknowns](#2-bitemporal-unknowns)
3. [Time-Based Partitioning](#3-time-based-partitioning)
4. [Generated Columns](#4-generated-columns)
5. [Troubleshooting](#5-troubleshooting)
---
## 1. Row-Level Security (RLS)
### 1.1 Overview
RLS provides database-level tenant isolation. All tenant-scoped tables enforce that queries can only access rows matching the current tenant context.
### 1.2 Setting Tenant Context
Before executing queries, the application must set the tenant context:
```sql
-- Set tenant context for the session
SELECT set_config('app.tenant_id', 'my-tenant-id', false);
-- Verify tenant context
SELECT current_setting('app.tenant_id', true);
```
### 1.3 Validating RLS Configuration
Run the validation script to check RLS status:
```bash
psql -f deploy/postgres-validation/001_validate_rls.sql
```
Expected output shows:
- All tenant-scoped tables with `rls_enabled = true`
- All tenant-scoped tables with `rls_forced = true`
- One policy per table named `*_tenant_isolation`
### 1.4 Bypassing RLS (Admin Operations)
For migrations or cross-tenant operations, use the admin role:
```sql
-- Grant admin role to current user (one-time setup)
GRANT scheduler_admin TO your_migration_user;
-- Execute as admin (bypasses RLS)
SET ROLE scheduler_admin;
SELECT * FROM scheduler.runs; -- Returns all tenants
-- Return to normal role
RESET ROLE;
```
### 1.5 Common Issues
**Issue:** Query returns no rows unexpectedly
```sql
-- Check if tenant context is set
SELECT current_setting('app.tenant_id', true);
-- If NULL or empty, set it before querying
```
**Issue:** "app.tenant_id session variable not set" error
```sql
-- Set the tenant context
SELECT set_config('app.tenant_id', 'tenant-123', false);
```
**Issue:** Need to query across tenants
```sql
-- Use admin bypass role
SET ROLE scheduler_admin;
-- Run your cross-tenant query
RESET ROLE;
```
---
## 2. Bitemporal Unknowns
### 2.1 Overview
The `unknowns` schema tracks ambiguities discovered during vulnerability scanning using bitemporal semantics:
- **Valid time** (`valid_from`, `valid_to`): When the unknown was relevant in the real world
- **System time** (`sys_from`, `sys_to`): When the system knew about the unknown
### 2.2 Querying Current Unknowns
```sql
-- Get all current open unknowns for a tenant
SELECT * FROM unknowns.current
WHERE tenant_id = 'my-tenant';
-- Or use the view with tenant context set
SELECT set_config('app.tenant_id', 'my-tenant', false);
SELECT * FROM unknowns.current;
```
### 2.3 Point-in-Time Queries
Query the state of unknowns at any point in history:
```sql
-- What unknowns existed on January 1, 2025?
SELECT * FROM unknowns.as_of(
'my-tenant',
'2025-01-01 00:00:00+00'::TIMESTAMPTZ
);
-- What did we know about unknowns on Jan 1, as of Jan 15?
SELECT * FROM unknowns.as_of(
'my-tenant',
'2025-01-01 00:00:00+00'::TIMESTAMPTZ, -- valid_at
'2025-01-15 00:00:00+00'::TIMESTAMPTZ -- sys_at
);
```
### 2.4 Resolving Unknowns
```sql
-- Resolve an unknown (sets valid_to and resolution fields)
UPDATE unknowns.unknown
SET resolved_at = NOW(),
resolution_type = 'feed_updated',
resolution_ref = 'nvd-feed-2025-01',
resolution_notes = 'NVD now covers this package',
valid_to = NOW()
WHERE id = 'unknown-uuid'
AND tenant_id = 'my-tenant';
```
### 2.5 Metrics Queries
```sql
-- Count unknowns by kind
SELECT * FROM unknowns.count_by_kind('my-tenant');
-- Count unknowns by severity
SELECT * FROM unknowns.count_by_severity('my-tenant');
```
---
## 3. Time-Based Partitioning
### 3.1 Overview
High-volume tables are partitioned by month for:
- Efficient time-range queries (partition pruning)
- O(1) data retention (drop partition vs DELETE)
- Improved vacuum performance
### 3.2 Checking Partition Status
Run the validation script:
```bash
psql -f deploy/postgres-validation/002_validate_partitions.sql
```
Or query directly:
```sql
-- View all partitions with sizes
SELECT * FROM partition_mgmt.partition_stats
WHERE schema_name = 'scheduler'
ORDER BY table_name, partition_name;
```
### 3.3 Creating Future Partitions
Partitions should be created ahead of time. Run monthly:
```sql
-- Create partitions for next 3 months
SELECT partition_mgmt.create_monthly_partitions(
'scheduler', -- schema
'audit', -- table
'created_at', -- partition column
CURRENT_DATE, -- start from now
3 -- months ahead
);
```
### 3.4 Archiving Old Partitions
Preview what would be archived:
```sql
-- Dry run: see what would be archived (12 month retention)
SELECT * FROM partition_mgmt.cleanup_old_partitions(
'scheduler', 'audit', 12, 'archive', TRUE
);
```
Execute archiving:
```sql
-- Actually archive partitions older than 12 months
SELECT * FROM partition_mgmt.cleanup_old_partitions(
'scheduler', 'audit', 12, 'archive', FALSE
);
```
### 3.5 Dropping Archived Partitions
After verifying archived data is backed up:
```sql
-- Drop archived partition
DROP TABLE archive.audit_2024_01;
```
### 3.6 Maintenance Checklist
**Weekly:**
- [ ] Check `partition_mgmt.partition_stats` for unexpected growth
- [ ] Verify default partitions are empty
**Monthly:**
- [ ] Create partitions for 3 months ahead
- [ ] Review partitions approaching retention cutoff
**Quarterly:**
- [ ] Archive partitions beyond retention period
- [ ] Verify archived data is backed up
- [ ] Drop archived partitions if space needed
---
## 4. Generated Columns
### 4.1 Overview
Generated columns extract frequently-queried JSONB fields into indexed columns for 10-50x query speedup.
### 4.2 Affected Tables
| Table | Generated Columns | Source |
|-------|-------------------|--------|
| `scheduler.runs` | `finding_count`, `critical_count`, `high_count`, `medium_count`, `low_count` | `stats` JSONB |
| `vuln.advisories` | `provenance_source_key`, `provenance_feed_id`, `provenance_ingested_at` | `provenance` JSONB |
| `vuln.advisory_affected` | `purl_type`, `purl_name` | `purl` TEXT |
### 4.3 Query Optimization
**Before (slow):**
```sql
SELECT * FROM scheduler.runs
WHERE (stats->>'findingCount')::int > 100;
```
**After (fast):**
```sql
SELECT * FROM scheduler.runs
WHERE finding_count > 100;
```
### 4.4 Adding New Generated Columns
```sql
-- Add a new generated column
ALTER TABLE your_table
ADD COLUMN new_field TEXT
GENERATED ALWAYS AS (jsonb_column->>'field_name') STORED;
-- Create index
CREATE INDEX ix_your_table_new_field
ON your_table (new_field)
WHERE new_field IS NOT NULL;
```
### 4.5 Monitoring
Check column usage in queries:
```sql
-- See which indexes are being used
SELECT indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE schemaname = 'scheduler'
ORDER BY idx_scan DESC;
```
---
## 5. Troubleshooting
### 5.1 RLS Issues
**Symptom:** Application gets "permission denied" or empty results
**Diagnosis:**
```sql
-- Check if RLS is enabled
SELECT tablename, rowsecurity, forcerowsecurity
FROM pg_tables WHERE schemaname = 'scheduler';
-- Check policies
SELECT * FROM pg_policies WHERE schemaname = 'scheduler';
-- Check tenant context
SELECT current_setting('app.tenant_id', true);
```
**Resolution:**
1. Ensure `set_config('app.tenant_id', ...)` is called before queries
2. Verify the tenant ID matches data in the table
3. For admin operations, use bypass role
### 5.2 Partition Issues
**Symptom:** Inserts fail with "no partition of relation"
**Diagnosis:**
```sql
-- Check existing partitions
SELECT * FROM partition_mgmt.partition_stats
WHERE table_name = 'audit';
-- Check default partition for orphaned data
SELECT COUNT(*) FROM scheduler.audit_default;
```
**Resolution:**
```sql
-- Create missing partition
SELECT partition_mgmt.create_partition(
'scheduler', 'audit', 'created_at',
'2025-12-01'::DATE, '2026-01-01'::DATE
);
```
**Symptom:** Queries scanning all partitions (slow)
**Diagnosis:**
```sql
-- Check query plan
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM scheduler.audit
WHERE created_at >= '2025-01-01' AND created_at < '2025-02-01';
```
**Resolution:**
- Ensure WHERE clause includes partition key
- Use `>=` and `<` instead of BETWEEN for clearer pruning
### 5.3 Bitemporal Query Issues
**Symptom:** `as_of()` returns unexpected results
**Diagnosis:**
```sql
-- Check raw temporal data
SELECT id, valid_from, valid_to, sys_from, sys_to
FROM unknowns.unknown
WHERE tenant_id = 'my-tenant'
ORDER BY valid_from;
```
**Resolution:**
- Verify timestamps are in UTC
- Check that `valid_from <= query_time < valid_to`
- Check that `sys_from <= query_time < sys_to`
### 5.4 Generated Column Issues
**Symptom:** Generated column has NULL when source field exists
**Diagnosis:**
```sql
-- Check source field format
SELECT provenance->>'source_key', provenance_source_key
FROM vuln.advisories
WHERE id = 'advisory-uuid';
```
**Resolution:**
- Verify JSONB field name matches exactly (case-sensitive)
- Check for type casting issues in generation expression
---
## Appendix A: Quick Reference
### Schema Admin Roles
| Schema | Admin Role | Bypass RLS |
|--------|------------|------------|
| scheduler | `scheduler_admin` | Yes |
| notify | `notify_admin` | Yes |
| authority | `authority_admin` | Yes |
| vex | `vex_admin` | Yes |
| policy | `policy_admin` | Yes |
| unknowns | `unknowns_admin` | Yes |
### Helper Functions
| Function | Purpose |
|----------|---------|
| `*_app.require_current_tenant()` | Returns tenant from session, raises if not set |
| `unknowns.as_of(tenant, valid_at, sys_at)` | Point-in-time temporal query |
| `unknowns.count_by_kind(tenant)` | Aggregate unknowns by kind |
| `unknowns.count_by_severity(tenant)` | Aggregate unknowns by severity |
| `partition_mgmt.create_monthly_partitions(...)` | Create partitions ahead |
| `partition_mgmt.cleanup_old_partitions(...)` | Archive old partitions |
### Validation Scripts
| Script | Purpose |
|--------|---------|
| `deploy/postgres-validation/001_validate_rls.sql` | Verify RLS configuration |
| `deploy/postgres-validation/002_validate_partitions.sql` | Verify partition health |