audit work, fixed StellaOps.sln warnings/errors, fixed tests, sprints work, new advisories
This commit is contained in:
176
docs/modules/scheduler/hlc-ordering.md
Normal file
176
docs/modules/scheduler/hlc-ordering.md
Normal file
@@ -0,0 +1,176 @@
|
||||
# Scheduler HLC Ordering Architecture
|
||||
|
||||
This document describes the Hybrid Logical Clock (HLC) based ordering system used by the StellaOps Scheduler for audit-safe job queue operations.
|
||||
|
||||
## Overview
|
||||
|
||||
The Scheduler uses HLC timestamps instead of wall-clock time to ensure:
|
||||
|
||||
1. **Total ordering** of jobs across distributed nodes
|
||||
2. **Audit-safe sequencing** with cryptographic chain linking
|
||||
3. **Deterministic merge** when offline nodes reconnect
|
||||
4. **Clock skew tolerance** in distributed deployments
|
||||
|
||||
## HLC Timestamp Format
|
||||
|
||||
An HLC timestamp consists of three components:
|
||||
|
||||
```
|
||||
(PhysicalTime, LogicalCounter, NodeId)
|
||||
```
|
||||
|
||||
| Component | Description | Example |
|
||||
|-----------|-------------|---------|
|
||||
| PhysicalTime | Unix milliseconds (UTC) | `1704585600000` |
|
||||
| LogicalCounter | Monotonic counter for same-millisecond events | `0`, `1`, `2`... |
|
||||
| NodeId | Unique identifier for the node | `scheduler-prod-01` |
|
||||
|
||||
**String format:** `{physical}:{logical}:{nodeId}`
|
||||
Example: `1704585600000:0:scheduler-prod-01`
|
||||
|
||||
## Database Schema
|
||||
|
||||
### scheduler_log Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE scheduler.scheduler_log (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
t_hlc TEXT NOT NULL, -- HLC timestamp
|
||||
job_id TEXT NOT NULL, -- Job identifier
|
||||
action TEXT NOT NULL, -- ENQUEUE, DEQUEUE, EXECUTE, COMPLETE, FAIL
|
||||
prev_chain_link TEXT, -- Hash of previous entry
|
||||
chain_link TEXT NOT NULL, -- Hash of this entry
|
||||
payload JSONB NOT NULL, -- Job metadata
|
||||
tenant_id TEXT NOT NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_scheduler_log_hlc ON scheduler.scheduler_log (t_hlc);
|
||||
CREATE INDEX idx_scheduler_log_tenant_hlc ON scheduler.scheduler_log (tenant_id, t_hlc);
|
||||
CREATE INDEX idx_scheduler_log_job ON scheduler.scheduler_log (job_id);
|
||||
```
|
||||
|
||||
### batch_snapshot Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE scheduler.batch_snapshot (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
snapshot_hlc TEXT NOT NULL, -- HLC at snapshot time
|
||||
from_chain_link TEXT NOT NULL, -- First entry in batch
|
||||
to_chain_link TEXT NOT NULL, -- Last entry in batch
|
||||
entry_count INTEGER NOT NULL,
|
||||
merkle_root TEXT NOT NULL, -- Merkle root of entries
|
||||
dsse_envelope JSONB, -- DSSE-signed attestation
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
### chain_heads Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE scheduler.chain_heads (
|
||||
tenant_id TEXT PRIMARY KEY,
|
||||
head_chain_link TEXT NOT NULL, -- Current chain head
|
||||
head_hlc TEXT NOT NULL, -- HLC of chain head
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
## Chain Link Computation
|
||||
|
||||
Each log entry is cryptographically linked to its predecessor:
|
||||
|
||||
```csharp
|
||||
public static string ComputeChainLink(
|
||||
string tHlc,
|
||||
string jobId,
|
||||
string action,
|
||||
string? prevChainLink,
|
||||
string payloadDigest)
|
||||
{
|
||||
using var hasher = IncrementalHash.CreateHash(HashAlgorithmName.SHA256);
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(tHlc));
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(jobId));
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(action));
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(prevChainLink ?? "genesis"));
|
||||
hasher.AppendData(Encoding.UTF8.GetBytes(payloadDigest));
|
||||
return Convert.ToHexString(hasher.GetHashAndReset()).ToLowerInvariant();
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
```yaml
|
||||
# etc/scheduler.yaml
|
||||
scheduler:
|
||||
hlc:
|
||||
enabled: true # Enable HLC ordering (default: true)
|
||||
nodeId: "scheduler-prod-01" # Unique node identifier
|
||||
maxClockSkew: "00:00:05" # Maximum tolerable clock skew (5 seconds)
|
||||
persistenceInterval: "00:01:00" # HLC state persistence interval
|
||||
|
||||
chain:
|
||||
enabled: true # Enable chain linking (default: true)
|
||||
batchSize: 1000 # Entries per batch snapshot
|
||||
batchInterval: "00:05:00" # Batch snapshot interval
|
||||
signSnapshots: true # DSSE-sign batch snapshots
|
||||
keyId: "scheduler-signing-key" # Key for snapshot signing
|
||||
```
|
||||
|
||||
## Operational Considerations
|
||||
|
||||
### Clock Skew Handling
|
||||
|
||||
The HLC algorithm tolerates clock skew by:
|
||||
|
||||
1. Advancing logical counter when physical time hasn't progressed
|
||||
2. Rejecting events with excessive clock skew (> `maxClockSkew`)
|
||||
3. Emitting `hlc_clock_skew_rejections_total` metric for monitoring
|
||||
|
||||
**Alert:** `HlcClockSkewExceeded` triggers when skew > tolerance.
|
||||
|
||||
### Chain Verification
|
||||
|
||||
Verify chain integrity on startup and periodically:
|
||||
|
||||
```bash
|
||||
# CLI command
|
||||
stella scheduler chain verify --tenant-id <tenant>
|
||||
|
||||
# API endpoint
|
||||
GET /api/v1/scheduler/chain/verify?tenantId=<tenant>
|
||||
```
|
||||
|
||||
### Offline Merge
|
||||
|
||||
When offline nodes reconnect:
|
||||
|
||||
1. Export local job log as bundle
|
||||
2. Import on connected node
|
||||
3. HLC-based merge produces deterministic ordering
|
||||
4. Chain is extended with merged entries
|
||||
|
||||
See `docs/operations/airgap-operations-runbook.md` for details.
|
||||
|
||||
## Metrics
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `hlc_ticks_total` | Counter | Total HLC tick operations |
|
||||
| `hlc_clock_skew_rejections_total` | Counter | Events rejected due to clock skew |
|
||||
| `hlc_physical_offset_seconds` | Gauge | Current physical time offset |
|
||||
| `scheduler_chain_entries_total` | Counter | Total chain log entries |
|
||||
| `scheduler_chain_verifications_total` | Counter | Chain verification operations |
|
||||
| `scheduler_chain_verification_failures_total` | Counter | Failed verifications |
|
||||
| `scheduler_batch_snapshots_total` | Counter | Batch snapshots created |
|
||||
|
||||
## Grafana Dashboard
|
||||
|
||||
See `devops/observability/grafana/hlc-queue-metrics.json` for the HLC monitoring dashboard.
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [HLC Core Library](../../../src/__Libraries/StellaOps.HybridLogicalClock/README.md)
|
||||
- [HLC Migration Guide](./hlc-migration-guide.md)
|
||||
- [Air-Gap Operations Runbook](../../operations/airgap-operations-runbook.md)
|
||||
- [HLC Troubleshooting](../../operations/runbooks/hlc-troubleshooting.md)
|
||||
Reference in New Issue
Block a user