doctor enhancements, setup, enhancements, ui functionality and design consolidation and , test projects fixes , product advisory attestation/rekor and delta verfications enhancements

2026-01-19 09:02:59 +02:00
parent 8c4bf54aed
commit 17419ba7c4
809 changed files with 170738 additions and 12244 deletions
--- a/docs/operations/artifact-migration-runbook.md
+++ b/docs/operations/artifact-migration-runbook.md
@@ -0,0 +1,212 @@
+# Artifact Store Migration Runbook
+
+Sprint: SPRINT_20260118_017_Evidence_artifact_store_unification (AS-006)
+
+## Overview
+
+This runbook covers the migration of existing evidence from legacy artifact stores to the unified ArtifactStore.
+
+## Migration Sources
+
+| Source | Legacy Path | Description |
+|--------|-------------|-------------|
+| EvidenceLocker | `tenants/{tenantId}/bundles/{bundleId}/{sha256}-{name}` | Evidence bundles |
+| Attestor | `attest/dsse/{bundleSha256}.json` | DSSE envelopes |
+| Vex | `{prefix}/{format}/{digest}.{ext}` | VEX documents |
+
+## Target Path Convention
+
+All artifacts are migrated to: `/artifacts/{bom-ref-encoded}/{serialNumber}/{artifactId}.json`
+
+## Pre-Migration Checklist
+
+- [ ] Backup existing S3 buckets
+- [ ] Verify PostgreSQL backup is current
+- [ ] Ensure sufficient storage for duplicated data
+- [ ] Review migration in dry-run mode first
+- [ ] Notify stakeholders of potential service impact
+
+## Running the Migration
+
+### Dry Run (Recommended First Step)
+
+```bash
+stella artifacts migrate --source all --dry-run --output migration-preview.json
+```
+
+### Full Migration
+
+```bash
+# Migrate all sources with default settings
+stella artifacts migrate --source all
+
+# Migrate with increased parallelism
+stella artifacts migrate --source all --parallelism 8 --batch-size 200
+
+# Migrate specific source
+stella artifacts migrate --source evidence --output migration-report.json
+
+# Migrate specific tenant
+stella artifacts migrate --source all --tenant <tenant-uuid>
+```
+
+### Resuming Failed Migration
+
+```bash
+# Use checkpoint ID from previous run
+stella artifacts migrate --source all --resume-from <checkpoint-id>
+```
+
+## Progress Monitoring
+
+The CLI displays real-time progress:
+
+```
+  Progress: 1500/10000 (15.0%) - Success: 1495, Failed: 3, Skipped: 2
+```
+
+## Rollback Procedure
+
+### When to Rollback
+
+- Migration corrupted data
+- Performance degradation after migration
+- Business-critical bug discovered
+
+### Rollback Steps
+
+#### 1. Stop New Writes to Unified Store
+
+```bash
+# Disable unified store in configuration
+kubectl set env deployment/evidence-locker ARTIFACT_STORE_UNIFIED_ENABLED=false
+kubectl set env deployment/attestor ARTIFACT_STORE_UNIFIED_ENABLED=false
+```
+
+#### 2. Revert Application Configuration
+
+```yaml
+# etc/appsettings.yaml
+artifactStore:
+  useUnifiedStore: false
+  legacyMode: true
+```
+
+#### 3. Clear Unified Store Index
+
+```sql
+-- Clear PostgreSQL index (preserves S3 data)
+TRUNCATE TABLE artifact_store.artifacts;
+```
+
+#### 4. (Optional) Remove Migrated S3 Objects
+
+```bash
+# Only if disk space is critical and you're certain about rollback
+# WARNING: This is destructive!
+aws s3 rm s3://artifacts-bucket/artifacts/ --recursive
+```
+
+#### 5. Restart Services
+
+```bash
+kubectl rollout restart deployment/evidence-locker
+kubectl rollout restart deployment/attestor
+```
+
+#### 6. Verify Legacy Stores Work
+
+```bash
+# Test evidence retrieval
+stella evidence get --bundle-id <test-bundle>
+
+# Test attestation retrieval  
+stella attestor get --digest <test-digest>
+```
+
+## Post-Migration Validation
+
+### Verify Artifact Counts
+
+```sql
+-- Count migrated artifacts by source
+SELECT 
+  CASE 
+    WHEN storage_key LIKE '%evidence%' THEN 'evidence'
+    WHEN storage_key LIKE '%dsse%' THEN 'attestor'
+    WHEN storage_key LIKE '%vex%' THEN 'vex'
+    ELSE 'unknown'
+  END as source,
+  COUNT(*) as count
+FROM artifact_store.artifacts
+GROUP BY 1;
+```
+
+### Verify bom-ref Extraction
+
+```sql
+-- Check for artifacts with synthetic bom-refs (extraction failed)
+SELECT COUNT(*) as synthetic_count
+FROM artifact_store.artifacts
+WHERE bom_ref LIKE 'sha256:%';
+```
+
+### Test Retrieval
+
+```bash
+# Query by bom-ref
+curl "https://api.example.com/api/v1/artifacts?bom_ref=pkg:docker/acme/api@sha256:abc123"
+
+# Verify content matches original
+stella artifacts compare \
+  --original tenants/xxx/bundles/yyy/sha256-sbom.json \
+  --migrated /artifacts/encoded-ref/serial/artifact.json
+```
+
+## Troubleshooting
+
+### Migration Stuck
+
+```bash
+# Check for stuck workers
+ps aux | grep migrate
+
+# Check migration checkpoints
+cat /var/lib/stella/migration-checkpoint.json
+```
+
+### High Failure Rate
+
+1. Check migration report for common errors
+2. Verify source store connectivity
+3. Check for corrupted source artifacts
+4. Increase batch size for memory issues
+
+### Slow Migration
+
+1. Increase parallelism (up to CPU count)
+2. Run during off-peak hours
+3. Consider migrating by tenant in parallel
+4. Verify network bandwidth to S3
+
+## Representative Dataset Testing
+
+Before production migration, test with representative dataset:
+
+```bash
+# Export sample from each source
+stella evidence list --limit 100 --output sample-evidence.json
+stella attestor list --limit 100 --output sample-attestor.json
+
+# Create test environment with samples
+stella artifacts migrate --source all --tenant test-tenant --output test-report.json
+
+# Verify counts and content
+diff <(cat sample-evidence.json | jq '.total') <(cat test-report.json | jq '.succeeded')
+```
+
+## Related Documentation
+
+- [Artifact Store API](../api/artifact-store-api.yaml)
+- [IArtifactStore Interface](../../src/__Libraries/StellaOps.Artifact.Core/IArtifactStore.cs)
+- [PostgreSQL Index Schema](../../src/__Libraries/StellaOps.Artifact.Infrastructure/Migrations/001_artifact_index_schema.sql)
--- a/docs/operations/unknowns-queue-runbook.md
+++ b/docs/operations/unknowns-queue-runbook.md
@@ -494,71 +494,142 @@ stella unknowns resolve unk-... \

 ## 7. Monitoring & Alerting

+> **Updated**: Sprint SPRINT_20260118_018_Unknowns_queue_enhancement (UQ-007)
+
 ### 7.1 Key Metrics

 | Metric | Description | Alert Threshold |
 |--------|-------------|-----------------|
-| `unknowns_total` | Total unknowns in queue | > 500 |
-| `unknowns_hot_count` | HOT band count | > 20 |
-| `unknowns_sla_breached` | SLA breaches | > 0 |
-| `unknowns_resolution_rate` | Daily resolutions | < 5 |
-| `unknowns_escalation_failures` | Failed escalations | > 0 |
-| `unknowns_avg_age_hours` | Average unknown age | > 168 (1 week) |
+| `unknowns_queue_depth_hot` | HOT band queue depth | > 5 critical, > 0 for 1h warning |
+| `unknowns_queue_depth_warm` | WARM band queue depth | > 25 warning |
+| `unknowns_queue_depth_cold` | COLD band queue depth | > 100 warning |
+| `unknowns_sla_compliance` | SLA compliance rate (0-1) | < 0.80 critical, < 0.95 warning |
+| `unknowns_sla_breach_total` | Total SLA breaches (counter) | increase > 0 |
+| `unknowns_escalated_total` | Escalations (counter) | rate > 10/hour |
+| `unknowns_demoted_total` | Demotions (counter) | - |
+| `unknowns_expired_total` | Expirations (counter) | - |
+| `unknowns_processing_time_seconds` | Processing time histogram | p95 > 30s |
+| `unknowns_resolution_time_hours` | Resolution time by band | p95 > SLA |
+| `unknowns_state_transitions_total` | State transitions (by from/to) | - |
+| `greyqueue_stuck_total` | Stuck processing entries | > 0 |
+| `greyqueue_timeout_total` | Processing timeouts | > 5/hour |
+| `greyqueue_processing_count` | Currently processing | > 10 for 30m |

 ### 7.2 Grafana Dashboard

-```
-Dashboard: Unknowns Queue Health
-Panels:
- Queue size by band (HOT/WARM/COLD)
- SLA compliance rate
- Unknowns by reason code
- Resolution velocity
- Escalation success rate
- Queue age distribution
- KEV item tracking
-```
+Import dashboard from: `devops/observability/grafana/dashboards/unknowns-queue-dashboard.json`
+
+**Dashboard Panels:**
+
+| Panel | Description |
+|-------|-------------|
+| Total Queue Depth | Stat showing total across all bands |
+| HOT/WARM/COLD Unknowns | Individual band stats with thresholds |
+| SLA Compliance | Gauge showing compliance percentage |
+| Queue Depth Over Time | Time series by band |
+| SLA Compliance Over Time | Trending compliance |
+| State Transitions | Rate of state changes |
+| Processing Time (p95) | Performance histogram |
+| Escalations & Failures | Lifecycle events |
+| Resolution Time by Band | Time-to-resolution |
+| Stuck & Timeout Events | Watchdog metrics |
+| SLA Breaches Today | 24h breach counter |

 ### 7.3 Alerting Rules

-```yaml
-groups:
-  - name: unknowns-queue
-    rules:
-      - alert: UnknownsHotBandHigh
-        expr: unknowns_hot_count > 20
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: "HOT unknowns queue is high ({{ $value }} items)"
-          
-      - alert: UnknownsSLABreach
-        expr: unknowns_sla_breached > 0
-        for: 1m
-        labels:
-          severity: critical
-        annotations:
-          summary: "{{ $value }} unknowns have breached SLA"
-          
-      - alert: UnknownsQueueGrowing
-        expr: rate(unknowns_total[1h]) > 10
-        for: 30m
-        labels:
-          severity: warning
-        annotations:
-          summary: "Unknowns queue is growing rapidly"
-          
-      - alert: UnknownsKEVPending
-        expr: unknowns_kev_count > 0 and unknowns_kev_unresolved_age_hours > 24
-        for: 5m
-        labels:
-          severity: critical
-        annotations:
-          summary: "KEV unknown pending for over 24 hours"
+Alert rules deployed from: `devops/observability/prometheus/rules/unknowns-queue-alerts.yaml`
+
+**Critical Alerts:**
+
+| Alert | Condition | Response |
+|-------|-----------|----------|
+| `UnknownsSlaBreachCritical` | compliance < 80% | Immediate escalation to security team |
+| `UnknownsHotQueueHigh` | HOT > 5 for 10m | Prioritize resolution |
+| `UnknownsProcessingFailures` | Failed entries in 1h | Manual intervention required |
+| `UnknownsSlaMonitorDown` | No metrics for 5m | Check service health |
+| `UnknownsHealthCheckUnhealthy` | Health check failing | Check SLA breaches |
+
+**Warning Alerts:**
+
+| Alert | Condition | Response |
+|-------|-----------|----------|
+| `UnknownsSlaBreachWarning` | 80% ≤ compliance < 95% | Review queue health |
+| `UnknownsHotQueuePresent` | HOT > 0 for 1h | Check progress |
+| `UnknownsQueueBacklog` | Total > 100 for 30m | Scale processing |
+| `UnknownsStuckProcessing` | Processing > 10 for 30m | Check bottlenecks |
+| `UnknownsProcessingTimeout` | Timeouts > 5/hour | Review automation |
+| `UnknownsEscalationRate` | Escalations > 10/hour | Review criteria |
+
+### 7.4 Metric-Based Troubleshooting
+
+#### SLA Breach Investigation
+
+```bash
+# 1. Check current breach status
+curl -s "http://prometheus:9090/api/v1/query?query=unknowns_sla_compliance" | jq
+
+# 2. Identify breached entries
+curl -s "$UNKNOWNS_API/grey-queue?status=pending" | \
+  jq '.items[] | select(.sla_breached == true)'
+
+# 3. Check SLA health endpoint
+curl -s "$UNKNOWNS_API/health/sla" | jq
+
+# 4. Review breach timeline
+# In Grafana: SLA Compliance Over Time panel, last 24h
 ```

-### 7.4 Daily Report
+#### Stuck Processing Investigation
+
+```bash
+# 1. Check processing count
+curl -s "http://prometheus:9090/api/v1/query?query=greyqueue_processing_count" | jq
+
+# 2. List stuck entries
+curl -s "$UNKNOWNS_API/grey-queue?status=Processing" | \
+  jq '.items[] | select((.last_processed_at | fromdateiso8601) < (now - 3600))'
+
+# 3. Check watchdog metrics
+curl -s "http://prometheus:9090/api/v1/query?query=rate(greyqueue_stuck_total[1h])" | jq
+
+# 4. Force retry if needed
+curl -X POST "$UNKNOWNS_API/grey-queue/{id}/retry"
+```
+
+#### High Escalation Rate
+
+```bash
+# 1. Check escalation rate
+curl -s "http://prometheus:9090/api/v1/query?query=rate(unknowns_escalated_total[1h])" | jq
+
+# 2. Review escalation reasons
+curl -s "$UNKNOWNS_API/grey-queue?status=Escalated" | \
+  jq 'group_by(.escalation_reason) | map({reason: .[0].escalation_reason, count: length})'
+
+# 3. Check for EPSS/KEV spikes
+# Events triggering escalations:
+# - epss.updated with score increase
+# - kev.added events
+# - deployment.created with affected components
+```
+
+#### Queue Growth Analysis
+
+```bash
+# 1. Check inflow rate
+curl -s "http://prometheus:9090/api/v1/query?query=rate(unknowns_enqueued_total[1h])" | jq
+
+# 2. Check resolution rate
+curl -s "http://prometheus:9090/api/v1/query?query=rate(unknowns_resolved_total[1h])" | jq
+
+# 3. Calculate net growth
+# growth_rate = inflow_rate - resolution_rate
+
+# 4. Review reasons for new unknowns
+curl -s "$UNKNOWNS_API/grey-queue/summary" | jq '.by_reason'
+```
+
+### 7.5 Daily Report

 ```bash
 # Generate daily report