Files
git.stella-ops.org/docs/12_PERFORMANCE_WORKBOOK.md

168 lines
6.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#12 - Performance Workbook
*Purpose* define **repeatable, datadriven** benchmarks that guard StellaOps core pledge:
> *“P95 vulnerability feedback in ≤5seconds.”*
---
##0Benchmark Scope
| Area | Included | Excluded |
|------------------|----------------------------------|---------------------------|
| SBOMfirst scan | Trivy engine w/ warmed DB | Full image unpack ≥300MB |
| Delta SBOM ⭑ | Missinglayer lookup & merge | Multiarch images |
| Policy eval ⭑ | YAML → JSON → rule match | Rego (until GA) |
| Feed merge | NVD JSON 20232025 | GHSA GraphQL (plugin) |
| Quota waitpath | 5s softwait, 60s hardwait behaviour | Paid tiers (unlimited) |
| API latency | REST `/scan`, `/layers/missing` | UI SPA calls |
⭑ = new in July2025.
---
##1Hardware Baseline (Reference Rig)
| Element | Spec |
|-------------|------------------------------------|
| CPU | 8vCPU (Intel IceLake equiv.) |
| Memory | 16GiB |
| Disk | NVMe SSD, 3GB/s R/W |
| Network | 1Gbit virt. switch |
| Container | Docker 25.0 + overlay2 |
| OS | Ubuntu 22.04LTS (kernel 6.8) |
*All P95 targets assume a **singlenode** deployment on this rig unless stated.*
---
##2Phase Targets & Gates
| Phase (ID) | Target P95 | Gate (CI) | Rationale |
|-----------------------|-----------:|-----------|----------------------------------------|
| **SBOM_FIRST** | ≤5s | `hard` | Core UX promise. |
| **IMAGE_UNPACK** | ≤10s | `soft` | Fallback path for legacy flows. |
| **DELTA_SBOM** ⭑ | ≤1s | `hard` | Needed to stay sub5s for big bases. |
| **POLICY_EVAL** ⭑ | ≤50ms | `hard` | Keeps gate latency invisible to users. |
| **QUOTA_WAIT** ⭑ | *soft*5s<br>*hard*60s | `hard` | Ensures graceful Freetier throttling. |
| **SCHED_RESCAN** | ≤30s | `soft` | Nightly batch not userfacing. |
| **FEED_MERGE** | ≤60s | `soft` | Offpeak cron @ 01:00. |
| **API_P95** | ≤200ms | `hard` | UI snappiness. |
*Gate* legend `hard`: break CI if regression>3×target,
`soft`: raise warning & issue ticket.
---
##3Test Harness
* **Runner** `perf/run.sh`, accepts `--phase` and `--samples`.
* **Metrics** Prometheus + `jq` extracts; aggregated via `scripts/aggregate.ts`.
* **CI** GitLab CI job *benchmark* publishes JSON to `benchartifacts/`.
* **Visualisation** Grafana dashboard *StellaPerf* (provisioned JSON).
> **Note** harness mounts `/var/cache/trivy` tmpfs to avoid disk noise.
---
##4Current Results (July2025)
| Phase | Samples | Mean (s) | P95 (s) | Target OK? |
|---------------|--------:|---------:|--------:|-----------:|
| SBOM_FIRST | 100 | 3.7 | 4.9 | ✅ |
| IMAGE_UNPACK | 50 | 6.4 | 9.2 | ✅ |
| **DELTA_SBOM**| 100 | 0.46 | 0.83 | ✅ |
| **POLICY_EVAL** | 1000 | 0.021 | 0.041 | ✅ |
| **QUOTA_WAIT** | 80 | 4.0* | 4.9* | ✅ |
| SCHED_RESCAN | 10 | 18.3 | 24.9 | ✅ |
| FEED_MERGE | 3 | 38.1 | 41.0 | ✅ |
| API_P95 | 20000 | 0.087 | 0.143 | ✅ |
*Data files:* `bench-artifacts/20250714/phasestats.json`.
---
##5ΔSBOM MicroBenchmark Detail
### 5.1 Scenario
1. Base image `python:3.12-slim` already scanned (all layers cached).
2. Application layer (`COPY . /app`) triggers new digest.
3. Santech lists **7** layers, backend replies *6 hit*, *1 miss*.
4. Builder scans **only 1 layer** (~9MiB, 217files) & uploads delta.
### 5.2 Key Timings
| Step | Time (ms) |
|---------------------|----------:|
| `/layers/missing` | 13 |
| Trivy single layer | 655 |
| Upload delta blob | 88 |
| Backend merge + CVE | 74 |
| **Total walltime** | **830ms** |
---
##6Quota WaitPath Benchmark Detail
###6.1Scenario
1. Freetier token reaches **scan #200** dashboard shows yellow banner.
###6.2 Key Timings
| Step | Time (ms) |
|------------------------------------|----------:|
| `/quota/check` Redis LUA INCR | 0.8 |
| Soft wait sleep (server) | 5000 |
| Hard wait sleep (server) | 60000 |
| Endtoend walltime (softhit) | 5003 |
| Endtoend walltime (hardhit) | 60004 |
---
##7Policy Eval Bench
### 7.1 Setup
* Policy YAML: **28** rules, mix severity & package conditions.
* Input: scan result JSON with **1026** findings.
* Evaluator: custom rules engine (Go structs → map lookups).
### 7.2 Latency Histogram
```
010ms ▇▇▇▇▇▇▇▇▇▇ 38%
1020ms ▇▇▇▇▇▇▇▇▇▇ 42%
2040ms ▇▇▇▇▇▇ 17%
4050ms ▇ 3%
```
P99=48ms. Meets 50ms gate.
---
##8Trend Snapshot
![Perf trend sparkline placeholder](perftrend.svg)
_Plot generated weekly by `scripts/updatetrend.py`; shows last 12 weeks P95 per phase._
---
##9Action Items
1. **Image Unpack** Evaluate zstd for layer decompress; aim to shave 1s.
2. **Feed Merge** Parallelise BDU XML parse (plugin) once stable.
3. **Rego Support** Prototype OPA sidecar; target ≤100ms eval.
4. **Concurrency** Stresstest 100rps on 4node Redis cluster (Q42025).
---
##10Change Log
| Date | Note |
|------------|-------------------------------------------------------------------------|
| 20250714 | Added ΔSBOM & Policy Eval phases; updated targets & current results. |
| 20250712 | First public workbook (SBOMfirst, imageunpack, feed merge). |
---