Files
git.stella-ops.org/docs/modules/scanner/design/ruby-analyzer.md
master 7040984215 Add inline DSSE provenance documentation and Mongo schema
- Introduced a new document outlining the inline DSSE provenance for SBOM, VEX, scan, and derived events.
- Defined the Mongo schema for event patches, including key fields for provenance and trust verification.
- Documented the write path for ingesting provenance metadata and backfilling historical events.
- Created CI/CD snippets for uploading DSSE attestations and generating provenance metadata.
- Established Mongo indexes for efficient provenance queries and provided query recipes for various use cases.
- Outlined policy gates for managing VEX decisions based on provenance verification.
- Included UI nudges for displaying provenance information and implementation tasks for future enhancements.

---

Implement reachability lattice and scoring model

- Developed a comprehensive document detailing the reachability lattice and scoring model.
- Defined core types for reachability states, evidence, and mitigations with corresponding C# models.
- Established a scoring policy with base score contributions from various evidence classes.
- Mapped reachability states to VEX gates and provided a clear overview of evidence sources.
- Documented the event graph schema for persisting reachability data in MongoDB.
- Outlined the integration of runtime probes for evidence collection and defined a roadmap for future tasks.

---

Introduce uncertainty states and entropy scoring

- Created a draft document for tracking uncertainty states and their impact on risk scoring.
- Defined core uncertainty states with associated entropy values and evidence requirements.
- Established a schema for storing uncertainty states alongside findings.
- Documented the risk score calculation incorporating uncertainty and its effect on final risk assessments.
- Provided policy guidelines for handling uncertainty in decision-making processes.
- Outlined UI guidelines for displaying uncertainty information and suggested remediation actions.

---

Add Ruby package inventory management

- Implemented Ruby package inventory management with corresponding data models and storage mechanisms.
- Created C# records for Ruby package inventory, artifacts, provenance, and runtime details.
- Developed a repository for managing Ruby package inventory documents in MongoDB.
- Implemented a service for storing and retrieving Ruby package inventories.
- Added unit tests for the Ruby package inventory store to ensure functionality and data integrity.
2025-11-13 00:20:33 +02:00

141 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Ruby Analyzer Parity Design (SCANNER-ENG-0009)
**Status:** Implemented • Owner: Ruby Analyzer Guild • Updated: 2025-11-10
## 1. Goals & Non-Goals
- **Goals**
- Deterministically catalogue Ruby application dependencies (Gemfile/Gemfile.lock, vendored specs, .gem archives) for container layers and local workspaces.
- Build runtime usage graphs (require/require_relative, Zeitwerk autoloads, Rack boot chains, Sidekiq/ActiveJob schedulers).
- Emit capability signals (exec/fs/net/serialization, framework fingerprints, job schedulers) consumable by Policy Engine and explain traces.
- Provide CLI verbs (`stella ruby inspect`, `stella ruby resolve`) and Offline Kit parity for air-gapped deployments.
- **Non-Goals**
- Shipping dynamic runtime profilers (log-based or APM) in this iteration.
- Implementing UI changes beyond exposing explain traces the Policy/UI guilds already support.
## 2. Scope & Inputs
| Input | Location | Notes |
|-------|----------|-------|
| Gemfile / Gemfile.lock | Source tree, layer filesystem | Handle multiple apps per repo; honour Bundler groups. |
| Vendor bundles (`vendor/bundle`, `.bundle/config`) | Layer filesystem | Needed for offline/built images; avoid double-counting platform-specific gems. |
| `.gemspec` files / cached specs | `~/.bundle/cache`, `vendor/cache`, gems in layers | Support deterministic parsing without executing gem metadata. |
| Framework configs | `config/application.rb`, `config/routes.rb`, `config/sidekiq.yml`, etc. | Feed framework surface mapper. |
| Container metadata | Layer digests via RustFS CAS | Support incremental composition per layer. |
## 3. High-Level Architecture
```
┌─────────────────────────┐ ┌────────────────────┐
│ Bundler Lock Collector │───────▶│ Package Graph │
└─────────────────────────┘ │ Aggregator │
└─────────┬──────────┘
┌─────────────────────────┐ │
│ Gemspec Inspector │───────────────▶│
└─────────────────────────┘ │
┌────────────────────┐
┌─────────────────────────┐ │ Runtime Graph │
│ Require/Autoload Scan │───────▶│ Builder (Zeitwerk) │
└─────────────────────────┘ └─────────┬──────────┘
┌────────────────────┐
│ Capability Emitter │
└─────────┬──────────┘
┌────────────────────┐
│ SBOM Writer │
│ + Policy Signals │
└────────────────────┘
```
## 4. Detailed Components
### 4.1 Bundler Lock Collector
- Parse `Gemfile.lock` deterministically (no network) using new `RubyLockCollector` under `StellaOps.Scanner.Analyzers.Lang.Ruby`.
- Support alternative manifests (`gems.rb`, `gems.locked`) and workspace overrides.
- Emit package nodes with fields: `name`, `version`, `source` (path/git/rubygems), `bundlerGroup[]`, `platform`, `declaredOnly` flag.
- Implementation:
- Reuse parsing strategy from Trivy (`pkg/fanal/analyzer/language/ruby/bundler`) but port to C# with streaming reader and stable ordering.
- Integrate with Surface.Validation to enforce size limits and tenant allowlists for git/path sources.
### 4.2 Gemspec Inspector
- Scan cached specs under `vendor/cache`, `.bundle/cache`, and gem directories to pick up transitive packages when lockfiles missing.
- Parse without executing Ruby by using a deterministic DSL subset (similar to Trivy gemspec parser).
- Link results to lockfile entries by `<name, version, platform>`; create new records flagged `InferredFromSpec` when lockfile absent.
### 4.3 Package Aggregator
- New orchestrator `RubyPackageAggregator` merges lock and gemspec data with installed gems from container layers (once runtime analyzer ships).
- Precedence: Installed > Lockfile > Gemspec.
- Deduplicate by package key (name+version+platform) and attach provenance bits for Policy Engine.
### 4.4 Runtime Graph Builder
- Static analysis for `require`, `require_relative`, `autoload`, Zeitwerk conventions, and Rails initialisers.
- Implementation phases:
1. **MVP (shipped in Sprint 138):** perform lightweight scanning using deterministic regex patterns scoped to Ruby sources. Captures explicit `require*` and `autoload` statements, records referencing files, and links back to packages when a matching lock entry exists.
2. **Planned follow-up:** integrate tree-sitter Ruby under `StellaOps.Scanner.Analyzers.Lang.Ruby.Syntax` for full AST coverage (Zeitwerk constants, conditional requires, dynamic module loading). This phase remains tracked under SCANNER-ANALYZERS-RUBY-28-003.
- Output merges with EntryTrace usage hints to support runtime filtering in Policy Engine. Entrypoint detection currently keys off file location plus usage hints; richer framework-aware mapping will accompany the tree-sitter phase.
### 4.5 Capability & Surface Signals
- Emit evidence documents for:
- Process/exec usage (`Kernel.system`, `` `cmd` ``, `Open3`).
- Network clients (`Net::HTTP`, `Faraday`, `Redis`, `ActiveRecord::Base.establish_connection`).
- Serialization sinks (`Marshal.load`, `YAML.load`, `Oj.load`).
- Job schedulers (Sidekiq, Resque, ActiveJob, Whenever, Clockwork) with schedule metadata.
- Capability records flow to Policy Engine under `capability.ruby.*` namespaces to allow gating on dangerous constructs.
### 4.6 CLI & Offline Integration
- Add CLI verbs:
- `stella ruby inspect <path>` runs collector locally, outputs JSON summary with provenance.
- `stella ruby resolve --image <ref>` fetches scan artifacts, prints dependency graph grouped by bundler group/platform.
- Ship analyzer DLLs and rules in Offline Kit manifest; include autoload/zeitwerk fingerprints and heuristics hashed for determinism.
## 5. Data Contracts
| Artifact | Shape | Consumer |
|----------|-------|----------|
| `ruby_packages.json` | `RubyPackageInventory { scanId, imageDigest, generatedAt, packages[] }` where each package mirrors `{id, name, version, source, provenance, groups[], platform, runtime.*}` | SBOM Composer, Policy Engine |
`ruby_packages.json` records are persisted in Mongos `ruby.packages` collection via the `RubyPackageInventoryStore`. Scanner.WebService exposes the same payload through `GET /api/scans/{scanId}/ruby-packages` so Policy, CLI, and Offline Kit consumers can reuse the canonical inventory without re-running the analyzer. Each document is keyed by `scanId` and includes the resolved `imageDigest` plus the UTC timestamp recorded by the Worker.
| `ruby_runtime_edges.json` | Edges `{from, to, reason, confidence}` | EntryTrace overlay, Policy explain traces |
| `ruby_capabilities.json` | Capability `{kind, location, evidenceHash, params}` | Policy Engine (capability predicates) |
| `ruby_observation.json` | Summary document (packages, runtime edges, capability flags) | Surface manifest, Policy explain traces |
All records follow AOC appender rules (immutable, tenant-scoped) and include `hash`, `layerDigest`, and `timestamp` normalized to UTC ISO-8601.
## 6. Testing Strategy
- **Fixtures**: Extend `fixtures/lang/ruby` with Rails, Sinatra, Sidekiq, Rack, container images (with/without vendor cache).
- **Fixtures**: Added `git-sources` scenario covering git/path dependencies, bundler groups, and vendor cache evidence for declared-only toggling.
- **Determinism**: Golden snapshots for package lists and capability outputs across repeated runs.
- **Integration**: Worker e2e to ensure per-layer aggregation; CLI golden outputs (`stella ruby inspect`).
- **Policy**: Unit tests verifying new predicates (`ruby.group`, `ruby.capability.exec`, etc.) in Policy Engine test suite.
## 7. Rollout Plan & Dependencies
1. Implement collectors and aggregators (SCANNER-ANALYZERS-RUBY-28-001..004).
2. Add capability analyzer and observations (SCANNER-ANALYZERS-RUBY-28-005..008).
3. Wire CLI commands and Offline Kit packaging (SCANNER-ANALYZERS-RUBY-28-011).
4. Update docs (DOCS-SCANNER-BENCH-62-009 follow-up) once analyzer alpha ready.
**Dependencies**
- Tree-sitter Ruby grammar inclusion (needs Offline Kit packaging and licensing check).
- Policy Engine support for new predicates and capability schemas.
- Surface.Validation updates for git/path gem sources and secret resolution.
## 8. Open Questions
- Do we require dynamic runtime logs (e.g., `ActiveSupport::Notifications`) for confidence boosts? (defer to future iteration)
- Should we enforce signed gem provenance in MVP? Pending Product decision.
- Need alignment with Export Center on Ruby-specific manifest emissions.
## 9. Licensing & Offline Packaging (SCANNER-LIC-0001)
- **License**: tree-sitter core and `tree-sitter-ruby` grammar are MIT licensed (confirmed via upstream LICENSE files retrieved 2025-11-10).
- **Obligations**:
1. Keep MIT license texts in `/third-party-licenses/` and ship them with Offline Kits (fulfilled via `build_offline_kit.py` copying the directory into staging).
2. Track acknowledgements in `NOTICE.md` (completed).
3. Record grammar provenance in build metadata once native parsers ship; current MVP uses regex-only parsing and does **not** bundle tree-sitter artifacts yet, so no generated sources are redistributed.
4. When tree-sitter integration lands, ensure `tree-sitter-cli` remains a build-time tool only.
- **Deliverables**:
- SCANNER-LIC-0001 tracks Legal sign-off; Offline Kit packaging now mirrors `third-party-licenses/`.
- Export centre recipe inherits the copied directory with deterministic hashing.
---
*References:*
- Trivy: `pkg/fanal/analyzer/language/ruby/bundler`, `pkg/fanal/analyzer/language/ruby/gemspec`
- Gap analysis: `docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md#ruby-analyzer-parity-trivy-grype-snyk`