- Introduced a new document outlining the inline DSSE provenance for SBOM, VEX, scan, and derived events. - Defined the Mongo schema for event patches, including key fields for provenance and trust verification. - Documented the write path for ingesting provenance metadata and backfilling historical events. - Created CI/CD snippets for uploading DSSE attestations and generating provenance metadata. - Established Mongo indexes for efficient provenance queries and provided query recipes for various use cases. - Outlined policy gates for managing VEX decisions based on provenance verification. - Included UI nudges for displaying provenance information and implementation tasks for future enhancements. --- Implement reachability lattice and scoring model - Developed a comprehensive document detailing the reachability lattice and scoring model. - Defined core types for reachability states, evidence, and mitigations with corresponding C# models. - Established a scoring policy with base score contributions from various evidence classes. - Mapped reachability states to VEX gates and provided a clear overview of evidence sources. - Documented the event graph schema for persisting reachability data in MongoDB. - Outlined the integration of runtime probes for evidence collection and defined a roadmap for future tasks. --- Introduce uncertainty states and entropy scoring - Created a draft document for tracking uncertainty states and their impact on risk scoring. - Defined core uncertainty states with associated entropy values and evidence requirements. - Established a schema for storing uncertainty states alongside findings. - Documented the risk score calculation incorporating uncertainty and its effect on final risk assessments. - Provided policy guidelines for handling uncertainty in decision-making processes. - Outlined UI guidelines for displaying uncertainty information and suggested remediation actions. --- Add Ruby package inventory management - Implemented Ruby package inventory management with corresponding data models and storage mechanisms. - Created C# records for Ruby package inventory, artifacts, provenance, and runtime details. - Developed a repository for managing Ruby package inventory documents in MongoDB. - Implemented a service for storing and retrieving Ruby package inventories. - Added unit tests for the Ruby package inventory store to ensure functionality and data integrity.
141 lines
11 KiB
Markdown
141 lines
11 KiB
Markdown
# Ruby Analyzer Parity Design (SCANNER-ENG-0009)
|
||
|
||
**Status:** Implemented • Owner: Ruby Analyzer Guild • Updated: 2025-11-10
|
||
|
||
## 1. Goals & Non-Goals
|
||
- **Goals**
|
||
- Deterministically catalogue Ruby application dependencies (Gemfile/Gemfile.lock, vendored specs, .gem archives) for container layers and local workspaces.
|
||
- Build runtime usage graphs (require/require_relative, Zeitwerk autoloads, Rack boot chains, Sidekiq/ActiveJob schedulers).
|
||
- Emit capability signals (exec/fs/net/serialization, framework fingerprints, job schedulers) consumable by Policy Engine and explain traces.
|
||
- Provide CLI verbs (`stella ruby inspect`, `stella ruby resolve`) and Offline Kit parity for air-gapped deployments.
|
||
- **Non-Goals**
|
||
- Shipping dynamic runtime profilers (log-based or APM) in this iteration.
|
||
- Implementing UI changes beyond exposing explain traces the Policy/UI guilds already support.
|
||
|
||
## 2. Scope & Inputs
|
||
| Input | Location | Notes |
|
||
|-------|----------|-------|
|
||
| Gemfile / Gemfile.lock | Source tree, layer filesystem | Handle multiple apps per repo; honour Bundler groups. |
|
||
| Vendor bundles (`vendor/bundle`, `.bundle/config`) | Layer filesystem | Needed for offline/built images; avoid double-counting platform-specific gems. |
|
||
| `.gemspec` files / cached specs | `~/.bundle/cache`, `vendor/cache`, gems in layers | Support deterministic parsing without executing gem metadata. |
|
||
| Framework configs | `config/application.rb`, `config/routes.rb`, `config/sidekiq.yml`, etc. | Feed framework surface mapper. |
|
||
| Container metadata | Layer digests via RustFS CAS | Support incremental composition per layer. |
|
||
|
||
## 3. High-Level Architecture
|
||
```
|
||
┌─────────────────────────┐ ┌────────────────────┐
|
||
│ Bundler Lock Collector │───────▶│ Package Graph │
|
||
└─────────────────────────┘ │ Aggregator │
|
||
└─────────┬──────────┘
|
||
┌─────────────────────────┐ │
|
||
│ Gemspec Inspector │───────────────▶│
|
||
└─────────────────────────┘ │
|
||
▼
|
||
┌────────────────────┐
|
||
┌─────────────────────────┐ │ Runtime Graph │
|
||
│ Require/Autoload Scan │───────▶│ Builder (Zeitwerk) │
|
||
└─────────────────────────┘ └─────────┬──────────┘
|
||
│
|
||
▼
|
||
┌────────────────────┐
|
||
│ Capability Emitter │
|
||
└─────────┬──────────┘
|
||
│
|
||
▼
|
||
┌────────────────────┐
|
||
│ SBOM Writer │
|
||
│ + Policy Signals │
|
||
└────────────────────┘
|
||
```
|
||
|
||
## 4. Detailed Components
|
||
### 4.1 Bundler Lock Collector
|
||
- Parse `Gemfile.lock` deterministically (no network) using new `RubyLockCollector` under `StellaOps.Scanner.Analyzers.Lang.Ruby`.
|
||
- Support alternative manifests (`gems.rb`, `gems.locked`) and workspace overrides.
|
||
- Emit package nodes with fields: `name`, `version`, `source` (path/git/rubygems), `bundlerGroup[]`, `platform`, `declaredOnly` flag.
|
||
- Implementation:
|
||
- Reuse parsing strategy from Trivy (`pkg/fanal/analyzer/language/ruby/bundler`) but port to C# with streaming reader and stable ordering.
|
||
- Integrate with Surface.Validation to enforce size limits and tenant allowlists for git/path sources.
|
||
|
||
### 4.2 Gemspec Inspector
|
||
- Scan cached specs under `vendor/cache`, `.bundle/cache`, and gem directories to pick up transitive packages when lockfiles missing.
|
||
- Parse without executing Ruby by using a deterministic DSL subset (similar to Trivy gemspec parser).
|
||
- Link results to lockfile entries by `<name, version, platform>`; create new records flagged `InferredFromSpec` when lockfile absent.
|
||
|
||
### 4.3 Package Aggregator
|
||
- New orchestrator `RubyPackageAggregator` merges lock and gemspec data with installed gems from container layers (once runtime analyzer ships).
|
||
- Precedence: Installed > Lockfile > Gemspec.
|
||
- Deduplicate by package key (name+version+platform) and attach provenance bits for Policy Engine.
|
||
|
||
### 4.4 Runtime Graph Builder
|
||
- Static analysis for `require`, `require_relative`, `autoload`, Zeitwerk conventions, and Rails initialisers.
|
||
- Implementation phases:
|
||
1. **MVP (shipped in Sprint 138):** perform lightweight scanning using deterministic regex patterns scoped to Ruby sources. Captures explicit `require*` and `autoload` statements, records referencing files, and links back to packages when a matching lock entry exists.
|
||
2. **Planned follow-up:** integrate tree-sitter Ruby under `StellaOps.Scanner.Analyzers.Lang.Ruby.Syntax` for full AST coverage (Zeitwerk constants, conditional requires, dynamic module loading). This phase remains tracked under SCANNER-ANALYZERS-RUBY-28-003.
|
||
- Output merges with EntryTrace usage hints to support runtime filtering in Policy Engine. Entrypoint detection currently keys off file location plus usage hints; richer framework-aware mapping will accompany the tree-sitter phase.
|
||
|
||
### 4.5 Capability & Surface Signals
|
||
- Emit evidence documents for:
|
||
- Process/exec usage (`Kernel.system`, `` `cmd` ``, `Open3`).
|
||
- Network clients (`Net::HTTP`, `Faraday`, `Redis`, `ActiveRecord::Base.establish_connection`).
|
||
- Serialization sinks (`Marshal.load`, `YAML.load`, `Oj.load`).
|
||
- Job schedulers (Sidekiq, Resque, ActiveJob, Whenever, Clockwork) with schedule metadata.
|
||
- Capability records flow to Policy Engine under `capability.ruby.*` namespaces to allow gating on dangerous constructs.
|
||
|
||
### 4.6 CLI & Offline Integration
|
||
- Add CLI verbs:
|
||
- `stella ruby inspect <path>` – runs collector locally, outputs JSON summary with provenance.
|
||
- `stella ruby resolve --image <ref>` – fetches scan artifacts, prints dependency graph grouped by bundler group/platform.
|
||
- Ship analyzer DLLs and rules in Offline Kit manifest; include autoload/zeitwerk fingerprints and heuristics hashed for determinism.
|
||
|
||
## 5. Data Contracts
|
||
| Artifact | Shape | Consumer |
|
||
|----------|-------|----------|
|
||
| `ruby_packages.json` | `RubyPackageInventory { scanId, imageDigest, generatedAt, packages[] }` where each package mirrors `{id, name, version, source, provenance, groups[], platform, runtime.*}` | SBOM Composer, Policy Engine |
|
||
|
||
`ruby_packages.json` records are persisted in Mongo’s `ruby.packages` collection via the `RubyPackageInventoryStore`. Scanner.WebService exposes the same payload through `GET /api/scans/{scanId}/ruby-packages` so Policy, CLI, and Offline Kit consumers can reuse the canonical inventory without re-running the analyzer. Each document is keyed by `scanId` and includes the resolved `imageDigest` plus the UTC timestamp recorded by the Worker.
|
||
| `ruby_runtime_edges.json` | Edges `{from, to, reason, confidence}` | EntryTrace overlay, Policy explain traces |
|
||
| `ruby_capabilities.json` | Capability `{kind, location, evidenceHash, params}` | Policy Engine (capability predicates) |
|
||
| `ruby_observation.json` | Summary document (packages, runtime edges, capability flags) | Surface manifest, Policy explain traces |
|
||
|
||
All records follow AOC appender rules (immutable, tenant-scoped) and include `hash`, `layerDigest`, and `timestamp` normalized to UTC ISO-8601.
|
||
|
||
## 6. Testing Strategy
|
||
- **Fixtures**: Extend `fixtures/lang/ruby` with Rails, Sinatra, Sidekiq, Rack, container images (with/without vendor cache).
|
||
- **Fixtures**: Added `git-sources` scenario covering git/path dependencies, bundler groups, and vendor cache evidence for declared-only toggling.
|
||
- **Determinism**: Golden snapshots for package lists and capability outputs across repeated runs.
|
||
- **Integration**: Worker e2e to ensure per-layer aggregation; CLI golden outputs (`stella ruby inspect`).
|
||
- **Policy**: Unit tests verifying new predicates (`ruby.group`, `ruby.capability.exec`, etc.) in Policy Engine test suite.
|
||
|
||
## 7. Rollout Plan & Dependencies
|
||
1. Implement collectors and aggregators (SCANNER-ANALYZERS-RUBY-28-001..004).
|
||
2. Add capability analyzer and observations (SCANNER-ANALYZERS-RUBY-28-005..008).
|
||
3. Wire CLI commands and Offline Kit packaging (SCANNER-ANALYZERS-RUBY-28-011).
|
||
4. Update docs (DOCS-SCANNER-BENCH-62-009 follow-up) once analyzer alpha ready.
|
||
|
||
**Dependencies**
|
||
- Tree-sitter Ruby grammar inclusion (needs Offline Kit packaging and licensing check).
|
||
- Policy Engine support for new predicates and capability schemas.
|
||
- Surface.Validation updates for git/path gem sources and secret resolution.
|
||
|
||
## 8. Open Questions
|
||
- Do we require dynamic runtime logs (e.g., `ActiveSupport::Notifications`) for confidence boosts? (defer to future iteration)
|
||
- Should we enforce signed gem provenance in MVP? Pending Product decision.
|
||
- Need alignment with Export Center on Ruby-specific manifest emissions.
|
||
|
||
## 9. Licensing & Offline Packaging (SCANNER-LIC-0001)
|
||
- **License**: tree-sitter core and `tree-sitter-ruby` grammar are MIT licensed (confirmed via upstream LICENSE files retrieved 2025-11-10).
|
||
- **Obligations**:
|
||
1. Keep MIT license texts in `/third-party-licenses/` and ship them with Offline Kits (fulfilled via `build_offline_kit.py` copying the directory into staging).
|
||
2. Track acknowledgements in `NOTICE.md` (completed).
|
||
3. Record grammar provenance in build metadata once native parsers ship; current MVP uses regex-only parsing and does **not** bundle tree-sitter artifacts yet, so no generated sources are redistributed.
|
||
4. When tree-sitter integration lands, ensure `tree-sitter-cli` remains a build-time tool only.
|
||
- **Deliverables**:
|
||
- SCANNER-LIC-0001 tracks Legal sign-off; Offline Kit packaging now mirrors `third-party-licenses/`.
|
||
- Export centre recipe inherits the copied directory with deterministic hashing.
|
||
|
||
---
|
||
*References:*
|
||
- Trivy: `pkg/fanal/analyzer/language/ruby/bundler`, `pkg/fanal/analyzer/language/ruby/gemspec`
|
||
- Gap analysis: `docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md#ruby-analyzer-parity-trivy-grype-snyk`
|