Files
git.stella-ops.org/docs/modules/scanner/design/ruby-analyzer.md
master c2c6b58b41 feat: Add Promotion-Time Attestations for Stella Ops
- Introduced a new document for promotion-time attestations, detailing the purpose, predicate schema, producer workflow, verification flow, APIs, and security considerations.
- Implemented the `stella.ops/promotion@v1` predicate schema to capture promotion evidence including image digest, SBOM/VEX artifacts, and Rekor proof.
- Defined producer responsibilities and workflows for CLI orchestration, signer responsibilities, and Export Center integration.
- Added verification steps for auditors to validate promotion attestations offline.

feat: Create Symbol Manifest v1 Specification

- Developed a specification for Symbol Manifest v1 to provide a deterministic format for publishing debug symbols and source maps.
- Defined the manifest structure, including schema, entries, source maps, toolchain, and provenance.
- Outlined upload and verification processes, resolve APIs, runtime proxy, caching, and offline bundle generation.
- Included security considerations and related tasks for implementation.

chore: Add Ruby Analyzer with Git Sources

- Created a Gemfile and Gemfile.lock for Ruby analyzer with dependencies on git-gem, httparty, and path-gem.
- Implemented main application logic to utilize the defined gems and output their versions.
- Added expected JSON output for the Ruby analyzer to validate the integration of the new gems and their functionalities.
- Developed internal observation classes for Ruby packages, runtime edges, and capabilities, including serialization logic for observations.

test: Add tests for Ruby Analyzer

- Created test fixtures for Ruby analyzer, including Gemfile, Gemfile.lock, main application, and expected JSON output.
- Ensured that the tests validate the correct integration and functionality of the Ruby analyzer with the specified gems.
2025-11-11 15:30:22 +02:00

139 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Ruby Analyzer Parity Design (SCANNER-ENG-0009)
**Status:** Implemented • Owner: Ruby Analyzer Guild • Updated: 2025-11-10
## 1. Goals & Non-Goals
- **Goals**
- Deterministically catalogue Ruby application dependencies (Gemfile/Gemfile.lock, vendored specs, .gem archives) for container layers and local workspaces.
- Build runtime usage graphs (require/require_relative, Zeitwerk autoloads, Rack boot chains, Sidekiq/ActiveJob schedulers).
- Emit capability signals (exec/fs/net/serialization, framework fingerprints, job schedulers) consumable by Policy Engine and explain traces.
- Provide CLI verbs (`stella ruby inspect`, `stella ruby resolve`) and Offline Kit parity for air-gapped deployments.
- **Non-Goals**
- Shipping dynamic runtime profilers (log-based or APM) in this iteration.
- Implementing UI changes beyond exposing explain traces the Policy/UI guilds already support.
## 2. Scope & Inputs
| Input | Location | Notes |
|-------|----------|-------|
| Gemfile / Gemfile.lock | Source tree, layer filesystem | Handle multiple apps per repo; honour Bundler groups. |
| Vendor bundles (`vendor/bundle`, `.bundle/config`) | Layer filesystem | Needed for offline/built images; avoid double-counting platform-specific gems. |
| `.gemspec` files / cached specs | `~/.bundle/cache`, `vendor/cache`, gems in layers | Support deterministic parsing without executing gem metadata. |
| Framework configs | `config/application.rb`, `config/routes.rb`, `config/sidekiq.yml`, etc. | Feed framework surface mapper. |
| Container metadata | Layer digests via RustFS CAS | Support incremental composition per layer. |
## 3. High-Level Architecture
```
┌─────────────────────────┐ ┌────────────────────┐
│ Bundler Lock Collector │───────▶│ Package Graph │
└─────────────────────────┘ │ Aggregator │
└─────────┬──────────┘
┌─────────────────────────┐ │
│ Gemspec Inspector │───────────────▶│
└─────────────────────────┘ │
┌────────────────────┐
┌─────────────────────────┐ │ Runtime Graph │
│ Require/Autoload Scan │───────▶│ Builder (Zeitwerk) │
└─────────────────────────┘ └─────────┬──────────┘
┌────────────────────┐
│ Capability Emitter │
└─────────┬──────────┘
┌────────────────────┐
│ SBOM Writer │
│ + Policy Signals │
└────────────────────┘
```
## 4. Detailed Components
### 4.1 Bundler Lock Collector
- Parse `Gemfile.lock` deterministically (no network) using new `RubyLockCollector` under `StellaOps.Scanner.Analyzers.Lang.Ruby`.
- Support alternative manifests (`gems.rb`, `gems.locked`) and workspace overrides.
- Emit package nodes with fields: `name`, `version`, `source` (path/git/rubygems), `bundlerGroup[]`, `platform`, `declaredOnly` flag.
- Implementation:
- Reuse parsing strategy from Trivy (`pkg/fanal/analyzer/language/ruby/bundler`) but port to C# with streaming reader and stable ordering.
- Integrate with Surface.Validation to enforce size limits and tenant allowlists for git/path sources.
### 4.2 Gemspec Inspector
- Scan cached specs under `vendor/cache`, `.bundle/cache`, and gem directories to pick up transitive packages when lockfiles missing.
- Parse without executing Ruby by using a deterministic DSL subset (similar to Trivy gemspec parser).
- Link results to lockfile entries by `<name, version, platform>`; create new records flagged `InferredFromSpec` when lockfile absent.
### 4.3 Package Aggregator
- New orchestrator `RubyPackageAggregator` merges lock and gemspec data with installed gems from container layers (once runtime analyzer ships).
- Precedence: Installed > Lockfile > Gemspec.
- Deduplicate by package key (name+version+platform) and attach provenance bits for Policy Engine.
### 4.4 Runtime Graph Builder
- Static analysis for `require`, `require_relative`, `autoload`, Zeitwerk conventions, and Rails initialisers.
- Implementation phases:
1. **MVP (shipped in Sprint 138):** perform lightweight scanning using deterministic regex patterns scoped to Ruby sources. Captures explicit `require*` and `autoload` statements, records referencing files, and links back to packages when a matching lock entry exists.
2. **Planned follow-up:** integrate tree-sitter Ruby under `StellaOps.Scanner.Analyzers.Lang.Ruby.Syntax` for full AST coverage (Zeitwerk constants, conditional requires, dynamic module loading). This phase remains tracked under SCANNER-ANALYZERS-RUBY-28-003.
- Output merges with EntryTrace usage hints to support runtime filtering in Policy Engine. Entrypoint detection currently keys off file location plus usage hints; richer framework-aware mapping will accompany the tree-sitter phase.
### 4.5 Capability & Surface Signals
- Emit evidence documents for:
- Process/exec usage (`Kernel.system`, `` `cmd` ``, `Open3`).
- Network clients (`Net::HTTP`, `Faraday`, `Redis`, `ActiveRecord::Base.establish_connection`).
- Serialization sinks (`Marshal.load`, `YAML.load`, `Oj.load`).
- Job schedulers (Sidekiq, Resque, ActiveJob, Whenever, Clockwork) with schedule metadata.
- Capability records flow to Policy Engine under `capability.ruby.*` namespaces to allow gating on dangerous constructs.
### 4.6 CLI & Offline Integration
- Add CLI verbs:
- `stella ruby inspect <path>` runs collector locally, outputs JSON summary with provenance.
- `stella ruby resolve --image <ref>` fetches scan artifacts, prints dependency graph grouped by bundler group/platform.
- Ship analyzer DLLs and rules in Offline Kit manifest; include autoload/zeitwerk fingerprints and heuristics hashed for determinism.
## 5. Data Contracts
| Artifact | Shape | Consumer |
|----------|-------|----------|
| `ruby_packages.json` | Array `{id, name, version, source, provenance, groups[], platform}` | SBOM Composer, Policy Engine |
| `ruby_runtime_edges.json` | Edges `{from, to, reason, confidence}` | EntryTrace overlay, Policy explain traces |
| `ruby_capabilities.json` | Capability `{kind, location, evidenceHash, params}` | Policy Engine (capability predicates) |
| `ruby_observation.json` | Summary document (packages, runtime edges, capability flags) | Surface manifest, Policy explain traces |
All records follow AOC appender rules (immutable, tenant-scoped) and include `hash`, `layerDigest`, and `timestamp` normalized to UTC ISO-8601.
## 6. Testing Strategy
- **Fixtures**: Extend `fixtures/lang/ruby` with Rails, Sinatra, Sidekiq, Rack, container images (with/without vendor cache).
- **Fixtures**: Added `git-sources` scenario covering git/path dependencies, bundler groups, and vendor cache evidence for declared-only toggling.
- **Determinism**: Golden snapshots for package lists and capability outputs across repeated runs.
- **Integration**: Worker e2e to ensure per-layer aggregation; CLI golden outputs (`stella ruby inspect`).
- **Policy**: Unit tests verifying new predicates (`ruby.group`, `ruby.capability.exec`, etc.) in Policy Engine test suite.
## 7. Rollout Plan & Dependencies
1. Implement collectors and aggregators (SCANNER-ANALYZERS-RUBY-28-001..004).
2. Add capability analyzer and observations (SCANNER-ANALYZERS-RUBY-28-005..008).
3. Wire CLI commands and Offline Kit packaging (SCANNER-ANALYZERS-RUBY-28-011).
4. Update docs (DOCS-SCANNER-BENCH-62-009 follow-up) once analyzer alpha ready.
**Dependencies**
- Tree-sitter Ruby grammar inclusion (needs Offline Kit packaging and licensing check).
- Policy Engine support for new predicates and capability schemas.
- Surface.Validation updates for git/path gem sources and secret resolution.
## 8. Open Questions
- Do we require dynamic runtime logs (e.g., `ActiveSupport::Notifications`) for confidence boosts? (defer to future iteration)
- Should we enforce signed gem provenance in MVP? Pending Product decision.
- Need alignment with Export Center on Ruby-specific manifest emissions.
## 9. Licensing & Offline Packaging (SCANNER-LIC-0001)
- **License**: tree-sitter core and `tree-sitter-ruby` grammar are MIT licensed (confirmed via upstream LICENSE files retrieved 2025-11-10).
- **Obligations**:
1. Keep MIT license texts in `/third-party-licenses/` and ship them with Offline Kits (fulfilled via `build_offline_kit.py` copying the directory into staging).
2. Track acknowledgements in `NOTICE.md` (completed).
3. Record grammar provenance in build metadata once native parsers ship; current MVP uses regex-only parsing and does **not** bundle tree-sitter artifacts yet, so no generated sources are redistributed.
4. When tree-sitter integration lands, ensure `tree-sitter-cli` remains a build-time tool only.
- **Deliverables**:
- SCANNER-LIC-0001 tracks Legal sign-off; Offline Kit packaging now mirrors `third-party-licenses/`.
- Export centre recipe inherits the copied directory with deterministic hashing.
---
*References:*
- Trivy: `pkg/fanal/analyzer/language/ruby/bundler`, `pkg/fanal/analyzer/language/ruby/gemspec`
- Gap analysis: `docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md#ruby-analyzer-parity-trivy-grype-snyk`