Files
git.stella-ops.org/docs/modules/scanner/design/ruby-analyzer.md
master bf2bf4b395 Add Ruby language analyzer and related functionality
- Introduced global usings for Ruby analyzer.
- Implemented RubyLockData, RubyLockEntry, and RubyLockParser for handling Gemfile.lock files.
- Created RubyPackage and RubyPackageCollector to manage Ruby packages and vendor cache.
- Developed RubyAnalyzerPlugin and RubyLanguageAnalyzer for analyzing Ruby projects.
- Added tests for Ruby language analyzer with sample Gemfile.lock and expected output.
- Included necessary project files and references for the Ruby analyzer.
- Added third-party licenses for tree-sitter dependencies.
2025-11-03 01:15:43 +02:00

138 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Ruby Analyzer Parity Design (SCANNER-ENG-0009)
**Status:** Draft • Owner: Ruby Analyzer Guild • Updated: 2025-11-02
## 1. Goals & Non-Goals
- **Goals**
- Deterministically catalogue Ruby application dependencies (Gemfile/Gemfile.lock, vendored specs, .gem archives) for container layers and local workspaces.
- Build runtime usage graphs (require/require_relative, Zeitwerk autoloads, Rack boot chains, Sidekiq/ActiveJob schedulers).
- Emit capability signals (exec/fs/net/serialization, framework fingerprints, job schedulers) consumable by Policy Engine and explain traces.
- Provide CLI verbs (`stella ruby inspect`, `stella ruby resolve`) and Offline Kit parity for air-gapped deployments.
- **Non-Goals**
- Shipping dynamic runtime profilers (log-based or APM) in this iteration.
- Implementing UI changes beyond exposing explain traces the Policy/UI guilds already support.
## 2. Scope & Inputs
| Input | Location | Notes |
|-------|----------|-------|
| Gemfile / Gemfile.lock | Source tree, layer filesystem | Handle multiple apps per repo; honour Bundler groups. |
| Vendor bundles (`vendor/bundle`, `.bundle/config`) | Layer filesystem | Needed for offline/built images; avoid double-counting platform-specific gems. |
| `.gemspec` files / cached specs | `~/.bundle/cache`, `vendor/cache`, gems in layers | Support deterministic parsing without executing gem metadata. |
| Framework configs | `config/application.rb`, `config/routes.rb`, `config/sidekiq.yml`, etc. | Feed framework surface mapper. |
| Container metadata | Layer digests via RustFS CAS | Support incremental composition per layer. |
## 3. High-Level Architecture
```
┌─────────────────────────┐ ┌────────────────────┐
│ Bundler Lock Collector │───────▶│ Package Graph │
└─────────────────────────┘ │ Aggregator │
└─────────┬──────────┘
┌─────────────────────────┐ │
│ Gemspec Inspector │───────────────▶│
└─────────────────────────┘ │
┌────────────────────┐
┌─────────────────────────┐ │ Runtime Graph │
│ Require/Autoload Scan │───────▶│ Builder (Zeitwerk) │
└─────────────────────────┘ └─────────┬──────────┘
┌────────────────────┐
│ Capability Emitter │
└─────────┬──────────┘
┌────────────────────┐
│ SBOM Writer │
│ + Policy Signals │
└────────────────────┘
```
## 4. Detailed Components
### 4.1 Bundler Lock Collector
- Parse `Gemfile.lock` deterministically (no network) using new `RubyLockCollector` under `StellaOps.Scanner.Analyzers.Lang.Ruby`.
- Support alternative manifests (`gems.rb`, `gems.locked`) and workspace overrides.
- Emit package nodes with fields: `name`, `version`, `source` (path/git/rubygems), `bundlerGroup[]`, `platform`, `declaredOnly` flag.
- Implementation:
- Reuse parsing strategy from Trivy (`pkg/fanal/analyzer/language/ruby/bundler`) but port to C# with streaming reader and stable ordering.
- Integrate with Surface.Validation to enforce size limits and tenant allowlists for git/path sources.
### 4.2 Gemspec Inspector
- Scan cached specs under `vendor/cache`, `.bundle/cache`, and gem directories to pick up transitive packages when lockfiles missing.
- Parse without executing Ruby by using a deterministic DSL subset (similar to Trivy gemspec parser).
- Link results to lockfile entries by `<name, version, platform>`; create new records flagged `InferredFromSpec` when lockfile absent.
### 4.3 Package Aggregator
- New orchestrator `RubyPackageAggregator` merges lock and gemspec data with installed gems from container layers (once runtime analyzer ships).
- Precedence: Installed > Lockfile > Gemspec.
- Deduplicate by package key (name+version+platform) and attach provenance bits for Policy Engine.
### 4.4 Runtime Graph Builder
- Static analysis for `require`, `require_relative`, `autoload`, Zeitwerk conventions, and Rails initialisers.
- Implementation phases:
1. Parse AST using tree-sitter Ruby embedded under `StellaOps.Scanner.Analyzers.Lang.Ruby.Syntax` with deterministic bindings.
2. Generate edges `entrypoint -> file` and `file -> package` with reason codes (`require-static`, `autoload-zeitwerk`, `autoload-const_missing`).
3. Identify framework entrypoints (Rails controllers, Rack middleware, Sidekiq workers) via heuristics defined in `SCANNER-ANALYZERS-RUBY-28-*` tasks.
- Output merges with EntryTrace usage hints to support runtime filtering in Policy Engine.
### 4.5 Capability & Surface Signals
- Emit evidence documents for:
- Process/exec usage (`Kernel.system`, `` `cmd` ``, `Open3`).
- Network clients (`Net::HTTP`, `Faraday`, `Redis`, `ActiveRecord::Base.establish_connection`).
- Serialization sinks (`Marshal.load`, `YAML.load`, `Oj.load`).
- Job schedulers (Sidekiq, Resque, ActiveJob, Whenever, Clockwork) with schedule metadata.
- Capability records flow to Policy Engine under `capability.ruby.*` namespaces to allow gating on dangerous constructs.
### 4.6 CLI & Offline Integration
- Add CLI verbs:
- `stella ruby inspect <path>` runs collector locally, outputs JSON summary with provenance.
- `stella ruby resolve --image <ref>` fetches scan artifacts, prints dependency graph grouped by bundler group/platform.
- Ship analyzer DLLs and rules in Offline Kit manifest; include autoload/zeitwerk fingerprints and heuristics hashed for determinism.
## 5. Data Contracts
| Artifact | Shape | Consumer |
|----------|-------|----------|
| `ruby_packages.json` | Array `{id, name, version, source, provenance, groups[], platform}` | SBOM Composer, Policy Engine |
| `ruby_runtime_edges.json` | Edges `{from, to, reason, confidence}` | EntryTrace overlay, Policy explain traces |
| `ruby_capabilities.json` | Capability `{kind, location, evidenceHash, params}` | Policy Engine (capability predicates) |
All records follow AOC appender rules (immutable, tenant-scoped) and include `hash`, `layerDigest`, and `timestamp` normalized to UTC ISO-8601.
## 6. Testing Strategy
- **Fixtures**: Extend `fixtures/lang/ruby` with Rails, Sinatra, Sidekiq, Rack, container images (with/without vendor cache).
- **Determinism**: Golden snapshots for package lists and capability outputs across repeated runs.
- **Integration**: Worker e2e to ensure per-layer aggregation; CLI golden outputs (`stella ruby inspect`).
- **Policy**: Unit tests verifying new predicates (`ruby.group`, `ruby.capability.exec`, etc.) in Policy Engine test suite.
## 7. Rollout Plan & Dependencies
1. Implement collectors and aggregators (SCANNER-ANALYZERS-RUBY-28-001..004).
2. Add capability analyzer and observations (SCANNER-ANALYZERS-RUBY-28-005..008).
3. Wire CLI commands and Offline Kit packaging (SCANNER-ANALYZERS-RUBY-28-011).
4. Update docs (DOCS-SCANNER-BENCH-62-009 follow-up) once analyzer alpha ready.
**Dependencies**
- Tree-sitter Ruby grammar inclusion (needs Offline Kit packaging and licensing check).
- Policy Engine support for new predicates and capability schemas.
- Surface.Validation updates for git/path gem sources and secret resolution.
## 8. Open Questions
- Do we require dynamic runtime logs (e.g., `ActiveSupport::Notifications`) for confidence boosts? (defer to future iteration)
- Should we enforce signed gem provenance in MVP? Pending Product decision.
- Need alignment with Export Center on Ruby-specific manifest emissions.
## 9. Licensing & Offline Packaging (SCANNER-LIC-0001)
- **License**: tree-sitter core and `tree-sitter-ruby` grammar are MIT licensed (confirmed via upstream LICENSE files retrieved 2025-11-02).
- **Obligations**:
1. Include both MIT license texts in `/third-party-licenses/` and in Offline Kit manifests.
2. Update `NOTICE.md` to acknowledge embedded grammars per company policy.
3. Record the grammar commit hashes in build metadata; regenerate generated C/WASM artifacts deterministically.
4. Ensure build pipeline uses `tree-sitter-cli` only as a build-time tool (not redistributed) to avoid extra licensing obligations.
- **Deliverables**:
- SCANNER-LIC-0001 to capture Legal sign-off and update packaging scripts.
- Export Center to mirror license files into Offline Kit bundle.
---
*References:*
- Trivy: `pkg/fanal/analyzer/language/ruby/bundler`, `pkg/fanal/analyzer/language/ruby/gemspec`
- Gap analysis: `docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md#ruby-analyzer-parity-trivy-grype-snyk`