# Ruby Analyzer Parity Design (SCANNER-ENG-0009) **Status:** Draft • Owner: Ruby Analyzer Guild • Updated: 2025-11-02 ## 1. Goals & Non-Goals - **Goals** - Deterministically catalogue Ruby application dependencies (Gemfile/Gemfile.lock, vendored specs, .gem archives) for container layers and local workspaces. - Build runtime usage graphs (require/require_relative, Zeitwerk autoloads, Rack boot chains, Sidekiq/ActiveJob schedulers). - Emit capability signals (exec/fs/net/serialization, framework fingerprints, job schedulers) consumable by Policy Engine and explain traces. - Provide CLI verbs (`stella ruby inspect`, `stella ruby resolve`) and Offline Kit parity for air-gapped deployments. - **Non-Goals** - Shipping dynamic runtime profilers (log-based or APM) in this iteration. - Implementing UI changes beyond exposing explain traces the Policy/UI guilds already support. ## 2. Scope & Inputs | Input | Location | Notes | |-------|----------|-------| | Gemfile / Gemfile.lock | Source tree, layer filesystem | Handle multiple apps per repo; honour Bundler groups. | | Vendor bundles (`vendor/bundle`, `.bundle/config`) | Layer filesystem | Needed for offline/built images; avoid double-counting platform-specific gems. | | `.gemspec` files / cached specs | `~/.bundle/cache`, `vendor/cache`, gems in layers | Support deterministic parsing without executing gem metadata. | | Framework configs | `config/application.rb`, `config/routes.rb`, `config/sidekiq.yml`, etc. | Feed framework surface mapper. | | Container metadata | Layer digests via RustFS CAS | Support incremental composition per layer. | ## 3. High-Level Architecture ``` ┌─────────────────────────┐ ┌────────────────────┐ │ Bundler Lock Collector │───────▶│ Package Graph │ └─────────────────────────┘ │ Aggregator │ └─────────┬──────────┘ ┌─────────────────────────┐ │ │ Gemspec Inspector │───────────────▶│ └─────────────────────────┘ │ ▼ ┌────────────────────┐ ┌─────────────────────────┐ │ Runtime Graph │ │ Require/Autoload Scan │───────▶│ Builder (Zeitwerk) │ └─────────────────────────┘ └─────────┬──────────┘ │ ▼ ┌────────────────────┐ │ Capability Emitter │ └─────────┬──────────┘ │ ▼ ┌────────────────────┐ │ SBOM Writer │ │ + Policy Signals │ └────────────────────┘ ``` ## 4. Detailed Components ### 4.1 Bundler Lock Collector - Parse `Gemfile.lock` deterministically (no network) using new `RubyLockCollector` under `StellaOps.Scanner.Analyzers.Lang.Ruby`. - Support alternative manifests (`gems.rb`, `gems.locked`) and workspace overrides. - Emit package nodes with fields: `name`, `version`, `source` (path/git/rubygems), `bundlerGroup[]`, `platform`, `declaredOnly` flag. - Implementation: - Reuse parsing strategy from Trivy (`pkg/fanal/analyzer/language/ruby/bundler`) but port to C# with streaming reader and stable ordering. - Integrate with Surface.Validation to enforce size limits and tenant allowlists for git/path sources. ### 4.2 Gemspec Inspector - Scan cached specs under `vendor/cache`, `.bundle/cache`, and gem directories to pick up transitive packages when lockfiles missing. - Parse without executing Ruby by using a deterministic DSL subset (similar to Trivy gemspec parser). - Link results to lockfile entries by ``; create new records flagged `InferredFromSpec` when lockfile absent. ### 4.3 Package Aggregator - New orchestrator `RubyPackageAggregator` merges lock and gemspec data with installed gems from container layers (once runtime analyzer ships). - Precedence: Installed > Lockfile > Gemspec. - Deduplicate by package key (name+version+platform) and attach provenance bits for Policy Engine. ### 4.4 Runtime Graph Builder - Static analysis for `require`, `require_relative`, `autoload`, Zeitwerk conventions, and Rails initialisers. - Implementation phases: 1. Parse AST using tree-sitter Ruby embedded under `StellaOps.Scanner.Analyzers.Lang.Ruby.Syntax` with deterministic bindings. 2. Generate edges `entrypoint -> file` and `file -> package` with reason codes (`require-static`, `autoload-zeitwerk`, `autoload-const_missing`). 3. Identify framework entrypoints (Rails controllers, Rack middleware, Sidekiq workers) via heuristics defined in `SCANNER-ANALYZERS-RUBY-28-*` tasks. - Output merges with EntryTrace usage hints to support runtime filtering in Policy Engine. ### 4.5 Capability & Surface Signals - Emit evidence documents for: - Process/exec usage (`Kernel.system`, `` `cmd` ``, `Open3`). - Network clients (`Net::HTTP`, `Faraday`, `Redis`, `ActiveRecord::Base.establish_connection`). - Serialization sinks (`Marshal.load`, `YAML.load`, `Oj.load`). - Job schedulers (Sidekiq, Resque, ActiveJob, Whenever, Clockwork) with schedule metadata. - Capability records flow to Policy Engine under `capability.ruby.*` namespaces to allow gating on dangerous constructs. ### 4.6 CLI & Offline Integration - Add CLI verbs: - `stella ruby inspect ` – runs collector locally, outputs JSON summary with provenance. - `stella ruby resolve --image ` – fetches scan artifacts, prints dependency graph grouped by bundler group/platform. - Ship analyzer DLLs and rules in Offline Kit manifest; include autoload/zeitwerk fingerprints and heuristics hashed for determinism. ## 5. Data Contracts | Artifact | Shape | Consumer | |----------|-------|----------| | `ruby_packages.json` | Array `{id, name, version, source, provenance, groups[], platform}` | SBOM Composer, Policy Engine | | `ruby_runtime_edges.json` | Edges `{from, to, reason, confidence}` | EntryTrace overlay, Policy explain traces | | `ruby_capabilities.json` | Capability `{kind, location, evidenceHash, params}` | Policy Engine (capability predicates) | All records follow AOC appender rules (immutable, tenant-scoped) and include `hash`, `layerDigest`, and `timestamp` normalized to UTC ISO-8601. ## 6. Testing Strategy - **Fixtures**: Extend `fixtures/lang/ruby` with Rails, Sinatra, Sidekiq, Rack, container images (with/without vendor cache). - **Determinism**: Golden snapshots for package lists and capability outputs across repeated runs. - **Integration**: Worker e2e to ensure per-layer aggregation; CLI golden outputs (`stella ruby inspect`). - **Policy**: Unit tests verifying new predicates (`ruby.group`, `ruby.capability.exec`, etc.) in Policy Engine test suite. ## 7. Rollout Plan & Dependencies 1. Implement collectors and aggregators (SCANNER-ANALYZERS-RUBY-28-001..004). 2. Add capability analyzer and observations (SCANNER-ANALYZERS-RUBY-28-005..008). 3. Wire CLI commands and Offline Kit packaging (SCANNER-ANALYZERS-RUBY-28-011). 4. Update docs (DOCS-SCANNER-BENCH-62-009 follow-up) once analyzer alpha ready. **Dependencies** - Tree-sitter Ruby grammar inclusion (needs Offline Kit packaging and licensing check). - Policy Engine support for new predicates and capability schemas. - Surface.Validation updates for git/path gem sources and secret resolution. ## 8. Open Questions - Do we require dynamic runtime logs (e.g., `ActiveSupport::Notifications`) for confidence boosts? (defer to future iteration) - Should we enforce signed gem provenance in MVP? Pending Product decision. - Need alignment with Export Center on Ruby-specific manifest emissions. ## 9. Licensing & Offline Packaging (SCANNER-LIC-0001) - **License**: tree-sitter core and `tree-sitter-ruby` grammar are MIT licensed (confirmed via upstream LICENSE files retrieved 2025-11-02). - **Obligations**: 1. Include both MIT license texts in `/third-party-licenses/` and in Offline Kit manifests. 2. Update `NOTICE.md` to acknowledge embedded grammars per company policy. 3. Record the grammar commit hashes in build metadata; regenerate generated C/WASM artifacts deterministically. 4. Ensure build pipeline uses `tree-sitter-cli` only as a build-time tool (not redistributed) to avoid extra licensing obligations. - **Deliverables**: - SCANNER-LIC-0001 to capture Legal sign-off and update packaging scripts. - Export Center to mirror license files into Offline Kit bundle. --- *References:* - Trivy: `pkg/fanal/analyzer/language/ruby/bundler`, `pkg/fanal/analyzer/language/ruby/gemspec` - Gap analysis: `docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md#ruby-analyzer-parity-trivy-grype-snyk`