Files
git.stella-ops.org/docs/modules/scanner/design/ruby-analyzer.md
master bf2bf4b395 Add Ruby language analyzer and related functionality
- Introduced global usings for Ruby analyzer.
- Implemented RubyLockData, RubyLockEntry, and RubyLockParser for handling Gemfile.lock files.
- Created RubyPackage and RubyPackageCollector to manage Ruby packages and vendor cache.
- Developed RubyAnalyzerPlugin and RubyLanguageAnalyzer for analyzing Ruby projects.
- Added tests for Ruby language analyzer with sample Gemfile.lock and expected output.
- Included necessary project files and references for the Ruby analyzer.
- Added third-party licenses for tree-sitter dependencies.
2025-11-03 01:15:43 +02:00

9.5 KiB
Raw Blame History

Ruby Analyzer Parity Design (SCANNER-ENG-0009)

Status: Draft • Owner: Ruby Analyzer Guild • Updated: 2025-11-02

1. Goals & Non-Goals

  • Goals
    • Deterministically catalogue Ruby application dependencies (Gemfile/Gemfile.lock, vendored specs, .gem archives) for container layers and local workspaces.
    • Build runtime usage graphs (require/require_relative, Zeitwerk autoloads, Rack boot chains, Sidekiq/ActiveJob schedulers).
    • Emit capability signals (exec/fs/net/serialization, framework fingerprints, job schedulers) consumable by Policy Engine and explain traces.
    • Provide CLI verbs (stella ruby inspect, stella ruby resolve) and Offline Kit parity for air-gapped deployments.
  • Non-Goals
    • Shipping dynamic runtime profilers (log-based or APM) in this iteration.
    • Implementing UI changes beyond exposing explain traces the Policy/UI guilds already support.

2. Scope & Inputs

Input Location Notes
Gemfile / Gemfile.lock Source tree, layer filesystem Handle multiple apps per repo; honour Bundler groups.
Vendor bundles (vendor/bundle, .bundle/config) Layer filesystem Needed for offline/built images; avoid double-counting platform-specific gems.
.gemspec files / cached specs ~/.bundle/cache, vendor/cache, gems in layers Support deterministic parsing without executing gem metadata.
Framework configs config/application.rb, config/routes.rb, config/sidekiq.yml, etc. Feed framework surface mapper.
Container metadata Layer digests via RustFS CAS Support incremental composition per layer.

3. High-Level Architecture

┌─────────────────────────┐        ┌────────────────────┐
│  Bundler Lock Collector │───────▶│  Package Graph     │
└─────────────────────────┘        │  Aggregator        │
                                   └─────────┬──────────┘
┌─────────────────────────┐                │
│  Gemspec Inspector      │───────────────▶│
└─────────────────────────┘                │
                                           ▼
                                   ┌────────────────────┐
┌─────────────────────────┐        │ Runtime Graph      │
│  Require/Autoload Scan  │───────▶│ Builder (Zeitwerk) │
└─────────────────────────┘        └─────────┬──────────┘
                                           │
                                           ▼
                                   ┌────────────────────┐
                                   │ Capability Emitter │
                                   └─────────┬──────────┘
                                           │
                                           ▼
                                   ┌────────────────────┐
                                   │ SBOM Writer        │
                                   │ + Policy Signals   │
                                   └────────────────────┘

4. Detailed Components

4.1 Bundler Lock Collector

  • Parse Gemfile.lock deterministically (no network) using new RubyLockCollector under StellaOps.Scanner.Analyzers.Lang.Ruby.
  • Support alternative manifests (gems.rb, gems.locked) and workspace overrides.
  • Emit package nodes with fields: name, version, source (path/git/rubygems), bundlerGroup[], platform, declaredOnly flag.
  • Implementation:
    • Reuse parsing strategy from Trivy (pkg/fanal/analyzer/language/ruby/bundler) but port to C# with streaming reader and stable ordering.
    • Integrate with Surface.Validation to enforce size limits and tenant allowlists for git/path sources.

4.2 Gemspec Inspector

  • Scan cached specs under vendor/cache, .bundle/cache, and gem directories to pick up transitive packages when lockfiles missing.
  • Parse without executing Ruby by using a deterministic DSL subset (similar to Trivy gemspec parser).
  • Link results to lockfile entries by <name, version, platform>; create new records flagged InferredFromSpec when lockfile absent.

4.3 Package Aggregator

  • New orchestrator RubyPackageAggregator merges lock and gemspec data with installed gems from container layers (once runtime analyzer ships).
  • Precedence: Installed > Lockfile > Gemspec.
  • Deduplicate by package key (name+version+platform) and attach provenance bits for Policy Engine.

4.4 Runtime Graph Builder

  • Static analysis for require, require_relative, autoload, Zeitwerk conventions, and Rails initialisers.
  • Implementation phases:
    1. Parse AST using tree-sitter Ruby embedded under StellaOps.Scanner.Analyzers.Lang.Ruby.Syntax with deterministic bindings.
    2. Generate edges entrypoint -> file and file -> package with reason codes (require-static, autoload-zeitwerk, autoload-const_missing).
    3. Identify framework entrypoints (Rails controllers, Rack middleware, Sidekiq workers) via heuristics defined in SCANNER-ANALYZERS-RUBY-28-* tasks.
  • Output merges with EntryTrace usage hints to support runtime filtering in Policy Engine.

4.5 Capability & Surface Signals

  • Emit evidence documents for:
    • Process/exec usage (Kernel.system, `cmd`, Open3).
    • Network clients (Net::HTTP, Faraday, Redis, ActiveRecord::Base.establish_connection).
    • Serialization sinks (Marshal.load, YAML.load, Oj.load).
    • Job schedulers (Sidekiq, Resque, ActiveJob, Whenever, Clockwork) with schedule metadata.
  • Capability records flow to Policy Engine under capability.ruby.* namespaces to allow gating on dangerous constructs.

4.6 CLI & Offline Integration

  • Add CLI verbs:
    • stella ruby inspect <path> runs collector locally, outputs JSON summary with provenance.
    • stella ruby resolve --image <ref> fetches scan artifacts, prints dependency graph grouped by bundler group/platform.
  • Ship analyzer DLLs and rules in Offline Kit manifest; include autoload/zeitwerk fingerprints and heuristics hashed for determinism.

5. Data Contracts

Artifact Shape Consumer
ruby_packages.json Array {id, name, version, source, provenance, groups[], platform} SBOM Composer, Policy Engine
ruby_runtime_edges.json Edges {from, to, reason, confidence} EntryTrace overlay, Policy explain traces
ruby_capabilities.json Capability {kind, location, evidenceHash, params} Policy Engine (capability predicates)

All records follow AOC appender rules (immutable, tenant-scoped) and include hash, layerDigest, and timestamp normalized to UTC ISO-8601.

6. Testing Strategy

  • Fixtures: Extend fixtures/lang/ruby with Rails, Sinatra, Sidekiq, Rack, container images (with/without vendor cache).
  • Determinism: Golden snapshots for package lists and capability outputs across repeated runs.
  • Integration: Worker e2e to ensure per-layer aggregation; CLI golden outputs (stella ruby inspect).
  • Policy: Unit tests verifying new predicates (ruby.group, ruby.capability.exec, etc.) in Policy Engine test suite.

7. Rollout Plan & Dependencies

  1. Implement collectors and aggregators (SCANNER-ANALYZERS-RUBY-28-001..004).
  2. Add capability analyzer and observations (SCANNER-ANALYZERS-RUBY-28-005..008).
  3. Wire CLI commands and Offline Kit packaging (SCANNER-ANALYZERS-RUBY-28-011).
  4. Update docs (DOCS-SCANNER-BENCH-62-009 follow-up) once analyzer alpha ready.

Dependencies

  • Tree-sitter Ruby grammar inclusion (needs Offline Kit packaging and licensing check).
  • Policy Engine support for new predicates and capability schemas.
  • Surface.Validation updates for git/path gem sources and secret resolution.

8. Open Questions

  • Do we require dynamic runtime logs (e.g., ActiveSupport::Notifications) for confidence boosts? (defer to future iteration)
  • Should we enforce signed gem provenance in MVP? Pending Product decision.
  • Need alignment with Export Center on Ruby-specific manifest emissions.

9. Licensing & Offline Packaging (SCANNER-LIC-0001)

  • License: tree-sitter core and tree-sitter-ruby grammar are MIT licensed (confirmed via upstream LICENSE files retrieved 2025-11-02).
  • Obligations:
    1. Include both MIT license texts in /third-party-licenses/ and in Offline Kit manifests.
    2. Update NOTICE.md to acknowledge embedded grammars per company policy.
    3. Record the grammar commit hashes in build metadata; regenerate generated C/WASM artifacts deterministically.
    4. Ensure build pipeline uses tree-sitter-cli only as a build-time tool (not redistributed) to avoid extra licensing obligations.
  • Deliverables:
    • SCANNER-LIC-0001 to capture Legal sign-off and update packaging scripts.
    • Export Center to mirror license files into Offline Kit bundle.

References:

  • Trivy: pkg/fanal/analyzer/language/ruby/bundler, pkg/fanal/analyzer/language/ruby/gemspec
  • Gap analysis: docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md#ruby-analyzer-parity-trivy-grype-snyk