Files

master 7040984215 Add inline DSSE provenance documentation and Mongo schema

- Introduced a new document outlining the inline DSSE provenance for SBOM, VEX, scan, and derived events.
- Defined the Mongo schema for event patches, including key fields for provenance and trust verification.
- Documented the write path for ingesting provenance metadata and backfilling historical events.
- Created CI/CD snippets for uploading DSSE attestations and generating provenance metadata.
- Established Mongo indexes for efficient provenance queries and provided query recipes for various use cases.
- Outlined policy gates for managing VEX decisions based on provenance verification.
- Included UI nudges for displaying provenance information and implementation tasks for future enhancements.

---

Implement reachability lattice and scoring model

- Developed a comprehensive document detailing the reachability lattice and scoring model.
- Defined core types for reachability states, evidence, and mitigations with corresponding C# models.
- Established a scoring policy with base score contributions from various evidence classes.
- Mapped reachability states to VEX gates and provided a clear overview of evidence sources.
- Documented the event graph schema for persisting reachability data in MongoDB.
- Outlined the integration of runtime probes for evidence collection and defined a roadmap for future tasks.

---

Introduce uncertainty states and entropy scoring

- Created a draft document for tracking uncertainty states and their impact on risk scoring.
- Defined core uncertainty states with associated entropy values and evidence requirements.
- Established a schema for storing uncertainty states alongside findings.
- Documented the risk score calculation incorporating uncertainty and its effect on final risk assessments.
- Provided policy guidelines for handling uncertainty in decision-making processes.
- Outlined UI guidelines for displaying uncertainty information and suggested remediation actions.

---

Add Ruby package inventory management

- Implemented Ruby package inventory management with corresponding data models and storage mechanisms.
- Created C# records for Ruby package inventory, artifacts, provenance, and runtime details.
- Developed a repository for managing Ruby package inventory documents in MongoDB.
- Implemented a service for storing and retrieving Ruby package inventories.
- Added unit tests for the Ruby package inventory store to ensure functionality and data integrity.

2025-11-13 00:20:33 +02:00

11 KiB

Raw Blame History

Ruby Analyzer Parity Design (SCANNER-ENG-0009)

Status: Implemented • Owner: Ruby Analyzer Guild • Updated: 2025-11-10

1. Goals & Non-Goals

Goals
- Deterministically catalogue Ruby application dependencies (Gemfile/Gemfile.lock, vendored specs, .gem archives) for container layers and local workspaces.
- Build runtime usage graphs (require/require_relative, Zeitwerk autoloads, Rack boot chains, Sidekiq/ActiveJob schedulers).
- Emit capability signals (exec/fs/net/serialization, framework fingerprints, job schedulers) consumable by Policy Engine and explain traces.
- Provide CLI verbs (stella ruby inspect, stella ruby resolve) and Offline Kit parity for air-gapped deployments.
Non-Goals
- Shipping dynamic runtime profilers (log-based or APM) in this iteration.
- Implementing UI changes beyond exposing explain traces the Policy/UI guilds already support.

2. Scope & Inputs

Input	Location	Notes
Gemfile / Gemfile.lock	Source tree, layer filesystem	Handle multiple apps per repo; honour Bundler groups.
Vendor bundles (`vendor/bundle`, `.bundle/config`)	Layer filesystem	Needed for offline/built images; avoid double-counting platform-specific gems.
`.gemspec` files / cached specs	`~/.bundle/cache`, `vendor/cache`, gems in layers	Support deterministic parsing without executing gem metadata.
Framework configs	`config/application.rb`, `config/routes.rb`, `config/sidekiq.yml`, etc.	Feed framework surface mapper.
Container metadata	Layer digests via RustFS CAS	Support incremental composition per layer.

3. High-Level Architecture

┌─────────────────────────┐        ┌────────────────────┐
│  Bundler Lock Collector │───────▶│  Package Graph     │
└─────────────────────────┘        │  Aggregator        │
                                   └─────────┬──────────┘
┌─────────────────────────┐                │
│  Gemspec Inspector      │───────────────▶│
└─────────────────────────┘                │
                                           ▼
                                   ┌────────────────────┐
┌─────────────────────────┐        │ Runtime Graph      │
│  Require/Autoload Scan  │───────▶│ Builder (Zeitwerk) │
└─────────────────────────┘        └─────────┬──────────┘
                                           │
                                           ▼
                                   ┌────────────────────┐
                                   │ Capability Emitter │
                                   └─────────┬──────────┘
                                           │
                                           ▼
                                   ┌────────────────────┐
                                   │ SBOM Writer        │
                                   │ + Policy Signals   │
                                   └────────────────────┘

4. Detailed Components

4.1 Bundler Lock Collector

Parse Gemfile.lock deterministically (no network) using new RubyLockCollector under StellaOps.Scanner.Analyzers.Lang.Ruby.
Support alternative manifests (gems.rb, gems.locked) and workspace overrides.
Emit package nodes with fields: name, version, source (path/git/rubygems), bundlerGroup[], platform, declaredOnly flag.
Implementation:
- Reuse parsing strategy from Trivy (pkg/fanal/analyzer/language/ruby/bundler) but port to C# with streaming reader and stable ordering.
- Integrate with Surface.Validation to enforce size limits and tenant allowlists for git/path sources.

4.2 Gemspec Inspector

Scan cached specs under vendor/cache, .bundle/cache, and gem directories to pick up transitive packages when lockfiles missing.
Parse without executing Ruby by using a deterministic DSL subset (similar to Trivy gemspec parser).
Link results to lockfile entries by <name, version, platform>; create new records flagged InferredFromSpec when lockfile absent.

4.3 Package Aggregator

New orchestrator RubyPackageAggregator merges lock and gemspec data with installed gems from container layers (once runtime analyzer ships).
Precedence: Installed > Lockfile > Gemspec.
Deduplicate by package key (name+version+platform) and attach provenance bits for Policy Engine.

4.4 Runtime Graph Builder

Static analysis for require, require_relative, autoload, Zeitwerk conventions, and Rails initialisers.
Implementation phases:
1. MVP (shipped in Sprint 138): perform lightweight scanning using deterministic regex patterns scoped to Ruby sources. Captures explicit require* and autoload statements, records referencing files, and links back to packages when a matching lock entry exists.
2. Planned follow-up: integrate tree-sitter Ruby under StellaOps.Scanner.Analyzers.Lang.Ruby.Syntax for full AST coverage (Zeitwerk constants, conditional requires, dynamic module loading). This phase remains tracked under SCANNER-ANALYZERS-RUBY-28-003.
Output merges with EntryTrace usage hints to support runtime filtering in Policy Engine. Entrypoint detection currently keys off file location plus usage hints; richer framework-aware mapping will accompany the tree-sitter phase.

4.5 Capability & Surface Signals

Emit evidence documents for:
- Process/exec usage (Kernel.system, `cmd`, Open3).
- Network clients (Net::HTTP, Faraday, Redis, ActiveRecord::Base.establish_connection).
- Serialization sinks (Marshal.load, YAML.load, Oj.load).
- Job schedulers (Sidekiq, Resque, ActiveJob, Whenever, Clockwork) with schedule metadata.
Capability records flow to Policy Engine under capability.ruby.* namespaces to allow gating on dangerous constructs.

4.6 CLI & Offline Integration

Add CLI verbs:
- stella ruby inspect <path> – runs collector locally, outputs JSON summary with provenance.
- stella ruby resolve --image <ref> – fetches scan artifacts, prints dependency graph grouped by bundler group/platform.
Ship analyzer DLLs and rules in Offline Kit manifest; include autoload/zeitwerk fingerprints and heuristics hashed for determinism.

5. Data Contracts

Artifact	Shape	Consumer
`ruby_packages.json`	`RubyPackageInventory { scanId, imageDigest, generatedAt, packages[] }` where each package mirrors `{id, name, version, source, provenance, groups[], platform, runtime.*}`	SBOM Composer, Policy Engine

ruby_packages.json records are persisted in Mongo’s ruby.packages collection via the RubyPackageInventoryStore. Scanner.WebService exposes the same payload through GET /api/scans/{scanId}/ruby-packages so Policy, CLI, and Offline Kit consumers can reuse the canonical inventory without re-running the analyzer. Each document is keyed by scanId and includes the resolved imageDigest plus the UTC timestamp recorded by the Worker. | ruby_runtime_edges.json | Edges {from, to, reason, confidence} | EntryTrace overlay, Policy explain traces | | ruby_capabilities.json | Capability {kind, location, evidenceHash, params} | Policy Engine (capability predicates) | | ruby_observation.json | Summary document (packages, runtime edges, capability flags) | Surface manifest, Policy explain traces |

All records follow AOC appender rules (immutable, tenant-scoped) and include hash, layerDigest, and timestamp normalized to UTC ISO-8601.

6. Testing Strategy

Fixtures: Extend fixtures/lang/ruby with Rails, Sinatra, Sidekiq, Rack, container images (with/without vendor cache).
Fixtures: Added git-sources scenario covering git/path dependencies, bundler groups, and vendor cache evidence for declared-only toggling.
Determinism: Golden snapshots for package lists and capability outputs across repeated runs.
Integration: Worker e2e to ensure per-layer aggregation; CLI golden outputs (stella ruby inspect).
Policy: Unit tests verifying new predicates (ruby.group, ruby.capability.exec, etc.) in Policy Engine test suite.

7. Rollout Plan & Dependencies

Implement collectors and aggregators (SCANNER-ANALYZERS-RUBY-28-001..004).
Add capability analyzer and observations (SCANNER-ANALYZERS-RUBY-28-005..008).
Wire CLI commands and Offline Kit packaging (SCANNER-ANALYZERS-RUBY-28-011).
Update docs (DOCS-SCANNER-BENCH-62-009 follow-up) once analyzer alpha ready.

Dependencies

Tree-sitter Ruby grammar inclusion (needs Offline Kit packaging and licensing check).
Policy Engine support for new predicates and capability schemas.
Surface.Validation updates for git/path gem sources and secret resolution.

8. Open Questions

Do we require dynamic runtime logs (e.g., ActiveSupport::Notifications) for confidence boosts? (defer to future iteration)
Should we enforce signed gem provenance in MVP? Pending Product decision.
Need alignment with Export Center on Ruby-specific manifest emissions.

9. Licensing & Offline Packaging (SCANNER-LIC-0001)

License: tree-sitter core and tree-sitter-ruby grammar are MIT licensed (confirmed via upstream LICENSE files retrieved 2025-11-10).
Obligations:
1. Keep MIT license texts in /third-party-licenses/ and ship them with Offline Kits (fulfilled via build_offline_kit.py copying the directory into staging).
2. Track acknowledgements in NOTICE.md (completed).
3. Record grammar provenance in build metadata once native parsers ship; current MVP uses regex-only parsing and does not bundle tree-sitter artifacts yet, so no generated sources are redistributed.
4. When tree-sitter integration lands, ensure tree-sitter-cli remains a build-time tool only.
Deliverables:
- SCANNER-LIC-0001 tracks Legal sign-off; Offline Kit packaging now mirrors third-party-licenses/.
- Export centre recipe inherits the copied directory with deterministic hashing.

References:

Trivy: pkg/fanal/analyzer/language/ruby/bundler, pkg/fanal/analyzer/language/ruby/gemspec
Gap analysis: docs/benchmarks/scanner/scanning-gaps-stella-misses-from-competitors.md#ruby-analyzer-parity-trivy-grype-snyk

11 KiB Raw Blame History Unescape Escape