Files
git.stella-ops.org/src/FASTER_MODELING_AND_NORMALIZATION.md

5.1 KiB
Raw Blame History

Heres a quick, practical idea to make your version-range modeling cleaner and faster to query.

A simple diagram showing a Vulnerability doc with an embedded normalizedVersions array next to a pipeline icon labeled “simpler aggregations”.

Rethinking SemVerRangeBuilder + MongoDB

Problem (today): Version normalization rules live as a nested object (and often as a bespoke structure per source). This can force awkward $objectToArray, $map, and conditional logic in pipelines when you need to:

  • match “is version X affected?”
  • flatten ranges for analytics
  • de-duplicate across sources

Proposal: Store normalized version rules as an embedded collection (array of small docs) instead of a single nested object.

Minimal background

  • SemVer normalization: converting all source-specific version notations into a single, strict representation (e.g., >=1.2.3 <2.0.0, exact pins, wildcards).
  • Embedded collection: an array of consistently shaped items inside the parent doc—great for $unwind-centric analytics and direct matches.

Suggested shape

{
  "_id": "VULN-123",
  "packageId": "pkg:npm/lodash",
  "source": "NVD",
  "normalizedVersions": [
    {
      "scheme": "semver",
      "type": "range",                 // "range" | "exact" | "lt" | "lte" | "gt" | "gte"
      "min": "1.2.3",                  // optional
      "minInclusive": true,            // optional
      "max": "2.0.0",                  // optional
      "maxInclusive": false,           // optional
      "notes": "from GHSA GHSA-xxxx"   // traceability
    },
    {
      "scheme": "semver",
      "type": "exact",
      "value": "1.5.0"
    }
  ],
  "metadata": { "ingestedAt": "2025-10-10T12:00:00Z" }
}

Why this helps

  • Simpler queries

    • Is v affected?

      db.vulns.aggregate([
        { $match: { packageId: "pkg:npm/lodash" } },
        { $unwind: "$normalizedVersions" },
        { $match: {
            $or: [
              { "normalizedVersions.type": "exact", "normalizedVersions.value": "1.5.0" },
              { "normalizedVersions.type": "range",
                "normalizedVersions.min": { $lte: "1.5.0" },
                "normalizedVersions.max": { $gt:  "1.5.0" } }
            ]
        }},
        { $project: { _id: 1 } }
      ])
      
    • No $objectToArray, fewer $conds.

  • Cheaper storage

    • Arrays of tiny docs compress well and avoid wide nested structures with many nulls/keys.
  • Easier dedup/merge

    • $unwind → normalize → $group by {scheme,type,min,max,value} to collapse equivalent rules across sources.

Builder changes (SemVerRangeBuilder)

  • Emit items, not a monolith: have the builder return IEnumerable<NormalizedVersionRule>.
  • Normalize early: resolve “aliases” (1.2.x, ^1.2.3, distro styles) into canonical (type,min,max,…) before persistence.
  • Traceability: include notes/sourceRef on each rule so you can re-materialize provenance during audits.
  • Lean projection helper: when you only need normalized rules (and not the intermediate primitives), prefer SemVerRangeRuleBuilder.BuildNormalizedRules(rawRange, patchedVersion, provenanceNote) to skip manual projections.

C# sketch

public record NormalizedVersionRule(
    string Scheme,           // "semver"
    string Type,             // "range" | "exact" | ...
    string? Min = null,
    bool? MinInclusive = null,
    string? Max = null,
    bool? MaxInclusive = null,
    string? Value = null,
    string? Notes = null
);

public static class SemVerRangeBuilder
{
    public static IEnumerable<NormalizedVersionRule> Build(string raw)
    {
        // parse raw (^1.2.3, 1.2.x, <=2.0.0, etc.)
        // yield canonical rules:
        yield return new NormalizedVersionRule(
            Scheme: "semver",
            Type: "range",
            Min: "1.2.3",
            MinInclusive: true,
            Max: "2.0.0",
            MaxInclusive: false,
            Notes: "nvd:ABC-123"
        );
    }
}

Aggregation patterns you unlock

  • Fast “affected version” lookups via $unwind + $match (can complement with a computed sort key).
  • Rollups: count of vulns per (major,minor) by mapping each rule into bucketed segments.
  • Cross-source reconciliation: group identical rules to de-duplicate.

Indexing tips

  • Compound index on { packageId: 1, "normalizedVersions.scheme": 1, "normalizedVersions.type": 1 }.
  • If lookups by exact value are common: add a sparse index on "normalizedVersions.value".

Migration path (safe + incremental)

  1. Dual-write: keep old nested object while writing the new normalizedVersions array.
  2. Backfill existing docs with a one-time script using your current builder.
  3. Cutover queries/aggregations to the new path (behind a feature flag).
  4. Clean up old field after soak.

If you want, I can draft:

  • a one-time Mongo backfill script,
  • the new EF/Mongo C# POCOs, and
  • a test matrix (edge cases: prerelease tags, build metadata, 0.* semantics, distro-style ranges).