Initial commit (history squashed)

2025-10-07 10:14:21 +03:00
commit b97fc7685a
1132 changed files with 117842 additions and 0 deletions
--- a/src/FASTER_MODELING_AND_NORMALIZATION.md
+++ b/src/FASTER_MODELING_AND_NORMALIZATION.md
@@ -0,0 +1,139 @@
+Here’s a quick, practical idea to make your version-range modeling cleaner and faster to query.
+
+![A simple diagram showing a Vulnerability doc with an embedded normalizedVersions array next to a pipeline icon labeled “simpler aggregations”.](https://images.unsplash.com/photo-1515879218367-8466d910aaa4?q=80\&w=1470\&auto=format\&fit=crop)
+
+# Rethinking `SemVerRangeBuilder` + MongoDB
+
+**Problem (today):** Version normalization rules live as a nested object (and often as a bespoke structure per source). This can force awkward `$objectToArray`, `$map`, and conditional logic in pipelines when you need to:
+
+* match “is version X affected?”
+* flatten ranges for analytics
+* de-duplicate across sources
+
+**Proposal:** Store *normalized version rules as an embedded collection (array of small docs)* instead of a single nested object.
+
+## Minimal background
+
+* **SemVer normalization**: converting all source-specific version notations into a single, strict representation (e.g., `>=1.2.3 <2.0.0`, exact pins, wildcards).
+* **Embedded collection**: an array of consistently shaped items inside the parent doc—great for `$unwind`-centric analytics and direct matches.
+
+## Suggested shape
+
+```json
+{
+  "_id": "VULN-123",
+  "packageId": "pkg:npm/lodash",
+  "source": "NVD",
+  "normalizedVersions": [
+    {
+      "scheme": "semver",
+      "type": "range",                 // "range" | "exact" | "lt" | "lte" | "gt" | "gte"
+      "min": "1.2.3",                  // optional
+      "minInclusive": true,            // optional
+      "max": "2.0.0",                  // optional
+      "maxInclusive": false,           // optional
+      "notes": "from GHSA GHSA-xxxx"   // traceability
+    },
+    {
+      "scheme": "semver",
+      "type": "exact",
+      "value": "1.5.0"
+    }
+  ],
+  "metadata": { "ingestedAt": "2025-10-10T12:00:00Z" }
+}
+```
+
+### Why this helps
+
+* **Simpler queries**
+
+  * *Is v affected?*
+
+    ```js
+    db.vulns.aggregate([
+      { $match: { packageId: "pkg:npm/lodash" } },
+      { $unwind: "$normalizedVersions" },
+      { $match: {
+          $or: [
+            { "normalizedVersions.type": "exact", "normalizedVersions.value": "1.5.0" },
+            { "normalizedVersions.type": "range",
+              "normalizedVersions.min": { $lte: "1.5.0" },
+              "normalizedVersions.max": { $gt:  "1.5.0" } }
+          ]
+      }},
+      { $project: { _id: 1 } }
+    ])
+    ```
+  * No `$objectToArray`, fewer `$cond`s.
+
+* **Cheaper storage**
+
+  * Arrays of tiny docs compress well and avoid wide nested structures with many nulls/keys.
+
+* **Easier dedup/merge**
+
+  * `$unwind` → normalize → `$group` by `{scheme,type,min,max,value}` to collapse equivalent rules across sources.
+
+## Builder changes (`SemVerRangeBuilder`)
+
+* **Emit items, not a monolith**: have the builder return `IEnumerable<NormalizedVersionRule>`.
+* **Normalize early**: resolve “aliases” (`1.2.x`, `^1.2.3`, distro styles) into canonical `(type,min,max,…)` before persistence.
+* **Traceability**: include `notes`/`sourceRef` on each rule so you can re-materialize provenance during audits.
+
+### C# sketch
+
+```csharp
+public record NormalizedVersionRule(
+    string Scheme,           // "semver"
+    string Type,             // "range" | "exact" | ...
+    string? Min = null,
+    bool? MinInclusive = null,
+    string? Max = null,
+    bool? MaxInclusive = null,
+    string? Value = null,
+    string? Notes = null
+);
+
+public static class SemVerRangeBuilder
+{
+    public static IEnumerable<NormalizedVersionRule> Build(string raw)
+    {
+        // parse raw (^1.2.3, 1.2.x, <=2.0.0, etc.)
+        // yield canonical rules:
+        yield return new NormalizedVersionRule(
+            Scheme: "semver",
+            Type: "range",
+            Min: "1.2.3",
+            MinInclusive: true,
+            Max: "2.0.0",
+            MaxInclusive: false,
+            Notes: "nvd:ABC-123"
+        );
+    }
+}
+```
+
+## Aggregation patterns you unlock
+
+* **Fast “affected version” lookups** via `$unwind + $match` (can complement with a computed sort key).
+* **Rollups**: count of vulns per `(major,minor)` by mapping each rule into bucketed segments.
+* **Cross-source reconciliation**: group identical rules to de-duplicate.
+
+## Indexing tips
+
+* Compound index on `{ packageId: 1, "normalizedVersions.scheme": 1, "normalizedVersions.type": 1 }`.
+* If lookups by exact value are common: add a sparse index on `"normalizedVersions.value"`.
+
+## Migration path (safe + incremental)
+
+1. **Dual-write**: keep old nested object while writing the new `normalizedVersions` array.
+2. **Backfill** existing docs with a one-time script using your current builder.
+3. **Cutover** queries/aggregations to the new path (behind a feature flag).
+4. **Clean up** old field after soak.
+
+If you want, I can draft:
+
+* a one-time Mongo backfill script,
+* the new EF/Mongo C# POCOs, and
+* a test matrix (edge cases: prerelease tags, build metadata, `0.*` semantics, distro-style ranges).