Files
git.stella-ops.org/src/FASTER_MODELING_AND_NORMALIZATION.md

141 lines
5.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Heres a quick, practical idea to make your version-range modeling cleaner and faster to query.
![A simple diagram showing a Vulnerability doc with an embedded normalizedVersions array next to a pipeline icon labeled “simpler aggregations”.](https://images.unsplash.com/photo-1515879218367-8466d910aaa4?q=80\&w=1470\&auto=format\&fit=crop)
# Rethinking `SemVerRangeBuilder` + MongoDB
**Problem (today):** Version normalization rules live as a nested object (and often as a bespoke structure per source). This can force awkward `$objectToArray`, `$map`, and conditional logic in pipelines when you need to:
* match “is version X affected?”
* flatten ranges for analytics
* de-duplicate across sources
**Proposal:** Store *normalized version rules as an embedded collection (array of small docs)* instead of a single nested object.
## Minimal background
* **SemVer normalization**: converting all source-specific version notations into a single, strict representation (e.g., `>=1.2.3 <2.0.0`, exact pins, wildcards).
* **Embedded collection**: an array of consistently shaped items inside the parent doc—great for `$unwind`-centric analytics and direct matches.
## Suggested shape
```json
{
"_id": "VULN-123",
"packageId": "pkg:npm/lodash",
"source": "NVD",
"normalizedVersions": [
{
"scheme": "semver",
"type": "range", // "range" | "exact" | "lt" | "lte" | "gt" | "gte"
"min": "1.2.3", // optional
"minInclusive": true, // optional
"max": "2.0.0", // optional
"maxInclusive": false, // optional
"notes": "from GHSA GHSA-xxxx" // traceability
},
{
"scheme": "semver",
"type": "exact",
"value": "1.5.0"
}
],
"metadata": { "ingestedAt": "2025-10-10T12:00:00Z" }
}
```
### Why this helps
* **Simpler queries**
* *Is v affected?*
```js
db.vulns.aggregate([
{ $match: { packageId: "pkg:npm/lodash" } },
{ $unwind: "$normalizedVersions" },
{ $match: {
$or: [
{ "normalizedVersions.type": "exact", "normalizedVersions.value": "1.5.0" },
{ "normalizedVersions.type": "range",
"normalizedVersions.min": { $lte: "1.5.0" },
"normalizedVersions.max": { $gt: "1.5.0" } }
]
}},
{ $project: { _id: 1 } }
])
```
* No `$objectToArray`, fewer `$cond`s.
* **Cheaper storage**
* Arrays of tiny docs compress well and avoid wide nested structures with many nulls/keys.
* **Easier dedup/merge**
* `$unwind` → normalize → `$group` by `{scheme,type,min,max,value}` to collapse equivalent rules across sources.
## Builder changes (`SemVerRangeBuilder`)
* **Emit items, not a monolith**: have the builder return `IEnumerable<NormalizedVersionRule>`.
* **Normalize early**: resolve “aliases” (`1.2.x`, `^1.2.3`, distro styles) into canonical `(type,min,max,…)` before persistence.
* **Traceability**: include `notes`/`sourceRef` on each rule so you can re-materialize provenance during audits.
* **Lean projection helper**: when you only need normalized rules (and not the intermediate primitives), prefer `SemVerRangeRuleBuilder.BuildNormalizedRules(rawRange, patchedVersion, provenanceNote)` to skip manual projections.
### C# sketch
```csharp
public record NormalizedVersionRule(
string Scheme, // "semver"
string Type, // "range" | "exact" | ...
string? Min = null,
bool? MinInclusive = null,
string? Max = null,
bool? MaxInclusive = null,
string? Value = null,
string? Notes = null
);
public static class SemVerRangeBuilder
{
public static IEnumerable<NormalizedVersionRule> Build(string raw)
{
// parse raw (^1.2.3, 1.2.x, <=2.0.0, etc.)
// yield canonical rules:
yield return new NormalizedVersionRule(
Scheme: "semver",
Type: "range",
Min: "1.2.3",
MinInclusive: true,
Max: "2.0.0",
MaxInclusive: false,
Notes: "nvd:ABC-123"
);
}
}
```
## Aggregation patterns you unlock
* **Fast “affected version” lookups** via `$unwind + $match` (can complement with a computed sort key).
* **Rollups**: count of vulns per `(major,minor)` by mapping each rule into bucketed segments.
* **Cross-source reconciliation**: group identical rules to de-duplicate.
## Indexing tips
* Compound index on `{ packageId: 1, "normalizedVersions.scheme": 1, "normalizedVersions.type": 1 }`.
* If lookups by exact value are common: add a sparse index on `"normalizedVersions.value"`.
## Migration path (safe + incremental)
1. **Dual-write**: keep old nested object while writing the new `normalizedVersions` array.
2. **Backfill** existing docs with a one-time script using your current builder.
3. **Cutover** queries/aggregations to the new path (behind a feature flag).
4. **Clean up** old field after soak.
If you want, I can draft:
* a one-time Mongo backfill script,
* the new EF/Mongo C# POCOs, and
* a test matrix (edge cases: prerelease tags, build metadata, `0.*` semantics, distro-style ranges).