141 lines
5.1 KiB
Markdown
141 lines
5.1 KiB
Markdown
Here’s a quick, practical idea to make your version-range modeling cleaner and faster to query.
|
||
|
||

|
||
|
||
# Rethinking `SemVerRangeBuilder` + MongoDB
|
||
|
||
**Problem (today):** Version normalization rules live as a nested object (and often as a bespoke structure per source). This can force awkward `$objectToArray`, `$map`, and conditional logic in pipelines when you need to:
|
||
|
||
* match “is version X affected?”
|
||
* flatten ranges for analytics
|
||
* de-duplicate across sources
|
||
|
||
**Proposal:** Store *normalized version rules as an embedded collection (array of small docs)* instead of a single nested object.
|
||
|
||
## Minimal background
|
||
|
||
* **SemVer normalization**: converting all source-specific version notations into a single, strict representation (e.g., `>=1.2.3 <2.0.0`, exact pins, wildcards).
|
||
* **Embedded collection**: an array of consistently shaped items inside the parent doc—great for `$unwind`-centric analytics and direct matches.
|
||
|
||
## Suggested shape
|
||
|
||
```json
|
||
{
|
||
"_id": "VULN-123",
|
||
"packageId": "pkg:npm/lodash",
|
||
"source": "NVD",
|
||
"normalizedVersions": [
|
||
{
|
||
"scheme": "semver",
|
||
"type": "range", // "range" | "exact" | "lt" | "lte" | "gt" | "gte"
|
||
"min": "1.2.3", // optional
|
||
"minInclusive": true, // optional
|
||
"max": "2.0.0", // optional
|
||
"maxInclusive": false, // optional
|
||
"notes": "from GHSA GHSA-xxxx" // traceability
|
||
},
|
||
{
|
||
"scheme": "semver",
|
||
"type": "exact",
|
||
"value": "1.5.0"
|
||
}
|
||
],
|
||
"metadata": { "ingestedAt": "2025-10-10T12:00:00Z" }
|
||
}
|
||
```
|
||
|
||
### Why this helps
|
||
|
||
* **Simpler queries**
|
||
|
||
* *Is v affected?*
|
||
|
||
```js
|
||
db.vulns.aggregate([
|
||
{ $match: { packageId: "pkg:npm/lodash" } },
|
||
{ $unwind: "$normalizedVersions" },
|
||
{ $match: {
|
||
$or: [
|
||
{ "normalizedVersions.type": "exact", "normalizedVersions.value": "1.5.0" },
|
||
{ "normalizedVersions.type": "range",
|
||
"normalizedVersions.min": { $lte: "1.5.0" },
|
||
"normalizedVersions.max": { $gt: "1.5.0" } }
|
||
]
|
||
}},
|
||
{ $project: { _id: 1 } }
|
||
])
|
||
```
|
||
* No `$objectToArray`, fewer `$cond`s.
|
||
|
||
* **Cheaper storage**
|
||
|
||
* Arrays of tiny docs compress well and avoid wide nested structures with many nulls/keys.
|
||
|
||
* **Easier dedup/merge**
|
||
|
||
* `$unwind` → normalize → `$group` by `{scheme,type,min,max,value}` to collapse equivalent rules across sources.
|
||
|
||
## Builder changes (`SemVerRangeBuilder`)
|
||
|
||
* **Emit items, not a monolith**: have the builder return `IEnumerable<NormalizedVersionRule>`.
|
||
* **Normalize early**: resolve “aliases” (`1.2.x`, `^1.2.3`, distro styles) into canonical `(type,min,max,…)` before persistence.
|
||
* **Traceability**: include `notes`/`sourceRef` on each rule so you can re-materialize provenance during audits.
|
||
* **Lean projection helper**: when you only need normalized rules (and not the intermediate primitives), prefer `SemVerRangeRuleBuilder.BuildNormalizedRules(rawRange, patchedVersion, provenanceNote)` to skip manual projections.
|
||
|
||
### C# sketch
|
||
|
||
```csharp
|
||
public record NormalizedVersionRule(
|
||
string Scheme, // "semver"
|
||
string Type, // "range" | "exact" | ...
|
||
string? Min = null,
|
||
bool? MinInclusive = null,
|
||
string? Max = null,
|
||
bool? MaxInclusive = null,
|
||
string? Value = null,
|
||
string? Notes = null
|
||
);
|
||
|
||
public static class SemVerRangeBuilder
|
||
{
|
||
public static IEnumerable<NormalizedVersionRule> Build(string raw)
|
||
{
|
||
// parse raw (^1.2.3, 1.2.x, <=2.0.0, etc.)
|
||
// yield canonical rules:
|
||
yield return new NormalizedVersionRule(
|
||
Scheme: "semver",
|
||
Type: "range",
|
||
Min: "1.2.3",
|
||
MinInclusive: true,
|
||
Max: "2.0.0",
|
||
MaxInclusive: false,
|
||
Notes: "nvd:ABC-123"
|
||
);
|
||
}
|
||
}
|
||
```
|
||
|
||
## Aggregation patterns you unlock
|
||
|
||
* **Fast “affected version” lookups** via `$unwind + $match` (can complement with a computed sort key).
|
||
* **Rollups**: count of vulns per `(major,minor)` by mapping each rule into bucketed segments.
|
||
* **Cross-source reconciliation**: group identical rules to de-duplicate.
|
||
|
||
## Indexing tips
|
||
|
||
* Compound index on `{ packageId: 1, "normalizedVersions.scheme": 1, "normalizedVersions.type": 1 }`.
|
||
* If lookups by exact value are common: add a sparse index on `"normalizedVersions.value"`.
|
||
|
||
## Migration path (safe + incremental)
|
||
|
||
1. **Dual-write**: keep old nested object while writing the new `normalizedVersions` array.
|
||
2. **Backfill** existing docs with a one-time script using your current builder.
|
||
3. **Cutover** queries/aggregations to the new path (behind a feature flag).
|
||
4. **Clean up** old field after soak.
|
||
|
||
If you want, I can draft:
|
||
|
||
* a one-time Mongo backfill script,
|
||
* the new EF/Mongo C# POCOs, and
|
||
* a test matrix (edge cases: prerelease tags, build metadata, `0.*` semantics, distro-style ranges).
|