Initial commit (history squashed)
	
		
			
	
		
	
	
		
	
		
			Some checks failed
		
		
	
	
		
			
				
	
				Build Test Deploy / authority-container (push) Has been cancelled
				
			
		
			
				
	
				Build Test Deploy / docs (push) Has been cancelled
				
			
		
			
				
	
				Build Test Deploy / deploy (push) Has been cancelled
				
			
		
			
				
	
				Build Test Deploy / build-test (push) Has been cancelled
				
			
		
			
				
	
				Docs CI / lint-and-preview (push) Has been cancelled
				
			
		
		
	
	
				
					
				
			
		
			Some checks failed
		
		
	
	Build Test Deploy / authority-container (push) Has been cancelled
				
			Build Test Deploy / docs (push) Has been cancelled
				
			Build Test Deploy / deploy (push) Has been cancelled
				
			Build Test Deploy / build-test (push) Has been cancelled
				
			Docs CI / lint-and-preview (push) Has been cancelled
				
			This commit is contained in:
		
							
								
								
									
										139
									
								
								src/FASTER_MODELING_AND_NORMALIZATION.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										139
									
								
								src/FASTER_MODELING_AND_NORMALIZATION.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,139 @@ | ||||
| Here’s a quick, practical idea to make your version-range modeling cleaner and faster to query. | ||||
|  | ||||
|  | ||||
|  | ||||
| # Rethinking `SemVerRangeBuilder` + MongoDB | ||||
|  | ||||
| **Problem (today):** Version normalization rules live as a nested object (and often as a bespoke structure per source). This can force awkward `$objectToArray`, `$map`, and conditional logic in pipelines when you need to: | ||||
|  | ||||
| * match “is version X affected?” | ||||
| * flatten ranges for analytics | ||||
| * de-duplicate across sources | ||||
|  | ||||
| **Proposal:** Store *normalized version rules as an embedded collection (array of small docs)* instead of a single nested object. | ||||
|  | ||||
| ## Minimal background | ||||
|  | ||||
| * **SemVer normalization**: converting all source-specific version notations into a single, strict representation (e.g., `>=1.2.3 <2.0.0`, exact pins, wildcards). | ||||
| * **Embedded collection**: an array of consistently shaped items inside the parent doc—great for `$unwind`-centric analytics and direct matches. | ||||
|  | ||||
| ## Suggested shape | ||||
|  | ||||
| ```json | ||||
| { | ||||
|   "_id": "VULN-123", | ||||
|   "packageId": "pkg:npm/lodash", | ||||
|   "source": "NVD", | ||||
|   "normalizedVersions": [ | ||||
|     { | ||||
|       "scheme": "semver", | ||||
|       "type": "range",                 // "range" | "exact" | "lt" | "lte" | "gt" | "gte" | ||||
|       "min": "1.2.3",                  // optional | ||||
|       "minInclusive": true,            // optional | ||||
|       "max": "2.0.0",                  // optional | ||||
|       "maxInclusive": false,           // optional | ||||
|       "notes": "from GHSA GHSA-xxxx"   // traceability | ||||
|     }, | ||||
|     { | ||||
|       "scheme": "semver", | ||||
|       "type": "exact", | ||||
|       "value": "1.5.0" | ||||
|     } | ||||
|   ], | ||||
|   "metadata": { "ingestedAt": "2025-10-10T12:00:00Z" } | ||||
| } | ||||
| ``` | ||||
|  | ||||
| ### Why this helps | ||||
|  | ||||
| * **Simpler queries** | ||||
|  | ||||
|   * *Is v affected?* | ||||
|  | ||||
|     ```js | ||||
|     db.vulns.aggregate([ | ||||
|       { $match: { packageId: "pkg:npm/lodash" } }, | ||||
|       { $unwind: "$normalizedVersions" }, | ||||
|       { $match: { | ||||
|           $or: [ | ||||
|             { "normalizedVersions.type": "exact", "normalizedVersions.value": "1.5.0" }, | ||||
|             { "normalizedVersions.type": "range", | ||||
|               "normalizedVersions.min": { $lte: "1.5.0" }, | ||||
|               "normalizedVersions.max": { $gt:  "1.5.0" } } | ||||
|           ] | ||||
|       }}, | ||||
|       { $project: { _id: 1 } } | ||||
|     ]) | ||||
|     ``` | ||||
|   * No `$objectToArray`, fewer `$cond`s. | ||||
|  | ||||
| * **Cheaper storage** | ||||
|  | ||||
|   * Arrays of tiny docs compress well and avoid wide nested structures with many nulls/keys. | ||||
|  | ||||
| * **Easier dedup/merge** | ||||
|  | ||||
|   * `$unwind` → normalize → `$group` by `{scheme,type,min,max,value}` to collapse equivalent rules across sources. | ||||
|  | ||||
| ## Builder changes (`SemVerRangeBuilder`) | ||||
|  | ||||
| * **Emit items, not a monolith**: have the builder return `IEnumerable<NormalizedVersionRule>`. | ||||
| * **Normalize early**: resolve “aliases” (`1.2.x`, `^1.2.3`, distro styles) into canonical `(type,min,max,…)` before persistence. | ||||
| * **Traceability**: include `notes`/`sourceRef` on each rule so you can re-materialize provenance during audits. | ||||
|  | ||||
| ### C# sketch | ||||
|  | ||||
| ```csharp | ||||
| public record NormalizedVersionRule( | ||||
|     string Scheme,           // "semver" | ||||
|     string Type,             // "range" | "exact" | ... | ||||
|     string? Min = null, | ||||
|     bool? MinInclusive = null, | ||||
|     string? Max = null, | ||||
|     bool? MaxInclusive = null, | ||||
|     string? Value = null, | ||||
|     string? Notes = null | ||||
| ); | ||||
|  | ||||
| public static class SemVerRangeBuilder | ||||
| { | ||||
|     public static IEnumerable<NormalizedVersionRule> Build(string raw) | ||||
|     { | ||||
|         // parse raw (^1.2.3, 1.2.x, <=2.0.0, etc.) | ||||
|         // yield canonical rules: | ||||
|         yield return new NormalizedVersionRule( | ||||
|             Scheme: "semver", | ||||
|             Type: "range", | ||||
|             Min: "1.2.3", | ||||
|             MinInclusive: true, | ||||
|             Max: "2.0.0", | ||||
|             MaxInclusive: false, | ||||
|             Notes: "nvd:ABC-123" | ||||
|         ); | ||||
|     } | ||||
| } | ||||
| ``` | ||||
|  | ||||
| ## Aggregation patterns you unlock | ||||
|  | ||||
| * **Fast “affected version” lookups** via `$unwind + $match` (can complement with a computed sort key). | ||||
| * **Rollups**: count of vulns per `(major,minor)` by mapping each rule into bucketed segments. | ||||
| * **Cross-source reconciliation**: group identical rules to de-duplicate. | ||||
|  | ||||
| ## Indexing tips | ||||
|  | ||||
| * Compound index on `{ packageId: 1, "normalizedVersions.scheme": 1, "normalizedVersions.type": 1 }`. | ||||
| * If lookups by exact value are common: add a sparse index on `"normalizedVersions.value"`. | ||||
|  | ||||
| ## Migration path (safe + incremental) | ||||
|  | ||||
| 1. **Dual-write**: keep old nested object while writing the new `normalizedVersions` array. | ||||
| 2. **Backfill** existing docs with a one-time script using your current builder. | ||||
| 3. **Cutover** queries/aggregations to the new path (behind a feature flag). | ||||
| 4. **Clean up** old field after soak. | ||||
|  | ||||
| If you want, I can draft: | ||||
|  | ||||
| * a one-time Mongo backfill script, | ||||
| * the new EF/Mongo C# POCOs, and | ||||
| * a test matrix (edge cases: prerelease tags, build metadata, `0.*` semantics, distro-style ranges). | ||||
		Reference in New Issue
	
	Block a user