Files
git.stella-ops.org/docs/contracts/canonical-sbom-id-v1.md
2026-02-19 22:07:11 +02:00

157 lines
5.5 KiB
Markdown

# Contract: Canonical SBOM Identifier (v1)
## Status
- Status: DRAFT (2026-02-19)
- Owners: Scanner Guild, Attestor Guild
- Consumers: Scanner, Attestor, VexLens, EvidenceLocker, Graph, Policy
## Purpose
Define a single, deterministic, cross-module identifier for any CycloneDX SBOM document. All modules that reference SBOMs must use this identifier for cross-module joins, evidence threading, and verification.
## Definition
```
canonical_id := "sha256:" + hex(SHA-256(JCS(sbom_json)))
```
Where:
- `sbom_json` is the raw CycloneDX JSON document (v1.4 through v1.7)
- `JCS` is JSON Canonicalization Scheme per RFC 8785
- `SHA-256` is the hash function
- `hex` is lowercase hexadecimal encoding
- The result is prefixed with `sha256:` for algorithm agility
## Canonicalization Rules (RFC 8785)
### Object Key Ordering
All JSON object keys are sorted lexicographically using Unicode code point ordering (equivalent to `StringComparer.Ordinal` in .NET).
### Number Serialization
- Integers: no leading zeros, no trailing zeros after decimal point
- Floating point: use shortest representation that round-trips exactly
- No scientific notation for integers
### String Serialization
- Use `\uXXXX` escaping for control characters (U+0000 through U+001F)
- Use minimal escaping (no unnecessary escape sequences)
- UTF-8 encoding
### Whitespace
- No whitespace between tokens (no indentation, no trailing newlines)
### Null Handling
- Null values are serialized as `null` (not omitted)
- Missing optional fields are omitted entirely
### Array Element Order
- Array elements maintain their original order (arrays are NOT sorted)
- This is critical: CycloneDX component arrays preserve document order
## Implementation Reference
### .NET Implementation
Use `StellaOps.AuditPack.Services.CanonicalJson.Canonicalize(ReadOnlySpan<byte> json)`:
```csharp
using StellaOps.AuditPack.Services;
using System.Security.Cryptography;
byte[] sbomJsonBytes = ...; // raw CycloneDX JSON
byte[] canonicalBytes = CanonicalJson.Canonicalize(sbomJsonBytes);
byte[] hash = SHA256.HashData(canonicalBytes);
string canonicalId = "sha256:" + Convert.ToHexString(hash).ToLowerInvariant();
```
### CLI Implementation
```bash
# Using jcs_canonicalize (or equivalent tool)
jcs_canonicalize ./bom.json | sha256sum | awk '{print "sha256:" $1}'
```
## Relationship to Existing Identifiers
### `ContentHash` (CycloneDxArtifact.ContentHash)
- **Current:** `sha256(raw_json_bytes)` -- hash of serialized JSON, NOT canonical
- **Relationship:** `ContentHash` is a serialization-specific hash that depends on whitespace, key ordering, and JSON serializer settings. `canonical_id` is a content-specific hash that is stable across serializers.
- **Coexistence:** Both identifiers are retained. `ContentHash` is used for integrity verification of a specific serialized form. `canonical_id` is used for cross-module reference and evidence threading.
### `stella.contentHash` (SBOM metadata property)
- **Current:** Same as `ContentHash` above, emitted as a metadata property
- **New:** `stella:canonical_id` is emitted as a separate metadata property alongside `stella.contentHash`
### `CompositionRecipeSha256`
- **Purpose:** Hash of the composition recipe (layer ordering), not the SBOM content itself
- **Relationship:** Independent. Composition recipe describes HOW the SBOM was built; `canonical_id` describes WHAT was built.
## DSSE Subject Binding
When a DSSE attestation is created for an SBOM (predicate type `StellaOps.SBOMAttestation@1`), the subject MUST include `canonical_id`:
```json
{
"_type": "https://in-toto.io/Statement/v1",
"subject": [
{
"name": "sbom",
"digest": {
"sha256": "<canonical_id_hex_without_prefix>"
}
}
],
"predicateType": "StellaOps.SBOMAttestation@1",
"predicate": { ... }
}
```
Note: The `subject.digest.sha256` value is the hex hash WITHOUT the `sha256:` prefix, following in-toto convention.
## Stability Guarantee
Given identical CycloneDX content (same components, same metadata, same vulnerabilities), `canonical_id` MUST produce the same value:
- Across different machines
- Across different .NET runtime versions
- Across serialization/deserialization round-trips
- Regardless of original JSON formatting (whitespace, key order)
This is the fundamental invariant that enables cross-module evidence joins.
## Test Vectors
### Vector 1: Minimal CycloneDX 1.7
Input:
```json
{"bomFormat":"CycloneDX","specVersion":"1.7","version":1,"components":[]}
```
Expected canonical form: `{"bomFormat":"CycloneDX","components":[],"specVersion":"1.7","version":1}`
Expected canonical_id: compute SHA-256 of the canonical form bytes.
### Vector 2: Key ordering
Input (keys out of order):
```json
{"specVersion":"1.7","bomFormat":"CycloneDX","version":1}
```
Expected canonical form: `{"bomFormat":"CycloneDX","specVersion":"1.7","version":1}`
Must produce same `canonical_id` as any other key ordering of the same content.
### Vector 3: Whitespace normalization
Input (pretty-printed):
```json
{
"bomFormat": "CycloneDX",
"specVersion": "1.7",
"version": 1
}
```
Must produce same `canonical_id` as the minified form.
## Migration Notes
- New attestations MUST include `canonical_id` in the DSSE subject
- Existing attestations are NOT backfilled (they retain their original subject digests)
- Verification of historical attestations uses their original subject binding
- The `stella:canonical_id` metadata property is added to new SBOMs only
## References
- RFC 8785: JSON Canonicalization Scheme (JCS)
- CycloneDX v1.7 specification
- DSSE v1.0 specification
- in-toto Statement v1 specification