git.stella-ops.org/docs/contracts/canonical-sbom-id-v1.md

# Contract: Canonical SBOM Identifier (v1)

## Status
- Status: DRAFT (2026-02-19)
- Owners: Scanner Guild, Attestor Guild
- Consumers: Scanner, Attestor, VexLens, EvidenceLocker, Graph, Policy

## Purpose
Define a single, deterministic, cross-module identifier for any CycloneDX SBOM document. All modules that reference SBOMs must use this identifier for cross-module joins, evidence threading, and verification.

## Definition
```
canonical_id := "sha256:" + hex(SHA-256(JCS(sbom_json)))
```

Where:
- `sbom_json` is the raw CycloneDX JSON document (v1.4 through v1.7)
- `JCS` is JSON Canonicalization Scheme per RFC 8785
- `SHA-256` is the hash function
- `hex` is lowercase hexadecimal encoding
- The result is prefixed with `sha256:` for algorithm agility

## Canonicalization Rules (RFC 8785)

### Object Key Ordering
All JSON object keys are sorted lexicographically using Unicode code point ordering (equivalent to `StringComparer.Ordinal` in .NET).

### Number Serialization
- Integers: no leading zeros, no trailing zeros after decimal point
- Floating point: use shortest representation that round-trips exactly
- No scientific notation for integers

### String Serialization
- Use `\uXXXX` escaping for control characters (U+0000 through U+001F)
- Use minimal escaping (no unnecessary escape sequences)
- UTF-8 encoding

### Whitespace
- No whitespace between tokens (no indentation, no trailing newlines)

### Null Handling
- Null values are serialized as `null` (not omitted)
- Missing optional fields are omitted entirely

### Array Element Order
- Array elements maintain their original order (arrays are NOT sorted)
- This is critical: CycloneDX component arrays preserve document order

## Implementation Reference

### .NET Implementation
Use `StellaOps.AuditPack.Services.CanonicalJson.Canonicalize(ReadOnlySpan<byte> json)`:
```csharp
using StellaOps.AuditPack.Services;
using System.Security.Cryptography;

byte[] sbomJsonBytes = ...; // raw CycloneDX JSON
byte[] canonicalBytes = CanonicalJson.Canonicalize(sbomJsonBytes);
byte[] hash = SHA256.HashData(canonicalBytes);
string canonicalId = "sha256:" + Convert.ToHexString(hash).ToLowerInvariant();
```

### CLI Implementation
```bash
# Using jcs_canonicalize (or equivalent tool)
jcs_canonicalize ./bom.json | sha256sum | awk '{print "sha256:" $1}'
```

## Relationship to Existing Identifiers

### `ContentHash` (CycloneDxArtifact.ContentHash)
- **Current:** `sha256(raw_json_bytes)` -- hash of serialized JSON, NOT canonical
- **Relationship:** `ContentHash` is a serialization-specific hash that depends on whitespace, key ordering, and JSON serializer settings. `canonical_id` is a content-specific hash that is stable across serializers.
- **Coexistence:** Both identifiers are retained. `ContentHash` is used for integrity verification of a specific serialized form. `canonical_id` is used for cross-module reference and evidence threading.

### `stella.contentHash` (SBOM metadata property)
- **Current:** Same as `ContentHash` above, emitted as a metadata property
- **New:** `stella:canonical_id` is emitted as a separate metadata property alongside `stella.contentHash`

### `CompositionRecipeSha256`
- **Purpose:** Hash of the composition recipe (layer ordering), not the SBOM content itself
- **Relationship:** Independent. Composition recipe describes HOW the SBOM was built; `canonical_id` describes WHAT was built.

## DSSE Subject Binding

When a DSSE attestation is created for an SBOM (predicate type `StellaOps.SBOMAttestation@1`), the subject MUST include `canonical_id`:

```json
{
  "_type": "https://in-toto.io/Statement/v1",
  "subject": [
    {
      "name": "sbom",
      "digest": {
        "sha256": "<canonical_id_hex_without_prefix>"
      }
    }
  ],
  "predicateType": "StellaOps.SBOMAttestation@1",
  "predicate": { ... }
}
```

Note: The `subject.digest.sha256` value is the hex hash WITHOUT the `sha256:` prefix, following in-toto convention.

## Stability Guarantee

Given identical CycloneDX content (same components, same metadata, same vulnerabilities), `canonical_id` MUST produce the same value:
- Across different machines
- Across different .NET runtime versions
- Across serialization/deserialization round-trips
- Regardless of original JSON formatting (whitespace, key order)

This is the fundamental invariant that enables cross-module evidence joins.

## Test Vectors

### Vector 1: Minimal CycloneDX 1.7
Input:
```json
{"bomFormat":"CycloneDX","specVersion":"1.7","version":1,"components":[]}
```
Expected canonical form: `{"bomFormat":"CycloneDX","components":[],"specVersion":"1.7","version":1}`
Expected canonical_id: compute SHA-256 of the canonical form bytes.

### Vector 2: Key ordering
Input (keys out of order):
```json
{"specVersion":"1.7","bomFormat":"CycloneDX","version":1}
```
Expected canonical form: `{"bomFormat":"CycloneDX","specVersion":"1.7","version":1}`
Must produce same `canonical_id` as any other key ordering of the same content.

### Vector 3: Whitespace normalization
Input (pretty-printed):
```json
{
  "bomFormat": "CycloneDX",
  "specVersion": "1.7",
  "version": 1
}
```
Must produce same `canonical_id` as the minified form.

## Migration Notes

- New attestations MUST include `canonical_id` in the DSSE subject
- Existing attestations are NOT backfilled (they retain their original subject digests)
- Verification of historical attestations uses their original subject binding
- The `stella:canonical_id` metadata property is added to new SBOMs only

## References
- RFC 8785: JSON Canonicalization Scheme (JCS)
- CycloneDX v1.7 specification
- DSSE v1.0 specification
- in-toto Statement v1 specification