Files

StellaOps Bot 223843f1d1 docs consolidation

2025-12-25 12:16:13 +02:00

12 KiB

Raw Blame History

End-to-End Reproducibility Testing Guide

Sprint: SPRINT_8200_0001_0004_e2e_reproducibility_test Tasks: E2E-8200-025, E2E-8200-026 Last Updated: 2025-06-15

Overview

StellaOps implements comprehensive end-to-end (E2E) reproducibility testing to ensure that identical inputs always produce identical outputs across:

Sequential pipeline runs
Parallel pipeline runs
Different execution environments (Ubuntu, Windows, macOS)
Different points in time (using frozen timestamps)

This document describes the E2E test structure, how to run tests, and how to troubleshoot reproducibility failures.

Test Architecture

Pipeline Stages

The E2E reproducibility tests cover the full security scanning pipeline:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Full E2E Pipeline                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────┐    ┌───────────┐    ┌──────┐    ┌────────┐    ┌──────────┐   │
│  │  Ingest  │───▶│ Normalize │───▶│ Diff │───▶│ Decide │───▶│  Attest  │   │
│  │ Advisory │    │  Merge &  │    │ SBOM │    │ Policy │    │   DSSE   │   │
│  │  Feeds   │    │  Dedup    │    │  vs  │    │ Verdict│    │ Envelope │   │
│  └──────────┘    └───────────┘    │Adviso│    └────────┘    └──────────┘   │
│                                   │ries  │                        │         │
│                                   └──────┘                        ▼         │
│                                                             ┌──────────┐    │
│                                                             │  Bundle  │    │
│                                                             │ Package  │    │
│                                                             └──────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Key Components

Component	File	Purpose
Test Project	`StellaOps.Integration.E2E.csproj`	MSBuild project for E2E tests
Test Fixture	`E2EReproducibilityTestFixture.cs`	Pipeline composition and execution
Tests	`E2EReproducibilityTests.cs`	Reproducibility verification tests
Comparer	`ManifestComparer.cs`	Byte-for-byte manifest comparison
CI Workflow	`.gitea/workflows/e2e-reproducibility.yml`	Cross-platform CI pipeline

Running E2E Tests

Prerequisites

.NET 10.0 SDK
Docker (for PostgreSQL container)
At least 4GB RAM available

Local Execution

# Run all E2E reproducibility tests
dotnet test tests/integration/StellaOps.Integration.E2E/ \
  --logger "console;verbosity=detailed"

# Run specific test category
dotnet test tests/integration/StellaOps.Integration.E2E/ \
  --filter "Category=Integration" \
  --logger "console;verbosity=detailed"

# Run with code coverage
dotnet test tests/integration/StellaOps.Integration.E2E/ \
  --collect:"XPlat Code Coverage" \
  --results-directory ./TestResults

CI Execution

E2E tests run automatically on:

Pull requests affecting src/** or tests/integration/**
Pushes to main and develop branches
Nightly at 2:00 AM UTC (full cross-platform suite)
Manual trigger with optional cross-platform flag

Test Categories

1. Sequential Reproducibility (Tasks 11-14)

Tests that the pipeline produces identical results when run multiple times:

[Fact]
public async Task FullPipeline_ProducesIdenticalVerdictHash_AcrossRuns()
{
    // Arrange
    var inputs = await _fixture.SnapshotInputsAsync();

    // Act - Run twice
    var result1 = await _fixture.RunFullPipelineAsync(inputs);
    var result2 = await _fixture.RunFullPipelineAsync(inputs);

    // Assert
    result1.VerdictId.Should().Be(result2.VerdictId);
    result1.BundleManifestHash.Should().Be(result2.BundleManifestHash);
}

2. Parallel Reproducibility (Task 14)

Tests that concurrent execution produces identical results:

[Fact]
public async Task FullPipeline_ParallelExecution_10Concurrent_AllIdentical()
{
    var inputs = await _fixture.SnapshotInputsAsync();
    const int concurrentRuns = 10;

    var tasks = Enumerable.Range(0, concurrentRuns)
        .Select(_ => _fixture.RunFullPipelineAsync(inputs));

    var results = await Task.WhenAll(tasks);
    var comparison = ManifestComparer.CompareMultiple(results.ToList());
    
    comparison.AllMatch.Should().BeTrue();
}

3. Cross-Platform Reproducibility (Tasks 15-18)

Tests that identical inputs produce identical outputs on different operating systems:

Platform	Runner	Status
Ubuntu	`ubuntu-latest`	Primary (runs on every PR)
Windows	`windows-latest`	Nightly / On-demand
macOS	`macos-latest`	Nightly / On-demand

4. Golden Baseline Verification (Tasks 19-21)

Tests that current results match a pre-approved baseline:

// bench/determinism/golden-baseline/e2e-hashes.json
{
  "verdict_hash": "sha256:abc123...",
  "manifest_hash": "sha256:def456...",
  "envelope_hash": "sha256:ghi789...",
  "updated_at": "2025-06-15T12:00:00Z",
  "updated_by": "ci",
  "commit": "abc123def456"
}

Troubleshooting Reproducibility Failures

Common Causes

1. Non-Deterministic Ordering

Symptom: Different verdict hashes despite identical inputs.

Diagnosis:

// Check if collections are being ordered
var comparison = ManifestComparer.Compare(result1, result2);
var report = ManifestComparer.GenerateDiffReport(comparison);
Console.WriteLine(report);

Solution: Ensure all collections are sorted before hashing:

// Bad - non-deterministic
var findings = results.ToList();

// Good - deterministic
var findings = results.OrderBy(f => f.CveId, StringComparer.Ordinal)
                     .ThenBy(f => f.Purl, StringComparer.Ordinal)
                     .ToList();

2. Timestamp Drift

Symptom: Bundle manifests differ in createdAt field.

Diagnosis:

var jsonComparison = ManifestComparer.CompareJson(
    result1.BundleManifest, 
    result2.BundleManifest);

Solution: Use frozen timestamps in tests:

// In test fixture
public DateTimeOffset FrozenTimestamp { get; } = 
    new DateTimeOffset(2025, 6, 15, 12, 0, 0, TimeSpan.Zero);

3. Platform-Specific Behavior

Symptom: Tests pass on Ubuntu but fail on Windows/macOS.

Common causes:

Line ending differences (\n vs \r\n)
Path separator differences (/ vs \)
Unicode normalization differences
Floating-point representation differences

Diagnosis:

# Download artifacts from all platforms
# Compare hex dumps
xxd ubuntu-manifest.bin > ubuntu.hex
xxd windows-manifest.bin > windows.hex
diff ubuntu.hex windows.hex

Solution: Use platform-agnostic serialization:

// Use canonical JSON
var json = CanonJson.Serialize(data);

// Normalize line endings
var normalized = content.Replace("\r\n", "\n");

4. Key/Signature Differences

Symptom: Envelope hashes differ despite identical payloads.

Diagnosis:

// Compare envelope structure
var envelope1 = JsonSerializer.Deserialize<DsseEnvelope>(result1.EnvelopeBytes);
var envelope2 = JsonSerializer.Deserialize<DsseEnvelope>(result2.EnvelopeBytes);

// Check if payloads match
envelope1.Payload.SequenceEqual(envelope2.Payload).Should().BeTrue();

Solution: Use deterministic key generation:

// Generate key from fixed seed for reproducibility
private static ECDsa GenerateDeterministicKey(int seed)
{
    var rng = new DeterministicRng(seed);
    var keyBytes = new byte[32];
    rng.GetBytes(keyBytes);
    // ... create key from bytes
}

Debugging Tools

ManifestComparer

// Full comparison
var comparison = ManifestComparer.Compare(expected, actual);

// Multiple results
var multiComparison = ManifestComparer.CompareMultiple(results);

// Detailed report
var report = ManifestComparer.GenerateDiffReport(comparison);

// Hex dump for byte-level debugging
var hexDump = ManifestComparer.GenerateHexDump(expected.BundleManifest, actual.BundleManifest);

JSON Comparison

var jsonComparison = ManifestComparer.CompareJson(
    expected.BundleManifest,
    actual.BundleManifest);

foreach (var diff in jsonComparison.Differences)
{
    Console.WriteLine($"Path: {diff.Path}");
    Console.WriteLine($"Expected: {diff.Expected}");
    Console.WriteLine($"Actual: {diff.Actual}");
}

Updating the Golden Baseline

When intentional changes affect reproducibility (e.g., new fields, algorithm changes):

1. Manual Update

# Run tests and capture new hashes
dotnet test tests/integration/StellaOps.Integration.E2E/ \
  --results-directory ./TestResults

# Update baseline
cp ./TestResults/verdict_hash.txt ./bench/determinism/golden-baseline/
# ... update e2e-hashes.json

2. CI Update (Recommended)

# Trigger workflow with update flag
# Via Gitea UI: Actions → E2E Reproducibility → Run workflow
# Set update_baseline = true

3. Approval Process

Create PR with baseline update
Explain why the change is intentional
Verify all platforms produce consistent results
Get approval from Platform Guild lead
Merge after CI passes

CI Workflow Reference

Jobs

Job	Runs On	Trigger	Purpose
`reproducibility-ubuntu`	Every PR	PR/Push	Primary reproducibility check
`reproducibility-windows`	Nightly	Schedule/Manual	Cross-platform Windows
`reproducibility-macos`	Nightly	Schedule/Manual	Cross-platform macOS
`cross-platform-compare`	After platform jobs	Schedule/Manual	Compare hashes
`golden-baseline`	After Ubuntu	Always	Baseline verification
`reproducibility-gate`	After all	Always	Final status check

Artifacts

Artifact	Retention	Contents
`e2e-results-{platform}`	14 days	Test results (.trx), logs
`hashes-{platform}`	14 days	Hash files for comparison
`cross-platform-report`	30 days	Markdown comparison report

Sprint History

8200.0001.0004 - Initial E2E reproducibility test implementation
8200.0001.0001 - VerdictId content-addressing (dependency)
8200.0001.0002 - DSSE round-trip testing (dependency)

12 KiB Raw Blame History

End-to-End Reproducibility Testing Guide

Overview

Test Architecture

Pipeline Stages

Key Components

Running E2E Tests

Prerequisites

Local Execution

CI Execution

Test Categories

1. Sequential Reproducibility (Tasks 11-14)

2. Parallel Reproducibility (Task 14)

3. Cross-Platform Reproducibility (Tasks 15-18)

4. Golden Baseline Verification (Tasks 19-21)

Troubleshooting Reproducibility Failures

Common Causes

1. Non-Deterministic Ordering

2. Timestamp Drift

3. Platform-Specific Behavior

4. Key/Signature Differences

Debugging Tools

ManifestComparer

JSON Comparison

Updating the Golden Baseline

1. Manual Update

2. CI Update (Recommended)

3. Approval Process

CI Workflow Reference

Jobs

Artifacts

Related Documentation

Sprint History

12 KiB

Raw Blame History