355 lines
12 KiB
Markdown
355 lines
12 KiB
Markdown
# End-to-End Reproducibility Testing Guide
|
|
|
|
> **Sprint:** SPRINT_8200_0001_0004_e2e_reproducibility_test
|
|
> **Tasks:** E2E-8200-025, E2E-8200-026
|
|
> **Last Updated:** 2025-06-15
|
|
|
|
## Overview
|
|
|
|
StellaOps implements comprehensive end-to-end (E2E) reproducibility testing to ensure that identical inputs always produce identical outputs across:
|
|
|
|
- Sequential pipeline runs
|
|
- Parallel pipeline runs
|
|
- Different execution environments (Ubuntu, Windows, macOS)
|
|
- Different points in time (using frozen timestamps)
|
|
|
|
This document describes the E2E test structure, how to run tests, and how to troubleshoot reproducibility failures.
|
|
|
|
## Test Architecture
|
|
|
|
### Pipeline Stages
|
|
|
|
The E2E reproducibility tests cover the full security scanning pipeline:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ Full E2E Pipeline │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────┐ ┌───────────┐ ┌──────┐ ┌────────┐ ┌──────────┐ │
|
|
│ │ Ingest │───▶│ Normalize │───▶│ Diff │───▶│ Decide │───▶│ Attest │ │
|
|
│ │ Advisory │ │ Merge & │ │ SBOM │ │ Policy │ │ DSSE │ │
|
|
│ │ Feeds │ │ Dedup │ │ vs │ │ Verdict│ │ Envelope │ │
|
|
│ └──────────┘ └───────────┘ │Adviso│ └────────┘ └──────────┘ │
|
|
│ │ries │ │ │
|
|
│ └──────┘ ▼ │
|
|
│ ┌──────────┐ │
|
|
│ │ Bundle │ │
|
|
│ │ Package │ │
|
|
│ └──────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Key Components
|
|
|
|
| Component | File | Purpose |
|
|
|-----------|------|---------|
|
|
| Test Project | `StellaOps.Integration.E2E.csproj` | MSBuild project for E2E tests |
|
|
| Test Fixture | `E2EReproducibilityTestFixture.cs` | Pipeline composition and execution |
|
|
| Tests | `E2EReproducibilityTests.cs` | Reproducibility verification tests |
|
|
| Comparer | `ManifestComparer.cs` | Byte-for-byte manifest comparison |
|
|
| CI Workflow | `.gitea/workflows/e2e-reproducibility.yml` | Cross-platform CI pipeline |
|
|
|
|
## Running E2E Tests
|
|
|
|
### Prerequisites
|
|
|
|
- .NET 10.0 SDK
|
|
- Docker (for PostgreSQL container)
|
|
- At least 4GB RAM available
|
|
|
|
### Local Execution
|
|
|
|
```bash
|
|
# Run all E2E reproducibility tests
|
|
dotnet test tests/integration/StellaOps.Integration.E2E/ \
|
|
--logger "console;verbosity=detailed"
|
|
|
|
# Run specific test category
|
|
dotnet test tests/integration/StellaOps.Integration.E2E/ \
|
|
--filter "Category=Integration" \
|
|
--logger "console;verbosity=detailed"
|
|
|
|
# Run with code coverage
|
|
dotnet test tests/integration/StellaOps.Integration.E2E/ \
|
|
--collect:"XPlat Code Coverage" \
|
|
--results-directory ./TestResults
|
|
```
|
|
|
|
### CI Execution
|
|
|
|
E2E tests run automatically on:
|
|
|
|
- Pull requests affecting `src/**` or `tests/integration/**`
|
|
- Pushes to `main` and `develop` branches
|
|
- Nightly at 2:00 AM UTC (full cross-platform suite)
|
|
- Manual trigger with optional cross-platform flag
|
|
|
|
## Test Categories
|
|
|
|
### 1. Sequential Reproducibility (Tasks 11-14)
|
|
|
|
Tests that the pipeline produces identical results when run multiple times:
|
|
|
|
```csharp
|
|
[Fact]
|
|
public async Task FullPipeline_ProducesIdenticalVerdictHash_AcrossRuns()
|
|
{
|
|
// Arrange
|
|
var inputs = await _fixture.SnapshotInputsAsync();
|
|
|
|
// Act - Run twice
|
|
var result1 = await _fixture.RunFullPipelineAsync(inputs);
|
|
var result2 = await _fixture.RunFullPipelineAsync(inputs);
|
|
|
|
// Assert
|
|
result1.VerdictId.Should().Be(result2.VerdictId);
|
|
result1.BundleManifestHash.Should().Be(result2.BundleManifestHash);
|
|
}
|
|
```
|
|
|
|
### 2. Parallel Reproducibility (Task 14)
|
|
|
|
Tests that concurrent execution produces identical results:
|
|
|
|
```csharp
|
|
[Fact]
|
|
public async Task FullPipeline_ParallelExecution_10Concurrent_AllIdentical()
|
|
{
|
|
var inputs = await _fixture.SnapshotInputsAsync();
|
|
const int concurrentRuns = 10;
|
|
|
|
var tasks = Enumerable.Range(0, concurrentRuns)
|
|
.Select(_ => _fixture.RunFullPipelineAsync(inputs));
|
|
|
|
var results = await Task.WhenAll(tasks);
|
|
var comparison = ManifestComparer.CompareMultiple(results.ToList());
|
|
|
|
comparison.AllMatch.Should().BeTrue();
|
|
}
|
|
```
|
|
|
|
### 3. Cross-Platform Reproducibility (Tasks 15-18)
|
|
|
|
Tests that identical inputs produce identical outputs on different operating systems:
|
|
|
|
| Platform | Runner | Status |
|
|
|----------|--------|--------|
|
|
| Ubuntu | `ubuntu-latest` | Primary (runs on every PR) |
|
|
| Windows | `windows-latest` | Nightly / On-demand |
|
|
| macOS | `macos-latest` | Nightly / On-demand |
|
|
|
|
### 4. Golden Baseline Verification (Tasks 19-21)
|
|
|
|
Tests that current results match a pre-approved baseline:
|
|
|
|
```json
|
|
// bench/determinism/golden-baseline/e2e-hashes.json
|
|
{
|
|
"verdict_hash": "sha256:abc123...",
|
|
"manifest_hash": "sha256:def456...",
|
|
"envelope_hash": "sha256:ghi789...",
|
|
"updated_at": "2025-06-15T12:00:00Z",
|
|
"updated_by": "ci",
|
|
"commit": "abc123def456"
|
|
}
|
|
```
|
|
|
|
## Troubleshooting Reproducibility Failures
|
|
|
|
### Common Causes
|
|
|
|
#### 1. Non-Deterministic Ordering
|
|
|
|
**Symptom:** Different verdict hashes despite identical inputs.
|
|
|
|
**Diagnosis:**
|
|
```csharp
|
|
// Check if collections are being ordered
|
|
var comparison = ManifestComparer.Compare(result1, result2);
|
|
var report = ManifestComparer.GenerateDiffReport(comparison);
|
|
Console.WriteLine(report);
|
|
```
|
|
|
|
**Solution:** Ensure all collections are sorted before hashing:
|
|
```csharp
|
|
// Bad - non-deterministic
|
|
var findings = results.ToList();
|
|
|
|
// Good - deterministic
|
|
var findings = results.OrderBy(f => f.CveId, StringComparer.Ordinal)
|
|
.ThenBy(f => f.Purl, StringComparer.Ordinal)
|
|
.ToList();
|
|
```
|
|
|
|
#### 2. Timestamp Drift
|
|
|
|
**Symptom:** Bundle manifests differ in `createdAt` field.
|
|
|
|
**Diagnosis:**
|
|
```csharp
|
|
var jsonComparison = ManifestComparer.CompareJson(
|
|
result1.BundleManifest,
|
|
result2.BundleManifest);
|
|
```
|
|
|
|
**Solution:** Use frozen timestamps in tests:
|
|
```csharp
|
|
// In test fixture
|
|
public DateTimeOffset FrozenTimestamp { get; } =
|
|
new DateTimeOffset(2025, 6, 15, 12, 0, 0, TimeSpan.Zero);
|
|
```
|
|
|
|
#### 3. Platform-Specific Behavior
|
|
|
|
**Symptom:** Tests pass on Ubuntu but fail on Windows/macOS.
|
|
|
|
**Common causes:**
|
|
- Line ending differences (`\n` vs `\r\n`)
|
|
- Path separator differences (`/` vs `\`)
|
|
- Unicode normalization differences
|
|
- Floating-point representation differences
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Download artifacts from all platforms
|
|
# Compare hex dumps
|
|
xxd ubuntu-manifest.bin > ubuntu.hex
|
|
xxd windows-manifest.bin > windows.hex
|
|
diff ubuntu.hex windows.hex
|
|
```
|
|
|
|
**Solution:** Use platform-agnostic serialization:
|
|
```csharp
|
|
// Use canonical JSON
|
|
var json = CanonJson.Serialize(data);
|
|
|
|
// Normalize line endings
|
|
var normalized = content.Replace("\r\n", "\n");
|
|
```
|
|
|
|
#### 4. Key/Signature Differences
|
|
|
|
**Symptom:** Envelope hashes differ despite identical payloads.
|
|
|
|
**Diagnosis:**
|
|
```csharp
|
|
// Compare envelope structure
|
|
var envelope1 = JsonSerializer.Deserialize<DsseEnvelope>(result1.EnvelopeBytes);
|
|
var envelope2 = JsonSerializer.Deserialize<DsseEnvelope>(result2.EnvelopeBytes);
|
|
|
|
// Check if payloads match
|
|
envelope1.Payload.SequenceEqual(envelope2.Payload).Should().BeTrue();
|
|
```
|
|
|
|
**Solution:** Use deterministic key generation:
|
|
```csharp
|
|
// Generate key from fixed seed for reproducibility
|
|
private static ECDsa GenerateDeterministicKey(int seed)
|
|
{
|
|
var rng = new DeterministicRng(seed);
|
|
var keyBytes = new byte[32];
|
|
rng.GetBytes(keyBytes);
|
|
// ... create key from bytes
|
|
}
|
|
```
|
|
|
|
### Debugging Tools
|
|
|
|
#### ManifestComparer
|
|
|
|
```csharp
|
|
// Full comparison
|
|
var comparison = ManifestComparer.Compare(expected, actual);
|
|
|
|
// Multiple results
|
|
var multiComparison = ManifestComparer.CompareMultiple(results);
|
|
|
|
// Detailed report
|
|
var report = ManifestComparer.GenerateDiffReport(comparison);
|
|
|
|
// Hex dump for byte-level debugging
|
|
var hexDump = ManifestComparer.GenerateHexDump(expected.BundleManifest, actual.BundleManifest);
|
|
```
|
|
|
|
#### JSON Comparison
|
|
|
|
```csharp
|
|
var jsonComparison = ManifestComparer.CompareJson(
|
|
expected.BundleManifest,
|
|
actual.BundleManifest);
|
|
|
|
foreach (var diff in jsonComparison.Differences)
|
|
{
|
|
Console.WriteLine($"Path: {diff.Path}");
|
|
Console.WriteLine($"Expected: {diff.Expected}");
|
|
Console.WriteLine($"Actual: {diff.Actual}");
|
|
}
|
|
```
|
|
|
|
## Updating the Golden Baseline
|
|
|
|
When intentional changes affect reproducibility (e.g., new fields, algorithm changes):
|
|
|
|
### 1. Manual Update
|
|
|
|
```bash
|
|
# Run tests and capture new hashes
|
|
dotnet test tests/integration/StellaOps.Integration.E2E/ \
|
|
--results-directory ./TestResults
|
|
|
|
# Update baseline
|
|
cp ./TestResults/verdict_hash.txt ./bench/determinism/golden-baseline/
|
|
# ... update e2e-hashes.json
|
|
```
|
|
|
|
### 2. CI Update (Recommended)
|
|
|
|
```bash
|
|
# Trigger workflow with update flag
|
|
# Via Gitea UI: Actions → E2E Reproducibility → Run workflow
|
|
# Set update_baseline = true
|
|
```
|
|
|
|
### 3. Approval Process
|
|
|
|
1. Create PR with baseline update
|
|
2. Explain why the change is intentional
|
|
3. Verify all platforms produce consistent results
|
|
4. Get approval from Platform Guild lead
|
|
5. Merge after CI passes
|
|
|
|
## CI Workflow Reference
|
|
|
|
### Jobs
|
|
|
|
| Job | Runs On | Trigger | Purpose |
|
|
|-----|---------|---------|---------|
|
|
| `reproducibility-ubuntu` | Every PR | PR/Push | Primary reproducibility check |
|
|
| `reproducibility-windows` | Nightly | Schedule/Manual | Cross-platform Windows |
|
|
| `reproducibility-macos` | Nightly | Schedule/Manual | Cross-platform macOS |
|
|
| `cross-platform-compare` | After platform jobs | Schedule/Manual | Compare hashes |
|
|
| `golden-baseline` | After Ubuntu | Always | Baseline verification |
|
|
| `reproducibility-gate` | After all | Always | Final status check |
|
|
|
|
### Artifacts
|
|
|
|
| Artifact | Retention | Contents |
|
|
|----------|-----------|----------|
|
|
| `e2e-results-{platform}` | 14 days | Test results (.trx), logs |
|
|
| `hashes-{platform}` | 14 days | Hash files for comparison |
|
|
| `cross-platform-report` | 30 days | Markdown comparison report |
|
|
|
|
## Related Documentation
|
|
|
|
- [Reproducibility Architecture](../reproducibility.md)
|
|
- [VerdictId Content-Addressing](../modules/policy/architecture.md#verdictid)
|
|
- [DSSE Envelope Format](../modules/attestor/architecture.md#dsse)
|
|
- [Determinism Testing](./determinism-verification.md)
|
|
|
|
## Sprint History
|
|
|
|
- **8200.0001.0004** - Initial E2E reproducibility test implementation
|
|
- **8200.0001.0001** - VerdictId content-addressing (dependency)
|
|
- **8200.0001.0002** - DSSE round-trip testing (dependency)
|