Files
git.stella-ops.org/.gitea/docs/troubleshooting.md

10 KiB

CI/CD Troubleshooting Guide

Common issues and solutions for StellaOps CI/CD infrastructure.

Quick Diagnostics

Check Workflow Status

# View recent workflow runs
gh run list --limit 10

# View specific run logs
gh run view <run-id> --log

# Re-run failed workflow
gh run rerun <run-id>

Verify Local Environment

# Check .NET SDK
dotnet --list-sdks

# Check Docker
docker version
docker buildx version

# Check Node.js
node --version
npm --version

# Check required tools
which cosign syft helm

Build Failures

NuGet Restore Failures

Symptom: error NU1301: Unable to load the service index

Causes:

  1. Network connectivity issues
  2. NuGet source unavailable
  3. Invalid credentials

Solutions:

# Clear NuGet cache
dotnet nuget locals all --clear

# Check NuGet sources
dotnet nuget list source

# Restore with verbose logging
dotnet restore src/StellaOps.sln -v detailed

In CI:

- name: Restore with retry
  run: |
    for i in {1..3}; do
      dotnet restore src/StellaOps.sln && break
      echo "Retry $i..."
      sleep 30
    done

SDK Version Mismatch

Symptom: error MSB4236: The SDK 'Microsoft.NET.Sdk' specified could not be found

Solutions:

  1. Check global.json:

    cat global.json
    
  2. Install correct SDK:

    # CI environment
    - uses: actions/setup-dotnet@v4
      with:
        dotnet-version: '10.0.100'
        include-prerelease: true
    
  3. Override SDK version:

    # Remove global.json override
    rm global.json
    

Docker Build Failures

Symptom: failed to solve: rpc error: code = Unknown

Causes:

  1. Disk space exhausted
  2. Layer cache corruption
  3. Network timeout

Solutions:

# Clean Docker system
docker system prune -af
docker builder prune -af

# Build without cache
docker build --no-cache -t myimage .

# Increase buildx timeout
docker buildx create --driver-opt network=host --use

Multi-arch Build Failures

Symptom: exec format error or QEMU issues

Solutions:

# Install QEMU for cross-platform builds
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes

# Create new buildx builder
docker buildx create --name multiarch --driver docker-container --use
docker buildx inspect --bootstrap

# Build for specific platforms
docker buildx build --platform linux/amd64 -t myimage .

Test Failures

Testcontainers Issues

Symptom: Could not find a running Docker daemon

Solutions:

  1. Ensure Docker is running:

    docker info
    
  2. Set Testcontainers host:

    export TESTCONTAINERS_HOST_OVERRIDE=host.docker.internal
    # or for Linux
    export TESTCONTAINERS_HOST_OVERRIDE=$(hostname -I | awk '{print $1}')
    
  3. Use Ryuk container for cleanup:

    export TESTCONTAINERS_RYUK_DISABLED=false
    
  4. CI configuration:

    services:
      dind:
        image: docker:dind
        privileged: true
    

PostgreSQL Test Failures

Symptom: FATAL: role "postgres" does not exist

Solutions:

  1. Check connection string:

    export STELLAOPS_TEST_POSTGRES_CONNECTION="Host=localhost;Database=test;Username=postgres;Password=postgres"
    
  2. Use Testcontainers PostgreSQL:

    var container = new PostgreSqlBuilder()
        .WithDatabase("test")
        .WithUsername("postgres")
        .WithPassword("postgres")
        .Build();
    
  3. Wait for PostgreSQL readiness:

    until pg_isready -h localhost -p 5432; do
      sleep 1
    done
    

Test Timeouts

Symptom: Test exceeded timeout

Solutions:

  1. Increase timeout:

    dotnet test --blame-hang-timeout 10m
    
  2. Run tests in parallel with limited concurrency:

    dotnet test -maxcpucount:2
    
  3. Identify slow tests:

    dotnet test --logger "console;verbosity=detailed" --logger "trx"
    

Determinism Test Failures

Symptom: Output mismatch: expected SHA256 differs

Solutions:

  1. Check for non-deterministic sources:

    • Timestamps
    • Random GUIDs
    • Floating-point operations
    • Dictionary ordering
  2. Run determinism comparison:

    .gitea/scripts/test/determinism-run.sh
    diff out/scanner-determinism/run1.json out/scanner-determinism/run2.json
    
  3. Update golden fixtures:

    .gitea/scripts/test/run-fixtures-check.sh --update
    

Deployment Failures

SSH Connection Issues

Symptom: ssh: connect to host X.X.X.X port 22: Connection refused

Solutions:

  1. Verify SSH key:

    ssh-keygen -lf ~/.ssh/id_rsa.pub
    
  2. Test connection:

    ssh -vvv user@host
    
  3. Add host to known_hosts:

    - name: Setup SSH
      run: |
        mkdir -p ~/.ssh
        ssh-keyscan -H ${{ secrets.DEPLOY_HOST }} >> ~/.ssh/known_hosts
    

Registry Push Failures

Symptom: unauthorized: authentication required

Solutions:

  1. Login to registry:

    docker login git.stella-ops.org -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD
    
  2. Check token permissions:

    • write:packages scope required
    • Token not expired
  3. Use credential helper:

    - name: Login to Registry
      uses: docker/login-action@v3
      with:
        registry: git.stella-ops.org
        username: ${{ secrets.REGISTRY_USERNAME }}
        password: ${{ secrets.REGISTRY_PASSWORD }}
    

Helm Deployment Failures

Symptom: Error: UPGRADE FAILED: cannot patch

Solutions:

  1. Check resource conflicts:

    kubectl get events -n stellaops --sort-by='.lastTimestamp'
    
  2. Force upgrade:

    helm upgrade --install --force stellaops ./devops/helm/stellaops
    
  3. Clean up stuck release:

    helm history stellaops
    helm rollback stellaops <revision>
    # or
    kubectl delete secret -l name=stellaops,owner=helm
    

Workflow Issues

Workflow Not Triggering

Symptom: Push/PR doesn't trigger workflow

Causes:

  1. Path filter not matching
  2. Branch protection rules
  3. YAML syntax error

Solutions:

  1. Check path filters:

    on:
      push:
        paths:
          - 'src/**'  # Check if files match
    
  2. Validate YAML:

    .gitea/scripts/validate/validate-workflows.sh
    
  3. Check branch rules:

    • Verify workflow permissions
    • Check protected branch settings

Concurrency Issues

Symptom: Duplicate runs or stuck workflows

Solutions:

  1. Add concurrency control:

    concurrency:
      group: ${{ github.workflow }}-${{ github.ref }}
      cancel-in-progress: true
    
  2. Cancel stale runs manually:

    gh run cancel <run-id>
    

Artifact Upload/Download Failures

Symptom: Unable to find any artifacts

Solutions:

  1. Check artifact names match:

    # Upload
    - uses: actions/upload-artifact@v4
      with:
        name: my-artifact  # Must match
    
    # Download
    - uses: actions/download-artifact@v4
      with:
        name: my-artifact  # Must match
    
  2. Check retention period:

    - uses: actions/upload-artifact@v4
      with:
        retention-days: 90  # Default is 90
    
  3. Verify job dependencies:

    download-job:
      needs: [upload-job]  # Must complete first
    

Runner Issues

Disk Space Exhausted

Symptom: No space left on device

Solutions:

  1. Run cleanup script:

    .gitea/scripts/util/cleanup-runner-space.sh
    
  2. Add cleanup step to workflow:

    - name: Free disk space
      run: |
        docker system prune -af
        rm -rf /tmp/*
        df -h
    
  3. Use larger runner:

    runs-on: ubuntu-latest-4xlarge
    

Out of Memory

Symptom: Killed or OOMKilled

Solutions:

  1. Limit parallel jobs:

    strategy:
      max-parallel: 2
    
  2. Limit dotnet memory:

    export DOTNET_GCHeapHardLimit=0x40000000  # 1GB
    
  3. Use swap:

    - name: Create swap
      run: |
        sudo fallocate -l 4G /swapfile
        sudo chmod 600 /swapfile
        sudo mkswap /swapfile
        sudo swapon /swapfile
    

Runner Not Picking Up Jobs

Symptom: Jobs stuck in queued state

Solutions:

  1. Check runner status:

    # Self-hosted runner
    ./run.sh --check
    
  2. Verify labels match:

    runs-on: [self-hosted, linux, x64]  # All labels must match
    
  3. Restart runner service:

    sudo systemctl restart actions.runner.*.service
    

Signing & Attestation Issues

Cosign Signing Failures

Symptom: error opening key: no such file

Solutions:

  1. Check key configuration:

    # From base64 secret
    echo "$COSIGN_PRIVATE_KEY_B64" | base64 -d > cosign.key
    
    # Verify key
    cosign public-key --key cosign.key
    
  2. Set password:

    export COSIGN_PASSWORD="${{ secrets.COSIGN_PASSWORD }}"
    
  3. Use keyless signing:

    - name: Sign with keyless
      env:
        COSIGN_EXPERIMENTAL: 1
      run: cosign sign --yes $IMAGE
    

SBOM Generation Failures

Symptom: syft: command not found

Solutions:

  1. Install Syft:

    curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
    
  2. Use container:

    - name: Generate SBOM
      uses: anchore/sbom-action@v0
      with:
        image: ${{ env.IMAGE }}
    

Debugging Tips

Enable Debug Logging

env:
  ACTIONS_STEP_DEBUG: true
  ACTIONS_RUNNER_DEBUG: true

SSH into Runner

- name: Debug SSH
  uses: mxschmitt/action-tmate@v3
  if: failure()

Collect Diagnostic Info

- name: Diagnostics
  if: failure()
  run: |
    echo "=== Environment ==="
    env | sort
    echo "=== Disk ==="
    df -h
    echo "=== Memory ==="
    free -m
    echo "=== Docker ==="
    docker info
    docker ps -a

View Workflow Logs

# Stream logs
gh run watch <run-id>

# Download logs
gh run download <run-id> --name logs

Getting Help

  1. Check existing issues: Search repository issues
  2. Review workflow history: Look for similar failures
  3. Consult documentation: docs/ and .gitea/docs/
  4. Contact DevOps: Create issue with label ci-cd

Information to Include

  • Workflow name and run ID
  • Error message and stack trace
  • Steps to reproduce
  • Environment details (OS, SDK versions)
  • Recent changes to affected code