10 KiB
CI/CD Troubleshooting Guide
Common issues and solutions for StellaOps CI/CD infrastructure.
Quick Diagnostics
Check Workflow Status
# View recent workflow runs
gh run list --limit 10
# View specific run logs
gh run view <run-id> --log
# Re-run failed workflow
gh run rerun <run-id>
Verify Local Environment
# Check .NET SDK
dotnet --list-sdks
# Check Docker
docker version
docker buildx version
# Check Node.js
node --version
npm --version
# Check required tools
which cosign syft helm
Build Failures
NuGet Restore Failures
Symptom: error NU1301: Unable to load the service index
Causes:
- Network connectivity issues
- NuGet source unavailable
- Invalid credentials
Solutions:
# Clear NuGet cache
dotnet nuget locals all --clear
# Check NuGet sources
dotnet nuget list source
# Restore with verbose logging
dotnet restore src/StellaOps.sln -v detailed
In CI:
- name: Restore with retry
run: |
for i in {1..3}; do
dotnet restore src/StellaOps.sln && break
echo "Retry $i..."
sleep 30
done
SDK Version Mismatch
Symptom: error MSB4236: The SDK 'Microsoft.NET.Sdk' specified could not be found
Solutions:
-
Check
global.json:cat global.json -
Install correct SDK:
# CI environment - uses: actions/setup-dotnet@v4 with: dotnet-version: '10.0.100' include-prerelease: true -
Override SDK version:
# Remove global.json override rm global.json
Docker Build Failures
Symptom: failed to solve: rpc error: code = Unknown
Causes:
- Disk space exhausted
- Layer cache corruption
- Network timeout
Solutions:
# Clean Docker system
docker system prune -af
docker builder prune -af
# Build without cache
docker build --no-cache -t myimage .
# Increase buildx timeout
docker buildx create --driver-opt network=host --use
Multi-arch Build Failures
Symptom: exec format error or QEMU issues
Solutions:
# Install QEMU for cross-platform builds
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
# Create new buildx builder
docker buildx create --name multiarch --driver docker-container --use
docker buildx inspect --bootstrap
# Build for specific platforms
docker buildx build --platform linux/amd64 -t myimage .
Test Failures
Testcontainers Issues
Symptom: Could not find a running Docker daemon
Solutions:
-
Ensure Docker is running:
docker info -
Set Testcontainers host:
export TESTCONTAINERS_HOST_OVERRIDE=host.docker.internal # or for Linux export TESTCONTAINERS_HOST_OVERRIDE=$(hostname -I | awk '{print $1}') -
Use Ryuk container for cleanup:
export TESTCONTAINERS_RYUK_DISABLED=false -
CI configuration:
services: dind: image: docker:dind privileged: true
PostgreSQL Test Failures
Symptom: FATAL: role "postgres" does not exist
Solutions:
-
Check connection string:
export STELLAOPS_TEST_POSTGRES_CONNECTION="Host=localhost;Database=test;Username=postgres;Password=postgres" -
Use Testcontainers PostgreSQL:
var container = new PostgreSqlBuilder() .WithDatabase("test") .WithUsername("postgres") .WithPassword("postgres") .Build(); -
Wait for PostgreSQL readiness:
until pg_isready -h localhost -p 5432; do sleep 1 done
Test Timeouts
Symptom: Test exceeded timeout
Solutions:
-
Increase timeout:
dotnet test --blame-hang-timeout 10m -
Run tests in parallel with limited concurrency:
dotnet test -maxcpucount:2 -
Identify slow tests:
dotnet test --logger "console;verbosity=detailed" --logger "trx"
Determinism Test Failures
Symptom: Output mismatch: expected SHA256 differs
Solutions:
-
Check for non-deterministic sources:
- Timestamps
- Random GUIDs
- Floating-point operations
- Dictionary ordering
-
Run determinism comparison:
.gitea/scripts/test/determinism-run.sh diff out/scanner-determinism/run1.json out/scanner-determinism/run2.json -
Update golden fixtures:
.gitea/scripts/test/run-fixtures-check.sh --update
Deployment Failures
SSH Connection Issues
Symptom: ssh: connect to host X.X.X.X port 22: Connection refused
Solutions:
-
Verify SSH key:
ssh-keygen -lf ~/.ssh/id_rsa.pub -
Test connection:
ssh -vvv user@host -
Add host to known_hosts:
- name: Setup SSH run: | mkdir -p ~/.ssh ssh-keyscan -H ${{ secrets.DEPLOY_HOST }} >> ~/.ssh/known_hosts
Registry Push Failures
Symptom: unauthorized: authentication required
Solutions:
-
Login to registry:
docker login git.stella-ops.org -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD -
Check token permissions:
write:packagesscope required- Token not expired
-
Use credential helper:
- name: Login to Registry uses: docker/login-action@v3 with: registry: git.stella-ops.org username: ${{ secrets.REGISTRY_USERNAME }} password: ${{ secrets.REGISTRY_PASSWORD }}
Helm Deployment Failures
Symptom: Error: UPGRADE FAILED: cannot patch
Solutions:
-
Check resource conflicts:
kubectl get events -n stellaops --sort-by='.lastTimestamp' -
Force upgrade:
helm upgrade --install --force stellaops ./devops/helm/stellaops -
Clean up stuck release:
helm history stellaops helm rollback stellaops <revision> # or kubectl delete secret -l name=stellaops,owner=helm
Workflow Issues
Workflow Not Triggering
Symptom: Push/PR doesn't trigger workflow
Causes:
- Path filter not matching
- Branch protection rules
- YAML syntax error
Solutions:
-
Check path filters:
on: push: paths: - 'src/**' # Check if files match -
Validate YAML:
.gitea/scripts/validate/validate-workflows.sh -
Check branch rules:
- Verify workflow permissions
- Check protected branch settings
Concurrency Issues
Symptom: Duplicate runs or stuck workflows
Solutions:
-
Add concurrency control:
concurrency: group: ${{ github.workflow }}-${{ github.ref }} cancel-in-progress: true -
Cancel stale runs manually:
gh run cancel <run-id>
Artifact Upload/Download Failures
Symptom: Unable to find any artifacts
Solutions:
-
Check artifact names match:
# Upload - uses: actions/upload-artifact@v4 with: name: my-artifact # Must match # Download - uses: actions/download-artifact@v4 with: name: my-artifact # Must match -
Check retention period:
- uses: actions/upload-artifact@v4 with: retention-days: 90 # Default is 90 -
Verify job dependencies:
download-job: needs: [upload-job] # Must complete first
Runner Issues
Disk Space Exhausted
Symptom: No space left on device
Solutions:
-
Run cleanup script:
.gitea/scripts/util/cleanup-runner-space.sh -
Add cleanup step to workflow:
- name: Free disk space run: | docker system prune -af rm -rf /tmp/* df -h -
Use larger runner:
runs-on: ubuntu-latest-4xlarge
Out of Memory
Symptom: Killed or OOMKilled
Solutions:
-
Limit parallel jobs:
strategy: max-parallel: 2 -
Limit dotnet memory:
export DOTNET_GCHeapHardLimit=0x40000000 # 1GB -
Use swap:
- name: Create swap run: | sudo fallocate -l 4G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile
Runner Not Picking Up Jobs
Symptom: Jobs stuck in queued state
Solutions:
-
Check runner status:
# Self-hosted runner ./run.sh --check -
Verify labels match:
runs-on: [self-hosted, linux, x64] # All labels must match -
Restart runner service:
sudo systemctl restart actions.runner.*.service
Signing & Attestation Issues
Cosign Signing Failures
Symptom: error opening key: no such file
Solutions:
-
Check key configuration:
# From base64 secret echo "$COSIGN_PRIVATE_KEY_B64" | base64 -d > cosign.key # Verify key cosign public-key --key cosign.key -
Set password:
export COSIGN_PASSWORD="${{ secrets.COSIGN_PASSWORD }}" -
Use keyless signing:
- name: Sign with keyless env: COSIGN_EXPERIMENTAL: 1 run: cosign sign --yes $IMAGE
SBOM Generation Failures
Symptom: syft: command not found
Solutions:
-
Install Syft:
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin -
Use container:
- name: Generate SBOM uses: anchore/sbom-action@v0 with: image: ${{ env.IMAGE }}
Debugging Tips
Enable Debug Logging
env:
ACTIONS_STEP_DEBUG: true
ACTIONS_RUNNER_DEBUG: true
SSH into Runner
- name: Debug SSH
uses: mxschmitt/action-tmate@v3
if: failure()
Collect Diagnostic Info
- name: Diagnostics
if: failure()
run: |
echo "=== Environment ==="
env | sort
echo "=== Disk ==="
df -h
echo "=== Memory ==="
free -m
echo "=== Docker ==="
docker info
docker ps -a
View Workflow Logs
# Stream logs
gh run watch <run-id>
# Download logs
gh run download <run-id> --name logs
Getting Help
- Check existing issues: Search repository issues
- Review workflow history: Look for similar failures
- Consult documentation:
docs/and.gitea/docs/ - Contact DevOps: Create issue with label
ci-cd
Information to Include
- Workflow name and run ID
- Error message and stack trace
- Steps to reproduce
- Environment details (OS, SDK versions)
- Recent changes to affected code