# CI/CD Troubleshooting Guide Common issues and solutions for StellaOps CI/CD infrastructure. ## Quick Diagnostics ### Check Workflow Status ```bash # View recent workflow runs gh run list --limit 10 # View specific run logs gh run view --log # Re-run failed workflow gh run rerun ``` ### Verify Local Environment ```bash # Check .NET SDK dotnet --list-sdks # Check Docker docker version docker buildx version # Check Node.js node --version npm --version # Check required tools which cosign syft helm ``` --- ## Build Failures ### NuGet Restore Failures **Symptom:** `error NU1301: Unable to load the service index` **Causes:** 1. Network connectivity issues 2. NuGet source unavailable 3. Invalid credentials **Solutions:** ```bash # Clear NuGet cache dotnet nuget locals all --clear # Check NuGet sources dotnet nuget list source # Restore with verbose logging dotnet restore src/StellaOps.sln -v detailed ``` **In CI:** ```yaml - name: Restore with retry run: | for i in {1..3}; do dotnet restore src/StellaOps.sln && break echo "Retry $i..." sleep 30 done ``` --- ### SDK Version Mismatch **Symptom:** `error MSB4236: The SDK 'Microsoft.NET.Sdk' specified could not be found` **Solutions:** 1. Check `global.json`: ```bash cat global.json ``` 2. Install correct SDK: ```bash # CI environment - uses: actions/setup-dotnet@v4 with: dotnet-version: '10.0.100' include-prerelease: true ``` 3. Override SDK version: ```bash # Remove global.json override rm global.json ``` --- ### Docker Build Failures **Symptom:** `failed to solve: rpc error: code = Unknown` **Causes:** 1. Disk space exhausted 2. Layer cache corruption 3. Network timeout **Solutions:** ```bash # Clean Docker system docker system prune -af docker builder prune -af # Build without cache docker build --no-cache -t myimage . # Increase buildx timeout docker buildx create --driver-opt network=host --use ``` --- ### Multi-arch Build Failures **Symptom:** `exec format error` or QEMU issues **Solutions:** ```bash # Install QEMU for cross-platform builds docker run --rm --privileged multiarch/qemu-user-static --reset -p yes # Create new buildx builder docker buildx create --name multiarch --driver docker-container --use docker buildx inspect --bootstrap # Build for specific platforms docker buildx build --platform linux/amd64 -t myimage . ``` --- ## Test Failures ### Testcontainers Issues **Symptom:** `Could not find a running Docker daemon` **Solutions:** 1. Ensure Docker is running: ```bash docker info ``` 2. Set Testcontainers host: ```bash export TESTCONTAINERS_HOST_OVERRIDE=host.docker.internal # or for Linux export TESTCONTAINERS_HOST_OVERRIDE=$(hostname -I | awk '{print $1}') ``` 3. Use Ryuk container for cleanup: ```bash export TESTCONTAINERS_RYUK_DISABLED=false ``` 4. CI configuration: ```yaml services: dind: image: docker:dind privileged: true ``` --- ### PostgreSQL Test Failures **Symptom:** `FATAL: role "postgres" does not exist` **Solutions:** 1. Check connection string: ```bash export STELLAOPS_TEST_POSTGRES_CONNECTION="Host=localhost;Database=test;Username=postgres;Password=postgres" ``` 2. Use Testcontainers PostgreSQL: ```csharp var container = new PostgreSqlBuilder() .WithDatabase("test") .WithUsername("postgres") .WithPassword("postgres") .Build(); ``` 3. Wait for PostgreSQL readiness: ```bash until pg_isready -h localhost -p 5432; do sleep 1 done ``` --- ### Test Timeouts **Symptom:** `Test exceeded timeout` **Solutions:** 1. Increase timeout: ```bash dotnet test --blame-hang-timeout 10m ``` 2. Run tests in parallel with limited concurrency: ```bash dotnet test -maxcpucount:2 ``` 3. Identify slow tests: ```bash dotnet test --logger "console;verbosity=detailed" --logger "trx" ``` --- ### Determinism Test Failures **Symptom:** `Output mismatch: expected SHA256 differs` **Solutions:** 1. Check for non-deterministic sources: - Timestamps - Random GUIDs - Floating-point operations - Dictionary ordering 2. Run determinism comparison: ```bash .gitea/scripts/test/determinism-run.sh diff out/scanner-determinism/run1.json out/scanner-determinism/run2.json ``` 3. Update golden fixtures: ```bash .gitea/scripts/test/run-fixtures-check.sh --update ``` --- ## Deployment Failures ### SSH Connection Issues **Symptom:** `ssh: connect to host X.X.X.X port 22: Connection refused` **Solutions:** 1. Verify SSH key: ```bash ssh-keygen -lf ~/.ssh/id_rsa.pub ``` 2. Test connection: ```bash ssh -vvv user@host ``` 3. Add host to known_hosts: ```yaml - name: Setup SSH run: | mkdir -p ~/.ssh ssh-keyscan -H ${{ secrets.DEPLOY_HOST }} >> ~/.ssh/known_hosts ``` --- ### Registry Push Failures **Symptom:** `unauthorized: authentication required` **Solutions:** 1. Login to registry: ```bash docker login git.stella-ops.org -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD ``` 2. Check token permissions: - `write:packages` scope required - Token not expired 3. Use credential helper: ```yaml - name: Login to Registry uses: docker/login-action@v3 with: registry: git.stella-ops.org username: ${{ secrets.REGISTRY_USERNAME }} password: ${{ secrets.REGISTRY_PASSWORD }} ``` --- ### Helm Deployment Failures **Symptom:** `Error: UPGRADE FAILED: cannot patch` **Solutions:** 1. Check resource conflicts: ```bash kubectl get events -n stellaops --sort-by='.lastTimestamp' ``` 2. Force upgrade: ```bash helm upgrade --install --force stellaops ./devops/helm/stellaops ``` 3. Clean up stuck release: ```bash helm history stellaops helm rollback stellaops # or kubectl delete secret -l name=stellaops,owner=helm ``` --- ## Workflow Issues ### Workflow Not Triggering **Symptom:** Push/PR doesn't trigger workflow **Causes:** 1. Path filter not matching 2. Branch protection rules 3. YAML syntax error **Solutions:** 1. Check path filters: ```yaml on: push: paths: - 'src/**' # Check if files match ``` 2. Validate YAML: ```bash .gitea/scripts/validate/validate-workflows.sh ``` 3. Check branch rules: - Verify workflow permissions - Check protected branch settings --- ### Concurrency Issues **Symptom:** Duplicate runs or stuck workflows **Solutions:** 1. Add concurrency control: ```yaml concurrency: group: ${{ github.workflow }}-${{ github.ref }} cancel-in-progress: true ``` 2. Cancel stale runs manually: ```bash gh run cancel ``` --- ### Artifact Upload/Download Failures **Symptom:** `Unable to find any artifacts` **Solutions:** 1. Check artifact names match: ```yaml # Upload - uses: actions/upload-artifact@v4 with: name: my-artifact # Must match # Download - uses: actions/download-artifact@v4 with: name: my-artifact # Must match ``` 2. Check retention period: ```yaml - uses: actions/upload-artifact@v4 with: retention-days: 90 # Default is 90 ``` 3. Verify job dependencies: ```yaml download-job: needs: [upload-job] # Must complete first ``` --- ## Runner Issues ### Disk Space Exhausted **Symptom:** `No space left on device` **Solutions:** 1. Run cleanup script: ```bash .gitea/scripts/util/cleanup-runner-space.sh ``` 2. Add cleanup step to workflow: ```yaml - name: Free disk space run: | docker system prune -af rm -rf /tmp/* df -h ``` 3. Use larger runner: ```yaml runs-on: ubuntu-latest-4xlarge ``` --- ### Out of Memory **Symptom:** `Killed` or `OOMKilled` **Solutions:** 1. Limit parallel jobs: ```yaml strategy: max-parallel: 2 ``` 2. Limit dotnet memory: ```bash export DOTNET_GCHeapHardLimit=0x40000000 # 1GB ``` 3. Use swap: ```yaml - name: Create swap run: | sudo fallocate -l 4G /swapfile sudo chmod 600 /swapfile sudo mkswap /swapfile sudo swapon /swapfile ``` --- ### Runner Not Picking Up Jobs **Symptom:** Jobs stuck in `queued` state **Solutions:** 1. Check runner status: ```bash # Self-hosted runner ./run.sh --check ``` 2. Verify labels match: ```yaml runs-on: [self-hosted, linux, x64] # All labels must match ``` 3. Restart runner service: ```bash sudo systemctl restart actions.runner.*.service ``` --- ## Signing & Attestation Issues ### Cosign Signing Failures **Symptom:** `error opening key: no such file` **Solutions:** 1. Check key configuration: ```bash # From base64 secret echo "$COSIGN_PRIVATE_KEY_B64" | base64 -d > cosign.key # Verify key cosign public-key --key cosign.key ``` 2. Set password: ```bash export COSIGN_PASSWORD="${{ secrets.COSIGN_PASSWORD }}" ``` 3. Use keyless signing: ```yaml - name: Sign with keyless env: COSIGN_EXPERIMENTAL: 1 run: cosign sign --yes $IMAGE ``` --- ### SBOM Generation Failures **Symptom:** `syft: command not found` **Solutions:** 1. Install Syft: ```bash curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin ``` 2. Use container: ```yaml - name: Generate SBOM uses: anchore/sbom-action@v0 with: image: ${{ env.IMAGE }} ``` --- ## Debugging Tips ### Enable Debug Logging ```yaml env: ACTIONS_STEP_DEBUG: true ACTIONS_RUNNER_DEBUG: true ``` ### SSH into Runner ```yaml - name: Debug SSH uses: mxschmitt/action-tmate@v3 if: failure() ``` ### Collect Diagnostic Info ```yaml - name: Diagnostics if: failure() run: | echo "=== Environment ===" env | sort echo "=== Disk ===" df -h echo "=== Memory ===" free -m echo "=== Docker ===" docker info docker ps -a ``` ### View Workflow Logs ```bash # Stream logs gh run watch # Download logs gh run download --name logs ``` --- ## Getting Help 1. **Check existing issues:** Search repository issues 2. **Review workflow history:** Look for similar failures 3. **Consult documentation:** `docs/` and `.gitea/docs/` 4. **Contact DevOps:** Create issue with label `ci-cd` ### Information to Include - Workflow name and run ID - Error message and stack trace - Steps to reproduce - Environment details (OS, SDK versions) - Recent changes to affected code