625 lines
10 KiB
Markdown
625 lines
10 KiB
Markdown
# CI/CD Troubleshooting Guide
|
|
|
|
Common issues and solutions for StellaOps CI/CD infrastructure.
|
|
|
|
## Quick Diagnostics
|
|
|
|
### Check Workflow Status
|
|
|
|
```bash
|
|
# View recent workflow runs
|
|
gh run list --limit 10
|
|
|
|
# View specific run logs
|
|
gh run view <run-id> --log
|
|
|
|
# Re-run failed workflow
|
|
gh run rerun <run-id>
|
|
```
|
|
|
|
### Verify Local Environment
|
|
|
|
```bash
|
|
# Check .NET SDK
|
|
dotnet --list-sdks
|
|
|
|
# Check Docker
|
|
docker version
|
|
docker buildx version
|
|
|
|
# Check Node.js
|
|
node --version
|
|
npm --version
|
|
|
|
# Check required tools
|
|
which cosign syft helm
|
|
```
|
|
|
|
---
|
|
|
|
## Build Failures
|
|
|
|
### NuGet Restore Failures
|
|
|
|
**Symptom:** `error NU1301: Unable to load the service index`
|
|
|
|
**Causes:**
|
|
1. Network connectivity issues
|
|
2. NuGet source unavailable
|
|
3. Invalid credentials
|
|
|
|
**Solutions:**
|
|
|
|
```bash
|
|
# Clear NuGet cache
|
|
dotnet nuget locals all --clear
|
|
|
|
# Check NuGet sources
|
|
dotnet nuget list source
|
|
|
|
# Restore with verbose logging
|
|
dotnet restore src/StellaOps.sln -v detailed
|
|
```
|
|
|
|
**In CI:**
|
|
```yaml
|
|
- name: Restore with retry
|
|
run: |
|
|
for i in {1..3}; do
|
|
dotnet restore src/StellaOps.sln && break
|
|
echo "Retry $i..."
|
|
sleep 30
|
|
done
|
|
```
|
|
|
|
---
|
|
|
|
### SDK Version Mismatch
|
|
|
|
**Symptom:** `error MSB4236: The SDK 'Microsoft.NET.Sdk' specified could not be found`
|
|
|
|
**Solutions:**
|
|
|
|
1. Check `global.json`:
|
|
```bash
|
|
cat global.json
|
|
```
|
|
|
|
2. Install correct SDK:
|
|
```bash
|
|
# CI environment
|
|
- uses: actions/setup-dotnet@v4
|
|
with:
|
|
dotnet-version: '10.0.100'
|
|
include-prerelease: true
|
|
```
|
|
|
|
3. Override SDK version:
|
|
```bash
|
|
# Remove global.json override
|
|
rm global.json
|
|
```
|
|
|
|
---
|
|
|
|
### Docker Build Failures
|
|
|
|
**Symptom:** `failed to solve: rpc error: code = Unknown`
|
|
|
|
**Causes:**
|
|
1. Disk space exhausted
|
|
2. Layer cache corruption
|
|
3. Network timeout
|
|
|
|
**Solutions:**
|
|
|
|
```bash
|
|
# Clean Docker system
|
|
docker system prune -af
|
|
docker builder prune -af
|
|
|
|
# Build without cache
|
|
docker build --no-cache -t myimage .
|
|
|
|
# Increase buildx timeout
|
|
docker buildx create --driver-opt network=host --use
|
|
```
|
|
|
|
---
|
|
|
|
### Multi-arch Build Failures
|
|
|
|
**Symptom:** `exec format error` or QEMU issues
|
|
|
|
**Solutions:**
|
|
|
|
```bash
|
|
# Install QEMU for cross-platform builds
|
|
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
|
|
|
|
# Create new buildx builder
|
|
docker buildx create --name multiarch --driver docker-container --use
|
|
docker buildx inspect --bootstrap
|
|
|
|
# Build for specific platforms
|
|
docker buildx build --platform linux/amd64 -t myimage .
|
|
```
|
|
|
|
---
|
|
|
|
## Test Failures
|
|
|
|
### Testcontainers Issues
|
|
|
|
**Symptom:** `Could not find a running Docker daemon`
|
|
|
|
**Solutions:**
|
|
|
|
1. Ensure Docker is running:
|
|
```bash
|
|
docker info
|
|
```
|
|
|
|
2. Set Testcontainers host:
|
|
```bash
|
|
export TESTCONTAINERS_HOST_OVERRIDE=host.docker.internal
|
|
# or for Linux
|
|
export TESTCONTAINERS_HOST_OVERRIDE=$(hostname -I | awk '{print $1}')
|
|
```
|
|
|
|
3. Use Ryuk container for cleanup:
|
|
```bash
|
|
export TESTCONTAINERS_RYUK_DISABLED=false
|
|
```
|
|
|
|
4. CI configuration:
|
|
```yaml
|
|
services:
|
|
dind:
|
|
image: docker:dind
|
|
privileged: true
|
|
```
|
|
|
|
---
|
|
|
|
### PostgreSQL Test Failures
|
|
|
|
**Symptom:** `FATAL: role "postgres" does not exist`
|
|
|
|
**Solutions:**
|
|
|
|
1. Check connection string:
|
|
```bash
|
|
export STELLAOPS_TEST_POSTGRES_CONNECTION="Host=localhost;Database=test;Username=postgres;Password=postgres"
|
|
```
|
|
|
|
2. Use Testcontainers PostgreSQL:
|
|
```csharp
|
|
var container = new PostgreSqlBuilder()
|
|
.WithDatabase("test")
|
|
.WithUsername("postgres")
|
|
.WithPassword("postgres")
|
|
.Build();
|
|
```
|
|
|
|
3. Wait for PostgreSQL readiness:
|
|
```bash
|
|
until pg_isready -h localhost -p 5432; do
|
|
sleep 1
|
|
done
|
|
```
|
|
|
|
---
|
|
|
|
### Test Timeouts
|
|
|
|
**Symptom:** `Test exceeded timeout`
|
|
|
|
**Solutions:**
|
|
|
|
1. Increase timeout:
|
|
```bash
|
|
dotnet test --blame-hang-timeout 10m
|
|
```
|
|
|
|
2. Run tests in parallel with limited concurrency:
|
|
```bash
|
|
dotnet test -maxcpucount:2
|
|
```
|
|
|
|
3. Identify slow tests:
|
|
```bash
|
|
dotnet test --logger "console;verbosity=detailed" --logger "trx"
|
|
```
|
|
|
|
---
|
|
|
|
### Determinism Test Failures
|
|
|
|
**Symptom:** `Output mismatch: expected SHA256 differs`
|
|
|
|
**Solutions:**
|
|
|
|
1. Check for non-deterministic sources:
|
|
- Timestamps
|
|
- Random GUIDs
|
|
- Floating-point operations
|
|
- Dictionary ordering
|
|
|
|
2. Run determinism comparison:
|
|
```bash
|
|
.gitea/scripts/test/determinism-run.sh
|
|
diff out/scanner-determinism/run1.json out/scanner-determinism/run2.json
|
|
```
|
|
|
|
3. Update golden fixtures:
|
|
```bash
|
|
.gitea/scripts/test/run-fixtures-check.sh --update
|
|
```
|
|
|
|
---
|
|
|
|
## Deployment Failures
|
|
|
|
### SSH Connection Issues
|
|
|
|
**Symptom:** `ssh: connect to host X.X.X.X port 22: Connection refused`
|
|
|
|
**Solutions:**
|
|
|
|
1. Verify SSH key:
|
|
```bash
|
|
ssh-keygen -lf ~/.ssh/id_rsa.pub
|
|
```
|
|
|
|
2. Test connection:
|
|
```bash
|
|
ssh -vvv user@host
|
|
```
|
|
|
|
3. Add host to known_hosts:
|
|
```yaml
|
|
- name: Setup SSH
|
|
run: |
|
|
mkdir -p ~/.ssh
|
|
ssh-keyscan -H ${{ secrets.DEPLOY_HOST }} >> ~/.ssh/known_hosts
|
|
```
|
|
|
|
---
|
|
|
|
### Registry Push Failures
|
|
|
|
**Symptom:** `unauthorized: authentication required`
|
|
|
|
**Solutions:**
|
|
|
|
1. Login to registry:
|
|
```bash
|
|
docker login git.stella-ops.org -u $REGISTRY_USERNAME -p $REGISTRY_PASSWORD
|
|
```
|
|
|
|
2. Check token permissions:
|
|
- `write:packages` scope required
|
|
- Token not expired
|
|
|
|
3. Use credential helper:
|
|
```yaml
|
|
- name: Login to Registry
|
|
uses: docker/login-action@v3
|
|
with:
|
|
registry: git.stella-ops.org
|
|
username: ${{ secrets.REGISTRY_USERNAME }}
|
|
password: ${{ secrets.REGISTRY_PASSWORD }}
|
|
```
|
|
|
|
---
|
|
|
|
### Helm Deployment Failures
|
|
|
|
**Symptom:** `Error: UPGRADE FAILED: cannot patch`
|
|
|
|
**Solutions:**
|
|
|
|
1. Check resource conflicts:
|
|
```bash
|
|
kubectl get events -n stellaops --sort-by='.lastTimestamp'
|
|
```
|
|
|
|
2. Force upgrade:
|
|
```bash
|
|
helm upgrade --install --force stellaops ./devops/helm/stellaops
|
|
```
|
|
|
|
3. Clean up stuck release:
|
|
```bash
|
|
helm history stellaops
|
|
helm rollback stellaops <revision>
|
|
# or
|
|
kubectl delete secret -l name=stellaops,owner=helm
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow Issues
|
|
|
|
### Workflow Not Triggering
|
|
|
|
**Symptom:** Push/PR doesn't trigger workflow
|
|
|
|
**Causes:**
|
|
1. Path filter not matching
|
|
2. Branch protection rules
|
|
3. YAML syntax error
|
|
|
|
**Solutions:**
|
|
|
|
1. Check path filters:
|
|
```yaml
|
|
on:
|
|
push:
|
|
paths:
|
|
- 'src/**' # Check if files match
|
|
```
|
|
|
|
2. Validate YAML:
|
|
```bash
|
|
.gitea/scripts/validate/validate-workflows.sh
|
|
```
|
|
|
|
3. Check branch rules:
|
|
- Verify workflow permissions
|
|
- Check protected branch settings
|
|
|
|
---
|
|
|
|
### Concurrency Issues
|
|
|
|
**Symptom:** Duplicate runs or stuck workflows
|
|
|
|
**Solutions:**
|
|
|
|
1. Add concurrency control:
|
|
```yaml
|
|
concurrency:
|
|
group: ${{ github.workflow }}-${{ github.ref }}
|
|
cancel-in-progress: true
|
|
```
|
|
|
|
2. Cancel stale runs manually:
|
|
```bash
|
|
gh run cancel <run-id>
|
|
```
|
|
|
|
---
|
|
|
|
### Artifact Upload/Download Failures
|
|
|
|
**Symptom:** `Unable to find any artifacts`
|
|
|
|
**Solutions:**
|
|
|
|
1. Check artifact names match:
|
|
```yaml
|
|
# Upload
|
|
- uses: actions/upload-artifact@v4
|
|
with:
|
|
name: my-artifact # Must match
|
|
|
|
# Download
|
|
- uses: actions/download-artifact@v4
|
|
with:
|
|
name: my-artifact # Must match
|
|
```
|
|
|
|
2. Check retention period:
|
|
```yaml
|
|
- uses: actions/upload-artifact@v4
|
|
with:
|
|
retention-days: 90 # Default is 90
|
|
```
|
|
|
|
3. Verify job dependencies:
|
|
```yaml
|
|
download-job:
|
|
needs: [upload-job] # Must complete first
|
|
```
|
|
|
|
---
|
|
|
|
## Runner Issues
|
|
|
|
### Disk Space Exhausted
|
|
|
|
**Symptom:** `No space left on device`
|
|
|
|
**Solutions:**
|
|
|
|
1. Run cleanup script:
|
|
```bash
|
|
.gitea/scripts/util/cleanup-runner-space.sh
|
|
```
|
|
|
|
2. Add cleanup step to workflow:
|
|
```yaml
|
|
- name: Free disk space
|
|
run: |
|
|
docker system prune -af
|
|
rm -rf /tmp/*
|
|
df -h
|
|
```
|
|
|
|
3. Use larger runner:
|
|
```yaml
|
|
runs-on: ubuntu-latest-4xlarge
|
|
```
|
|
|
|
---
|
|
|
|
### Out of Memory
|
|
|
|
**Symptom:** `Killed` or `OOMKilled`
|
|
|
|
**Solutions:**
|
|
|
|
1. Limit parallel jobs:
|
|
```yaml
|
|
strategy:
|
|
max-parallel: 2
|
|
```
|
|
|
|
2. Limit dotnet memory:
|
|
```bash
|
|
export DOTNET_GCHeapHardLimit=0x40000000 # 1GB
|
|
```
|
|
|
|
3. Use swap:
|
|
```yaml
|
|
- name: Create swap
|
|
run: |
|
|
sudo fallocate -l 4G /swapfile
|
|
sudo chmod 600 /swapfile
|
|
sudo mkswap /swapfile
|
|
sudo swapon /swapfile
|
|
```
|
|
|
|
---
|
|
|
|
### Runner Not Picking Up Jobs
|
|
|
|
**Symptom:** Jobs stuck in `queued` state
|
|
|
|
**Solutions:**
|
|
|
|
1. Check runner status:
|
|
```bash
|
|
# Self-hosted runner
|
|
./run.sh --check
|
|
```
|
|
|
|
2. Verify labels match:
|
|
```yaml
|
|
runs-on: [self-hosted, linux, x64] # All labels must match
|
|
```
|
|
|
|
3. Restart runner service:
|
|
```bash
|
|
sudo systemctl restart actions.runner.*.service
|
|
```
|
|
|
|
---
|
|
|
|
## Signing & Attestation Issues
|
|
|
|
### Cosign Signing Failures
|
|
|
|
**Symptom:** `error opening key: no such file`
|
|
|
|
**Solutions:**
|
|
|
|
1. Check key configuration:
|
|
```bash
|
|
# From base64 secret
|
|
echo "$COSIGN_PRIVATE_KEY_B64" | base64 -d > cosign.key
|
|
|
|
# Verify key
|
|
cosign public-key --key cosign.key
|
|
```
|
|
|
|
2. Set password:
|
|
```bash
|
|
export COSIGN_PASSWORD="${{ secrets.COSIGN_PASSWORD }}"
|
|
```
|
|
|
|
3. Use keyless signing:
|
|
```yaml
|
|
- name: Sign with keyless
|
|
env:
|
|
COSIGN_EXPERIMENTAL: 1
|
|
run: cosign sign --yes $IMAGE
|
|
```
|
|
|
|
---
|
|
|
|
### SBOM Generation Failures
|
|
|
|
**Symptom:** `syft: command not found`
|
|
|
|
**Solutions:**
|
|
|
|
1. Install Syft:
|
|
```bash
|
|
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
|
|
```
|
|
|
|
2. Use container:
|
|
```yaml
|
|
- name: Generate SBOM
|
|
uses: anchore/sbom-action@v0
|
|
with:
|
|
image: ${{ env.IMAGE }}
|
|
```
|
|
|
|
---
|
|
|
|
## Debugging Tips
|
|
|
|
### Enable Debug Logging
|
|
|
|
```yaml
|
|
env:
|
|
ACTIONS_STEP_DEBUG: true
|
|
ACTIONS_RUNNER_DEBUG: true
|
|
```
|
|
|
|
### SSH into Runner
|
|
|
|
```yaml
|
|
- name: Debug SSH
|
|
uses: mxschmitt/action-tmate@v3
|
|
if: failure()
|
|
```
|
|
|
|
### Collect Diagnostic Info
|
|
|
|
```yaml
|
|
- name: Diagnostics
|
|
if: failure()
|
|
run: |
|
|
echo "=== Environment ==="
|
|
env | sort
|
|
echo "=== Disk ==="
|
|
df -h
|
|
echo "=== Memory ==="
|
|
free -m
|
|
echo "=== Docker ==="
|
|
docker info
|
|
docker ps -a
|
|
```
|
|
|
|
### View Workflow Logs
|
|
|
|
```bash
|
|
# Stream logs
|
|
gh run watch <run-id>
|
|
|
|
# Download logs
|
|
gh run download <run-id> --name logs
|
|
```
|
|
|
|
---
|
|
|
|
## Getting Help
|
|
|
|
1. **Check existing issues:** Search repository issues
|
|
2. **Review workflow history:** Look for similar failures
|
|
3. **Consult documentation:** `docs/` and `.gitea/docs/`
|
|
4. **Contact DevOps:** Create issue with label `ci-cd`
|
|
|
|
### Information to Include
|
|
|
|
- Workflow name and run ID
|
|
- Error message and stack trace
|
|
- Steps to reproduce
|
|
- Environment details (OS, SDK versions)
|
|
- Recent changes to affected code
|